• No results found

What to Measure

3.1 Time Performance

Which is better: accuracy or precision? The following table illustrates the dif- ference. Each row shows an experimental measurement of the speed of light in kilometers per second, published by the Nobel physicist Albert Michelson in different years.

Year Result

1879 299, 910± 50km/s 1926 299, 796± 4km/s 1935 299, 774± 11km/s

The precision of an experimental result corresponds to how much variation is seen in repeated measurements: Michelson’s results ranged in precision from a

low of±50km/s to a high of ±4km/s, depending on his instruments. The accuracy of a result is how close it is to the truth: since none of these results overlap, at most one can be accurate. (The 1926 result turns out to be compatible with modern measurements. In 1980 the length of a meter was redefined by international agreement so that the speed of light is now exactly 299,792.458km/s.)

In algorithm analysis the two most common time performance indicators are

dominant costs and CPU times. A dominant operation, such as a comparison, has

the property that no other operation in the algorithm is performed more frequently. An asymptotic bound on the dominant cost, like “The algorithm performsO(n2) comparisons in the worst case,” is perfectly accurate. Furthermore, the bound is

universal, since it holds no matter what input, platform, or implementation is used. But it lacks precision: will the program take seconds or hours to run?

A CPU time measurement (explained in Section 3.1.2) is precise down to frac- tions of seconds. But as we shall see, precision guarantees neither accuracy nor generality. System timing tools can return non-replicable results, and runtimes measured on one platform and input instance are notoriously difficult to translate to other scenarios.

To illustrate this point, Figure 3.1 shows results of time measurements carried out by participants in the 2000 DIMACS TSP Challenge [14]. CPU times of one TSP program on a suite of nine input instances were recorded on 11 different platforms – runtimes varied between six seconds and 1.75 hours in these tests. Each line corresponds to a platform; the nine measurements for each are divided by times on a common benchmark platformP . Times are ordered by input size

(ranging from 103 to 107 cities) on thex-axis. The dotted top line shows, for

2 4 6 8 10 Instance Ratio of Times 1 2 3 4 5 6 7 8 9

Figure 3.1. Relative CPU times. CPU times for a single program were recorded on 11 platforms and 9 inputs. The inputs are ordered by size on thex-axis. Each line shows the ratio of times on one platform to times on a common benchmark platform. Time ratios vary from 0.91 to 10, and there is no obvious pattern in these variations.

example, that on this platform the program ran about 4 times slower than onP

with small inputs, and about 10 times slower with large inputs.

In an ideal world, each line would be horizontal, and mapping CPU times from one platform to another would be a simple matter of finding the right scaling coefficient. If the lines had the same general shape, we could build an empirical model that predicts times on the basis of input sizes. Instead there is no common pattern: it is like weighing an object on Earth and trying to predict its weight on the Moon, except there is no known function that describes the relationship between gravitational forces and object weights. Millisecond-scale measurement precision is useless if predictions based on CPU times can be wrong by seconds or hours.

Fortunately, the experimenter has more choices than just these two performance indicators. The fundamental trade-off between precision and universality cannot be eliminated, but it can be controlled to obtain time measurements that are more precise than asymptotic bounds and more general than CPU times. The following section introduces a case study problem to illustrate various time performance indicators and their merits.

Case Study: Random Text Generation

The text generation problem is to generate a random text that looks as if it was written by a human being. The problem arises in diverse applications including generating parodies and circumventing spam filters. One well-known approach is to set a hundred monkeys working at typewriters until a good-looking text appears. But we can get faster results using a Markov Chain Monte Carlo (MCMC) approach. The algorithm described here (called MC) has been employed by Bentley [4] and by Kernighan and Pike [16] to illustrate principles of algorithm design and of code tuning.

An MCMC algorithm makes a random walk through a state space. At timet, the

algorithm is in stateSt, and it steps to the next stateSt+1according to transition probabilities that map from states to states. The output of the algorithm corresponds

to a trace of the states it passes through during its random walk. The MC text generation algorithm creates a state space and transition probabilities based on a sample of real text and then generates a random text by stepping through the states according to those probabilities.

MC starts by reading textT containing n words, together with two parameters: m is the number of words to print, and k determines the size of the state space and

the transition table.

Every sequence ofk words is a key, and every key is followed by a one-word suffix. For example, Figure 3.2 shows all keys and suffixes fork= 2 and T = this

is a test this is only a test this is a test of the emergency broadcasting system. The algorithm builds a dictionaryD containing all k-word keys and their suffixes.

2-Word Key Suffix 2-Word Key Suffix

this is a this is a

is a test is a test

a test this a test of

test this is test of the

this is only of the emergency

is only a the emergency broadcasting only a test emergency broadcasting system

a test this broadcasting system this

test this is system this is

Figure 3.2. Keys and suffixes. From the textT = this is a test this is only a test this is a test of the emergency broadcasting system, withk= 2. Notice the suffixes for the last two keys wrap around to the beginning.

Once the dictionary is built, the stringphrase is initialized to contain the first k words ofT , and those words are printed. The remaining m−k words are generated

by iterating this process:

1. Lookup. Find all keys inD that matchphrase. Return a set S of suffixes of