• No results found

Random Permutations

The Toolbo

Guideline 5.4 Write self-describing output files that require minimal reformatting

5.2 Generating Random Inputs

5.2.2 Random Permutations

A permutation of the integers 1. . . 5 is a particular ordering, such as (5, 1, 3, 2, 4)

or (1, 2, 3, 5, 4). A random permutation of sizen is generated such that each of

then! possible orderings of the first n integers is equally likely to appear.

Random permutations are common in algorithm studies. Algorithms like Ran- dom in Figure 2.3 and SIG in Figure 2.7 in Chapter 2 generate random permutations as a processing step; also, average-case inputs for many standard algorithms and data structures, including quicksort, insertion sort, and binary search trees, are defined in terms of random permutations. A random tour of a graph corresponds to a random permutation of the vertices.

The code to generate a random permutation in arrayperm[1...n] appears in the following. This loop should be implemented as shown: plausible-looking variations, for example, choosing a random index in 1. . . n instead of 1 . . . i at each

iteration, do not yield outputs that are uniformly distributed. Random Permutation (n) {

for (i = 1; i<=n; i++) perm[i] = i; // initialize for (i = n; i>=2; i--) {

tmp = perm[i]; // swap perm[i] = perm[r]; perm[r] = tmp; } return perm; }

This procedure takesO(n) time to generate a random permutation of n integers. 5.2.3 Random Samples

A random sample is a subset ofk elements drawn from a collection of n elements,

such that each possible subset is equally likely to be drawn. We assume here that a sample is drawn from the integers 1. . . n. The random sample may be drawn with replacement, which means that duplicates are allowed (that is, we imagine replac-

ing each element after it is drawn from the set), or drawn without replacement, which means that duplicates are not allowed.

Random sampling can be used to create a small version of a large data set while preserving its statistical properties or to create a hybrid input instance by sampling from a space of real-world components. Sampling can also be used to create ran- dom combinatorial objects with certain types of structures: for example, random samples of vertices could be designated “source” and “sink” in a flow network.

Sampling with replacement is easy: just call the RNGk times. The loop that

follows samples k integers with replacement from the integer range [1 . . . n] in O(k) time.

SampleWithReplacement (n, k) { for (i=0; i<k; i++)

sample[i] = (int) (rngReal()*n)+1; return sample;

}

To generate a sample without replacement whenk is much smaller than n, use

the same loop with a set data structure to reject duplicates, untilk distinct integers

are collected:

SampleWithoutReplacement (n, k) { Set S = empty;

while (S.size() < k) {

r = (int) (rngReal()*n)+1; // sample 1..n if (!S.contains(r)) S.insert(r);

}

return S; }

static int k; // sample size

static int n; // sample range is [1...n]

static int s = 0; // selected

static int c = 0; // considered

int orderedSampleNoReplacement() { double p;

int nextInt;

boolean done = false; while (!done) {

p = (double) (k-s)/(n-c);

if (rngReal() < p) { // with probability p nextInt = c + 1; s++; done = true; } c++; } return nextInt; }

Figure 5.3. Ordered integer samples. Samplingk integers from 1 . . . n without replacement, in increasing order. The next integer in the sample is returned at each invocation of this routine.

Ifk is near n, this loop spends too much time rejecting duplicates toward the end

of the process; in this case, it is more efficient to generate a random permutation of 1. . . n and then select the first k elements of the permutation.

Ordered Integer Samples

Sometimes the experiment calls for a random sample of sizek from 1 . . . n, gener-

ated without replacement and sorted in increasing order. For example, an ordered integer sample may be used to select from a pool ofn real world elements, such as

URLs from a trace of network activity, or a directory of test input files. The ordered sample is used in a linear scan through the pool to pull out sampled elements by index.

A simple approach is to generate the sample without replacement as shown previously, and then sort the sample. Sorting takesO(k log k) time and O(k) space,

which may be fine in many situations.

Faster methods are known, however. The algorithm sketched in Figure 5.3, due to Fan et al. [7] and Jones [9], generates a sample ofk integers from the range

static double k; // total in sample

static double i = k; // counter

static double m = 1.0; // current top of range double orderedSampleReals() {

m = m * exp(ln (rngReal()) / i); i--;

return m; }

Figure 5.4. Ordered reals. Generating a sample ofk doubles from[0,1) in decreasing order. The next double in the sample is returned at each invocation of the procedure.

To understand how it works, note that the integer 1 (or any particular integer) should be a part of the final sample with probabilityk/n. If 1 is selected, then

2 should be part of the final sample with probability(k− 1)/(n − 1); if 1 is not

selected, 2 should be selected with probability(k)/(n−1). Let c denote the number

of integers that have been considered so far in the selection process, and lets denote the number that have already been selected for the sample. Then the probability of selecting the next integer is equal to(k− s)/(n − c). At each invocation the

procedure considers integers in increasing order according to that probability, until one is selected.

This algorithm requires constant space. The time to generatek of n integers is

proportional to the last integer generated, orO(n− n/k). This may be preferable

to the sample-sort-scan approach when the sample is too large to be stored conve- niently in an array; it also may be more efficient if the test program can sometimes stop before the entire sample is generated.

Ordered Reals

A related problem is to generate a sorted sample of k reals uniformly without

replacement from[0,1). Bentley and Saxe [2] describe an algorithm that creates such a sample on the fly, using constant space and constant time per invocation. Like the preceding method, this algorithm is preferred if the sample is too big to store in advance, or if sorting takes too much time. The algorithm generates the sample variates in decreasing order – for a sample in increasing order, just subtract each variate from 1.0.

The algorithm is shown in Figure 5.4. The idea is first to generate random variate

M1according to the distribution of the maximum of a sample ofk variates from [0,1). Once M1is generated, the next variate represents the maximum M2 of a sample ofk−1 variates from [0,M1), and so forth. The first variate M1is generated

ReservoirSample(k, Pool) {

// Pre: Pool contains at least k elements

S.init(empty); // priority queue of pairs with key u for (i=1; i<=k; i++) { // initial sample

u = rngReal();

p = Pool.take(i); // take element i

S.insert(p,u); } int n = k+1; while (Pool.hasElement(n)) { u = rngReal(); if (u <= S.minKey()) {

S.extractMin() // delete old p = Pool.take(n); // take element n S.insert(p, u); // insert new n++;

}

return S.data(); }

Figure 5.5. Reservoir sampling. This procedure returns a random sample of sizek from a pool ofn elements where the size of the pool is not known in advance.

using the formulaM1= U1(1/k), which can be coded as shown in the figure;ln is the natural logarithm function, andexp exponentiates e, the base of natural logarithm. Subsequent variatesMi+1are scaled by multiplying byMi.

Reservoir Sampling

The reservoir sampling problem is to generate a random sample ofk elements

from the integers 1. . . n on the fly, where n is not known in advance.

For example, the experiment may call fork lines to be sampled at random from

an input file without prior knowledge of the file size, or ofk packets from a router

trace without knowing in advance how many packets will pass through the router. An application of reservoir sampling appears in theselection procedure of the markov.c program shown in Figure 3.4, which samples a random array element (k= 1) from a subarray a[i ···j], without prior knowledge of the subarray size.

The reservoir sampling algorithm due to Knuth [10] is sketched in Figure 5.5. Start by selecting the sample(1, 2, . . . k), which is the correct choice if k= n. If

it turns out thatn= k + 1, a new sample can be constructed from the old one

as follows: with probability 1/n replace a random element in the current sample

The algorithm continues this way for each newn in sequence, replacing random

elements in the current sample according to appropriate probabilities.

A convenient way to manage these probabilities is to create a pair(i, u) that

associates pool elementi with a real number u chosen uniformly at random from

[0,1). These pairs are stored in a priority queue S using u values as keys. When the algorithm is finished, the elements tied to thek smallest of n random reals u∈ [0,1) form the sample in S.