Nonuniform Random Variates - Generating Random Inputs

The Toolbo

Guideline 5.4 Write self-describing output ﬁles that require minimal reformatting

5.2 Generating Random Inputs

5.2.4 Nonuniform Random Variates

Variates generated from a nonuniform distribution have the property that some values appear more frequently than others. Nonuniform data can be used to model a wide variety of real-world phenomena, from patterns of access at Web sites, to population densities in maps, to distributions of ﬁle sizes, to lengths of words in text.

Nonuniform distributions are also widely used in robustness tests of algorithms. Many algorithms and data structures, for example, cell-based geometric algorithms, display best performance when the inputs have a uniform or near-uniform distribution; experiments are used to learn how performance degrades when inputs move away from this theoretical ideal.

A discrete probability distributionP = (p1,p2,. . . pn) speciﬁes the probabil-

ity that each integeri∈ 1...n is likely to appear next in a sequence of random

variatesR1,R2,. . .. For example, the distribution P= (.5,.3,.2) speciﬁes that 1 is generated with probability .5, 2 is generated with probability .3, and 3 is generated with probability .2. Sometimes probabilities are speciﬁed by a probability density

functionp(i) instead of a list. For example, the probability density function for the

roll of a random die isp(i)= 1/6 for i ∈ [1...6]. If P is a continuous probability

distribution, the variatesRitake values in some real range that may be bounded or

unbounded. Continuous distributions are usually speciﬁed by probability density functions. No matter how the distribution is speciﬁed, the probabilities always sum to 1.

We say a distributionP models a real-world scenario if the variates generated

according to P tend to have the same statistical properties as naturally occur-

ring events in that scenario. The following is a short list of classic scenarios and the distributions that have been used to model their properties. Each distribution corresponds to a family of density functions parameterized by the values in parentheses.

Code for simple generation tasks is shown in the list, but more complicated procedures are omitted here because of space constraints. Precise mathematical deﬁnitions of these distributions and generator code may be found in DeVroye’s [5]

comprehensive and deﬁnitive text and in most simulation textbooks, for example, [4], [11], or [17].

• A Bernoulli trial is a single event with two possible outcomes A and B, which occur with probabilitiesp and 1−p. This distribution can be used to model, for

example, a coin toss (p= .5), a random read or write to a memory address, or

a random operation (insert, delete) on a data structure. The Bernoulli (p) distri- bution generates a sequence of random outcomes in Bernoulli trials as follows:

loop:

if (p > rngReal()) generate(A); else generate(B);

The binomial(t, p) distribution models the number of failures (B) amongt inde-

pendent Bernoulli trials with success (A) probabilityp. Use this distribution,

for example, to generate counts of how many heads and tails are observed int

coin ﬂips.

The geometric (p) distribution models the number of initial events B before the ﬁrst occurrence of A in a sequence of Bernoulli trials: use this distribution to generate, for example, a count of how many reads occur before the next write. The negative binomial (k, p) distribution models the number of initial events B

before thekth_{event A. Use this to generate a random count of inserts before the} kth_{delete operation.}

• A Poisson process is a system where events occur independently and contin- uously at a constant average rate, for example, clients arriving at a Web site or packets arriving at a router. The discrete Poisson(λ) distribution models the

number of events (arrivals) occurring in a ﬁxed time interval, whereλ is the

average number of events in the interval.

Use the continuous exponential (λ) distribution to generate random time inter-

vals between successive events in a Poisson process. To generate a random variateX according to this distribution use the formula X= (−ln U)/λ, where U is a random uniform real in[0,1) and ln is the natural logarithm.

• The normal (μ,σ) distribution is used to model quantities that tend to clus- ter around their meanμ; parameter σ denotes the amount of “spread” away

from the mean. This is the famous bell-shaped curve, used to model situations in which variation from the mean can be interpreted as a sum of many small random errors. Classic examples include positions of randomly moving parti- cles, cumulative rainfalls in a period, scores on standardized tests, and errors in scientiﬁc measurements.

To generate a variateN according to the standard normal distribution with

apply the formula

N=−2ln U1cos(2π U2).

Alternative formulas that are faster and/or more accurate in some cases may be found in the references cited earlier.

• The lognormal (μ,σ) distribution describes variables that can be seen as the products, rather than sums, of many random errors – that is, the logarithms of these variates are normally distributed. This distribution is skewed, with a com- pressed range of low values near the mean and a long tail of high values spread over a larger range. Classic examples found in nature include size measurements of biological phenomena (height, weight, length of hair, length of beak) and ﬁnancial data such as changes in exchange rates and stock market indices. The lognormal distribution and the related Pareto distribution have been used to model ﬁle sizes in Internet transactions, burst rates in communication networks, and job sizes in scheduling problems.

• Zipf’s distribution (n) is commonly used to model discrete distributions aris- ing in the social sciences, such as word and letter frequencies in texts. This distribution has been proposed [1] to model Internet features such as counts of links in Web pages and counts of e-mail contacts. Parametern denotes the

number of distinct elements in the set. Zipf’s law states that the frequency of an element is inversely proportional to its rank in a table of frequencies: the most common element occurs with probabilityc, the second with probability c/2,

the third with probabilityc/3, and so forth. The probability density function is Zn(i)= 1/(iHn), where Hn=

i=11/i is called the nthharmonic number. General Distributions

General methods are also known for generating random variates according to arbitrary discrete distributions where the probabilitiesP= (p1,p2,. . . pn) are speciﬁed

in a list rather than by a function, and distributions (such as Zipf’s) where no simple arithmetic transformation is known. Two techniques called the lookup method and the alias method are sketched here.

The lookup method starts by constructing an arrayprob[1...n] containing the

cumulative probabilities: prob[k] = k i=1 pi

For example, ifP = (.5,.3,.2), the array would contain [.5, .8, 1]. The cumulative

.5 .3 .2

0 .5 .8 1

1 2 3

Once the table is built, each random variater is generated by the following loop:

double p = rngReal();

for (int r = 1; r < n; r++) if (p < prob[r]) break; return r;

The lookup method takesO(n) initialization time and O(n) time per element

generated, worst case; the lookup is fastest if the original probabilities are sorted in decreasing order. Binary search can be applied to improve this toO(log n) time

per element.

The lookup method works well whenn is fairly small. The alternative alias method takes only constant time per element generated, but uses more space.

This method starts by building an alias table table[1...n] that holds an

alias probability and an alias value for each element. We denote these values

table[r].prob and table[r].alias.

The generation code starts with a random realrx, generated uniformly from the real range[1,n + 1). This value is separated into its integer part r (uniform on 1. . . n) and its real partx (uniform on[0,1)). The integer r becomes the table index, and the realx is compared to table[r].prob. Depending on the comparison, the code generates eitherr or table[r].alias:

double rx = (rngReal()*n)+1; // real on [1..n+1)

int r = (int) rx; // integer part

double x = rx - r; // real part

if (x < table[r].prob) return r; else return table[r].alias;

The initialization code to build the alias table is shown in Figure 5.6. The main loop pairs up low-probability elements j with high-probability elements k, so thatk becomes the alias for j. The diagram that follows shows how this works for distributionP = (.5,.3,.2). The horizontal line marks the average probability

for three elements, equal to 1/3. In the ﬁrst step some of the excess probability

for element 1 is mapped to form the alias for element 3; in the second step the remaining excess for element 1 becomes the alias for element 2. The three values in table[].alias are (x, 1, 1); the three values in table[].prob are (1, .9, .6), reﬂecting the cutoffs scaled to[0,1).

1 2 3 1 2 3 1 1 2 1 3 1

The initialization process starts by creating a setH of elements with higher-than- average probabilities and a setL of elements with lower-than-average probabilities. All probabilities are multiplied by n to simplify the arithmetic, so element i is inserted intoH or L according to whether the scaled probability R[i] is more or less than 1.

During the main loop an arbitrary elementk is selected from H and an arbitrary elementj is selected from L. The table[j] values are assigned, and j is removed from further consideration. The remaining probability fork is calculated; if the result is below average,k is moved from H to L. The initialization step takes O(n) time to build the table; the setsH and L can be implemented with a simple unordered array partitioned around high and low elements.

In document 9cgmv.A.Guide.to.Experimental.Algorithmics.pdf (Page 180-184)