The Toolbo
Guideline 5.4 Write self-describing output files that require minimal reformatting
5.2 Generating Random Inputs
5.2.4 Nonuniform Random Variates
Variates generated from a nonuniform distribution have the property that some values appear more frequently than others. Nonuniform data can be used to model a wide variety of real-world phenomena, from patterns of access at Web sites, to population densities in maps, to distributions of file sizes, to lengths of words in text.
Nonuniform distributions are also widely used in robustness tests of algorithms. Many algorithms and data structures, for example, cell-based geometric algo- rithms, display best performance when the inputs have a uniform or near-uniform distribution; experiments are used to learn how performance degrades when inputs move away from this theoretical ideal.
A discrete probability distributionP = (p1,p2,. . . pn) specifies the probabil-
ity that each integeri∈ 1...n is likely to appear next in a sequence of random
variatesR1,R2,. . .. For example, the distribution P= (.5,.3,.2) specifies that 1 is generated with probability .5, 2 is generated with probability .3, and 3 is generated with probability .2. Sometimes probabilities are specified by a probability density
functionp(i) instead of a list. For example, the probability density function for the
roll of a random die isp(i)= 1/6 for i ∈ [1...6]. If P is a continuous probability
distribution, the variatesRitake values in some real range that may be bounded or
unbounded. Continuous distributions are usually specified by probability density functions. No matter how the distribution is specified, the probabilities always sum to 1.
We say a distributionP models a real-world scenario if the variates generated
according to P tend to have the same statistical properties as naturally occur-
ring events in that scenario. The following is a short list of classic scenarios and the distributions that have been used to model their properties. Each distribu- tion corresponds to a family of density functions parameterized by the values in parentheses.
Code for simple generation tasks is shown in the list, but more complicated procedures are omitted here because of space constraints. Precise mathematical definitions of these distributions and generator code may be found in DeVroye’s [5]
comprehensive and definitive text and in most simulation textbooks, for example, [4], [11], or [17].
• A Bernoulli trial is a single event with two possible outcomes A and B, which occur with probabilitiesp and 1−p. This distribution can be used to model, for
example, a coin toss (p= .5), a random read or write to a memory address, or
a random operation (insert, delete) on a data structure. The Bernoulli (p) distri- bution generates a sequence of random outcomes in Bernoulli trials as follows:
loop:
if (p > rngReal()) generate(A); else generate(B);
The binomial(t, p) distribution models the number of failures (B) amongt inde-
pendent Bernoulli trials with success (A) probabilityp. Use this distribution,
for example, to generate counts of how many heads and tails are observed int
coin flips.
The geometric (p) distribution models the number of initial events B before the first occurrence of A in a sequence of Bernoulli trials: use this distribution to generate, for example, a count of how many reads occur before the next write. The negative binomial (k, p) distribution models the number of initial events B
before thekthevent A. Use this to generate a random count of inserts before the kthdelete operation.
• A Poisson process is a system where events occur independently and contin- uously at a constant average rate, for example, clients arriving at a Web site or packets arriving at a router. The discrete Poisson(λ) distribution models the
number of events (arrivals) occurring in a fixed time interval, whereλ is the
average number of events in the interval.
Use the continuous exponential (λ) distribution to generate random time inter-
vals between successive events in a Poisson process. To generate a random variateX according to this distribution use the formula X= (−ln U)/λ, where U is a random uniform real in[0,1) and ln is the natural logarithm.
• The normal (μ,σ) distribution is used to model quantities that tend to clus- ter around their meanμ; parameter σ denotes the amount of “spread” away
from the mean. This is the famous bell-shaped curve, used to model situations in which variation from the mean can be interpreted as a sum of many small random errors. Classic examples include positions of randomly moving parti- cles, cumulative rainfalls in a period, scores on standardized tests, and errors in scientific measurements.
To generate a variateN according to the standard normal distribution with
apply the formula
N=−2ln U1cos(2π U2).
Alternative formulas that are faster and/or more accurate in some cases may be found in the references cited earlier.
• The lognormal (μ,σ) distribution describes variables that can be seen as the products, rather than sums, of many random errors – that is, the logarithms of these variates are normally distributed. This distribution is skewed, with a com- pressed range of low values near the mean and a long tail of high values spread over a larger range. Classic examples found in nature include size measure- ments of biological phenomena (height, weight, length of hair, length of beak) and financial data such as changes in exchange rates and stock market indices. The lognormal distribution and the related Pareto distribution have been used to model file sizes in Internet transactions, burst rates in communication networks, and job sizes in scheduling problems.
• Zipf’s distribution (n) is commonly used to model discrete distributions aris- ing in the social sciences, such as word and letter frequencies in texts. This distribution has been proposed [1] to model Internet features such as counts of links in Web pages and counts of e-mail contacts. Parametern denotes the
number of distinct elements in the set. Zipf’s law states that the frequency of an element is inversely proportional to its rank in a table of frequencies: the most common element occurs with probabilityc, the second with probability c/2,
the third with probabilityc/3, and so forth. The probability density function is Zn(i)= 1/(iHn), where Hn=
n
i=11/i is called the nthharmonic number. General Distributions
General methods are also known for generating random variates according to arbi- trary discrete distributions where the probabilitiesP= (p1,p2,. . . pn) are specified
in a list rather than by a function, and distributions (such as Zipf’s) where no simple arithmetic transformation is known. Two techniques called the lookup method and the alias method are sketched here.
The lookup method starts by constructing an arrayprob[1...n] containing the
cumulative probabilities: prob[k] = k i=1 pi
For example, ifP = (.5,.3,.2), the array would contain [.5, .8, 1]. The cumulative
.5 .3 .2
0 .5 .8 1
1 2 3
Once the table is built, each random variater is generated by the following loop:
double p = rngReal();
for (int r = 1; r < n; r++) if (p < prob[r]) break; return r;
The lookup method takesO(n) initialization time and O(n) time per element
generated, worst case; the lookup is fastest if the original probabilities are sorted in decreasing order. Binary search can be applied to improve this toO(log n) time
per element.
The lookup method works well whenn is fairly small. The alternative alias method takes only constant time per element generated, but uses more space.
This method starts by building an alias table table[1...n] that holds an
alias probability and an alias value for each element. We denote these values
table[r].prob and table[r].alias.
The generation code starts with a random realrx, generated uniformly from the real range[1,n + 1). This value is separated into its integer part r (uniform on 1. . . n) and its real partx (uniform on[0,1)). The integer r becomes the table index, and the realx is compared to table[r].prob. Depending on the comparison, the code generates eitherr or table[r].alias:
double rx = (rngReal()*n)+1; // real on [1..n+1)
int r = (int) rx; // integer part
double x = rx - r; // real part
if (x < table[r].prob) return r; else return table[r].alias;
The initialization code to build the alias table is shown in Figure 5.6. The main loop pairs up low-probability elements j with high-probability elements k, so thatk becomes the alias for j. The diagram that follows shows how this works for distributionP = (.5,.3,.2). The horizontal line marks the average probability
for three elements, equal to 1/3. In the first step some of the excess probability
for element 1 is mapped to form the alias for element 3; in the second step the remaining excess for element 1 becomes the alias for element 2. The three values in table[].alias are (x, 1, 1); the three values in table[].prob are (1, .9, .6), reflecting the cutoffs scaled to[0,1).
1 2 3 1 2 3 1 1 2 1 3 1
The initialization process starts by creating a setH of elements with higher-than- average probabilities and a setL of elements with lower-than-average probabilities. All probabilities are multiplied by n to simplify the arithmetic, so element i is inserted intoH or L according to whether the scaled probability R[i] is more or less than 1.
During the main loop an arbitrary elementk is selected from H and an arbitrary elementj is selected from L. The table[j] values are assigned, and j is removed from further consideration. The remaining probability fork is calculated; if the result is below average,k is moved from H to L. The initialization step takes O(n) time to build the table; the setsH and L can be implemented with a simple unordered array partitioned around high and low elements.