Allele Frequency Distribution Under Recurrent Selective Sweeps

(1)

DOI: 10.1534/genetics.105.048447

Allele Frequency Distribution Under Recurrent Selective Sweeps

Yuseob Kim

1

Department of Biology, University of Rochester, Rochester, New York 14620

Manuscript received July 18, 2005 Accepted for publication November 29, 2005

ABSTRACT

The allele frequency of a neutral variant in a population is pushed either upward or downward by direc-tional selection on a linked beneficial mutation (‘‘selective sweeps’’). DNA sequences sampled after the fixation of the beneficial allele thus contain an excess of rare neutral alleles. This study investigates the allele frequency distribution under selective sweep models using analytic approximation and simulation. First, given a single selective sweep at a fixed time, I derive an expression for the sampling probabilities of neutral mutants. This solution can be used to estimate the time of the fixation of a beneficial allele from sequence data. Next, I obtain an approximation to mean allele frequencies under recurrent selective sweeps. Under recurrent sweeps, the frequency spectrum is skewed toward rare alleles. However, the excess of high-frequency derived alleles, previously shown to be a signature of single selective sweeps, disappears with recurrent sweeps. It is shown that, using this approximation and multilocus polymorphism data, genome-wide parameters of directional selection can be estimated.

T

HE origin and spread of mutations that increase their carriers’ reproductive success are fundamen-tal processes of Darwinian evolution. In contrast to dele-terious and neutral mutations, beneficial mutations in plants and animals occur too infrequently to be ob-served directly in laboratories. As a result, we know much less about beneficial mutations (positive selection) than about neutral or deleterious mutations (purifying selec-tion). Many basic facts about beneficial mutations—their rate, fitness effects, genomic locations, and molecular properties—remain unknown. Recent developments in population genetic theory, however, have suggested some promising approaches to obtaining these quanti-ties. Given that beneficial mutations occur very rarely, basic information about them can be obtained only by inferring past evolutionary events of positive selection. As in other problems of evolutionary inference, the dom-inant approach is to search present-day sequence poly-morphism and diversity for the footprint of past events, in this case, for the signature of genetic hitchhiking.

Genetic hitchhiking (or ‘‘selective sweeps’’) is the sudden loss of genetic variation at neutral loci when a new beneficial allele arises nearby and is fixed in the population (MaynardSmithand Haigh1974; Kaplan

et al. 1989; Stephanet al.1992; Barton2000). Recent

selective sweeps are expected to generate specific patterns of sequence variation, including a local re-duction of variation (Kimand Stephan2002), a skew

in the allele frequency distribution (Bravermanet al.

1995; Fay and Wu 2000), and an increase in linkage

disequilibrium (Thomson1977; Przeworski2002; Kim

and Nielsen2004). Developments of hitchhiking

the-ory, along with the rapid increase of DNA sequence data, have led to the discovery of many episodes of re-cent positive selection in natural populations (for ex-ample, Enardet al.2002; Clarket al.2004; Schlenke

and Begun2004).

It is more difficult, however, to measure the cumula-tive effect of recurrent seleccumula-tive sweeps throughout the genome. The main effect of recurrent selective sweeps is to reduce the standing level of variation, which is well understood (Kaplan et al.1989; Wieheand Stephan

1993; Gillespie2000). In particular, since

recombina-tion reduces the hitchhiking effect, a positive correla-tion between local recombinacorrela-tion rate and sequence diversity is predicted under recurrent sweeps. Such a correlation has been observed inDrosophila melanogaster (Begunand Aquadro1992) and was used to estimate

the intensity of directional selection,al, whereais the scaled selection coefficient and l is the frequency of hitchhiking events (Wieheand Stephan1993; Stephan

1995; Andolfatto2001). However, this calculation

as-sumes that selective sweeps alone create the correlation between recombination and genetic variation. In reality, the constant removal of deleterious mutations along the genome by purifying selection (‘‘background selec-tion’’; Charlesworthet al.1993) also reduces the level

of variation and contributes to the relationship between recombination and variation (Hudson and Kaplan

1995; Charlesworth1996). Therefore, fitting observed

levels of variation to selective sweep models yields only an overestimate ofal.

1_{Address for correspondence:} _{Center for Evolutionary Functional} Ge-nomics, The Biodesign Institute, Arizona State University, Tempe, AZ 85287-5301. E-mail: [email protected]

(2)

A second signature of the hitchhiking effect is evident in the distribution of neutral allele frequencies. With partial recombination, the hitchhiking effect causes neutral variants at intermediate frequencies to increase close to fixation or to decrease close to extinction. A sample of DNA sequences affected by selective sweeps thus contains fewer segregating alleles of intermediate frequency than expected under neutral equilibrium (Bravermanet al.1995). Background selection, on the

other hand, only slightly changes the frequency spec-trum (Golding 1997; Przeworski et al. 1997). Thus,

it may be possible to isolate the effect of directional selection from that of purifying selection by examining frequency spectra throughout the genome. For exam-ple, the observation in Andolfattoand Przeworski

(2001) that regions of theD. melanogastergenome with reduced crossing over harbor more rare variants can be used to support the hypothesis that recurrent selective sweeps affect neutral loci throughout the genome. Examination of frequency spectra may further allow us to estimate parameters of directional selection. This requires an analytic solution for sampling probabilities under selective sweep models.

Using a simple model of hitchhiking and a well-known diffusion approximation, this study first derives an approximation to allele frequency distribution for a given time after one selective sweep event. This solution then leads to an approximation under recurrent selec-tive sweeps. These approximations, along with coales-cent simulations, are used to investigate patterns of polymorphism under recurrent selective sweeps and to calculate maximum-likelihood estimates of the param-eters of directional selection.

SIMULATION METHODS

To investigate the frequency spectrum after recur-rent selective sweeps, the coalescent-with-recombination algorithm of Kimand Stephan(2002) is used with

mod-ifications (described below). A genealogy (ancestral recombination graph) is constructed for a sample ofn sequences. Time is counted backward into the past in units of either 4Ngenerations (during the neutral phase) or single generations (during the selective phase). Each sequence isMnucleotides (sites) long and recombina-tion between adjacent sites occurs with probabilityrper generation. The ancestral recombination graph starts withn edges at time 0. Recombination and coalescent events occur according to probabilities given in Kimand

Stephan(2002).

Simulations are performed under two different mod-els: a single selective sweep at a given time or recurrent selective sweeps. The genealogy for the single-sweep model has one selective phase, which begins at timet. This simulation is identical to that in Kimand Stephan

(2002), except that it uses the simulated trajectory of beneficial allele frequency (see below). For the

re-current sweep model, the construction of a genealogy begins with the neutral phase, which lasts for tnunits (4Ntngenerations).tnis exponentially distributed with mean 1/L¼1/(4NMl), wherelis the rate of strongly selected substitutions per generation per site. When the neutral phase ends, the selective phase begins, followed by another neutral phase. This process is repeated until the height of the ancestral recombination graph (cu-mulative time from the present) reachesTlimit¼5 (20N generations). For each selective phase, the location of the beneficial allele is chosen randomly between 1 andM. Before each selective phase begins, the time-dependent frequency of two genetic backgrounds—the beneficial allele (B) and the ancestral allele (b)—is predetermined by a forward simulation as described by Innanand Kim

(2004). It is modeled such that the beneficial allele starting from a single copy is fixed in a diploid pop-ulation of sizeNin which the fitnesses of genotypesbb, Bb, andBBare 1, 112hs, and 112s, respectively. For these simulations, I usedN¼105_{, but the outcome of the}

selective sweep simulations depends little onNas long asa¼2Nsis constant (data not shown).

The allele frequency distribution is observed either at fixed distances from the location of selection (single-sweep model) or from a short (1 or 2 kb long) segment in the center of the chromosome (recurrent sweep model). For the latter, marginal coalescent trees for sites M/2499 toM/21500 (for a 1-kb segment) are ex-tracted from the recombination graph, and mutations are added on trees with probability u per 4N gener-ations. From thekth replicate of this simulation,Skjsites

segregatingjmutant alleles are generated (k¼1,. . .,K; j¼1,. . .,n1). Because mutations occur with variable tree lengths, the total number of segregating sites,Sk¼

P

jSkj, may change for differentk. The sampling

prob-ability forjmutants per site,Pnj, is estimated by

P

kSkj=

ð1000KÞ (for a 1-kb segment) from all K replicates. Since the frequency spectrum (pn1, pn2,. . .,pnn1) is

defined as the distribution of allele frequency condi-tional on polymorphism at the site,pnjis estimated from

simulations byð1=KÞP_kSkj=Sk.

ANALYTIC APPROXIMATIONS

We consider single-nucleotide polymorphisms (SNPs) innhomologous DNA sequences sampled from a pop-ulation of random-matingNdiploids. The mutation rate is assumed to be very low (i.e., the infinite-sites model is assumed) such that the mutant allele can be unambig-uously distinguished from the ancestral allele by using a close outgroup sequence. The probability that a given site is polymorphic withk(¼1,. . .,n1) mutant alleles is defined asPnk. We denote byPn0the probability that the site is monomorphic. Under the model of neutral evolution in a constant-sized population,Pnk¼4Nm/k¼ u/k(Ewens2004). Kimand Stephan(2002) obtained

(3)

sweep. However, their solution assumes that the sequen-ces are sampled immediately after the fixation of the beneficial mutation. Below, a general solution ofPnkfor

arbitrary time since the fixation event is derived. Single sweeps:First,Pnkafter a single selective sweep

is investigated. I assume that the fixation of a beneficial allele occurredtgenerations ago and consider a neutral locus at a distance from the site of selection measured by the recombination fractionr. The selective advantage of the beneficial mutation is given bys, and genic selection is assumed (h¼0.5). I attempt to obtainPnkas a

func-tion of t. The number of sites segregating a neutral mutant in the frequency interval [x,x1dx] at the time when the beneficial mutation occurs is given byf(x)dx. If the beneficial mutation occurs on a chromosome carrying this neutral variant,xincreases toy1(1y)x on average, whereyis the mean increase of the identity by descent in the population due to hitchhiking (Gillespie2000), approximated by (4Ns)r/s(Kimand

Nielsen 2004). This event happens with probability

xf(x)dx(Fayand Wu2000). If the beneficial mutation

occurs in repulsion phase with the variant,xdecreases to (1y)xon average, with probability (1x)f(x)dx. This radical change of neutral allele frequency may be modeled as though it occurs instantaneously, as the sojourn time of the beneficial mutation [2 log(4Ns)/s generations] is very short relative to 4N, the coalescent timescale of the neutral model. If f(x) ¼ u/x, the expected distribution of mutant allele frequencies under neutral equilibrium, andr>s, this hitchhiking effect will produce a skewed distribution in which high-frequency mutants are as frequent as low-high-frequency mutants but intermediate-frequency mutants are rare (Fayand Wu2000; Kimand Stephan2002). However,

thisU-shaped pattern decays quickly due to mutation and genetic drift: Low-frequency mutants will increase due to new mutational input, while high-frequency mutants will drift either to lower-frequency classes or to fixation. At t generations after the fixation of the beneficial allele,Pnkis approximately

Pnkðt;rÞ ¼

n k

ð1

0

E½Xk_ð₁_X_Þnk_j_y₁_ð₁_y_Þ_z_;_t_z_f_ð_z_Þ_dz

1

ð1

0

E½Xkð1XÞnk_{j ð}

1yÞz;tð1zÞfðzÞdz

12Nm

ðt

0

E Xkð1XÞnk_j 1

2N;t9

dt9

ð1Þ

for 0,k,n. E[Xa₍₁_X₎b_j_z_,_t_{] is the expectation of} Xa₍₁_X₎b_{, where}_X_{is the frequency of a neutral allele} that has drifted fortgenerations starting at frequencyz. The first two integral terms of Equation 1 represent the probability of samplingkcopies of a neutral allele that was already segregating before the hitchhiking event occurred. The third term represents the contribution by neutral variants that entered the population after the

sweep.E[Xa₍₁ X)b_j

z,t] can be obtained using the dif-fusion approximation (Kimura1955) (seeappendix).

After rescaling time byt¼t/(4N), the explicit solution of Equation 1 is given by

Pnkðt;rÞ ¼

ð1

0

QnkðzÞfðzÞdz1uRnk; ð2Þ

where

QnkðzÞ ¼

n k

Xn

i¼0 Xi

j¼0 c_ijðk;nkÞ

3fzðy1ð1yÞzÞj₁_ð₁_z_Þð₁_y_Þj_zj_g_eiði1Þt

and

Rnk

n k tc

ðk;nkÞ

11 1

Xn

i¼2 c_iðk₁;nkÞ iði1Þð1e

iði1Þt_Þ

( )

;

ignoring terms on the order of 1/N(a sufficiently large N is assumed). Coefficients cijða;bÞ are defined in the appendix.

Figure 1 shows the predicted sampling probabilities for eight sequences with various combinations of time and map distance. The predictions from Equation 2 agree well with simulation results, although Equation 2 predicts a slightly greater excess of extreme-frequency classes (frequencies 1 andn1 in the sample), partic-ularly fort¼0 (see also Kimand Stephan2002). This is

likely due to the assumption that sweeps are instanta-neous, which results in ignoring the effect of genetic drift at neutral loci and the stochastic decay of linkage disequilibrium between the beneficial and neutral al-leles during the period of the sweep. In the approxima-tion above,yrepresents only the deterministic increase in the identity by descent. However, in reality and in the simulationsyis a random variable due to genetic drift. Figure 2A shows the distribution of y observed in the forward simulation of a selective sweep. This empirically determined distribution ofycan be used to predict the hitchhiking effect on the sampling probability att¼0 (Figure 2B, shaded bars). It is shown that including the variance ofyproduces a sampling probability distribu-tion that is less skewed and is closer to the result of coalescent simulation than the prediction by Equation 2. In addition to the fluctuation ofy, the genetic drift of the neutral allele along the lineages that escape the hitchhiking-induced coalescence (the frequency of this genetic background is 1yat the end of sweep) may further reduce the skew of frequency distribution.

To connect the current solution to earlier studies, the frequency spectrum obtained here may be transformed into conventional summary statistics of the frequency spectrum—Tajima’sD(Tajima1989a) and Fay and Wu’s

H(Fayand Wu2000). Both statistics are expected to be

(4)

single selective sweep. Tajima’sDcompares the number of segregating sites (S) and the average number of pair-wise differences (ˆup;Tajima1983) in the sample. For a

givenS, the prediction of ˆupunder the current model is

E½ˆup ¼S Xn1

i¼1 pni

2iðniÞ

nðn1Þ;

wherepni¼Pni/(Pn11 1Pnn1). Then, the expected

Tajima’s D might be obtained from Equation 38 of Tajima(1989a), usingE[ˆu_p] instead of ˆu_p. Fay and Wu’s

His the difference between ˆupand ˆuH, the expected ho-mozygosity of mutant alleles summed over polymorphic sites (Fayand Wu2000). I normalize Fay and Wu’sH

by usingH¼ ðûpuˆHÞ=û_p instead of û_puˆH. MeanH might be predicted byðE½uˆp E½uˆHÞ=E½ûp, where

E½uˆ_H ¼SX

n1

i¼1 pni

2i2 nðn1Þ:

Equation 2 can be used to examine how long the skewed distribution of allele frequency lasts. It was shown that the expected heterozygosity is recovered to the before-sweep level approximately in 4Ngenerations after the sweep:E½uˆp uð1y2e2tÞ(Kimand Stephan 2000). The expectations of relative heterozygosity (ˆup/ u), Tajima’sD, and Fay and Wu’sHwere calculated for a single sweep witha¼1000 andn¼15 and are plotted in Figure 3 for various map distances and times after the sweep. The decay of Tajima’sDis only slightly faster than the decay of heterozygosity. However, the decay of Fay and Wu’s H is much faster, as demonstrated in

Przeworski(2002).Hbecomes even positive aftert¼

0.1 in a position-dependent way. This reversal (ˆup.uˆH) was first observed in Kim and Stephan (2000) and is

further examined in the next section. The spatial pattern of Fay and Wu’s H through time is rather complicated. It was argued that, to produce negative Tajima’sD, the recombination fraction from the selec-tive target scaled by the selection coefficient (r/s) should not be too small or too large. The optimalr/s

that generates the most negativeHshifts upward witht

(this is also true if unnormalizedH¼uˆpuˆH is used; result not shown).

Recurrent selective sweeps: From the sampling probability for a single, isolated sweep given by Equation 1, we can obtainPnkunder a model of recurrent selective

sweeps. Consider a chromosome M nucleotides long. Recombination occurs with probabilityr between any two adjacent nucleotide sites per generation. It is as-sumed that each nucleotide site mutates into a benefi-cial allele with selection coefficients(no dominance). All sites on the chromosome have equal probability,l, of fixing a beneficial mutation each generation (l4Nus, whereuis the rate of new beneficial mutation per site). It is assumed that the site at which the variation is ob-served (focal site) is locatedMLnucleotides away from the left end andMRnucleotides from the right end (M¼ ML1MR). Under this model, the amount and pattern of variation at the focal site depends on the hitchhiking effect of recurrent substitutions on the chromosome (Wiehe and Stephan 1993; Braverman et al. 1995;

Gillespie 2000). Averaging over time, an equilibrium

distribution of mutant allele frequency will be estab-lished. Assume that, under this equilibrium, a neutral mutant at the focal site is found in the frequency interval [x, x 1 dx] with probability fðxÞdx. Let Pnk be the

corresponding sampling probability. Again, it is assumed that the substitution of a beneficial mutation occurs in-stantaneously. Going backward in time, the waiting time until the last fixation of a beneficial mutation is given by

t, which is exponentially distributed with mean 1/ (4MNl)¼1/L. Since equilibrium is assumed, the allele frequency distributions both at present and at timetare given byfðxÞ. Then,

Pnk

1 M

ðMR

ML

ð‘

0

Pnkðt;rjmjÞLeLtdtdm

¼

ð1

0

Q_nkðzÞfðzÞdz1uRnk; ð3Þ

Figure1.—Sampling

prob-abilities of neutral mutant alleles after a single selec-tive sweep event (n ¼ 8), for various distances from the selective target and time of the fixation of the bene-ficial allele. Analytic pre-diction is obtained from Equation 2. The simulation result is based on 105

(5)

where

Q_nkðzÞ ¼ 1

M ðMR

ML

ð‘

0

QnkðzÞLeLtdtdm

¼ n

k

Xn

i¼0 Xi

j¼0

c_ijðk;nkÞhjðzÞ L

iði1Þ1L;

hjðzÞ ¼

1 M

ðMR

ML

½zfz1ð1zÞð4NsÞð jmjr=sÞ_gj

1ð1zÞfzð1 ð4NsÞð jmjr=sÞ_Þgj_dm_;

and

Rnk ¼

1 M

ðMR

ML

ð‘

0

RnkLeLtdtdm¼ n

k

Xn

i¼1

cðki1;nkÞ iði1Þ1L:

AsnincreasesPnkshould approachfðk=nÞð1=nÞ. Then,

with sufficiently largen, we may approximate Equation 3 by

Pnk

Xn1

j¼1 Q_nk j

n

Pnj1uRnk:

Therefore, the sampling probabilities under the model of recurrent sweeps are given by

P¼uðIQÞ1₃

R; ð4Þ

where P ¼ ð_P_n₁_;_{. . .}_;_P_nn₁_ÞT_;

R¼ ð_R_n₁_;_{. . .}_;_R_nn₁_ÞT , andQis a (n1)3(n1) matrix whose element in ith row and jth column is Q_niðj=nÞ. The frequency spectrum (pn1, pn2,. . .,pnn1) is defined as sampling

distribution conditional on polymorphism. Therefore, under recurrent sweeps, it is approximated by

pni¼Pni

Xn1

j¼1 Pnj:

,

ð5Þ

Figure 4A shows that sampling probabilities given by Equation 4 are in good agreement with results from sim-ulation of recurrent sweeps. As in Figure 1, the approxi-mations ofPn1 andPnn1 are slightly larger than those observed in the simulations. However, the frequency spectrum observed in the same set of simulated data shows the opposite result (Figure 4B): the analytic approximation (pni) predicts less skew of frequency

spectrum than that observed in the simulation. This dis-crepancy occurs due to the negative correlation between the number of segregating sites in the sample and the skew of the frequency spectrum. For example, if the last hitchhiking event occurs very recently and near the center, the simulated data will have smallerSand greater skew of the frequency spectrum. As the sampling probability estimated from simulation is not normalized by S, it is expected to be less skewed than the frequency spectrum. Equation 5 is transformed to summary statistics as described in the section of single sweeps. Table 1 shows that the predicted Tajima’s Dis less negative than the average observed in simulations, as expected from com-parisons of the analytic and simulation results shown in Figure 4B. Despite this bias, the predicted Tajima’sDis still useful for tracking the general change of frequency spectrum with changing parameters. First, as observed in Bravermanet al.(1995), Tajima’sDdecreases linearly

with the reduction of variability (measured by E½uˆ_W, where ˆuWis Watterson’s estimate ofu; Watterson1975)

relative to the neutral equilibrium (Figure 5). This result is in agreement with Figure 3 in which the heterozygosity and Tajima’sDreturn to equilibrium approximately at the same rate after a single sweep. However, the slope of the linear relationship between ˆuW and Tajima’s D changes with scaled map length of the sequence (R¼

4NMr) relative to the strength of selection (a). With smaller R/a, more negative Tajima’sDis obtained for the same reduced level of variation. This was observed also in the simulation (Table 1, cases 2–4).

Fay and Wu’sHwith recurrent selective sweeps is posi-tive (ˆup.ˆuH) for all parameter sets examined, including

Figure 2.—(A) Distribution of yat the end of a selective

sweep. A two-locus forward simulation using the method in Kimand Stephan(2000) was conducted withN¼10,000, s¼0.05, andr/s¼0.025. (B) Sampling probability forn¼8 was determined by transforming an allele frequencyx, randomly chosen from distributionf(x)¼u/x, toy1(1y)xwith prob-abilityxor to (1y)xwith probability 1x. This procedure was repeated 105_{times and the resulting distribution of allele}

frequency was converted to the sampling probability. Columns with light shading are obtained by using a constantythat is the mean of the distribution shown in A [mean ¼ 0.815; (4Ns)r/s_¼_{0.827]. Using variable}_y_{in A results in probabilities} shown by columns with dark shading. Solid columns represent the results from coalescent simulation (a¼1000, u¼0.01,

(6)

those in Table 1. Figure 4B shows that, with a severe skew of the frequency spectrum caused by recurrent hitch-hiking events, more mutant alleles are found at high frequency than at intermediate frequency. However, this skew does not translate into a negative H because, compared to the frequency spectrum under neutrality (curves in Figure 4B), there is no excess of high-frequency mutants. On the other hand, the proportion of low-frequency mutants is much greater than that under the neutral expectation. A recent study using coalescent simulation also found that Fay and Wu’s H becomes positive under recurrent sweeps (Haddrillet al.2005).

One may argue that this results from an excessively large proportion of singletons, particularly mutant alleles with frequency 1, relative to high-frequency derived al-leles after selective sweeps. However, in all cases of recur-rent sweeps in Table 1, Fay and Wu’sHis still positive even if singletons are removed from the data sets.

This is an interesting result as previous studies found that the excess of high-frequency derived alleles (nega-tiveH) is an important signature of selective sweeps (Fay

and Wu 2000; Kim and Stephan 2002; Przeworski

2002). However, these studies modeled recent selective sweeps with no further events in the past. This ‘‘single-sweep’’ model predicts a coalescent tree with long inner branches—i.e., most gene lineages find a very recent common ancestor, but one or two lineages escape the early coalescence via recombination [see Fayand Wu’s

(2000) Figure 2B]. Mutations mapped on this tree will be found at either very low or high frequency in the sample. In the recurrent sweep model, however, the probability of finding such a tree will be small because gene lineages that have escaped the coalescence at the most recent hitchhiking event are still subject to coales-cence at hitchhiking events further back in the past. To confirm this explanation, I performed new simulations Figure3.—Decay of skewed frequency spectrum after a single selective sweep. The expectations of relative heterozygosity (ˆu_p),

Tajima’sD(givenS¼20), and normalized Fay and Wu’sHare plotted along the distance from the selective target (given byr/s) for increasing values oft.

Figure 4.—Allele

fre-quency distribution of neu-tral mutant alleles under recurrent selective sweeps (n¼15,a¼2000,u¼0.01,

M ¼ 2 3 105_, _R _¼ _400).

Shaded and solid columns show analytic approxima-tion and coalescent sim-ulation, respectively. (A) Probability of sampling i

(7)

in which, after the first hitchhiking event at time tn (distributed exponentially with mean 1/L), no more selective sweeps in the past were allowed. The mean values ofHfrom those simulations are negative (Table 1, cases 7–9).

Finally, I examined the effect of the dominance of beneficial mutations on the frequency spectrum. For similar degrees of reduction in ˆuW, Tajima’s D be-comes slightly more negative with increasing dominance (Table 1, cases 10–12).

PARAMETER ESTIMATIONS OF DIRECTIONAL SELECTION

In this section, I examine whether the analytic approximation obtained above can be used to estimate parameters of directional selection from sequence data. First, Equation 2 may allow the inference of the time

since the hitchhiking event, if only one selective sweep event has occurred in the recent past. In Kim and

Stephan(2002), the composite likelihood of the data

under the selective sweep model was calculated assum-ing that sequences were sampled immediately after the fixation of the beneficial allele (t¼0). Here, Equation 2 is used to perform a likelihood-ratio test as described in Kimand Stephan(2002) but withtbeing an unknown

variable. Przeworski (2003) also estimated t using a

different maximum-likelihood approach, a rejection-sampling method based on three summary statistics:S, Tajima’sD, and the number of distinct haplotypes.

The test was applied to simulated data sets of 15 sequences (10 kb long, 4Nr¼0.05,u¼0.005). Fixation of the beneficial mutation occurs in the middle of the sequence at timet. Three values oftwere used: 0, 0.05, and 0.1. Table 2 shows the mean and standard deviation of the estimate of sweep time (t) as well as those of the

TABLE 1

Simulation of recurrent selective sweeps (a¼1000,u¼0.01,ML¼MR¼105)

Case R/a L ˆuWa uˆpa D H

1 0.4 2 0.697 [0.735] 0.618 [0.652] 0.562 [0.412] 0.068 [0.070] 2 0.4 4 0.581 [0.596] 0.471 [0.484] 0.835 [0.686] 0.084 [0.102] 3 1 8 0.555 [0.565] 0.451 [0.466] 0.796 [0.634] 0.063 [0.072] 4 2 13 0.584 [0.562] 0.493 [0.480] 0.681 [0.521] 0.009 [0.064] 5 0.4 10 0.396 [0.398] 0.272 [0.273] 1.19 [1.10] 0.114 [0.123] 6 0.1 12 0.296 [0.289] 0.173 [0.166] 1.46 [1.43] 0.346 [0.341]

7b _0.4 ₂ _0.773 _0.706 _0.471 _0.095

8b _0.4 ₄ _0.691 _0.597 _0.678 _0.227

9b _0.4 ₁₀ _0.627 _0.522 _0.811 _0.613

10c _0.4 ₁₀ _0.441 _0.328 _1.01 _0.089

11 0.4 8 0.439 [0.445] 0.318 [0.319] 1.09 [1.02] 0.106 [0.121]

12d _0.4 ₈ _0.431 _0.303 _1.15 _0.161

Simulation results are based on 5000 replicates. Theoretical expectations are given in brackets. a

Mean value of estimates relative tou. b

Only one selective sweep occurs in the genealogy (see text). c

Dominance of the beneficial mutation (h) is 0.2. d

Dominance of the beneficial mutation (h) is 0.8.

Figure 5.—Correlation between Tajima’s D

and relative reduction of genetic variation under recurrent selective sweeps predicted by analytic approximation. It is assumed that sequence data (n¼25) are taken from a 1-kb segment in the middle of the chromosome and u¼ 0.01. The sampling probability given by Equation 4 was transformed to the expected level of variation (E½uˆW=u), where E½uˆW ¼

Pn1

i¼1Pni, and to ex-pected Tajima’s D, using S ¼ 103_E_½ˆ_u

W. Solid curve:aincreases from 150 to 5000 withL¼4,

R¼ 400. Dashed curves:Lincreases from 1 to 12 witha¼1000 andR¼400, 1000, and 2000.

(8)

strength (a) and location (X) of the beneficial allele. Different tests were performed depending on whether the parameters u and X are known or unknown and whether derived/ancestral states of alleles are distin-guished. If the true value ofuis assumed to be unknown, Watterson’s estimate ofufrom data itself is used for the composite likelihood. Estimation oftis poor in general: standard deviations of ˆtare greater than means in many cases. Using the true value of u slightly improves the accuracy of ˆt. It is sometimes possible to infer the location of the target of directional selection from ex-ternal information (for example, Enard et al. 2002).

However, having the true value ofXdoes not improve the parameter estimation of ˆtand ˆa. Not distinguishing derivedvs.ancestral states of alleles generally results in poorer estimates, especially with largert. Whent¼0, mean ˆtranged from 0.002 to 0.018. Given that the coa-lescence of gene lineages due to hitchhiking occurs mainly when the frequency of the beneficial allele is low, the behavior of the genealogy during much of the selective phase is similar to that under neutrality. Therefore, the estimate oftwhent¼0 should reflect the length of the selective phase. The length of the selec-tive phase in this case is(1/a)ln(2a)0.007, in agree-ment with the range of ˆtobtained.

Next, I asked if Equations 4 and 5 can be used to estimate genomic parameters of directional selection under a simple model of recurrent selective sweeps. I assume that selective sweeps occur at a rate l per generation per nucleotide, regardless of recombination rates that may vary along the genome. The strength of

directional selection,a, is constant. Since the distribu-tion of allele frequency captured in the analytic approx-imation is only the average over many realizations of selective sweeps for a given parameter set, it will be essential to use data from many independently evolving loci. Assume that the data set consists ofLsuch loci. The recombination rate for each locus is assumed known. LetSijbe the number of nucleotide sites withjmutant

alleles from theith locus (j¼0,. . .,ni1;Si0is defined to be the number of monomorphic sites). A composite likelihood based on the sampling probability given by Equation 4 can then be defined as

CLða;4NlÞ ¼X

L

i¼1 Xni

j¼0

SijlogðPijÞ: ð6Þ

Note that the likelihood is given by a function of 4Nl, notl, because the effect of recurrent sweeps depends on the frequency of hitchhiking events in the coalescent time scale (4N). The composite likelihood based on the frequency spectrum is identical except thatpijis used

instead ofPijand terms withj¼0 disappear. Remember

that the sampling probability can be calculated only when the scaled mutation rate,u, of the locus is known, while the frequency spectrum does not depend onu. The estimates ofaand 4Nl(orlifNis known) may be found by maximization of the above function.

This approach was tested against simulated data sets ofL ¼30,ni¼n¼15, andu ¼0.01. One data set is

produced by simulating 30 loci individually and then combining them. The ith locus was simulated with a

TABLE 2

Parameter estimates of single selective sweep (mean6standard deviation)

Derived/ancestral distinguished Derived/ancestral not distinguished

t ˆt aˆ(3103₎ _X^₍₃₁₀3₎ _ˆ_t _a_ˆ₍₃₁₀3₎ _X^₍₃₁₀3₎

u¼uˆW,Xunknown

0 0.001760.0043 0.9860.51 5.0360.96 0.001360.0030 0.8660.49 5.0360.92 0.05 0.03160.035 0.7260.38 5.0261.31 0.007960.0229 0.6560.35 5.0061.05 0.1 0.08460.066 0.6060.36 4.9261.44 0.01760.031 0.5060.25 4.9561.18

uknown,Xunknown

0 0.007060.0189 1.5861.09 5.0560.96 0.007860.0041 1.6661.25 5.0260.92 0.05 0.04760.036 1.6461.15 4.9761.11 0.02660.045 1.5061.07 4.9961.02 0.1 0.09460.046 1.7761.41 4.9961.13 0.04560.058 1.5161.58 5.0061.21

u¼uˆW,Xknown

0 0.009060.0202 0.9060.49 NA 0.01560.047 0.7560.46 NA 0.05 0.05660.053 0.7160.38 NA 0.06660.157 0.6260.35 NA 0.1 0.13260.109 0.6160.35 NA 0.09460.137 0.5560.69 NA

uandXknown

0 0.01860.031 1.6161.09 NA 0.02860.068 1.8261.87 NA 0.05 0.06160.040 1.6561.07 NA 0.08460.290 2.70611.5 NA 0.1 0.11260.057 1.8261.44 NA 0.14360.537 3.01615.5 NA

(9)

recombination rate given by 4Nr_i¼(i11)3103₍_i_¼

1,. . ., L) and with chromosome length Mi ¼ 0.8a/

(4Nri). (Here each ‘‘chromosome’’ models an

indepen-dently evolving segment on the actual chromosome.Mi

is approximately the size of the region affected by a beneficial mutation of strength a; if the scaled re-combination rate is.0.4a, the heterozygosity decreases at most by 6%.) Polymorphism is observed in a 2-kb-long segment at the center of each chromosome. A total of 1000 replicate data sets were generated witha¼1000 and 4Nl¼43105_{. The first four replicates were used}

to obtain the profile of the composite likelihood using the frequency spectrum in the parameter space ofaand 4Nl (Figure 6). The profile for each replicate has a plateau over the space defined by a product ofa and 4Nl. This suggests that the strength and the rate of selective sweeps may not be estimated separately from frequency spectra of multilocus genomic data. The same conclusion was drawn previously, using the heterozygos-ity of data: the effect of hitchhiking is given by the productal(Wieheand Stephan1993; Stephan1995).

However, it is not simple to estimate the composite pa-rameter al using Equation 3 because this equation is not expressed by this single variable.

It is often possible to obtain a separate estimate of the rate of substitutions driven by directional selection in the genome (e.g., Smithand Eyre-Walker2002). Iflis

given by an external source, the estimate ofamight be obtained. Figure 7 shows the distribution of ˆawhen the composite likelihood of the above data was maximized using the correct value of 4Nl (4 3 105_{). The}

com-posite likelihood based on sampling probability, assum-ing the correctuis known, yields better estimates ofa, as expected. Using only the frequency spectrum lowers the accuracy of â. But it still gives reasonably unbiased es-timates (mean â¼1045). When the same analyses were performed without distinguishing ancestral/derived alleles, the distribution of â changed slightly (from 1020[mean] 6 495[SD] to 1025 6 521 using the sampling probability; from 1045 6 647 to 1083 6 783 using the frequency spectrum). This suggests that genetic information regarding the intensity of directional selec-tion is contained in the excess of rare alleles rather than in the excess of high-frequency derived alleles. The error in the estimation ofashown in Figure 7 is quite large. Adding more independent loci will reduce this error. For example, by increasingLfrom 30 to 60 (each chromo-some is duplicated in the new data set), â improves to 10456315 using the sampling probability and to 9686

392 using the frequency spectrum (distinguishing ances-tral/derived alleles; based on 500 replicates).

DISCUSSION

A skew in the site frequency spectrum,i.e., the excess of rare alleles and high-frequency derived alleles com-pared to the expectation under neutral equilibrium, is well known to be characteristic of the genetic variation resulting from positive directional selection (Braverman

et al.1995; Simonsenet al. 1995; Fayand Wu2000). The

theoretical basis of this effect has been studied for single recent selective sweeps (Kim and Stephan2002; Kim

and Nielsen 2004), in which case one can make a

simplifying assumption that genetic drift after the fixation of a beneficial allele is negligible. Relatively little progress has been made for the theory of frequency spectrum under recurrent selective sweeps. This study appears to represent the first step toward a Figure6.—Contour plots of composite likelihood

(Equa-tion 6) calculated for four random sets of multilocus data (n¼15, L¼30,a¼ 1000, and 4Nl¼43105_{). Contour}

lines were drawn in increments of five downward from the maximum value (included in the open area).

Figure7.—Distribution of the maximum

com-posite-likelihood estimate of a when the true value of 4Nl¼43105_{is known. The true value}

(10)

complete mathematical analysis of the frequency spec-trum under recurrent selective sweeps.

There are two sources of the rare alleles in DNA sequences after a selective sweep. First, a selective sweep may generate a star-like genealogy with many short outer branches. Mutations mapping onto those branches will be found at low frequency in the sample. Second, when only one selective sweep has occurred in the recent past, one or two lineages may escape co-alescence during the hitchhiking event and generate long inner branches. Alleles descended along such an ‘‘escaped’’ branch will also be found in low frequency in the sample. Both classes of rare alleles contribute to generate negative values of Tajima’sD. However, there are qualitative differences between the two classes. Only the second class is associated with an excess of high-frequency derived alleles and high linkage disequilib-rium (Fayand Wu2000; Kimand Nielsen2004). The

results of this study suggest that negative Tajima’s D generated under recurrent selective sweeps must be due mainly to the first class of rare alleles, since no excess of high-frequency derived alleles is predicted or observed in the simulations (Table 1). A neutral variant that has increased to a high frequency due to one hitchhiking event is likely to be dragged to fixation during sub-sequent hitchhiking events. An excess of high-frequency variants thus cannot be maintained under recurrent sweeps. Low-frequency variants, in contrast, are con-stantly replenished by new mutations and can occur with recurrent sweeps. This result raises concerns about the interpretation of Fay and Wu’sH-test as an attempt to detect positive selection from sequence polymor-phism data: a failure to observe a negativeHshould not be regarded as a failure to detect recent directional selection, particularly if the locus under test is believed to have experienced multiple adaptive substitutions (e.g., with excess nonsynonymous relative to synony-mous substitutions). It is, however, possible to observe an excess of high- compared tointermediate-frequency derived alleles under high rates of recurrent sweeps (Figure 4). We may need to develop another summary statistic that can conveniently capture this feature of recurrent selective sweeps.

The analytic approximations obtained here can be used to estimate genomic parameters of directional selection from multilocus data. However, it should be stressed that these approximations are based on a simple model of recurrent selective sweeps. In this model, selective sweeps are assumed to occur in a random-mating population of constant size and with no deleterious mutation. In reality, the frequency spectrum may be affected by both spatial and temporal changes of population structure. One possible way of removing these confounding factors and isolating the effect of selective sweeps alone is to examine whether the frequency spectrum is more skewed in genomic regions of lower recombination. Recurrent selective

sweeps are expected to produce this kind of strong cor-relation. For example, Andolfatto and Przeworski

(2001) observed more negative Tajima’sDin regions of lower crossing over in theD. melanogastergenome and argued that this correlation supports the model of recur-rent selective sweeps. We may thus estimate the intensity of selection in the Drosophila genome by finding values that produce a similar profile of Tajima’s D over re-combination rate. However, background selection, which was argued to be prevalent in D. melanogaster (Hudson

and Kaplan 1995; Charlesworth 1996), makes it

dif-ficult to apply the current model to their data.

There are two potentially opposing effects of back-ground selection on the frequency spectrum under the recurrent sweep model. First, because deleterious alleles segregate at low frequency, background selection may further skew the frequency spectrum in the di-rection of negative Tajima’s D (Charlesworth et al.

1995; Bachtrog 2004), particularly in regions of low

recombination. In this case, the intensity of directional selection is likely to be overestimated by fitting the current model to the correlation between recombina-tion and Tajima’s D. However, it is known that back-ground selection can produce substantially negative Tajima’sDonly when deleterious mutations are weakly selected and the mutation rate is high (Golding1997;

Przeworskiet al. 1997; Bachtrog2004). For example,

purifying selection in Drosophila appears to be too strong to generate significantly negative Tajima’s D

(Andolfatto and Przeworski 2001). The second

effect of background selection is to reduce the short-term effective population size, also strongly in regions of low recombination. In this case, the effect of selective sweeps on a genealogy diminishes because, for givenl, the rate of selective sweeps on the coalescent timescale, 4Nl, becomes smaller. Thus, Tajima’sDmay becomeless negative when background selection is added. The per generation substitution rate l itself will also decrease because the efficacy of directional selection on benefi-cial mutations decreases by interference from purifying selection at linked loci (Peck1994; Barton1995; Kim

and Stephan 2000). Therefore, the intensity of

di-rectional selection required to explain the observed pattern of the frequency spectrum may increase when background selection is added. In Drosophila, this second effect of background selection is likely to be more important than the first since the reduction of effective population size by background selection is predicted to be substantial (Hudsonand Kaplan1995;

Charlesworth1996).

(11)

example, Haddrillet al.(2005) showed that a simple

model of population bottleneck is the most parsimoni-ous explanation for the pattern of the frequency spec-trum along the genome ofD. melanogaster. Although it might still be possible to confirm selective sweeps by observing a positive correlation between recombination rate and Tajima’sD, if selective sweeps occur along with demographic changes, the estimate of al using the above approximation may have substantial error. An-other simplifying assumption of the current model is that a selective sweep starts with one copy of a new beneficial mutation. One recent study suggests, how-ever, that adaptive substitutions starting from standing genetic variation might be as common as those from a new beneficial mutation (Hermisson and Pennings

2005). Selective sweeps starting with multiple copies of a beneficial mutation may produce frequency spectra that are quite different from those expected under the standard model of hitchhiking (Innanand Kim2004;

Hermissonand Pennings2005).

I thank Wolfgang Stephan, Allen Orr, Andrea Betancourt, Daven Presgraves, Naoyuki Takahata, and two anonymous reviewers for help-ful comments on the manuscript. This work was funded by National Science Foundation grant DEB-0449581.

LITERATURE CITED

Andolfatto_{, P., 2001} _{Adaptive hitchhiking effects on genome}

var-iability. Curr. Opin. Genet. Dev.11:635–641.

Andolfatto, P., and M. Przeworski, 2001 Regions of lower

cross-ing over harbor more rare variants in African populations of

Drosophila melanogaster.Genetics158:657–665.

Bachtrog_{, D., 2004} _{Evidence that positive selection drives}

Y-chromo-some degeneration in Drosophila miranda. Nat. Genet.36:518– 522.

Barton, N. H., 1995 Linkage and the limits to natural selection.

Genetics140:821–884.

Barton, N. H., 2000 Genetic hitchhiking. Philos. Trans. R. Soc. Lond.

B355:1533–1562.

Begun, D. J., and C. F. Aquadro, 1992 Levels of naturally occurring

DNA polymorphism correlate with recombination rates inD. melanogaster.Nature356:519–520.

Braverman, J. M., R. R. Hudson, N. L. Kaplan, C. H. Langley

and W. Stephan_{, 1995} _{The hitchhiking effect on the site}

fre-quency spectrum of DNA polymorphisms. Genetics 140:783– 796.

Charlesworth, B., 1996 Background selection and patterns of

genetic diversity inDrosophila melanogaster.Genet. Res.68:131– 149.

Charlesworth_{, B., M. T. M}organ _{and D. C}harlesworth_,

1993 The effect of deleterious mutations on neutral molecular variation. Genetics134:1289–1303.

Charlesworth_{, D., B. C}harlesworth_{and M. T. M}organ_{, 1995} _The

pattern of neutral molecular variation under the background selec-tion model. Genetics141:1619–1632.

Clark, R. M., E. Linton, J. Messingand J. F. Doebley, 2004

Pat-tern of diversity in the genomic region near the maize domestication gene tb1. Proc. Natl. Acad. Sci. USA101:700– 707.

Enard, W., M. Przeworski, S. E. Fisher, C. S. L. Lai, V. Wiebeet al.,

2002 Molecular evolution of FOXP2, a gene involved in speech and language. Nature418:869–872.

Ewens, W. J., 2004 _{Mathematical Population Genetics. I. Theoretical}

Introduction.Springer-Verlag, New York.

Fay, J. C., and C.-I Wu, 2000 Hitchhiking under positive Darwinian

selection. Genetics155:1405–1413.

Fu, Y.-X., 1997 Statistical tests of neutrality of mutations against

pop-ulation growth, hitchhiking and background selection. Genetics

147:915–925.

Gillespie_{, J. H., 2000} _{Genetic drift in an infinite population: the}

pseudohitchhiking model. Genetics155:909–919.

Golding_{, G. B., 1997} _{The effect of purifying selection on}

genealo-gies, pp. 271–285 in Progress in Population Genetics and Human Evolution (IMA Volumes in Mathematics and Its Applications, Vol. 87), edited by P. Donnellyand S. Tavare. Springer-Verlag,

New York.

Haddrill_{, P. R., K. R. T}hornton_{, B. C}harlesworth _and

P. Andolfatto, 2005 Multilocus patterns of nucleotide

vari-ability and the demographic and selection history of Drosophila melanogaster populations. Genome Res.15:790–799.

Hermisson_{, J., and P. S. P}ennings_{, 2005} _{Soft sweeps: molecular}

pop-ulation genetics of adaptation from standing genetic variation. Genetics169:2335–2352.

Hudson, R. R., and N. L. Kaplan, 1995 Deleterious background

se-lection with recombination. Genetics141:1605–1617.

Innan_{, H., and Y. K}im_{, 2004} _{Pattern of polymorphism after strong}

artificial selection in a domestication event. Proc. Natl. Acad. Sci. USA101:10667–10672.

Kaplan, N. L., R. R. Hudsonand C. H. Langley, 1989 The

‘‘hitch-hiking effect’’ revisited. Genetics123:887–899.

Kim, Y., and R. Nielsen, 2004 Linkage disequilibrium as a signature

of selective sweeps. Genetics167:1513–1524.

Kim, Y., and W. Stephan, 2000 Joint effects of genetic hitchhiking

and background selection on neutral variation. Genetics 155:

1415–1427.

Kim, Y., and W. Stephan, 2002 Detecting a local signature of genetic

hitchhiking along a recombining chromosome. Genetics 160:

765–777.

Kimura_{, M., 1955} _{Solution of a process of random genetic drift}

with a continuous model. Proc. Natl. Acad. Sci. USA41:144– 150.

MaynardSmith, J., and J. Haigh, 1974 The hitch-hiking effect of a

favourable gene. Genet. Res.23:23–35.

Peck, J. R., 1994 A ruby in the rubbish: beneficial mutations,

dele-terious mutations and the evolution of sex. Genetics137:597– 606.

Przeworski, M., 2002 The signature of positive selection at

ran-domly chosen loci. Genetics160:1179–1189.

Przeworski, M., 2003 Estimating the time since the fixation of a

beneficial allele. Genetics164:1667–1676.

Przeworski, M., B. Charlesworthand J. D. Wall, 1997

Geneal-ogies and weak purifying selection. Mol. Biol. Evol.16:246–252. Schlenke, T. A., and D. J. Begun, 2004 Strong selective sweep

asso-ciated with a transposon insertion in Drosophila simulans. Proc. Natl. Acad. Sci. USA101:1626–1631.

Simonsen, K. L., G. A. Churchill and C. F. Aquadro,

1995 Properties of statistical tests of neutrality for DNA poly-morphism data. Genetics141:413–429.

Smith_{, N. G. C., and A. E}yre_-Walker_{, 2002} _{Adaptive protein}

evolu-tion inDrosophila.Nature415:1022–1024.

Stephan_{, W., 1995} _{An improved method for estimating the rate of}

fixation of favorable mutations based on DNA polymorphism data. Mol. Biol. Evol.12:959–962.

Stephan, W., T. H. E. Wieheand M. W. Lenz, 1992 The effect of

strongly selected substitutions on neutral polymorphism: analytical results based on diffusion theory. Theor. Popul. Biol.41:237–254. Tajima, F., 1983 Evolutionary relationship of DNA sequences in

fi-nite populations. Genetics105:437–460.

Tajima, F., 1989a Statistical method for testing the neutral mutation

hypothesis by DNA polymorphism. Genetics123:585–595. Tajima, F., 1989b The effect of change in population size on DNA

polymorphism. Genetics123:597–601.

Thomson, G., 1977 The effect of a selected locus on linked neutral

loci. Genetics85:752–788.

Watterson, G. A., 1975 On the number of segregating sites. Theor.

Popul. Biol.7:256–276.

Wiehe_{, T. H. E., and W. S}tephan_{, 1993} _{Analysis of a genetic}

hitch-hiking model, and its application to DNA polymorphism data fromDrosophila melanogaster.Mol. Biol. Evol.10:842–854.

(12)

APPENDIX: CALCULATION OFE[Xa₍₁ _X₎b_j_{z, t}_]

Consider a neutral allele with initial frequency z reaching frequency X after t generations of random genetic drift. An approximate solution for the nth moment ofX,E[Xn

]¼E[Xn_j

z,t], can be obtained from the following diffusion approximation first described by

Kimura(1955):

dE½Xn

dt ¼

nðn1Þ

4N fE½X

n1 _E_½_Xn_g _ð_n_¼_1;_2;_{. . .}_Þ_:

ðA1Þ

His solution is expressed as a sum of infinite series. Unfortunately, this series converges very slowly for small t(,N), the range this study mainly considers. However, from an examination of the exact solutions to the above equation for smalln[e.g.,E½X ¼z,E½X2_¼_z₁_z_ð_z₁_Þ expðt=2NÞ, and E½X3_¼_z3

2zð1zÞexpðt=2NÞ1

ðz33

2z2112zÞexpð3t=2NÞ], it can be easily shown that the solution for arbitraryntakes a form

E½Xnjz;t ¼X

n

i¼1

fiðnÞðzÞeðiði1Þ

=4NÞt _ð

n$1Þ

where fiðnÞðzÞ is a polynomial function of z. Inserting

E[Xn

] andE[Xn1_{] of the above form into Equation A1,}

we obtain

fiðnÞðzÞ ¼f ðn1Þ

i ðzÞ

nðn1Þ

nðn1Þ iði1Þ ð1#i,nÞ: ðA2Þ

SinceE½Xn_¼_zn_for_t_¼_0,

f_nðnÞðzÞ ¼znX

n1

i¼1

fiðnÞðzÞ: ðA3Þ

E[Xn_j

z,t] is thus given by obtainingfiðnÞðzÞby recursion

using Equations A2 and A3 and f1ð1ÞðzÞ ¼z. Most importantly, this exact solution has finite terms.fiðnÞðzÞ

can be written asPi_j_¼0cijðnÞzj. Coefficientsc

ðnÞ

ij are readily

obtained from the above recursion. E[Xa₍₁

X)b_j

z,t] is now obtained from

E½Xað1XÞb_j_z_;_t_¼X b

j¼0 b

j !

ð1Þj_E_½_Xa1j_j_z_;_t

¼X

b

j¼0 b

j !

ð1ÞjX

a1j

l¼0

flða1jÞðzÞe

ðlðl1Þ=4NÞt

¼X

a1b

i¼0

fiða;bÞðzÞeðiði1Þ =4NÞt_;

where

fiða;bÞðzÞ ¼

Xb

j¼maxð0;iaÞ b j ð1Þ

j_fða1jÞ

i ðzÞ:

fiða;bÞðzÞ can be written as

Pi k¼0c

ða;bÞ

ik zk. Therefore,

cikða;bÞ¼

Pb

j¼maxð0;iaÞð

b jÞð1Þ

j

cikða1jÞ. The table ofc

ða;bÞ

ik is