Estimation of Allele Frequencies From High-Coverage Genome-Sequencing Projects

(1)

DOI: 10.1534/genetics.109.100479

Estimation of Allele Frequencies From High-Coverage

Genome-Sequencing Projects

Michael Lynch

1

Department of Biology, Indiana University, Bloomington, Indiana 47405

Manuscript received January 6, 2009 Accepted for publication March 6, 2009

ABSTRACT

A new generation of high-throughput sequencing strategies will soon lead to the acquisition of high-coverage genomic profiles of hundreds to thousands of individuals within species, generating unprecedented levels of information on the frequencies of nucleotides segregating at individual sites. However, because these new technologies are error prone and yield uneven coverage of alleles in diploid individuals, they also introduce the need for novel methods for analyzing the raw read data. A maximum-likelihood method for the estimation of allele frequencies is developed, eliminating both the need to arbitrarily discard individuals with low coverage and the requirement for an extrinsic measure of the sequence error rate. The resultant estimates are nearly unbiased with asymptotically minimal sampling variance, thereby defining the limits to our ability to estimate population-genetic parameters and providing a logical basis for the optimal design of population-genomic surveys.

V

ERY soon, the field of population genomics will be overwhelmed with enormous data sets, including those generated by the ‘‘1000-genome’’ projects being planned for human, Drosophila, and Arabidopsis. With surveys of this magnitude, it should be possible to go well beyond summary measures of nucleotide diversity for large genomic regions to refined estimates of allele-frequency distributions at specific nucleotide sites. Such information will provide a deeper understanding of the mechanisms of evolution operating at the geno-mic level, as the properties of site-frequency distributions are functions of the relative power of mutation, drift, selection, and historical demographic events (which transiently modify the power of drift) (Kimura 1983;

Ewens 2004; Keightley and Eyre-Walker 2007;

McVean2007). Accurate estimates of allele frequencies

are also central to association studies for QTL mapping (Lynch and Walsh 1998), to the development of

databases for forensic analysis (Weir1998), and to the

ascertainment of pairwise individual relationships (Weiret al.2006).

If the promise of high-throughput population-geno-mic data is to be fulfilled, a number of analytical challenges will need to be met. High-throughput sequencing strategies lead to the unbalanced sampling of parental alleles (in nonselfing diploid species) and also introduce errors that can result in uncertainty about individual genotypic states ( Johnson and Slatkin

2007; Hellmannet al. 2008; Lynch2008; Jianget al.

2009). Unless they are statistically accounted for, both types of problems will lead to elevated estimates of low-frequency alleles, especially singletons, potentially lead-ing to exaggerated conclusions about purifylead-ing selection. Such aberrations are of concern because, for economic reasons, many plans for surveying large numbers of genomes envision the pursuit of relatively low (23) depth-of-coverage sequence for each individual, a sam-pling design that will magnify the variance of parental-allele sampling, while also rendering the error-correction problem particularly difficult (Lynch2008).

Simple ad hocstrategies for dealing with light-cover-age data sets include the restriction of site-specific analyses to individuals that by chance have adequate coverage for a high probability of biallelic sampling (e.g., Jianget al.2009). However, such an approach can

result in the discarding of large amounts of data, increasing the lower bound on the frequency of alleles that can be determined (as a result of the reduction in sample size). In addition, although methods exist for flagging and discarding low-quality reads (Ewingand

Green1998; Ewinget al.1998; Huseet al.2007), such

treatment is unable to eliminate other sources of error. With low-coverage data, singleton reads cannot be dis-carded without inducing significant bias, as such treat-ment will also lead to the censoring of true heterozygotes with unbalanced parental-allele sampling.

Thus, there is a clear need for novel methods for efficiently estimating nucleotide frequencies at individ-ual sites, not only to maximize the information that can be extracted from population-level samples, but also to guide in the design of such surveys. In the latter context, questions remain as to the relative merits of increasing

Supporting information is available online athttp://www.genetics.org/ cgi/content/full/genetics.109.100479/DC1.

(2)

the number of individuals sampled vs. the depth of sequence coverage per individual,e.g., a 1000-genome project involving 23 coverage per individual, as op-posed to a 500-genome project with 43 coverage, etc. Because many genetic analyses rely on low-frequency markers (SNPs), there is also a need to know the limits to our ability to estimate the frequencies of rare alleles, particularly in the face of error rates of comparable or even higher levels.

As a first step toward addressing these issues, a maximum-likelihood (ML) method is developed for the estimation of allele frequencies at individual nucleotide sites. This method is shown to behave optimally in that the estimates are nearly unbiased with sampling variances asymptotically approaching the expectation under pure genotypic sampling at a high depth of sequence coverage. Moreover, the proposed approach eliminates the need for an extrinsic measure of the read-error rate, using the data themselves to remove this nuisance parameter, thereby reducing potential biases resulting from site-specific variation in error-generating processes.

THEORY AND ANALYSIS

Throughout, we assume that there are no more than two actual nucleotides segregating per site, a situation that experience has shown to be almost universally true at the population level [average nucleotide heterozy-gosity in most diploid species is generally well below 0.1, even at neutral sites (Lynch2007)]. It is also assumed

that prior to analysis the investigator has properly assembled the sequence reads to ensure that paralogous regions (including mobile elements and copy-number variants) that might aggregate to the same sites have been removed from the analysis. This is, of course, a consideration for all methods of sequence analysis, but with high-coverage projects involving small sequence reads, it is more manageable because unusually high depths of coverage can flag problematical regions.

For the classical situation in which the genotypes ofN

individuals have been determined without error, assum-ing a population in Hardy–Weinberg equilibrium, the likelihood that the frequency of the leading allele (denoted as 1) is equal topis simply

LðpÞ ¼K½p2N11_½₂_p_ð₁_p_ÞN12_½ð₁_p_Þ2N22_; _ð₁_Þ

whereN11,N12, andN22are the numbers of times the three genotypes appear in the sample, and K is a constant that accounts for the multiplicity of orderings of individuals within the sample (which has no bearing on the likelihood analysis). The ML estimate ofp, which maximizesL(p), has the well-known analytical solution (which also applies in the face of Hardy–Weinberg deviations)ˆp¼ ðN1110:5N12Þ=N (Weir1996).

Unfortunately, uncertainty in the identity of individ-ual genotypes, resulting from incomplete sampling and

erroneous reads, substantially complicates the situation in a random-sequencing project, as there are no longer just three categories of individuals. Rather, each site in each individual will be characterized by a specific read-frequency array (the number of reads for A, C, G, and T). This necessitates that the full likelihood be broken up into two components, the first associated with the sampling of reads within individuals and the second associated with the sampling of individual genotypes (as above). Assuming no more than two alleles per site, the data can then be further condensed as the analysis proceeds.

For example, letting 1 and 2 denote the putative major and minor nucleotides at a site in the population (with respective frequencies above and below 0.5) and 3 denote the error bin containing incorrect reads to the remaining two nucleotides, for any candidate major/ minor nucleotide combination, each individual sur-veyed at the site can be represented by an array (n1,n2,

n3), where the three entries denote the numbers of reads at this site that coincide with the candidate major nucleotide, the minor nucleotide, and an erroneous read. Given the observed reads at the site forN indivi-duals, our goal is to identify the major/minor nucleo-tides at the site and to estimate their frequencies. There are 12 possible arrangements of major and minor nucleotides, A/C, A/G,. . ., G/T, and the likelihoods of each of these alternatives must be considered sequentially.

We again start with the assumption of a population in Hardy–Weinberg equilibrium and further assume a homogeneous error distribution, such that each nucle-otide is equally likely to be recorded erroneously as any other nucleotide with probability e/3 (the total error rate per site beinge). The composite-likelihood func-tion for a particular read configurafunc-tion is the sum of three terms, each the product of the probability of a particular genotype and the probability of the read configuration conditional on being that genotype,

Pðn1;n2;n3jn;p;eÞ

¼p2f_e n2;n3;n;e

3; 2e

3

1ð1pÞ2_f

e n1;n3;n;

e

3; 2e

3

12pð1pÞf_eðn3;n;2e

3Þpðn1;n11n2;0:5Þ:

ð2Þ

The first two terms in this expression denote the contributions to the likelihood assuming the genotype is homozygous for the major or minor alleles, with the functionsfe(x,y;n,e/3, 2e/3) denoting joint

probabil-ities ofxerrors to the alternative allele andyerrors to the remaining two nonexistent alleles, assumed here to follow a trinomial distribution withn ¼n11 n21 n3. The third term accounts for heterozygotes, withfe(n3; n, 2e/3) being the probability ofn3errors to non-1/2 nucleotides, with the total error rate being reduced by one-third to discount internal (unobservable) 142

(3)

errors, andp(n1;n11n2, 0.5) being the probability of sampling n1 copies of the major allele among the remaining (nn3) reads consistent with the designated genotype (Lynch2008).

It is informative to start with the simple situation in which each individual is sequenced to the same depth of coverage (n), in which case the log likelihood for the entire data set, conditional on a particular candidate major/minor nucleotide pair, major-allele frequency (p), and error rate (e), is

L¼XNðn1;n2;n3Þ ln½Pðn1;n2;n3jn;p;eÞ; ð3Þ where the unit of analysis isN(n1,n2,n3), the number of individuals sampled with read configuration (n1,n2,n3), and the summation is over all observed configurations. For sites withn3coverage, there are 11[n(n13)/2] possible read-array configurations for each major/ minor nucleotide combination satisfyingn¼n11n21

n3. To obtain the ML solution, this expression needs to be evaluated over all possible combinations of major and minor nucleotides to determine the joint combi-nation of major/minor allele identities, major-allele frequency (p), and error rate (e) that maximizesL.

To evaluate the behavior of Equations 2 and 3, stochastic simulations were performed by drawing random samples ofNindividuals from populations in Hardy–Weinberg equilibrium, with a constant sequence coverage per siten. Errors were randomly assigned to each read with probabilitye, such that each nucleotide had a probability of being erroneously recorded as any other with probabilitye/3. A relatively high error rate of

e¼0.01 was assumed to evaluate the situation under the worst possible conditions. The ML estimates for each simulation were obtained by thorough grid searches of the full parameter space forpandefor each possible combination of major and minor alleles, for 250 to 500 replications.

The results of these analyses indicate that, provided the coverage is .13, the proposed method yields estimates of p that are essentially unbiased, with the variance among replicate samples rapidly approaching the expectation based on individual sampling alone,

p(1p)/(2N), once the coverage exceeds 43(Figure 1). Small downward biases in the frequency estimates for rare alleles can arise when the sample size is low enough to cause rare-allele sampling to be extremely sporadic and the coverage is also low enough to cause sporadic sampling of errors. The problem is most severe with sequences with 13 coverage, as there is then no information within individuals on heterozygosity. How-ever, even with 23coverage, this bias is minor enough to be of little concern in most applications, as when it is likely to be of quantitative importance, it is also over-whelmed by the sampling variance. Moreover, as is demonstrated below, under the usual situation of vari-able coverage, even 13sites are less problematic, as they

supplement the information from sites with higher coverage.

The remaining issue is the extension of the above approach to sites with variable coverage among individ-uals. In this case, the absolute probabilities of the various configuration arrays are now functions of the probabilities of the various coverage levels, but the latter influence only the arbitrary constant in the log likeli-hood, so the approaches outlined above can be used without further modification, other than the summa-tion over all configurasumma-tions at all coverage levels in Equation 3. Evaluation of this procedure by computer simulation, assuming a Poisson distribution of read numbers across individuals and including 13-covered sites in the analysis, yielded no substantive differences from the patterns illustrated in Figure 1. The estimator is essentially unbiased over all estimable allele

frequen-Figure1.—Biases (top plot) and coefficients of sampling

(4)

cies (minor-allele frequencies .1/2N), and the sam-pling variance of the estimate ˆp asymptotically ap-proachesp(1p)/(2N) at high coverages.

We may now inquire as to the optimal sampling strategy for accurate allele-frequency estimates under a fixed sequencing budget, which is assumed here to be proportional toT¼Nn, the expected number of bases sequenced per site in the sample (withN individuals, each sequenced to average depth of coveragen). This effort function assumes that the bulk of the cost of data acquisition involves sequencing rather than the prepa-ration/procurement of individual samples. Low cover-age will magnify the problem of sampling variance within individuals, while also resulting in a fraction of individuals that are entirely lacking in observations at individual sites (en _{under a Poisson distribution of}

coverages). On the other hand, under the constraint of constantNn, increased sequencing depth per individual will result in a reduction in the number of individuals that can be assayed, thereby reducing the ability to detect rare alleles.

Assuming error rates in the range ofe¼0.001–0.01, if allele-frequency estimation is the goal, it appears that there is often no advantage to sequencing above 13 coverage, except for alleles with frequencies on the order of e, where the sampling variance of ˆp is minimized at 1–33coverage (Figure 2). Indeed, under the assumption of constantNnas small as 500, for most allele frequencies, the sampling variance of ˆp actually continues to decline at coverages well below 13 pro-videde,0.01, and in no case is there a gain in power above 23coverage.

DISCUSSION

Contrary to previous studies of nucleotide variation in which careful enough attention has been given to individual sequences that genotypes are unambiguous, with the new generation of high-throughput sequenc-ing strategies, quality is sacrificed for enormous quan-tity. Genotypes are then incompletely penetrant, in that one or both alleles need not be revealed in a subset of individuals, and those that are sampled are sometimes recorded erroneously. The method proposed herein provides a logical and efficient means for estimating allele frequencies in the face of these problems. Although the applications of the model in this article to simulated data involved brute-force searches of the full range of potential parameter space (the code for which is available as supporting information,File S1and File S2), iterative methods will likely be desirable for large-scale studies involving millions of sites, and some derivations essential to such approaches are provided in theappendix.

Two primary advantages of the proposed method are (1) its use of the full data set, which eliminates the need for arbitrary decisions on cutoffs for adequate coverage,

and (2) its ability to internally separate the incidence of sequence errors from true variation, which eliminates the necessity of relying on external measures of the read-error rate that do not incorporate all sources of error. As the error rates assumed in a number of the simulations in this study are presumably near the upper end of what will occur in ultra-high-throughput meth-ods now under development [although likely still too low in the case of ancient DNA samples (Briggs et al.

2007; Gilbertet al.2008)], the fact that the estimates

are nearly unbiased with asymptotic sampling variances defined by the number of individuals alone suggests that the ML method provides an optimal solution to the ascertainment of frequencies of both common and rare alleles, just as it does in the conventional situation where genotypes are assumed to be known without error

Figure2.—Average sampling standard deviations for

esti-mates of nucleotide frequencies (ˆp) for rare alleles, obtained from computer simulations (250–1000 replicate samples for each set of conditions) for the situation in which individuals are subject to variable depth-of-coverage sequencing, given for two error rates (solid and open symbols denote e ¼

0.001 and 0.01, respectively). The left three symbols for each plot denote average coverages of 0.1, 0.25, and 0.5 per site. For each allele frequency, results in the top plot are given for a fixed number of 5000 sequences per site over the entire sample, i.e.,T¼5000¼Nn, such that a doubling of the num-ber of individuals is balanced by a 50% reduction in the se-quence coverage per site. For the bottom plot,T¼500.

(5)

(Weir 1996). Moreover, the fact that the proposed

approach provides joint estimates ofpandeon a site-by-site basis provides a potentially strong advantage in situations where the error rate might be heterogeneous,

e.g., dependent on the total sequence context.

With this methodology in hand, it should be possible to ascertain the form of the site-frequency spectrum for broad classes of genomic positions down to a very fine level. Indeed, provided the depth of coverage per site is sufficiently high, the proposed method has the poten-tial to yield accurate allele-frequency estimates even whenpis smaller than the error rate, as is often the case with disease-associated alleles. Thus, with large enough sample sizes of individuals, there are no significant barriers to estimating allele frequencies>0:01.

The ML allele-frequency estimator extends previous methods for estimating nucleotide heterozygosity (p) from high-throughput data from one or a small number of individuals ( Johnsonand Slatkin2007; Hellmann et al.2008; Lynch2008; Jianget al.2009). For a given

genomic region sequenced from multiple individuals without error, p is generally estimated as 2ˆpð1 ˆpÞð2N 1Þ=ð2NÞ averaged over all sites, the terms containing 2N correcting for the bias associated with individual sampling, which induces variance in the estimate of p equal to p(1 p)/(2N) (Nei 1978). It

can be shown that when sequence errors are present, the sampling variance of ˆp is elevated to [1 1 (4e/ 3)2_]_p₍₁_p_)/(2_N₎₁_e₍₁₄_p₎2_/(3_Nn_{), but with}_e_>₀_:₀₁

andN?10, heterozygosities per nucleotide site will still be adequately estimated as 2ˆpð1ˆpÞ.

A number of variants of the model assumed herein can be readily implemented. For example, it is straight-forward to allow for non-Hardy–Weinberg conditions. For two alleles 1 and 2, this simply requires the in-corporation of two unknown genotype frequencies (P11 andP12), with the third being constrained to be (1P11

P12) and hence the estimation of just one additional unknown (Weir 1996). A simple likelihood-ratio test,

contrasting the fit from the full model with that under Hardy–Weinberg assumptions, can then be used to evaluate whether sites are in Hardy–Weinberg equilib-rium. By a similar extension, it is possible to test for interpopulation divergence by evaluating whether the likelihood of the full data set for multiple population samples is significantly improved by allowing for pop-ulation-specific estimates of allele frequencies.

The assumption of a homogeneous site-specific error process can also be relaxed by allowing for unique error rates for any major/minor allele combination. Although there are 12 possible types of single-base substitution errors, for any particular major/minor nucleotide com-bination (under the two-allele model), a maximum of 4 are relevant to the specific likelihood expression: the reciprocal errors between the major and minor alleles and the movement of each of the two true nucleotides into the error bin. Again, the necessity of these types

of embellishments can be evaluated by likelihood-ratio tests.

The overall analysis indicates that unless there is a need for specific information on each individual, from the standpoint of procuring accurate population esti-mates of allelic frequencies (and other associated parameters), there is little advantage in pursuing high-coverage data per individual. Indeed, given a fixed seq-uencing budget, an overemphasis on depth of coverage will result in a reduction in the accuracy of population-level parameters, as the optimal design for estimating most allele frequencies employs an average coverage of ,13, even though this magnifies the fraction of indi-viduals with no data. Of course, if the ultimate goal is the procurement of accurate genotypic profiles for individ-uals (as, for example, in association-mapping studies), there is no substitute for relatively high coverages.

Finally, although the focus of this article has been on the development of an optimally efficient method for extracting allele-frequency estimates from high-throughput sequencing projects, the utility of any speci-fic application will ultimately depend on the quality of the sequence assembly prior to analysis. The complex genomes of multicellular organisms are typically laden with repetitive DNAs, duplicate genes, and mobile ele-ments, which will misassemble to a degree declining with the evolutionary distances between paralogous sequen-ces. Prescreening of the data for regions of exception-ally high depth of coverage, combined with the masking of duplicated DNA (when a reference genome is avail-able), will aid in the elimination of such problematical sequences. An additional potential source of error may be a bias in assembling reads to reference genomes, an issue that is likely to become of diminishing importance with increasing read lengths.

Two anonymous reviewers provided helpful comments on this manuscript. This work was funded by National Science Foundation grant EF-0827411, National Institutes of Health grant GM36827, and MetaCyte funding from the Lilly Foundation.

LITERATURE CITED

Briggs_{, A. W., U. S}tenzel_{, P. L. J}ohnson_{, R. E. G}reen_{, J. K}elso_{et al.}_,

2007 Patterns of damage in genomic DNA sequences from a Neanderthal. Proc. Natl. Acad. Sci. USA104:14616–14621. Edwards, A. W. F., 1972 Likelihood.Cambridge University Press, New

York.

Ewens_{, W., 2004} _{Mathematical Population Genetics}_{, Ed 2.}

Springer-Verlag, New York.

Ewing_{, B., and P. G}reen_{, 1998} _{Base-calling of automated sequencer}

traces using phred. II. Error probabilities. Genome Res.8:186–194. Ewing_{, B., L. H}illier_{, M. C. W}endl_{and P. G}reen_{, 1998} _Base-calling

of automated sequencer traces using phred. I. Accuracy assess-ment. Genome Res.8:175–185.

Gilbert, M. T., D. L. Jenkins, A. Go¨ therstrom, N. Naveran, J. J.

Sanchez_{et al.}, 2008 DNA from pre-Clovis human coprolites

in Oregon, North America. Science320:786–789.

Hellmann, I., Y. Mang, Z. Gu, P. Li, F. M. De La Vega _{et al.},

2008 Population genetic analysis of shotgun assemblies of genomic sequence from multiple individuals. Genome Res.18:1020–1029. Huse_{, S. M., J. A. H}uber_{, H. G. M}orrison_{, M. L. S}ogin_{and D. M.}

Welch, 2007 Accuracy and quality of massively parallel DNA

(6)

Jiang, R., S. Tavareand P. Marjoram, 2009 Population genetic

in-ference from resequencing data. Genetics181:187–197. Johnson, P. L., and M. Slatkin, 2007 Accounting for bias from

se-quencing error in population genetic estimates. Mol. Biol. Evol.

25:199–206.

Keightley_{, P. D., and A. E}yre_-Walker_{, 2007} _{Joint inference of the}

distribution of fitness effects of deleterious mutations and pop-ulation demography based on nucleotide polymorphism fre-quencies. Genetics177:2251–2261.

Kimura, M., 1983 _{The Neutral Theory of Molecular Evolution.}

Cam-bridge University Press, CamCam-bridge, UK.

Lynch, M., 2007 The Origins of Genome Architecture.Sinauer

Associ-ates, Sunderland, MA.

Lynch, M., 2008 Estimation of nucleotide diversity, disequilibrium

coefficients, and mutation rates from high-coverage genome-sequencing projects. Mol. Biol. Evol.25:2421–2431.

Lynch, M., and B. Walsh, 1998 _{Genetics and Analysis of Quantitative} Traits.Sinauer Associates, Sunderland, MA.

McVean, G., 2007 The structure of linkage disequilibrium around a

selective sweep. Genetics175:1395–1406.

Nei, M., 1978 Estimation of average heterozygosity and genetic

distance from a small number of individuals. Genetics 89:

583–590.

Weir_{, B. S., 1996} _{Genetic Data Analysis II.}_{Sinauer Associates,}

Sunder-land, MA.

Weir, B. S., 1998 Statistical methods employed in evaluation of

sin-gle-locus probe results in criminal identity cases. Methods Mol. Biol.98:83–96.

Weir_{, B. S., A. D. A}nderson_{and A. B. H}epler_{, 2006} _Genetic

relat-edness analysis: modern data and new challenges. Nat. Rev. Genet.7:771–780.

Communicating editor: R. W. Doerge

APPENDIX

Although site-specific allele frequencies and error rates can be determined by solving the likelihood function (Equation 3), over the full range of parameter space and searching for the global maximum, numerical methods such as Newton–Raphson iteration (Edwards1972) may provide a more time-economical solution when large numbers of

sites are to be assayed. For any candidate pair of major/minor nucleotides, the ML solution satisfies the two equations obtained by setting the partial derivatives of the likelihood function equal to zero,

@L @p ¼

XN123@P123=@p

P123

;

@L @e ¼

XN123@P123=@e

P123

;

whereN(n1,n2,n3) andP(n1,n2,n3) are abbreviated asN123andP123, respectively, withP123, the likelihood of the data, being defined as Equation 2. For the model presented in the text (two alleles, with homogeneous error rates across all nucleotides),

@P123

@p ¼2½pfe11ðp1Þfe21ð12pÞfe3pn1; @P123

@e ¼p 2_f

e1

n21n3

e

n1 1e

1ð1pÞ2_f

e2

n11n3

e

n2 1e

12pð1pÞf_e₃pn1 n3

e

ð2=3Þðn11n2Þ 12e=3

;

where the termsfe1,fe2,fe3, andpn1, which are functions ofn1,n2, andn3, are notationally abbreviated from their use in Equation 2 by dropping the terms in parentheses.

Estimation of the parameters by iterative methods (as well as approximations of their standard errors from the curvature of the likelihood surface) will generally require the second derivatives,

@2L=@p2¼XN123fP123@

2_P

123=@p2 ½@P123=@p2g

P₁₂₃2 ;

@2L=@e2¼XN123fP123@

2_P

123=@2e ½@P123=@e2g

P₁₂₃2 ;

@2L=ð@p@eÞ ¼XN123fP123@

2_P

123=ð@p@eÞ ½@P123=@p½@P123=@eg

P₁₂₃2 ;

where

(7)

@2P123

@2p ¼2½fe11fe22fe3pn1; @2P123

@2e ¼p 2_f

e1

n21n3

e

n1

ð1eÞ

2

n21n3 e2 1

n1

ð1eÞ2

1ð1pÞ2_f

e2

n11n3

e

n2 1e

2

n11n3

e2 1

n2

ð1eÞ2

12pð1pÞf_e₃pn1 n3

e

ð2=3Þðn11n2Þ

ð12e=3Þ

2

n3

e2 1

ð2=3Þ2_ð_n 11n2Þ

ð12e=3Þ2

;

@2P123=ð@p@eÞ ¼2 pfe1

n21n3

e

n1 1e

1ðp1Þf_e₂ n11n3

e

n2 1e

1ð12pÞf_e₃pn1 n3

e

ð2=3Þðn11n2Þ

ð12e=3Þ

(8)

Supporting Information

http://www.genetics.org/cgi/content/full/genetics.109.100479/DC1

Estimation of Allele Frequencies From High-Coverage

Genome-Sequencing Projects

Michael Lynch

(9)

Supporting Information

http://www.genetics.org/cgi/content/full/genetics.109.100479/DC1

Estimation of Allele Frequencies From High-Coverage

Genome-Sequencing Projects

Michael Lynch

(10)

13

APPENDIX

Although site-specific allele frequencies and error rates can be determined by solving the

likeli-hood function, Equation (3), over the full range of parameter space and searching for the global

maximum, numerical methods such as Newton-Raphson iteration (Edwards 1972) may provide a

more time-economical solution when large numbers of sites are to be assayed. For any candidate

pair of major/minor nucleotides, the ML solution satisfies the two equations obtained by setting

the partial derivatives of the likelihood function equal to zero,

∂L/∂p=XN123·∂P123/∂p

P123 ,

∂L/∂=XN123·∂P123/∂

P123 ,

whereN(n1, n2, n3) andP(n1, n2, n3) are abbreviated asN123andP123, respectively, withP123,

the likelihood of the data, being defined as Equation (2). For the model presented in the text

(two alleles, with homogeneous error rates across all nucleotides),

∂P123/∂p= 2[pφe1+ (p−1)φe2+ (1−2p)φe3pn1],

∂P123/∂=p2φe1[(n2+n3)/−n1/(1−)] + (1−p)2φe2[ (n1+n3)/−n2/(1−) ]

+ 2p(1−p)φe3pn1[ (n3/)−(2/3)(n1+n2)/(1−2/3) ],

where the temsφe1, φe2, φe3, and pn1, which are functions ofn1, n2, and n3, are notationally

abbreviated from their use in Equation (2) by dropping the terms in parentheses.

Estimation of the parameters by iterative methods (as well as approximations of their

standard errors from the curvature of the likelihood surface) will generally require the second

derivatives,

∂2L/∂p2=XN123

P123∂2P123/∂p2−[∂P123/∂p]2

P2 123

,

∂2L/∂2=XN123

P123∂2P123/∂2−[∂P123/∂]2

P2 123

,

∂2L/(∂p∂) =XN123{P123∂

2_P

123/(∂p∂)−[∂P123/∂p][∂P123/∂]} P2

123

,

FILE S1

M. Lynch

2 SI

_!

F

ILE

S1

Appendix

Although site-specific allele frequencies and error rates can be determined by solving the likelihood

function, Equation (3), over the full range of parameter space and searching for the global maximum,

numerical methods such as Newton-Raphson iteration (Edwards 1972) may provide a more

time-economical solution when large numbers of sites are to be assayed. For any candidate pair of

major/minor nucleotides, the ML solution satisfies the two equations obtained by setting the partial

derivatives of the likelihood function equal to zero,

where

N

(

n

1

,

n

2

,

n

3

) and

P

(

n

1

,

n

2

,

n

3

) are abbreviated as

N

123

and

P

123

, respectively, with

P

123

, the likelihood of

the data, being defined as Equation (2). For the model presented in the text (two alleles, with

homogeneous error rates across all nucleotides),

where the tems

Øe1

,

Øe2

,

Øe3

, and

p

n1

, which are functions of

n

1

,

n

2

,

n

3

, are notationally abbreviated from

their use in Equation (2) by dropping the terms in parentheses.

Estimation of the parameters by iterative methods (as well as approximations of their standard

(11)

14

where

∂2P123/∂2p= 2[φe1+φe2−2φe3pn1],

∂2P123/∂2=p2φe1

[(n2+n3)/−n1/(1−)]2−[(n2+n3)/2+n1/(1−)2]

+ (1−p)2φe2[(n1+n3)/−n2/(1−)]2−[(n1+n3)/2+n2/(1−)2]

+ 2p(1−p)φe3pn1

[(n3/)−(2/3)(n1+n2)/(1−2/3)]2−[(n3/2) + (2/3)2(n1+n2)/(1−2/3)2] ,

∂2P123/(∂p∂) = 2{pφe1[(n2+n3)/−n1/(1−)] + (p−1)φe2[(n1+n3)/−n2/(1−)]

+ (1−2p)φe3pn1[(n3/)−(2/3)(n1+n2)/(1−2/3)]}.

M. Lynch

3 SI

_!

(12)

M. Lynch 4 SI

FILE S2