Measuring Gametic Disequilibrium From Multilocus Data

(1)

Copyright2001 by the Genetics Society of America

Measuring Gametic Disequilibrium From Multilocus Data

Karen L. Ayres and David J. Balding

Department of Applied Statistics, University of Reading, Reading RG6 6FN, United Kingdom Manuscript received June 20, 1999

Accepted for publication September 19, 2000

ABSTRACT

We describe a Bayesian approach to analyzing multilocus genotype or haplotype data to assess departures from gametic (linkage) equilibrium. Our approach employs a Markov chain Monte Carlo (MCMC) algo-rithm to approximate the posterior probability distributions of disequilibrium parameters. The distributions are computed exactly in some simple settings. Among other advantages, posterior distributions can be presented visually, which allows the uncertainties in parameter estimates to be readily assessed. In addition, background knowledge can be incorporated, where available, to improve the precision of inferences. The method is illustrated by application to previously published datasets; implications for multilocus forensic match probabilities and for simple association-based gene mapping are also discussed.

D

EPARTURES from gametic (or linkage) and ters have been developed (see,e.g.,Weir1979). Here, Hardy-Weinberg (HW) equilibria can provide we propose a Markov chain Monte Carlo (MCMC) clues about aspects of population histories and mating method to investigate probability distributions for ga-behavior (see,e.g.,Lewontin1974) and can be useful metic disequilibrium measures given the data. This ex-in locatex-ing disease genes (Jorde1995;Federet al.1996; tends previous work (AyresandBalding1998; Shoe-Nielsenet al.1998). They also play an important role makeret al.1998) on assessing departures from HW. in the forensic use of DNA profile evidence. Match Perhaps the most important advantage of our ap-probability calculations either rely on assumptions of proach is interpretability: the questions of interest are equilibrium (National Research Council 1996) or answered directly in terms of probabilities that can con-else allow for patterns of departures that hold in simpli- veniently be presented graphically via probability den-fied population models (Weir 1994; Balding and _{sity curves, providing an immediate yet detailed}

assess-Nichols1995;AyresandOverall1999). It is impor- _{ment of the variability associated with an estimate. A} tant that the validity of such assumptions in actual popu- _{further advantage is that, since the approach is} likeli-lations is verified empirically, as far as is feasible. _{hood based, it is statistically powerful and can} incorpo-Traditional statistical treatments usually focus on test- _{rate a wide range of modeling assumptions. Previous} ing hypotheses of equilibrium, with recent develop- _{treatments assume random union of gametes (RUG) to} ments involving randomization tests (e.g.,Zaykinet al. _{infer population haplotype proportions from genotype} 1995; Slatkin and Excoffier 1996). Although they _{data (e.g.,}_Excoffier_and_Slatkin_{1995). Although we} may form a useful first step, such hypothesis tests repre- _{also implement the RUG model in the following} analy-sent a limited form of statistical inference, since the _{ses, we note that other models can be readily applied,} tests concern only whether or not the data are consistent _{such as those that incorporate inbreeding measures.} with equilibrium, rather than directly assessing how

The choice of prior distribution is sometimes seen as large are the departures from precise equilibrium that

a barrier to the implementation of direct probability, or inevitably exist in real populations (e.g.,Smith1970).

Bayesian, methods. We introduce a class of hierarchical In forensic applications, for example, a hypothesis of

prior distributions for the haplotype proportions, which equilibrium may be rejected with a sufficiently large

allows the scientist some flexibility either to incorporate sample, whereas a forensic scientist may nevertheless

relevant background information, if desired, or to adopt believe that the magnitude of the departure is

suffi-a relsuffi-atively “vsuffi-ague” prior. ciently small that the hypothesis of equilibrium, though

We illustrate our method by analyzing samples of ge-strictly false, is adequate for the application at hand.

notypes at two unlinked loci and at three linked loci. Point estimation methods for disequilibrium

parame-We also briefly discuss its application to forensic identi-fication, and to haplotype data and simple disequilib-rium gene mapping. Computer programs (C code) for Corresponding author:David J. Balding, Department of Applied

Statis-the MCMC algorithms are available from Statis-the authors tics, University of Reading, P.O. Box 240, Earley Gate, Reading RG6

6FN, United Kingdom. E-mail: [email protected] on request.

(2)

METHODS _{Measures of gametic disequilibrium not based on}_D_ij have also been proposed.Smouse(1974) specifies a log-Measures of gametic disequilibrium:Genetic

equilib-linear model for thehij, with allele-specific parameters rium corresponds to statistical independence, and many

aiandbj, and an interaction termcijthat can be employed authors (see,e.g.,Weir1979) measure gametic

disequi-as an alternative toDij.Weir(1996, pp. 127–133) details librium in terms of the differences between population

a closely related multiplicative model and extends the haplotype proportions and the values that would be

analysis to genotype data. expected under equilibrium, given the allele

propor-Here, we focus onD⬘ as a summary measure of ga-tions. Following this approach, for a two-locus haplotype

metic disequilibrium (together with an extension D″, consisting of allelesAiandBj, we introduce the notation

introduced below). This measure is widely used and, although it suffers from the interpretability drawbacks Dij⫽ hij⫺ piqj, (1)

described above, there seems to be no univariate mea-where hij denotes the population proportion of haplo- _{sure that avoids such difficulties. When interest focuses} typeAiBj, whilepi⫽Rjhijandqj⫽Rihij, the proportions _{on gametic disequilibrium due to linkage, such as in} of, respectively, allelesAiandBj. _{“simple” genetic mapping, then a natural criterion for} The range of Dij depends onpi andqj, which makes _{choosing between disequilibrium measures is} correla-cross-locus and cross-population comparisons difficult. _{tion with physical distance and} _Devlin _and _Risch To alleviate this problem, Lewontin (1964) defined _{(1995) find that}_D_⬘_{has good properties in that setting.} the normalized difference, with range [⫺1, 1], by _{Random union of gametes model:}_{When only} geno-type counts are available, a model is required to relate D⬘ij ⫽Dij/Dmax,

the hij to genotype proportions, which then implies a model for theDij.For two loci at which the population whereDmaxis

proportion of genotype AiAi⬘BjBj⬘ is denoted pii⬘jj⬘ (with iⱕi⬘andjⱕj⬘), perhaps the simplest plausible model min(piqj, (1⫺pi)(1⫺qj)) ifDij ⬍0

min(pi(1⫺qj), (1⫺ pi)qj) ifDij ⬎0. assumes RUG:

When there are only two alleles at each locus, there is a unique value of|D⬘ij|. Otherwise, it is usually of interest

to have a summary measure of the gametic disequilib- _p_ii_⬘_jj_⬘_⫽











h2

ij ifi⫽i⬘,j⫽j⬘ 2hijhij⬘ ifi⫽i⬘,j⬍j⬘

2hijhi⬘j ifi⬍i⬘,j⫽j⬘ 2hijhi⬘j⬘⫹ 2hi⬘jhij⬘ ifi⬍i⬘,j⬍j⬘. (3)

rium between the two loci;Hedrick(1987) proposed

D⬘ ⫽

兺

i

兺

j

piqj|D⬘ij|. (2)

Inbreeding and selection, for example, will invalidate this model: haplotype proportions will be incorrectly The range ofD⬘is [0, 1], independent of thepiandqj.

estimated because no allowance is made for the depen-However, there remain difficulties in interpreting the

dence of haplotypes within multi locus genotypes. How-value of D⬘.Lewontin(1988) noted that values ofD⬘

ever, for human populations and approximately neutral at different loci and in different populations tend to

loci, the effect on inference should be negligible, and vary with the values of thepiandqj, so that the problem

so the RUG assumption may be reasonable in such cases. of cross-locus and cross-population comparisons is not

The log-likelihood for a random sample of genotypes fully overcome by use ofD⬘.

is obtained by substituting (3) into the multinomial log-Moreover, in practice the range of values ofD⬘

consis-likelihood function, tent with gametic equilibrium is not readily apparent

and can vary from locus to locus. Under equilibrium, _log_L_⫽

_兺

_n

ii⬘jj⬘log(pii⬘jj⬘), (4) eachDij, and henceD⬘, takes value zero. However, just

as a␹2_{goodness-of-fit statistic is unlikely to be very close} _{where the}_n

ii⬘jj⬘are the observed genotype counts. The to zero even when the model is valid, so estimates ofD⬘ maximum-likelihood (ML) estimateshˆijcan then be ob-based on data from equilibrium populations are un- tained by maximizing logLusing any suitable method, likely to be very close to zero (furthermore, variances such as the expectation-maximization (EM) algorithm ofD⬘are difficult to calculate; seeZapataet al.1997). ofExcoffierandSlatkin(1995). Substitution of the Insight into whether or not the data are consistent with hˆij into (1) and (2) then leads to point estimates Dˆ⬘ij gametic equilibrium can be gained by reanalyzing them andDˆ⬘.

(3)

sponding to no background information about haplo- and␤j, respectively. (Conceptually,␣i and␤jmight be thought of as metapopulation allele proportions.) A type proportions. However, a uniform prior for thehij

tractable family of prior distributions for thehijwould does not correspond to an uninformative prior forD⬘,

then be the Dirichlet family with parameters ␭␣i␤j, and the level of informativeness is fixed and cannot be

where␭is a constant, so that eachhijhas prior expecta-controlled. Moreover, the uniform-on-haplotypes prior

tion and variance given by does not encapsulate the fact that haplotypes are

com-posed of alleles and hence, for example, theh1j,j⬆ 1

E[hij]⫽ ␣i␤j, Var[hij]⫽

␣i␤j(1 ⫺ ␣i␤j) 1⫹ ␭ . (5) and thehi1, i⬆ 1 are informative about p1and q1 and

thus may well be informative abouth11.

Suppose that information was available in advance, Under this assumption, thepiand theqjare also Dirich-perhaps from surveys in other populations, which indi- let, with parameters␭␣iand␭␤j, respectively. If␭is large thenhij,pi, andqjwill be close to, respectively, ␣i␤j,␣i, cated that pi and qj were likely to be close to, say, ␣i

(4)

TABLE 1

Medians and equal-tailed 90% intervals of the prior and posterior distribution forDⴕshown in Figure 1

Prior Posterior

No. alleles Prior

trueD⬘ parameters Median 90% interval Median 90% interval

I⫽J⫽3 ␭ ⫽IJ⫽9 0.453 (0.132–0.952) 0.090 (0.030–0.182)

␭ ⫽2IJ⫽18 0.327 (0.102–0.825) 0.088 (0.031–0.180) D⬘ ⫽0.081 ␭ ⫽4IJ⫽36 0.236 (0.077–0.622) 0.084 (0.028–0.172)

I⫽J⫽6 ␭ ⫽IJ⫽36 0.329 (0.211–0.463) 0.305 (0.250–0.359)

␭ ⫽2IJ⫽72 0.242 (0.155–0.343) 0.247 (0.198–0.299) D⬘ ⫽0.253 ␭ ⫽4IJ⫽144 0.176 (0.119–0.248) 0.231 (0.182–0.281)

and␤j, and hence the implied prior forDijwill be peaked Hastings type (Metropoliset al.1953;Hastings1970). At each iteration of the algorithm, a decision is made at zero, implying little gametic disequilibrium.

Decreas-ing the value of ␭ makes strong disequilibrium more whether to keep the current vector of parameter values or reject it in favor of a new vector. The accept/reject probable (the tails of the implied prior distribution for

Dijare longer). decision is made in such a way that the proportion of iterations at which the current vector lies in any region The sum of the Dirichlet parameters provides a

mea-sure of the information conveyed by the distribution. of the parameter space approximates the probability that the true parameter vector lies in that region, with Choosing␭so that the average of the␭␣i␤jis one would

give a distribution that has the same information content the approximation becoming more accurate as the num-ber of iterations increases. Further details of the MCMC as the uniform (for which all the parameters equal one)

and may provide a reasonable vague prior for thehij. algorithm are given in theappendix.

Figure 1 shows the posterior density curves for D⬘, This framework for specifying a prior distribution for

hijdoes not require that␣iand␤jbe specified precisely. approximated via the MCMC algorithm, given two sam-ples of two-locus genotypes simulated under the RUG Instead, they can be assigned probability distributions,

leading to a hierarchical prior model. Below, we adopt model withD⬘ ⫽0.081 (three alleles) and D⬘ ⫽0.253 (six alleles). Three prior distributions were employed, independent uniform distributions for the ␣i and ␤j,

although background information could in practice be shown as dotted curves. Key quantiles of the prior and posterior distributions are given in Table 1.

incorporated into more informative distributions.

MCMC algorithm:We implement an MCMC stochas- Even with a reasonably large sample size (200 individ-uals), D⬘ is a difficult parameter to estimate. This is tic simulation algorithm for genotype data to

approxi-mate the joint distribution of thehij, and hence of the because the data bear directly on the population geno-type proportions, whereas differences between allele gametic disequilibrium measures, under the RUG model

and the hierarchical prior distribution described above. and haplotype proportions are the quantities of interest. This difficulty is reflected by the posterior curves of The MCMC algorithm adopted is of the

(5)

Figure 1, which support a rather broad range of values and standard deviation 0.039. These values compare favorably with the MLE-based estimatesDˆ_⬘, for which the forD⬘and display some sensitivity to the choice of prior.

However, in each case the posterior median is close to mean and standard deviation over these 100 simulated datasets were 0.407 and 0.040.

the true value and usually closer than the corresponding ML-based estimates (0.058 and 0.315), for which the sampling variance is difficult to calculate. Moreover,

RESULTS since D⬘ is univariate it is relatively easy to plot both

prior and posterior density curves and hence assess visu- _{Two unlinked loci used for forensic identification:} ally the effect of the prior from the plots. Background _{The MCMC method was applied to the genotypes at} information, when available, can be incorporated via two unlinked forensic short tandem repeat (STR) loci, the prior and may be invaluable in situations of little THO1 and TPOX, for samples of Maoris (n ⫽ 1091) data and/or many alleles. and Samoans (n⫽139) resident in New Zealand. Eight Also shown in Figure 1 are density curves averaged alleles were observed for locus THO1 and six for TPOX over 50 random permutations of the alleles, mimicking (additional alleles observed in other populations are 50 samples from populations in gametic equilibrium ignored here, although they could be incorporated into with the same allele proportions. The data with three the analysis if desired).

alleles at each locus are clearly consistent with equilib- Figure 2 shows prior (␭ ⫽ IJ) and posterior curves rium, but those with six alleles are not. These results for the overall measure D⬘ together with a curve ob-are in accord with thePvalues 0.56 and 0.00 obtained tained from 50 random permutations of the data (mim-from an LR-based permutation test for gametic disequi- icking equilibrium). There is a substantial overlap of librium (Slatkin andExcoffier1996). these curves, suggesting that both samples are consistent Figure 1 corresponds to a single simulated dataset. with gametic equilibrium in the underlying populations; We also applied the MCMC method (for the prior with these conclusions are in agreement with P values

ob-␭ ⫽IJ) to 100 datasets of sizen⫽1000, simulated with tained from the LR-based permutation test (0.42 and I⫽J⫽3. The underlyinghijwere such thatD⬘ ⫽0.404. 0.13).

For each dataset we calculated the posterior median: A full multilocus match probability involves correla-tions of genes both within and between individuals (in the 100 estimated posterior medians had mean 0.403

Figure 3.—Posterior densities for disequilibrium coefficients D6:11, D8:8, D9:10,

andD9:12for the Maori and

(6)

Figure 4.—Posterior densi-ties (solid curves) for pairwise disequilibrium coefficients D″ from a sample ofn⫽96 geno-types, at loci LF261 (four al-leles), LF168 (four alal-leles), and LF347 (five alleles) of the MOYO strains of theA. aegypti mosquito data of Yan et al. (1997). Point estimates (from ML estimates) are 0.492, 0.220, and 0.242. Dot-dashed curves are posterior densities aver-aged over 50 random permuta-tions of the genotypes. The prior densities (dotted curves) are based on a Dirichlet prior for the hij, with parameters 40␣i␤j␥k, conditional on the ␣i,

␤j, and␥k, which are each (mul-tivariate) uniform.

the latter case, between the defendant and an alternative is therefore important to investigate levels of gametic disequilibrium, and a selection of marginal posterior possible source of the crime scene DNA). In current

practice (see,e.g.,EvettandWeir 1998), adjustment density curves for the Dij is shown in Figure 3. All the posterior distributions support values close to zero, en-is often made for between-individual correlations at a

single locus. However, multilocus forensic match proba- couraging optimism that the effect of gametic disequi-librium on two-locus forensic match probabilities involv-bilities are usually obtained by taking the product of

the single-locus probabilities, thereby assuming inde- ing these loci may indeed be negligible.

Although these results tend to support current prac-pendence between loci. Strong gametic disequilibrium

may invalidate this assumption. tice, note that we have not simultaneously taken all relevant correlations into account. In particular, other Although THO1 and TPOX are unlinked, gametic

disequilibrium may nevertheless arise (due to founder forms of assocation may invalidate the independence of genes assumption in the match probability (seeAyres

effects, selection, or drift) and affect multilocus forensic

(7)

Figure5.—Posterior median (䊏) and central 90% posterior intervals (—) forD⬘in two populations, for the Xq25–Xq28 SNP data ofTaillon-Milleret al.(2000); marker pairs analyzed here correspond to those presented in Table 2 ofTaillon-Miller

et al.(2000). A multivariate uniform prior was used for thehij.

inbreeding). Also, forensic identification involves many from chromosome 3 of the MOYO strain of the mos-quito Aedes aegypti. For the three RFLP loci LF261, loci, often⬎10, whereas here we have considered only

the two-locus case. LF168, and LF347, they reported that the values ofD⬘ for all three pairs were significant at the 1% level. We Three linked loci:The MCMC algorithm for

approxi-mating the joint posterior of the haplotype proportions have calculated posterior density curves for D″ on the basis of their original data, as well as curves based on is readily extended to three loci. For forensic

applica-tions, we may be interested in investigating the differ- random permutations (Figure 4).

Our results suggest strong disequilibrium between LF ence hijk ⫺ piqjrk, which can be readily obtained from

the MCMC output. For other problems, simultaneous 261 and LF168, since the curve based on the randomly permuted data has little overlap with that based on the estimation of the pairwise disequilibrium measures may

be of more interest. However, multilocus systems impose observed data. In contrast, Figure 4 suggests little or no disequilibrium between the other two pairs of loci. The additional constraints on theDij. For three diallelic loci,

Robinsonet al.(1991) describe a new pairwise normal- latter conclusion differs from that ofYanet al.(1997). It is, however, consistent with the relative map positions ized measureD″ij based on these adjusted bounds and

with range [⫺1, 1]. These authors also note that addi- of the loci—LF168 is situated between LF261 and LF347, much closer to LF261 than to LF347. Although the tional loci beyond three add no further constraints on

the Dij. The multiallelic analogue of the formulas of MCMC results correctly identify LF168 and LF261 as the closest pair, disequilibrium between the other

Robinsonet al.(1991) is given in the appendix.

(8)

TABLE 2

Posterior summaries ofDⴕfor the analyses of Figure 5

Distance D⬘CEPH D⬘Fin Posterior

Markers (Mb) median median Pr(D⬘CEPH⬍D⬘Fin)

Xq4007-1, Xq3774-2 0.044 0.94 0.96 0.59

Xq3804-1, Xq3812-1 0.090 0.42 0.58 0.75

Xq3774-2, Xq3773-1 0.093 0.95 0.97 0.59

Xq4007-1, Xq3773-1 0.137 0.96 0.97 0.50

Xq3812-1, Xq3917-1 0.281 0.30 0.55 0.82

Xq3862-1, Xq3773-1 0.323 0.31 0.52 0.89

Xq3804-1, Xq3917-1 0.371 0.96 0.98 0.68

Xq3773-1, Xq3804-1 0.380 0.16 0.50 0.97

Xq3773-1, Xq3812-1 0.470 0.23 0.51 0.88

Xq3774-2, Xq3804-1 0.473 0.68 0.74 0.66

Xq4007-1, Xq3804-1 0.517 0.19 0.52 0.97

Xq3774-2, Xq3812-1 0.563 0.42 0.65 0.79

Xq4007-1, Xq3812-1 0.607 0.23 0.30 0.58

Xq3862-1, Xq3804-1 0.703 0.26 0.59 0.96

Xq3773-1, Xq3917-1 0.751 0.15 0.48 0.95

Xq3774-2, Xq3917-1 0.844 0.79 0.74 0.34

Xq4007-1, Xq3917-1 0.888 0.18 0.51 0.96

Xq3774-2, Xq3698-1 1.245 0.10 0.25 0.83

Xq2816-1, Xq3274-1 0.028 0.96 0.94 0.28

Xq3471-1, Xq4001-1 0.038 0.97 0.98 0.65

Xq3413-1, Xq2816-1 0.044 0.59 0.50 0.33

Xq3449-1, Xq3471-1 0.048 0.97 0.97 0.53

Xq4001-1, Xq3413-1 0.067 0.84 0.81 0.41

Xq3413-1, Xq3274-1 0.072 0.63 0.53 0.29

Xq3449-1, Xq4001-1 0.086 0.96 0.97 0.56

Xq3471-1, Xq3413-1 0.105 0.90 0.82 0.24

Xq4001-1, Xq2816-1 0.111 0.63 0.51 0.24

Xq3476-1, Xq3449-1 0.117 0.59 0.55 0.45

Xq4001-1, Xq3274-1 0.139 0.66 0.53 0.25

Xq3471-1, Xq2816-1 0.149 0.68 0.52 0.19

Xq3449-1, Xq3413-1 0.153 0.89 0.80 0.24

Xq3476-1, Xq3471-1 0.165 0.56 0.56 0.50

Xq3471-1, Xq3274-1 0.177 0.69 0.53 0.21

Xq3449-1, Xq2816-1 0.197 0.65 0.50 0.18

Xq3476-1, Xq4001-1 0.203 0.56 0.56 0.51

Xq3449-1, Xq3274-1 0.225 0.65 0.52 0.22

Xq3476-1, Xq3413-1 0.270 0.29 0.53 0.94

Xq3476-1, Xq2816-1 0.314 0.29 0.26 0.43

Xq3476-1, Xq3274-1 0.342 0.28 0.17 0.28

Posterior medians (based on 1000 sampled values) forD⬘in two populations (CEPH,n⫽92; Finland (Fin), n⫽100), for the Xq25–Xq28 SNP data ofTaillon-Milleret al.(2000); also shown is the probability thatD⬘ in the CEPH population is less thanD⬘in the Finnish population. Marker pairs analyzed here correspond to those presented in Table 2 ofTaillon-Milleret al.(2000) and are ordered by distance between the markers. A multivariate uniform prior was used for thehij.Each entry is based on 1000 values sampled from the posterior distribution ofD⬘.

where N⬅ {nijk} denotes the sample haplotype counts markers on the basis of the joint posterior distribution of

and␣, ␤, and ␥are vectors of hyperparameters speci-theD″cannot be achieved with confidence, the correct

fying prior distributions for the population allele pro-order being assigned a probability of 42%.

portions at the three loci. Haplotype data:In some cases haplotype counts may

A method for sampling from this distribution is given be available, simplifying the direct probability approach.

in theappendix, together with a summary of implica-For example, for three-locus haplotypeshijkand a

hierar-tions for disequilibrium mapping. However, we focus chical prior, we have

(9)

DISCUSSION parameterskijis implemented for thehij.When the

likeli-hood is multinomial, the posterior distribution for the _{The direct probability, or Bayesian, approach} devel-hij will again be Dirichlet with parametersnij ⫹kij, and _{oped here permits interpretable visual answers to the} a sample from this distribution can be obtained by stan- _{question of interest about disequilibrium parameters.} dard random number generation (see,e.g., Appendix _{Moreover, it can readily incorporate complex models}

A ofGelmanet al.1995). _{and background knowledge about a population, when}

Taillon-Miller et al. (2000) analyzed several pairs _{available. For a discussion of the advantages of Bayesian} of single nucleotide polymorphism (SNP) markers in _{approaches to problems in genetics, see}_Shoemaker_et the human Xq25–Xq28 region for three populations _al._{(1999). We have also developed a family of} hierarchi-[general European (CEPH), Finnish, and Sardinian]. _{cal prior distributions that allow the scientist some} flex-They found significantPvalues (P⬍ 0.001, from a␹2

ibility in specifying background knowledge. test for gametic equilibrium) for markers separated by

Zapataet al.(1997) note that point estimates of D⬘ij up toⵑ900 kb. They also found that, in general, point

are frequently reported without a corresponding mea-estimates of disequilibrium measures (such asD⬘) did

sure of variability (such as the standard error), which not differ greatly between the large outbred population

can complicate comparisons over loci and populations. (represented by the CEPH sample) and the genetically

However, the calculation of Var(D⬘ij) is complicated by isolated populations of Finland and Sardinia. These

re-the different rescaling of positive and negative values sults were consistent with an STR analysis of similar

in the definition ofD⬘ij. Zapataet al.derived an approxi-populations (Eaves et al. 2000), though both conflict

mation to Var(D⬘ij) for biallelic loci only. Our direct with the suggestion that genetically isolated populations

probability approach provides an approximation not tend to exhibit higher levels of disequilibrium and are

just for the variance of (multiallelic)D⬘ijandD⬘but for therefore more useful for disease gene mapping (see,

their entire posterior distributions. A particular advan-e.g.,Wright et al. 1999). In summarizing the

conclu-tage is that the posterior intervals obtained can be di-sions ofTaillon-Milleret al.(2000) andEaveset al.

rectly interpreted in terms of probabilities, unlike stan-(2000),Boehnke (2000) argues that the levels of

dis-dard confidence intervals that are routinely provided equilibrium observed appeared slightly stronger in the

for some point estimates, which do not have such a isolates than in the general mixed populations.

direct interpretation. We have reanalyzed the CEPH and Finnish SNP data

There are no theoretical limits to the number of loci ofTaillon-Milleret al.(2000), implementing a

multi-that can be analyzed simultaneously. However, for a variate uniform prior forhij for each pair of markers

fixed sample size, the information contained in the data analyzed. This distribution imposes a prior belief that

decreases as the number of loci increases, and, as for none of the alleles is very rare [the implied prior for

hypothesis testing, useful inferences are usually not fea-thepiandqjis Beta(2,2)], which is reasonable for these

sible for more than about three loci. markers as they have been selected on the basis of

poly-We thank the following for kindly providing the data analyzed in morphism. Results are shown in Figure 5: posterior

me-this study: John Buckleton (human STR data), Guiyun Yan (mosquito dians and 90% intervals for the two populations are

data), Patty Taillon-Miller, and Pui-Yan Kwok (SNP data). We thank plotted against physical distance. _{Laurent Excoffier for helpful comments on an earlier draft, as well} The measure of variability provided by our MCMC as two anonymous referees. Work was supported in part by the UK approach allows more careful comparison of the levels Biotechnology and Biological Sciences Research Council, under grant

45/G09617. of disequilibrium across the populations analyzed. For

almost all of the marker pairs given in Table 2, the posterior 90% intervals from the two populations

over-lap substantially, indicating that there is little evidence _{LITERATURE CITED} of any difference across the populations. This is

quanti-Ayres, K. L.,1998 Measuring genetic correlations within and be-fied in Table 2, which gives the posterior probability _{tween loci, with implications for disequilibrium mapping and} for each marker pair that D⬘ is larger in the Finnish forensic identification. Ph.D. Thesis, The University of Reading,

Reading, UK. population than in the CEPH population: these

proba-Ayres, K. L.,2000 A two-locus forensic match probability for subdi-bilities exceed 90% for only a handful of markers, and

vided populations. Genetica108:137–143.

in no case exceed 97%. (Note the values ofD⬘ across Ayres, K. L.,andD. J. Balding,1998 Measuring departures from Hardy-Weinberg: a Markov chain Monte Carlo method for esti-closely linked markers are not independent.)

mating the inbreeding coefficient. Heredity80:769–777. Our results therefore quantify the observation made

Ayres, K. L.,andA. D. J. Overall,1999 Allowing for within-subpop-by Taillon-Miller et al. (2000) that disequilibrium _{ulation inbreeding in forensic match probabilities. Forensic Sci.}

Int.103:207–216. levels were similar across the populations. The data

pro-Balding, D. J.,andR. A. Nichols,1995 A method for quantifying vide little evidence that gametic disequilibrium is higher

differentiation between populations at multi-allelic loci and its in the Finnish population than in the general European _{implications for investigating identity and paternity. Genetica}_96:

(10)

Best, N. G., M. K. CowlesandS. K. Vines,1995 CODA Manual strains selected for refractoriness to a malaria parasite. J. Hered. 88:187–194.

Version 0.30.MRC Biostatistics Unit, Cambridge, UK.

Zapata, C., G. AlvarezandC. Carollo,1997 Approximate vari-Boehnke, M.,2000 A look at linkage disequilibrium. Nat. Genet.

ance of the standardized measure of gametic disequilibriumD⬘. 25:246–247.

Am. J. Hum. Genet.61:771–774. Brooks, S. P., 1998 Markov chain Monte Carlo method and its

Zaykin, D., L. A. ZhivotovskyandB. S. Weir,1995 Exact tests for application. Statistician47:69–100.

association between alleles at arbitrary numbers of loci. Genetica Devlin, B.,andN. Risch,1995 A comparison of linkage

disequilib-96:169–178. rium measures for fine-scale mapping. Genomics29:311–322.

Eaves, I. A., T. R. Merriman, R. A. Barber, S. Nutland, E. Tuomi- _{Communicating editor:}_{G. A. Churchill} lehto-Wolfet al., 2000 The genetically isolated populations

of Finland and Sardinia may not be a panacea for linkage disequi-librium mapping of common disease genes. Nat. Genet.25:320–

323. APPENDIX

Evett, I. W.,andB. S. Weir,1998 Interpreting DNA Evidence: Statistical

Genetics for Forensic Scientists.Sinauer, Sunderland, MA. MCMC algorithm for genotype data: Metropolis-Has-Excoffier, L.,andM. Slatkin,1995 Maximum-likelihood estima- _{tings algorithms are methods for generating a sample}

tion of molecular haplotype frequencies in a diploid population.

from an arbitrary probability distribution⌸(with proba-Mol. Biol. Evol.12:921–927.

bility density function␲) by constructing a Markov chain Feder, J. N., A. Gnirke, W. Thomas, Z. Tsuchihashi, D. A. Ruddy

et al., 1996 A novel MHC class I-like gene is mutated in patients _{whose stationary distribution is} _⌸_{. If the current state} with hereditary haemochromatosis. Nat. Genet.13:399–408.

of the chain isx, a candidate new statex⬘is chosen with Gelman, A., J. B. Carlin, H. S. SternandD. B. Rubin,1995 Bayesian

probability density q(x⬘|x). The candidate is accepted Data Analysis.Chapman and Hall, London.

Hastings, W. K.,1970 Monte Carlo sampling methods using Markov _{with probability} chains and their applications. Biometrika57:97–109.

Hedrick, P. W., 1987 Gametic disequilibrium measures: proceed

with caution. Genetics117:331–341. min

冢

␲(x⬘)q(x|x⬘)

␲(x)q(x⬘|x), 1

冣

, Jorde, L. B.,1995 Linkage disequilibrium as a gene-mapping tool.

Am. J. Hum. Genet.56:11–14.

Lewontin, R. C.,1964 The interaction of selection and linkage. I. otherwise the current statexis retained. A key feature General considerations; heterotic models. Genetics49:49–67. _{of these algorithms is that} _␲ _{need only be specified} Lewontin, R. C.,1974 The Genetic Basis of Evolutionary Change.

Co-up to a normalizing constant, and so high-dimensional lumbia University Press, New York.

probability distributions can often be successfully han-Lewontin, R. C.,1988 On measures of gametic disequilibrium.

Ge-netics120:849–852. _{dled. Although the states of the chain are correlated,}

Metropolis, N., A. W. Rosenbluth, M. N. Rosenbluth, A. H.

selecting every kth iteration, after an initial “burn-in” TellerandE. Teller,1953 Equation of state calculations by

period of length b, can lead to approximate random fast computing machines. J. Chem. Phys.21:1087–1092.

National Research Council,1996 The Evaluation of Forensic DNA _{samples from} ⌸_{when suitable choices are made for}_k Evidence, NRC2. National Academy Press, Washington, DC.

andb.SeeBrooks(1998) for an introduction to MCMC Nielsen, D. M., M. G. EhmandB. S. Weir,1998 Detecting

marker-algorithms. disease association by testing for Hardy-Weinberg disequilibrium

at a marker locus. Am. J. Hum. Genet.63:1531–1540. _{For the algorithm implemented here,}␲_{is the joint} Robinson, W. P., M. A. AsmussenandG. Thomson,1991

Three-posterior density function of thehij. For two loci, each locus systems impose additional constraints on pairwise

disequil-candidatex⬘ differs fromxat a randomly chosen pair bria. Genetics129:925–930.

Shoemaker, J., I. PainterandB. S. Weir,1998 A Bayesian character- _{of the}_h_ij_{, say}_h_rs_and_h_wz_._{A proposal value}_h⬘_rs_{is chosen} ization of Hardy-Weinberg disequilibrium. Genetics149:2079–

uniformly in the interval 2088.

Shoemaker, J., I. PainterandB. S. Weir,1999 Bayesian statistics in

(max(0,hrs⫺ε), min(hrs⫹ε,hrs⫹ hwz)), genetics: a guide for the uninitiated. Trends Genet.15:354–358.

Slatkin, M.,andL. Excoffier,1996 Testing for linkage

disequilib-and h⬘wz ⫽ hwz ⫹ hrs ⫺ h⬘rs. The (positive) value of ε is rium in genotypic data using the Expectation-Maximization

algo-rithm. Heredity76:377–383. chosen to prevent proposed values from being rejected Smith, A. F. M.,andJ. M. Bernardo,1994 Bayesian Theory.Wiley, _{too often, which would result in slow movement of the}

Chichester, UK.

chain around the sample space. Smith, C. A. B.,1970 A note on testing the Hardy-Weinberg law.

Ann. Hum. Genet.33:377–383. Slow convergence and poor mixing can arise in the Smouse, P. E.,1974 Likelihood analysis of recombinational disequi- _{presence of many alleles and/or loci. No difficulties}

librium in multiple-locus gametic frequencies. Genetics76:557–

were experienced with the examples discussed here that 565.

could not reasonably be overcome by choosing suitably Taillon-Miller, P., I. Bauer-Sardin˜ a, N. L. Saccone, J. Putzel,

T. Laitinenet al., 2000 Juxtaposed regions of extensive and _{large values for}_k_and_b._{A burn-in of}_b⫽_{30,000 iterations} minimal linkage disequilibrium in human Xq25 and Xq28. Nat.

was found to be adequate for the two-locus algorithm Genet.25:324–328.

(50,000 for three loci), with everyk⫽200th (300) itera-Weir, B. S.,1979 Inferences about linkage disequilibrium.

Biomet-rics35:235–254. _{tion output (these values having been determined by}

Weir, B. S.,1994 The effects of inbreeding on forensic calculation.

the inspection of sequential and autocorrelation plots Ann. Rev. Genet.28:597–621.

of the output for initial runs). The output of each run Weir, B. S.,1996 Genetic Data Analysis II.Sinauer, Sunderland, MA.

Wright, A. F., A. D. CarothersandM. Pirastu,1999 Population underlying the figures and tables was analyzed with the choice in mapping genes for complex diseases. Nat. Genet.23: _{MCMC diagnostic computer package CODA (}_Best _et 397–404.

al.1995), which indicated satisfactory convergence and Yan, G., B. M. ChristensenandD. W. Severson,1997 Comparisons

(11)

algorithm to the data, 5000 values were output (1000 est to the disease locus should exhibit higher levels of disequilibrium than those that are far away (e.g.,Jorde

for the permuted data).

MCMC algorithm for haplotype data:For three-locus 1995). Devlin and Risch (1995) detailed the use of point estimates of disequilibrium for inferring the clos-haplotype data, assuming a multinomial log likelihood

for thehijk(given hyperparameters␣,␤, and␥), together est of a number of biallelic markers (identifying the marker with the highest observed disequilibrium value with a Dirichlet prior distribution, after observing the

nijkthehijkhave a Dirichlet distribution with parameters as the closest). The direct probability approach, imple-menting the methods outlined above, adds a degree nijk ⫹ ␭␣i␤j␥k. A posterior sample from p({hijk}|␣, ␤, ␥,

N) can therefore be readily obtained using standard of interpretability to simple disequilibrium mapping, assigning probabilities to the event that a marker is methods for the Dirichlet distribution (see,e.g.,

Appen-dix A ofGelman et al.1995). To obtain a distribution closest to the disease locus—seeAyres(1998) for fur-ther details.

for the hijk that does not involve ␣, ␤, and ␥, we can

Multiallelic three-locus normalized measures:The fol-employ (6) together with a method of simulating from

lowing bounds apply to theDij for two loci in a three-p(␣, ␤, ␥|N). A number of approaches are available,

locus system (disequilibrium measures for the other lo-and details of an MCMC algorithm are given here.

cus pairs are denotedDikandDjk), The target distribution for the MCMC algorithm is

p(␣, ␤, ␥|N), the probability density function of the

Dijmin⫽max{⫺piqj,⫺(1 ⫺pi)(1⫺qj), ⫺m1,⫺m2}

hyperparameters ␣, ␤, and ␥ given the data N. The

likelihoodp(N|␣,␤,␥) is of a standard form known as Dij_max⫽min{pi(1 ⫺qj), (1⫺pi)qj,M1,M2},

the multinomial-Dirichlet (SmithandBernardo1994,

where p. 135), and by Bayes theorem we can write

m1⫽piqjrk⫹(1 ⫺pi)(1⫺ qj)(1⫺rk)⫹Dik⫹Djk p(␣,␤,␥|N)⫽cp(␣,␤,␥)

兿

I

i⫽1

兿

J

j⫽1

兿

K

k⫽1

⌫(nijk⫹ ␭␣i␤j␥k)

⌫(␭␣i␤j␥k) ,

m2⫽piqj(1 ⫺rk)⫹(1⫺ pi)(1⫺ qj)rk⫺Dik⫺Djk

(7) M1⫽pi(1⫺ qj)rk⫹(1 ⫺pi)qj(1⫺rk)⫹Dik⫺Djk

M2⫽pi(1⫺ qj)(1⫺rk)⫹(1⫺ pi)qjrk⫺Dik⫹Djk, in which c is a constant (and hence does not need

to be known here) andp(␣, ␤, ␥) denotes the prior

which are analogous to equations (12, a and b) and distribution for the hyperparameters, assumed here to

(13, a–d) of Robinson et al. (1991). The normalized be the product of multivariate uniforms so thatp(␣,␤,

parametersD″ij defined by these authors are then given

␥) is also a constant.

by A suitable Metropolis-Hastings algorithm can proceed

as follows: first select a locus l, chosen uniformly at random. Suppose for notational convenience that␣is the hyperparameter vector corresponding to the chosen locusl, then choose two elements of␣, say ␣vand␣w. The proposal␣⬘v is chosen uniformly at random in the interval

D″ij ⫽











Dij

Dij_max

ifDij⬎ 0 andDij_min ⱕ0 Dij⫺Dijmin

Dij_max⫺Dij_min

ifDij⬎0 and Dij_min⬎ 0 Dij

⫺Dij_min

ifDij⬍0 and Dij_maxⱖ 0 Dij ⫺Dijmax

Dij_max⫺Dij_min

ifDij⬍0 and Dij_max⬍ 0. (max(0,␣v⫺ε), min(␣v⫹ε,␣v⫹ ␣w)), (8)

whereεis again a tuning parameter chosen to ensure that proposal values are not rejected either too fre-quently or too rarely. Finally,␣⬘wis assigned value␣w⫹

␣v⫺ ␣⬘v.

This algorithm, and modifications of it, can be useful TheD″ij can be interpreted as the amount by which |Dij| in the location of disease loci via simple disequilibrium exceeds its minimum value (given its sign), divided by mapping. Briefly, under the assumption of a single dis- its range. The overall pairwise measureD″is calculated