MAXIMUM LIKELIHOOD ANALYSIS OF POPULATION DIFFERENCES IN ALLELIC FREQUENCIES

(1)

DIFFERENCES I N ALLELIC FREQUENCIES

PETER E. SMOUSE2 AND KEN-ICHI KOJIMA3

Department of Zoology, University of Texas at Austin, Austin, Texas 78712

Manuscript received October 19, 1971 Revised copy received July 7,1972

ABSTRACT

Statistical techniques are presented for the analysis of geographic variation in allelic frequencies. Likelihood ratio test criteria are derived from a multi- nominal sampling distribution, and are used to answer three questions. (1) Are there geographic differences in allelic frequencies? (2) Are population differences in allelic frequencies associated with environmental differences? (3) Is there any residual “lack of fit” variation among populations, after accounting far that variation associated with environmental differences? The two- and three-allele cases are explicitly treated, and the extension to more alleles is indicated.

INCE

LEWONTIN

and HUBBY (1966) and JOHNSON et al. (1966) first demon-

s

strated the presence of large quantities of electrophoretically detectable genetic variation in natural populations of Drosophila, numerous workers have compiled evidence of genetic variation for an impressive array of enzyme systems in a number of other organisms. Inevitably, a number of workers have made comparisons of allelic frequencies for different populations of a single species

(JOHN-

SON, KOJIMA and

WHEELER

1969a; SELANDER, YANG and

HUNT

1969). Several

attempts have been made to detect associations between genetic and environmental characteristics

(KOEHN

and RASMUSSEN 1967; SELANDER et al. 1969; JOHNSON et al. 1969b), because such associations might well be indicative of selective pressures operating on the loci in question.

The strictly statistical aspects of detecting, measuring, and testing genotype- environment associations have not received the attention they warrant. KOEHN and RASMUSSEN (1967) and SELANDER et al. (1969) have employed standard regression analyses, while

JOHNSON

et al. (1969b) have made use of a principal components approach. All of this work, however, largely ignores the distribu- tional features of gene frequency data. The purpose of this paper is to develop alternative statistical techniques particularly suited to the analysis of geographic variation in allelic frequencies.

‘Supported by NSF Postdoctoral Fellowship Award WO12 to Dr. SMOUSE and by USPHS Grant GM 15769 to Dr. KOJIMA.

Present address: Deparhnent of Human Genetics, University of Michigan, Ann Arbor, Michigan. Professor of Zoology (deceased), University of Texas, Austin, Texas.

(2)

710 P. E. SMOUSE A N D K. KOJIMA

A SINGLE LOCUS WITH TWO ALLELES (A, AND A,)

Consider a set of 1 populations, each sampled for a number of individuals. Each individual is assayed for a particular genetic locus for which all alleles are expressed. Since each of the ( n i / 2 ) diploid individuals contains two alleles, we have a sample of ni alleles from the ith population.

If we may assume Hardy-Weinberg equilibrium within each of the populations, and if we denote the relative frequency of A, in the ith population by the parameter

Pi,

then the number of A, alleles has a binomial probability distribution:

(1)

where

Pi

is defined as above and X i is the observed number of A, alleles drawn in a sample of ni alleles. If a single allele is assayed from each of ni individuals,

as is sometimes possible in the laboratory, the Hardy-Weinberg assumption is unnecessary.

The analysis consists of a set of statistical comparisons among various hypothesis concerning the values of the parameters

Pi:

i

= 1,

. . .

1. The null hypothesis

(H,) is that

P,

=

P,

.

= P I = P, i.e., that there are no differences in P among populations. This is formally equivalent to the hypothesis

H,:

Pi

= poZ0i,

where Z0i = 1 for all

i

= 1,

.

,

I . An alternative hypothesis (H,) is given by the model:

f ( x i > =

(2i)

pixi (1 - p i ) n i - x t ,

Pi

=

pozoi

+

p12,i

+

. . .

+

plizlci if 0

<

P’Zi

<

1

( 2 )

The

K

additional 2-variables represent the values of K different environmental measures of interest. The definitions of

Pi

for the last two situations amount to the assumption that there may be environmental thresholds, beyond which fixation occurs. The relationship of P to the environment need not be linear, but in the absence of more specific information, the linear model will constitute a

convenient point of departure. We shall restrict our attention in this paper to the linear case. The final, least restrictive hypothesis (H?), is that there are population differences in P which may or may not be predictable from the environment in linear fashion.

There are three questions of interest. ( I ) Are there any differences among populations? (2) Are there any linear associations between allelic frequency and quantitative environmental variables? ( 3 ) Are there any “unexplained” geo- graphic differences, above and beyond those accounted for by the linear model?

(3)

MAXIMUM LIKELIHOOD ANALYSIS

s u p L ( _ X I H J

-

SUP L ( X

I

Hl)

hL =

-

SUP

L ( X

1

H 1 7 H,) SUP

L(X

1 Hz)

'

where the likelihood values, except for a constant, are given by

( 5 )

I

where

P,

pi,

and

P.i

are the maximum likelihood estimates of the parameters

Pi

under the three hypotheses.

To

obtain these maximum likelihood estimates, the logarithms of the likelihood functions are partially differentiated with respect to the parameters in question, and these partial derivatives are equated to zero. The resulting sets of

equations are solved for the estimates in question. I t is well known that =

(,E

X i ) / ( . z

ni) and that

Pi

-

=

X i / n i

for

i

= 1,

. .

.

,

Z.

2 = 1 a=i

The hypothesis H, specifies that

Pi

= poZoi

+

plZIi

+

. .

.-I- @pn-Zxi7 and one

must maximize

L ( X

I

HI)

simultaneously for each of the p-coefficients:

or in matrix form:

_-

(10)

a

ab

-

log

L

(X

_-

1

H,) Z'V-l

(e

-

Zp)

_-

= 0

-

where is the ( K S I ) vector of estimated p-coefficients,

P'

=

( P I ,

. . .

;

P,)',AZ

is the Z X (K+1) matrix of environmental values, and V-l= diag [ni/f'i

( 1 -

P,i)].

The values of

Pi,

and therefore of V-I are dependent upon the estimated

p-

coefficients, and it is necessary to iterate to a solution:

-

= (ZT-lZ) -1

( Z V - G )

.

If we differentiate (9) again, we obtain:

Upon taking expectations, one obtains the information matrix:

Z(j)

=

(Z'V-12)

(13)

(4)

7 1 2 P. E. SMOUSE AND K. KOJIMA

-

.

=

(Z’v;;,z)-1

(z’v;:,p)

,

where Vj:, = diag [ n I / P L ( r ) ( 1 - < , ( r ) ) ] , and =@(.), T- 2

I.

Since

-

b(,+l)

depends upon

fi(.,

only through

e(.,

=

[Pi(?)],

one may start with a n initial

P,,,,,

and simp& iterate. Two of the more obvious choices for &,) are and

P ,

the values of P under H, and H,. respectively. Equation

(14)

constitutes the familiar Gauss-Newton process (KENDALL and STUART 1963).

There are multiple solutions to equation

(IO).

Only those solutions are of interest, however, which yield 0

<

&, =

S’Z,

<

1 for i =

I,

. . .

,

I. The question arises as to whether there are multiple a&&sible solutions. It can be seen from equation (12) that the matrix of second partial derivatives is strictly negative- definite for any such solution, indicating a relative maximum. Between any two relative maxima there must be a relative minimum, assuming continuity. Now the portion of the /?-space circumscribed by the restrictions on the P , is convex, and any point lying between two internal points is also internal. No internal points are relative minima. however. and continuity obtains. Consequently, there is never more than one admissible solution to (10). If no admissible solution exists, then a peripheral absolute maximum (one or more

P ,

= 0 or 1) will exist.

It has been our experience that rapid convergence can usually be obtained by setting or

e.

In the process of iteration, values of

fi,

5 0 or 2 1 sometimes arise. To remain in the admissible range, it is necessary to replace these values with 0

<

P,

<

1 . W e have routinely employed P, = .001 or .999, and have found that the final values of

&,

are very often between 0 and 1 . There remain some cases, however, for which convergence is not obtained.

Upon further investigation. it appears that non-convergence occurs in two separate cases.

(1)

An internal solution exists, but is so close to the periphery

(P,+O. or $ ? + l ) that the matrix V-’ is very sensitive to small changes in

P , ,

preventing convergence. ( 2 ) No internal solution exists, and the likelihood function is maximized for some combination of /?-estimates leading to peripheral

Pi

values.

In the first case, it seems likely that a more refined iterative technique might well yield the appropriate solution. We have found it easier to employ a simple search-grid procedure which tracks the optimum. Since the likelihood function is both continuous (equation 10) and strictly concave downwards (equation 12) i n the region of interest, no difficulty is encountered in obtaining the proper solution.

I n the second case, the same search procedure leads to a n estimated p-vector such that one or more P , are external. I n keeping with the assumptions (2), the offending values are set equal to zero or one. Only those populations with P , = 0 or 1 (X, = 0 or n,) may accommodate peripheral

5,

values, of course, since other- wise the likelihood function takes its minimum value (0), i.e.:

-

=

(5)

$ixi

(1

-

$ i ) n i - x , = 0 if = 0 or 1 a n d 0

<Xi

<

ni

.

(15)

Returning to the likelihood functions, it should be pointed out that AA = -2 log AA is asymptotically distributed as

x’

with (I-1) degrees of freedom (KEN-

DALL and STUART 1963). It is also clear that AA = AR.AL, and that AR = -2 log AR

and AL = -2 log A L are asymptotically distributed as

x 2

variates with

( K

<

I ) and ( I - K

-

1 ) degrees of freedom, respectively. These three test criteria are given by:

A A = ~ [logL(X

I

Ha) - l o g L ( X

I

Ho)1

A L = ~ [logL(&

I

H,) - l o g L ( X

I

Hi)]

A R = ~ [logL(X

I

HI) - l o g L ( X

I

Ho)1 7 (16)

where the likelihoods are understood to take their maximum values under the respective hypotheses.

It can be shown that the values of the various likelihood functions do not depend upon whether L is maximized relative to P , or P , = (1 - P I ) . Since the argument is symmetric with respect to P , and (1 - P , ) , we may choose either allele. We are ultimately led to the x 2 analysis of Table 1.

A SINGLE LOCUS WITH THREE ALLELES (A,, A,, AND A,)

I n this section, we develop the analysis for a three-allele locus. If we denote the relative frequencies of the alleles A,, A,, and A, in the ith population by P I , , P,,, and

PsL

= (I - P I , -

PzL),

respectively, then the numbers of A, and A, alleles have a trinomial distribution:

(n,) ! Pl*% P * , L ( 1

-

P I ,

-

P,i)n* -x1, - % I

( X l l ) ! ( X 2 I ) ! ( n , -XI, -

X , , )

!

f(Xlt, X , , ) = __- 9 ( 1 7 )

where X , , and

X 2 ,

are the observed numbers of A, and A, alleles in a sample of

n , drawn from the ith population. The trinomial likelihood function, except for a constant, is given by:

The hypotheses of interest are:

TABLE I

Analysis of geographic variation for a single locus with two alleles

~ ~ ~ ~~

Source of % ariation Degiees of freedom X’

Among populations 1-1 AA

Regression _K _A,

(6)

714 P. E. SMOUSE A N D K. KOJIMA

-

P ,

=.

i

x,i/

.$

ni P,i =

XZi/ni

i = l ,

. .

.

,

1 .

%=I. 2 = 1

If log

L(XIH,)

is differentiated with respect to

pli,

we obtain:

for j=O,l

. .

. ,

K , while differentiating with respect to p 2 k leads to:

for

k=O,

. .

.

,

K . These two sets of equations may be combined into a single matrix equation:

-

~- aiog L - Z;W-l_P-Z:U-lZ,p

,

or

ae

p

=

(z’v-’z)-’(z;

w-@

The matrix Z’, is the [ 2 ( K + l ) x 211 matrix defined by

Z‘

I Z‘

x z

I ( K + 1 )

x z

- _ _ _ _ _

0

the matrix W-’ is the (21

x 2Z)

matrix defined by:

7 ( 2 4 )

(7)

with

D,

= diag. [ni ( l - b 2 i ) / f i l i ( l-i,i--b2i)],

D,

=

D,

= diag.

-

[ni/(

- -

i-f'li-

i ) , i ) ] , andD,= diag. [ ~ i ( l - ~ l ~ ) / ~ ~ ~ ( l - l ) l i - l ) z i ) ] ; t h e v e c t o r Y = ( ~ ~ , ~ ) ; the

matrix

U-'

is the (21 x 21) matrix:

(26)

( I

x

I )

with U;l = diag.

[ni/81i(l-Pli-1)2i)]

and

U;*

= d i a g . ~ [ ~ ~ / 8 ~ ( i ~ ~ ~ ~ - 1 ) z ~ ) ] ; the vector p' = (p', -1 - 2 p'). The matrix 2 and the vectors

&, &,

El,

_Pz,

bl,

bz

are

the same a s for the two allele cases. I t should be pointed out that &e @X2) matrix constructed from the

(i.i)

th elements of the four D-matrices constitutes the inverse of the within-population covariance matrix

( W i )

of the ith population:

The matrix

W-'

is therefore analygous to

V-'

of the two allele case. The matrix

U-I is a modified

W-'

matrix.

Since both W-I and

U-'

are functions of

p,

an iterative solution is in order.

If

the second partial derivatives are obtained from (23), and if expectations are taken, we obtain:

which converges to ( 2 3 ) as (22) converges to the zero vector. We have found that (29) is not always well behaved, relative to restrictions on

P,i

and

bzi,

and we prefer to iterate directly via (23).

As with the two-allele case, the likelihood equation has multiple relative extrema. It can be shown that there is never more than one solution which satisfies 0

<

PI,,

P,,,

fi,3i

<

1 f o r

i

= 1,

.

. .

,

I . It has been our experience that the sequence converges, except near the periphery or when there is no internal solution to (23). In both of these non-convergent cases, a simple search-grid

technique will yield the appropriate solution.

(8)

716 P. E. S M O U S E A N D K . K O J I M A

ing error involved in the iteration. The three-allele equivalents of equations (6),

( 7 ) , and (8) may be computed, and we are led again to the analysis of Table 1.

The degrees of freedom for all entries are doubled, of course, since twice as many parameters are involved.

The extension to four o r more alleles is straightforward, and we shall not describe it here. In practice, few multiple allelic loci will have more than two or three alleles in appreciable frequency. In view of the asymptotic properties of the test criteria and the consequent problems with small expected numbers, it seems advisable to pool all but the most frequent two or three alleles into an “other alleles” class. The loss of information resulting from this practice should be minimal.

M U L T I P L E LOCI

It is becoming rather common to assay a number of genetic loci on each individual sampled, and this practice opens up the possibility of a multilocus analysis. A full treatment is beyond the scope of this paper, and we shall content ourselves with two comments.

If one assumes that the various loci are segregating independently within populations, i.e., that the loci are in linkage equilibrium, then the multilocus analysis is simply the sum of the separate single-locus components. If the loci are not segregating independently within populations, a more elaborate analysis is appropriate.

S E Q U E N T I A L TESTI N G A N D P A RT I TI O NI N G THE REGRESSION C O M P O N E N T

There will sometimes be a question as to which of the possible environmental variables should be included in the linear model. One would prefer to obtain as complete a description as possible with a minimum number of environmental variables. It will thus be appropriate to test a variety of environmental models. This problem is a familiar one in multiple regression of normal variables, and

is entirely analogous for the case at hand.

To facilitate the discussion, we assume a two-allele locus and four environmental 2-measures (Z,,, Z,, Z 2 , Z3)l. It will be convenient to employ several hypothesis:

* We shall make use of this particular illustration in the companion paper (KOJIMA et al. 1972). H,: pi =

aozoi

[equivalent to

Si

= P I HI:

H,:

H3:

Pi

= SoZoi

f

SIZli

+

SZZzi 63z3i

Pi

=

poZoi

4-

plZ1i

Pi

= yoZoi

+

y i Z i i

+

yzZzi H,: Not all Pi = P

(9)

for

K

= 1,

. . .

,

4.

The respective estimates are given

by

f i 0 i =

P,

PI+

=

-

$,zi,

(‘,i =

?z+,

f i , i =

&zi,

and

i),i

= P i .

-

If we denote -2 log h by A, then it is clearly true that:

A01 5 _< nos 5 AO4

,

(31

1

since adding an additional environmental variable never worsens the description. The additional information gained by adding an additional variable may not warrant its inclusion in the model, however, and one is led to ask such questions as: “Is a model with 2, and 2, a significantly better description than a model with only Z,?” This question may be answered by contrasting hypotheses H,

and

H,.

Questions such as that asked immediately above may be approached analyti- cally, since ho, = A,,, . AI, hl’:$. A,,, which implies that An, = A,, f A1,

f

AZ3

f

A3+

The reader will recognize that il,, = A*. The partition into hR and AL will depend upon the model fit. When considering the model P , , = ,8ll_zi, we find that AR = A”,

and that AL = AI, = A12 -t A,:,

+

A ~ + . When one is considering the model P,+ =

y ’ Z i , we find that AR = A,,:! = A,,

f

A,, and that AL = = Aea

f

As4, and simi- larly for the model P,i =

rzi.

The analogy with weighted multiple regression of normal variables is both close and obvious. The partitioned analysis is shown in Table 2.

The order of fitting shown in Table 2 is arbitrary, of course, and one may employ more than one order. The decision problems involved in determining the order of fitting and the stopping strategy are familiar to users of normal regression, and we shall not belabor them here. The interested reader is referred to DRAPER and SMITH (1966). There are, however, two features of the present analysis which are slightly different from normal regression analysis and which require further comment.

First, it is possible with the analysis shown in Table 2 to obtain AR (2,

[

2,)

>

l i R ( Z , ) , a feature not encountered with multiple regression of normal variables.

The explanation f o r this seemingly anomolous observation is nevertheless fairly simple. In normal regression analysis, the weight matrix, be it

I-,

or V-I, is assumed independent of the model. Consequently, the weight assigned a given population is the same for all models. The weight matrix (V-l) in multinomial regression is strictly dependent upon the model.fit, and each model has its own

TABLE 2.

Partitioned X”ana1ysis of popularion differences in allelic frequencies

Source Df Asymptotic K Z

Regression K = 3 AR =

Zl I A01

z,

I

z,

1 4 2

Lack of fit ( I - K - I ) A L = A,,

Among populations _([-I) _A*₌_A”,

(10)

71 8 P. E. SMOUSE A N D K. KOJIMA

particular set of weights. The difference is simply the difference in the under- lying distributions.

Second, it should be pointed out that each of the x2-tests of Tables 1 and 2 is analogous to an F-test. The implicit denominator of any such

“F

ratio” is the within population “pure” error. By construction, this pure error term takes the value ,2 ( n i - I )

=

(N-Z),

and is asymptotically distributed as

x2

with

(N-Z)

degrees of freedom. The effect is to divide each “mean square” by unity. As a consequence of the above, one has a n independent test criterion for both the regression and lack of fit components. It is very possible that both will be significant. In such a case, the proper interpretation would be that while the regression model is a significant improvement over the null hypothesis, indicating a general association between environmental and genetic variables, the model employed is not an entirely adequate description. Perhaps other environmental variables should be considered, or perhaps the relationship is not linear, or both. The addition of further variables to the linear model is straightforward, and requires no further comment. The addition of quadratic or higher order terms may be conveniently handled within the framework of the linear model proposed. The analytical framework for general non-linear models is essentially the same as that outlined in Tables 1 and 2. The estimation procedures will necessarily depend upon the type of model employed.

I

a = 1

DISCUSSION

The test criterion -2 log h is asymptotically distributed at x2, but with finite samples is biased upwards. In general, the bias is of order (n-l), and for reason- ably large ( n ) can be safely ignored. The approximation is poorest when several of the f i i are near 0 or 1 and when sample sizes are small. W e would suggest

sample sizes on the order of ni = 100 for all populations. This is particularly important where rare alleles are encountered. If one must operate very close to the periphery and if sample sizes are unavoidably small, caution is in order relative to exact probability levels.

LAWLEY

(1956) and PEERS (1971) have developed general methods for obtaining closer approximations to the limiting xz-distribution.

It should be pointed out that the h-criterion used here is related to the more familiar x2-homogeneity test so often used in testing differences in allelic frequencies among populations. In view of the fact that x2 can also be partitioned along the lines indicated above, the question arises as to which constitutes the better approach. KENDALL and STUART (1963) point out that the two test criteria are asymptotically identical and interchangeable. This is not the case for finite samples, of course, and the two criteria must be judged on small sample proper-

ties. Unfortunately, very little information exists on this point (see COCHRAN

(11)

likelihood function in both cases, the h-criterion would seem the more natural approach, and we prefer it on that account.

LITERATURE CITED

The authors would like to express their thanks to Dr. R. H. RICHARDSON for lengthy dis- cussions on this paper.

COCHRAN, W . G., 1952

DRAPER, N. R. and H. SMITH, 1966

JOHNSON, F. M., C. G. KANAPE, R. H . RICHARDSON, M. R. WHEELER and W. S. STONE, 1966 The x2 test of goodness of fit. Ann. Math. Stat. 23: 315-345.

Applied Regression Analysis. Wiley & Sons, New York. 407 pp.

An analysis of polymorphisms among isozyme loci in dark and light Drosophila ananassae

strains from American and Western Samoa. Proc. Nat. Acad. Sci. U.S. 56: 119-125. Isozyme variation in Drosophila island populations 11. An analysis of Drosophila ananassae populations in the Samoan, Fijian and Philippine Islands. Univ. Texas Publ. 6918: 187-205.

Isozyme genotype- environment relationships i n natural populations of the harvester ant, Pogonomyrmex bar-

batus from Texas. Biochem. Genet. 3: 429-450.

The Advanced Theory of Statistics. Vol. 2. Griffin & Co.,

London.

Polymorphic and monomorphic serum esterase heterogeneity in catostonid fish populations. Biochem. Genet. 1 : 131-144.

Isozyme frequency patterns in Drosophila pauani associated with geographical and seasonal variables. Genetics 72 :

A general method for approximation to the distribution of likelihood ratio

A molecular approach to the study of genic heterozygosity in natural populations. 11. Amount of variation and degree of heterozygosity in natural populations of Drosophila pseudoobscura. Genetics 52: 595-609.

JOHNSON, F. M., K. KOJIMA and M. R. WHEELER, 1969a

JOHNSON, F. M., H. E. SCHAFFER, J. E. GILLASPY and E. S. ROCKWOOD, 1969b

KENDALL, M. G. and A. STUART, 1963

KOEHN, R. K. and D. I. RASMUSSEN, 1967

KOJIMA, K., P. SMOUSE, S. YANG, P. S. NAIR and D. BRNCIC, 1972

721-731.

LAWLEY, D. N., 1956

LEWONTIN, R. C. and J. L. HUBBY, 1966 criteria. Biometrika 43: 295-303.

PEERS, H. W . , 1971

SELANDER, R. K., S. Y. YANG and W. G. HUNT, 1969

Likelihood ratio and associated test criteria. Biometrika 58: 577-587. Polymorphism in esterases and hemo- globin in wild populations of the house mouse (Mus musculus). Univ. Texas Publ. 6918: