Maximum Likelihood Estimation: Binomial

(1)

Maximum Likelihood Estimation: Binomial

For a sample of n independently-sampled alleles, n_A of type A and n_a = n − n_A of type a, the likelihood of p_A is

L(p_A) = C(p_A)nA_{(1 − p}

A)n−nA

and this is maximized when p_A = n_A/n. The maximum likelihood estimate (MLE) of p_A is its sample value:

ˆ

(2)

Aside: MLE Details

The likelihood function L(p_A) is maximized by setting to zero its derivative with respect to p_A:

∂L(p_A|n_A) ∂p_A = 0 or when ∂ ln L(p_A|n_a) ∂p_A = 0 Now ln L(p_A|n_A) = ln C + n_A ln(p_A) + (n − n_A) ln(1 − p_A) so ∂ ln L(p_A|n_A) ∂p_A = n_A p_A − n − n_A 1 − p_A

(3)

Maximum Likelihood Estimation: Multinomial

If {n_i} are multinomial with parameters n and {P_i}, then the MLE’s of P_i are n_i/n. This will always hold for genotype propor-tions, but not always for allele proportions.

For two alleles, the MLE’s for genotype proportions are: ˆ P_AA = n_AA/n ˆ P_Aa = n_Aa/n ˆ Paa = naa/n

(4)

Maximum Likelihood Estimation

Because

P_AA = p2_A + f p_A(1 − p_A)

P_Aa = 2p_A(1 − p_A) − 2f p_A(1 − p_A) Paa = (1 − p_A)2 + f p_A(1 − p_A)

the likelihood function for p_A, f is

L(p_A, f ) = C[p2_A + p_A(1 − p_A)f ]nAA

×[2p_A(1 − p_A)f ]nAa_{[(1 − p}

A)2 + pA(1 − pA)f ]naa

and it is difficult to find, algebraically, the values of p_A and f that maximize this function or its logarithm.

(5)

Bailey’s Method

Because the number of parameters (2) equals the number of degrees of freedom in this case, we can just equate observed and expected genotype proportions based on the estimates of p_A and f :

n_AA/n = ˆp2_A + ˆf ˆp_A(1 − ˆp_A)

n_Aa/n = 2ˆp_A(1 − ˆp_A) − 2 ˆf ˆp_A(1 − ˆp_A) naa/n = (1 − ˆp_A)2 + ˆf ˆp_A(1 − ˆp_A)

Solving these equations (e.g. by adding the first equation to half the second equation to give solution for ˆp_A and then substituting that into one equation):

(6)

Aside: Three-allele Case

With three alleles, there are six genotypes and 5 df. To use Bailey’s method, would need five parameters: 2 allele frequencies and 3 inbreeding coefficients. For example

P₁₁ = p2₁ + f₁₂p₁p₂ + f₁₃p₁p₃ P₁₂ = 2p₁p₂ − 2f₁₂p₁p₂ P₂₂ = p2₂ + f₁₂p₁p₂ + f₂₃p₂p₃ P₁₃ = 2p₁p₃ − 2f₁₃p₁p₃ P₂₃ = 2p₂p₃ − 2f₂₃p₂p₃ P₃₃ = p2₃ + f₁₃p₁p₃ + f₂₃p₂p₃

(7)

Method of Moments

An alternative to maximum likelihood estimation is the method of moments (MoM) where observed values of statistics are set equal to their expected values regardless of degrees of freedom. In general, this does not lead to unique estimates or to estimates with variances as small as those for maximum likelihood.

(8)

Aside: MoM for Multiple Alleles

For the inbreeding coefficient at loci with m alleles A_u, two pos-sible MoM estimates are (for large sample sizes)

ˆ f_W = Pm u=1( ˜Puu − ˜p2u) Pm u=1 p˜u(1 − ˜pu) ˆ f_H = 1 m − 1 m X u=1 ˜ Puu − ˜p2_u ˜ p_u !

These both have low bias. Their variances depend on the value of f .

For loci with two alleles, m = 2, the two moment estimates are equal to each other and to the maximum likelihood estimate:

ˆ

(9)

MLE for Recessive Alleles

Suppose allele a is recessive to allele A, and a sample of n individ-uals has n_aa recessive homozygotes. The genotypes of the other (n − naa) individuals can be AA or Aa. If there is Hardy-Weinberg

equilibrium, the likelihood for the two phenotypes is L(p_a) = (p2_a)naa_{(1 − p}2 a)n−naa ln[L(p_a)] = 2n_aa ln(p_a) + (n − n_aa) ln(1 − p2_a) Differentiating wrt p_a: ∂ ln L(p_a) ∂pa = 2naa pa − 2p_a(n − n_aa) 1 − p2_a

(10)

Aside: EM Algorithm for Recessive Alleles

An alternative way of finding maximum likelihood estimates when there are “missing data” involves Estimation of the missing data and then Maximization of the likelihood.

For a locus with allele A dominant to a the missing information is the counts of the AA and Aa genotypes. Only the joint count (n − n_aa) of AA + Aa is observed.

(11)

Aside: EM Algorithm for Recessive Alleles

Maximize _{the likelihood (using Bailey’s method):}

ˆ pa = nAa + 2naa 2n = 1 2n 2p_a(n − n_aa) (1 + p_a) + 2naa ! = 2(npa + naa) 2n(1 + p_a)

An initial estimate p_a is put into the right hand side to give an updated estimated ˆp_a on the left hand side. This is then put back into the right hand side to give an iterative equation for pa.

(12)

EM Algorithm for Two Loci

An interesting application of the EM algorithm is the estimation of two-locus gamete frequencies from unphased genotype data. For locus A with alleles A, a and locus B with alleles B, b, the ten two-locus frequencies are:

Genotype Actual Expected Genotype Actual Expected

AB/AB P_ABAB p2_AB AB/Ab P_AbAB 2pABpAb

AB/aB P_aBAB 2pABpaB AB/ab P_abAB 2pABpab

Ab/Ab P_AbAb p2_Ab Ab/aB P_aBAb 2p_AbpaB

(13)

EM Algorithm for Two Loci

Gamete frequencies are marginal sums: p_AB = P_ABAB + 1 2(P AB Ab + PaBAB + PabAB) p_Ab = P_AbAb + 1 2(P Ab AB + PabAb + PaBAb) p_aB = P_aBaB + 1 2(P aB AB + PabaB + PAbaB) p_ab = P_abab + 1 2(P ab Ab + PaBab + PABab )

Arrange the gamete frequencies as a two-way table to show that only one of them is unknown when the allele frequencies are known:

p_AB p_Ab p_A p_aB p_ab p_a

(14)

EM Algorithm for Two Loci

The two double heterozygote counts nAB_ab , nAb_aB are “missing data.”

Assume initial value of p_AB and Estimate the missing counts as proportions of the total count n_AaBb of double heterozygotes:

nAB_ab = 2pABpab

2p_ABp_ab + 2p_Abp_aBnAaBb nAb_aB = 2pAbpaB

2p_ABp_ab + 2p_Abp_aBnAaBb and then Maximize the likelihood by setting

(15)

Example

As an example, consider the data for two SNPs:

BB Bb bb Total

AA n_AABB = 0 n_AABb = 0 n_AAbb = 2 n_AA = 2 Aa n_AaBB = 1 n_AaBb = 3 n_Aabb = 4 n_Aa = 8

aa n_aaBB = 0 n_aaBb = 1 n_aabb = 4 naa = 5

Total n_BB = 1 n_Bb = 4 n_bb = 10 n = 15

There is one unknown gamete count x = n_AB for AB:

B b Total

A n_AB = x n_Ab = 12 − x n_A = 12 a n_aB = 6 − x n_ab = x + 12 n_a = 18 Total n_B = 6 n_b = 24 2n = 30

(16)

Example

EM iterative equation:

x0 = 2n_AABB + n_AABb + n_AaBB + n_AB/ab

= 2n_AABB + n_AABb + n_AaBB + 2pABpab

2p_ABp_ab + 2p_Abp_aBnAaBb

= 0 + 0 + 1 + 3 × 2x(x + 12)

2x(x + 12) + 2(12 − x)(6 − x)

= 1 + 3x(x + 12)

(17)