r
-4 4 -
model, it is important to no t i c e that the infinite
sites model w i t h o u t r e c o m b i n a t i o n is identical to the
infinite alleles mode.! . Besides, for alleles w i t h low
f r e q uency the e x p e c t e d n umber of different alleles
in a sample a p p r o x i m a t e s well the n u mber of d i f f e r e n t
sites s e g r e g a t i n g (Nei, 1977). Any d i f f e r e n c e in
the results usi n g these two models must, therefore,
be related to dif f e r e n t aspects of allelic data used,
as input.
For the step-wise m u t a t i o n model, the form of
equation (2.29) is given as
[2nqj 2n
E ( k ) = T. ( .) B fB' + j ;6 + 2 n - j )/u(6B' + l) (2.39)
r j=l ] 1
where b(.,.) is the beta f u nction and B^is as given
in (2.2). This e x p r e s s i o n is p r a c t i c a l l y identical
to (2.29) for 0<O.Oi as seen in e x t ensive n u m e r i c a l
computa t i o n s by C h a k r a b o r t y et a i (1980) .
2.4.1. i . .3 . Num b e r of singl e t o n s (k ) .
An exact sol u t i o n of (2.19) m a y be obta i n e d for
singletons or single copy alleles in the sample (k ).
By taking the bino m i a l s ampling e q u a t i o n for j=l, we
get E(k ) = 2n f x ( l - x ) 2n ^ $(x)dx s 0 = 2nC / (1-x) 2n + 0 'flx 0 2nG/(2n+0 -1) (2.40)
4 5 -
w h i c h y ields a m e t h o d of moment estimator, 6^ , as s
0, k s (2n-l)
s ( 2 n - k g) (2.41)
where k^ is the o b s e r v e d n umber of s i n g l e t o n s / l o c u s
in the sample. For infinite sites model.
° l k s k s /Zn (2.42)
■k
where k is the p r o p o r t i o n of sites segregated as
singletons. The e q u a t i o n (2.42) is quite similar to
(2.41) for large 2n.
The above solutions of the sampling equations
have b e e n given on the a s s u m p t i o n of sampling from
an infinite p o p u l a t i o n w i t h r a n d o m mating, w i t h o u t
replacement. This is not a p p r o a c h e d in actual situations.
While the sampling p r o c e d u r e s for finite p o p u l a t i o n s will
be taken up in the next section, it m a y be relevant
to include the role of i n b r e e d i n g in re c o v e r i n g the
n u m b e r of alleles in the population.
T e m p l e t o n (1980) has given the a p p r o p r i a t e
e x pected n u m b e r of n e u t r a l alleles und e r the infinite
model, as E (k I a) = £l-[l-x] 4'+ax(l-x) ] 0 2n-l 2n-j = E(k j a=0) - 0 E (x)dx f (2 n + 1 ) T (2 n + j + 0) j =0 2n-j r(j+l)r(2n+0) (2.43)
-4 6 ~
where E(kjcx = 0) is e q u at i o n (2.11); 2n T and 6 are
a l r e a dy defined. It is r e a d i l y seen from (2.43) that the
m e a n n u mb er of alleles r e c o v e re d in a sample decreases w i t h increase in a, the inbreeding c o e f f i c i e n t of
the population. No simple e x p r e ss io n for 0 in
terms of E(kja), leading to an estimator based on k is, however, available.
In an earlier section it was pointed out that
0p is a b i a s e d e s t i m a t o r of 0. For 0.6<0<2.0, Ewens
and Gillespie (1974) show, by simulation, that the
A
m ean value of 0p is rather c on si s t e n t l y about 40% or
A
m ore in excess of 0. A l t h o u g h 0^ is also a bi a s e d
e s t i m at o r (no u n b i a s e d e s t i m a t o r of' 0 exists; Ewens,
1979a), its bias d e c r e a s e s to zero asymptotically. For 2n=200, the bias is n e g l i g i b l e (Ewens, 1979a).
A
Furthermore, 0^ has v ery small m e a n square error
A
t y p ic al l y l/7th or l/8th of 0p. In the context of
strict n e u t ra l i t y there appears no excuse for
A
e s t i m a ti n g 0 t h r ough F.
Und e r g e n e r a l i z e d neutrality, however, the
above c o mp ar i s on does not hold u n i v e r s a l l y for the large values of a"=2Ns, w h e r e s is the selection coefficient.
•A. /\
For large a', 0p has less bias than 0 ^ und e r a n um be r of
conditions. It m i g ht seem p a r a doxical then to estimate
A A
0 from 0-, w h e n a " is low and from 0 w h e n ex'* is
k F
high for the p r o p e r t y of u nbiasedness. In addition, the
two estimates will then r e p r esent "total" 0 and " n e u t r a l ” 0.
- 47-
In dealing with protein data, when strict neutrality is presumed but not either proved or disproved using a statistical test, the choice of
/s
0r may fortiutously, provide a less biased estimate
t ' *
A
if the bias is removed by defining a new estimator Gp , defined by
0* - 0.71 0r (2.44)
F b
This new estimator is designed to allow for the 40%
A A
bias in 0p. Thus 6* will be an approximately unbiased estimator of ’total* 0 i.e, when a"=0 and an unbiased estimator of ’’neutral only” 9,
whichever might apply. This blind estimation procedure is the only alternative, unless a strong test statistic are developed to discriminate between the various
aspects of neutrality.
2.4.2 Branching process models.
Alternative approaches to estimate the mutation rate from protein data were suggested by Neel and Thompson (1978) and Rothman and Adams (1978) using branching process models. Fisher (1930) and Karlin and McGregor (1967) had earlier considered these models for estimating the total number of heterozygous loci in the genome and number of alleles represented by j copies in the population, respectively.
48 -
One of the advantages in working with branching process models is that the assumptions of fixed population size
and equilibrium are not required to describe the transition of one allele from the jth allelic state to ith allelic state where an allelic state is defined by the number of copies by which an allele is
represented in the population. However, for the estimation of mutation rate these questions are d i ffi cu 11 t o avo i.d.
2.4.2.1. Rothman and Adams' method.
Let denote the number of different single copy alleles segregating at a locus in a population in tth generation. The presence of these alleles can be attributed to three different sources, provided there is no immigration/emigration and intragenic recombination is low. These sources are:
(1) New mutations introduced in the tth
t t
generation at a rate 2N y where N ' is the population size in the tth generation. It is assumed that an infinite series of alleles can be generated at this locus i.e, every new
allele is a novel allele,
(2) The drift of higher frequency alleles in the t-lth generation to the singleton class in the tth generation,
(3) The retention of singletons in the t-lth generation as singletons itself in the tth generation.
- 4 9-
The drift to and retention of singletons is given by the probability transition matrix P
where individual elements of the matrix P.. indicate
31
the probability that an allele present in j copies in the t-lth generation is changed to i conies in the tth generation. Quantitatively, this is given by Rothman and Adams (1978) as
E(K ) - 2Ny+ K Cj)P .
j=l 3 (2.49)
where E C ) is the expected number of singletons in the population, g(i) is the relative proportion of alleles each represented by j copies in the populations,^ is the expected number of alleles in the population and P_. ^ is the transition probability vector. This
equation represents the balance, at equillibrium between the expected number of alleles entering the singleton class and those alleles which exit.
The method of Rothman and Adams of course, assumes that the mutational events are given under the infinite alleles model. The form of the equation
(2.49) implicitly also assumes that mutation is
introduced as a replication error during gametogenesis and is expressed phenotypically in the offspring.
This being a unique event under the infinite alleles model, the possibility of a similar slip occuring again
- 5 0“
An alternative model for the occurrence of mutation has been, put forward by Vogel (1970;
1975). Under this model the mutation is introduced in the non-expressible form in the gamete cells of
one of the parents. The probability of transmission of such mutations is governed by the usual demographic processes. The form of equation (2.45) under this model will be E ( V ) !2N1 'V f 7 '1 ] Pll+K1;_lZg(i3P,l j=2 J 2Nt '1yPu + Kt '1Zg(j)Pjl j “ 1 (2.50) Expanding over £ generations, and after rearranging we get at equilibrium:
E(Ks) = 2NyP11 + KZg(j)P
j = 1 J
(2.51) The model derived here, however, assumes that the mutations are introduced during the pre-pubertal period. Adjustments to the transmission probability Pj^ (associated with fresh mutations) will have
to be made if the mutation is introduced in the gametes of the parents during the reproductive period.
Although the second model is not entirely acceptable (Vogel, 1975), the above equations have implications when the model is extended to expanding or contracting populations. While under the first model the mutation rate is measured in terms of the
size of the tth generation population size, under
the second model the size of the previous generation
is taken into consideration. A comparison of equations
(2.45) and (2,51) reveals that, under the first
model, the adjustment for the size of the previous
generation is not admissible.
Rothman and Adams (1978) have given the
equation which takes into consideration the growth
rate per generation in the estimation of K , Accordingly,
2Nt ‘1y + H Zg(j)Pn j=l J '
(2.52)
which is an extension of the approach taken by Lea
and Coulson (1949). However, this equation is not
extendable to any of the two models of mutation
mentioned earlier.
Neel and Rothman (1978) rewrite the expression
(2.49) as
2Ny + K Zg (j)?.-, j > 1 -
K(l) (1-Pn )
(2.53)
which expresses the balance between the number of
singletons lost and gained per generation. This
-52 -
The elements of transition probability matrix
P are calculated as P
ji
m i n (i , j) £ h=l Ch)(i-h^1- T ^ )j'hbhci'h (2.54)where b and c are the parameters of geometric
d i s t r i b u t i o n .
The population values of g(i), the expected
relative frequencies of the alleles, are obtained
as
^g (j)pjl = g (i)
j =1 (2.55)
for i>2. The relative frequency of g(l), however,
is given as
g(l) - Zg(j)Pj]L + 2Ny/K
(2.56)
The estimation of the relative proportions g(j),
however, needs a well documented demographic data on
the population as also extensive computations. In the
absence of such data, the rough estimates of g(j)
can be obtained from the observed distribution of rare
alleles by taking the number of copies over a set of
protein loci for sufficient sample sizes.
-5 3 -
(1978) give the e s t i m a t e d n u m b e r of alleles in the
p o p u lation, using the n u m b e r k in the sample, as
K - k 2N U - 2 g(j) j=l 2N- i ( *) 2n — s (2N) 2 n (2.57)
the binomial app r o x i m a t i o n of w h i c h is
^ 2n .j
K = k/[l- Z g(j) (1-f)
3=1
(2.58)
where f - n / N is the s a mpling fraction. For j>30,
g(j)(l-f)^ is n e g l i g i b l e and the summation m a y be
t r u n c a t e d .
The e s t i m a t i o n p r o c edure of Neel and Rothman
(1978), however, is very d i f f i c u l t to utilise since
there are too m a n y unknowns. In the a b s ence of well
d o c u m e n t e d d e m o g r a p h i c data over a n umber of generations,
c a l c u l a t i o n of the values of the elements of p r o b a b i l i t y
t r a n s i t i o n s m a t r i x is d i f f i c u l t . s i m i l a r l y the p o p u l a t i o n
values of g(j) are not k n o w n . F u r t hermore e x t r a p o l a t i o n s
A
of the va l u e s of k to obtain K is a v e r y u n c e r t a i n
p r o p o s i t i o n since k is a ra n d o m v a r i a b l e rather than an
-5 4 -
2.4.2.2. Nee'l and T h o m p s o n ' s method.
T h o m p s o n and Neel (1978), usi n g the tth g en e r a t i o n d i s t r i b u t i o n forms of the n um b e r of copies given by K ei d i n g and N i e l s e n (197S) have given the p a r a m e t e r s of cumul a t i v e d i s tr i b u t i o n for the t w o - p a r a m e t e r - g e o m e t r i c
form. Neel and T h o m p s o n (1978) u t i lize these r e s ults
to give an e s t i m at o r of m u t a t i o n rate as T
K (Aäj) = nN E {(I---i-)j - i >
t t t
(2.59)
w h ere r e p re s e n t s the n u m b er of alleles w i t h m ore
than or equal to j copies in the p o p u l a t i on / lo cu s , and is the tth g e n e r a t i o n m ean value of r e plicates
conditional on non-zero. However, the s u m m ation on
the right hand side is u n b o u n d e d which, for the p ri va te variants, may be t r u n c a te d to include only the time
since tribal d i f f e rentation. The a pp r o a c h is quite
useful for u t i l i z i n g i n f or m a t i o n on p r i vate po l y m o r p h i s m s .
2.4.3. E q u i l i b r i u m m e t h o d s .
The d i f fu s i o n a p p r o xi ma ti on s approach o ut li ne d above helps in a r r i v i n g at some of the results in
simple a p pr o x i m a t e forms. These a p p r o x im a ti on s are,
however, based on a n umber of a s s u m p t i o n s w hi c h m a y be c o n s i d er e d u n r e a l i s t i c for natural populations.
I n cluded in this section is the e qu i l i b r i u m a p p r o a c h of Ewens (1964) and K imura and Ohta (1969),