• No results found

Population Genetic Statistics – Within Population Estimates

Chapter 3: Materials and Methods

3.5.1 Population Genetic Statistics – Within Population Estimates

The genetic variation in populations can be summarized using several statistics. These statistics are generally descriptive, providing quantitative estimates of basic features for each population. The statistical values provide basic characteristics that allow each population to be assessed relative to other populations of known diversity or

110

size. These estimates are therefore useful in supplying a general understanding of the amount of variation in population, but not necessarily the type of variation.

The first of these summary statistics is the gene diversity or average

heterozygosity. This statistic is defined as the probability that any two alleles (either haplogroup or haplotype) randomly chosen from a population are different (Nei, 1978, 1987). The estimation of this statistic was implemented using Arlequin v3.11 (Excoffier, Laval, & Schneider, 2005). I calculated the gene diversities for both mtDNA and NRY data using haplogroup designations and haplotypes. Those calculated from haplogroup frequencies are referred to as “haplogroup diversities,” while those from the haplotype data are “haplotype diversities.”

Haplogroup diversity is estimated from haplogroup frequencies, with each haplogroup essentially representing a different allele. For this reason, the definition of a haplogroup is important when describing the amount of variation in a population. A haplogroup is essentially a monophyletic clade in a phylogeny that shares a number of unique polymorphisms relative to other clades of equal depth (Richards et al., 1998).

The haplogroups for human mtDNA studies were defined by RFLPs (SNPs and indels). It has become clear, however, that not all of these SNPs are at the same level (or depth) in the phylogeny. For example, the SNP defining haplogroup U (12038) is quite old and encompasses a number of other haplogroups (U1-U7, U8/K and U9) (van Oven & Kayser, 2009). Throughout these analyses, all haplogroup designations are related to branches that are approximately the same depth in the phylogeny. The NRY haplogroups are similarly resolved for this analysis.

111

Haplotype diversity is calculated like haplogroup diversity. However, this gene diversity estimate is the probability that any two haplotypes (HVS1 DNA sequences for mtDNA or Y-STRs for NRY) randomly chosen from a single population are different. Evaluation at the nucleotide level is made between comparisons of the same stretches of sequence or repeats. For this reason, each unique haplotype serves as a unique allele. Moreover, haplotype diversities are not biased in the same way because it is not

necessary to group samples. In fact, no a priori categories are used for these estimates. Differences at the haplotype-level were examined with additional statistics. For the mtDNA sequences, nucleotide diversity, average pairwise differences, and mismatch distributions were estimated using Arlequin v3.11 (Excoffier et al., 2005). Nucleotide diversity is the probability that any two nucleotides randomly chosen from a population are different (Nei & Li, 1979; Nei & Tajima, 1981). It is can be more informative than gene diversity estimates for relationships between haplotypes since the statistic takes into account the amount of difference between sequences instead of simply whether two alleles are different. Similarly, the average number of pairwise differences takes into account the DNA sequence and is defined as the average number of differences between all pairs of haplotypes in a population (Tajima, 1983).

These estimates have become standards for describing the amount of genetic diversity within a population (Nei, 1987). This is mostly because they are not influenced by sample size in the same manner as counting the number of segregating sites between sequences or counting the number of alleles in a sample – both measures of which can be affected by deleterious mutations (Tajima, 1983). Nevertheless, it should be noted that large stochastic variances can be associated with average pairwise difference estimates.

112

Average pairwise distributions are calculated from the observed number of differences between pairs of haplotypes, which is called a mismatch distribution (Excoffier et al., 2005). Using a single stepwise expansion model, the shape and raggedness of the mismatch distribution curve can provide insight into a population’s demography (Rogers & Harpending, 1992). This approach was expanded to include rate heterogeneity, making it more suitable to mtDNA sequence analysis (Schneider & Excoffier, 1999). Simulation studies indicate that the population size parameters are too conservative when estimated with rate heterogeneity. Therefore, the magnitude of expansion cannot be determined with this method (Schneider & Excoffier, 1999), although the parameter for time of expansion is still unbiased (Excoffier et al., 2005; Slatkin, 1995).

For Y-chromosome microsatellite data, two statistics were employed – the number of different alleles and the sum of squared differences. The number of different alleles is the equivalent of the number of unique haplotypes for the mtDNA data. The other statistic is specific to microsatellites. The sum of squared differences is dependent on the similarities in repeat length at each locus, and can therefore be used to estimate distance between haplotypes (Slatkin, 1995). The statistic “counts the sum of the squared number of repeat differences between two haplotypes” (Excoffier et al. 2005:104). The fewer the number of repeat differences per locus between two haplotypes, the more similar two STR haplotypes are.

The population parameter, θ, theta (θ= 2Nµ, where N is the inbreeding effective population size and µ is the neutral mutation rate), was estimated using mtDNA HVS1 sequences. Four different calculations were made using Arlequin v3.11 (Excoffier et al.,

113

2005). The four estimates for mtDNA haplotypes are based on expected homozygosity (θ(H)), number of segregating sites (θ(S)), expected number of alleles (θ(k)), and the average number of pairwise differences (θ(π)) (Ewens, 1972; Excoffier et al., 2005; Tajima, 1983; Watterson, 1975; Zouros, 1979). This analysis provides information on the relative strength of mutation versus genetic drift in a population (Templeton, 2006). For the Y- STR haplotypes, only one θ estimate was calculated, and it was based on expected heterozygosity using a pure stepwise mutation model (Excoffier et al., 2005; Ohta & Kimura, 1973). The other three estimates were not used for Y-STRs because either they are based on sequence data (θ(S) and θ(π)) or were not appropriate due to violations of fundamental assumptions associated with the statistics (infinite-allele equilibrium, θ(k)).

Neutrality indices were also calculated with Arlequin v3.11 using some of the aforementioned θ estimates. In Tajima’s test of neutrality, the D statistic is calculated using the difference between θ estimates derived from the number of segregating sites and the average number of pairwise differences (Excoffier et al., 2005; Tajima, 1989a, 1989b, 1996). Similarly, Fu’s FS test of neutrality considers the probability of observing an equal or fewer set of alleles as the population in question for a random neutral

population compared to the θ estimate obtained from the average number of pairwise differences (Excoffier et al., 2005; Fu, 1997). We consider any estimate with a P-value at 0.02 as significant for Fu’s FS test of neutrality, following the recommendations in

Excoffier et al. (2005). Both of these neutrality tests essentially assess the number of rare alleles in a population. The excess of rare alleles can be due to selection, but it is also characteristic of population expansion. Conversely, the presence of several alleles at high frequencies indicates balancing or diversifying selection or even population substructure.

114