MtDNA data analysis - MtDNA methods

Y- chromosome and mtDNA comparative studies

2. SUBJECTS AND METHODS

2.2 Methods

2.2.2 MtDNA methods

2.2.2.3 MtDNA data analysis

The designed minisequencing method was used to group samples in their major haplogroups. Further classification was achieved by analysing HVS-I and II.

HVS-I and II sequences were aligned to the control region reference sequence (Andrews et al., 1999) using the Clustal W algorithm (Thompson et al., 1994) implemented in BioEdit v.7.0.5.3 (Hall, 1999). HVS-I and II sequences (15997-16569 and 57-607) were then combined into one sequence of 1124 bp for further analysis. Unique haplotypes were identified using DnaSP v4.10 (Rozas et al., 2003) and variant sites were recorded electronically using S-compare (Nelson, 2006). Using the variant positions together with a phylogenetic approach, haplogrouping was done according to the nomenclature of Behar 2008 (Behar et al., 2008).

Variation in the HVS-II region 303-315 were not considered or reported in any of the analyses. Insertions in the poly C repeat track at position 568-573 where taken as a 1 bp C insertion. All other regions were considered albeit some regions were differentially weighted as outlined in the analysis description.

Phylogenetic tree analyses of sequences were done through Maximum likelihood analysis using PHYML (Guindon et al., 2005). The HKY substitution model with Gamma distributed rates and Invariable sites, received the best likelihood prediction through likelihood ratio

tests using Modeltest 3.7 (Posada and Crandall, 1998) in conjunction with PAUP v4.0b10 (Swofford, 1998) and were implemented in the Maximum likelihood analysis. The tree topology search employed was nearest neighbour interchange (NNI). An approximate likelihood ratio test (aLRT) was computed to determine branch support (Anisimova and Gascuel, 2006). Trees were visualized in MEGA4 (Tamura et al., 2007).

Networks of the sequences were constructed using the Median Joining algorithm (Bandelt et al., 1999) of Network v4.5.0.0 (Fluxus-engineering, 2008). Networks were subjected to maximum parsimony post-analysis using the Steiner maximum parsimony algorithm (Polzin and Daneschmand, 2003) within Network 4.5.0.0. For network analysis the epsilon parameter (Network program parameter for quick calculation of sparse networks), was set to 2 and transversions were weighted 3x the weight of transitions. Furthermore the weight of the 16189 position was reduced 10x and the weight each of the CA repeats at position 523 was reduced 5x per nucleotide in the repeat.

Sequences from other sources included in phylogenetic and network analyses were Neanderthal (Genbank accession number: NC_011137) (Green et al., 2008) and the control region reference sequence (Andrews et al., 1999). Additional L0d sequences published in the literature (Gonder et al., 2007; Tishkoff et al., 2007; Behar et al., 2008) were included in the L0d network to compare our results with. Sequences from Gonder et al., and Tishkoff et al., had overlap in some of the subjects and only one of the two in each case were selected (Gonder et al., 2007; Tishkoff et al., 2007).

Time estimates of L0d subgroups were calculated using the Rho statistic (Forster et al., 1996) with the associated standard deviation, sigma (Saillard et al., 2000), using a mutation rate of 2.5 x 10^-6per nucleotide per generation (Ward et al., 1991) (25 yrs per generation; 1124 nucleotides). Time estimates were also calculated using other published mutation rates (i.e. 1.75 x 10^-6 per nucleotide per generation (Horai et al., 1995); 4.5 x 10^-6 per nucleotide per generation (Forster et al., 1996); 2.1 x 10^-6 per nucleotide per generation (Soodyall et al., 1996) but because of its intermediate value the mutation rate of Ward et al., was used in subsequent discussions and analyses (Ward et al., 1991). A generation

Haplogroup isofrequency maps were generated applying the Kriging method (Oliver and Webster, 1990; Xue et al., 2005) incorporated in the Surfer v.8.06.39 program (Golden-Software, 2006). Mitochondrial contour plots were based on frequencies of the L0d/k subgroups on the background of the L0d/k group as a whole. This was done to eliminate the effects that admixture from Bantu-speakers and non-Africans would have on the distribution of the L0d/k subgroups. When frequencies were calculated, sample size effects were corrected by adjusting the total sample sizes in all groups to the same value.

Mismatch distributions of populations and haplogroups were calculated in Arlequin v.3.11 (Excoffier et al., 2005). From these the validity of demographic expansions and the date of expansions were inferred. The demographic expansion scenario is tested through simulating a population going through an expansion and testing whether the actual data is significantly different from the simulated expansion scenario. A non-significant Sum of Squared deviation (SSD) p-value will therefore indicate a population/group of sequences that went through an expansion. Parameters calculated are θ₁, θ₀, and

τ

. Dividing θ₁ by θ₀ give an indication of the magnitude of the expansion while

τ

gives an indication of the time of the expansion. The mutation rate of 2.5 x 10^-6per nucleotide per generation (Ward et al., 1991) and a generation time of 25 years were used to convert

τ

(Tau) to T (Time BP when expansion took place) by using the equation T= (

τ

/2µ) x generation time. In the equation µ is the mutation rate per gene per generation i.e 2.5 x 10^-6per nucleotide per generation (Ward et al., 1991) x 1124 sites results in µ = 2.81 x 10^-3.

The summary statistics; number of sequences, haplotype number, gene diversity (Nei, 1987) and nucleotide diversity (Nei, 1987), for each group were calculated in DnaSP v4.10 (Rozas et al., 2003). Using DnaSP v4.10, the population mutation parameter (θ) was estimated from using segregating sites (θs per nucleotide site) as well as the Waterson estimator (W-θs per sequence) (Tajima, 1996). From W-θs the effective population size (Ne) was estimated by dividing W-θs with 2µ where µ is the mutation rate per gene per generation of 2.81 x 10^-3 (Ward et al., 1991) as explained in the previous paragraph. The

selective neutrality tests of Tajima’s D (Tajima, 1989), Fu’s Fs statistic (Fu, 1997) and the R2 statistic (Ramos-Onsins and Rozas, 2002) were also calculated using DnaSP v4.10.

To visually represent the effective population size changes through time, Bayesian Skyline Plots (BSP) (Drummond et al., 2005) were constructed. For each of the haplogroups, BSPs of effective population size through time were constructed using a Markov Chain Monte Carlo (MCMC) sampling algorithm, as implemented in BEAST v. 1.4.8 (Drummond and Rambaut, 2007). The population size function of the BSP can be implemented using either a piecewise constant or a piecewise linear function of population size change. In the present study, a piecewise linear model made up of 10 control points was used. The general time-reversible (GTR) substitution model with estimated base frequencies and a Gamma + Invariant Sites heterogeneity model was used to infer the ancestral gene trees for each haplogroup. The mean substitution rate was fixed to the rate of Ward et al., (Ward et al., 1991) and a relaxed molecular clock (Uncorrelated Lognormal) was employed. Each MCMC sampling was repeated for 40 000 000 generations, sampled every 4 000, with the first 4 000 000 generations discarded as burn-in. All runs had an effective sample size of at least 1 000 for the parameters of interest. Each independent run was repeated at least twice and results were combined using the LogCombiner v1.4.8 tool included in the BEAST package. BSPs were visualized in TRACER v. 1.4 (Rambaut and Drummond, 2007).

Population pairwise differences were calculated with Arlequin v3.11 (Excoffier et al., 2005) by using Fst distances (Reynolds et al., 1983) incorporating the nucleotide correction model of Tamura and Nei (Tamura and Nei, 1993) and a gamma correction of 0.532. An exact test of population differentiation (Raymond and Rousset, 1995) was also calculated using Arlequin v3.11 (Excoffier et al., 2005). The distance matrix was visualized through PCA and cluster analysis in PAST v.1.54 (Hammer et al., 2001b).

The relationship between physical and genetic distances were investigated in the Khoe-San and Coloured groups by doing a linear regression using R v.2.5.0 (R-Project, 2006).

The regression was applied on a scatter plot resulting from pairwise comparisons of distance matrices based on physical and genetic distances. The linear regression model

against one another and assign significance values to each model. Additionally, a Mantel test implemented in Arlequin v3.11 (Excoffier et al., 2005) was also done to test the correlation between the two distance matrices.

The physical distance matrix was constructed by obtaining latitude and longitude information of the different sampling locations from the website “Google Maps Latitude, Longitude Popup” (Gorissen, 2008) and calculating the great circle distance (in km) between the points using the “Latitude/Longitude Distance Calculation” website (Michels, 1997). The physical distance matrix is included in Appendix C.

Inter-population genetic distances were used in Analyses of Molecular Variance (AMOVA), implemented in Arlequin v.3.11 (Excoffier et al., 2005). The distribution of variance among three hierarchical levels was tested in order to assess relationships among groups of populations. The lowest level is the variation contained between individuals within the same population. The next level contains the variation that exists between populations (populations in this case was the groups defined in Table 2.1). The third level contains the variation between groupings of these populations. Different groupings of populations were attempted, which were based on geographic distribution, language and self-identification of populations.

In document Genetic variation in Khoisan-speaking populations from southern Africa (Page 124-128)