Molecular data analyses - Material and Methods

polymorphism (SRLP): A novel universal marker system

4.2 Material and Methods

4.2.6 Molecular data analyses

Analyses of molecular data were conducted on each primer pair dataset that showed variation in fragment profiles among any species (initial primer-trial investigation) or among S. aethnensis, S. chrysanthemifolius,S. squalidus, S. vulgaris and S. cambrensis

(i.e. in the more detailed analysis). Some of these variable datasets were also combined and analysed. The combined data sets were constructed after either removing samples that were not present in all fragment datasets (pruned - P datasets) or following the introduction of missing data (MD datasets). The datasets produced from the initial primer-trial were only subjected to individual based analyses through the generation of

Chapter 4 Material and Methods

4.2.6.1 Missing data

Missing data were treated by pairwise deletion (e.g. in PAST) where samples are excluded from any calculation for which they have missing data. Alternatively, missing data were interpolated by sample-by-sample pairwise distances (e.g. in GenAlex) where the average genetic distances for each group level were inserted.

4.2.6.2 Genetic distance analysis – Neighbour Joining (NJ) and

Principal Coordinate (PCO) analyses

The Neighbour Joining (NJ) cluster method minimizes the total length of the phylogram by sequentially grouping similar OTUs (operational taxonomic units) (Saitou & Nei, 1987).

The Principal Coordinate Analysis (PCO) finds eigenvalues and eigenvectors of a distance or similarity matrix between all data points and the relationship between the data points can be visualized in a low dimensional space reflecting the original distances as well as possible. PCO is normally performed in three steps. Firstly, a similarity/distance matrix of all data points is produced. Secondly, the matrix is double-centred summing all columns and rows to zero. Thirdly, the transformed matrix is factored and an eigen analysis is performed. The eigenvectors are normalised and the sum of squares of its components equals the corresponding eigenvalues.

The elements of the normalised eigenvectors are the coordinates of the data points representing exactly the distance between them in multidimensional space. The coordinates are adjusted relative to their rectangular and independent principal axis. Thus, the first dimension accounts for the greatest amount of variance and each subsequent dimension explain progressively less of the variance.

NJ and PCO analyses were conducted on matrices of dice similarity index which puts more weight on the joint occurrences of fragments than on shared absence. For combined datasets, bootstrap values (Felsenstein, 1985) for the NJ trees were obtained using 1000 pseudoreplicates. All forms of analysis were conducted using the software PAST 1.99 (Hammeret al., 2001).

Chapter 4 Material and Methods

4.2.6.3 Genetic distance between NJ trees

To investigate whether datasets generated for different snoRNA genes and gene clusters across the same set of taxa contained similar phylogenetic information, distances between trees for each single dataset used in combined data analyses, and the combined matrix containing all datasets, were calculated using TREEDIST implemented in the PHYLIP package version 3.67 (Felsenstein, 2007). The Branch Score Distance (Kuhner & Felsenstein, 1994) was used in calculations because it takes into account branch lengths. Only datasets consisting of the same samples were used in these analyses and therefore a NJ tree for each single primer pair matrix of the combined and pruned dataset, each containing 43 samples, was produced in PAST 1.99. The NJ trees obtained in Newick notation were copied into a single file which was processed using TREEDIST (with option 2 changed: full distance matrix of distances between all possible trees) and the distance matrix was used for PCO analysis in GenAlEx 6.3 (Peakall & Smouse, 2006).

4.2.6.4 Analyses of molecular variance (AMOVA)

This statistical procedure is used to partition genetic variation at different hierarchical levels (e.g. among individuals within populations, among populations within a region and between different regions). It was initially developed for RFLP haplotypes (Excoffier et al., 1992) but can also be used for many other markers. For binary data, pairwise genetic distances can be estimated using the Euclidean distance metric of Huff et al. (1993). The significance of the variance components can be tested by random permutation.

To quantify levels of genetic differentiation within and among (groups of) species estimates of variance components were assessed by analyses of molecular variance (Excoffier et al., 1992) performed in GenAlEx 6.3 (Peakall & Smouse, 2006). Species were grouped into ‘species groups’ based on the results of the genetic distance analyses, phylogenetic relationship and ploidy level. Therefore, S. aethnensis, S.

chrysanthemifolius and S. squalidus (closely related diploids) were put in one ‘species

group’, S. vulgaris and S. cambrensis (tetra/hexaploid and S. vulgaris is more distantly related) in another ‘species group’ and S. madagascariensis (distant relative), when available, into a third group. Other species could not be included because of low number

Chapter 4 Material and Methods of samples available. However, analyses were performed with one, two (S. vulgaris, S.

cambrensis and S. madagascariensis grouped together), and three ‘species groups’.

Furthermore, for datasets containingS. madagascariensisadditional analyses without this species were also carried out.

For combined datasets, pairwise ΦST values (analogous to Fisher’s FST values)

were estimated to measure differentiation between species. Furthermore, separate AMOVAs for each species, exceptS. madagascariensis, were conducted. Due to the low numbers of individuals per population, some populations were excluded from analysis, while others were assigned to populations in the same area, and a few were geographically grouped. For example, only one sample ofS. vulgarisfrom the population in Egypt was available and was, thus, removed. The only S. squalidus sample from the Summerhill population was assigned to the population from Pentre and all S. cambrensis

individuals from different populations in Wales were treated as one population.

4.2.6.5 STRUCTURE assignment tests

The genetic structure of all variable primer pair matrices was analysed by a model based clustering approach implemented in the computer programme STRUCTURE 2.3.3 (Pritchard et al., 2000; Falush et al., 2007; Hubisz et al., 2009) which can handle dominant markers by introduction of a recessive allele. A single fragment of different size (fds) observation (i.e. one column in the datamatrix) consists of presence (1) or absence (0) of a fragment. Absence of a fragment is the recessive state whereas the presence of fragment represents an ambiguous underlying genotype (in diploids: 11, 10 and 01, respectively). According to its probability, one of these ambiguous genotypes is randomly chosen in each iteration (Falush et al., 2007). This programme is able to calculate the probability P(X|K) for different numbers of natural genetic groups (K) which are distinguished by allele frequencies using a Bayesian algorithm in combination with a Markov Chain Monte Carlo (MCMC) simulation. STRUCTURE analyses were performed for each variable data set and their subsets (e.g. S. cam datasets) with K set from K =1 to K = 9 (with 5 replicates for each K), assuming no-admixture model and uncorrelated allele frequencies using a burn-in period of 20000 and 50000 MCMC

Chapter 4 Material and Methods repeats. These settings (burn-in and MCMC values) were long enough to stabilize log alpha and Ln likelihood (burn-in) and to obtain consistent end results (MCMC) (Pritchard

et al., 2000). Three functions, “Structure.deltaK”, “Structure.Table” and “Structure.simil”, of the R-script STRUCTURE-SUM-2009.R (Ehrich, 2006; Ehrich et al., 2007) were chosen to decide which K-value and STRUCTURE run would best explain the data. The former function generated 4 plots (Mean L(K), Mean L’(K), Mean L’’(K) and Mean DeltaK, respectively) for the determination of the number of groups (K) using the method described in Evanno et al. (2005). The number of groups within the plots was indicated by a more or less clear break (plots Mean L(K) and Mean L’ (K)) and peak in the slope (plots Mean L’’(K) and Mean DeltaK), respectively. However, the most reliable indication of the real K value was shown by the modal value of the Mean DeltaK distribution and its hight might be used as a parameter for the strenght of the signal (Evannoet al., 2005).

Alternatively, the number of groups were chosen using the latter two functions. “Structure.Table” plots the likelihood of each K value (lnP), while “Structure.simil” estimates and plots the similarity among the results of all replicates for each K. The number of groups (K) was chosen when either the lnP in the “Structure.Table” plot showed a maximum or the curve started to even out, the replicates displayed highest similarity (“Structue.Table” and “Structure.simil” plots), and no empty groups were obtained. The run displaying the highest lnP was taken from barplot outputs (see Nordborget al., 2005) which were further examined to confirm the number of groups.

The ancestry ofS. squalidus and S. cambrensis samples was estimated according to the admixture model by assuming that all hybrid individuals were derived from two populations representing their parents (i.e. S. squalidus, S. aethnensis and S.

chrysanthemifolius; S. cambrensis, S. squalidus and S. vulgaris). The clustering

procedure determines the proportions of an individual’s ancestry derived from these populations (Pritchard et al., 2000). STRUCTURE analyses were performed for combined datasets containing hybrid and parents samples, the latter were predefined (USEPOPINFO = 1), with K set to 2 (with 5 replicates) using a burn-in period of 20000 and 50000 MCMC repeats. The run with the highest lnP was taken from barplot outputs.

Chapter 4 Results

4.3 Results

4.3.1 Radioactively labeled fragment analysis (initial primer-trial

In document Revealing the past : the potential of a novel small nucleolar RNA (snoRNA) marker system for studying plant evolution (Page 111-116)