Phylogenetic reconstruction - The ecological genetics of Pseudomonas syringae in the kiwifruit

2.4 Analysis

2.4.10 Phylogenetic reconstruction

Phylogenetics infer the evolutionary descent and hierarchical clustering of a study organism e.g. to classify unknown species, understand the diversity of a sampled population, reveal events that occurred throughout the evolution or simply observe the fate of lineages. There are several methods to construct phylogenetic trees. Distance-based methods like UPGMA (Michener & Sokal, 1957) and Neighbor Joining (NJ) (Saitou & Nei, 1987) are both based on a pairwise genetic distance matrix, however NJ corrects for varying rates of evolution throughout the evolutionary past of an organism and is not based on assumption of a molecular clock. The other three popular methods to infer phylogenetic relationships are based on the

actual nucleotide sequences themselves with the accurate and best-ﬁtting model of evolution chosen for determination of the correct tree: Maximum Parsimony, Maximum Likelihood and Bayesian inference.

Maximum Parsimony (Edwards & Cavalli-Sforza, 1963) is based on the number of substitutions which occurred that minimize the cost. Basically it looks for the tree with the least character conﬂict. However, the caveat is that approach can lead to inaccurate phylogenies, which result from long branch lengths, high substitution rates, or unequal evolutionary rates across diﬀerent lineages (Felsenstein, 1978).

Edwards & Cavalli-Sforza (1965) first proposed the idea of implementing likelihood to reconstruct phylogenies, but it was not until Felsenstein (1981) published the first computationally feasible and reasonably fast Maximum Likelihood (ML) algorithm to reconstruct phylogenies, that the ML method was more practicable. Under the chosen model of evolution, the likelihood for each possible tree topology is calculated and the tree with the highest likelihood is regarded as the best-fitting tree.

Bayesian inference (Mau et al., 1999; Yang & Rannala, 1997) is based on the posterior probability (the probability the tree ﬁts the presented data, taking into account the prior probabilities of the tree) and employs the use of Markov Chain Monte Carlo (MCMC). The algorithm returns a number of trees providing the highest probability, from which a consensus tree is built. Bayesian methods allow the use of more complex evolutionary models.

Bayesian inference. Both methods were employed to compare the resulting phylogenetic trees of the two diﬀerent inference methodologies.

2.4.10.1 Maximum Likelihood (ML) Tree

The program TREEPUZZLE v5.3 (Schmidt et al., 2002) was used to construct ML trees for each single gene and the concatenated dataset. The most ﬁtting evolutionary model, as determined with jModeltest, was used and program parameters were set to 100,000 puzzling steps, a neighbor-joining tree for parameter estimation use and quartet puzzling as tree search procedure. All duplicate sequences were removed and the input ﬁle consisted of a single representative of each of the 45 unique sequence types.

The reconstruction of trees takes place in three steps: (1) TREEPUZZLE computes the maximum likelihood distances between groups of four sequences and weights them according to posterior probabilities. (2) Stepwise, starting from one quartet tree, sequences are added, using the ML information from the ﬁrst step, producing a number of intermediate trees. (3) The consensus tree is built with a 50% majority consensus.

The Shimodaira-Hasegawa (SH) test is applied for testing the congruence between single trees (Shimodaira & Hasegawa, 1999). The SH-test is a way of comparing tree topologies and hence determining if recombination has occurred between the housekeeping genes. The program Dnaml, incorporated

in PHYLIP v3.695. (Felsenstein, 1989), was used to perform the SH-test. Dnaml is a DNA Maximum Likelihood program and allows the comparison of user-specified trees with log likelihood values. The SH-test is a statistical test, which makes use of the branch lengths and evaluates differences in log- likelihoods of each tree against each other, without altering branch lengths any further. The difference between the highest log-likelihood and the other tree’s values is compared and the output gives an indication of significance, i.e. whether the tree is significantly worse than the best one. It was run using the default parameters, but with user trees as input file and a random number seed of 333.

2.4.10.2 Bayesian Tree

For Bayesian inference of genealogical relations between strains, taking into account point mutation and homologous recombination, the program Clonalframe v1.1 (Didelot & Falush, 2007) was employed. As discussed previously, recombination is a major problem in phylogenetic inference, as horizontal gene transfer can obscure and falsify the resulting tree. Clonalframe works under the assumption that all recombination events introduce an unknown number of novel polymorphisms, however it does not try to determine the origins of stretches of DNA sequence created via homologous recombination.

Duplicate runs of Clonalframe were performed starting from 100,000 to 500,000 iterations, using a burn-in of 50,000 iterations based on the concatenated dataset of 45 STs. Values of various parameters (θ, ρ,.. ) of

duplicate runs were compared to see if they have converged and the best run was chosen according to converged values. In addition topologies of the trees were compared, with each branch in the tree having a minimum of 50% support based on the posterior distribution.

2.4.10.3 Phylogenetic network

Phylogenetic trees are based on evolutionary models, where point mutations are supposedly the sole source of evolutionary diﬀerences between taxa. However, microevolutionary events, such as e.g. gene duplication/loss, recombination, Horizontal Gene Transfer can obscure the true evolutionary history presented by phylogenetic trees. Although there are some reconstruction methods which consider recombination, another method of inferring the evolutionary history of individual taxa are phylogenetic networks. These allow for reticulate events and model the relationships as a network based on a choice of distances, sequences or trees, with the nodes representing taxa and their evolutionary relationships being represented by edges, where multiple edges can be parallel (Huson & Bryant, 2006). Splitstree v4.13.1 (Huson & Bryant, 2006) was used to create a splits network based on the neighbor-net method using the concatenated dataset of the 45 unique STs.

2.4.11 Testing

for

biogeographic

structure

-

In document The ecological genetics of Pseudomonas syringae in the kiwifruit phyllosphere : a thesis submitted in partial fulfilment of the requirements for the degree of Ph D in Evolutionary Genetics, New Zealand Institute for Advanced Study at Massey University, A (Page 97-102)