• No results found

2.9. DNA sequence analyses

2.9.5. Identifying the optimal model for DNA sequence evolution

2.9.5. Identifying the optimal model for DNA sequence evolution

Multiple ‘hits’, where pre-existing mutations are masked by more recent mutations that occur at the same site, will lead to an underestimate of the actual number of changes that have taken place at a particular site, thus obscuring the phylogenetic relationship of the taxa being compared. It is therefore necessary to apply a model of sequence evolution in order to ‘correct’ for such multiple ‘hits’ (Graur & Li, 2000).

Models require certain assumptions as to how variations in DNA sequences evolve. All

74

possible assumptions for a given situation that are taken into account form a ‘conceptual model’ in which phylogenetic estimation is made. As more assumptions or parameters are incorporated in the model, the more complex it becomes. Several models can be used to account for DNA sequence evolution; these include the JC69 (Jukes & Cantor, 1969), F81 (Felsenstein, 1981), K2P (Kimura, 1980), HKY85 (Hasegawa, et al. 1985), TN93 (Tamura & Nei, 1993) and general time-reversible (GTR) models (Rodriguez et al., 1990). The JC69 is the simplest model and assumes that all types of change (all

substitutions) are equally likely, base frequencies are equal, all sites are equally likely to change and change independently of each other, and base composition is at equilibrium among all the sequences under consideration (Jukes & Cantor, 1969). The K2P is an extension of the JC69 model but allows transitions and transversions to have different substitution rates (Kimura, 1980). Likewise, the F81 model is an extension of the JC69 but allows for unequal base frequencies (Felsenstein, 1981). The HKY85 model allows for different rates of substitution for transitions and transversions as well as allowing for unequal base frequencies (Hasegawa et al., 1985). The TN93 model is an extension of the HKY model but distinguishes between transition rates of purines and pyrimidines (Tamura & Nei, 1993). Finally, the GTR model allows all six pairs of substitution to have different substitution rates as well as allowing for unequal base frequencies (Rodriguez et al., 1990). Rate heterogeneity between sites can also be accounted for by incorporating gamma distributed rates (Γ) into the models (Yang, 1993). Gu et al.

(1995) proposed to take into account the proportion of invariant sites (I) in the gamma distributed rates, hence the ‘Γ+I’ model. Yang (2006) describes this model as

“pathological” as gamma distribution with an shape parameter less than 1 already accounts for the invariant sites. Depending on the model under consideration, the base frequencies, rate matrix and shape parameter ( ) of the gamma distribution using 16 rate

75

categories were estimated using likelihood by iteration from an initial neighbor-joining (NJ) tree. The parameters derived from the initial tree were then used to build a new neighbor-joining tree and the parameters re-estimated, repeating the process until no noticeable improvement is seen in the likelihood.

Models are generally selected based on their fit to the sequence data as measured by likelihood values (Kelchner & Thomas, 2007). Normally, addition of parameters in a model increases the likelihood score; this, however, increases complexity and thus the data are spread more thinly, so if there is no significant improvement in likelihood score then there is no justification for using the more complex model. One way to identify which model to use is through a likelihood ratio test (LRT). The LRT is a statistical test that determines the goodness of fit of any two models being compared with a particular dataset. This can be applied to models that are nested since twice the difference in the likelihood scores between two nested models is approximately Chi squared distributed.

The formula for this test is given as: LR = 2*(lnL1-lnL2); where lnL1-lnL2 is the difference in the log likelihood scores between any two nested models being compared.

The LRT can then be used to determine if there is a significant difference between the log likelihood scores of the two models by identifying the degrees of freedom and checking for the P value in a Chi square table. The number of degrees of freedom is the difference between the number of parameters used by the two models being compared.

For example, the GTR and the GTR+Γ models differ by one parameter (addition of the gamma distribution in the latter); therefore, the number of degrees of freedom for comparing these two models is 1 (Huelsenbeck & Crandall, 1997). Table 2.9 summarizes the number of parameters for a given model of DNA substitution. The model with the best likelihood score was selected but only if it was significantly better than a less complex model; otherwise, the simpler model was used.

76

Table 2.9: Summary of the number of parameters of the different models of DNA substitution (taken from Morrison, 2006)

Model Number or Parameters

JC69 0

Twelve different models were evaluated; these were: (1) JC69, (2) JC69+Γ, (3) K2P, (4) K2P+Γ, (5) F81, (6) F81+Γ, (7) HKY85, (8) HKY85+Γ, (9) TN93, (10) TN93+Γ, (11) GTR, (12), and GTR+Γ. Since the F81 and K2P models are not nested, they could not be compared with each other. F81 and K2P could, however, be compared with any other model. Although it was tempting to restrict the model search to the parameter-rich HKY85, TN93 and GTR models as they were determined to be the optimal models by the ModelTest program in 80% of 208 published datasets in 2004 alone (Kelchner & Thomas, 2007), it was more prudent to check the less parameter-rich models to confidently rule them out if the more complex models had significantly higher likelihood scores. The likelihood scores for these models were computed in PAUP*, with the command lines summarized in Appendix 2.1, pp. 360-366.

The application of the LRT described here is similar to that which is applied in the Modeltest program by Posada and Crandall (1998) except that the LRT used in this study allows for a comprehensive comparison of all models under consideration (apart

77

from non-nested models) whereas Modeltest ‘traverses’ a model space through a series of pairwise comparisons of the different models. For instance, if Modeltest compares the likelihood scores of JC69 and F81 and found the latter to be significantly better, then F81 is selected and compared with HKY85. If HKY85 is better than F81, then HKY85 is selected and compared with GTR. If GTR is better than HKY, then GTR is compared with GTR+Γ. Otherwise, HKY and TN93 will instead be compared. The problem with this approach is that it does not allow for a comprehensive comparison of all the different models being considered. In the above example, the GTR and TN93 models were not compared, and there is the possibility that TN93 is not significantly better than GTR.