CHLOROPLAST DNA VARIATION
3.2. Materials and methods
3.2.4. Tree construction from parsimony analysis
Details of tree construction from parsimony analysis have been broadly described and thoroughly reviewed elsewhere (Avise, 1994; Felsenstein, 1988; Hillis et al., 1996; Swofford, 1991; Swofford et al., 1996; Soltis et al., 1990). Thus, these approaches are discussed very briefly below, based on the aforementioned references.
Exhaustive search
In this approach, an initial tree for the first n taxa is constructed and a next (n + 7) taxon is added and evaluated in every topology. Then each additional taxon is added and every single tree topology is evaluated as subsequent taxa are added. The main difficulty with this type of search is that the number of trees increases rapidly with the addition of further taxa. Exhaustive methods are not generally useful for more than 10 or 11 taxa, since they generate over two and 34 million trees for 10 and 11 taxa, respectively (Swofford, 1991 & 1993).
Branch-and-bound search
This is another exact algorithm for the identification of all optimal trees and closely resembles the exhaustive search. It differs in that the length of each tree is not calculated at the time of its construction, thus considerably reducing computing time. In addition, this approach employs a search procedure which has a provision for discarding trees without evaluating them in detail.
Several factors influence the running time of the branch-and-bound algorithm, with quality of the data being perhaps the most important one. Large data sets with little homoplasy will run quickly because most paths of the search tree are terminated early. The speed with which the length of each tree can be evaluated, a function of the character types, is also important. For example, ordered (Wagner) characters are much faster than unordered characters. Finally, for obvious reasons, the speed of the available computer is also critical to the running time (Swofford, 1993).
Heuristic methods
When a data set is too large to permit the use of exact methods, a heuristic approach is recommended which operates by sacrificing the guarantee of optimality in favour of reduced computing time. The search begins by surveying only a small sample of all possible tree topologies and the optimal tree is the shortest of these. A tree is then constructed and rearranged so as to bring it closer to the optimum; once no further alterations can improve the tree, the analysis is terminated. Heuristic methods have proven to be very effective by using this principle. Moreover, two basic strategies can be used:
• Stepwise addition: this is the common method for obtaining a starting point for further rearrangement of additional taxa to a growing tree. Three taxa are chosen for the initial tree, then one of the unplaced taxa is selected for next addition. The trees resulting from the addition of a fourth taxon are evaluated and the one with the optimal score is saved for the next 'round'. Next, a fifth taxon is placed along one of the five possible branches on the tree saved from the previous round. The evaluation procedure is repeated with the best tree saved for the next round, and this process is concluded when all taxa have been joined to the growing tree.
• Branch swapping: in this strategy, the initial estimate provided by stepwise addition is subjected to a series of predefined rearrangements until the shortest tree is found. These rearrangements are performed until the tree cannot be improved any further, which in turn is assumed to be the optimum.
Figure 3.3 illustrates the stages completed in a heuristic approach and the three branch swapping algorithms as implemented in the software programme PAUP (Phylogenetic Analysis Using Parsimony; Swofford, 1993): the nearest neighbour interchanges (NNI); subtree pruning and regrafting (SPR); and, tree bisection and reconnection (TBR). A brief description of these algorithms is also shown in the same figure.
(A) Nearest neighbour interchanges (NNI). Each interior branch of the tree defines a local region of four subtrees connected by the interior branch. Interchanging a subtree on one side of the branch with one from the other constitutes an NNI. Two such rearrangements are possible for each interior branch.
A
B
A
(B) Sub-tree pruning and regrafting (SPR). A subtree is pruned from the tree (e.g. the subtree containing terminal nodes A and B as indicated). The subtree is then regrafted to a different location on the tree. All possible subtree removals and reattachment points are evaluated.
(C) Tree bisection and reconnection (TBR). The tree is bisected along a branch, yielding two disjoined subtrees. The subtrees are then reconnected by joining a pair of branches, one from each subtree. All possible bisections and pairwise reconnections are evaluated.
Figure 3.3. Schematic representation of the heuristic parsimony searches. The three different approaches to branch swapping are described (after Swofford et al., 1996).
Outgroup comparison
Au important concept among the optimality criteria in parsimony methods is the use of an outgroup. An outgroup is any taxon used in phylogenetic analysis that is assumed to be outside the group of taxa under study (Swofford et al., 1996). Incorporation of an outgroup is useful for assigning the direction of change to character-state transformations and for determining the root of a phylogenetic tree. An outgroup is often chosen as a sister group bearing in mind that it is genealogically most closely related to the remaining taxa (i.e. the ingroup), but must not be the ancestor of the ingroup. Swofford et al. (1996) emphasise, however, that the assignment of taxa to the outgroup constitutes an automatic assumption that the remaining taxa are monophyletic, an assumption that should be justified by evidence extrinsic to the phylogenetic data at hand. If this assumption is wrong, the tree will be rooted incorrectly.
Consensus trees
A consensus tree is a hierarchical summary of all relationships described by the equally parsimonious trees produced after use of an algorithm, as the ones previously described. However, a consensus tree does not necessarily give the best estimate of phylogenetic relationships among groups; it only summarises them and thus must be interpreted with caution. A large number of polytomies (i.e. unresolved regions in the phylogenetic tree) may become evident in the consensus tree when there is much disagreement among the rival trees it is summarising (Baum, 1992; Swofford et al.,
1996). In general, a consensus tree is longer than the minimal trees it describes, since the consensus is less resolved than any of the minimal trees.
There are different types of consensus trees (Figure 3.4), with the strict consensus, semi-strict consensus and 50% majority rule trees as the most widely used (Swofford, 1993). A strict consensus tree describes species groupings (clades) that are present in all rival trees. Therefore, it is the most conservative consensus and the easiest to
interpret, but unfortunately it may be too strict and result in a completely unresolved consensus for trees that only differ by the placement of one taxon.
The semi-strict consensus tree can be best explained by the example given in Figure 3.4. Trees having an ABC trichotomy or an (AB)C dichotomy, with A and B always together, will result in a semi-strict consensus where AB will be retained. In this circumstance, a semi-strict consensus will conserve the AB relationship while a strict consensus will not.
In contrast, the majority rule consensus defines groups that appear in a predefined percentage of the rival trees it describes (50% in this case). In turn, this means that a clade may be retained even if some of the trees do not resolve it (Swofford, 1993; Figure 3.4).
(A)
ABCDEF ABCDEF A B C E D F
(B) (C) (D)
ABCDEF ABCDEF ABCDEF
Figure 3.4. Types of consensus trees. (A) Rival trees to be summarised by each consensus tree; (B) strict consensus tree; (C) semi-strict consensus tree; and, (D) 50% majority rule consensus tree.
Robustness of an inferred tree: goodness-of-fit statistics
Several statistics can be calculated to determine the goodness of fit of phylogenetic trees with the data sets they are describing (Swofford, 1993). The most widely used measure of robustness is the consistency index (CI), which provides an indication of how well a particular tree topology explains the data. In simple terms, a transformation series with little or no homoplasy will yield a high CI value (1 being the maximum), while those with high homoplasy have a lower value (0 as a minimum). Cis are calculated on the basis of synapomorphies only, expressed as the minimum number of changes or steps necessary if all data agreed (zw) divided by the actual number of steps (5) in the tree (i.e. CI = m /s', Kluge and Farris, 1969).
Another useful goodness-of-fit statistic is the retention index (RI), which indicates how well characters fit the tree that describes them. Farris (1989) described the RI to express the amount of synapomorphy in a data set by examining the actual amount of homoplasy as a function of the maximum possible homoplasy, or in other words, the rate of similarities in a tree due to synapomorphies. In addition, there is also the homoplasy index, which provides an indication as to the amount of homoplasy in a tree (Swofford, 1991 & 1993). All of these statistics can be determined for individual characters in addition to entire data sets.
On the other hand, the robustness of a specific clade within a tree can also be determined by: (1) an observation of the number of synapomorphies that support each branch in the tree; and, (2) by obtaining a decay index. The latter one is a useful index of support for a monophyletic group obtained by calculating the difference in tree lengths between the shortest trees that contain a group versus those that lack the group (Bremer, 1988). Further details of these and additional indices (e.g. the topology- dependent pennutation tail probability, T-PTP test) can be found in Armstrong et al.
Reliability of inferred trees
The question of certainty of the historical relationships represented in a particular tree has been addressed by assigning confidence limits to its branches. To this purpose, two procedures have been widely used: the bootstrap and the jack-knife methods. The bootstrap method (Felsenstein, 1985) is a non-parametric resampling method (Hillis et al., 1996) which operates by estimating the variance of the sampling distribution by repeatedly resampling data from the original data set. This method is based on the mathematical principle of constructing a series of fictional matrices; data from the original matrix may be present once, more than once, or not at all in the new matrix. Each bootstrap data set is then analysed using a heuristic or branch-and-bound search to produce a tree or a set of trees (in the present study, heuristic searches were used). This procedure is repeated a predetermined number of times (normally 100) on the random samples, and the percentage of occurrence of a particular group or component that appears in the consensus of the bootstrapped trees is regarded as an index of support for monophyly in that group.
This method, although used widely in phylogenetical studies, cannot be considered as a true confidence limit in a statistical sense. Moreover, it has created controversy (Van Dongen, 1995; Felsenstein & Kishino, 1993; Hillis & Bull, 1993), as a necessary condition for the bootstrap to be valid is that configurations in the characters are independently and identically distributed. This is not true in all evolutionary processes or in the case of multistate characters recoded into binary data.
The jack-knife procedure (Mueller & Ayala, 1982) operates in a similar way to the bootstrap, but is based on gene frequency data. This method resamples the original data set by eliminating k data points at a time and recomputing the estimate from the remaining n - k observations. New trees are constructed from the resulting reduced matrix and a cluster from the original tree is confirmed as robust if it appears in the new tree.
3.2.5. Data analysis and phylogeny reconstruction based on cpDNA