Tree Reconstruction - Stochastic Processes and Phylogenetics

Chapter 3 Statistical Techniques for Functional Data Analysis

3.5 Stochastic Processes and Phylogenetics

3.5.2 Tree Reconstruction

The estimation method presented above makes a critical assumption: The phylogenetic relations among the “leaves” of the phylogenetic tree used are correct. This can’t be further than truth; in reality the phylogenetic tree is at best a sensible approximation [61; 154]. Three main approaches have been presented as suitable approximation schemes:

• Distance-based trees

• Maximum Parsimony-based trees

• Maximum Likelihood-based trees

The idea behind distance-based trees is analogous to that used in clustering. One utilizes a metric of similarity between the given extant taxa of the phylogeny and based on that metric a distance matrix is computed. Aside the obvious choice of Euclidean distance other distances metrics eg. Levenshtein distance in Linguistics studies, are popular choices. Using the distance matrix produced, two approaches can be taken. Either a top-bottom or a bottom-up [129]. In the first case one finds

a point that best partitions data in two well-distinct partitions and then recursively applies the same splitting among the two resulting partitions. On the bottom-up approach one first merges the two data-points closer together and assigns them in the same cluster. Afterwards the same approach is used but this time the merged cluster is treated as a single point. This approach while rather straightforward has a problematic property: it does not account for the root. This bottom-up approach, known in the Computer Science literature as agglomerative clustering, is the essence of one of the most popular early phylogenetic tree reconstruction algorithms,neighbour joining (NJ), the other beingUnweighed Pair Group Method with Arithmetic mean(UPGMA) [31]. Algorithmically both NJ and UPGMA run in

O(N3) time. While computationally efficient in comparison with other approaches though, both implementations do not guarantee that will result in a tree where no edge lengths are negative; also they are obviously extremely sensitive to the choice of the distant metric used, and as such have been deemed “inappropriate” for most modern day phylogenetic analysis. Often NJ or UPGMA tree serve as an “initial solution” tree for advances methods.

The parsimony based approach views each phylogeny as a model of evolution and tries to fit the most parsimonious model; it is based on the same theoretical prin- cipals as model selection: Occam’s razor [86]. The parameters of a model in the case of phylogenetic trees though, are evolutionary transitions, roughly speaking specia- tion events. Unfortunately while this appears reasonably coherent, it often results into over-simplified models. It enforces phylogenetic affinity, by requiring two leaves that exhibit the same trait to be related. While quite plausible, convergent17_evolu-

tion among species have shown this not to be a necessary condition. This manifests in the well-known problem of long branch attraction, ie. the clustering together of otherwise unrelated species. The main critique against maximum-parsimony relies in its inconsistency. As with any information measurement used for model selection one would expect the P(choose the true model)→ 1 as the number of sample

N → ∞, but maximum parsimony does not guarantee that. Algorithmically maximum parsimony does describe anN P-hard problem and while certain well-adapted heuristic algorithms do exist this also tends to make it undesirable. Additionally it is often the case that a number of “equally” good parsimony trees might be produced for a given dataset. In those cases a majority rule is enforced but it is not guaranteed to resolve this collision situation, especially in cases where the data are not highly informative in regards with the phylogeny in question [154]. Ultimately parsimonious reconstructed trees have been generally outperformed by M L-based methods.

As convergent evolution we define the independent evolution of similar features in species of different phylogenies, eg. the presence of wings in bats and birds.

The ML-based trees are exactly that; the phylogenies that maximize the likelihood of observing the extant taxa under the evolutionary model assumed [79]. In short, “likelihood methods produce a number of trees, one of which is usually found to be the most likely tree”[154]. Under this approach one specifies a model of evolutionary change (eg. the OU model presented beforehand) and then calcu- lates the probability of the data given the evolutionary history presented by the tree. Evidently the quality of this methodology is related to the successful choice of evolutionary model. ML-based methods, in contrast with maximum parsimony based methods do use branch length to calculate the distance between point of the phylogeny; exactly because of that they also enable the practitioner to seamlessly infer ancestral states along the phylogeny in question. The most obvious theoretical limitation of “simple” ML-based methods is the fact they assume a unique rate of change along a phylogeny. Multiple rates can be possibly assumed but especially when one is presented with a smaller dataset, overfitting can be an issue. Obviously standard information criteria (eg. AIC) can be also employed here. In practice one starts with a specified tree derived from the input set (eg. using NJ tree) and then branch lengths are changed in order to produce the “ML-tree” [154]; other methodologies go as far as sampling the whole tree-space, effectively examining 2N different trees but this is obviously an extremely expensive approach for all but the smallest datasets. Direct generalizations of this approach are presented within a Bayesian setting [143; 273]. While we do not explore this in detail, the presence of priors is used in order to account for prior assumptions regarding certain branch-lengths, evolutionary optima, and other model parameters.

Interestingly none of the proposed methodologies addresses internally the issue of rooting a tree. In general they are two approaches: mid-point routing and

outgroup usage [165]. The first approach assumes that the longest path between two extant taxa denotes the most “archaic” split and therefore the tree is rooted at the mid-point of that path. The second approach is first fitting an unrooted phylogenetic tree T on the original data and adding an obvious “phylogenetic outlier” to that tree. The new node connecting the original unrooted tree to the outgroup taxon, is considered to be the root of the tree. The rationale behind this technique is straightforward: if for example Greek is added in Romance languages phylogeny the bifurcation between Greek language and all the other Romance languages must be the “oldest” one. Clearly this method can be problematic because one might either pick an outgroup that is actually related to some of the original member of the phylogeny or either the outgroup that is so extreme (for instance a Papuan language in the case of a Romance phylogeny) that the rooting results become “random” as they are no meaningful similarities to start with.

In document Functional data analysis in phonetics (Page 82-85)