• No results found

One possible strategy for inferring large phylogenies is to estimate a large number of “small” trees, then amalgamate them into a complete tree for the whole data set. Quartet methods take the same approach by building the tree from a large number of quartets - trees on four taxa.

For four taxa, there are only three possible topologies (see Figure2.2). For taxa a, b, c, d, we denote by ab|cd the topology where a and b are separated from c and d by the middle edge.

The choice of quartets as the building block for larger trees is motivated by simplicity. Below, we will review the theoretical and practical algorithms for constructing trees from quartets.

2.3.1

Inferring a quartet

The simplest algorithm for inferring a four-taxa phylogeny is known as the four-point method. If distances d are additive, then for a quartet topology ab|cd we should have

dab+ dcd < dac+ dbd, dad+ dbc

In other words, the sum of the distances corresponding to the two pairs of siblings in the correct topology is smaller than the other two sums. The difference is equal to twice the length of the middle edge. It follows that the topology of a quartet can be estimated by picking the minimum of the three sums:

dab+ dcd, dac+ dbd, dad+ dbc

d

e

b

c

a

e

b

d

b

d

c

a

a

e

d

b

c

a

e

b

d

c

a

b

c

d

e

Figure 2.3: An example of three quartet topologies on five taxa that are not consistent with any five-taxa topology. For each pair of quartets from the set {ab|cd, bc|de, ae|bd}, there exists a single topology on five taxa consistent with that pair of quartets (shown in black).

In most programs, quartets are inferred by maximum likelihood. This is usually done numerically, by repeatedly optimizing branch lengths for each of the three topologies. The program then chooses the the topology with the highest likelihood. This strategy results in higher accuracy compared to the four-point method, at the cost of increased runtime. Chor et al. [36] derived an analytical formula for maximizing the likelihood of ultrametric quartets.

2.3.2

Maximum Quartet Compatibility

It is not always possible to find a tree that agrees with every quartet in a given set of quartets. For example, there is no tree that is consistent with the quartets in Figure 2.3. Such inconsistent sets of quartets will arise frequently in practice as a result of errors in individual quartet inferences. Determining if a set of quartets is consistent is an N P - complete problem [153].

A natural approach when dealing with inconsistent sets of quartets is to find a tree that is consistent with the maximum number of quartets. This is known as the Maximum

Quartet Compatibility (MQC) problem. Finding an MQC tree is NP-hard, as proven by Berry et al. [14]. Berry and Gascuel [13] proposed an exponential-time algorithm for the problem. A series of more scalable exponential-time algorithms was proposed by Lin et al. [170, 169]. Jiang et al. [93] derived an polynomial-time approximation scheme for the problem.

Quartet Puzzling (QP) [158] is a widely used quartet method inspired by earlier MQC approaches. After inferring all n4 quartets, it starts from a star tree of three taxa, and adds taxa sequentially until an n-taxon phylogeny is produced. At each insertion, the new taxon is attached to the existing tree so as to maximize the number of quartet topologies consistent with the location of the new taxon. The iterative insertion algorithm is repeated a large number of times for different orderings of taxa, and the algorithm outputs the consensus tree of all the trees produced during the puzzling step. While this approach does not offer any theoretical guarantees on accuracy, it gives reasonable results in practice and has found a large number of users, despite its quartic runtime.

A major flaw of the MQC approach is that it treats all quartet trees as equally likely to be correct. This is not true, as we will see in Chapter 3. For example, quartets consisting of closely related taxa are more likely to be inferred correctly since short distance estimates have lower variance. To account for this, several methods weight quartets by their likelihood score or a posteriori probability. If the three possible quartet topologies for a set of taxa have likelihoods `1, `2, `3, we can approximate the posterior probability of the first topology

by writing

p1 =

`1

`1+ `2+ `3

This weighting scheme was first used in the second version of Quartet Puzzling [157], which led to considerable improvement in the quality of inferred trees. Ranwez and Gascuel [137] proposed an improved algorithm using a similar weighing scheme. In their experiments, they showed that their Weight Optimization algorithm outperformed QP, but the accuracy of both algorithms remained markedly lower than that of maximum likelihood.

St. John et al. [94] performed a large-scale experimental study comparing the accuracy of unweighted quartet methods to Neighbour Joining. They found that in most cases, unweighted quartet methods were much less accurate than Neighbour Joining.

2.4

Mathematical guarantees on the accuracy of phy-