Clustering Methods - Some applications of statistical phylogenetics : a thesis presented in par

4.3 Summary

7.1.2 Clustering Methods

The first step of each of the clustering methods we consider here is to build a distance matrix dij from the sequence data. In some methods it is important to correct the

distances for multiple changes, otherwise systematic error can mislead the analysis (Huson and Steel, 2004). This is a monotone transformation so it does not affect complete or Single Linkage methods (described below). Although computation for most clustering algorithms needs O(n3) steps, computation of the distance metric requires

O(n2_m_{), where} _n _and _m _{are the number of species and the length of the sequences.}

Thus if m > n, building the distance matrix may be computationally more costly than the clustering algorithm itself.

For our comparison we will use several of the agglomerative cluster algorithms (UP- GMA, WPGMA, single and Complete Linkage clustering). We describe agglomerative clustering in algorithm 7.1 and more details of these algorithms can be found in Sneath and Sokal (1973) or Kaufman and Rousseeuw (1990). I also showed some relation of UPGMA and WPGMA with linear models in section 5.1.1. In phylogenetic analysis the most widely used clustering method is Neighbor Joining (NJ) algorithm (Saitou and Nei, 1987). We include NJ and some extensions (Weighbor by Bruno et al. (2000) and BioNJ by Gascuel (1997a)) in the comparison. We now describe the basic principle of agglomerative clustering (Kaufman and Rousseeuw, 1990; Sneath and Sokal, 1973). The clustering methods differ in their linkage functions (step 3) and also in the selection criteria (step 2):

Linkage functions

The four methods under consideration differ in their linkage functions. In each case the linked clusters are Cv and Cw, where d(Cv, Cw) is minimal over all cluster pairs. For

UPGMA, d(Cv, Cw) =dU(Cv, Cw) = 1 |Cv||Cw| X n∈Cv X m∈Cw dn,m; (7.2)

7.1. Methods 119

Algorithm 7.1 Agglomerative clustering

1. Initial condition

Each species represents a cluster. The distances between the clusters correspond to the (possibly corrected) distances or dis- similarities between the species of one cluster to those of an- other.

2. Selection criterion

The two closest clusters Ck and Cl are merged to form a new clus-

ter Cm:

d(Ck, Cl) = min

i6=j d(Ci, Cj) (7.1)

Cm ←−Ck∪Cl

3. Reduction formula

The distance matrix is now updated.The clusters Ck and Cl get

deleted, and the distances between the cluster Cm and each of

the remaining clusters is computed with a linkage function (see below).

for WPGMA d(Cv, Cw) =dW(Cv, Cw) = 1 2 X n∈Cv X m∈Cw dn,m; (7.3)

for Single Linkage

d(Cv, Cw) = dS(Cv, Cw) = min n∈Cv,m∈Cw

dn,m (7.4)

and for Complete Linkage

d(Cv, Cw) = dC(Cv, Cw) = max n∈Cv,m∈Cw

dn,m. (7.5)

Hence we see in UPGMA , the distance between clusters Cv and Cw is the average of

distances across each pair of taxa, with one taxon from each cluster. In WPGMA, the cluster distances are weighted by the size of the cluster. The distance between clusters in the Single Linkage method is the minimum distance between pairs of taxa, with one taxon from each cluster and for Complete Linkage it is the maximum pairwise distance. It is reported (Sneath and Sokal, 1973) that trees reconstructed with Single Link- age are likely to be less balanced than those generated by UPGMA or WPGMA, and Complete Linkage is more likely to be balanced. Here we extend this result by compar- ing different biases among these four clustering methods, and contrast these with the NJ-methods we describe later.

A distance is said to be ultrametric, if it fulfills the following three point condition:

d(i, j)≤max{d(i, k), d(j, k)}. (7.6) Ultrametric distances can always be displayed on a weighted clock-like tree. A weaker condition is additivity, distances are said to be additive if the four point condition is true

d(i, j) +d(k, l)≤max{d(i, k) +d(j, l), d(i, l) +d(j, k)}. (7.7) Additive distances can always be displayed on a weighted tree (Buneman, 1971). Com- plete Linkage and Single Linkage only require that distances are ordinal, they are invariant against (positive) monotone transformations of the distances.

7.1. Methods 121

ilarity matrix is ultrametric, but may be inconsistent for additive distances, that is, given longer and longer sequences they might converge on the wrong phylogenetic tree.

Neighbor Joining

In contrast to the methods above, NJ (Saitou and Nei, 1987; Studier and Keppler, 1988) is consistent for additive distances. At each step the two clusters with the minimal net divergence are merged. Consider a tree where two sister clusters are separated from the rest of the taxa by an edge, but there are no other internal edges. The net divergence is the length of this edge as determined by a least squares best fit.

BioNJ (Gascuel, 1997a) and Weighbor (Bruno et al., 2000) are extensions of the NJ algorithm using a biological model to give more accurate distance estimates when reducing the distance matrix.

We introduce here a new heuristic clustering method that uses the NJ selection criterion on distances derived using a parsimony framework.

Parsimony Neighbor Joining (PNJ)

The Hamming distance between sequencesi and j is the minimal number of substitu- tions required to changei to j. For PNJ we need to generalize the idea of a character state to a set of character states. Ci[k] denotes the set of possible states of sequence i

at site k. At site k, if the intersection Ci[k]∩Cj[k] is not empty, (but Ci[k] 6= Cj[k])

no substitution is required. The PNJ algorithm is outlined in table 7.2. The main difference between NJ and PNJ is in step 3 of algorithm 7.2 where we compute an ancestral sequence which is then used for updating the distance matrix. I developed a implementation of PNJ, which is part of the R-packagephangorn.

In document Some applications of statistical phylogenetics : a thesis presented in partial fulfilment of the requirements for the degree of Doctor of Philosophy in Biomathematics at Massey University (Page 135-138)