S ampling error - Experimental methods Computers are useless They can only give you answers.

Experimental methods Computers are useless They can only give you answers.

4.2.2 S ampling error

This is essentially the random error which cannot be avoided with real data, due to the finite data set size. Typically, nucleotide sequences are obtained of up to � 1 000

characters. In the simulations, I have often taken a range of sequence lengths, which show the effect of sampling error on the performance of the phylogenetic methods. The set of sequence lengths used is given in Chapter 5: Experimental Methods.

The effect of having zero sampling error can, for small n, be assessed by calcu

lating exactly the expected frequencies of character state patterns across the taxa and using these as input data for the phylogenetic methods

(

see later

)

[41).

4.2.3 "White noise" and "Pink noise"

The term white noise is taken for its meaning in modern music: it is, in this case, essentially random error in the data, with no bias in any particular direction. For example, white noise is introduced if, when carrying out DNA sequencing, errors are made which randomly misread one character, say x, as another, say y, with probability independent of x and y. It is this form of white noise which I have investigated here.

On the other hand, pink noise is random error in the data which introduces a bias in the parameter

(

)

estimated from that data, in a specific direction. If, in the above case of sequencing errors, the probability of mis-reading a character was not independent of the character, the random error introduced would be regarded as pink noise.

species, and obtain also a set of sequences which are from lice of these birds. The underlying phylogenies of the birds and their lice may well be different [62] .

Attempting to infer the phylogeny of the birds by using the data from both sets of taxa would then be hampered by the pink noise introduced by the parasite

data. The combined data set would then regarded as "contaminated" by the par asite data, as it could no longer be guaranteed to be all generated from the tree describing the phylogeny of the birds.

The above is the type of pink noise which I have studied here.

4 . 3 General approach

There have been many studies performed to assess the performance of phylogenetic methods, but all are necessarily limited by computing time [13] , [50] , [54] , [60] , [80] .

For example, in a 1981 study by Li (54] the sequence length c was 300, and only

20 trials could be carried out, though more recently the number of trials is around

300 [80] . The tree topologies have been limited to only a few, generally rooted, trees [50] , (54] .

I have therefore tried to exploit the computing power of the Sun SPARCStations available and fill i n some of the gaps, testing over a large range of up to 4 1 different sequence lengths, all 27 topologies for 4 ::::; n ::::; 10, and performing up to 1 000 trials for any given set of parameters. (For the test of the effect of tree topology, using the two topologies with n = 6 pendant vertices, 1 05 trials were carried out.

)

The sequence lengths used ranged over

{ 1 0 , 13, 16, 20, 25, 32, 40, 50, 64, 80, 100, 125, 1 60, ... , 1 05 } .

The large number of trials for each case allowed the general trends i n the per formance of these phylogenetic methods to be seen easily. Such an exhaustive set of experiments as this, with such a wide range of parameters, has not been carried out before, but is now possible with the advances of modern computers.

In each trial I have generated data according to some known generating tree

To and set of edge lengths q. Thus the same data are used for each method, to

enable comparison between methods. Note that this may give a false impression of coincidentally fluctuating performance of the methods, with varying data sets, but this is in general not the case: such fluctuation is purely an artifact of sampling and other random error in the data.

4.4. Small n 53

With each trial, the labels of the pendant vertices were permuted at random, to reduce the effect of the order in which data are presented to the clustering algorithms (see Section D. 1 ) .

4.4 S mall n

The general simulation procedure is as follows (more detail is provided after this list):

1 . Specify the topology of the tree to be used (the list of unrooted binary tree

topologies is given in Appendix A), which methods to include, and overall parameters describing the properties of the generating tree Ta.

2. Randomly choose edge lengths qi, equal to expected numbers of character state changes between the end-points of each edge ei of Ta.

3. Calculate the expected bipartition frequencies Si of all 2n- l _{bipartitions from}

the vector q of edge lengths.

4. Sample from this expected frequency spectrum to obtain an observed bipar tition spectrum, s' ( "ohs" in the program listing).

5. If distance-based methods are being used, obtain the distance matrix D from s'.

6. Use the distance and/or bipartition data as inputs to the various methods under study.

4.4.1 C hoosing the tree topology and other parameters

For small values of n, I have used unrooted trees. This is because the compatibility, closest tree and maximum parsimony methods do not distinguish the root. Also, it should be noted that there is a lack of irrefutable evidence that, for small n, the most likely or biologically interesting cases are those in which the rooted tree is most appropriate. In fact there is positive evidence supporting differing rates of evolution on different lineages of trees in some cases (6] , (31], (55] , (56] , and also (100] . The data generated from these unrooted trees did not satisfy the molecular

clock hypothesis, which is that the expected distance between each extant taxon and the nearest common ancestor of all the taxa is the same.

I have also often operated under the assumption that all trees are equally likely

(

abbreviated 'a.t.e.l. '

)

for simplicity and so as to not introduce any bias toward any particular type of tree

(

85]. This is more important in the case of large rooted trees, where investigating each topology and then providing a weighted average of the results is not possible due to the exponential growth with n in the number of tree topologies [8] , [36] .

Under the 'a.t.e.l. ' assumption, the number of trees with a given topology X is calculated using Burnside's Theorem [29] . For any tree on n pendant vertices, divide n ! by the order of each of the symmetries of X.

For example, the tree UB5, 23 on n = 9 pendant vertices

(

see Appendix A

)

has three symmetries of order 2

(

the three pendant neighbouring pairs

)

and one symmetry of order 3! = 6

(

the central point

)

. Hence the number of trees with this topology is

(

UB5, 23) = 2!2!2!3!

=

7560.

When the behaviour of phylogenetic methods was considered with respect to

n, ignoring tree topology, the data from simulations with each topology X were amalgamated in proportion to the numbers N

(

)

(

)

is shown in Table 4 . 1 for all the tree topologies with from 4 to 10 pendant vertices. Diagrams showing these tree topologies are in Appendix A .

I n this part of the investigation I have had t o assign some parameters which have a degree of arbitrariness:

• The upper bound of the maximum path length u between taxa

(

in terms of maximum expected number of observed differences in character state, per site

)

;

• The minimum edge length t;

• The type of distribution of the edge lengths

(

uniform , normal, and log normal

)

;

• The ratio r of maximum internal edge length to maximum pendant edge

4. 4. Small n

Table 4 . 1 : The number of trees with each topology for 4 � n � 10. In the first column are the values of n used. Columns two to five show the tree topology names, using the Tree Topology Description Notation (TTDN) described in Appendix A, and the last column shows the total number of trees for the given n, equal to F(n) = (2n - 5)!!.

n Tree topologies X and number of trees N(X) F(n) = (2n - 5)!!

4 UB2 3 3 5 UB3 15 1 5 6 UB4 UB3, 12 90 15 105 7 UB5 UB4, 12 630 3 15 945

In document Factors affecting the performance of phylogenetic methods : a thesis presented in partial fulfilment of the requirements for the degree of Ph D in Mathematics at Massey University (Page 67-71)