Testing Robustness - Extracting and exploiting signals in genetic sequences : a thesis presente

Earlier robustness analyses, such as Sullivan & Swoﬀord (2001), generated datasets using a relatively general model, and then inferred trees using a method that assumed a more restricted class of models. Importantly, there remained commonalities between the generating model and inference model class: they both assumed a single tree topology. This made it possible to test robustness by measuring the proportion of 4-taxon tree topologies recovered correctly.

In our paper, we generated data on a mixture model using trees with diﬀerent

topologies and fed into a single-tree inference method. Consequently it may very well be asked: What are the common parameters shared by the generating model and the inference model class? Or more plainly: What are we hoping to get?

5.2.1 Does “the” internal edge of a mixture of two trees

really exist?

The main diﬃculty surrounds our use of the term “the internal edge” in describing the mixture of trees A and B on p. 87: it is not clear what single parameter in the generating model the inferred internal edge length can be said to be estimating.

No diﬃculties arise if we assume that one of the trees in the mixture, called the “true tree”, has a much larger proportion than the other, which we can call the

“noise tree”. If the proportion p_A of tree A is large in relation to the proportion

p_B = 1−p_Aof tree B, and if suﬃciently many sites are available that sampling error is not a concern, then it is reasonable to expect that single-tree ML will infer A’s topology, so the internal branch length inferred is an estimate of A’s internal branch

length (and vice versa when p_B p_A). But when p_A ≈ p_B, this interpretation is

5.2.2 Shared parameter values

In order to sensibly discuss the behaviour of inferences made using more-restricted model classes on data generated under more-general models, we need to make precise the notion of when parameter values are “shared” by diﬀerent models.

Ever-present real-valued parameters like transition-transversion ratio are a sim-

ple case: if allk components of a mixture model have equal values for such a param-

eter, then we can sensibly describe this collection of parameters θ_i,1 ≤ i ≤ k as a

single parameterθof the overall mixture model, and attempt to infer it using a more

speciﬁc, single-component model. The resulting estimate, ˆθ, can be meaningfully

compared with theθ_i, and statements can be made about the estimation procedure

regarding the usual parameter-dependent statistical properties like convergence (or lack thereof) and bias.

On the other hand, the structure of an edge-weighted phylogenetic tree means that it is not immediately clear how, or even whether, the parameters describing one

tree T can be matched up with the parameters describing another tree U. T and

U may in general have diﬀerent topologies, which potentially makes their respective

sets of parameters prima facie structurally diﬀerent and thus incommensurable.

5.2.3 The edges of a mixture model

One way to overcome this is to follow the lead of the Hadamard conjugation tech- nique (Hendy & Penny, 1993), and embed the topology-dependent parameter space of a particular tree in a larger, topology-independent space. Recall that an edge in

a tree (considered without its length) is deﬁned by a split of taxa X|Y, and that

there are 2n−1 distinct splits on n taxa. This enables us to represent the 2n −3

edge-length parameters describing any edge-weighted, unrooted binary tree T on n

taxa using a vector s_T of 2n−1 parameters indexed by split: the 2n−3 elements

corresponding to edges present in T are assigned the corresponding length, while

the remaining 2n−1 −2n+ 3 elements, which correspond to edges absent from T,

are assigned the value 0.2 Now that we have a set of parameters that is structurally

identical across diﬀerent tree topologies, we can safely say that two trees T and U

2_{This describes the situation for an unrooted binary tree; much the same procedure works for}

share a parameter value whenever s_{T i} =s_Ui for some split i.

According to this formulation, the meaning of “the internal edge” of a mixture of trees is not well deﬁned; but whenever all components in the mixture model contain some splitA|B as an edge, and this edge has identical length across all components,

the meaning of “the edge splitting taxon set A from taxon set B” is well deﬁned,

regardless of how the topologies of the components may otherwise diﬀer. All such shared edges can be regarded as edgesof the mixture model, capable of being inferred using a single-tree inference method (at least in principle).

It follows that the four external edges of the trees A and B analysed in the paper

are shared parameter values. Likewise, for the 5-taxon analysis, all ﬁve external

edges, plus the edge separating taxa 4 and 5 from the rest in the mixture of trees A and B, are genuine shared parameter values. Figure 1B shows that ML estimation of the four external edges in the 4-taxon analysis is indeed biased upward as the mixture approaches an even balance between the two topologies. That such a simple (and, we propose, frequently occurring) eﬀect as a mixture of two trees is enough to distort the results of single-tree ML analysis is a persuasive argument for the use of

network methods like Spectronet (Huberet al., 2002), which are inherently immune

to such problems.

In document Extracting and exploiting signals in genetic sequences : a thesis presented in partial fulfilment of the requirements for the degree of Doctor of Philosophy in Mathematics at Massey University (Page 90-92)