Conclusion - Efficient algorithms in analyzing genomic data

In chapter, we present a tree-based quantitative GWA mapping algorithm, TreeQA. TreeQA utilizes local perfect phylogenies in detecting associations. Perfect phylogenies provide sensible groupings of samples at multiple resolutions. TreeQA explores the space of all possible groupings implied by the perfect phylogenies in a carefully designed order so that intermediate computations can be maximally reused. Our experimental results on both simulated and real data show that TreeQA can eﬃciently conduct quantitative GWA analysis and is more eﬀective than the previous methods.

Chapter 3 TreeQA+: Improving the Power of

Phylogeny-based Genome-wide Association

Mapping

3.1 Introduction

In Chapter 2, I discussed the TreeQA algorithm which is a phylogeny-based genome- wide association mapping method. The experimental results on both synthetic and real data demonstrates that TreeQA outperforms single marker-based and haplotype- based methods. TreeQA also outperforms the previous phylogeny-based methods such as TreeLD, Blossoc and TreeDT (Zollner and Pritchard (2005); Mailund et al. (2006);

Sevon et al.(2006);Larribea et al.(2002); Morris et al. (2002);Minichiello and Durbin

(2006)) in terms of runtime and the ability to handle quantitative traits.

However, both TreeQA and other phylogeny-based methods do not consider the sample correlations implied by the tree topologies in the analysis. Ignoring sample correlations can bias the signiﬁcance of the associations and lead to spurious signals.

For example, three phylogeny trees are plotted in Fig. 4.1. At the leaf nodes, we use ”s₁, s₂, ...” to represent the samples. The phenotype values are shown in the parentheses. Let’s consider the partition created by removing the edge in the middle. We

s1(20) s2(50) s3(10) s4(10) s5(10) s6(50)

(a)

s1(20) s2(50) s3(10) s6(50)

(b)

s1(20) s2(50) s3(10) s4(10) s5(10) s6(50)

(c)

20 20 20 20 20 20 20 20 20 20 20 20 20 20 20 20 19 1 1 1

Figure 3.1: The sample correlations aﬀect the signiﬁcance of the association.

get partitions,{{s₁, s₂},{s₃, s₆}},{{s₁, s₂},{s₃, s₄, s₅, s₆}}and{{s₁, s₂},{s₃, s₄, s₅, s₆}}

from the three trees. The mean phenotype values of the left group and right group in these partitions are, {35,30}, {35,20} and {35,20}. If all samples are assumed to be independent as in the previous methods, the associations between the partitions and the phenotype would be equally strong in trees (b) and (c), and weak in tree (a). However, since s₃, s₄ and s₅ are far more closely related to each other than to the remaining samples (indicated by the short branches between them) in tree (b), it is erroneous to treat them as independent samples in tree (b). In fact, the associations between the partitions and the phenotype should be similarly weak in trees (a) and (b), and relatively strong in tree (c).

Therefore, it is critical to take into account the sample correlations implied by the topology properly in association study. However, this is not a trivial task, especially when we assess the association of the partitions such as{{s₁, s₂},{s₃, s₄, s₅},{s₆}}(created by removing multiple edges).

which incorporates the sample correlations modelled by local perfect phylogeny trees. As a phylogeny-based method, TreeQA+inherits all advantages of TreeQA by examining all groupings induced by a perfect phylogeny (constructed in genomic regions exhibiting no evidence of historical recombination by the 4-gamete test(Hudson and Kaplan(1985))). In addition, TreeQA+ is more eﬀective and robust than TreeQA by incorporating sample correlations. TreeQA+ adopts the model of Brownian motion (Nelson (1967)) which was previously used to study phylogeny (Edwards and Cavalli-Sforza (1964);

Felsenstein (1981)): for any two nodes (samples or hypothetical ancestors) in the phylogeny, if there is no causative mutation happened during the evolution from one node to the other, the diﬀerence between the phenotype values of the two nodes should fol- low a normal distribution with mean 0 and variance proportional to the sum of edge lengthes between them. Thus, any signiﬁcant deviation from this estimation suggests the existence of some causative mutations during the evolution.

In TreeQA+, a grouping also consists of several non-overlapping subtrees created by removing edges from a perfect phylogeny tree. TreeQA+ utilizes Felsenstein’s tree pruning method (Felsenstein (1981)) to estimate the phenotype values of hypothetical ancestors (intermediate nodes) in each subtree. Then the estimated phenotype values at the two adjacent nodes of each removed edge are examined under the assumption of Brownian motion. A signiﬁcant deviation between the two nodes implies a strong association between the grouping and the phenotype. For each phylogeny, TreeQA+ ﬁnds the strongest association between its induced groupings and the phenotype.

A brute-force implementation of TreeQA+ is computationally expensive. TreeQA+ faces the same computational challenge as TreeQA.

• Both the number of trees and number of groupings per tree can be very large1 in a GWA mapping.

1_{For example, the number of trees can exceed tens of thousands in a chromosome-wide association} study. And there are up to 22n−2groupings that can be generated from a tree of _nsamples.

• Permutation tests are necessary to ensure the statistical signiﬁcance of the discov- ered associations, which further increase the computational burden.

Since different statistical methods are used, the optimizations developed in TreeQA can not be used in TreeQA+. However, the same strategy applies, i.e., maximize the reuse of intermediate computations. A few new optimizations are developed and make TreeQA+ very efficient and effective in GWA mapping, as demonstrated by extensive experiments on both simulated datasets and inbred mouse strains.

In document Efficient algorithms in analyzing genomic data (Page 47-51)