• No results found

Finding Clusters in Phylogenetic Trees: A Special Type of Cluster Analysis

N/A
N/A
Protected

Academic year: 2021

Share "Finding Clusters in Phylogenetic Trees: A Special Type of Cluster Analysis"

Copied!
15
0
0

Loading.... (view fulltext now)

Full text

(1)

Finding Clusters in Phylogenetic Trees:

A Special Type of Cluster Analysis

NUMBER: Why are

NUMBER: Why are

there so many

there so many

distinct clusters?

distinct clusters?

SYNCHRONY: Was the

SYNCHRONY: Was the

onset of diversification

onset of diversification

synchronized?

synchronized?

Why try to identify clusters in phylogenetic trees?

Example: “origin of HIV.”

(2)

Example

• Observe: 2 main features of HIV-1, type M - Approx. 10 distinct subtypes

- Subtypes are approx. equidistant (“sunburst”) Question: Could these features have arisen “naturally”? • Approach:

- quantitative comparison to simulated African epidemic.

Simulation details are in the models/tools:

- coalescent theory, phylogenetic tree estimation,

- estimate the number of subtypes, and

- classical statistics: are the 2 main features “outliers” with respect to our forward model?

(3)

This talk to focus on:

• To choose groups, consider:

Model-based clustering (Raftery et al: mclust in S+)

Max likelihood + bootstrap (State of art: PHYLIP, other)

Markov Chain Monte Carlo (BAMBE)

(4)

Complicated Genetic Data Structure

A94CY.034.11 - - - - GAGTGA

T

GG

G

GAC...

E90CM.243

ATGA

GAGTGA

A

GG

A

GAC...

Example sequence identifier: A94CY.034.11

A: subtype 94: isolation year

CY: country of origin

034: isolate 11: clone number

Ensure

: global coverage, include all known subtypes

widest possible span of isolation times

more than one region of genome

Avoid

: more than 1 clone from same isolate

(5)

Distance measures/micro evolutionary models

Pij(t) = 4-by-4 transition prob. matrix

P(A -> C in time t) = PAC(t), etc.

For some P matrices, can define an evolutionary distance between taxa

x and y each with N nucleotides (must correct for multiple substitutions)

NFxy =

- aπC G TA -GTA C - fπTA C G

-GTR: πiPij = πjPji has 8 free parameters.. Common models are special

cases with fewer parameters. Use NFxy to estimate parameters.

nAA nAC nAG nAT

nCA nCC nCG nCT nGA nGC nGG nGT nTA nTC nTG nTT

Qij/µ = P = eQt

JC1: Pij(t) = 1/4 + 3/4e-µt for i = j, and 1/4 - 1/4e-µt for i != j

K2: Pij(t) = 1/4 + 1/4e-µt + 1/2 e -µt (κ+1)/2, for i = j, etc. where κ is transition/transversion ratio

(6)

Number of subtypes: Model-based clustering

AWE No. subtypes No. subtypes x2 Xx1 x1 x2 Env Gag

(7)

Simulated data: 4 macro growth rates

(c) N = N0, then N= N0 ert

(a) N = N0 ert (b) N = N0

(d) N is quadratic from1970 to 1990

(8)

Example Real Trees

D B C F G H A E J G A D F C H B K J

The ML + bootstrap approach suggests 7 clusters (subtypes) in the 95

env sequences and 6 clusters (subtypes) in the 88 gag sequences. The

data is available at hiv-web.lanl.gov and accession numbers are available upon request.

NOTE: B, D are “similar” and H, J are rare (omitted in this analysis)

(9)

Model-based clustering as in mclust

-

Approximate Bayes method to choose the no. of groups G.

First assume: G is known and data is n cases of p-dim

observations x = (x1, x2, …, xn) with probability density

fk(x;

θ

) for observations from group k.

Let γ = (γ1,..,γn) be the group labels.

Choose (θ,γ) to maximize L(θ;γ) = Πifγi(xi;θ)

If f is MVN(µkk), get a sum of squares criterion, with

variations depending on assumptions on Σk.

BR (1993) use hierarchical agglomeration and iterative reallocation to maximize the classification likelihood:

1

L(x| , )

( |

,

),

i i n i i

x

ν ν

θ ν

φ

µ

=

=

Σ

Banfield and Raftery, Biometrics 1993

(10)

Model-based clustering as in mclust

BR approach: use the spectral decomposition

T

k

λ

k

D A D

k k k

Σ =

Next, to estimate p(G = r | x), approximate the distribution of the

Bayes factor p(x | G = r)/p(x | G = s).

Allow: a “noise” component for “new cluster cases” and use

heuristic method to address failure of a regularity condition in the clustering context.

where control the volume, shape, and orientation of group k

, ,

k Ak D k

(11)

Simulated Example

Evaluation of emclust for a simulated data set of 30 observations from each of 3 clusters (labeled 1, 2, 3 in top plot) with true model VEV denoting that the volume varies (V) among clusters, the shape does not vary (E for equal) among clusters, and the orientation varies (V) among clusters (model 6). The BIC correctly chooses 3 clusters but chooses VVV rather than the correct VEV.

x1 x2 -2 -1 0 1 -1 0 1 2 1 1 1 1 1 1 1 1 1 1 1 11 1 1 1 1 1 1 1 1 1 11 1 1 1 1 1 1 2 2 2 2 2 2 2 2 2 2 22 2 2 2 2 2 2 2 2 2 2 2 2 2 2 2 2 2 2 3 3 3 3 3 3 3 3 3 3 3 3 3 3 3 3 3 3 3 3 3 3 3 3 33 3 3 3 3 number of clusters BIC 2 4 6 8 10 12 -450 -400 -350 -300 1 1 1 1 1 1 1 1 1 1 1 1 2 2 2 2 2 2 2 2 2 2 2 2 3 3 3 3 3 3 3 3 3 3 3 3 4 4 4 4 4 4 4 4 4 4 4 4 5 5 5 5 5 5 5 5 5 5 5 5 6 6 6 6 6 6 6 6 6 6 6 6 1 EI 2 VI 3 EEE 4 VVV 5 EEV 6 VEV

(12)

mclust suggests 6 subtypes (tends to merge B and D)

A A A A A A A A A A A A A A A B B B B B B B B B B B B B B B C C C C C C C C C C C C C C C D D D D D D D D D D D D D D D E E E E E E E E E E E E E E E E F F F F F F F F F G G G G G G G G G G 0. 0 0 .2 0. 4 x1 x2 -0.2 -0.1 0.0 0.1 0.2 -0. 2 0. 0 AA A A A A AA AAAA A A A B B B B B B B B B B B B BB B C C CC CC CC C C CCC C C D D D D D D DDDDD D DD D E E E E EE E E E E E E E E E E F F F F F F F F F G G G G G G G GG G number of clusters BIC 5 10 15 20 600 1000 1400 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 2 2 2 2 2 2 2 2 3 3 3 3 3 3 3 3 3 3 3 3 3 3 3 3 3 3 3 3 4 4 4 4 4 4 5 5 5 5 5 5 5 5 5 5 5 5 5 5 5 5 5 5 5 5 6 6 6 6 6 6 6 6 1 EI 2 VI 3 EEE 4 VVV 5 EEV 6 VEV

Env Data. (Top) Hierarchical Clustering; (Middle) Principle

(13)

MCMC via BAMBE

On different data with fewer taxa: Compare MCMC to ML + bootstrap in case where groups chosen in advance

(c) Influenza, HA gene, 91,93,94, or 95 vs 96 groups Probability via ML+Bootstrap

Probability via MCMC 0.0 0.2 0.4 0.6 0.8 1.0 0 .00 .2 0 .40 .6 0 .81 .0

(d) Influenza, NP gene, H, S, A groups Probability via ML+Bootstrap

Probability via MCMC 0.0 0.2 0.4 0.6 0.8 1.0 0 .00 .2 0 .40 .6 0 .81 .0

(14)

Summary

• Present method to choose the number of groups via ML + bootstrap or MCMC: “trial and error.” Usually: human eye studies tree, selects groups, then ML + bootstrap on specified groups. Similar with MCMC

• Model-based clustering offers automatic way to choose

groups, but relies on pair-wise distances (less efficient than likelihood). FUTURE: consider how to automate (without human eye) cluster choices in ML + bootstrap or MCMC (or any other method such as weighted parsimony + bootstrap) • Increasing the number of taxa: MCMC and ML are very

slow, so currently limited to few hundred taxa

• Consider: identify groups, then assign new taxa to existing groups.

(15)

References

Banfield, J., & Raftery, A. (1993). Model-based gaussian and non-gaussian

clustering. Biometrics. 49, 803-821.

Burr T., Myers G., & Hyman J. (2001). The Origin of AIDS – Darwinian or

Lamarkian?, Phil. Trans. R. Soc. Lond. B. 356, 877-887.

Burr, T., Skourikhine, A.N., Macken, C., & Bruno, W. (1999). Confidence measures for evolutionary trees: applications to molecular epidemiology.

Proc. of the 1999 IEEE Inter. Conference on Information, Intelligence and

Systems, 107-114.

Burr T., Doak J., Gattiker, J., & Stanbro, W. (2002a). Assessing confidence in phylogenetic trees: bootstrap versus Markov Chain Monte Carlo,

Mathematical and Engineering Methods in Medicine and Biological Sciences.

1, 181-187.

Burr, T., Gattiker, J., & LaBerge, G. (2002b). Genetic subtyping using cluster analysis, Special Interest Group on Knowledge Discovery and Data Mining Explorations. 3, 33-42.

References

Related documents

Using generalized linear models, we related firstly species richness of mayfly, stonefly and caddisfly (Ephemeroptera, Plecoptera, Trichoptera; EPT) and secondly richness of

Measurement tools and targeted data-mining Vision Mission Strategic objectives Operational objectives.. Vision Mission Strategic objectives Operational objectives Fedict Vademecum

The consequence is that China is further moving into the core (North), while at the same time other semi-peripheral countries are being pushed out of the

Lloyd's is now estimated to insure about 25% of world shipping and marine forms about 30% of Lloyd's total business. Few major marine risks in the world are placed without at

Glucolipotoxicity increases both toxic glucose and lipid myocardial intermediates which collectively alters the cardiac structure evident by cardiomyocyte hypertrophy and

We identi fi ed, in both GBM and matched GSCs, recurrent copy number alterations, as chromosome 7 polysomy, chromosome 10 monosomy, and chromosome 9p21deletions, which are

After Mendelising the pheno- typic data, GpaXI tar l could be more precisely mapped near markers GP163 and FEN427, thus anchoring GpaXI tar l to a region with a known R -gene

16 semi structured in-depth interviews with eight boys and eight girls, all aged 15, in the 10th grade (third and final year of lower secondary school) were conducted