mitochondrial DNA data part 1

(1)

SUMMER SCHOOL 2008

PIACENZA, ITALY - 10 September 2008

Methods for the analysis of

mitochondrial DNA data – part 1

Licia Colli

, U.C.S.C. di Piacenza

licia colli@unicatt it

[email protected]

(2)

•The mitochondrial genome

•Sequence format and alignment

•Input file formats most frequently used in mtDNA analyses

•Molecular diversity indices

•Analysis of Molecular VAriance

•Mismatch distribution and estimates of population expansion

Mismatch distribution and estimates of population expansion

•Admixture analysis

•Trees:

-generalities;

generalities;

-models of DNA sequence evolution and choice of the best-fitting model

-Tree reconstruction strategies

-Distance-based methods (NJ)

( J)

-Character-based methods (MP, ML, Bayesian)

-Molecular clock and calculations of divergence times

-Bootstrap and Jacknife

p

•Software list

•Rereferences

(3)

The mitochondrial genome (mtDNA)

• Its length varies among species (15-17kb)

•multiple copies in each cell (mammalian egg cell contains about 100.000 copies) • lack of recombination

• HAPLOID - maternally inherited; • high mutation rate

•13 protein coding genes, 2 rRNA sequences (12s and 16s) 22 tRNA sequences (12s and 16s), 22 tRNA sequences and 1 non coding region (control region or displacement loop). • the mitochondrial genetic code differs slightly from the nuclear code:

( g p p)

nuclear mitochondrial

TGA Æstop codon TGA ÆTrp (W)

ATA ÆIle (I) ATA ÆMet (M)

AGA ÆArg (R)g ( ) AGA Æstop codonp

(4)

The mitochondrial genome (mtDNA)

A useful molecule, indeed…

• genealogy

• phylogeny (cytochrome b 12s 16s control region whole mtDNA)

• phylogeny (cytochrome b, 12s, 16s, control region, whole mtDNA)

• phylogeography (cytb, control region, whole mtDNA)

• species identification (cytb, control region)

l ti t di ( th

k

)

• population studies ( + other markers)

• detection of “cryptic species” and “barcoding” projects (COXI)

• studies on the domestication process

• studies on male fertility/infertility

• studies on ancient DNA (aDNA)…

(5)

Sequence format and alignment

d

l

EditPlus:

a text editor useful to handle sequences and prepare input files.

Freely downloadable 30-days evaluation version:

FASTA

(fil )

ClustalX

y y

http://www.editplus.com/download.html

>Seq_1

FASTA

(filename.txt)

ClustalX

cccctaatatgtacaataatgaatgttgtaaattagtgttataacacatctatgtataat >Seq_2 CCCCTAATATGTACAATAATGAATGTTGTAAATTAGTGTTATAACACATCTATGTATAAT >Seq_3 CCCCTAATATGTACAATAATGAATGTTGTAAATTAGTGTTATAACACATCTATGTATAAT >Seq 4q_ CCCCTAATATGTACAATAATGAATGTTGTAAATTAGTGTTATAACACATCTATGTATAAT >Seq_5 CCCCTAATAGGTACAATAACTAATGTTGTAAATTAGTGTTATAACACATCTATGTATAAT >Seq_6 CCCCTAATAGGTACAATAATTAATGTTGTAAATTAGTGTTATAACACATCTATGTATAAT >Seq 7

CLUSTAL X (1.83) multiple sequence alignment

Seq_1 CCCCTAATATGTACAATAATGAATGTTGTAAATTAGTGTTATAACACATCTATGTATAAT Seq_2 CCCCTAATATGTACAATAATGAATGTTGTAAATTAGTGTTATAACACATCTATGTATAAT Seq_3 CCCCTAATATGTACAATAATGAATGTTGTAAATTAGTGTTATAACACATCTATGTATAAT (filename.aln) >Seq_7 CCCCTAATTTGTACAATAATGAATGTTGTAAATTAATGTTATAACACATCTATGTATAAT >Seq_8 CCCCTAATATGTCCAATAATGAATGTTGTAAATTAGTGTTATAACACATCTATGTATAAT >Seq_9 CCCCTAATATGTACAATAATGAATGTTGTAAATTAGTGTTATAACACATCTATGTATAAT Seq_4 CCCCTAATATGTACAATAATGAATGTTGTAAATTAGTGTTATAACACATCTATGTATAAT Seq_5 CCCCTAATAGGTACAATAACTAATGTTGTAAATTAGTGTTATAACACATCTATGTATAAT Seq_6 CCCCTAATAGGTACAATAATTAATGTTGTAAATTAGTGTTATAACACATCTATGTATAAT Seq_7 CCCCTAATTTGTACAATAATGAATGTTGTAAATTAATGTTATAACACATCTATGTATAAT Seq_8 CCCCTAATATGTCCAATAATGAATGTTGTAAATTAGTGTTATAACACATCTATGTATAAT Seq 9 CCCCTAATATGTACAATAATGAATGTTGTAAATTAGTGTTATAACACATCTATGTATAAT >Seq_10 CCCCTAATATGTACAATAATGAATGTTGTAAATTAGTGTTATAACACATCTATGTATAAT _ Seq_10 CCCCTAATATGTACAATAATGAATGTTGTAAATTAGTGTTATAACACATCTATGTATAAT ******** ** ****** ************** ************************

(6)

Input file formats

Phylip

(filename.txt; filename.phy)

MEGA

(filename.meg)

Phylip

(filename.txt; filename.phy)

MEGA

(filename.meg)

10 60 Seq_1 cccctaatatgtacaataatgaatgttgtaaattagtgttataacacatctatgtataat Seq_2 CCCCTAATATGTACAATAATGAATGTTGTAAATTAGTGTTATAACACATCTATGTATAAT S 3 CCCCTAATATGTACAATAATGAATGTTGTAAATTAGTGTTATAACACATCTATGTATAAT #Mega title: title_of_your_project #Seq_1 cccctaatatgtacaataatgaatgttgtaaattagtgttataacacatctatgtataat #Seq 2 CCCCTAATATGTACAATAATGAATGTTGTAAATTAGTGTTATAACACATCTATGTATAAT Seq_3 CCCCTAATATGTACAATAATGAATGTTGTAAATTAGTGTTATAACACATCTATGTATAAT Seq_4 CCCCTAATATGTACAATAATGAATGTTGTAAATTAGTGTTATAACACATCTATGTATAAT Seq_5 CCCCTAATAGGTACAATAACTAATGTTGTAAATTAGTGTTATAACACATCTATGTATAAT Seq_6 CCCCTAATAGGTACAATAATTAATGTTGTAAATTAGTGTTATAACACATCTATGTATAAT Seq_7 CCCCTAATTTGTACAATAATGAATGTTGTAAATTAATGTTATAACACATCTATGTATAAT Seq_8 CCCCTAATATGTCCAATAATGAATGTTGTAAATTAGTGTTATAACACATCTATGTATAAT #Seq_2 CCCCTAATATGTACAATAATGAATGTTGTAAATTAGTGTTATAACACATCTATGTATAAT #Seq_3 CCCCTAATATGTACAATAATGAATGTTGTAAATTAGTGTTATAACACATCTATGTATAAT #Seq_4 CCCCTAATATGTACAATAATGAATGTTGTAAATTAGTGTTATAACACATCTATGTATAAT #Seq_5 CCCCTAATAGGTACAATAACTAATGTTGTAAATTAGTGTTATAACACATCTATGTATAAT #Seq_6 CCCCTAATAGGTACAATAATTAATGTTGTAAATTAGTGTTATAACACATCTATGTATAAT #Seq_7 CCCCTAATTTGTACAATAATGAATGTTGTAAATTAATGTTATAACACATCTATGTATAAT #Seq 8 CCCCTAATATGTCCAATAATGAATGTTGTAAATTAGTGTTATAACACATCTATGTATAAT Seq_9 CCCCTAATATGTACAATAATGAATGTTGTAAATTAGTGTTATAACACATCTATGTATAAT Seq_10 CCCCTAATATGTACAATAATGAATGTTGTAAATTAGTGTTATAACACATCTATGTATAAT

or otherwise

10 60 q_ #Seq_9 CCCCTAATATGTACAATAATGAATGTTGTAAATTAGTGTTATAACACATCTATGTATAAT #Seq_10 CCCCTAATATGTACAATAATGAATGTTGTAAATTAGTGTTATAACACATCTATGTATAAT

or otherwise

#Mega Seq_1 cccctaatatgtacaataatgaatgttgta Seq_2 CCCCTAATATGTACAATAATGAATGTTGTA Seq_3 CCCCTAATATGTACAATAATGAATGTTGTA Seq_4 CCCCTAATATGTACAATAATGAATGTTGTA Seq_5 CCCCTAATAGGTACAATAACTAATGTTGTA Seq 6 CCCCTAATAGGTACAATAATTAATGTTGTA title: title_of_your_project #Seq_1 cccctaatatgtacaataatgaatgttgtaaattagtgttataacacatctatgtataat #Seq_2 CCCCTAATATGTACAATAATGAATGTTGTAAATTAGTGTTATAACACATCTATGTATAAT Seq_6 CCCCTAATAGGTACAATAATTAATGTTGTA Seq_7 CCCCTAATTTGTACAATAATGAATGTTGTA Seq_8 CCCCTAATATGTCCAATAATGAATGTTGTA Seq_9 CCCCTAATATGTACAATAATGAATGTTGTA Seq_10 CCCCTAATATGTACAATAATGAATGTTGTA aattagtgttataacacatctatgtataat #Seq_3 CCCCTAATATGTACAATAATGAATGTTGTAAATTAGTGTTATAACACATCTATGTATAAT #Seq_4 CCCCTAATATGTACAATAATGAATGTTGTAAATTAGTGTTATAACACATCTATGTATAAT #Seq_5 CCCCTAATAGGTACAATAACTAATGTTGTAAATTAGTGTTATAACACATCTATGTATAAT #Seq 6 aattagtgttataacacatctatgtataat AATTAGTGTTATAACACATCTATGTATAAT AATTAGTGTTATAACACATCTATGTATAAT AATTAGTGTTATAACACATCTATGTATAAT AATTAGTGTTATAACACATCTATGTATAAT AATTAGTGTTATAACACATCTATGTATAAT AATTAATGTTATAACACATCTATGTATAAT #Seq_6 CCCCTAATAGGTACAATAATTAATGTTGTAAATTAGTGTTATAACACATCTATGTATAAT #Seq_7 CCCCTAATTTGTACAATAATGAATGTTGTAAATTAATGTTATAACACATCTATGTATAAT #Seq_8 CCCCTAATATGTCCAATAATGAATGTTGTAAATTAGTGTTATAACACATCTATGTATAAT #Seq_9 AATTAATGTTATAACACATCTATGTATAAT AATTAGTGTTATAACACATCTATGTATAAT AATTAGTGTTATAACACATCTATGTATAAT _ CCCCTAATATGTACAATAATGAATGTTGTAAATTAGTGTTATAACACATCTATGTATAAT #Seq_10 CCCCTAATATGTACAATAATGAATGTTGTAAATTAGTGTTATAACACATCTATGTATAAT

(7)

Input file formats

NEXUS

(filename.nex)

Arlequin

q

(filename.arp)

#NEXUS BEGIN TAXA; DIMENSIONS NTAX=10; TAXLABELS Seq_1 [Profile]

Title="An example of DNA sequence data" NbSamples=3 GenotypicData=0 DataType=DNA Seq_2 Seq_3 Seq_4 Seq_5 Seq_6 Seq_7 Seq 8 yp LocusSeparator=NONE [Data] [[Samples]] SampleName="Population 1" SampleSize=3 SampleData= { Seq_8 Seq_9 Seq_10; END; BEGIN CHARACTERS; DIMENSIONS NCHAR=60; Seq_1 1 cccctaatatgtacaataatgaatgttgtaaattagtgttataacacatctatgtataat Seq_2 1 CCCCTAATATGTACAATAATGAATGTTGTAAATTAGTGTTATAACACATCTATGTATAAT Seq_3 1 CCCCTAATATGTACAATAATGAATGTTGTAAATTAGTGTTATAACACATCTATGTATAAT } SampleName="Population 2" SampleSize=3 SampleData= { FORMAT DATATYPE=DNA MISSING=? GAP=- MATCHCHAR=.;

MATRIX Seq_1 cccctaatatgtacaataatgaatgttgtaaattagtgttataacacatctatgtataat Seq_2 CCCCTAATATGTACAATAATGAATGTTGTAAATTAGTGTTATAACACATCTATGTATAAT Seq 3 SampleData { Seq_4 1 CCCCTAATATGTACAATAATGAATGTTGTAAATTAGTGTTATAACACATCTATGTATAAT Seq_5 1 CCCCTAATAGGTACAATAACTAATGTTGTAAATTAGTGTTATAACACATCTATGTATAAT Seq_6 1 CCCCTAATAGGTACAATAATTAATGTTGTAAATTAGTGTTATAACACATCTATGTATAAT } SampleName="Population 3" SampleSize=4 Seq_3 CCCCTAATATGTACAATAATGAATGTTGTAAATTAGTGTTATAACACATCTATGTATAAT Seq_4 CCCCTAATATGTACAATAATGAATGTTGTAAATTAGTGTTATAACACATCTATGTATAAT Seq_5 CCCCTAATAGGTACAATAACTAATGTTGTAAATTAGTGTTATAACACATCTATGTATAAT Seq_6 SampleData= { Seq_7 1 CCCCTAATTTGTACAATAATGAATGTTGTAAATTAATGTTATAACACATCTATGTATAAT Seq_8 1 CCCCTAATATGTCCAATAATGAATGTTGTAAATTAGTGTTATAACACATCTATGTATAAT Seq_9 1 CCCCTAATATGTACAATAATGAATGTTGTAAATTAGTGTTATAACACATCTATGTATAAT Seq_10 1 CCCCTAATATGTACAATAATGAATGTTGTAAATTAGTGTTATAACACATCTATGTATAAT } [[St t ]] CCCCTAATAGGTACAATAATTAATGTTGTAAATTAGTGTTATAACACATCTATGTATAAT Seq_7 CCCCTAATTTGTACAATAATGAATGTTGTAAATTAATGTTATAACACATCTATGTATAAT Seq_8 CCCCTAATATGTCCAATAATGAATGTTGTAAATTAGTGTTATAACACATCTATGTATAAT Seq_9 CCCCTAATATGTACAATAATGAATGTTGTAAATTAGTGTTATAACACATCTATGTATAAT [[Structure]]

StructureName="A group of 3 populations analyzed for DNA" NbGroups=1 Group= { "Population 1" "Population 2" "Population 3" CCCCTAATATGTACAATAATGAATGTTGTAAATTAGTGTTATAACACATCTATGTATAAT Seq_10 CCCCTAATATGTACAATAATGAATGTTGTAAATTAGTGTTATAACACATCTATGTATAAT; p }

(8)

Sequence alignment

Software of the Clustal family:

• ClustalW

l

h

//

h

b

/ f

/Cl

lW h

l

online versions

http://www.ch.embnet.org/software/ClustalW.html

http://www.ebi.ac.uk/Tools/clustalw2/index.html

download

http://www.clustal.org/download/

• ClustalX

download

http://www.clustal.org/download/current/

Higgins & Sharp (1988; 1989); Higgins et al. (1992); Thompson et al. (1994; 1997).

SeaView

is a sequence alignment editor which is able to read and write various li t f t (NEXUS CLUSTAL FASTA PHYLIP )

alignment formats (NEXUS, CLUSTAL, FASTA, PHYLIP…). Free download from this website:

http://pbil.univ-lyon1.fr/software/seaview.htmlp //p y / /

(9)

Molecular diversity indices

Haplotype diversity (

p

yp

y ( )

H

)

It is defined as the probability that two randomly chosen haplotypes are different

in the sample. Haplotype (gene) diversity is estimated as:

where

n

is the number of gene copies in the sample,

k

is the number of haplotypes,

and

p

_i

is the sample frequency of the i-th haplotype.

Nei (1987).

(10)

Molecular diversity indices

Mean number of pairwise differences (

π

)

Mean number of differences between all pairs of haplotypes in the sample. It can

be estimated as

where

d

is an estimate of the number of mutations having occurred since the

where

d

_ij

is an estimate of the number of mutations having occurred since the

divergence of haplotypes i and j,

k

is the number of haplotypes,

p

_i

is the frequency

of haplotype i,

p

_j

is the frequency of haplotype j, and

n

is the sample size.

Tajima (1983); (1993).

(11)

Molecular diversity indices

Nucleotide diversity (

π

_n

)

It is computed as the probability that two randomly chosen homologous nucleotide sites are different. It is equivalent to the haplotype diversity at the nucleotide level.

where d_ijis an estimate of the number of mutations having occurred since the divergence of haplotypes i and j, kis the number of haplotypes, p_i is the frequency of haplotype i, p_j is the frequency of haplotype j n is the sample size and L is the number of loci

frequency of haplotype j, n is the sample size and L is the number of loci.

Tajima (1983); Nei (1987).

(12)

Molecular diversity indices

«Genetic loci from a centre of origin are expected to retain more ancestral variation and show hi h h l t i d l tid di it ith li i th h i

higher haplotypic and nucleotide diversity, with lineage pruning through successive colonization events leading to a reduction in derived populations.».

Troy et al. (2001).y ( ) 383 B. taurus mtDNA sequences (240 bp of the HVRI region ):

M i i diff ( d )

Mean pairwise differences (±s.d.)

Middle East 3.79 ± 2.03 Anatolia 3.49 ± 1.81 Mainland Europe 1.92 ± 1.10 Britain 2.68 ± 1.45 Northern Europe 1.47 ± 0.91 Africa 2.09 ± 1.18

(13)

Analysis of MOlecular VAriance - AMOVA

The Analysis of MOlecular Variance (AMOVA, Excoffier et al. 1992) is based on

analyses of variance of gene frequencies taking into account the number of

analyses of variance of gene frequencies, taking into account the number of

mutations between molecular haplotypes.

User-defined groups of populations

Æ

_{particular genetic structure to test.}

User defined groups of populations

Æ

_{particular genetic structure to test.}

A hierarchical analysis of variance partitions the total variance into covariance

components (Rousset, 2000).

p

(

,

)

The total molecular variance (

σ

2

_{) is the sum of the components due to:}

• σ

_a2

= differences among the populations;

• σ

_b2

= differences among haplotypes in different populations within a group;

• σ

_c2

= differences among haplotypes within a population.

(14)

Analysis of MOlecular VAriance - AMOVA

Simple hierarchical genetic structure e g haploid individuals in populations Æ the algorithm Simple hierarchical genetic structure e.g. haploid individuals in populations Æ the algorithm leads to a fixation index F_ST (Weir & Cockerham, 1984) which can be expressed in terms of inbreeding coefficients asg

Slatkin (1991) Slatkin (1991). where f₀ is the probability of identity by descent of two different genes drawn from the same population, f₁ is the probability of identity by descent of two genes drawn from two different

p p f₁ p y y y g

populations.

(15)

Mismatch Distribution

It is the distribution of the observed number of differences between pairs of haplotypes. This p p yp distribution is usually multimodal in samples drawn from populations at demographic

equilibrium, as it reflects the highly stochastic shape of gene trees…

(16)

Mismatch Distribution

…but it is usually unimodal in populations having passed through a recent demographic y p p g p g g p expansion.

Rogers & Harpending, (1992); Hudson & Slatkin, (1991).

Simulations of populations that underwent a sudden 100-fold growth at 7 units of mutational time before present (Rogers, 2004). Simulations of populations that underwent a sudden 100 fold growth at 7 units of mutational time before present (Rogers, 2004).

(17)

Mismatch Distribution and estimates of population expansion

In case of a sudden population growth (mismatch distribution = smooth unimodal wave), the time of the expansion τ₀ and the size of the pre-expansion population θ₁ can be estimated as follows

where π is the mean pairwise difference per sequence within the sample, m is the mean of pairwise differences, and v is the variance.

(18)

Estimates of population expansion – an alternative approach

Analysis of Bayesian skyline plots: an approach alternative to mismatch distribution analysis. Past changes in population size can be inferred from present-day genetic diversity without prior assumptions about population history.

Mitochondrial d-loop sequence data (also aDNA).

F d ti i

Four domestic species:

-Yak(Bos grunniens) n=71

-Water buffalo (Bubalus bubalis) n=110 -Mithan(Bos frontalis) n=24

-CattleCattle ((Bos taurusBos taurus) n=84) n 84

One closely related wild species:

-African buffalo (Syncerus caffer) n=195 Uniform mutation rate: 32%Myr-1

Domestic species - sudden expansion during the last 104 _{years ~ time since domestication.}

Af i b ff l d l l ti i f ll d b h d li ( i t ith African buffalo - gradual population expansion followed by a sharp decline (consisten with documented epidemics and habitat loss since the XIXth century).

Source: Finlay et al. (2007).

S ft

BEAST BEAUTI d TRACER

Software: BEAST, BEAUTI and TRACER.

(19)

Admixture analysis

This analysis evaluates the relative contributions of any number of parental populations to a derived, hybrid population.

It compares the composition of different gene pools rather than making inference about the admixture event itself (mY estimator; Dupanloup & Bertorelle, 2001).

Software: ADMIX ver. 1.0

Features: - works with sequences RFLPs microsatellites Features: works with sequences, RFLPs, microsatellites

- needs 2 input files: DATA file (filename.dat) MATRIX file (filename.mtx)

The DATA file should contain for each locus se sample sizes of the admixed and of the parental populations and the number of copies observed for each haplotype (allele) in each population.

DATA file example:

LocusX AD=admixed pop; P1=parental pop. 1; P2= parental pop. 2 n_AD, n_P1, n_P2 n_AD= sample size of pop. AD; etc.

cnH1(AD), cnH1(P1), cnH1(P2) H1, H2, H3= haplotypes

cnH2(AD), cnH2(P1), cnH2(P2) cnH1(AD)= count number for haplotype 1 in AD pop.; etc. cnH3(AD), cnH3(P1), cnH3(P2)

(20)

Admixture analysis

MATRIX file example:

n_X number of analyzed loci

LocusX

n_H number of haplotypes observed at the locus H

0 lower triangular matrix of molecular distances (number of 1 0 substitutions in pairwise comparisons of haplotypes)

1 0 substitutions in pairwise comparisons of haplotypes) 3 2 0

ADMIX ver. 2.0 needs only one input file containing both the data and the matrix.

Pellecchia et al. (2007).

Admixture values ± s.e. calculated on Bos taurus mtDNA data (HVRI region) derived from autochthonous Italian breeds.

(21)

A tree is a graph which describes the evolutionary relationships between sequences.

Trees

g p y p q

•Nodes = Taxonomic Units (TUs);

•Branches = evolutionary relationships between TU in terms of ancestry/descent

A branch connects only two nodes. Internal nodes represent ancestral TUs, while terminal bramches represent present TUs (i.e. sequences), also defined Operational Taxonomic Units, OTUs.

Cladogram: a tree describing only the relationships between nodes. Branch lengths have no specific meaning.

Phylogram: branch lengths are proportional to the evolutionary distance Æ calculations of genetic divergence between nodes.

Cladogram Phylogram Seq 9 Seq 1 Seq 7 Seq 10 S 3 Seq 9 Seq 1 Seq 7 Seq 10 Seq 3 Seq 8 Seq 2 Seq 4 Seq 5 Seq 3 Seq 8 Seq 2 Seq 4 Seq 5 Seq 5 Seq 6 Seq 5 Seq 6

(22)

Trees

Rooted tree: a particular node, the “root”, represents the common ancestor of all the remaining nodes Æ all the branches can be oriented as a function of time.

Unrooted tree: describes exclusively the evolutionary relationships between OTUs No Unrooted tree: describes exclusively the evolutionary relationships between OTUs. No

information on the evolutionary process as a function of time Æit is not possible to identify older/more recent nodes.

Unrooted tree

Rooted tree

outgroup

Seq 6 Seq 1 Seq 7 Seq 9

Seq 6 Seq 5 Seq 1 Seq 3 Seq 10 Seq 3 Seq 4 Seq 5 Seq 4 Seq 10 Seq 2 Seq 7 Seq 9

Rooted trees are usually built when the hypothesis of the “molecular clock” is assumed, i.e.

Seq 8 Seq 2

Seq 9 Seq 8

(23)

Trees

To root a tree, a particular OTU, called “outgroup”, is included into the dataset. The outgroup is defined as “a OTU which started the process of divergence from its ancestor before all the remaining OTUs started diverging from each other” (information derived from non-genetic evidence, e.g. paleontology morphology etc )

paleontology, morphology etc.).

Trees can also be represented in the Newick (computer readable) format with nested brackets:

((((Seq_9,(Seq_6,Seq_5)),Seq_10),((Seq_8,Seq_4),Seq_3)),(Seq_7,Seq_2),Seq_1);

Dedicated software read trees in Newick format (e.g. TreeView; Page, 1996).

Seq 1 Seq 7 Seq 2 Seq 3 Seq 8 Seq 4 Seq 4 Seq 10 Seq 9 Seq 6 Seq 5

(24)

Trees

Aim of a phylogenetic analysis

Æ

determining the “topology” (structure) of the tree.

The number of possible trees grows exponentially with the umber of OTUs.

For

n

OTUs, the numbers of rooted (N

_R

) and unrooted (N

_U

) trees are given by

N

_R

= N

(2n-3)!

_U

=

(2n-5)!

N

_R

N

_U

2

n-2

_(n-2)!

₂

n-3

_(n-3)!

N

_U

for n OTUs = N

_R

for (n-1) OTUs.

E.g. if n=10 there are about 35· 10

_g

6

_{possible trees, only one of which correctly}

_p

_y

(25)

Tree reconstruction strategies

Tree-building methods can be classified according to

•the type of data (i.e. distance matrix vs. discrete characters);

•the reconstruction strategy (clustering algorithms vs. optimality criteria);

DATA distance discrete D _matrix _characters Clustering algorithm UPGMA, NJ M ETHO D Optimality criterion ME, FM MP, ML, BA M

UPGMA: unweighted pair-group method using arithmetic means; NJ: neighbor-joining; ME: minimum evolution; FM: Fitch-Margoliash's least-squares method; MP: maximum parsimony; ML: maximum likelihood; BA: Bayesian inference.

All the aforementioned methods (excepted MP), require the selection of an explicit model All the aforementioned methods (excepted MP), require the selection of an explicit model of sequences evolution (“substitution model”).

Substitution models describe in probabilistic terms the process by which a set of characters

( l id ) h i h f h l h i

(26)

Models of DNA sequences evolution

P i

i

P

t

diff

Pairwise an Percentage difference

These are very rough estimates of evolutionary divergence between sequences.

y

g

y

g

q

They are computed as the number/percentage of loci (nucleotides) for which two

sequences are different:

P = n

_d

P = n

_d

/L

Where n

_d

is the number of observed substitutions between two DNA sequences

and L is the number of loci.

(27)

Models of DNA sequences evolution

The number of observed differences usually underestimates the real amount of The number of observed differences usually underestimates the real amount of evolutionary change (e.g. occurrence of “multiple hits”).

Substitution models incorporate some “correction” parameters, their number varying p p y g according to the a priori assumptions accepted (number of fixed/variable parametrs).

A priori assumptions:

•Nucleotide sites evolve independently; •Nucleotide sites evolve independently; • All sites can mutate with equal probability; • All types of substitutions are equally probable; • Substitution rate is constant ;

h h h

• The base composition is at equilibrium (sequences have the same base composition ). The higher the number of accepted assumptions, the simpler the model. The higher the number of accepted assumptions, the simpler the model.

The lower the number of accepted assumptions, the higher the number of the parameters that need to be estimated

(28)

Models of DNA sequences evolution

The most renowned and used nucleotide substitution models are those from the

General Time-Reversible (GTR) family (Lanave et al., 1984): 203 possible models

diff

ti t d b th

b d t

f fi d/

i bl

t

differentiated by the number and type of fixed/variable parameters.

The nucleotide substitution models implemented in the most frequentl used

The nucleotide substitution models implemented in the most frequently used

phylogenetics software packages (MEGA, PAUP*, PHYLIP, PHYML, MrBayes

ecc.) belong to the GTR family.

(29)

Models of DNA sequences evolution

Jukes and Cantor (JC69; 1969)

It i th i

l t (

t

t)

d l hi h

th t

It is the simplest (parameter poorest) model, which assumes that:

•Nucleotide frequencies are equal (i.e.

π

_A

=

π

_T

=

π

_C

=

π

_G

= 0.25);

•All possible substitutions take place at a single rate

Æ

only the parameter

α

needs

to be estimated (substitution rate).

A C

G

T

A C

G

T

A

-

α

C

α

-

α

G

α

-

α

T

α

(30)

-Models of DNA sequences evolution

Kimura 2-parameters (K80; 1980)

•Nucleotide frequencies are equal (i.e.

π

_A

=

π

_C

=

π

_G

=

π

_T

= 0.25);

•Different substitution rates between transitions (Ts)

α

and transversions (Tv)

β

.

The Ts/Tv ratio is estimated from the data.

A C

G

T

A

-

β

α

β

C

β

-

β

α

G

α

β

-

β

(31)

-Tamura (1992)

Models of DNA sequences evolution

Tamura (1992)

This model is an extension of K80 method, allowing for unequal nucleotide

frequencies

frequencies.

•Base composition is not equal (A + T

≠

G + C and G + C =

θ

);

•Different substitution rates between Ts (

α

)

and Tv (

β

)

.

The Ts/Tv ratio, as well as nucleotide frequencies are computed from the data.

A

C

G

T

A

-

θβ

θα

(1-

θ

)

β

A

θβ

θα

(1

θ

)

β

C

(1-

θ

)

β

-

β

(1-

θ

)

α

G

(1

θ

)

α

β

(1

θ

)

β

G

(1-

θ

)

α

β

-

(1-

θ

)

β

T

(1-

θ

)

β

α

β

(32)

-Models of DNA sequences evolution

Felsenstein (F81; 1981)

It i t

i f JC69

th d ll

i f

l

l tid f

i (i

It is an extension of JC69 method, allowing for unequal nucleotide frequencies (i.e.

π

_A

≠

π

_C

≠

π

_G

≠

π

_T

).

The overall nucleotide frequencies are computed from the data.

A

C

G

T

A

C

G

T

A

-

π

_C

α

π

_G

α

π

_T

α

C

π

_A

α

-

π

_G

α

π

_T

α

G

π

_A

α

π

_C

α

-

π

_T

α

T

π

_A

α

π

_C

α

π

_G

α

(33)

-H

Ki hi

Y

(HKY 1985)

Models of DNA sequences evolution

Hasegawa-Kishino-Yano (HKY; 1985)

This model combines the assumptions of K80 and F81:

p

•unequal nucleotide frequencies (i.e.

π

_A

≠

π

_C

≠

π

_G

≠

π

_T

).

•Different substitution rates between Ts (

α

)

and Tv (

β

)

.

Overall nucleotide frequencies and the Ts/Tv ratio computed from the data.

A

C

G

T

A

C

G

T

A

-

π

_C

β

π

_G

α

π

_T

β

C

π

_A

β

-

π

_G

β

π

_T

α

G

π

_A

α

π

_C

β

-

π

_T

β

T

π

_A

β

π

_C

α

π

_G

β

(34)

-Models of DNA sequences evolution

General Time Reversible (GTR) Lanave et al. (1984).

It i th

t

l d

t

i h d l

It is the most general and parameter-rich model.

•Unequal nucleotide frequencies (i.e.

π

_A

≠

π

_C

≠

π

_G

≠

π

_T

).

•Different substitution rates between the two transitions and the four transversions

Ts:

A

Æ

G =

α

₁

;

C

Æ

T =

α

₂

Tv:

A

Æ

_{C =}

β

₁

; A

Æ

T =

β

₂

; C

Æ

G =

β

₃

; G

Æ

T =

β

₄

*

.

• Unequal probability for each type of nucleotide substitution.

S b tit ti

ibl

(A

Æ

_G

Æ

_A)

• Substitutions are reversible (A

Æ

_{G = G}

Æ

_A).

A

C

G

T

A

-

π

_C

β

₁

π

_G

α

₁

π

_T

β

₂

C

π

_A

β

₁

-

π

_G

β

₃

π

_T

α

₂

G

π

_A

α

₁

π

_C

β

₃

-

π

_T

β

₄

T

π

_A_A

β

₂₂

π

_C_C

α

₂₂

π

_G_G

β

₄₄

(35)

Γ (

) di t ib ti d I

i t it

Models of DNA sequences evolution

Γ (gamma) distribution and Invariant sites

An additional parameter is considered when the substitution rates cannot be assumed as uniform for all sites

uniform for all sites.

Not all the nucleotide positions within a sequence, in fact, are subject to the same evolutionary constraints (e.g. 1st-2nd vs. 3rd codon position in protein-coding genes). evolutionary constraints (e.g. 1st 2nd vs. 3rd codon position in protein coding genes). There are two strategies:

1) To analyze separately the sites subject to different evolutionary dynamics;

2) To adopt a model with additional parameters that account for the rate variation.

Γ distributions are used to model continuous variables that are always positive and have k d di t ib ti

skewed distributions.

The shape of the Γ distribution is determined by a single parameter α(“shape parameter”) which specifies the range of rate variation among sites and is inversely proportional to the which specifies the range of rate variation among sites and is inversely proportional to the level of heterogeneity among site rates.

(36)

Models of DNA sequences evolution

The lower the values of

α

, the larger the range of rate variation and

the more uneven

the substitution rates.

A

Æ

ll i

h

b i

i

As

α

Æ

∞

, all sites have the same substitution rate.

ion of

sites

f(r)

Proport

i

Also the fraction of

Invariant sites

(i.e. sites showing no variation within the

sequences set) can be estimated and taken into account when modeling the

Substitution rate (r)

q

)

g

(37)

Models of DNA sequences evolution

All models typically used to infer evolutionary relationships between DNA

sequences represent a special case of the GTR model.

Imposing constraints (i e

a priori

assumptions) on the parameters of the GTR

Imposing constraints (i.e.

a priori

assumptions) on the parameters of the GTR

leads to a different model which can, therefore, be considered as a special case of

the GTR.

A model is said to be “nested” within a more complex one if the former can be

obtained by constraining the parameters of the latter.

E.g. JC69 is nested within K80, while F81 and K80 are not nested because fixing

parameter values of either one does not yield the other model.

(38)

Models of DNA sequences evolution

How do we select the best-fitting model?

T l h b fi i b i i d l h Lik lih d R i T (LRT) i ll To select the best-fitting substitution model the Likelihood Ratio Test (LRT) is usually applied. In a maximum likelihood framework, it evaluates the statistical significance of the increase in fit of alternative nested models to the data as their number and types of

parameters increases. parameters increases.

Δ = 2 (ln L₁ - ln L₀)

L₁ = global ML estimate for the alternative hypothesis (more general, parameter richer model)

L₀ = global ML estimate for the null hypothesis (simpler model).

The probabilities are χ2_{distributed with d.f.= difference in the number of free parameters between the two alternative}

models.

The Akaike Information Criterion (AIC; Akaike, 1974; Posada & Buckley, 2004) and the Bayesian Information Criterion (BIC; Schwarz, 1978) are methods alternative to the LRT; they simultaneously evaluate the statistical significance of the relative fit of all competing models be they nested or not

(39)

Models of DNA sequences evolution

ModelTest Posada & Crandall (1998)

ModelTest Posada & Crandall (1998).

A very popular tool which automatically selects the best-fitting substitution model from among 56 alternatives by performing LRT AIC (software ver 3 06) and BIC (software ver among 56 alternatives by performing LRT, AIC (software ver. 3.06) and BIC (software ver. 3.7) calculations. It returns the name and the parameter values of the best-fitting model. Original software version: both the ModelTest application and the software PAUP* Original software version: both the ModelTest application and the software PAUP (Swofford, 1998) are needed. Unfortunately, PAUP* software is not free.

Input file format = Nexus (same as PAUP*) + “ModelTest” block. Input file format Nexus (same as PAUP ) ModelTest block.

More information on how to run ModelTest can be found here:

http://darwin.uvigo.es/software/modeltest.html

htt // hi bi / h l ti / d lt t ht l (Wi d ) http://www.rhizobia.co.nz/phylogenetics/modeltest.html (Windows) http://www.genedrift.org/mtgui.php (Windows and Linux).

A web-based tool to run ModelTest can be found here:

http://darwin.uvigo.es/software/modeltest_server.html (Posada, 2006).

A free web-based tool to choose among 28 nucleotide models with the AIC:

(40)

Models of DNA sequences evolution

Hierarchical structure of 56 models implemented in the ModelTest procedure (Posada and Crandall 1998) It does not include all of the possible models in the GTR family and Crandall, 1998). It does not include all of the possible models in the GTR family.

(41)

Models of DNA sequences evolution

** Hierarchical Likelihood Ratio Tests (hLRTs) **

ModelTest

ver. 3.06

output

Testing models of evolution - Modeltest Version 3.06

Equal base frequencies

Null model = JC -lnL0 = 2562.9832 Alternative model = F81 -lnL1 = 2543.3635 2(lnL1-lnL0) = 39.2393 df = 3

Department of Zoology, Brigham Young University WIDB 574, Provo, UT 84602, USA

_______________________________________________________________ Wed May 23 16:49:15 2007

P-value = <0.000001 Ti=Tv

Null model = F81 -lnL0 = 2543.3635 Alternative model = HKY -lnL1 = 2482.0591 2(lnL1-lnL0) = 122.6089 df = 1

P-value = <0.000001 y

Input format: Paup matrix file ** Log Likelihood scores **

+I +G +I+G

Equal Ti rates

Null model = HKY -lnL0 = 2482.0591 Alternative model = TrN -lnL1 = 2482.0227 2(lnL1-lnL0) = 0.0728 df = 1 P-value = 0.787369 +I +G +I+G JC = 2458.8540 2458.8540 2454.2852 2443.3606 F81 = 2440.0264 2440.0264 2434.6941 2424.4517 K80 = 2400.7991 2400.7991 2396.5891 2385.6047 HKY = 2379.0457 2379.0457 2374.1394 2362.9192 Equal Tv rates

Null model = HKY -lnL0 = 2482.0591 Alternative model = K81uf -lnL1 = 2480.2668 2(lnL1-lnL0) = 3.5845 df = 1

P-value = 0.058322 Equal rates among sites TrNef = 2400.6252 2400.6252 2396.1169 2385.5432

TrN = 2379.0442 2379.0442 2374.0200 2363.2795 K81 = 2398.1973 2398.1973 2393.5496 2382.9202 K81uf = 2377.6162 2377.6162 2372.7297 2362.2349 TIMef = 2398.0212 2398.0212 2393.4592 2382.8601

Equal rates among sites

Null model = HKY -lnL0 = 2482.0591 Alternative model = HKY+G -lnL1 = 2374.1394 2(lnL1-lnL0) = 215.8394 df = 1

Using mixed chi-square distribution P-value = <0.000001 TIM = 2376.7197 2376.7197 2372.6138 2362.1086 TVMef = 2395.8040 2395.8040 2391.3481 2380.6255 TVM = 2375.0203 2375.0203 2369.3423 2358.6572 SYM = 2395.6624 2395.6624 2391.2957 2380.6013 GTR 2374 8865 2374 8865 2369 2437 2358 5361 No Invariable sites

Null model = HKY+G -lnL0 = 2374.1394 Alternative model = HKY+I+G -lnL1 = 2362.9192 2(lnL1-lnL0) = 22.4404 df = 1

Using mixed chi-square distribution P value = 0 000001

(42)

Models of DNA sequences evolution

ModelTest output

Model selected: HKY+I+G

** Akaike Information Criterion (AIC) **

Model selected: TVM+I+G

-lnL = 2358.6572 AIC = 4735.3145 -lnL = 2362.9192 Base frequencies: freqA = 0.3123 freqC = 0.2263 freqG = 0.1618 Base frequencies: freqA = 0.3116 freqC = 0.2168 freqG = 0.1607 freqT = 0.3109 S b i i d l q freqT = 0.2996 Substitution model: Ti/tv ratio = 2.0963 Among-site rate variation

P ti f i i bl it (I) 0 6051 Substitution model: Rate matrix R(a) [A-C] = 1.8176 R(b) [A-G] = 8.0533 R(c) [A-T] = 1.6254 R(d) [C-G] = 3.5609 R(e) [C-T] = 8 0533 Proportion of invariable sites (I) = 0.6051

Variable sites (G)

Gamma distribution shape parameter = 0.9352

--PAUP* Commands Block: If you want to implement the previous ti t lik lih d tti i PAUP*

R(e) [C-T] = 8.0533 R(f) [G-T] = 1.0000 Among-site rate variation

Proportion of invariable sites (I) = 0.6002 Variable sites (G)

Gamma distribution shape parameter = 0.9020 estimates as likelihod settings in PAUP*,

attach the next block of commands after the data in your PAUP file:

[!

Likelihood settings from best-fit model (HKY+I+G) selected by hLRT in Modeltest Version 3.06

--PAUP* Commands Block: If you want to implement the previous estimates as likelihod settings in PAUP*, attach the next block of commands after the data in your PAUP file:

[!

Likelihood settings from best-fit model (TVM+I+G) selected by AIC in Modeltest Version 3.06

]

BEGIN PAUP;

Lset Base=(0.3123 0.2263 0.1618) Nst=2 TRatio=2.0963 Rates=gamma Shape=0.9352 Pinvar=0.6051;

END;

]

BEGIN PAUP;

Lset Base=(0.3116 0.2168 0.1607) Nst=6 Rmat=(1.8176 8.0533 1.6254 3.5609 8.0533) Rates=gamma Shape=0.9020 Pinvar=0.6002;

END;

---- _________________________________________________________________ Time processing: 0.001 seconds

(43)

Tree reconstruction strategies

DATA distance matrix discrete characters D Clustering algorithm UPGMA, NJ Optimality _{ME FM} _{MP ML BA} METHO D

Distance methods - aligned sequences converted into a pair-wise distance matrix Æ loss of information about single sites contributions and no inference on the ancestral character

Optimality

criterion ME, FM MP, ML, BA

information about single sites contributions and no inference on the ancestral character states.

Discrete methods - each nucleotide site is considered directly y Æ allow to draw inference

on the ancestral character states.

Clustering methods follow an algorithm (set of steps) to produce a tree (usually a single

one) Æshort computational times but the results often depend on the order of sequences one) Æshort computational times, but the results often depend on the order of sequences addition to the growing tree. Competing hypotheses cannot be tested.

Optimality methodsp y use a specific criterion to assign a score to each possible tree. The p g p ranking is a function of the relationship between tree and data.

(44)

Tree reconstruction strategies

Neighbor-Joining (NJ)

Saitou & Nei (1987).

The clustering algorithm starts from a star topology (completely unresolved tree) and determines the branches between the nearest pair of OTUs (neighbors) and the remaining OTUs through an iterative process.

Each step is taken according to the choice that minimizes the sum of the lengths of all the branches of the tree

branches of the tree.

The pair of OTUs chosen at each step will form a “composite OTU” treated as a single entity afterwards.

Advantages:

-Very fast computations;

-Allows for different evolutionary rates along the branches; Usually returns reliable results

-Usually returns reliable results. Disadvantages:

-the calculation of a distance matrix causes a loss of information. Software: CLUSTALX, PHYLIP, PAUP*, MEGA and others.

(45)

Tree reconstruction strategies

Neighbor-Joining (NJ)

Seq 2 Seq 2 Seq_1 Seq_2 Seq_3 Seq_1 Seq_2

Seq_3 Seq_1 Seq_2

Seq_3 Seq_4

Seq_5

Seq_6 Seq_4

Seq_5

Seq_6 Seq_6 _{Seq_5} Seq_4

(46)

Tree reconstruction strategies

Maximum Parsimony

Swofford & Berlocher (1987).

This method identifies the tree which needs the smallest number of substitutions (evolutionary changes) to explain the differences between the considered sequences. The branch length is proportional to the number of substitutions between the nodes connected by the branch itself.

“Parsimony informative sites” show at least two different character states occurring at least two times each.

Then the minimum number of substitutions is calculated for each possible unrooted tree. The MP tree is the one requiring the smallest number of changes.

Advantages:

-No loss of information;No loss of information; Disadvantages:

- no explicit evolutionary model (all substitutions equally probable, equal base frequencies,

f l l h )

no correction for multiple hits);

-often it returns a set of equally parsimonious trees. Software: PHYLIP, PAUP*, MEGA and others.

(47)

Tree reconstruction strategies

Maximum Parsimony (MP)

Seq_1 GTACG S 2 GTCGG Tree Seq_1 Seq_3 Seq_2 GTCGG Tree Seq_3 ACAGG Seq_4 ACCGG Seq_4 Seq_2

Site 1 – 1 change Site 1 – 5 changes

G A G A G A G A G A A G A G G A

(48)

Tree reconstruction strategies

Maximum Parsimony (MP)

Seq_1 GTACG S 2 GTCGG Tree Seq_1 Seq_3 Seq_2 GTCGG Tree Seq_3 ACAGG Seq_4 ACCGG Seq_4 Seq_2

Site 2 – 1 change Site 3 – 2 changes

T C A A A A C T T C or C C C C C C A A Site 5 – no changes Site 4 – 1 change G G C T C C C C

C G _Tree Sites _{1 2 3 4 5 total}

G G G G Tree 1 2 3 4 5 total ((1,2),(3,4)) 1 1 2 1 0 5 ((1,3),(2,4)) 2 2 1 1 0 6 ((1,4),(2,3)) 2 2 2 1 0 7 G G G G

(49)

Maximum Likelihood (ML)

Felsenstein (1981)

Tree reconstruction strategies

Maximum Likelihood (ML)

Felsenstein (1981).

Often considered as the best approach to determine the most consistent tree topology. Formally, given a data set D (alignment) and the hypothesis H (tree), the probability of observing the data is given by

L_D= Pr(D|H)

Which is equal to the conditional probability of D given H.

The tree which scores the highest value of L represents the ML estimate of the evolutionary l ti hi b t th id d OTU I th d th ML t i th hi h relationships between the considered OTUs. In other words, the ML tree is the one which best explains the examined dataset.

Advantages:

-It usually returns consistent results;

-It permits the statistical testing of evolutionary hypotheses (Likelihood Ratio Test). Disadvantages:

Disadvantages:

-very long computational times (often 100 bootstrap replicates are used instead of 1000). Software: PHYLIP, PHYML, PAUP* and others.

(50)

Bayesian approach (BA)

Tree reconstruction strategies

Bayesian approach (BA)

A recent variant of ML. While ML seeks the tree that maximizes the probability of observing the data given the tree and the model, BA searches the set of trees that have the maximum probability of being observed given the data and the model.

BA produces a set of trees with approximately equal likelihoods. Advantages:

-Results are easy to interpret: the frequency of a given clade within the set of trees is taken as the probability of that clade – no need for bootstrapping

the probability of that clade – no need for bootstrapping. Disadvantages:

-Depending on the settings, it may require long computational times (not as long as for ML).p g g , y q g p ( g ) Software: Mr Bayes and others.

N Of sequences

N Of sequences NeighborNeighbor--joiningjoining Maximum ParsimonyMaximum Parsimony Maximum LikelihoodMaximum Likelihood BayesianBayesian N. Of sequences

N. Of sequences NeighborNeighbor--joiningjoining Maximum ParsimonyMaximum Parsimony Maximum LikelihoodMaximum Likelihood BayesianBayesian 54

54 0.20 sec0.20 sec 0.72 sec.0.72 sec. 7.06 hr7.06 hr 3.8 hr3.8 hr 40

40 0.18sec.0.18sec. 0.32 sec.0.32 sec. 1.1 hr1.1 hr 2.4 hr2.4 hr 30

30 0.22 sec.0.22 sec. 0.18 sec.0.18 sec. 17.3 min17.3 min 1.7 hr1.7 hr 20

20 0.22 sec.0.22 sec. 0.10 sec.0.10 sec. 1.8 min1.8 min 1.05 hr1.05 hr

Computational times required for analysis by the four different methods. Source: Hall (2001). Thanks to faster present day processors the times have proportionally shortened.

10

(51)

Tree reconstruction strategies

Calculation of divergence times

If the assumption of a molecular clock – genetic divergence proportional to

evolutionary time - is correct, the reconstruction of the tree topology allows to

estimate the divergence times between all the OTUs.

g

The divergence time between at least two OTUs must be known from non genetic

evidence (e.g. paleontology). This time value is then used to calibrate the molecular

clock for that given tree.

(52)

Tree reconstruction strategies

Calculation of divergence times

A Likelihood Ratio Test can be performed on the ML values calculated with (

L

_clock

)

and without (

L

_noclock

) the assumption of the validity of the molecular clock.

Δ = 2 (lnL_noclock – lnL_clock)

The probabilities are χ2 _{distributed with d.f.= n-2, being n the number of sequences.}

Software: PAUP* and others.

The calculation of the Time to the Most Recent Common Ancestor (TMRCA) for a

(

)

set of sequences can also be performed with a Bayesian approach.

Software: BEAST BEAUTI and TRACER Software: BEAST, BEAUTI and TRACER.

(53)

Tree reconstruction strategies

Bootstrap

ootst ap

Felsenstein (1985).e se ste ( 985). Non-parametric bootstrap is used to infer the robustness of tree reconstructions.

It estimates sampling error by resampling from the dataset instead of resampling from the population.

This approach can be applied to all the phylogenetic methods, with the exception of BA. How does it work? Data: n aligned sequences of length N (n x N matrix).

Obj i i fid i i l f f h

Objective: estimate confidence in particular features of the obtained tree (robustness of nodes).

(54)

Tree reconstruction strategies

Bootstrap

Felsenstein (1985). Method:

Step 1 - create a large number of pseudo-datasets (100 or 1000) by re-sampling with replacement the columns of the original data matrix. In each of the bootstrapped replicates, some sites may occur more than once while others are never sampled

occur more than once, while others are never sampled.

Original dataset 10 60 Bootstrap pseudoreplicate 10 60 Seq_1 cccctaatatgtacaataatgaatgttgtaaattagtgttataacacatctatgtataat Seq_2 CCCCTAATATGTACAATAATGAATGTTGTAAATTAGTGTTATAACACATCTATGTATAAT Seq_3 CCCCTAATATGTACAATAATGAATGTTGTAAATTAGTGTTATAACACATCTATGTATAAT Seq_4 CCCCTAATATGTACAATAATGAATGTTGTAAATTAGTGTTATAACACATCTATGTATAAT Seq_5 CCCCTAATAGGTACAATAACTAATGTTGTAAATTAGTGTTATAACACATCTATGTATAAT Seq 6 CCCCTAATAGGTACAATAATTAATGTTGTAAATTAGTGTTATAACACATCTATGTATAAT Seq_1 CTTTGTTAAAAATATTGAAAAATTGTGTAAATTAGTTTTTAAACCACCCCTTTGTATTAA Seq_2 CTTTGTTAAAAATATTGAAAAATTGTGTAAATTAGTTTTTAAACCACCCCTTTGTATTAA Seq_3 CTTTGTTAAAAATATTGAAAAATTGTGTAAATTAGTTTTTAAACCACCCCTTTGTATTAA Seq_4 CTTTGTTAAAAATATTGAAAAATTGTGTAAATTAGTTTTTAAACCACCCCTTTGTATTAA Seq_5 CTTGGTTAAAAATACCTAAAAATTGTGTAAATTAGTTTTTAAACCACCCCTTTGTATTAA Seq 6 CTTGGTTAAAAATATTTAAAAATTGTGTAAATTAGTTTTTAAACCACCCCTTTGTATTAA Seq_6 CCCCTAATAGGTACAATAATTAATGTTGTAAATTAGTGTTATAACACATCTATGTATAAT Seq_7 CCCCTAATTTGTACAATAATGAATGTTGTAAATTAATGTTATAACACATCTATGTATAAT Seq_8 CCCCTAATATGTCCAATAATGAATGTTGTAAATTAGTGTTATAACACATCTATGTATAAT Seq_9 CCCCTAATATGTACAATAATGAATGTTGTAAATTAGTGTTATAACACATCTATGTATAAT Seq_10 CCCCTAATATGTACAATAATGAATGTTGTAAATTAGTGTTATAACACATCTATGTATAAT Seq_6 CTTGGTTAAAAATATTTAAAAATTGTGTAAATTAGTTTTTAAACCACCCCTTTGTATTAA Seq_7 CTTTGTTAAAAATATTGAAAAATTGTGTAAATTAGTTTTTAAACCACCCCTTTGTATTAA Seq_8 CTTTGTTCCAAATATTGAAAAATTGTGTAAATTAGTTTTTAAACCACCCCTTTGTATTAA Seq_9 CTTTGTTAAAAATATTGAAAAATTGTGTAAATTAGTTTTTAAACCACCCCTTTGTATTAA Seq_10 CTTTGTTAAAAATATTGAAAAATTGTGTAAATTAGTTTTTAAACCACCCCTTTGTATTAA

Step 2 - build a tree by applying the method of choice to each pseudo-dataset Æ disadvantage: it drastically increases the time required for computations.

(55)

Bootstrap

Felsenstein (1985).

Tree reconstruction strategies

Bootstrap

Felsenstein (1985).

Step 3 –evaluate the bootstrap support of the nodes by calculating the proportion of replicates where the feature is present Æ“consensus tree”.

Seq 9 Seq 1 Seq 7 17 Seq 10 7 100 Seq 3 Seq 8 14 7 Seq 2 Seq 4 10 8 100

The results are % values that are usually interpreted following a “rule of thumb”:

Seq 5 Seq 6 91

19

- value<50% - weakly supported nodes, unlikely to be correct - 50%<value<70% - nodes to be interpreted with caution

- 70%<value –strongly supported nodes, likely to be correct.

Simulations have shown that bootstrap values greater than 70% correspond to a probability greater than 95%. In BA trees only the nodes with 95% PP values are considered as strongly supported, instead. pp

(56)

Jacknife

Tree reconstruction strategies

Jacknife

In the jacknife procedure the resampling occurs without replacement.

This is usually done by deleting randomly half of the characters in each replicate Æ

subreplicates are smaller than the original dataset Æthe statistical properties of the samples may change.

Original dataset Jacknife subreplicate

10 60 Seq_1 cccctaatatgtacaataatgaatgttgtaaattagtgttataacacatctatgtataat Seq_2 CCCCTAATATGTACAATAATGAATGTTGTAAATTAGTGTTATAACACATCTATGTATAAT Seq_3 CCCCTAATATGTACAATAATGAATGTTGTAAATTAGTGTTATAACACATCTATGTATAAT S 4 CCCC G C G G G G G C C C G 10 30 Seq_1 CCTATAGCATTAATTAATTGTTTACATTAA Seq_2 CCTATAGCATTAATTAATTGTTTACATTAA Seq_3 CCTATAGCATTAATTAATTGTTTACATTAA S 4 CC GC G C Seq_4 CCCCTAATATGTACAATAATGAATGTTGTAAATTAGTGTTATAACACATCTATGTATAAT Seq_5 CCCCTAATAGGTACAATAACTAATGTTGTAAATTAGTGTTATAACACATCTATGTATAAT Seq_6 CCCCTAATAGGTACAATAATTAATGTTGTAAATTAGTGTTATAACACATCTATGTATAAT Seq_7 CCCCTAATTTGTACAATAATGAATGTTGTAAATTAATGTTATAACACATCTATGTATAAT Seq_8 CCCCTAATATGTCCAATAATGAATGTTGTAAATTAGTGTTATAACACATCTATGTATAAT Seq 9 CCCCTAATATGTACAATAATGAATGTTGTAAATTAGTGTTATAACACATCTATGTATAAT Seq_4 CCTATAGCATTAATTAATTGTTTACATTAA Seq_5 CCTATAGCATCAATTAATTGTTTACATTAA Seq_6 CCTATAGCATTAATTAATTGTTTACATTAA Seq_7 CCTATTGCATTAATTAATTATTTACATTAA Seq_8 CCTATAGCATTAATTAATTGTTTACATTAA Seq 9 CCTATAGCATTAATTAATTGTTTACATTAA q_ Seq_10 CCCCTAATATGTACAATAATGAATGTTGTAAATTAGTGTTATAACACATCTATGTATAAT q_ Seq_10 CCTATAGCATTAATTAATTGTTTACATTAA

(57)

Software

• ADMIX ver. 1.0 Dupanloup & Bertorelle (2001).p p ( )

http://web.unife.it/progetti/genetica/Giorgio/giorgio_soft.html FREE!!

• ADMIX ver. 2.0

http://cmpg.unibe.ch/software/admix/ FREE!!

• ARLEQUINver. 3.1 Excoffier et al. (2005). http://cmpg unibe ch/software/arlequin3/ FREE!!

http://cmpg.unibe.ch/software/arlequin3/ FREE!!

• MEGA– Molecular Evolutionary Genetics Analysis ver. 4 Tamura et al. (2007). http://www.megasoftware.net/ FREE!!

• PAUP*- Phylogenetic Analysis Using Parsimony* ver. 4.0β Swofford (1998).

http://paup.csit.fsu.edu/

PHYLIP 3 68 F l t i (2002) • PHYLIPver. 3.68 Felsenstein (2002).

http://evolution.genetics.washington.edu/phylip.html FREE!!

• PHYML ver. 3.0 Guindon & Gascuel (2003). http://atgc.lirmm.fr/phyml/FREE!!

• BEASTver. 1.4.8… Drummond & Rambaut (2007).( ) http://beast.bio.ed.ac.uk/ FREE!!

• …and BEAUTI ver 1.4 Drummond & Rambaut (2007). http://beast.bio.ed.ac.uk/BEAUti FREE!!

• Mr BAYESver. 3.1 Hulsenbeck & Ronquist (2001). http://mrbayes csit fsu edu/ FREE!!

http://mrbayes.csit.fsu.edu/ FREE!!

• TRACERver. 1.4 Rambaut & Drummond (2007). http://tree.bio.ed.ac.uk/software/tracer/ FREE!!

• TREEVIEWver. 1.6.6 Page (1996). http://taxonomy.zoology.gla.ac.uk/rod/treeview.html FREE!!

A miscellany of phylogeny programs and tools is available here

http://evolution.genetics.washington.edu/phylip/software.html

The BioPortal of the University of Oslo allows to run several applications through a web server

(58)

THANK YOU!!

(59)

References:

• Akaike, H. 1974. A new look at the statistical model identification. IEEE Transactions on Automatic Control 19: 716-723.

• Drummond, A.J. and Rambaut, A. 2007. "BEAST: Bayesian evolutionary analysis by sampling trees“. BMC Evolutionary Biology 7: 214., J , y y y y p g y gy

• Dupanloup, I. and Bertorelle, G. 2001. Inferring admixture proportions from molecular data: extension to any number of parental populations. Mol. Biol. Evol. 18: 672–675.

• Excoffier, L., Laval, G., and Schneider, S. 2005. Arlequin ver. 3.0: An integrated software package for population genetics data analysis. Evolutionary Bioinformatics Online 1: 47-50.

• Excoffier, L., Smouse, P., and Quattro, J. 1992. Analysis of molecular variance inferred from metric distances among DNA haplotypes: Application to human mitochondrial DNA restriction data Genetics 131:479 491

human mitochondrial DNA restriction data. Genetics 131:479-491.

• Felsemstein, J. 1981. Evolutionary Trees from DNA Sequences: a Maximum Likelihood Approach. J. Mol. Evol. 17: 368−376. • Felsenstein, J. 1985. Confidence limits on phylogenies: an approach using the bootstrap. Evolution 39: 783–791.

• Finlay, E.K., Gaillard, C., Vahidi, S.M.F., Mirhoseini, S.Z., Jianlin, H., Qi, X.B., El-Barody, M.A.A., Baird, J.F., Healy, B.C. and Bradley, D.G. 2007. Bayesian inference of population expansions in domestic bovines. Biology Letters 3: 449-452.

• Galtier, N., Gouy, M. and Gautier, C. 1996. SeaView and Phylo_win, two graphic tools for sequence alignment and molecular phylogeny. Comput. Applic. Biosci. 12: 543-548.

• Guindon, S., and Gascuel, O. 2003. A simple, fast, and accurate algorithm to estimate large phylogenies by Maximum Likelihood. Syst Biol 52(5): 696-704.

• Hall, B.G. 2001. Phylogenetic trees made easy. A how-to manual for molecular biologists. Sinauer Associates Inc., Publishers, Sunderland, Massachussetts, USA.

• Hasegawa, M., Kishino, H. and Yano, T. 1985. Dating of the human-ape splitting by a molecular clock of mitochondrial DNA. J. Mol. Evol. 22: 160-174.Hasegawa, M., Kishino, H. and Yano, T. 1985. Dating of the human ape splitting by a molecular clock of mitochondrial DNA. J. Mol. Evol. 22: 160 174. • Higgins, D.G., Bleasby, A.J. and Fuchs, R. 1992. CLUSTAL V: improved software for multiple sequence alignment. CABIOS 8: 189-191.

• Higgins, D.G. and Sharp, P.M. 1989. Fast and sensitive multiple sequence alignments on a microcomputer. CABIOS 5: 151-153.

• Higgins, D.G. and Sharp, P.M. 1988. CLUSTAL: a package for performing multiple sequence alignment on a microcomputer. Gene 73: 237-244. • Hudson, R. R. 1990. Gene genealogies and the coalescent process, pp. 1-44 in Oxford Surveys in Evolutionary Biology, edited by Futuyama, and J. D.

Antonovics. Oxford University Press, New York.

J k T d C t C 1969 E l ti f t i l l I M li P t i M t b li dit d b M HN N Y k A d i • Jukes, T. and Cantor, C. 1969. Evolution of protein molecules. In: Mammalian Protein Metabolism, edited by Munro HN, New York: Academic press, p.

21-132.

• Kimura, M. 1980. A simple method for estimating evolutionary rate of base substitution through comparative studies of nucleotide sequences. J. Mol. Evol. 16:111-120.

• Lanave, C., Preparata, G., Saccone, C. and Serio, G. 1984. A new method for calculating evolutionary substitution rates. J. Mol. Evol. 20: 86-93. • Nei, M., 1987. Molecular Evolutionary Genetics. Columbia University Press, New York, NY, USA.y y

• Page, R.D.M. 1996. TREEVIEW: An application to display phylogenetic trees on personal computers. Computer Applications in the Biosciences 12: 357-358.

• Pellecchia, M., Negrini, R., Colli, L., Patrini, M., Milanesi, E., Achilli, A., Bertorelle, G., Cavalli-Sforza, L.L., Piazza, A., Torroni, A. and Ajmone-Marsan, P. 2007. The mystery of Etruscan origins: novel clues from Bos taurus mitochondrial DNA. Proc. R. Soc. B . 274: 1175–1179.

• Posada, D. 2006. ModelTest Server: a web-based tool for the statistical selection of models of nucleotide substitution online. Nucleic Acids Research 34: W700-W703

(60)

References:

• Posada, D. and Buckley, T.R. 2004. Model selection and model averaging in phylogenetics: advantages of the AIC and Bayesian approaches over likelihood ratio tests. Systematic Biology 53: 793-808.y gy

• Posada, D. and Crandall, K.A. 1998. Modeltest: testing the model of DNA substitution. Bioinformatics 14(9): 817-818. Rambaut, A. and Drummond, A.J. 2007. Tracer v1.4.. http://tree.bio.ed.ac.uk/software/tracer/

• Rogers, A.R. 2004. Lecture Notes on Gene Genealogies. www.anthro.utah.edu/~rogers/bio5410/Lectures/a_alu.pdf

• Rogers, A. R. and Harpending, H. 1992. Population growth makes waves in the distribution of pairwise genetic differences. Mol. Biol. Evol. 9: 552-569. • Rousset, F., 2000. Inferences from spatial population genetics, in Handbook of Statistical Genetics, D. Balding, M. Bishop and C. Cannings. (eds.) Wiley

& Sons Ltd & Sons, Ltd.

• Saitou, N. and Nei, M. 1987. The neighbor–joining method: a new method for reconstructing the phylogenetic tree. Mol. Biol. Evol. 4: 406−425. • Schwarz, G. 1978. Estimating the dimension of a model. The Annals of Statistics 6: 461-464.

• Slatkin, M., 1991 Inbreeding coefficients and coalescence times. Genet. Res. Camb. 58: 167-175.

• Swofford, D.L., 1998. PAUP*. Phylogenetic Analysis Using Parsimony (*and other methods). Version 4. Sinauer Associates, Sunderland, Massachussetts.

• Swofford, D.L. and Berlocher, S.H. 1987. Inferring evolutionary trees from gene frequency data under the principle of maximum parsimony. Systematic Zoology 36: 293−325.

• Tajima, F. 1983 Evolutionary relationship of DNA sequences in finite populations. Genetics 105: 437-460.

• Tajima, F. 1993. Measurement of DNA polymorphism. In: Mechanisms of Molecular Evolution. Introduction to Molecular Paleopopulation Biology, edited by Takahata, N. and Clark, A.G., Tokyo, Sunderland, MA:Japan Scientific Societies Press, Sinauer Associates, Inc., p. 37-59.

• Tajima, F. and Nei, M. 1984. Estimation of evolutionary distance between nucleotide sequences. Mol. Biol. Evol. 1:269-285.Tajima, F. and Nei, M. 1984. Estimation of evolutionary distance between nucleotide sequences. Mol. Biol. Evol. 1:269 285.

• Tamura, K., 1992 Estimation of the number of nucleotide substitutions when there are strong transition-transversion and G+C content biases. Mol. Biol. Evol. 9: 678-687.

• Tamura, K., Dudley, J., Nei, M., and Kumar, S. 2007. MEGA4: Molecular Evolutionary Genetics Analysis (MEGA) software version 4.0. Molecular Biology and Evolution 24: 1596-1599.

• Tamura, K., and M. Nei, 1993 Estimation of the number of nucleotide substitutions in the control region of mitochondrial DNA in humans and hi M l Bi l E l 10 512 526

chimpanzees. Mol. Biol. Evol. 10: 512-526.

• Thompson, J.D., Gibson, T.J., Plewniak, F., Jeanmougin, F. and Higgins, D.G. 1997. The ClustalX windows interface: flexible strategies for multiple sequence alignment aided by quality analysis tools. Nucleic Acids Research 24: 4876-4882.

• Thompson, J.D., Higgins, D.G. and Gibson, T.J. 1994. CLUSTAL W: improving the sensitivity of progressive multiple sequence alignment through sequence weighting, positions-specific gap penalties and weight matrix choice. Nucleic Acids Research 22: 4673-4680.

• Troy, C.S., MacHugh, D.E., Bailey, J.F., Magee, D.A., Loftus, R.T., Cunningham, P., Chamberlain, A.T., Sykesk, B.C. and Bradley, D.G. 2001. Genetic y g y g g y y evidence for Near-Eastern origins of European cattle. Nature 410: 1088-1091.

• Weir, B.S. and Cockerham, C.C. 1984 Estimating F-statistics for the analysis of population structure. Evolution 38:1358-1370. • Wright, S., 1951 The genetical structure of populations. Ann.Eugen. 15: 323-354.