SUMMER SCHOOL 2008
PIACENZA, ITALY - 10 September 2008
Methods for the analysis of
Methods for the analysis of
mitochondrial DNA data – part 1
Licia Colli
, U.C.S.C. di Piacenza
licia colli@unicatt it
[email protected]
•The mitochondrial genome
•Sequence format and alignment
•Input file formats most frequently used in mtDNA analyses
•Input file formats most frequently used in mtDNA analyses
•Molecular diversity indices
•Analysis of Molecular VAriance
•Mismatch distribution and estimates of population expansion
Mismatch distribution and estimates of population expansion
•Admixture analysis
•Trees:
-generalities;
generalities;
-models of DNA sequence evolution and choice of the best-fitting model
-Tree reconstruction strategies
-Distance-based methods (NJ)
( J)
-Character-based methods (MP, ML, Bayesian)
-Molecular clock and calculations of divergence times
-Bootstrap and Jacknife
p
•Software list
•Rereferences
The mitochondrial genome (mtDNA)
• Its length varies among species (15-17kb)
•multiple copies in each cell (mammalian egg cell contains about 100.000 copies) • lack of recombination
• HAPLOID - maternally inherited; • high mutation rate
•13 protein coding genes, 2 rRNA sequences (12s and 16s) 22 tRNA sequences (12s and 16s), 22 tRNA sequences and 1 non coding region (control region or displacement loop). • the mitochondrial genetic code differs slightly from the nuclear code:
( g p p)
nuclear mitochondrial
TGA Æstop codon TGA ÆTrp (W)
ATA ÆIle (I) ATA ÆMet (M)
AGA ÆArg (R)g ( ) AGA Æstop codonp
The mitochondrial genome (mtDNA)
A useful molecule, indeed…
A useful molecule, indeed…
•
genealogy
•
phylogeny (cytochrome b 12s 16s control region whole mtDNA)
•
phylogeny (cytochrome b, 12s, 16s, control region, whole mtDNA)
•
phylogeography (cytb, control region, whole mtDNA)
•
species identification (cytb, control region)
l ti t di ( th
k
)
•
population studies ( + other markers)
•
detection of “cryptic species” and “barcoding” projects (COXI)
•
studies on the domestication process
•
studies on male fertility/infertility
•
studies on ancient DNA (aDNA)…
Sequence format and alignment
d
l
EditPlus:
a text editor useful to handle sequences and prepare input files.Freely downloadable 30-days evaluation version:
FASTA
(fil )ClustalX
y y
http://www.editplus.com/download.html
>Seq_1
FASTA
(filename.txt)ClustalX
cccctaatatgtacaataatgaatgttgtaaattagtgttataacacatctatgtataat >Seq_2 CCCCTAATATGTACAATAATGAATGTTGTAAATTAGTGTTATAACACATCTATGTATAAT >Seq_3 CCCCTAATATGTACAATAATGAATGTTGTAAATTAGTGTTATAACACATCTATGTATAAT >Seq 4q_ CCCCTAATATGTACAATAATGAATGTTGTAAATTAGTGTTATAACACATCTATGTATAAT >Seq_5 CCCCTAATAGGTACAATAACTAATGTTGTAAATTAGTGTTATAACACATCTATGTATAAT >Seq_6 CCCCTAATAGGTACAATAATTAATGTTGTAAATTAGTGTTATAACACATCTATGTATAAT >Seq 7
CLUSTAL X (1.83) multiple sequence alignment
Seq_1 CCCCTAATATGTACAATAATGAATGTTGTAAATTAGTGTTATAACACATCTATGTATAAT Seq_2 CCCCTAATATGTACAATAATGAATGTTGTAAATTAGTGTTATAACACATCTATGTATAAT Seq_3 CCCCTAATATGTACAATAATGAATGTTGTAAATTAGTGTTATAACACATCTATGTATAAT (filename.aln) >Seq_7 CCCCTAATTTGTACAATAATGAATGTTGTAAATTAATGTTATAACACATCTATGTATAAT >Seq_8 CCCCTAATATGTCCAATAATGAATGTTGTAAATTAGTGTTATAACACATCTATGTATAAT >Seq_9 CCCCTAATATGTACAATAATGAATGTTGTAAATTAGTGTTATAACACATCTATGTATAAT Seq_4 CCCCTAATATGTACAATAATGAATGTTGTAAATTAGTGTTATAACACATCTATGTATAAT Seq_5 CCCCTAATAGGTACAATAACTAATGTTGTAAATTAGTGTTATAACACATCTATGTATAAT Seq_6 CCCCTAATAGGTACAATAATTAATGTTGTAAATTAGTGTTATAACACATCTATGTATAAT Seq_7 CCCCTAATTTGTACAATAATGAATGTTGTAAATTAATGTTATAACACATCTATGTATAAT Seq_8 CCCCTAATATGTCCAATAATGAATGTTGTAAATTAGTGTTATAACACATCTATGTATAAT Seq 9 CCCCTAATATGTACAATAATGAATGTTGTAAATTAGTGTTATAACACATCTATGTATAAT >Seq_10 CCCCTAATATGTACAATAATGAATGTTGTAAATTAGTGTTATAACACATCTATGTATAAT _ Seq_10 CCCCTAATATGTACAATAATGAATGTTGTAAATTAGTGTTATAACACATCTATGTATAAT ******** ** ****** ************** ************************
Input file formats
Phylip
(filename.txt; filename.phy)MEGA
(filename.meg)Phylip
(filename.txt; filename.phy)MEGA
(filename.meg)10 60 Seq_1 cccctaatatgtacaataatgaatgttgtaaattagtgttataacacatctatgtataat Seq_2 CCCCTAATATGTACAATAATGAATGTTGTAAATTAGTGTTATAACACATCTATGTATAAT S 3 CCCCTAATATGTACAATAATGAATGTTGTAAATTAGTGTTATAACACATCTATGTATAAT #Mega title: title_of_your_project #Seq_1 cccctaatatgtacaataatgaatgttgtaaattagtgttataacacatctatgtataat #Seq 2 CCCCTAATATGTACAATAATGAATGTTGTAAATTAGTGTTATAACACATCTATGTATAAT Seq_3 CCCCTAATATGTACAATAATGAATGTTGTAAATTAGTGTTATAACACATCTATGTATAAT Seq_4 CCCCTAATATGTACAATAATGAATGTTGTAAATTAGTGTTATAACACATCTATGTATAAT Seq_5 CCCCTAATAGGTACAATAACTAATGTTGTAAATTAGTGTTATAACACATCTATGTATAAT Seq_6 CCCCTAATAGGTACAATAATTAATGTTGTAAATTAGTGTTATAACACATCTATGTATAAT Seq_7 CCCCTAATTTGTACAATAATGAATGTTGTAAATTAATGTTATAACACATCTATGTATAAT Seq_8 CCCCTAATATGTCCAATAATGAATGTTGTAAATTAGTGTTATAACACATCTATGTATAAT #Seq_2 CCCCTAATATGTACAATAATGAATGTTGTAAATTAGTGTTATAACACATCTATGTATAAT #Seq_3 CCCCTAATATGTACAATAATGAATGTTGTAAATTAGTGTTATAACACATCTATGTATAAT #Seq_4 CCCCTAATATGTACAATAATGAATGTTGTAAATTAGTGTTATAACACATCTATGTATAAT #Seq_5 CCCCTAATAGGTACAATAACTAATGTTGTAAATTAGTGTTATAACACATCTATGTATAAT #Seq_6 CCCCTAATAGGTACAATAATTAATGTTGTAAATTAGTGTTATAACACATCTATGTATAAT #Seq_7 CCCCTAATTTGTACAATAATGAATGTTGTAAATTAATGTTATAACACATCTATGTATAAT #Seq 8 CCCCTAATATGTCCAATAATGAATGTTGTAAATTAGTGTTATAACACATCTATGTATAAT Seq_9 CCCCTAATATGTACAATAATGAATGTTGTAAATTAGTGTTATAACACATCTATGTATAAT Seq_10 CCCCTAATATGTACAATAATGAATGTTGTAAATTAGTGTTATAACACATCTATGTATAAT
or otherwise
10 60 q_ #Seq_9 CCCCTAATATGTACAATAATGAATGTTGTAAATTAGTGTTATAACACATCTATGTATAAT #Seq_10 CCCCTAATATGTACAATAATGAATGTTGTAAATTAGTGTTATAACACATCTATGTATAATor otherwise
#Mega Seq_1 cccctaatatgtacaataatgaatgttgta Seq_2 CCCCTAATATGTACAATAATGAATGTTGTA Seq_3 CCCCTAATATGTACAATAATGAATGTTGTA Seq_4 CCCCTAATATGTACAATAATGAATGTTGTA Seq_5 CCCCTAATAGGTACAATAACTAATGTTGTA Seq 6 CCCCTAATAGGTACAATAATTAATGTTGTA title: title_of_your_project #Seq_1 cccctaatatgtacaataatgaatgttgtaaattagtgttataacacatctatgtataat #Seq_2 CCCCTAATATGTACAATAATGAATGTTGTAAATTAGTGTTATAACACATCTATGTATAAT Seq_6 CCCCTAATAGGTACAATAATTAATGTTGTA Seq_7 CCCCTAATTTGTACAATAATGAATGTTGTA Seq_8 CCCCTAATATGTCCAATAATGAATGTTGTA Seq_9 CCCCTAATATGTACAATAATGAATGTTGTA Seq_10 CCCCTAATATGTACAATAATGAATGTTGTA aattagtgttataacacatctatgtataat #Seq_3 CCCCTAATATGTACAATAATGAATGTTGTAAATTAGTGTTATAACACATCTATGTATAAT #Seq_4 CCCCTAATATGTACAATAATGAATGTTGTAAATTAGTGTTATAACACATCTATGTATAAT #Seq_5 CCCCTAATAGGTACAATAACTAATGTTGTAAATTAGTGTTATAACACATCTATGTATAAT #Seq 6 aattagtgttataacacatctatgtataat AATTAGTGTTATAACACATCTATGTATAAT AATTAGTGTTATAACACATCTATGTATAAT AATTAGTGTTATAACACATCTATGTATAAT AATTAGTGTTATAACACATCTATGTATAAT AATTAGTGTTATAACACATCTATGTATAAT AATTAATGTTATAACACATCTATGTATAAT #Seq_6 CCCCTAATAGGTACAATAATTAATGTTGTAAATTAGTGTTATAACACATCTATGTATAAT #Seq_7 CCCCTAATTTGTACAATAATGAATGTTGTAAATTAATGTTATAACACATCTATGTATAAT #Seq_8 CCCCTAATATGTCCAATAATGAATGTTGTAAATTAGTGTTATAACACATCTATGTATAAT #Seq_9 AATTAATGTTATAACACATCTATGTATAAT AATTAGTGTTATAACACATCTATGTATAAT AATTAGTGTTATAACACATCTATGTATAAT _ CCCCTAATATGTACAATAATGAATGTTGTAAATTAGTGTTATAACACATCTATGTATAAT #Seq_10 CCCCTAATATGTACAATAATGAATGTTGTAAATTAGTGTTATAACACATCTATGTATAATInput file formats
NEXUS
(filename.nex)Arlequin
q
(filename.arp)#NEXUS BEGIN TAXA; DIMENSIONS NTAX=10; TAXLABELS Seq_1 [Profile]
Title="An example of DNA sequence data" NbSamples=3 GenotypicData=0 DataType=DNA Seq_2 Seq_3 Seq_4 Seq_5 Seq_6 Seq_7 Seq 8 yp LocusSeparator=NONE [Data] [[Samples]] SampleName="Population 1" SampleSize=3 SampleData= { Seq_8 Seq_9 Seq_10; END; BEGIN CHARACTERS; DIMENSIONS NCHAR=60; Seq_1 1 cccctaatatgtacaataatgaatgttgtaaattagtgttataacacatctatgtataat Seq_2 1 CCCCTAATATGTACAATAATGAATGTTGTAAATTAGTGTTATAACACATCTATGTATAAT Seq_3 1 CCCCTAATATGTACAATAATGAATGTTGTAAATTAGTGTTATAACACATCTATGTATAAT } SampleName="Population 2" SampleSize=3 SampleData= { FORMAT DATATYPE=DNA MISSING=? GAP=- MATCHCHAR=.;
MATRIX Seq_1 cccctaatatgtacaataatgaatgttgtaaattagtgttataacacatctatgtataat Seq_2 CCCCTAATATGTACAATAATGAATGTTGTAAATTAGTGTTATAACACATCTATGTATAAT Seq 3 SampleData { Seq_4 1 CCCCTAATATGTACAATAATGAATGTTGTAAATTAGTGTTATAACACATCTATGTATAAT Seq_5 1 CCCCTAATAGGTACAATAACTAATGTTGTAAATTAGTGTTATAACACATCTATGTATAAT Seq_6 1 CCCCTAATAGGTACAATAATTAATGTTGTAAATTAGTGTTATAACACATCTATGTATAAT } SampleName="Population 3" SampleSize=4 Seq_3 CCCCTAATATGTACAATAATGAATGTTGTAAATTAGTGTTATAACACATCTATGTATAAT Seq_4 CCCCTAATATGTACAATAATGAATGTTGTAAATTAGTGTTATAACACATCTATGTATAAT Seq_5 CCCCTAATAGGTACAATAACTAATGTTGTAAATTAGTGTTATAACACATCTATGTATAAT Seq_6 SampleData= { Seq_7 1 CCCCTAATTTGTACAATAATGAATGTTGTAAATTAATGTTATAACACATCTATGTATAAT Seq_8 1 CCCCTAATATGTCCAATAATGAATGTTGTAAATTAGTGTTATAACACATCTATGTATAAT Seq_9 1 CCCCTAATATGTACAATAATGAATGTTGTAAATTAGTGTTATAACACATCTATGTATAAT Seq_10 1 CCCCTAATATGTACAATAATGAATGTTGTAAATTAGTGTTATAACACATCTATGTATAAT } [[St t ]] CCCCTAATAGGTACAATAATTAATGTTGTAAATTAGTGTTATAACACATCTATGTATAAT Seq_7 CCCCTAATTTGTACAATAATGAATGTTGTAAATTAATGTTATAACACATCTATGTATAAT Seq_8 CCCCTAATATGTCCAATAATGAATGTTGTAAATTAGTGTTATAACACATCTATGTATAAT Seq_9 CCCCTAATATGTACAATAATGAATGTTGTAAATTAGTGTTATAACACATCTATGTATAAT [[Structure]]
StructureName="A group of 3 populations analyzed for DNA" NbGroups=1 Group= { "Population 1" "Population 2" "Population 3" CCCCTAATATGTACAATAATGAATGTTGTAAATTAGTGTTATAACACATCTATGTATAAT Seq_10 CCCCTAATATGTACAATAATGAATGTTGTAAATTAGTGTTATAACACATCTATGTATAAT; p }
Sequence alignment
Software of the Clustal family:
• ClustalW
l
h
//
h
b
/ f
/Cl
lW h
l
online versions
http://www.ch.embnet.org/software/ClustalW.html
http://www.ebi.ac.uk/Tools/clustalw2/index.html
download
http://www.clustal.org/download/
• ClustalX
download
http://www.clustal.org/download/current/
Higgins & Sharp (1988; 1989); Higgins et al. (1992); Thompson et al. (1994; 1997).
SeaView
is a sequence alignment editor which is able to read and write various li t f t (NEXUS CLUSTAL FASTA PHYLIP )alignment formats (NEXUS, CLUSTAL, FASTA, PHYLIP…). Free download from this website:
http://pbil.univ-lyon1.fr/software/seaview.htmlp //p y / /
Molecular diversity indices
Haplotype diversity (
p
yp
y ( )
H
)
It is defined as the probability that two randomly chosen haplotypes are different
in the sample. Haplotype (gene) diversity is estimated as:
where
n
is the number of gene copies in the sample,
k
is the number of haplotypes,
and
p
iis the sample frequency of the i-th haplotype.
Nei (1987).
Molecular diversity indices
Mean number of pairwise differences (
π
)
Mean number of differences between all pairs of haplotypes in the sample. It can
be estimated as
where
d
is an estimate of the number of mutations having occurred since the
where
d
ijis an estimate of the number of mutations having occurred since the
divergence of haplotypes i and j,
k
is the number of haplotypes,
p
iis the frequency
of haplotype i,
p
jis the frequency of haplotype j, and
n
is the sample size.
Tajima (1983); (1993).
Molecular diversity indices
Nucleotide diversity (
π
n)
It is computed as the probability that two randomly chosen homologous nucleotide sites are different. It is equivalent to the haplotype diversity at the nucleotide level.
where dijis an estimate of the number of mutations having occurred since the divergence of haplotypes i and j, kis the number of haplotypes, pi is the frequency of haplotype i, pj is the frequency of haplotype j n is the sample size and L is the number of loci
frequency of haplotype j, n is the sample size and L is the number of loci.
Tajima (1983); Nei (1987).
Molecular diversity indices
«Genetic loci from a centre of origin are expected to retain more ancestral variation and show hi h h l t i d l tid di it ith li i th h i
higher haplotypic and nucleotide diversity, with lineage pruning through successive colonization events leading to a reduction in derived populations.».
Troy et al. (2001).y ( ) 383 B. taurus mtDNA sequences (240 bp of the HVRI region ):
M i i diff ( d )
Mean pairwise differences (±s.d.)
Middle East 3.79 ± 2.03 Anatolia 3.49 ± 1.81 Mainland Europe 1.92 ± 1.10 Britain 2.68 ± 1.45 Northern Europe 1.47 ± 0.91 Africa 2.09 ± 1.18
Analysis of MOlecular VAriance - AMOVA
The Analysis of MOlecular Variance (AMOVA, Excoffier et al. 1992) is based on
analyses of variance of gene frequencies taking into account the number of
analyses of variance of gene frequencies, taking into account the number of
mutations between molecular haplotypes.
User-defined groups of populations
Æ
particular genetic structure to test.
User defined groups of populations
Æ
particular genetic structure to test.
A hierarchical analysis of variance partitions the total variance into covariance
components (Rousset, 2000).
p
(
,
)
The total molecular variance (
σ
2) is the sum of the components due to:
•
σ
a2= differences among the populations;
•
σ
b2= differences among haplotypes in different populations within a group;
•
σ
c2= differences among haplotypes within a population.
Analysis of MOlecular VAriance - AMOVA
Simple hierarchical genetic structure e g haploid individuals in populations Æ the algorithm Simple hierarchical genetic structure e.g. haploid individuals in populations Æ the algorithm leads to a fixation index FST (Weir & Cockerham, 1984) which can be expressed in terms of inbreeding coefficients asg
Slatkin (1991) Slatkin (1991). where f0 is the probability of identity by descent of two different genes drawn from the same population, f1 is the probability of identity by descent of two genes drawn from two different
p p f1 p y y y g
populations.
Mismatch Distribution
It is the distribution of the observed number of differences between pairs of haplotypes. This p p yp distribution is usually multimodal in samples drawn from populations at demographic
equilibrium, as it reflects the highly stochastic shape of gene trees…
Mismatch Distribution
…but it is usually unimodal in populations having passed through a recent demographic y p p g p g g p expansion.
Rogers & Harpending, (1992); Hudson & Slatkin, (1991).
Simulations of populations that underwent a sudden 100-fold growth at 7 units of mutational time before present (Rogers, 2004). Simulations of populations that underwent a sudden 100 fold growth at 7 units of mutational time before present (Rogers, 2004).
Mismatch Distribution and estimates of population expansion
In case of a sudden population growth (mismatch distribution = smooth unimodal wave), the time of the expansion τ0 and the size of the pre-expansion population θ1 can be estimated as follows
where π is the mean pairwise difference per sequence within the sample, m is the mean of pairwise differences, and v is the variance.
Estimates of population expansion – an alternative approach
Analysis of Bayesian skyline plots: an approach alternative to mismatch distribution analysis. Past changes in population size can be inferred from present-day genetic diversity without prior assumptions about population history.
Mitochondrial d-loop sequence data (also aDNA).
F d ti i
Four domestic species:
-Yak(Bos grunniens) n=71
-Water buffalo (Bubalus bubalis) n=110 -Mithan(Bos frontalis) n=24
-CattleCattle ((Bos taurusBos taurus) n=84) n 84
One closely related wild species:
-African buffalo (Syncerus caffer) n=195 Uniform mutation rate: 32%Myr-1
Domestic species - sudden expansion during the last 104 years ~ time since domestication.
Af i b ff l d l l ti i f ll d b h d li ( i t ith African buffalo - gradual population expansion followed by a sharp decline (consisten with documented epidemics and habitat loss since the XIXth century).
Source: Finlay et al. (2007).
S ft
BEAST BEAUTI d TRACER
Software: BEAST, BEAUTI and TRACER.
Admixture analysis
This analysis evaluates the relative contributions of any number of parental populations to a derived, hybrid population.
It compares the composition of different gene pools rather than making inference about the admixture event itself (mY estimator; Dupanloup & Bertorelle, 2001).
Software: ADMIX ver. 1.0
Features: - works with sequences RFLPs microsatellites Features: works with sequences, RFLPs, microsatellites
- needs 2 input files: DATA file (filename.dat) MATRIX file (filename.mtx)
The DATA file should contain for each locus se sample sizes of the admixed and of the parental populations and the number of copies observed for each haplotype (allele) in each population.
DATA file example:
LocusX AD=admixed pop; P1=parental pop. 1; P2= parental pop. 2 nAD, nP1, nP2 nAD= sample size of pop. AD; etc.
cnH1(AD), cnH1(P1), cnH1(P2) H1, H2, H3= haplotypes
cnH2(AD), cnH2(P1), cnH2(P2) cnH1(AD)= count number for haplotype 1 in AD pop.; etc. cnH3(AD), cnH3(P1), cnH3(P2)
Admixture analysis
MATRIX file example:
nX number of analyzed loci
LocusX
nH number of haplotypes observed at the locus H
0 lower triangular matrix of molecular distances (number of 1 0 substitutions in pairwise comparisons of haplotypes)
1 0 substitutions in pairwise comparisons of haplotypes) 3 2 0
ADMIX ver. 2.0 needs only one input file containing both the data and the matrix.
Pellecchia et al. (2007).
Admixture values ± s.e. calculated on Bos taurus mtDNA data (HVRI region) derived from autochthonous Italian breeds.
A tree is a graph which describes the evolutionary relationships between sequences.
Trees
g p y p q
•Nodes = Taxonomic Units (TUs);
•Branches = evolutionary relationships between TU in terms of ancestry/descent
A branch connects only two nodes. Internal nodes represent ancestral TUs, while terminal bramches represent present TUs (i.e. sequences), also defined Operational Taxonomic Units, OTUs.
Cladogram: a tree describing only the relationships between nodes. Branch lengths have no specific meaning.
Phylogram: branch lengths are proportional to the evolutionary distance Æ calculations of genetic divergence between nodes.
Cladogram Phylogram Seq 9 Seq 1 Seq 7 Seq 10 S 3 Seq 9 Seq 1 Seq 7 Seq 10 Seq 3 Seq 8 Seq 2 Seq 4 Seq 5 Seq 3 Seq 8 Seq 2 Seq 4 Seq 5 Seq 5 Seq 6 Seq 5 Seq 6
Trees
Rooted tree: a particular node, the “root”, represents the common ancestor of all the remaining nodes Æ all the branches can be oriented as a function of time.
Unrooted tree: describes exclusively the evolutionary relationships between OTUs No Unrooted tree: describes exclusively the evolutionary relationships between OTUs. No
information on the evolutionary process as a function of time Æit is not possible to identify older/more recent nodes.
Unrooted tree
Rooted tree
outgroup
Seq 6 Seq 1 Seq 7 Seq 9
Seq 6 Seq 5 Seq 1 Seq 3 Seq 10 Seq 3 Seq 4 Seq 5 Seq 4 Seq 10 Seq 2 Seq 7 Seq 9
Rooted trees are usually built when the hypothesis of the “molecular clock” is assumed, i.e.
Seq 8 Seq 2
Seq 9 Seq 8
Trees
To root a tree, a particular OTU, called “outgroup”, is included into the dataset. The outgroup is defined as “a OTU which started the process of divergence from its ancestor before all the remaining OTUs started diverging from each other” (information derived from non-genetic evidence, e.g. paleontology morphology etc )
paleontology, morphology etc.).
Trees can also be represented in the Newick (computer readable) format with nested brackets:
((((Seq_9,(Seq_6,Seq_5)),Seq_10),((Seq_8,Seq_4),Seq_3)),(Seq_7,Seq_2),Seq_1);
Dedicated software read trees in Newick format (e.g. TreeView; Page, 1996).
Seq 1 Seq 7 Seq 2 Seq 3 Seq 8 Seq 4 Seq 4 Seq 10 Seq 9 Seq 6 Seq 5
Trees
Aim of a phylogenetic analysis
Æ
determining the “topology” (structure) of the tree.
The number of possible trees grows exponentially with the umber of OTUs.
For
n
OTUs, the numbers of rooted (N
R) and unrooted (N
U) trees are given by
N
R= N
(2n-3)!
U=
(2n-5)!
N
RN
U2
n-2(n-2)!
2
n-3(n-3)!
N
Ufor n OTUs = N
Rfor (n-1) OTUs.
E.g. if n=10 there are about 35· 10
g
6possible trees, only one of which correctly
p
y
y
Tree reconstruction strategies
Tree-building methods can be classified according to
•the type of data (i.e. distance matrix vs. discrete characters);
•the reconstruction strategy (clustering algorithms vs. optimality criteria);
DATA distance discrete D matrix characters Clustering algorithm UPGMA, NJ M ETHO D Optimality criterion ME, FM MP, ML, BA M
UPGMA: unweighted pair-group method using arithmetic means; NJ: neighbor-joining; ME: minimum evolution; FM: Fitch-Margoliash's least-squares method; MP: maximum parsimony; ML: maximum likelihood; BA: Bayesian inference.
All the aforementioned methods (excepted MP), require the selection of an explicit model All the aforementioned methods (excepted MP), require the selection of an explicit model of sequences evolution (“substitution model”).
Substitution models describe in probabilistic terms the process by which a set of characters
( l id ) h i h f h l h i
Models of DNA sequences evolution
P i
i
P
t
diff
Pairwise an Percentage difference
These are very rough estimates of evolutionary divergence between sequences.
y
g
y
g
q
They are computed as the number/percentage of loci (nucleotides) for which two
sequences are different:
P = n
dP = n
d/L
Where n
dis the number of observed substitutions between two DNA sequences
and L is the number of loci.
Models of DNA sequences evolution
The number of observed differences usually underestimates the real amount of The number of observed differences usually underestimates the real amount of evolutionary change (e.g. occurrence of “multiple hits”).
Substitution models incorporate some “correction” parameters, their number varying p p y g according to the a priori assumptions accepted (number of fixed/variable parametrs).
A priori assumptions:
•Nucleotide sites evolve independently; •Nucleotide sites evolve independently; • All sites can mutate with equal probability; • All types of substitutions are equally probable; • Substitution rate is constant ;
h h h
• The base composition is at equilibrium (sequences have the same base composition ). The higher the number of accepted assumptions, the simpler the model. The higher the number of accepted assumptions, the simpler the model.
The lower the number of accepted assumptions, the higher the number of the parameters that need to be estimated
Models of DNA sequences evolution
The most renowned and used nucleotide substitution models are those from the
General Time-Reversible (GTR) family (Lanave et al., 1984): 203 possible models
diff
ti t d b th
b d t
f fi d/
i bl
t
differentiated by the number and type of fixed/variable parameters.
The nucleotide substitution models implemented in the most frequentl used
The nucleotide substitution models implemented in the most frequently used
phylogenetics software packages (MEGA, PAUP*, PHYLIP, PHYML, MrBayes
ecc.) belong to the GTR family.
Models of DNA sequences evolution
Jukes and Cantor (JC69; 1969)
It i th i
l t (
t
t)
d l hi h
th t
It is the simplest (parameter poorest) model, which assumes that:
•Nucleotide frequencies are equal (i.e.
π
A=
π
T=
π
C=
π
G= 0.25);
•All possible substitutions take place at a single rate
Æ
only the parameter
α
needs
to be estimated (substitution rate).
A C
G
T
A C
G
T
A
-
α
α
α
C
α
-
α
α
G
α
α
-
α
T
α
α
α
-Models of DNA sequences evolution
Kimura 2-parameters (K80; 1980)
•Nucleotide frequencies are equal (i.e.
π
A=
π
C=
π
G=
π
T= 0.25);
•Different substitution rates between transitions (Ts)
α
and transversions (Tv)
β
.
The Ts/Tv ratio is estimated from the data.
A C
G
T
A
-
β
α
β
C
β
-
β
α
G
α
β
-
β
-Tamura (1992)
Models of DNA sequences evolution
Tamura (1992)
This model is an extension of K80 method, allowing for unequal nucleotide
frequencies
frequencies.
•Base composition is not equal (A + T
≠
G + C and G + C =
θ
);
•Different substitution rates between Ts (
α
)
and Tv (
β
)
.
The Ts/Tv ratio, as well as nucleotide frequencies are computed from the data.
A
C
G
T
A
-
θβ
θα
(1-
θ
)
β
A
θβ
θα
(1
θ
)
β
C
(1-
θ
)
β
-
β
(1-
θ
)
α
G
(1
θ
)
α
β
(1
θ
)
β
G
(1-
θ
)
α
β
-
(1-
θ
)
β
T
(1-
θ
)
β
α
β
-Models of DNA sequences evolution
Felsenstein (F81; 1981)
It i t
i f JC69
th d ll
i f
l
l tid f
i (i
It is an extension of JC69 method, allowing for unequal nucleotide frequencies (i.e.
π
A≠
π
C≠
π
G≠
π
T).
The overall nucleotide frequencies are computed from the data.
A
C
G
T
A
C
G
T
A
-
π
Cα
π
Gα
π
Tα
C
π
Aα
-
π
Gα
π
Tα
G
π
Aα
π
Cα
-
π
Tα
T
π
Aα
π
Cα
π
Gα
-H
Ki hi
Y
(HKY 1985)
Models of DNA sequences evolution
Hasegawa-Kishino-Yano (HKY; 1985)
This model combines the assumptions of K80 and F81:
p
•unequal nucleotide frequencies (i.e.
π
A≠
π
C≠
π
G≠
π
T).
•Different substitution rates between Ts (
α
)
and Tv (
β
)
.
Overall nucleotide frequencies and the Ts/Tv ratio computed from the data.
A
C
G
T
A
C
G
T
A
-
π
Cβ
π
Gα
π
Tβ
C
π
Aβ
-
π
Gβ
π
Tα
G
π
Aα
π
Cβ
-
π
Tβ
T
π
Aβ
π
Cα
π
Gβ
-Models of DNA sequences evolution
General Time Reversible (GTR) Lanave et al. (1984).
It i th
t
l d
t
i h d l
It is the most general and parameter-rich model.
•Unequal nucleotide frequencies (i.e.
π
A≠
π
C≠
π
G≠
π
T).
•Different substitution rates between the two transitions and the four transversions
Ts:
A
Æ
G =
α
1;
C
Æ
T =
α
2Tv:
A
Æ
C =
β
1; A
Æ
T =
β
2; C
Æ
G =
β
3; G
Æ
T =
β
4*
.
• Unequal probability for each type of nucleotide substitution.
S b tit ti
ibl
(A
Æ
G
G
Æ
A)
• Substitutions are reversible (A
Æ
G = G
Æ
A).
A
C
G
T
A
-
π
Cβ
1π
Gα
1π
Tβ
2C
π
Aβ
1-
π
Gβ
3π
Tα
2G
π
Aα
1π
Cβ
3-
π
Tβ
4T
π
AAβ
β
22π
CCα
22π
GGβ
β
44Γ (
) di t ib ti d I
i t it
Models of DNA sequences evolution
Γ (gamma) distribution and Invariant sites
An additional parameter is considered when the substitution rates cannot be assumed as uniform for all sites
uniform for all sites.
Not all the nucleotide positions within a sequence, in fact, are subject to the same evolutionary constraints (e.g. 1st-2nd vs. 3rd codon position in protein-coding genes). evolutionary constraints (e.g. 1st 2nd vs. 3rd codon position in protein coding genes). There are two strategies:
1) To analyze separately the sites subject to different evolutionary dynamics;
2) To adopt a model with additional parameters that account for the rate variation.
Γ distributions are used to model continuous variables that are always positive and have k d di t ib ti
skewed distributions.
The shape of the Γ distribution is determined by a single parameter α(“shape parameter”) which specifies the range of rate variation among sites and is inversely proportional to the which specifies the range of rate variation among sites and is inversely proportional to the level of heterogeneity among site rates.
Models of DNA sequences evolution
The lower the values of
α
, the larger the range of rate variation and
the more uneven
the substitution rates.
A
Æ
ll i
h
h
b i
i
As
α
Æ
∞
, all sites have the same substitution rate.
ion of
sites
f(r)
Proport
i
Also the fraction of
Invariant sites
(i.e. sites showing no variation within the
sequences set) can be estimated and taken into account when modeling the
Substitution rate (r)
q
)
g
Models of DNA sequences evolution
All models typically used to infer evolutionary relationships between DNA
All models typically used to infer evolutionary relationships between DNA
sequences represent a special case of the GTR model.
Imposing constraints (i e
a priori
assumptions) on the parameters of the GTR
Imposing constraints (i.e.
a priori
assumptions) on the parameters of the GTR
leads to a different model which can, therefore, be considered as a special case of
the GTR.
A model is said to be “nested” within a more complex one if the former can be
obtained by constraining the parameters of the latter.
E.g. JC69 is nested within K80, while F81 and K80 are not nested because fixing
parameter values of either one does not yield the other model.
Models of DNA sequences evolution
How do we select the best-fitting model?
T l h b fi i b i i d l h Lik lih d R i T (LRT) i ll To select the best-fitting substitution model the Likelihood Ratio Test (LRT) is usually applied. In a maximum likelihood framework, it evaluates the statistical significance of the increase in fit of alternative nested models to the data as their number and types of
parameters increases. parameters increases.
Δ = 2 (ln L1 - ln L0)
L1 = global ML estimate for the alternative hypothesis (more general, parameter richer model)
L0 = global ML estimate for the null hypothesis (simpler model).
The probabilities are χ2distributed with d.f.= difference in the number of free parameters between the two alternative
models.
The Akaike Information Criterion (AIC; Akaike, 1974; Posada & Buckley, 2004) and the Bayesian Information Criterion (BIC; Schwarz, 1978) are methods alternative to the LRT; they simultaneously evaluate the statistical significance of the relative fit of all competing models be they nested or not
Models of DNA sequences evolution
ModelTest Posada & Crandall (1998)
ModelTest Posada & Crandall (1998).
A very popular tool which automatically selects the best-fitting substitution model from among 56 alternatives by performing LRT AIC (software ver 3 06) and BIC (software ver among 56 alternatives by performing LRT, AIC (software ver. 3.06) and BIC (software ver. 3.7) calculations. It returns the name and the parameter values of the best-fitting model. Original software version: both the ModelTest application and the software PAUP* Original software version: both the ModelTest application and the software PAUP (Swofford, 1998) are needed. Unfortunately, PAUP* software is not free.
Input file format = Nexus (same as PAUP*) + “ModelTest” block. Input file format Nexus (same as PAUP ) ModelTest block.
More information on how to run ModelTest can be found here:
http://darwin.uvigo.es/software/modeltest.html
htt // hi bi / h l ti / d lt t ht l (Wi d ) http://www.rhizobia.co.nz/phylogenetics/modeltest.html (Windows) http://www.genedrift.org/mtgui.php (Windows and Linux).
A web-based tool to run ModelTest can be found here:
http://darwin.uvigo.es/software/modeltest_server.html (Posada, 2006).
A free web-based tool to choose among 28 nucleotide models with the AIC:
Models of DNA sequences evolution
Hierarchical structure of 56 models implemented in the ModelTest procedure (Posada and Crandall 1998) It does not include all of the possible models in the GTR family and Crandall, 1998). It does not include all of the possible models in the GTR family.
Models of DNA sequences evolution
** Hierarchical Likelihood Ratio Tests (hLRTs) **
ModelTest
ver. 3.06
output
Testing models of evolution - Modeltest Version 3.06
Equal base frequencies
Null model = JC -lnL0 = 2562.9832 Alternative model = F81 -lnL1 = 2543.3635 2(lnL1-lnL0) = 39.2393 df = 3
P value <0 000001 (c) Copyright, 1998-2000 David Posada ([email protected])
Department of Zoology, Brigham Young University WIDB 574, Provo, UT 84602, USA
_______________________________________________________________ Wed May 23 16:49:15 2007
P-value = <0.000001 Ti=Tv
Null model = F81 -lnL0 = 2543.3635 Alternative model = HKY -lnL1 = 2482.0591 2(lnL1-lnL0) = 122.6089 df = 1
P-value = <0.000001 y
Input format: Paup matrix file ** Log Likelihood scores **
+I +G +I+G
Equal Ti rates
Null model = HKY -lnL0 = 2482.0591 Alternative model = TrN -lnL1 = 2482.0227 2(lnL1-lnL0) = 0.0728 df = 1 P-value = 0.787369 +I +G +I+G JC = 2458.8540 2458.8540 2454.2852 2443.3606 F81 = 2440.0264 2440.0264 2434.6941 2424.4517 K80 = 2400.7991 2400.7991 2396.5891 2385.6047 HKY = 2379.0457 2379.0457 2374.1394 2362.9192 Equal Tv rates
Null model = HKY -lnL0 = 2482.0591 Alternative model = K81uf -lnL1 = 2480.2668 2(lnL1-lnL0) = 3.5845 df = 1
P-value = 0.058322 Equal rates among sites TrNef = 2400.6252 2400.6252 2396.1169 2385.5432
TrN = 2379.0442 2379.0442 2374.0200 2363.2795 K81 = 2398.1973 2398.1973 2393.5496 2382.9202 K81uf = 2377.6162 2377.6162 2372.7297 2362.2349 TIMef = 2398.0212 2398.0212 2393.4592 2382.8601
Equal rates among sites
Null model = HKY -lnL0 = 2482.0591 Alternative model = HKY+G -lnL1 = 2374.1394 2(lnL1-lnL0) = 215.8394 df = 1
Using mixed chi-square distribution P-value = <0.000001 TIM = 2376.7197 2376.7197 2372.6138 2362.1086 TVMef = 2395.8040 2395.8040 2391.3481 2380.6255 TVM = 2375.0203 2375.0203 2369.3423 2358.6572 SYM = 2395.6624 2395.6624 2391.2957 2380.6013 GTR 2374 8865 2374 8865 2369 2437 2358 5361 No Invariable sites
Null model = HKY+G -lnL0 = 2374.1394 Alternative model = HKY+I+G -lnL1 = 2362.9192 2(lnL1-lnL0) = 22.4404 df = 1
Using mixed chi-square distribution P value = 0 000001
Models of DNA sequences evolution
ModelTest output
Model selected: HKY+I+G
** Akaike Information Criterion (AIC) **
Model selected: TVM+I+G
-lnL = 2358.6572 AIC = 4735.3145 -lnL = 2362.9192 Base frequencies: freqA = 0.3123 freqC = 0.2263 freqG = 0.1618 Base frequencies: freqA = 0.3116 freqC = 0.2168 freqG = 0.1607 freqT = 0.3109 S b i i d l q freqT = 0.2996 Substitution model: Ti/tv ratio = 2.0963 Among-site rate variation
P ti f i i bl it (I) 0 6051 Substitution model: Rate matrix R(a) [A-C] = 1.8176 R(b) [A-G] = 8.0533 R(c) [A-T] = 1.6254 R(d) [C-G] = 3.5609 R(e) [C-T] = 8 0533 Proportion of invariable sites (I) = 0.6051
Variable sites (G)
Gamma distribution shape parameter = 0.9352
--PAUP* Commands Block: If you want to implement the previous ti t lik lih d tti i PAUP*
R(e) [C-T] = 8.0533 R(f) [G-T] = 1.0000 Among-site rate variation
Proportion of invariable sites (I) = 0.6002 Variable sites (G)
Gamma distribution shape parameter = 0.9020 estimates as likelihod settings in PAUP*,
attach the next block of commands after the data in your PAUP file:
[!
Likelihood settings from best-fit model (HKY+I+G) selected by hLRT in Modeltest Version 3.06
--PAUP* Commands Block: If you want to implement the previous estimates as likelihod settings in PAUP*, attach the next block of commands after the data in your PAUP file:
[!
Likelihood settings from best-fit model (TVM+I+G) selected by AIC in Modeltest Version 3.06
]
BEGIN PAUP;
Lset Base=(0.3123 0.2263 0.1618) Nst=2 TRatio=2.0963 Rates=gamma Shape=0.9352 Pinvar=0.6051;
END;
]
BEGIN PAUP;
Lset Base=(0.3116 0.2168 0.1607) Nst=6 Rmat=(1.8176 8.0533 1.6254 3.5609 8.0533) Rates=gamma Shape=0.9020 Pinvar=0.6002;
END;
---- _________________________________________________________________ Time processing: 0.001 seconds
Tree reconstruction strategies
DATA distance matrix discrete characters D Clustering algorithm UPGMA, NJ Optimality ME FM MP ML BA METHO DDistance methods - aligned sequences converted into a pair-wise distance matrix Æ loss of information about single sites contributions and no inference on the ancestral character
Optimality
criterion ME, FM MP, ML, BA
information about single sites contributions and no inference on the ancestral character states.
Discrete methods - each nucleotide site is considered directly y Æ allow to draw inference
on the ancestral character states.
Clustering methods follow an algorithm (set of steps) to produce a tree (usually a single
one) Æshort computational times but the results often depend on the order of sequences one) Æshort computational times, but the results often depend on the order of sequences addition to the growing tree. Competing hypotheses cannot be tested.
Optimality methodsp y use a specific criterion to assign a score to each possible tree. The p g p ranking is a function of the relationship between tree and data.
Tree reconstruction strategies
Neighbor-Joining (NJ)
Saitou & Nei (1987).The clustering algorithm starts from a star topology (completely unresolved tree) and determines the branches between the nearest pair of OTUs (neighbors) and the remaining OTUs through an iterative process.
Each step is taken according to the choice that minimizes the sum of the lengths of all the branches of the tree
branches of the tree.
The pair of OTUs chosen at each step will form a “composite OTU” treated as a single entity afterwards.
Advantages:
-Very fast computations;
-Allows for different evolutionary rates along the branches; Usually returns reliable results
-Usually returns reliable results. Disadvantages:
-the calculation of a distance matrix causes a loss of information. Software: CLUSTALX, PHYLIP, PAUP*, MEGA and others.
Tree reconstruction strategies
Neighbor-Joining (NJ)
Seq 2 Seq 2 Seq_1 Seq_2 Seq_3 Seq_1 Seq_2Seq_3 Seq_1 Seq_2
Seq_3 Seq_4
Seq_5
Seq_6 Seq_4
Seq_5
Seq_6 Seq_6 Seq_5 Seq_4
Tree reconstruction strategies
Maximum Parsimony
Swofford & Berlocher (1987).This method identifies the tree which needs the smallest number of substitutions (evolutionary changes) to explain the differences between the considered sequences. The branch length is proportional to the number of substitutions between the nodes connected by the branch itself.
“Parsimony informative sites” show at least two different character states occurring at least two times each.
Then the minimum number of substitutions is calculated for each possible unrooted tree. The MP tree is the one requiring the smallest number of changes.
Advantages:
-No loss of information;No loss of information; Disadvantages:
- no explicit evolutionary model (all substitutions equally probable, equal base frequencies,
f l l h )
no correction for multiple hits);
-often it returns a set of equally parsimonious trees. Software: PHYLIP, PAUP*, MEGA and others.
Tree reconstruction strategies
Maximum Parsimony (MP)
Seq_1 GTACG S 2 GTCGG Tree Seq_1 Seq_3 Seq_2 GTCGG Tree Seq_3 ACAGG Seq_4 ACCGG Seq_4 Seq_2Site 1 – 1 change Site 1 – 5 changes
G A G A G A G A G A A G A G G A
Tree reconstruction strategies
Maximum Parsimony (MP)
Seq_1 GTACG S 2 GTCGG Tree Seq_1 Seq_3 Seq_2 GTCGG Tree Seq_3 ACAGG Seq_4 ACCGG Seq_4 Seq_2Site 2 – 1 change Site 3 – 2 changes
T C A A A A C T T C or C C C C C C A A Site 5 – no changes Site 4 – 1 change G G C T C C C C
C G Tree Sites 1 2 3 4 5 total
G G G G Tree 1 2 3 4 5 total ((1,2),(3,4)) 1 1 2 1 0 5 ((1,3),(2,4)) 2 2 1 1 0 6 ((1,4),(2,3)) 2 2 2 1 0 7 G G G G
Maximum Likelihood (ML)
Felsenstein (1981)Tree reconstruction strategies
Maximum Likelihood (ML)
Felsenstein (1981).Often considered as the best approach to determine the most consistent tree topology. Formally, given a data set D (alignment) and the hypothesis H (tree), the probability of observing the data is given by
LD= Pr(D|H)
Which is equal to the conditional probability of D given H.
The tree which scores the highest value of L represents the ML estimate of the evolutionary l ti hi b t th id d OTU I th d th ML t i th hi h relationships between the considered OTUs. In other words, the ML tree is the one which best explains the examined dataset.
Advantages:
-It usually returns consistent results;
-It permits the statistical testing of evolutionary hypotheses (Likelihood Ratio Test). Disadvantages:
Disadvantages:
-very long computational times (often 100 bootstrap replicates are used instead of 1000). Software: PHYLIP, PHYML, PAUP* and others.
Bayesian approach (BA)
Tree reconstruction strategies
Bayesian approach (BA)
A recent variant of ML. While ML seeks the tree that maximizes the probability of observing the data given the tree and the model, BA searches the set of trees that have the maximum probability of being observed given the data and the model.
BA produces a set of trees with approximately equal likelihoods. Advantages:
-Results are easy to interpret: the frequency of a given clade within the set of trees is taken as the probability of that clade – no need for bootstrapping
the probability of that clade – no need for bootstrapping. Disadvantages:
-Depending on the settings, it may require long computational times (not as long as for ML).p g g , y q g p ( g ) Software: Mr Bayes and others.
N Of sequences
N Of sequences NeighborNeighbor--joiningjoining Maximum ParsimonyMaximum Parsimony Maximum LikelihoodMaximum Likelihood BayesianBayesian N. Of sequences
N. Of sequences NeighborNeighbor--joiningjoining Maximum ParsimonyMaximum Parsimony Maximum LikelihoodMaximum Likelihood BayesianBayesian 54
54 0.20 sec0.20 sec 0.72 sec.0.72 sec. 7.06 hr7.06 hr 3.8 hr3.8 hr 40
40 0.18sec.0.18sec. 0.32 sec.0.32 sec. 1.1 hr1.1 hr 2.4 hr2.4 hr 30
30 0.22 sec.0.22 sec. 0.18 sec.0.18 sec. 17.3 min17.3 min 1.7 hr1.7 hr 20
20 0.22 sec.0.22 sec. 0.10 sec.0.10 sec. 1.8 min1.8 min 1.05 hr1.05 hr
Computational times required for analysis by the four different methods. Source: Hall (2001). Thanks to faster present day processors the times have proportionally shortened.
10
Tree reconstruction strategies
Calculation of divergence times
If the assumption of a molecular clock – genetic divergence proportional to
evolutionary time - is correct, the reconstruction of the tree topology allows to
estimate the divergence times between all the OTUs.
g
The divergence time between at least two OTUs must be known from non genetic
evidence (e.g. paleontology). This time value is then used to calibrate the molecular
clock for that given tree.
Tree reconstruction strategies
Calculation of divergence times
A Likelihood Ratio Test can be performed on the ML values calculated with (
L
clock)
and without (
L
noclock) the assumption of the validity of the molecular clock.
Δ = 2 (lnLnoclock – lnLclock)
The probabilities are χ2 distributed with d.f.= n-2, being n the number of sequences.
Software: PAUP* and others.
The calculation of the Time to the Most Recent Common Ancestor (TMRCA) for a
(
)
set of sequences can also be performed with a Bayesian approach.
Software: BEAST BEAUTI and TRACER Software: BEAST, BEAUTI and TRACER.
Tree reconstruction strategies
Bootstrap
ootst ap
Felsenstein (1985).e se ste ( 985). Non-parametric bootstrap is used to infer the robustness of tree reconstructions.It estimates sampling error by resampling from the dataset instead of resampling from the population.
This approach can be applied to all the phylogenetic methods, with the exception of BA. How does it work? Data: n aligned sequences of length N (n x N matrix).
Obj i i fid i i l f f h
Objective: estimate confidence in particular features of the obtained tree (robustness of nodes).
Tree reconstruction strategies
Bootstrap
Felsenstein (1985). Method:Step 1 - create a large number of pseudo-datasets (100 or 1000) by re-sampling with replacement the columns of the original data matrix. In each of the bootstrapped replicates, some sites may occur more than once while others are never sampled
occur more than once, while others are never sampled.
Original dataset 10 60 Bootstrap pseudoreplicate 10 60 Seq_1 cccctaatatgtacaataatgaatgttgtaaattagtgttataacacatctatgtataat Seq_2 CCCCTAATATGTACAATAATGAATGTTGTAAATTAGTGTTATAACACATCTATGTATAAT Seq_3 CCCCTAATATGTACAATAATGAATGTTGTAAATTAGTGTTATAACACATCTATGTATAAT Seq_4 CCCCTAATATGTACAATAATGAATGTTGTAAATTAGTGTTATAACACATCTATGTATAAT Seq_5 CCCCTAATAGGTACAATAACTAATGTTGTAAATTAGTGTTATAACACATCTATGTATAAT Seq 6 CCCCTAATAGGTACAATAATTAATGTTGTAAATTAGTGTTATAACACATCTATGTATAAT Seq_1 CTTTGTTAAAAATATTGAAAAATTGTGTAAATTAGTTTTTAAACCACCCCTTTGTATTAA Seq_2 CTTTGTTAAAAATATTGAAAAATTGTGTAAATTAGTTTTTAAACCACCCCTTTGTATTAA Seq_3 CTTTGTTAAAAATATTGAAAAATTGTGTAAATTAGTTTTTAAACCACCCCTTTGTATTAA Seq_4 CTTTGTTAAAAATATTGAAAAATTGTGTAAATTAGTTTTTAAACCACCCCTTTGTATTAA Seq_5 CTTGGTTAAAAATACCTAAAAATTGTGTAAATTAGTTTTTAAACCACCCCTTTGTATTAA Seq 6 CTTGGTTAAAAATATTTAAAAATTGTGTAAATTAGTTTTTAAACCACCCCTTTGTATTAA Seq_6 CCCCTAATAGGTACAATAATTAATGTTGTAAATTAGTGTTATAACACATCTATGTATAAT Seq_7 CCCCTAATTTGTACAATAATGAATGTTGTAAATTAATGTTATAACACATCTATGTATAAT Seq_8 CCCCTAATATGTCCAATAATGAATGTTGTAAATTAGTGTTATAACACATCTATGTATAAT Seq_9 CCCCTAATATGTACAATAATGAATGTTGTAAATTAGTGTTATAACACATCTATGTATAAT Seq_10 CCCCTAATATGTACAATAATGAATGTTGTAAATTAGTGTTATAACACATCTATGTATAAT Seq_6 CTTGGTTAAAAATATTTAAAAATTGTGTAAATTAGTTTTTAAACCACCCCTTTGTATTAA Seq_7 CTTTGTTAAAAATATTGAAAAATTGTGTAAATTAGTTTTTAAACCACCCCTTTGTATTAA Seq_8 CTTTGTTCCAAATATTGAAAAATTGTGTAAATTAGTTTTTAAACCACCCCTTTGTATTAA Seq_9 CTTTGTTAAAAATATTGAAAAATTGTGTAAATTAGTTTTTAAACCACCCCTTTGTATTAA Seq_10 CTTTGTTAAAAATATTGAAAAATTGTGTAAATTAGTTTTTAAACCACCCCTTTGTATTAA
Step 2 - build a tree by applying the method of choice to each pseudo-dataset Æ disadvantage: it drastically increases the time required for computations.
Bootstrap
Felsenstein (1985).
Tree reconstruction strategies
Bootstrap
Felsenstein (1985).
Step 3 –evaluate the bootstrap support of the nodes by calculating the proportion of replicates where the feature is present Æ“consensus tree”.Seq 9 Seq 1 Seq 7 17 Seq 10 7 100 Seq 3 Seq 8 14 7 Seq 2 Seq 4 10 8 100
The results are % values that are usually interpreted following a “rule of thumb”:
Seq 5 Seq 6 91
19
- value<50% - weakly supported nodes, unlikely to be correct - 50%<value<70% - nodes to be interpreted with caution
- 70%<value –strongly supported nodes, likely to be correct.
Simulations have shown that bootstrap values greater than 70% correspond to a probability greater than 95%. In BA trees only the nodes with 95% PP values are considered as strongly supported, instead. pp
Jacknife
Tree reconstruction strategies
Jacknife
In the jacknife procedure the resampling occurs without replacement.
This is usually done by deleting randomly half of the characters in each replicate Æ
subreplicates are smaller than the original dataset Æthe statistical properties of the samples may change.
Original dataset Jacknife subreplicate
10 60 Seq_1 cccctaatatgtacaataatgaatgttgtaaattagtgttataacacatctatgtataat Seq_2 CCCCTAATATGTACAATAATGAATGTTGTAAATTAGTGTTATAACACATCTATGTATAAT Seq_3 CCCCTAATATGTACAATAATGAATGTTGTAAATTAGTGTTATAACACATCTATGTATAAT S 4 CCCC G C G G G G G C C C G 10 30 Seq_1 CCTATAGCATTAATTAATTGTTTACATTAA Seq_2 CCTATAGCATTAATTAATTGTTTACATTAA Seq_3 CCTATAGCATTAATTAATTGTTTACATTAA S 4 CC GC G C Seq_4 CCCCTAATATGTACAATAATGAATGTTGTAAATTAGTGTTATAACACATCTATGTATAAT Seq_5 CCCCTAATAGGTACAATAACTAATGTTGTAAATTAGTGTTATAACACATCTATGTATAAT Seq_6 CCCCTAATAGGTACAATAATTAATGTTGTAAATTAGTGTTATAACACATCTATGTATAAT Seq_7 CCCCTAATTTGTACAATAATGAATGTTGTAAATTAATGTTATAACACATCTATGTATAAT Seq_8 CCCCTAATATGTCCAATAATGAATGTTGTAAATTAGTGTTATAACACATCTATGTATAAT Seq 9 CCCCTAATATGTACAATAATGAATGTTGTAAATTAGTGTTATAACACATCTATGTATAAT Seq_4 CCTATAGCATTAATTAATTGTTTACATTAA Seq_5 CCTATAGCATCAATTAATTGTTTACATTAA Seq_6 CCTATAGCATTAATTAATTGTTTACATTAA Seq_7 CCTATTGCATTAATTAATTATTTACATTAA Seq_8 CCTATAGCATTAATTAATTGTTTACATTAA Seq 9 CCTATAGCATTAATTAATTGTTTACATTAA q_ Seq_10 CCCCTAATATGTACAATAATGAATGTTGTAAATTAGTGTTATAACACATCTATGTATAAT q_ Seq_10 CCTATAGCATTAATTAATTGTTTACATTAA
Software
• ADMIX ver. 1.0 Dupanloup & Bertorelle (2001).p p ( )
http://web.unife.it/progetti/genetica/Giorgio/giorgio_soft.html FREE!!
• ADMIX ver. 2.0
http://cmpg.unibe.ch/software/admix/ FREE!!
• ARLEQUINver. 3.1 Excoffier et al. (2005). http://cmpg unibe ch/software/arlequin3/ FREE!!
http://cmpg.unibe.ch/software/arlequin3/ FREE!!
• MEGA– Molecular Evolutionary Genetics Analysis ver. 4 Tamura et al. (2007). http://www.megasoftware.net/ FREE!!
• PAUP*- Phylogenetic Analysis Using Parsimony* ver. 4.0β Swofford (1998).
http://paup.csit.fsu.edu/
PHYLIP 3 68 F l t i (2002) • PHYLIPver. 3.68 Felsenstein (2002).
http://evolution.genetics.washington.edu/phylip.html FREE!!
• PHYML ver. 3.0 Guindon & Gascuel (2003). http://atgc.lirmm.fr/phyml/FREE!!
• BEASTver. 1.4.8… Drummond & Rambaut (2007).( ) http://beast.bio.ed.ac.uk/ FREE!!
• …and BEAUTI ver 1.4 Drummond & Rambaut (2007). http://beast.bio.ed.ac.uk/BEAUti FREE!!
• Mr BAYESver. 3.1 Hulsenbeck & Ronquist (2001). http://mrbayes csit fsu edu/ FREE!!
http://mrbayes.csit.fsu.edu/ FREE!!
• TRACERver. 1.4 Rambaut & Drummond (2007). http://tree.bio.ed.ac.uk/software/tracer/ FREE!!
• TREEVIEWver. 1.6.6 Page (1996). http://taxonomy.zoology.gla.ac.uk/rod/treeview.html FREE!!
A miscellany of phylogeny programs and tools is available here
http://evolution.genetics.washington.edu/phylip/software.html
The BioPortal of the University of Oslo allows to run several applications through a web server
THANK YOU!!
THANK YOU!!
References:
• Akaike, H. 1974. A new look at the statistical model identification. IEEE Transactions on Automatic Control 19: 716-723.
• Drummond, A.J. and Rambaut, A. 2007. "BEAST: Bayesian evolutionary analysis by sampling trees“. BMC Evolutionary Biology 7: 214., J , y y y y p g y gy
• Dupanloup, I. and Bertorelle, G. 2001. Inferring admixture proportions from molecular data: extension to any number of parental populations. Mol. Biol. Evol. 18: 672–675.
• Excoffier, L., Laval, G., and Schneider, S. 2005. Arlequin ver. 3.0: An integrated software package for population genetics data analysis. Evolutionary Bioinformatics Online 1: 47-50.
• Excoffier, L., Smouse, P., and Quattro, J. 1992. Analysis of molecular variance inferred from metric distances among DNA haplotypes: Application to human mitochondrial DNA restriction data Genetics 131:479 491
human mitochondrial DNA restriction data. Genetics 131:479-491.
• Felsemstein, J. 1981. Evolutionary Trees from DNA Sequences: a Maximum Likelihood Approach. J. Mol. Evol. 17: 368−376. • Felsenstein, J. 1985. Confidence limits on phylogenies: an approach using the bootstrap. Evolution 39: 783–791.
• Finlay, E.K., Gaillard, C., Vahidi, S.M.F., Mirhoseini, S.Z., Jianlin, H., Qi, X.B., El-Barody, M.A.A., Baird, J.F., Healy, B.C. and Bradley, D.G. 2007. Bayesian inference of population expansions in domestic bovines. Biology Letters 3: 449-452.
• Galtier, N., Gouy, M. and Gautier, C. 1996. SeaView and Phylo_win, two graphic tools for sequence alignment and molecular phylogeny. Comput. Applic. Biosci. 12: 543-548.
• Guindon, S., and Gascuel, O. 2003. A simple, fast, and accurate algorithm to estimate large phylogenies by Maximum Likelihood. Syst Biol 52(5): 696-704.
• Hall, B.G. 2001. Phylogenetic trees made easy. A how-to manual for molecular biologists. Sinauer Associates Inc., Publishers, Sunderland, Massachussetts, USA.
• Hasegawa, M., Kishino, H. and Yano, T. 1985. Dating of the human-ape splitting by a molecular clock of mitochondrial DNA. J. Mol. Evol. 22: 160-174.Hasegawa, M., Kishino, H. and Yano, T. 1985. Dating of the human ape splitting by a molecular clock of mitochondrial DNA. J. Mol. Evol. 22: 160 174. • Higgins, D.G., Bleasby, A.J. and Fuchs, R. 1992. CLUSTAL V: improved software for multiple sequence alignment. CABIOS 8: 189-191.
• Higgins, D.G. and Sharp, P.M. 1989. Fast and sensitive multiple sequence alignments on a microcomputer. CABIOS 5: 151-153.
• Higgins, D.G. and Sharp, P.M. 1988. CLUSTAL: a package for performing multiple sequence alignment on a microcomputer. Gene 73: 237-244. • Hudson, R. R. 1990. Gene genealogies and the coalescent process, pp. 1-44 in Oxford Surveys in Evolutionary Biology, edited by Futuyama, and J. D.
Antonovics. Oxford University Press, New York.
J k T d C t C 1969 E l ti f t i l l I M li P t i M t b li dit d b M HN N Y k A d i • Jukes, T. and Cantor, C. 1969. Evolution of protein molecules. In: Mammalian Protein Metabolism, edited by Munro HN, New York: Academic press, p.
21-132.
• Kimura, M. 1980. A simple method for estimating evolutionary rate of base substitution through comparative studies of nucleotide sequences. J. Mol. Evol. 16:111-120.
• Lanave, C., Preparata, G., Saccone, C. and Serio, G. 1984. A new method for calculating evolutionary substitution rates. J. Mol. Evol. 20: 86-93. • Nei, M., 1987. Molecular Evolutionary Genetics. Columbia University Press, New York, NY, USA.y y
• Page, R.D.M. 1996. TREEVIEW: An application to display phylogenetic trees on personal computers. Computer Applications in the Biosciences 12: 357-358.
• Pellecchia, M., Negrini, R., Colli, L., Patrini, M., Milanesi, E., Achilli, A., Bertorelle, G., Cavalli-Sforza, L.L., Piazza, A., Torroni, A. and Ajmone-Marsan, P. 2007. The mystery of Etruscan origins: novel clues from Bos taurus mitochondrial DNA. Proc. R. Soc. B . 274: 1175–1179.
• Posada, D. 2006. ModelTest Server: a web-based tool for the statistical selection of models of nucleotide substitution online. Nucleic Acids Research 34: W700-W703
References:
• Posada, D. and Buckley, T.R. 2004. Model selection and model averaging in phylogenetics: advantages of the AIC and Bayesian approaches over likelihood ratio tests. Systematic Biology 53: 793-808.y gy
• Posada, D. and Crandall, K.A. 1998. Modeltest: testing the model of DNA substitution. Bioinformatics 14(9): 817-818. Rambaut, A. and Drummond, A.J. 2007. Tracer v1.4.. http://tree.bio.ed.ac.uk/software/tracer/
• Rogers, A.R. 2004. Lecture Notes on Gene Genealogies. www.anthro.utah.edu/~rogers/bio5410/Lectures/a_alu.pdf
• Rogers, A. R. and Harpending, H. 1992. Population growth makes waves in the distribution of pairwise genetic differences. Mol. Biol. Evol. 9: 552-569. • Rousset, F., 2000. Inferences from spatial population genetics, in Handbook of Statistical Genetics, D. Balding, M. Bishop and C. Cannings. (eds.) Wiley
& Sons Ltd & Sons, Ltd.
• Saitou, N. and Nei, M. 1987. The neighbor–joining method: a new method for reconstructing the phylogenetic tree. Mol. Biol. Evol. 4: 406−425. • Schwarz, G. 1978. Estimating the dimension of a model. The Annals of Statistics 6: 461-464.
• Slatkin, M., 1991 Inbreeding coefficients and coalescence times. Genet. Res. Camb. 58: 167-175.
• Swofford, D.L., 1998. PAUP*. Phylogenetic Analysis Using Parsimony (*and other methods). Version 4. Sinauer Associates, Sunderland, Massachussetts.
• Swofford, D.L. and Berlocher, S.H. 1987. Inferring evolutionary trees from gene frequency data under the principle of maximum parsimony. Systematic Zoology 36: 293−325.
• Tajima, F. 1983 Evolutionary relationship of DNA sequences in finite populations. Genetics 105: 437-460.
• Tajima, F. 1993. Measurement of DNA polymorphism. In: Mechanisms of Molecular Evolution. Introduction to Molecular Paleopopulation Biology, edited by Takahata, N. and Clark, A.G., Tokyo, Sunderland, MA:Japan Scientific Societies Press, Sinauer Associates, Inc., p. 37-59.
• Tajima, F. and Nei, M. 1984. Estimation of evolutionary distance between nucleotide sequences. Mol. Biol. Evol. 1:269-285.Tajima, F. and Nei, M. 1984. Estimation of evolutionary distance between nucleotide sequences. Mol. Biol. Evol. 1:269 285.
• Tamura, K., 1992 Estimation of the number of nucleotide substitutions when there are strong transition-transversion and G+C content biases. Mol. Biol. Evol. 9: 678-687.
• Tamura, K., Dudley, J., Nei, M., and Kumar, S. 2007. MEGA4: Molecular Evolutionary Genetics Analysis (MEGA) software version 4.0. Molecular Biology and Evolution 24: 1596-1599.
• Tamura, K., and M. Nei, 1993 Estimation of the number of nucleotide substitutions in the control region of mitochondrial DNA in humans and hi M l Bi l E l 10 512 526
chimpanzees. Mol. Biol. Evol. 10: 512-526.
• Thompson, J.D., Gibson, T.J., Plewniak, F., Jeanmougin, F. and Higgins, D.G. 1997. The ClustalX windows interface: flexible strategies for multiple sequence alignment aided by quality analysis tools. Nucleic Acids Research 24: 4876-4882.
• Thompson, J.D., Higgins, D.G. and Gibson, T.J. 1994. CLUSTAL W: improving the sensitivity of progressive multiple sequence alignment through sequence weighting, positions-specific gap penalties and weight matrix choice. Nucleic Acids Research 22: 4673-4680.
• Troy, C.S., MacHugh, D.E., Bailey, J.F., Magee, D.A., Loftus, R.T., Cunningham, P., Chamberlain, A.T., Sykesk, B.C. and Bradley, D.G. 2001. Genetic y g y g g y y evidence for Near-Eastern origins of European cattle. Nature 410: 1088-1091.
• Weir, B.S. and Cockerham, C.C. 1984 Estimating F-statistics for the analysis of population structure. Evolution 38:1358-1370. • Wright, S., 1951 The genetical structure of populations. Ann.Eugen. 15: 323-354.