• No results found

Consistency and identifiability of the polymorphism aware phylogenetic models

N/A
N/A
Protected

Academic year: 2020

Share "Consistency and identifiability of the polymorphism aware phylogenetic models"

Copied!
6
0
0

Loading.... (view fulltext now)

Full text

(1)

Contents lists available at ScienceDirect

Journal

of

Theoretical

Biology

journal homepage: www.elsevier.com/locate/jtb

Letter to Editor

Consistencyandidentifiabilityofthepolymorphism-aware phylogeneticmodels

1. Introduction

Phylogenies help to answer a multitude of questions regard-ing the origin of species, the tempo of evolution, the origin of particular traitsand theprocesses (either neutralor selective)of molecular evolution (e.g. see Pagel,1999). Phylogenies are there-forecentraltothestudyofpatternsandprocessesbywhich evolu-tionhappens.However,phylogeniescanonlyservethesepurposes iftheyarecorrectlyestimated.Fortunately,mathematical phyloge-neticsprovidescriteriathathelpustoassesswhetheragiven phy-logeny estimatoris statistically sound. Statistical consistency and identifiabilityaretwoexamplesofsuchcriteria.

Inphylogenetictheory,aphylogeneticreconstructionmethod

τ

ˆ isstatisticallyconsistentunderamodelifitconvergesin probabil-ity to the true tree

τ

as the number of sites S of the sequence alignment increases indefinitely (Wald, 1949; Felsenstein, 1973), i.e.

lim

S→∞p

(

|

τ

ˆS

τ

|

)

=1 . (1)

Identifiabilityismetifthemodelofevolutionisuniquely char-acterizedbytheprobabilitydistributionitdefines(Changand Har-tigan, 1991). An identifiable model is a necessary condition for consistency. Moreformal conditionsforidentifiabilityand consis-tencyaredescribedinSteel(1994),Chang(1996)andSteel(2013); these are revisited later in thisarticle. Lack ofstatistical consis-tency haslong beenanaspect thatphylogeneticistscared for.For example, some maximumparsimony methods were criticizedfor lackingstatisticalconsistencyearlyon(Felsenstein,1978).

Simplephylogenyestimationmethodsusingstandard substitu-tion model (e.g. Bayesian or maximum likelihood inference un-der the Jukes and Cantor, 1969; Kimura, 1980; Tajima and Nei, 1984 substitution models)can be shownto enjoy statistical con-sistency (Chang, 1996). However, the same principle cannot be directly extended to more complex andgeneral methods oftree inference that include rate variation across sites (

) and invari-ant sites (I). For such models of evolution, the model distribu-tion is not fully understood, which complicates proving identifi-ability. Identifiabilityhas beenproven for thepure general time-reversible model(GTR,Tavaré, 1986). TheGTRwithratevariation (i.e.GTR+

)(WuandSusko,2010)alsoenjoysidentifiability; how-ever, a rigorousproof forthe commonlyutilized GTR+

+I is still lacking (Rogers, 2001; Allmanet al., 2008; Chai and Housworth, 2011).

During the last years, alternative approaches to the clas-sic nucleotide substitution models have been proposed. The polymorphism-awarephylogeneticmodels(PoMo)aresuchan

ex-ampleofanalternative approachfortreeestimationthat also ac-countsforincompletelineagesorting(DeMaioetal.,2013).While PoMo canbe morebroadly classifiedasa nucleotidesubstitution model; it adds a new layer of complexity by accounting for the population-levelevolutionaryprocesses(suchasmutations,genetic drift,andselection)todescribetheevolutionaryprocess(DeMaio etal., 2015;Schrempf et al., 2016;Borgeset al., 2019). Todo so, PoMobuildson aGTR-likemutationschemeandexpandsthe {A, C,G,T}state-spacetoincludepolymorphicstates,thereby account-ing for current and ancestral intra-population variation. The lat-ter aspect sets PoMo apartfrom the classic models of evolution, whichtraditionallyonlyuseasinglerepresentativeDNAsequence perspecies.

PoMohasreceived substantial attentionfromthe evolutionary community (see Mirarab etal., 2014; Szöllsi et al., 2015; Leaché andOaks,2017).Severalpublicationshaveemployed ittosolve a widerange ofevolutionaryquestions,e.g., disentangling phyloge-neticrelationshipsamongbaboonspecies(Rogersetal.,2019), de-scribingthephylogeographichistoryofgreatapes(Schrempfetal., 2016), estimating patterns of GC-bias and mutational biases (De Maio et al., 2013; Borges et al., 2019) and inferring the site-frequency spectrum (Schrempf and Hobolth, 2017; Borges et al., 2019)frompopulationdata.

Allthis raised the questionof whetherPoMo is a statistically consistentphylogenyestimatorforphylogeneticdatasets.Building uponthe formalresultsprovided by Steel(2013) and identifiabil-ityofthePoMoratematrixandstationarydistribution,wepresent aproofthat theMAPtopology (i.e.thetreetopology thathasthe greatestposteriorprobability)isa statisticallyconsistentestimate ofthetruetree.ThisresultshowsthatPoMoisastatisticallysound methodofphylogeneticinference,anditprovidesvalidity for fur-therinvestigationsandusesofPoMomethodsonrealdatasets.

2. Polymorphism-awarephylogeneticmodelsinanutshell

PoMo assumes a Moran model (Moran, 1958) with N haploid individualsanddefinesthealleletrajectoryofa singlelocuswith fourpossibleallelesai,whereiA=

{

A,C,G,T

}

(Table1includes aglossaryofthesymbolsusedinthepresentproof).Theevolution ofthispopulationinthecourseoftime isdescribedbya contin-uous timeMarkovchainwitha discretestate-space definedbyN andthefour alleles.Statesare monomorphic (orboundary) ifall the N individuals have the allele i {Nai}; orpolymorphic, if two allelesarepresentinthepopulation

{

nai,

(

Nn

)

aj

}

(Fig.1).

A ratematrix Q describing the dynamic betweenthe bound-aryandpolymorphic statescanbedefinedby consideringseveral population-levelprocesses.Sofar,PoMoincludesmutation,genetic driftandallelicselection (Fig. 1) (DeMaio etal.,2015; Schrempf etal.,2016;Borgesetal.,2019).

https://doi.org/10.1016/j.jtbi.2019.110074

(2)

Table1

Glossary.Theorderofthesymbolsfollowstheirfirstoccurrenceinthetext.

Symbol Description

τ treetopology ˆ

τ treetopologyestimator

S numberofsites

ai allelei

A nucleotidebase{A,C,G,T}

N populationsize

n absolutefrequencyofanallele;relativefrequencywouldben/N Q PoMoinstantaneousratematrixwithelementsq

π componentofthemutationrate;stationaryfrequenciesintheGTRmodelofevolution ρ componentofthemutationrate;baseexchangeabilitiesintheGTRmodelofevolution σ allelicselectioncoefficients

k normalizationconstantofthestationaryvector

α PoMostate

K numberofalleles;K=4forPoMo

L numberoftaxa/sequences

P transitionmatrix

χ sitepatternor,better,PoMofrequencypatterns η rootvertex

=(ν1,ν2) directededge ω branchlengths

θ setofPoMoparametersandbranchlengths

Fig.1. PoMostate-spaceandtransitionrates.ThetwoallelesrepresentanyofthefournucleotidebasesA,C,G,andT.Orangeandgreydistinguishtheroleofmutation andgeneticdriftplusselection,respectively.ThePoMostate-spaceincludesmonomorphic(orboundarystates){Nai}andpolymorphicstates{nai,(Nn)aj}.Monomorphic

statesinteractwithpolymorphicstatesviamutationandpolymorphicstatesreachmonomorphicstatesviadriftorselection.Betweenpolymorphicstatesonlydriftand selectionoccur.PoMoisthusaparticularcaseofthemultivariateMoranmodelwithboundarymutationsandselectionwhenfouralleles(i.e.thefournucleotidebases)are considered.

• Mutations are assumed to occur only in the boundary states {Nai}withmutation rates

μ

aiaj=

ρ

aiaj

π

aj=

ρ

ajai

π

aj.Mutation rates are thus modeled according to a GTR model of evolu-tion(Tavaré,1986),where

π

representsthestationary frequen-cies and

ρ

the six exchangeability terms. Similar interpreta-tions of theseparameters are still validforthe neutralPoMo, as

π

immediately informs the stationary frequencies of the monomorphicstates.However,thestationaryfrequenciesofthe monomorphic states are no longer only defined by

π

in the more generalPoMo with allelicselection (Borgeset al., 2019) (Eq.(3)). Allin all, the GTR-likemutationscheme is a conve-nientstrategytoobtainquantitiesofinterestforPoMo. • Genetic drift rules the allele frequency changes in a

popula-tion. Genetic drift is modeled according to the Moran model, inwhichoneindividualischosentoreproduce(i.e.tocopy it-self)andonetodieineachtimestep(Moran,1958).Therefore, the rates by which an allelewith a startingfrequency n/N is bornor dies(i.e. thealleleincreasesor decreasesby one)are thesameandequalto n(N−nN ).Theallelefrequencychangesare thusneutral.

• Allelic selection may favor one allele over the other by dif-ferentiated reproductive success. The Moran machinery de-scribed previously can be adapted to model relative fitnesses by permitting that a given allele ai has a fitness advan-tage/disadvantage

(

1+

σ

ai

)

over the others (Durrett, 2008;

Borgesetal., 2019).

σ

refersto thevectorofrelative selection coefficients:i.e.areferencealleleischosentohavefitness1,or selectioncoefficient0.

Taking intoaccount thepopulation processesdescribedso far, thePoMoinstantaneousratematrixQis

q

(

{

n1 ai,

(

Nn1

)

aj

}

,

{

n2 ai,

(

Nn2

)

aj

}

)

=

μ

aiaj=

ρ

aiaj

π

aj n1 =N, n2 =N−1

μ

ajai=

ρ

aiaj

π

ai n1 =0, n2 =1

n1

N

(

Nn1

)(

1+

σ

ai

)

n2 =n1 +1, 0<n1 <N

n1

N

(

Nn1

)(

1+

σ

aj

)

n2 =n1 −1, 0<n1 <N

0

|

n1 −n2

|

>1

, (2)

where n1 andn2 represent a shift inthe allelefrequencies. Fre-quencyshiftslargerthan1aredisallowed(lastconditioninEq.(2)) makingPoMoratematricestypicallysparse.Thediagonalelements are defined such that the respective rowsum is 0. The station-ary distribution of PoMo is obtained by satisfying the condition

ψ

Q =0.

ψ

isthe normalizedstationaryvector andhas the solu-tion

ψ

(

α

)

=

π

ai

(

1+

σ

ai

)

N−1 k−1 if

α

=

{

Nai

}

π

ai

π

aj

ρ

aiaj

(

1+

σ

ai

)

n−1 if

α

=

{

na

i,

(

Nn

)

aj

}

(

1+

σ

aj

)

Nn−1 N n(N−n)k−1

,

(3)

where k is the normalization constant and

α

a PoMo state (Borgesetal.,2019).

When using PoMo to infer phylogenies, we make use of two important assumptions that are convenient for the proof of con-sistency presented here. Our firstassumption is that sites evolve independently. Thus the probability that sequence A evolves to sequence B equals the product of theprobability ofevolutionary pathsacrossallsitesS.Asaresult,thesitepatternscreatedbythe Lspeciesareindependentlyandidenticallydistributed(i.i.d.). The second assumption is that the evolutionary process is stationary and reversiblewith equilibriummeasure

ψ

. Proof ofreversibility andstationarityhavebeenprovidedbySchrempf etal.(2016) for theneutralPoMomodelandbyBorgesetal.(2019)forthemodel withallelicselection.

3. IdentifiabilityofPoMo

Identifiabilityisthe inverseproblemoffindingthetree

τ

and thetransitionmatrixPgivenjusttheprobabilityofthevarioussite patterns

χ

(orfrequencypatterns,inthecaseofPoMo)(Steeletal., 1998). Identifiability comes from the restrictions that must be placed on P for

τ

to be uniquely describedby theprobability of generating a pattern

χ

. These restrictions have been extensively studied(ChangandHartigan,1991;Steel,1994;Chang,1996;Steel etal.,1998).

Let us assume a tree

τ

with

η

representing the mostrecent commonancestorofLspecies(i.e.therootofthetree).Theedges

=

(

ν

1 ,

ν

2

)

of

τ

aredirectedawayfromtheroot

η

insuchaway that

ν

1 liesbetween

η

and

ν

2 .Ifweattribute statestoeach ver-texinthetree

τ

,beginningfromtheroot

η

toallthedescending vertices,we canrepresenttheprobability ofgeneratingpattern

χ

as

fχ= χ

p

(

χ

η

)

=(ν12)

P

(

χ

ν1,

χ

ν2

)

, (4)

where

χ

extends

χ

andrepresentsthetotalassignmentofstates, andp(

χ

η) istheprobability ofstate

χ

η attheroot. Thealphabet ofPoMohas4+6

(

N−1

)

statesandthusthereare[4+6

(

N−1

)

]L possiblefrequencypatterns

χ

inatreewithLleaves.

As PoMo makes some assumptions regarding theevolutionary process(Schrempfetal.,2016),wecanfurthersimplifyEq.(4):(i) frequency changes on edges are described by a continuous-time Markov process; (ii) the PoMo rate matrix Q is the same for all edges ofthe tree;and(iii)the distributionoffrequencystates at therootp(

χ

η)issimplytheequilibriumdistribution

ψ

.

fχ =

χ

ψ

(

χ

η

)

=(ν1,ν2)

exp

{

tQ

}

(

χ

ν1,

χ

ν2

)

, (5)

wheretisthelengthofedge

.Conditionsrepresentedby(4)and

(5)are alsoknown astheMarkovproperty, whichis anecessary thoughnotsufficientconditionforidentifiability.

Following(Steel,1994;Steeletal.,1998),anotherconditionfor identifiabilityadditionalto(5)needstohold:det

(

P

)

= 0,1,−1for all edges

inthetreeand

ψ

(

α

)=0foreachPoMo state

α

.Then thetree

τ

canbeuniquelyrecoveredfromthefrequencypatterns

χ

(Steel,1994;Steeletal.,1998).ByJacobi’sformula(theorem2.12 in Hall, 2015), we can write that det

(

P

)

=exp

{

tr Q

}

and there-fore

det

(

P

)

=exp

{

aiajA

ai=aj

ρ

aiaj

(

π

ai+

π

aj

)

1

2

(

N−1

)(

N+1

)

(

4+

aiA

σ

ai

)

}

(6)

Theselectivepressures andthe mutationrates,asmodeled in PoMo,canonlyberealpositivenumbers.Theseratescannot be0 because they would represent very unlikely and biologically un-reasonable scenarios: If 1+

σ

ai=0 the individual carrying allele ai wouldimmediatelydie (i.e.an extremely deleteriousallele); if

π

ai=0or

ρ

aiaj=0(bothimply

μ

ai=0)thealleleaidoesnotarise bymutation.Inbothsituations,suchalleleshouldnotbeobserved atallinthepopulation,andonecouldsimplyusePoMowithK −1 alleles,whereKisthenumberofallelesinthepopulation. There-fore,wecaneasilyconcludethat0<det

(

P)<1foralledges

of

τ

.

Because we assume the equilibrium distribution of allele fre-quencies inthe root, weneed to check whetheranyelementsof

ψ

canbe equal to0.The PoMo stationarydistribution isdefined for two different types of states

α

, the monomorphic (or fixed) andthe polymorphic ones. As shownin Eq.(3), thesestates can only have 0 probability if any of the population parameters are 0. Therefore, we conclude that

ψ

(

α

)>0 for all PoMo states

α

. Summarizing,PoMomodelsrespecttheconditionsfor identifiabil-ityandweconcludethat PoMo treesare uniquelyidentifiableby thefrequencypatternstheyinduce.Identifiabilitycanbeextended to the multivariate case by defining an alphabet A over K alle-lesanda state-spaceofK+KK−1

2

(

N−1

)

states.Forsimplicitywe haveconsidered thefour-variatecase, butthe proofsshownhere remainvalidforthe multivariate Moranmodel withallelic selec-tion(Borgesetal.,2019).

4. ConsistencyoftheBayesianPoMo

Toevaluate theconsistency propertyunder PoMo,we have to prove that a giventree estimatorconverges in probability to the truetreeasthenumberofsitesSincreasesindefinitely.Ithas al-readybeenformally shownthattheMAPtree(i.e.thetree topol-ogy that has the greatest posterior probability), estimated in a Bayesianframework,providesastatisticallyconsistentestimatorof thetrue tree(Steel, 2013). Consistencywasprovenunder a wide variety of conditions (Steel, 2013), including tree inference from alignedsequencesacrosstheentireparameterrange,andwiththe usageofgeneralpriorsinmodels wheretheidentifiability condi-tionholds.Here,weshowthattheseconditionsarealsometunder PoMo.

Supposewearegiveni.i.d.sitepatterns

χ

=

(

χ

1 ,...,

χ

S

)

gener-atedbysome unknownparameters(

τ

,

θ

)andwewishtoidentify thetopology

τ

from

χ

givenprior densitiesonthesetoffully re-solvedtreesTandthecontinuousparameters

θ

.Theseparameters includethe branch lengths

ω

,the mutations rates (defined by

π

and

ρ

) and the selection coefficients

σ

. Suppose we have a dis-cretepriorprobabilitydistributionp(

τ

)onT,and,foreach

τ

T,a continuouspriorprobabilitydistributionon (

τ

)witha probabil-itydensityfunctionpτ(

θ

).

Inparticular,ifthefollowingfourconditionshold:

• C1:p(

τ

)>0;

• C2: the densitypτ(

θ

) is continuous,bounded andnonzero on (

τ

);

• C3: the function

θ

p( τ, θ) (

χ

) is continuous and nonzero on (

τ

);

• C4: identifiability,guaranteeing that

τ

is uniquely identifiable by

χ

.

Steel(2013)hasshownthat

lim S→∞p

(

τ

,

θ

,S

|

χ)

=1 , (7)

where p(

τ

∗,

θ

, S|

χ

) is probability that the MAP correctly selects

(4)

Fig.2. Consistency oftheBayesianPoMo.(A)Topology(andrelativebranchlengths)usedforsimulations andthetwotop-visitedwrong topologiessampledfromthe posterior.Thesealternativetopologiesfollowtheexpectedtreediscordanceduetoincompletelineagesortingbetweenpopulations2and3.(B)Posteriorprobabilitiesofthe MAPandtruetopologiesforthetwosimulatedscenariosIandII.TheexpectedheterozygosityHeanddivergenceDusedtosimulateeachscenarioweretakenfromnatural

populationsofD.simulans(scenarioI(Begunetal.,2007))andgreatapes(scenarioII(Prado-Martinezetal.,2013)).TheasteriskindicatesthattheMAPtreecorrespondsto thetruetree.Numbersingreyindicatethenumberofsampledposteriortopologies.

θ

)] whereEθ is the expectation ofthe likelihood withrespectto thepriorprobabilityof (

τ

).

Consistency can then be inherently gained by Bayesian infer-ence.PoMocanbeeasilyplacedinaBayesianframework.Indeed, wecaneasilymeetconditions1and2instandardBayesian phylo-geneticinferencesoftware(e.g.BEAST(Bouckaertetal.,2019)and RevBayes(Höhnaetal.,2016)).C1requiresthatanonzeroprioris setonthetreetopology,whichcanbeeasilyimplementedby set-tingauniformprior. If

θ

takestheusual exponential/gamma pri-ors on the branch lengths and rate parameters (

ρ

,

σ

and

ω

) or theDirichlet distribution on

π

(which can actually be generated froma setofK-independent gammarandomvariables), condition C2ismet.ConditionsC3andC4holdforallMarkovprocesseson anytree with pendant edges of positive lengths (i.e. on any bi-nary metric-tree) forwhich identifiability wasproven. Therefore, asshownintheprevioussection, C3andC4 holdforPoMo. Con-sequently,the MAPtreeunder PoMo isa consistent estimator of thespeciestree.

5. Asimulation-basedexampleofconsistencywithPoMo

Consistencyguaranteestheidentificationofthecorrect param-etervalueswithinfinitesequencelengths. Inrealdatasituations, the sequence length is finite as is the running time. We have nevertheless tested the consistency of the tree topology for the BayesianPoMoestimatorusingsimulatedpopulationdatasets.

We simulatedalignments of10 000sitesunderPoMo usinga phylogeny of five populations as shown in Fig. 2A with relative branch lengths definedin such a way that we have two closely andtwodistantly(i.e.twicetheexpecteddivergence)related pop-ulations. We simulated two scenarios Iand II by mimicking fast andslowlyevolvingpopulations (expected divergence D equalto 0.3and0.02substitutionspersite,respectively)withexpected het-erozygosityHe asobservedinDrosophilasimulanspopulationsand among great apes (0.0015 and 0.018, respectively) (Begun et al., 2007;Prado-Martinezetal.,2013).

To test for consistency, we created four alignments including thefirst 10,100, 1000,10 000sites.We fed these alignments to

RevBayes (Höhna et al., 2016) and performed standard Bayesian phylogenyestimation onthem. Weran PoMo fortwo chainsand 20000generations,keepingevery10thiteration.Aburningperiod of 10% was defined by checking mixing, convergence, and auto-correlationoftheMCMCchains.

We observed that MAP already recovers the true treefor the averagegenelengthof1000sites(Fig.2B).Thesesimulated scenar-ioscomplementourproofsthattheMAPisaconsistentestimator ofthetruetree.Weobservedfurtherthat populationswithmore heterozygousalignments (i.e.scenarioI)convergetothetruetree faster.Theeffectofincompletelineagesortingisevidentinthese examples, as the top-sampled wrongtopologies (Fig. 2A) cluster the closelyrelated populations 3 and2 with theclade including population 4 and5.These topologies are,for a fewer numberof sites,theMAPtreesofscenariosIandII.

6. Conclusions

Here we prove that PoMo is identifiable and further that the MAPisastatisticallyconsistentestimatorofthespeciestreewhen PoMo is placed in a Bayesian framework. This is the first time identifiabilityandconsistencywereshownforthe polymorphism-awarephylogeneticmodels.

Identifiability of PoMo can be easily extended to important, more general models. Identifiability should be kept for general-izations that work onthe standard PoMo instantaneous rate ma-trix: This is, for example, the case with balancing selection act-ing along with geneticdrift. More complicated extensionsof the PoMomodelswouldincludethejointinferenceofgenetreeswith the species trees; currently, PoMo directly estimates the species tree. Steel (2013) suggested that identifiability and consistency should still apply, as the gene trees can be viewed as nuisance parameters. A formal proof for thisstatement, especially for the case of gene trees undergoing gene gain, loss, and transfer, is missing.

(5)

thetypeofdataPoMoisappliedto:population-scaleand genome-wide.Thisdataiscostlyandtechnicallydifficulttoobtainand in-deedthedatasetsavailableintheliteraturearefew(e.g.humans (Altshuler et al., 2010), great apes (Prado-Martinez et al., 2013), fruit flies(Lack et al., 2015) and Arabidopsis (Long et al., 2013)). It is only naturalto expect that higher samples sizes are repaid withlesserroneousestimatesofmodelparameters,includingthe speciestree.

Oursimulatedexamples,thoughsimple,showthattheMAP re-coversthetruetreeevenwhentheexpectedheterozygosityisvery low andwith incompletelineagesortingis present. ILS isa very well-known causeofdiscordancebetweengeneandspeciestrees (MaddisonandKnowles,2006),byaffectingtheprobability of vis-ited wrong topologies. Statistical consistency is thus a desirable property forphylogenyestimation onclosely relatedpopulations. As we have seen, the MAP tree corresponds already to the true topologyfor1000sites,evenforlessdiversespecies.

Asfuturework,wewouldwanttoexplorethesequencelength requirement underPoMo. This is essentially the sequence length thataphylogenyreconstructionmethodneedstorecoverthetrue tree with a considerably small error (Atteson, 1999). A low se-quencelengthrequirementisaconditionforacomputationally ef-ficient method.Inparticular, itwouldbe important todetermine whetherPoMoisafastconvergingmethod(i.e.amethodthat re-quiresasequencelengthofonlyO

(

poly

(

n

))

).

Funding

ThisworkwassupportedbytheViennaScienceandTechnology Fund(WWTF)[MA16-061].

DeclarationofCompetingInterest

Theauthorshavenoconflictsofinterest.

Acknowledgements

WethankBastienBoussauforhelpingwithRevBayes,andAsger Hobolt andLynette Mikula for usefulcomments and suggestions onanearlierversionofthismanuscript.

References

Allman,E.S., Ané, C.,Rhodes, J.A., 2008.IdentifiabilityofaMarkovianmodel of molecularevolutionwithgamma-distributedrates.Adv.Appl.Probab.40(01), 229–249.doi:10.1239/aap/1208358894.

Altshuler, D.L., Durbin, R.M., Abecasis, G.R., Bentley, D.R., Chakravarti, A., Clark,A.G.,Collins,F.S.,DeLaVega,F.M.,Donnelly,P.,Egholm,M., Flicek,P., Gabriel,S.B.,Gibbs,R.A.,Knoppers,B.M.,Lander,E.S.,Lehrach,H.,Mardis,E.R., McVean, G.A., et al., 2010. A map of human genome variation from population-scalesequencing.Nature467(7319),1061–1073.doi:10.1038/nature 09534.

Atteson,K.,1999.Theperformanceofneighbor-joiningmethodsofphylogenetic re-construction.Algorithmica25(2–3),251–278.doi:10.1007/PL00008277. Begun, D.J., Holloway, A.K., Stevens, K., Hillier, L.W., Poh, Y.-P., Hahn, M.W.,

Nista, P.M.,Jones, C.D., Kern,A.D., Dewey,C.N.,Pachter,L., Myers, E., Lang-ley,C.H.,2007.Populationgenomics:whole-genomeanalysisofpolymorphism and divergencein Drosophilasimulans. PLoS Biol. 5(11), e310. doi:10.1371/ journal.pbio.0050310.

Borges,R.,Szöll˝osi,G.J.,Kosiol,C.,2019.QuantifyingGC-biasedgeneconversionin greatapegenomesusingpolymorphism-awaremodels.Genetics212(4),1321– 1336.doi:10.1534/genetics.119.302074.

Bouckaert, R., Vaughan, T.G., Barido-Sottani, J., Duchêne, S., Fourment, M., Gavryushkina,A.,Heled,J.,Jones,G.,Kühnert,D.,DeMaio,N.,Matschiner,M., Mendes,F.K.,Müller,N.F.,Ogilvie,H.A.,duPlessis,L.,Popinga,A.,Rambaut,A., Rasmussen, D., etal., 2019. BEAST 2.5: anadvanced software platform for Bayesianevolutionaryanalysis. PLoSComput. Biol. 15(4), e1006650.doi:10. 1371/journal.pcbi.1006650.

Chai,J.,Housworth,E.A.,2011.OnRogers’proofofidentifiabilityfortheGTR++ imodel.Syst.Biol.60(5),713–718.doi:10.1093/sysbio/syr023.

Chang,J.,Hartigan,J., 1991.Reconstructionofevolutionarytreesfrompairwise dis-tributionsoncurrentspecies.In:ComputingScienceandStatistics:Proceedings ofthe23rdSymposiumontheInterface,pp.254–257.

Chang, J.T., 1996. Full reconstruction of Markov models on evolutionary trees: identifiability and consistency. Math. Biosci. 137 (1), 51–73. doi:10.1016/ S0025-5564(96)00075-2.

DeMaio,N.,Schlötterer,C.,Kosiol,C.,2013.Linkinggreatapesgenomeevolution acrosstimescalesusingpolymorphism-awarephylogeneticmodels.Mol.Biol. Evol.30(10),2249–2262.doi:10.1093/molbev/mst131.

DeMaio,N.,Schrempf,D.,Kosiol,C., 2015.PoMo: anallelefrequency-based ap-proachforspeciestreeestimation.Syst.Biol. 64(6),1018–1031.doi:10.1093/ sysbio/syv048.

Durrett, R., 2008. Probability Models for DNA Sequence Evolution. Probabil-ity and its Applications. Springer New York, New York, NY doi:10.1007/ 978-0-387-78168-6.

Felsenstein,J.,1973.Maximumlikelihoodandminimum-stepsmethodsfor estimat-ingevolutionarytreesfromdataondiscretecharacters.Syst.Biol.22(3),240– 249.doi:10.1093/sysbio/22.3.240.

Felsenstein,J.,1978. Casesinwhichparsimonyorcompatibilitymethodswillbe positivelymisleading.Syst.Biol.27(4),401–410.doi:10.1093/sysbio/27.4.401. Hall,B.,2015.LieGroups,LieAlgebras,andRepresentations:AnElementary

Intro-duction,seconded.SpringerInternationalPublishing,Switzerlanddoi:10.1017/ s0025557200177174.

Höhna,S.,Landis,M.J.,Heath,T.A.,Boussau,B.,Lartillot,N.,Moore,B.R., Huelsen-beck,J.P., Ronquist,F.,2016. RevBayes:Bayesianphylogenetic inferenceusing graphicalmodelsandaninteractivemodel-specificationlanguage.Syst.Biol.65 (4),726–736.doi:10.1093/sysbio/syw021.

Jukes,T.,Cantor,C.,1969.Evolutionofproteinmolecules.In:MammalianProtein Metabolism.Elsevier,pp.21–132.doi:10.1016/B978-1-4832-3211-9.50009-7. Kimura,M.,1980.Asimplemethodforestimatingevolutionaryratesofbase

sub-stitutionsthroughcomparativestudiesofnucleotidesequences.J.Mol.Evol.16 (2),111–120.doi:10.1007/BF01731581.

Lack,J.B.,Cardeno,C.M.,Crepeau,M.W.,Taylor,W.,Corbett-Detig,R.B.,Stevens,K.A., Langley,C.H.,Pool,J.E.,2015.TheDrosophilagenomenexus:apopulation ge-nomicresourceof623Drosophilamelanogastergenomes,including 197from asingleancestralrangepopulation.Genetics199(4),1229–1241.doi:10.1534/ genetics.115.174664.

Leaché,A.D.,Oaks,J.R.,2017.Theutilityofsinglenucleotidepolymorphism(SNP) datainphylogenetics.Annu.Rev.Ecol.Evol.Syst.48(1),69–84.doi:10.1146/ annurev-ecolsys-110316-022645.

Long, Q., Rabanal, F.A.,Meng, D., Huber, C.D., Farlow, A., Platzer, A.,Zhang, Q., Vilhjálmsson,B.J., Korte,A., Nizhynska,V.,Voronin,V.,Korte,P., Sedman,L., Mandáková,T.,Lysak,M.A.,Seren,Ü.,Hellmann,I.,Nordborg,M.,2013. Mas-sivegenomicvariationandstrongselection inArabidopsisthalianalinesfrom Sweden.Nat.Genet.45(8),884–890.doi:10.1038/ng.2678.

Maddison,W.P.,Knowles,L.L.,2006.Inferringphylogenydespiteincompletelineage sorting.Syst.Biol.55(1),21–30.doi:10.1080/10635150500354928.

Mirarab,S.,Bayzid,M.S.,Boussau,B.,Warnow,T.,2014.Statisticalbinningenables anaccuratecoalescent-basedestimationoftheaviantree.Science346(6215), 1250463.doi:10.1126/science.1250463.

Moran,P.,1958.Randomprocessesingenetics.Math.Proc.CambridgePhilos.Soc. 54(01),60.doi:10.1017/S0305004100033193.

Pagel,M.,1999.Inferringthehistoricalpatternsofbiologicalevolution.Nature401 (6756),877–884.doi:10.1038/44766.

Prado-Martinez,J.,Sudmant,P.H.,Kidd,J.M.,Li,H.,Kelley,J.L.,Lorente-Galdos,B., Veeramah, K.R., Woerner, A.E., O’Connor, T.D., Santpere, G., Cagan, A., The-unert, C.,Casals, F.,Laayouni, H.,Munch,K., Hobolth,A., Halager,A.E., Ma-lig,M.,etal.,2013.Greatapegeneticdiversityandpopulationhistory.Nature 499(7459),471–475.doi:10.1038/nature12228.

Rogers,J.,Raveendran,M., Harris, R.A.,Mailund,T., Leppälä,K.,Athanasiadis, G., Schierup, M.H., Cheng, J., Munch, K., Walker, J.A., Konkel, M.K., Jordan, V., Steely, C.J., Beckstrom, T.O., Bergey, C., Burrell, A., Schrempf, D., Noll, A., Kothe,M.,etal.,2019.Thecomparativegenomicsandcomplexpopulation his-toryofpapiobaboons.Sci.Adv.5(1),eaau6947.doi:10.1126/sciadv.aau6947. Rogers,J.S.,2001.Maximumlikelihoodestimationofphylogenetictreesis

consis-tentwhensubstitutionratesvaryaccordingtotheinvariablesitesplusgamma distribution.Syst.Biol.50(5),713–722.doi:10.1080/106351501753328839. Schrempf,D.,Hobolth,A.,2017.Analternativederivationofthestationary

distribu-tionofthemultivariateneutralwrightfishermodelforlowmutationrateswith aviewtomutationrateestimationfromsitefrequencydata.Theor.Popul.Biol. 114,88–94.doi:10.1016/j.tpb.2016.12.001.

Schrempf,D.,Minh,B.Q.,DeMaio,N.,vonHaeseler,A.,Kosiol,C.,2016.Reversible polymorphism-awarephylogeneticmodelsandtheirapplicationtotree infer-ence.J.Theor.Biol.407,362–370.doi:10.1016/j.jtbi.2016.07.042.

Steel,M., 1994.Recovering a treefrom the leafcolourationsit generates under aMarkov model. Appl.Math. Lett.7 (2),19–23. doi:10.1016/0893-9659(94) 90024-8.

Steel,M.,2013.ConsistencyofBayesianinferenceofresolvedphylogenetictrees.J. Theor.Biol.336,246–249.doi:10.1016/j.jtbi.2013.08.012.

(6)

Szöllsi,G.J., Tannier,E.,Daubin,V.,Boussau,B.,2015.Theinferenceofgenetrees withspeciestrees.Syst.Biol.64(1),e42–e62.doi:10.1093/sysbio/syu048. Tajima,F.,Nei,M., 1984.Estimationofevolutionarydistancebetweennucleotide

sequences. Mol.Biol.Evol.1(3),269–285.doi:10.1093/oxfordjournals.molbev. a040317.

Tavaré,S.,1986.SomeprobabilisticandstatisticalproblemsintheanalysisofDNA sequences.Lect.Math.LifeSci.17(2),57–86.

Wald,A.,1949.Noteontheconsistencyofthemaximumlikelihoodestimate.Ann. Math.Stat.20(4),595–601.doi:10.1214/aoms/1177729952.

Wu, J., Susko, E., 2010. Rate-variation need not defeat phylogenetic inference throughpairwisesequencecomparisons.J.Theor.Biol.263(4),587–589.doi:10. 1016/j.jtbi.2009.12.022.

RuiBorges InstitutfürPopulationsgenetik,VetmeduniVienna,Veterinärplatz1, Wien1210,Austria

CarolinKosiol∗ InstitutfürPopulationsgenetik,VetmeduniVienna,Veterinärplatz1, Wien1210,Austria CentreforBiologicalDiversity,UniversityofStAndrews,StAndrews, FifeKY169TH,UK

Correspondingauthor. E-mailaddress:[email protected](C. Kosiol)

References

Related documents

innovation in payment systems, in particular the infrastructure used to operate payment systems, in the interests of service-users 3.. to ensure that payment systems

National Conference on Technical Vocational Education, Training and Skills Development: A Roadmap for Empowerment (Dec. 2008): Ministry of Human Resource Development, Department

UDL provides the following three guiding principles to consider when planning lessons so that children are properly stimulated in ways that allow for learning: (1) multiple means of

Marie Laure Suites (Self Catering) Self Catering 14 Mr. Richard Naya Mahe Belombre 2516591 [email protected] 61 Metcalfe Villas Self Catering 6 Ms Loulou Metcalfe

The corona radiata consists of one or more layers of follicular cells that surround the zona pellucida, the polar body, and the secondary oocyte.. The corona radiata is dispersed

Currently, National Instruments leads the 5G Test &amp; Measurement market, being “responsible for making the hardware and software for testing and measuring … 5G, … carrier

Creation of a self-assessment tool to assess perceived competence for counseling could provide an efficient means for evaluating longitudinal evaluation of resident training

2012 2011 U.S. The intent of this strategy is to minimize plan expenses by exceeding the interest growth in long-term plan liabilities. Risk tolerance is established