Contents lists available at ScienceDirect
Journal
of
Theoretical
Biology
journal homepage: www.elsevier.com/locate/jtb
Letter to Editor
Consistencyandidentifiabilityofthepolymorphism-aware phylogeneticmodels
1. Introduction
Phylogenies help to answer a multitude of questions regard-ing the origin of species, the tempo of evolution, the origin of particular traitsand theprocesses (either neutralor selective)of molecular evolution (e.g. see Pagel,1999). Phylogenies are there-forecentraltothestudyofpatternsandprocessesbywhich evolu-tionhappens.However,phylogeniescanonlyservethesepurposes iftheyarecorrectlyestimated.Fortunately,mathematical phyloge-neticsprovidescriteriathathelpustoassesswhetheragiven phy-logeny estimatoris statistically sound. Statistical consistency and identifiabilityaretwoexamplesofsuchcriteria.
Inphylogenetictheory,aphylogeneticreconstructionmethod
τ
ˆ isstatisticallyconsistentunderamodelifitconvergesin probabil-ity to the true treeτ
as the number of sites S of the sequence alignment increases indefinitely (Wald, 1949; Felsenstein, 1973), i.e.lim
S→∞p
(
|
τ
ˆS→τ
|
)
=1 . (1)Identifiabilityismetifthemodelofevolutionisuniquely char-acterizedbytheprobabilitydistributionitdefines(Changand Har-tigan, 1991). An identifiable model is a necessary condition for consistency. Moreformal conditionsforidentifiabilityand consis-tencyaredescribedinSteel(1994),Chang(1996)andSteel(2013); these are revisited later in thisarticle. Lack ofstatistical consis-tency haslong beenanaspect thatphylogeneticistscared for.For example, some maximumparsimony methods were criticizedfor lackingstatisticalconsistencyearlyon(Felsenstein,1978).
Simplephylogenyestimationmethodsusingstandard substitu-tion model (e.g. Bayesian or maximum likelihood inference un-der the Jukes and Cantor, 1969; Kimura, 1980; Tajima and Nei, 1984 substitution models)can be shownto enjoy statistical con-sistency (Chang, 1996). However, the same principle cannot be directly extended to more complex andgeneral methods oftree inference that include rate variation across sites (
) and invari-ant sites (I). For such models of evolution, the model distribu-tion is not fully understood, which complicates proving identifi-ability. Identifiabilityhas beenproven for thepure general time-reversible model(GTR,Tavaré, 1986). TheGTRwithratevariation (i.e.GTR+
)(WuandSusko,2010)alsoenjoysidentifiability; how-ever, a rigorousproof forthe commonlyutilized GTR+
+I is still lacking (Rogers, 2001; Allmanet al., 2008; Chai and Housworth, 2011).
During the last years, alternative approaches to the clas-sic nucleotide substitution models have been proposed. The polymorphism-awarephylogeneticmodels(PoMo)aresuchan
ex-ampleofanalternative approachfortreeestimationthat also ac-countsforincompletelineagesorting(DeMaioetal.,2013).While PoMo canbe morebroadly classifiedasa nucleotidesubstitution model; it adds a new layer of complexity by accounting for the population-levelevolutionaryprocesses(suchasmutations,genetic drift,andselection)todescribetheevolutionaryprocess(DeMaio etal., 2015;Schrempf et al., 2016;Borgeset al., 2019). Todo so, PoMobuildson aGTR-likemutationschemeandexpandsthe {A, C,G,T}state-spacetoincludepolymorphicstates,thereby account-ing for current and ancestral intra-population variation. The lat-ter aspect sets PoMo apartfrom the classic models of evolution, whichtraditionallyonlyuseasinglerepresentativeDNAsequence perspecies.
PoMohasreceived substantial attentionfromthe evolutionary community (see Mirarab etal., 2014; Szöllsi et al., 2015; Leaché andOaks,2017).Severalpublicationshaveemployed ittosolve a widerange ofevolutionaryquestions,e.g., disentangling phyloge-neticrelationshipsamongbaboonspecies(Rogersetal.,2019), de-scribingthephylogeographichistoryofgreatapes(Schrempfetal., 2016), estimating patterns of GC-bias and mutational biases (De Maio et al., 2013; Borges et al., 2019) and inferring the site-frequency spectrum (Schrempf and Hobolth, 2017; Borges et al., 2019)frompopulationdata.
Allthis raised the questionof whetherPoMo is a statistically consistentphylogenyestimatorforphylogeneticdatasets.Building uponthe formalresultsprovided by Steel(2013) and identifiabil-ityofthePoMoratematrixandstationarydistribution,wepresent aproofthat theMAPtopology (i.e.thetreetopology thathasthe greatestposteriorprobability)isa statisticallyconsistentestimate ofthetruetree.ThisresultshowsthatPoMoisastatisticallysound methodofphylogeneticinference,anditprovidesvalidity for fur-therinvestigationsandusesofPoMomethodsonrealdatasets.
2. Polymorphism-awarephylogeneticmodelsinanutshell
PoMo assumes a Moran model (Moran, 1958) with N haploid individualsanddefinesthealleletrajectoryofa singlelocuswith fourpossibleallelesai,wherei∈A=
{
A,C,G,T}
(Table1includes aglossaryofthesymbolsusedinthepresentproof).Theevolution ofthispopulationinthecourseoftime isdescribedbya contin-uous timeMarkovchainwitha discretestate-space definedbyN andthefour alleles.Statesare monomorphic (orboundary) ifall the N individuals have the allele i {Nai}; orpolymorphic, if two allelesarepresentinthepopulation{
nai,(
N−n)
aj}
(Fig.1).A ratematrix Q describing the dynamic betweenthe bound-aryandpolymorphic statescanbedefinedby consideringseveral population-levelprocesses.Sofar,PoMoincludesmutation,genetic driftandallelicselection (Fig. 1) (DeMaio etal.,2015; Schrempf etal.,2016;Borgesetal.,2019).
https://doi.org/10.1016/j.jtbi.2019.110074
Table1
Glossary.Theorderofthesymbolsfollowstheirfirstoccurrenceinthetext.
Symbol Description
τ treetopology ˆ
τ treetopologyestimator
S numberofsites
ai allelei
A nucleotidebase{A,C,G,T}
N populationsize
n absolutefrequencyofanallele;relativefrequencywouldben/N Q PoMoinstantaneousratematrixwithelementsq
π componentofthemutationrate;stationaryfrequenciesintheGTRmodelofevolution ρ componentofthemutationrate;baseexchangeabilitiesintheGTRmodelofevolution σ allelicselectioncoefficients
k normalizationconstantofthestationaryvector
α PoMostate
K numberofalleles;K=4forPoMo
L numberoftaxa/sequences
P transitionmatrix
χ sitepatternor,better,PoMofrequencypatterns η rootvertex
=(ν1,ν2) directededge ω branchlengths
θ setofPoMoparametersandbranchlengths
Fig.1. PoMostate-spaceandtransitionrates.ThetwoallelesrepresentanyofthefournucleotidebasesA,C,G,andT.Orangeandgreydistinguishtheroleofmutation andgeneticdriftplusselection,respectively.ThePoMostate-spaceincludesmonomorphic(orboundarystates){Nai}andpolymorphicstates{nai,(N−n)aj}.Monomorphic
statesinteractwithpolymorphicstatesviamutationandpolymorphicstatesreachmonomorphicstatesviadriftorselection.Betweenpolymorphicstatesonlydriftand selectionoccur.PoMoisthusaparticularcaseofthemultivariateMoranmodelwithboundarymutationsandselectionwhenfouralleles(i.e.thefournucleotidebases)are considered.
• Mutations are assumed to occur only in the boundary states {Nai}withmutation rates
μ
aiaj=ρ
aiajπ
aj=ρ
ajaiπ
aj.Mutation rates are thus modeled according to a GTR model of evolu-tion(Tavaré,1986),whereπ
representsthestationary frequen-cies andρ
the six exchangeability terms. Similar interpreta-tions of theseparameters are still validforthe neutralPoMo, asπ
immediately informs the stationary frequencies of the monomorphicstates.However,thestationaryfrequenciesofthe monomorphic states are no longer only defined byπ
in the more generalPoMo with allelicselection (Borgeset al., 2019) (Eq.(3)). Allin all, the GTR-likemutationscheme is a conve-nientstrategytoobtainquantitiesofinterestforPoMo. • Genetic drift rules the allele frequency changes in apopula-tion. Genetic drift is modeled according to the Moran model, inwhichoneindividualischosentoreproduce(i.e.tocopy it-self)andonetodieineachtimestep(Moran,1958).Therefore, the rates by which an allelewith a startingfrequency n/N is bornor dies(i.e. thealleleincreasesor decreasesby one)are thesameandequalto n(N−nN ).Theallelefrequencychangesare thusneutral.
• Allelic selection may favor one allele over the other by dif-ferentiated reproductive success. The Moran machinery de-scribed previously can be adapted to model relative fitnesses by permitting that a given allele ai has a fitness advan-tage/disadvantage
(
1+σ
ai)
over the others (Durrett, 2008;Borgesetal., 2019).
σ
refersto thevectorofrelative selection coefficients:i.e.areferencealleleischosentohavefitness1,or selectioncoefficient0.Taking intoaccount thepopulation processesdescribedso far, thePoMoinstantaneousratematrixQis
q
(
{
n1 ai,(
N−n1)
aj}
,{
n2 ai,(
N−n2)
aj}
)
=
⎧
⎪
⎪
⎪
⎨
⎪
⎪
⎪
⎩
μ
aiaj=ρ
aiajπ
aj n1 =N, n2 =N−1μ
ajai=ρ
aiajπ
ai n1 =0, n2 =1n1
N
(
N−n1)(
1+σ
ai)
n2 =n1 +1, 0<n1 <Nn1
N
(
N−n1)(
1+σ
aj)
n2 =n1 −1, 0<n1 <N0
|
n1 −n2|
>1, (2)
where n1 andn2 represent a shift inthe allelefrequencies. Fre-quencyshiftslargerthan1aredisallowed(lastconditioninEq.(2)) makingPoMoratematricestypicallysparse.Thediagonalelements are defined such that the respective rowsum is 0. The station-ary distribution of PoMo is obtained by satisfying the condition
ψ
Q =0.ψ
isthe normalizedstationaryvector andhas the solu-tionψ
(
α
)
=π
ai
(
1+σ
ai)
N−1 k−1 ifα
={
Nai}
π
aiπ
ajρ
aiaj(
1+σ
ai)
n−1 if
α
={
nai,
(
N−n)
aj}
(
1+σ
aj)
N−n−1 N n(N−n)k−1
,
where k is the normalization constant and
α
a PoMo state (Borgesetal.,2019).When using PoMo to infer phylogenies, we make use of two important assumptions that are convenient for the proof of con-sistency presented here. Our firstassumption is that sites evolve independently. Thus the probability that sequence A evolves to sequence B equals the product of theprobability ofevolutionary pathsacrossallsitesS.Asaresult,thesitepatternscreatedbythe Lspeciesareindependentlyandidenticallydistributed(i.i.d.). The second assumption is that the evolutionary process is stationary and reversiblewith equilibriummeasure
ψ
. Proof ofreversibility andstationarityhavebeenprovidedbySchrempf etal.(2016) for theneutralPoMomodelandbyBorgesetal.(2019)forthemodel withallelicselection.3. IdentifiabilityofPoMo
Identifiabilityisthe inverseproblemoffindingthetree
τ
and thetransitionmatrixPgivenjusttheprobabilityofthevarioussite patternsχ
(orfrequencypatterns,inthecaseofPoMo)(Steeletal., 1998). Identifiability comes from the restrictions that must be placed on P forτ
to be uniquely describedby theprobability of generating a patternχ
. These restrictions have been extensively studied(ChangandHartigan,1991;Steel,1994;Chang,1996;Steel etal.,1998).Let us assume a tree
τ
withη
representing the mostrecent commonancestorofLspecies(i.e.therootofthetree).Theedges=
(
ν
1 ,ν
2)
ofτ
aredirectedawayfromtherootη
insuchaway thatν
1 liesbetweenη
andν
2 .Ifweattribute statestoeach ver-texinthetreeτ
,beginningfromtherootη
toallthedescending vertices,we canrepresenttheprobability ofgeneratingpatternχ
asfχ= χ
p
(
χ
η)
=(ν1,ν2)
P
(
χ
ν1,χ
ν2)
, (4)where
χ
extendsχ
andrepresentsthetotalassignmentofstates, andp(χ
η) istheprobability ofstateχ
η attheroot. Thealphabet ofPoMohas4+6(
N−1)
statesandthusthereare[4+6(
N−1)
]L possiblefrequencypatternsχ
inatreewithLleaves.As PoMo makes some assumptions regarding theevolutionary process(Schrempfetal.,2016),wecanfurthersimplifyEq.(4):(i) frequency changes on edges are described by a continuous-time Markov process; (ii) the PoMo rate matrix Q is the same for all edges ofthe tree;and(iii)the distributionoffrequencystates at therootp(
χ
η)issimplytheequilibriumdistributionψ
.fχ =
χ
ψ
(
χ
η)
=(ν1,ν2)exp
{
tQ}
(
χ
ν1,χ
ν2)
, (5)wheretisthelengthofedge
.Conditionsrepresentedby(4)and
(5)are alsoknown astheMarkovproperty, whichis anecessary thoughnotsufficientconditionforidentifiability.
Following(Steel,1994;Steeletal.,1998),anotherconditionfor identifiabilityadditionalto(5)needstohold:det
(
P)
= 0,1,−1for all edgesinthetreeand
ψ
(α
)=0foreachPoMo stateα
.Then thetreeτ
canbeuniquelyrecoveredfromthefrequencypatternsχ
(Steel,1994;Steeletal.,1998).ByJacobi’sformula(theorem2.12 in Hall, 2015), we can write that det(
P)
=exp{
tr Q}
and there-foredet
(
P)
=exp{
−aiaj∈A
ai=aj
ρ
aiaj(
π
ai+π
aj)
−1
2
(
N−1)(
N+1)
(
4+ai∈A
σ
ai)
}
(6)Theselectivepressures andthe mutationrates,asmodeled in PoMo,canonlyberealpositivenumbers.Theseratescannot be0 because they would represent very unlikely and biologically un-reasonable scenarios: If 1+
σ
ai=0 the individual carrying allele ai wouldimmediatelydie (i.e.an extremely deleteriousallele); ifπ
ai=0orρ
aiaj=0(bothimplyμ
ai=0)thealleleaidoesnotarise bymutation.Inbothsituations,suchalleleshouldnotbeobserved atallinthepopulation,andonecouldsimplyusePoMowithK −1 alleles,whereKisthenumberofallelesinthepopulation. There-fore,wecaneasilyconcludethat0<det(
P)<1foralledgesof
τ
.Because we assume the equilibrium distribution of allele fre-quencies inthe root, weneed to check whetheranyelementsof
ψ
canbe equal to0.The PoMo stationarydistribution isdefined for two different types of statesα
, the monomorphic (or fixed) andthe polymorphic ones. As shownin Eq.(3), thesestates can only have 0 probability if any of the population parameters are 0. Therefore, we conclude thatψ
(α
)>0 for all PoMo statesα
. Summarizing,PoMomodelsrespecttheconditionsfor identifiabil-ityandweconcludethat PoMo treesare uniquelyidentifiableby thefrequencypatternstheyinduce.Identifiabilitycanbeextended to the multivariate case by defining an alphabet A over K alle-lesanda state-spaceofK+KK−12
(
N−1)
states.Forsimplicitywe haveconsidered thefour-variatecase, butthe proofsshownhere remainvalidforthe multivariate Moranmodel withallelic selec-tion(Borgesetal.,2019).4. ConsistencyoftheBayesianPoMo
Toevaluate theconsistency propertyunder PoMo,we have to prove that a giventree estimatorconverges in probability to the truetreeasthenumberofsitesSincreasesindefinitely.Ithas al-readybeenformally shownthattheMAPtree(i.e.thetree topol-ogy that has the greatest posterior probability), estimated in a Bayesianframework,providesastatisticallyconsistentestimatorof thetrue tree(Steel, 2013). Consistencywasprovenunder a wide variety of conditions (Steel, 2013), including tree inference from alignedsequencesacrosstheentireparameterrange,andwiththe usageofgeneralpriorsinmodels wheretheidentifiability condi-tionholds.Here,weshowthattheseconditionsarealsometunder PoMo.
Supposewearegiveni.i.d.sitepatterns
χ
=(
χ
1 ,...,χ
S)
gener-atedbysome unknownparameters(τ
,θ
)andwewishtoidentify thetopologyτ
fromχ
givenprior densitiesonthesetoffully re-solvedtreesTandthecontinuousparametersθ
.Theseparameters includethe branch lengthsω
,the mutations rates (defined byπ
andρ
) and the selection coefficientsσ
. Suppose we have a dis-cretepriorprobabilitydistributionp(τ
)onT,and,foreachτ
∈T,a continuouspriorprobabilitydistributionon (τ
)witha probabil-itydensityfunctionpτ(θ
).Inparticular,ifthefollowingfourconditionshold:
• C1:p(
τ
)>0;• C2: the densitypτ(
θ
) is continuous,bounded andnonzero on (τ
);• C3: the function
θ
→ p( τ, θ) (χ
) is continuous and nonzero on (τ
);• C4: identifiability,guaranteeing that
τ
is uniquely identifiable byχ
.Steel(2013)hasshownthat
lim S→∞p
(
τ
∗,
θ
,S|
χ)
=1 , (7)where p(
τ
∗,θ
, S|χ
) is probability that the MAP correctly selectsFig.2. Consistency oftheBayesianPoMo.(A)Topology(andrelativebranchlengths)usedforsimulations andthetwotop-visitedwrong topologiessampledfromthe posterior.Thesealternativetopologiesfollowtheexpectedtreediscordanceduetoincompletelineagesortingbetweenpopulations2and3.(B)Posteriorprobabilitiesofthe MAPandtruetopologiesforthetwosimulatedscenariosIandII.TheexpectedheterozygosityHeanddivergenceDusedtosimulateeachscenarioweretakenfromnatural
populationsofD.simulans(scenarioI(Begunetal.,2007))andgreatapes(scenarioII(Prado-Martinezetal.,2013)).TheasteriskindicatesthattheMAPtreecorrespondsto thetruetree.Numbersingreyindicatethenumberofsampledposteriortopologies.
θ
)] whereEθ is the expectation ofthe likelihood withrespectto thepriorprobabilityof (τ
).Consistency can then be inherently gained by Bayesian infer-ence.PoMocanbeeasilyplacedinaBayesianframework.Indeed, wecaneasilymeetconditions1and2instandardBayesian phylo-geneticinferencesoftware(e.g.BEAST(Bouckaertetal.,2019)and RevBayes(Höhnaetal.,2016)).C1requiresthatanonzeroprioris setonthetreetopology,whichcanbeeasilyimplementedby set-tingauniformprior. If
θ
takestheusual exponential/gamma pri-ors on the branch lengths and rate parameters (ρ
,σ
andω
) or theDirichlet distribution onπ
(which can actually be generated froma setofK-independent gammarandomvariables), condition C2ismet.ConditionsC3andC4holdforallMarkovprocesseson anytree with pendant edges of positive lengths (i.e. on any bi-nary metric-tree) forwhich identifiability wasproven. Therefore, asshownintheprevioussection, C3andC4 holdforPoMo. Con-sequently,the MAPtreeunder PoMo isa consistent estimator of thespeciestree.5. Asimulation-basedexampleofconsistencywithPoMo
Consistencyguaranteestheidentificationofthecorrect param-etervalueswithinfinitesequencelengths. Inrealdatasituations, the sequence length is finite as is the running time. We have nevertheless tested the consistency of the tree topology for the BayesianPoMoestimatorusingsimulatedpopulationdatasets.
We simulatedalignments of10 000sitesunderPoMo usinga phylogeny of five populations as shown in Fig. 2A with relative branch lengths definedin such a way that we have two closely andtwodistantly(i.e.twicetheexpecteddivergence)related pop-ulations. We simulated two scenarios Iand II by mimicking fast andslowlyevolvingpopulations (expected divergence D equalto 0.3and0.02substitutionspersite,respectively)withexpected het-erozygosityHe asobservedinDrosophilasimulanspopulationsand among great apes (0.0015 and 0.018, respectively) (Begun et al., 2007;Prado-Martinezetal.,2013).
To test for consistency, we created four alignments including thefirst 10,100, 1000,10 000sites.We fed these alignments to
RevBayes (Höhna et al., 2016) and performed standard Bayesian phylogenyestimation onthem. Weran PoMo fortwo chainsand 20000generations,keepingevery10thiteration.Aburningperiod of 10% was defined by checking mixing, convergence, and auto-correlationoftheMCMCchains.
We observed that MAP already recovers the true treefor the averagegenelengthof1000sites(Fig.2B).Thesesimulated scenar-ioscomplementourproofsthattheMAPisaconsistentestimator ofthetruetree.Weobservedfurtherthat populationswithmore heterozygousalignments (i.e.scenarioI)convergetothetruetree faster.Theeffectofincompletelineagesortingisevidentinthese examples, as the top-sampled wrongtopologies (Fig. 2A) cluster the closelyrelated populations 3 and2 with theclade including population 4 and5.These topologies are,for a fewer numberof sites,theMAPtreesofscenariosIandII.
6. Conclusions
Here we prove that PoMo is identifiable and further that the MAPisastatisticallyconsistentestimatorofthespeciestreewhen PoMo is placed in a Bayesian framework. This is the first time identifiabilityandconsistencywereshownforthe polymorphism-awarephylogeneticmodels.
Identifiability of PoMo can be easily extended to important, more general models. Identifiability should be kept for general-izations that work onthe standard PoMo instantaneous rate ma-trix: This is, for example, the case with balancing selection act-ing along with geneticdrift. More complicated extensionsof the PoMomodelswouldincludethejointinferenceofgenetreeswith the species trees; currently, PoMo directly estimates the species tree. Steel (2013) suggested that identifiability and consistency should still apply, as the gene trees can be viewed as nuisance parameters. A formal proof for thisstatement, especially for the case of gene trees undergoing gene gain, loss, and transfer, is missing.
thetypeofdataPoMoisappliedto:population-scaleand genome-wide.Thisdataiscostlyandtechnicallydifficulttoobtainand in-deedthedatasetsavailableintheliteraturearefew(e.g.humans (Altshuler et al., 2010), great apes (Prado-Martinez et al., 2013), fruit flies(Lack et al., 2015) and Arabidopsis (Long et al., 2013)). It is only naturalto expect that higher samples sizes are repaid withlesserroneousestimatesofmodelparameters,includingthe speciestree.
Oursimulatedexamples,thoughsimple,showthattheMAP re-coversthetruetreeevenwhentheexpectedheterozygosityisvery low andwith incompletelineagesortingis present. ILS isa very well-known causeofdiscordancebetweengeneandspeciestrees (MaddisonandKnowles,2006),byaffectingtheprobability of vis-ited wrong topologies. Statistical consistency is thus a desirable property forphylogenyestimation onclosely relatedpopulations. As we have seen, the MAP tree corresponds already to the true topologyfor1000sites,evenforlessdiversespecies.
Asfuturework,wewouldwanttoexplorethesequencelength requirement underPoMo. This is essentially the sequence length thataphylogenyreconstructionmethodneedstorecoverthetrue tree with a considerably small error (Atteson, 1999). A low se-quencelengthrequirementisaconditionforacomputationally ef-ficient method.Inparticular, itwouldbe important todetermine whetherPoMoisafastconvergingmethod(i.e.amethodthat re-quiresasequencelengthofonlyO
(
poly(
n))
).Funding
ThisworkwassupportedbytheViennaScienceandTechnology Fund(WWTF)[MA16-061].
DeclarationofCompetingInterest
Theauthorshavenoconflictsofinterest.
Acknowledgements
WethankBastienBoussauforhelpingwithRevBayes,andAsger Hobolt andLynette Mikula for usefulcomments and suggestions onanearlierversionofthismanuscript.
References
Allman,E.S., Ané, C.,Rhodes, J.A., 2008.IdentifiabilityofaMarkovianmodel of molecularevolutionwithgamma-distributedrates.Adv.Appl.Probab.40(01), 229–249.doi:10.1239/aap/1208358894.
Altshuler, D.L., Durbin, R.M., Abecasis, G.R., Bentley, D.R., Chakravarti, A., Clark,A.G.,Collins,F.S.,DeLaVega,F.M.,Donnelly,P.,Egholm,M., Flicek,P., Gabriel,S.B.,Gibbs,R.A.,Knoppers,B.M.,Lander,E.S.,Lehrach,H.,Mardis,E.R., McVean, G.A., et al., 2010. A map of human genome variation from population-scalesequencing.Nature467(7319),1061–1073.doi:10.1038/nature 09534.
Atteson,K.,1999.Theperformanceofneighbor-joiningmethodsofphylogenetic re-construction.Algorithmica25(2–3),251–278.doi:10.1007/PL00008277. Begun, D.J., Holloway, A.K., Stevens, K., Hillier, L.W., Poh, Y.-P., Hahn, M.W.,
Nista, P.M.,Jones, C.D., Kern,A.D., Dewey,C.N.,Pachter,L., Myers, E., Lang-ley,C.H.,2007.Populationgenomics:whole-genomeanalysisofpolymorphism and divergencein Drosophilasimulans. PLoS Biol. 5(11), e310. doi:10.1371/ journal.pbio.0050310.
Borges,R.,Szöll˝osi,G.J.,Kosiol,C.,2019.QuantifyingGC-biasedgeneconversionin greatapegenomesusingpolymorphism-awaremodels.Genetics212(4),1321– 1336.doi:10.1534/genetics.119.302074.
Bouckaert, R., Vaughan, T.G., Barido-Sottani, J., Duchêne, S., Fourment, M., Gavryushkina,A.,Heled,J.,Jones,G.,Kühnert,D.,DeMaio,N.,Matschiner,M., Mendes,F.K.,Müller,N.F.,Ogilvie,H.A.,duPlessis,L.,Popinga,A.,Rambaut,A., Rasmussen, D., etal., 2019. BEAST 2.5: anadvanced software platform for Bayesianevolutionaryanalysis. PLoSComput. Biol. 15(4), e1006650.doi:10. 1371/journal.pcbi.1006650.
Chai,J.,Housworth,E.A.,2011.OnRogers’proofofidentifiabilityfortheGTR++ imodel.Syst.Biol.60(5),713–718.doi:10.1093/sysbio/syr023.
Chang,J.,Hartigan,J., 1991.Reconstructionofevolutionarytreesfrompairwise dis-tributionsoncurrentspecies.In:ComputingScienceandStatistics:Proceedings ofthe23rdSymposiumontheInterface,pp.254–257.
Chang, J.T., 1996. Full reconstruction of Markov models on evolutionary trees: identifiability and consistency. Math. Biosci. 137 (1), 51–73. doi:10.1016/ S0025-5564(96)00075-2.
DeMaio,N.,Schlötterer,C.,Kosiol,C.,2013.Linkinggreatapesgenomeevolution acrosstimescalesusingpolymorphism-awarephylogeneticmodels.Mol.Biol. Evol.30(10),2249–2262.doi:10.1093/molbev/mst131.
DeMaio,N.,Schrempf,D.,Kosiol,C., 2015.PoMo: anallelefrequency-based ap-proachforspeciestreeestimation.Syst.Biol. 64(6),1018–1031.doi:10.1093/ sysbio/syv048.
Durrett, R., 2008. Probability Models for DNA Sequence Evolution. Probabil-ity and its Applications. Springer New York, New York, NY doi:10.1007/ 978-0-387-78168-6.
Felsenstein,J.,1973.Maximumlikelihoodandminimum-stepsmethodsfor estimat-ingevolutionarytreesfromdataondiscretecharacters.Syst.Biol.22(3),240– 249.doi:10.1093/sysbio/22.3.240.
Felsenstein,J.,1978. Casesinwhichparsimonyorcompatibilitymethodswillbe positivelymisleading.Syst.Biol.27(4),401–410.doi:10.1093/sysbio/27.4.401. Hall,B.,2015.LieGroups,LieAlgebras,andRepresentations:AnElementary
Intro-duction,seconded.SpringerInternationalPublishing,Switzerlanddoi:10.1017/ s0025557200177174.
Höhna,S.,Landis,M.J.,Heath,T.A.,Boussau,B.,Lartillot,N.,Moore,B.R., Huelsen-beck,J.P., Ronquist,F.,2016. RevBayes:Bayesianphylogenetic inferenceusing graphicalmodelsandaninteractivemodel-specificationlanguage.Syst.Biol.65 (4),726–736.doi:10.1093/sysbio/syw021.
Jukes,T.,Cantor,C.,1969.Evolutionofproteinmolecules.In:MammalianProtein Metabolism.Elsevier,pp.21–132.doi:10.1016/B978-1-4832-3211-9.50009-7. Kimura,M.,1980.Asimplemethodforestimatingevolutionaryratesofbase
sub-stitutionsthroughcomparativestudiesofnucleotidesequences.J.Mol.Evol.16 (2),111–120.doi:10.1007/BF01731581.
Lack,J.B.,Cardeno,C.M.,Crepeau,M.W.,Taylor,W.,Corbett-Detig,R.B.,Stevens,K.A., Langley,C.H.,Pool,J.E.,2015.TheDrosophilagenomenexus:apopulation ge-nomicresourceof623Drosophilamelanogastergenomes,including 197from asingleancestralrangepopulation.Genetics199(4),1229–1241.doi:10.1534/ genetics.115.174664.
Leaché,A.D.,Oaks,J.R.,2017.Theutilityofsinglenucleotidepolymorphism(SNP) datainphylogenetics.Annu.Rev.Ecol.Evol.Syst.48(1),69–84.doi:10.1146/ annurev-ecolsys-110316-022645.
Long, Q., Rabanal, F.A.,Meng, D., Huber, C.D., Farlow, A., Platzer, A.,Zhang, Q., Vilhjálmsson,B.J., Korte,A., Nizhynska,V.,Voronin,V.,Korte,P., Sedman,L., Mandáková,T.,Lysak,M.A.,Seren,Ü.,Hellmann,I.,Nordborg,M.,2013. Mas-sivegenomicvariationandstrongselection inArabidopsisthalianalinesfrom Sweden.Nat.Genet.45(8),884–890.doi:10.1038/ng.2678.
Maddison,W.P.,Knowles,L.L.,2006.Inferringphylogenydespiteincompletelineage sorting.Syst.Biol.55(1),21–30.doi:10.1080/10635150500354928.
Mirarab,S.,Bayzid,M.S.,Boussau,B.,Warnow,T.,2014.Statisticalbinningenables anaccuratecoalescent-basedestimationoftheaviantree.Science346(6215), 1250463.doi:10.1126/science.1250463.
Moran,P.,1958.Randomprocessesingenetics.Math.Proc.CambridgePhilos.Soc. 54(01),60.doi:10.1017/S0305004100033193.
Pagel,M.,1999.Inferringthehistoricalpatternsofbiologicalevolution.Nature401 (6756),877–884.doi:10.1038/44766.
Prado-Martinez,J.,Sudmant,P.H.,Kidd,J.M.,Li,H.,Kelley,J.L.,Lorente-Galdos,B., Veeramah, K.R., Woerner, A.E., O’Connor, T.D., Santpere, G., Cagan, A., The-unert, C.,Casals, F.,Laayouni, H.,Munch,K., Hobolth,A., Halager,A.E., Ma-lig,M.,etal.,2013.Greatapegeneticdiversityandpopulationhistory.Nature 499(7459),471–475.doi:10.1038/nature12228.
Rogers,J.,Raveendran,M., Harris, R.A.,Mailund,T., Leppälä,K.,Athanasiadis, G., Schierup, M.H., Cheng, J., Munch, K., Walker, J.A., Konkel, M.K., Jordan, V., Steely, C.J., Beckstrom, T.O., Bergey, C., Burrell, A., Schrempf, D., Noll, A., Kothe,M.,etal.,2019.Thecomparativegenomicsandcomplexpopulation his-toryofpapiobaboons.Sci.Adv.5(1),eaau6947.doi:10.1126/sciadv.aau6947. Rogers,J.S.,2001.Maximumlikelihoodestimationofphylogenetictreesis
consis-tentwhensubstitutionratesvaryaccordingtotheinvariablesitesplusgamma distribution.Syst.Biol.50(5),713–722.doi:10.1080/106351501753328839. Schrempf,D.,Hobolth,A.,2017.Analternativederivationofthestationary
distribu-tionofthemultivariateneutralwrightfishermodelforlowmutationrateswith aviewtomutationrateestimationfromsitefrequencydata.Theor.Popul.Biol. 114,88–94.doi:10.1016/j.tpb.2016.12.001.
Schrempf,D.,Minh,B.Q.,DeMaio,N.,vonHaeseler,A.,Kosiol,C.,2016.Reversible polymorphism-awarephylogeneticmodelsandtheirapplicationtotree infer-ence.J.Theor.Biol.407,362–370.doi:10.1016/j.jtbi.2016.07.042.
Steel,M., 1994.Recovering a treefrom the leafcolourationsit generates under aMarkov model. Appl.Math. Lett.7 (2),19–23. doi:10.1016/0893-9659(94) 90024-8.
Steel,M.,2013.ConsistencyofBayesianinferenceofresolvedphylogenetictrees.J. Theor.Biol.336,246–249.doi:10.1016/j.jtbi.2013.08.012.
Szöllsi,G.J., Tannier,E.,Daubin,V.,Boussau,B.,2015.Theinferenceofgenetrees withspeciestrees.Syst.Biol.64(1),e42–e62.doi:10.1093/sysbio/syu048. Tajima,F.,Nei,M., 1984.Estimationofevolutionarydistancebetweennucleotide
sequences. Mol.Biol.Evol.1(3),269–285.doi:10.1093/oxfordjournals.molbev. a040317.
Tavaré,S.,1986.SomeprobabilisticandstatisticalproblemsintheanalysisofDNA sequences.Lect.Math.LifeSci.17(2),57–86.
Wald,A.,1949.Noteontheconsistencyofthemaximumlikelihoodestimate.Ann. Math.Stat.20(4),595–601.doi:10.1214/aoms/1177729952.
Wu, J., Susko, E., 2010. Rate-variation need not defeat phylogenetic inference throughpairwisesequencecomparisons.J.Theor.Biol.263(4),587–589.doi:10. 1016/j.jtbi.2009.12.022.
RuiBorges InstitutfürPopulationsgenetik,VetmeduniVienna,Veterinärplatz1, Wien1210,Austria
CarolinKosiol∗ InstitutfürPopulationsgenetik,VetmeduniVienna,Veterinärplatz1, Wien1210,Austria CentreforBiologicalDiversity,UniversityofStAndrews,StAndrews, FifeKY169TH,UK
∗Correspondingauthor. E-mailaddress:[email protected](C. Kosiol)