Concept for estimating mitochondrial DNA haplogroups using a maximum likelihood approach (EMMA)

(1)

Concept

for

estimating

mitochondrial

DNA

haplogroups

using

a

maximum

likelihood

approach

(EMMA)

Alexander

W. Ro¨ck

a

,

Arne

Du¨r

b

,

Mannis

van

Oven

c

,

Walther

Parson

a,d,

*

a

InstituteofLegalMedicine,InnsbruckMedicalUniversity,Innsbruck,Austria b_Institute_of_Mathematics,_University_of_Innsbruck,_Innsbruck,_Austria c

DepartmentofForensicMolecularBiology,ErasmusMC,UniversityMedicalCenterRotterdam,TheNetherlands d

PennStateEberlyCollegeofScience,UniversityPark,PA,USA

1. Introduction

Human mitochondrial (mt)DNA is passed from mother to offspring and therefore inherited along a phylogeny. The ﬁrst humanmitochondrialgenome(mtGenome)wassequencedinthe early1980s[1]andrevised18yearslater[2],servingasareference sequence(rCRS)relativetowhichothermtDNA sequenceshave beenreportedinadifference-codedformat.Aplethoraofpartialas wellascompletemtGenomeshasbeenproducedsince,permitting anincreasedunderstandingoftheevolutionofthismolecule.Its dispersalthroughhumanmigration leftcharacteristicfootprints inducedbymutationsthathavebeenusedtoassignsequencesto

haplogroups[3].Thegrowingcollectionofestablishedcladeshas meanwhilereached3925discerniblehaplogroupsbasedon16,810 fullmtGenomes[Phylotree(www.phylotree.org)Build15[4]].

Theunderstandingofsequencesfromthestandpointoftheir haplogroupaffiliationhasbecomeincreasinglyvaluableinstudies ofhumanmtDNA.Notonlydoes haplogroupnomenclatureand assignmentfacilitatecomparisonandcommunicationofgenetic variability,butitisalsoemployedtocharacterizemitochondrial lineagesforpopulation [5],medical [6]andforensic genetic[7] purposes.Mostimportantlyforforensics,haplogroupassignment has proven to be an important tool for sequence data quality control[8].HaplogroupingofmtDNAsequenceshasbeengreatly simplified with the provision of Phylotree, which is widely acceptedas‘‘mitochondrialhaplogroupdictionary’’inthe scien-tific community. The haplogroup-defining mutations listed in Phylotreenotonlyfacilitatemanualhaplogroupassignment,but theyalsoserveasthebasisforanumberofsoftwareapplications thatperformthetask(e.g.MitoTool[9,10],HmtDB[11],HaploGrep [12], and mtDNAoffice [13]). To date, however, none of the available automated solutions provide reliable and unbiased

ARTICLE INFO

Articlehistory:

Received27February2013 Receivedinrevisedform1July2013 Accepted8July2013 Keywords: mtDNA Haplogroup EMPOP Fluctuationrates Phylotree ABSTRACT

Theassignmentofhaplogroups tomitochondrial DNAhaplotypescontributessubstantialvaluefor qualitycontrol,notonlyinforensicgeneticsbutalsoinpopulationandmedicalgenetics.Theavailability ofPhylotree,awidelyacceptedphylogenetictreeofhumanmitochondrialDNAlineages,ledtothe developmentofseveral(semi-)automatedsoftwaresolutionsforhaplogrouping.However,currently existing haplogrouping tools only make use of haplogroup-defining mutations, whereas private mutations (beyond the haplogroup level) can be additionally informative allowing for enhanced haplogroupassignment.This isespeciallyrelevantinthecaseof(partial)controlregionsequences, whicharemainlyusedinforensics.Thepresentstudymakesthreemajorcontributionstowardamore reliable,semi-automatedestimationofmitochondrialhaplogroups.First,aquality-controlleddatabase consistingof14,990fullmtGenomesdownloadedfromGenBankwascompiled.TogetherwithPhylotree, thesemtGenomesserve asareference databaseforhaplogroupestimates. Second, theconcept of fluctuationrates,i.e.amaximumlikelihoodestimationofthestabilityofmutationsbasedon19,171full controlregionhaplotypesforwhichrawlanedataisavailable,ispresented.Finally,analgorithmfor estimatingthehaplogroupofanmtDNAsequencebasedonthecombineddatabaseoffullmtGenomes andPhylotree,whichalsoincorporatestheempiricallydeterminedfluctuationrates,isbroughtforward. OnthebasisofexamplesfromtheliteratureandEMPOP,thealgorithmisnotonlyvalidated,butboththe strength of this approach and its utility for quality control of mitochondrial haplotypes is also demonstrated.

ß2013TheAuthors.PublishedbyElsevierIrelandLtd.

*Corresponding author at: Institute of Legal Medicine, Innsbruck Medical University,Innsbruck,Austria.Tel.:+43512900370640;fax:+43512900373640.

E-mailaddress:[email protected](W.Parson).

ContentslistsavailableatScienceDirect

Forensic

Science

International:

Genetics

j o urn a l hom e pa ge : ww w. e l s e v i e r. c om/ l o ca t e / f si g

1872-4973ß2013 The Authors. Published by Elsevier Ireland Ltd.

http://dx.doi.org/10.1016/j.fsigen.2013.07.005

Open access under CC BY-NC-SA license.

(2)

haplogroup estimates, especially in the case of partial mtDNA sequences[7].Themajorlimitationofexistingtoolsisthatthey base their haplogroup assignment solely on deﬁned motifs of diagnosticmutations(virtualhaplotypes).Theremaining‘‘private’’ mutationsofasequencewhichcanbeadditionallyinformativefor thehaplogroupstatusarenotconsidered.Asaconsequence,the haplogroup assignments are therefore often incorrect or too coarse.

Consider forexamplethefollowingcontrolregion haplotype fromArgentina, 16189C 16292T 16519C71A 153G204C 207A 263G315.1C373G,whichharborsnocharacteristicmutationto match a haplogroup in Phylotree’s motif list (Build 15) but nevertheless seems to fall within superhaplogroup R0. This sequencewasassignedtohaplogroupHbymtDNAmanager[14] and to haplogroup H1+16,189 by HaploGrep [12]. Additional codingregionsequencingofthissamplerevealedhaplogroupH55 status (4769G 10464A). This conclusion could also have been drawnfromthecontrolregionhaplotypealone,ifitsnearmatch with the complete sequence JQ705203 known to belong to haplogroup H55 had been considered. In this study, we offer new software (EMMA) that bases haplogroup estimation on Phylotree’s list of virtual haplotypes and a database of 14,990 quality-controlledfullmtGenomes,andthatemploysamaximum likelihood approach. We demonstrate,by comparative analysis thatourtool yieldsmore precisehaplogroupassignments than other available software. For the Argentinean control region haplotype,EMMAcorrectlyassignedtheproperhaplogroupstatus asH55evenwithoutcodingregioninformation.

2. Materialsandmethods

2.1. VirtualPhylotreehaplotypesandfullmtGenomedatabase

ThephylogenetictreeinPhylotree[4]representsknownglobal mtDNA variation by deﬁning haplogroups and their signature mutations. This tree is regularly updated incorporating newly availablemtGenomesandismadeavailabletotheuserinHTML format.AllresultspresentedinthisstudyusePhylotreeBuild15 (September30,2012)asthereferencetree.Forourpurposes,anR [15]scriptwasdevelopedthattransformsthetreeintoalistof hypotheticalhaplotypescarryingthesignaturemutationsofthe respectivehaplogroups(treenodes)asdifferencestotherCRS[2]. Asthesehaplotypesareinferredratherthanobservedinthereal worldtheyarehereinreferredtoasvirtualhaplotypes.

Recently,anewreferencesequenceformtDNA,theso-called Reconstructed Sapiens Reference Sequence (RSRS), has been proposed [16]. Instead of using a contemporary European mtGenomeasreferencesequencetheauthorssuggestswitching toa reconstructedancestralsequencethatis allocatedbetween haplogroups L0 and L10₂0₃0₄0₅0₆0_. _A _switch _to _the _RSRS _in _the forensicfield, however,isnot expectedtooccur soon[17].The software presented here is primarily designed for forensic purposes,thusinputofmtDNAdataiscurrentlybasedontherCRS. The defined haplogroup motifs in Phylotree are based on a database of published mtGenomes (http://www.phylotree.org/ mtDNA_seqs.htm) that were downloaded from GenBank and evaluated for their application within EMMA. Some sequences wereincomplete (e.g.[18],accessionnumberEF657231,lackof control region; [19], accession number EF661002, sequencing frame1–41674434–54835785–83148566–1068310749–16548) andthereforeexcludedfromthisstudy.MtGenomeshighlightedas problematicinAppendixS1ofRef.[20]havealsobeenremoved. Additionally,mtGenomesthathavebeengeneratedinthecourse ofsecond generationsequencingattempts in Refs.[21,22], and flaweddata publishedby [[23],I.P. Maksum, unpublished, V.C. Phan,unpublished]havebeenexcludedfromthedatabase.Ofthe

remainingmtGenomes,thosecontainingtenormore‘N’ designa-tions intheFASTA stringwereexcludedbecauseofinsufﬁcient sequencequality.

Subsequent analysis of our quality-controlled mtGenome databaserevealedthat20haplogroupsofPhylotree15werenot represented by complete sequences anymore due to the strict policy applied above. To ensure maximum coverage of hap-logroups,29mtGenomesrejectedintheﬁrststepwerereincluded in the database. Withthe expected future availabilityof more reliable mtGenomes those questionable haplotypes will be replaced. A single haplogroup was considered unreliable and thereforenotreincluded:accordingtoPhylotreehaplogroupM25 isrepresentedbytwomtGenomesfrom[24](accessionnumbers DQ246830,DQ246833).Duetothelackoftheﬁrst250basesin thesemtGenomesandthepresenceofseveraldoubtfulvariants, both sequences were not considered for the database, thus haplogroupM25istheonlyhaplogroupthat isnotrepresented byanymtGenomeinthedatabaseforEMMA.SeeESM1forthelist ofgenomesaddedandthereasonforinitialexclusion.

Supplementarymaterialrelatedtothisarticlecanbefound,in theonlineversion,atdoi:10.1016/j.fsigen.2013.07.005.

Finally, all FASTA strings were translated into rCRS-coded haplotypes usingSAM [25] and subsequently checkedwith in-housesoftwaretoharmonizealignment.Thequalityﬁlteringofthe sequencesﬁnallyresultedinadatabaseof14,990fullmtGenomes storedwiththeiraccessionnumbersandversion.Inconjunction withthe3925virtualPhylotreemotifs,these18,915virtualand realmtGenomesformthebasisforhaplogroupestimation.

2.2. Fluctuationrates

HaplogroupingofmtDNAdatainrCRS-basedformatrequires consistent alignment and notation of sequences following a phylogenetic approach[26] inordertoassessthe stabilityof mutations in defined haplogroups. Here, we refer to this mutational (in)stability as a fluctuation rate. The weighting schemepresentedforthestring-searchmethodinRef.[25]was updated by assessing the stability of mutations within the mtDNA control region among 19,171 full control region haplotypesforwhichrawlanedatawereavailable.Haplogroups weremanuallyassignedtoallsequencesinthisdatasetbetween November2011andSeptember2012followingtheclassification outlinedinPhylotreeBuilds12through15.Consequently,the sequences were grouped into discernable control region haplogroupclusters(CR-HGs),i.e.clustersofhaplogroupsthat canbeconfidentlydeterminedbasedoncontrolregionmotifs. WesetaminimumoffouravailablesequencestodefineaCR-HG with theexception ofCR-HGs L0,L2,L6,U40_9,_K3,_and_P9 _for whichonlyoneortwosequenceswereavailable.Inthesecases, mergingthesequenceswiththeirparenthaplogroupsLandR wouldhaverenderedtheresolutiontoocoarse.Foralistof CR-HGsbasedonPhylotreeBuild15andthenumberofsamplesfor eachclusterseeESM2.Samplesthatwereassignedtomultiple haplogroups due to uncertainty were split equally into the respectiveCR-HGs.

Assumingindependentpositionsweestimatedtheﬂuctuation rateby rab¼ P gminðnð

a

;

g

Þ;nð

b

;

g

ÞÞ P gnð

g

Þ

where

a

,

b

areelementsofthesetA,C,G,T,–with

a

notequalto

b

,

g

runsoverallCR-HGswhere

a

or

b

aredominant,n(x,

g

)denotes A.W.Ro¨cketal./ForensicScienceInternational:Genetics7(2013)601–609

(3)

thenumberofsamplesinCR-HG

g

withsymbolxandn(

g

)denotes thetotalnumberofsamplesinCR-HG

g

.

Heteroplasmiesweresplitequallyintotherepresentedbases andeachbaseaddsitsfractionton(x,

g

). Iftheestimateforthe ﬂuctuationrateiszero,aminimumvalueof106

fortransitions and 109 _for _{transversions} _or _indels _is _assigned. _Finally, _we computethediagonalratesasraa¼1Pb6¼arab.

Zero weightwasassigned toinsertionsin theC-stretchesof HVS-I (around 16,191 and 16,193) and HVS-II (around 309) becausetheycarrynophylogeneticsignature.Furthermore,zero weightwasappliedtodeletionsatpositions521and522asthey aresometimestheresult ofa50 _alignment _of_indels_in _the AC-stretch.Formultipleinsertionsatpositions315(C-insertions),455 (T-insertions),524(AC-insertions),and573(C-insertions)onlythe ﬁrstinsertionwasweighted.Additionalinsertionswereassigned weight zero. Two unique duplication events, a 15 base pair insertionatposition16,032foundinonehaplotypefromtheUSA and a 204 base pair insertion at position 563 found in two haplotypes from Morocco and the USA, respectively, were interpreted as single events. Thus, the weight of the single insertion oftheﬁrst basewas takenas measurefor thewhole duplication event. The same logic was applied to deletions at positions105–111relativetotherCRS.

The resulting fluctuation rates for control region mutations were expanded to the coding region using the number of occurrences of coding region mutations following [27]. Taking T16519C as themost frequent mutation with209 occurrences accordingtoRef.[27],wederivedfluctuationrates forallother mutations following the formula r=r(T16519C)n/209. Addi-tionally,multipleC-insertionsatpositions960,965,5899,8276, 8278andinsertionsofthestringCCCCCTCTAatposition8289were treatedasasingleevent.Forthecomprehensivelistoffluctuation ratesseeESM3.

Somemutationswerenotconsideredfortreereconstructionby Phylotree.Toreflectthisinthecollectionofvirtualhaplotypes,we introducedanextendednomenclaturefordenotingthe simulta-neousappearanceofmutationsanddeletions.Inadditiontothe charactersetdefinedbytheIUPACcodeweusedcorresponding smalllettersthatrepresentthesamenucleotidesastherespective capitalletterplusthedeletion.Forexample,theannotation‘‘523a’’ representsboththerCRSnucleotideandthedeletionatposition 523.Inthesamemanner,‘‘524.1m’’representsaninsertionofanA or Cat position 524.1as well as thenon-insertion. Using this extendedcharactersetinEMMAweappliedpatternmatchrules betweennucleotidesaspresentedinRef.[25].Mutations309.1c 309.2c309.3c315.1c315.2c315.3c521a522c523a524c524.1m 524.2m524.3m524.4m524.5m524.6m524.7m524.8m16181m 16182m16183m16191.1c16191.2c16193.1c16193.2c16193.3c 16519Ywereartificiallyadded toallvirtualhaplotypes. Conse-quently, any mismatch between a test profile and a virtual database profile for these mutations does not influence the haplogroupestimatesincetheywerenot treatedasdifferences when applying pattern match rules. The same holds true for mutationsset in brackets byPhylotree. These wereconsidered optionalfortheaffectedhaplogroupstatus.

2.3. Algorithm

Atestprofileiscomparedtoeverydatabaseprofilewiththe same or a larger reading frame (sequence span) than the test profile.Differencesbetweentwoprofiles(costs)werecomputedas detailedbelow.Thebasicideais,fortestprofilet,tomaximizethe likelihood of the base profile b Lt(b)=Qi r(bi!ti) where the

productistakenoverallpositionsiofthecommonreadingframe

andrdenotesthefluctuationrate.Asthenumberofpositionscan be large, e.g. about 16.6 thousand for full genomes, and the computationalworkforevaluatingtheproductwouldbecometoo high, therankingof thebase profilesisdeterminedas follows: Instead of maximizing the likelihood function we equivalently minimized the cost function Ct(b)=lg(Qir(ti!ti)/Lt(b)) where lg(x)=log10(x)/3isthescaleddecimallogarithm.Ifthetestprofile and the base profile are encoded by short lists of differences relative toareferencesequencesuchastherCRS,thenthecost functioncanbeefficientlyevaluatedbytheformula:

CtðbÞ¼ X

icðbi;tiÞ

whereirunsoverallpositionswherethetestandthebaseproﬁle differ,andc(bi,ti)=lg(r(ti!ti)/r(bi!ti))arerealnumberstermed

positionalcostsforthechangefromthebaseprofilesymboltothe testprofilesymbol.Thescalingofthelogarithmwasmotivatedby the costs of approximately 1.0 for an average mutation. For ambiguoussymbolsinbaseortestprofiles,themaximumratesof matchingnucleotidesareused.Therefore,therankingofthebase profiles by their total costs equals the maximum likelihood ranking.Intheoutputofthealgorithmonlybaseprofileswiththe lowestandsecondlowestcostswerepresentedwhereatolerance value(default0.3)isusedtoclustertheoptimalandsuboptimal profiles.

3. Resultsanddiscussion

3.1. Validationoftheconcept

3.1.1. PreparationandvalidationofthemtGenomedatabase

AnyestimateofthehaplogroupstatusofanmtDNAsequence canonlybeasgoodastheunderlyingdatabaseofsequencesused toderivetheestimate.EMMAusestwodatabases,oneconsistingof the 3925 virtual haplotypes of Phylotree Build 15 and one comprisedof14,990fullmtGenomesdownloadedfromGenBank. Becausethelattercomewithouthaplogroupinformation,weused the database of virtual haplotypes from Phylotree Build 15 to estimate the haplogroup of the downloaded mtGenomes. This procedure wasstraightforward in mostcases, as thepresented codingregioninformationinthevirtualhaplotypeswassufﬁcient for unambiguous haplogroup determination of the GenBank sequences. To validate this procedure we extracted the 5920 mtGenomes that are cited as haplogroup-speciﬁc example sequences in Phylotree Build 15, serving as the basis for the virtualhaplotypes.Of these,140wereexcludedforquality and coveragereasons.Fortheremaining5780genomes(38.6%ofthe 14,990mtGenomesinthetotaldatabase),thehaplogroupstatus wasdeterminedwithEMMA.Wefoundfullconcordancebetween the Phylotree haplogroup labels and EMMA’s haplogroup esti-matesin5774cases(99.9%of5780).Thedifferentresultsobserved in six cases couldbe attributed totwo issues. First, a missing signaturemutation causeda more conservativeestimate(more ancestral haplogroup) by EMMA in three cases (AY922257, HQ873519, JQ797975). For example, AY922257 (haplogroup M30c1a1inPhylotreeBuild15)lackedtheC16069Trequiredfor haplogroupM30c1a,and although thesubhaplogroupM30c1a1 diagnostic G9966A was present, EMMA estimated haplogroup M30c1.Intheotherthreecases(EU219921,EU545420,JQ704528) two differenthaplogroups werecompatible with the sequence motifofthemtGenome,thusthehaplogroupassignmentreliedon thecostsofthedifferingmutations.EU545420(haplogroup HV9-152inPhylotreeBuild15),forexample,carriedG8994A(costsof 0.79)andT152C(costsof0.29)thattogethersuggestedhaplogroup HV9-152.However, mutationC13449T (costsof 2.00) wasalso

(4)

presentin thishaplotypeand indicatedhaplogroupHV10,since the mismatch with the two HV9-152 diagnostic mutations resultedinloweroverallcoststhanthatwithC13449T.Fordetails onallsixexamples,seeESM4.Thisvalidationprocessrevealedone errorin mtGenomeAY882385of GenBank(haplogroup U3b1a), whereposition3546carriedthetransitionC3546Tinsteadofthe U3b1 characteristic transversion C3546A. After communicating thisﬁndingtotheauthorsofthismtGenome,itssequencedatahas beencorrectedandupdatedinGenBank.

Thesameprocedurewasemployedfortheexamplegenomesof PhylotreeBuild14,before Build15wasreleased.In additionto highconcordanceofhaplogroups,theEMMAhaplogroup assign-mentprocessunveiledthefollowingdetails.FJ383192(haplogroup D4b2b5)lackedthediagnosticpolymorphism9296forhaplogroup D4b2b.DQ112751 (codingregion only)listedashaplogroupS1, lackedmarker 14384C that is characteristic for haplogroupS1. Thus, both mtGenomes chosen for Phylotree Build 14 did not constituteidealrepresentativesforthesehaplogroupsand were replacedinBuild15.Furthermore,ouranalysisshowedthatthe sequencemotifsforhaplogroupsK1a8bandK1a22wereidentical inBuild14.HaplogroupK1a22wasthereforedeletedinBuild15. Finally,polymorphism1718forhaplogroupC5b1a1shouldhave read1719,whichwasalsocorrectedinBuild15.

3.1.2. Validationusingexternal/literatureHVS-I/HVS-IIandCR proﬁles

TofurtherevaluateEMMA,weprocessedall26CRhaplotypes fromTable2inRef.[7].Ingeneral,estimatesfromEMMAwerethe sameasorareﬁnementofthemostlikelyhaplogroupassignedby Bandeltet al.basedon Phylotree13. Thiswasalsotruefor the related GenBank entries found by EMMA. A summary of the haplotypesandhaplogroupassignmentsofTable2inRef.[7]is reproducedinTable1,but modiﬁedtoincluderesultsobtained withEMMAandupdatedhaplogroupestimatesfromHaploGrep basedonPhylotreeBuild15.

NearestsamplesfromGenBankfoundbyEMMAwithrespectto ﬂuctuationrateinducedcosts,coincidedwiththoselistedinthe originaltablefor15ofthe26samples.Exampleno.4listssample EF093557togetherwiththemostlikelyhaplogroupM52ainthe originaltable.However, takingPhylotreeBuild15intoaccount, sample EF093557 is of haplogroup E1a1a status, not M52a as statedinRef.[7].EMMAcorrectlyestimatedthehaplogroupfor exampleno.4basedonthevirtualhaplotypeM52a.Thenearest mtGenomeaccordingtoEMMAwasJQ703445(haplogroupM52a) withcostsof5.74.

Forexampleno.7,GenBankentryJQ703419withcostsof0.39 seemed more closely related to the proﬁle in question than EU979418(indicatedinRef.[7])withcostsof1.10.Bothsamples however belong to haplogroup H1a3c. Example no. 12 was assigned haplogroup J1c6 status based on GenBank entry AY495209 [7]. In addition to the virtual haplotype J1c-16261-189, which was also returned by HaploGrep, EMMA included haplogroupJ1c12inthetopranks.IfonlythefullmtGenomeswere searched,samples JQ702857 (haplogroup J1c7a) and JQ705489 (haplogroupJ1c-16261)atcostsof1.14wouldhavebeenreturned. WethereforesuggestJ1c12(notyetdeﬁnedinPhylotreeBuild13) as most likely haplogroup as this clade is a subbranch of the intermediatebranchJ1c-16261-189.

Fortheexamplesnos.14–16EMMAreturneddifferentGenBank samplesthanthoseindicatedintheoriginaltable[7].However,the mostlikelyhaplogroupsaccordingtoEMMAwerenotdistinctfrom theoriginals.Instead,theywerereﬁnements of thosegiven by Bandeltetal.[7].DifferencesbetweentheEMMAandBandeltetal. [7]analyseswerealsoobservedforexamples17,18,22,24,and26.

Inthesecases,themostlikelyhaplogroupwasconﬁrmed,butthe GenBank entries favored by EMMA were different from those foundbyBandeltetal.Thereasonforthiscanbeexplainedbythe underlying database: samples AF347006 and EF660967 were excluded fromourdatabase for quality reasons(see Ref. [20]). Sample HQ165756 is no longer available at GenBank and has apparently been removed (‘‘this record was removed at the submitter’srequest’’).

Haplotype no. 20 was most likely assigned in Ref. [7] to haplogroup M43a based on GenBank entry FJ770954. For this profile,HaploGrepreportedhaplogroupD4e2aasbestchoicewith aqualityvalueof103.3%.ThisCRprofilecannotbeunambiguously assigned toa singlehaplogroupwithoutadditionalinformation fromthecodingregion.TwoequallymatchingGenBanksamplesof different haplogroups were found by EMMA: FJ770954 (hap-logroupM43a)andJQ702232(haplogroupM74b2),bothatcostsof 0.53, that only differed in a C-insertion at 309 and three C-insertions at 573.A search along Phylotreeadded haplogroups D4e2a, M10, M74, D4j-16311 and subhaplogroupsthereof that onlydifferedinthecodingregion. For haplotypeno. 25several differentgenomesthatallperfectlymatchedthetestprofile(zero costs)werereportedbyEMMA.Amongthose,sampleAY339515 (haplogroup Z1a1a) was found, but also haplogroups Z1a (FJ147318), Za1a (FJ493513), and Z1a2 (AY195761) were pre-sented.Insummary,thiscompilationshowsthatEMMAcorrectly classifiedallofthe26examples.

3.2. ImprovedhaplogroupassignmentwithfullmtGenomesequences

Haplogrouping of partial mtDNA sequences such as CR sequences, when restricted to virtual haplotypes, may lead to imprecise or even inaccurate results, as demonstrated below. TakingacloserlookattheArgentineansampleintroducedearlier, itsCR haplotype16189C16292T 16519C71A153G204C207A 263G315.1C373G(ABS133[28])wasassignedtohaplogroupHby mtDNAmanagerandtohaplogroupH1+16189byHaploGrepwith thequalityvalueof64.7%reﬂectingthelowqualityoftheestimate. Applying EMMA together with the collection of 14,990 full mtGenomes, the best match was JQ705203 (haplogroup H55) withcostsof3.39(ESM5b).Thesecondbestestimatewasalsoa member of haplogroup H55 (JQ704460) with costs of 3.91. Additional sequencing of the coding region segments 3042– 3549,4270–4859,and10,184–11,000yieldedthevariants4769G 10646AconﬁrminghaplogroupH55status(Phylotree,Build15).

TheCRhaplotypeofsample‘‘stain3,GEDNAP44’’16248T146C 263G309.1C309.2C315.1C[29]suggestedhaplogroupHstatus. ThetransitionC16248TisonlyfoundinhaplogroupsH1ah1,H3w and H4c1 within haplogroup H in Phylotree (Build 15), with haplogroup H4c1 status also requiring a transition at 73. MtDNAmanager assigned this profile to haplogroup H, while thequeryinHaploGrepconfirmedthemanualsearchinPhylotree withqualityvaluesof82.7%forH1ah1and94.2%forH3w.EMMA found two perfectly matching (costsof 0.00) GenBankentries belonging to haplogroup HV (GU592048, GU592033). Hap-logroupsH1ah1andH3wfollowedwithcostsof0.39resulting from the private mutation at 146(ESM5a). Sequencing of the coding region segments 3280–3513 6700–7140 7226–7749 9110–936014,520–14,950yielded7028Tassingledifferenceto therCRS,thereforeexcludinghaplogroupHstatus(thusrejecting haplogroupsH1ah1andH3w)andconfirminghaplogroupstatus HV. The two previous examples demonstrate that browsing virtual haplotypes in motif lists is not always sufficient to determinereliablehaplogroupstatus.Privatemutationsthatcan onlybematchedwithfullmtGenomesequencesproviderelevant A.W.Ro¨cketal./ForensicScienceInternational:Genetics7(2013)601–609

(5)

Samples from Table 2 of Ref.[7]with updated haplogroup classiﬁcation based on Phylotree Build 15 according to HaploGrep and EMMA. No. Related GenBank

sample in Ref.[7]

(Updated) haplogroup of related GenBank sample from Ref.[7]

Likely haplogroup[7]a HaploGrep [12]b QVc EMMAd Costse Rank 1 results EMMAf Missing mutationsg Private mutationsh 1 HM030505 M20 M20 M20 90.8 M20 0.72 HM030505 T16362C T16519C None

2 HM030542 N10a N10a N10a 86.6 N10a 3.06 HM030542 C16111T T16224C

T16519C A16258T T16311C 3 HM030500 N10b N10b N10b 100.0 N10b 0.00 N10b None None 0.00 HM030500 (N10b) None 309.2C

4 EF093557 E1a1a M52a M52a 74.9 M52a 4.62 M52a None C16114A T16126C

C16218T C16291T T16356C G16391A

5 EF093544 E1a1a E1a1a E1a1a 100.0 E1a1a, E1a1a1 0.00 E1a1a None None

0.00 E1a1a1 None None

0.00 EF093544

(E1a1a)

None 309.1C

6

GU296545

U5b2a1b U5b2a1b U5b2a1b 93.5 U5b2a1b 0.61 U5b2a1b,

GU296545

None G16319A

7 EU979418 H1a3c H1a3 H1a3c 90.4 H1a3c 0.39 JQ703419 None T146C

8 GU903270 J2a1a1a J2a1a1 J2a1a1a 100.0 J2a1a1a, J2a1a1a2 0.00 J2a1a1a None None

0.00 J2a1a1a2 None None

0.00 GU903270

(J2a1a1a)

None None

9 EU770310 K2b1a1 K2b1a K2b1a1 94.1 K2b1a1 0.00 EU770310 None None

10 GU296544 U5b2a1a1 U5b2a1a1 U5b2a1a2 75.1 U5b2a1a1 0.00 GU296544 None None

11 EU597496 K1a4a1e K1a(K1a4a1) K1a4a1e 94.4 K1a4a1e 0.52 K1a4a1e None T204C

0.52 EU597496

(K1a4a1e)

524.3A524.4C T204C

12 AY495209 J1c6 J1c6 J1c + 16261 + 189 92.2 J1c-16261-189, J1c12 0.63 J1c-16261-189 T16126C None

0.63 J1c12 T16126C None

13 EU597496 K1a4a1e K1a(K1a4a1) K1a4a1e 94.4 K1a4a1e 1.51 K1a4a1e None T204C A272G

1.51 EU597496

(K1a4a1e)

524.3A524.4C T204C A272G

14 AY882396 U1a1 U1a1 U1a1b 95.1 U1a1b 0.00 JX289842 None 16193.1C

15 FJ348157 J2a1a1a2 J2a1a J2a1a1 86.0 J2a1a1 0.00 HQ699438 None None

16 AY882396 U1a1 U1a1 U1a1b 95.1 U1a1b 0.00 JX289842 309.2C 16193.1C

17 AF347006 V7a1, excluded

for EMMA[20]

V7a V7a 76.1 V7a, V7a1 1.55 V7a None A73G A95C

1.55 V7a1 None A73G A95C

1.55 JX171078 (V7a1) 309.2C A73G A95C

for EMMA[20]

V7a V7a 85.8 V7a, V7a1 1.55 V7a None A73G A95C

1.55 V7a1 None A73G A95C

1.55 JX171078 (V7a1) 309.2C A73G A95C

19 EU095550 B2d B4b(B2d) B4b 89.0 B2d 0.00 EU095550 None 16193.1C

309.2C309.3C A.W. Ro ¨ck et al. / Forensic Science International: Genetics 7 (2013) 601–609 605

(6)

Table 1 (Continued) No. Related GenBank

sample in Ref.[7]

(Updated) haplogroup of related GenBank sample from Ref.[7]

Likely haplogroup[7]a HaploGrep [12]b QVc EMMAd Costse Rank 1 results EMMAf Missing mutationsg Private mutationsh

20 FJ770954 M43a M(M43a) D4e2a 103.3 D4e2a,

M10, M10a, M74, M74b, M74b2, D4j-16311, D4j11, M43a 0.38 D4e2a None T16311C 573.2C 573.3C 0.47 M10 None T16362C 573.2C573.3C 0.47 M10a None T16362C573.2C573.3C 0.53 M74 None 573.1C573.2C573.3C 0.53 M74b None 573.1C573.2C573.3C 0.53 M74b2 None 573.1C573.2C573.3C 0.53 D4j-16311 None 573.1C573.2C573.3C 0.53 D4j11 None 573.1C573.2C573.3C 0.53 FJ770954 (M43a) 309.1C 573.1C573.2C573.3C 0.53 JQ702232 (M74b2) 309.1C 573.1C573.2C573.3C

0.63 AP008834 (D4e2a) None T16311C T16519C

573.2C573.3C

0.63 AP008468 (D4e2a) 573.4C T16311C T16519C

21 HM030520 M74b M74 D4j1b2 79.8 M74b 2.71 HM030520 C16214T T16093C

T16172C T16297C

for EMMA[20]

V7a V7a 90.7 V7a, V7a1 0.80 V7a None A95C

0.80 V7a1 None A95C

23 AY495306 V-@72 HV(V) V + @72 100.0 V-@72,

V1a1, V1a, V1a1b

0.00 V-@72 None None

0.25 AY495306 309.1C T16519C

24 HQ165756 sample not in

GenBank anymore

HV1(HV1b) HV1b3 96.0 HV1b3 0.00 JQ704284 None None

25 AY339515 Z1a1a Z1a1a Z1a 100.0 Z1a, Z1a1,

Z1a1a, Z1a2

0.00 Z1a None None

0.00 AY339515 (Z1a1a) 309.1C None

26 EF660967 J2a2a, excluded

for EMMA[20]

J2a2a J2a2a 86.8 J2a2a 0.00 JQ797922 None None

a

Likely haplogroup assigned in Ref.[7]based on Phylotree Build 13. b _{Haplogroup classiﬁcation by HaploGrep using Phylotree Build 15.} c_{Quality value assigned by HaploGrep.}

d

Haplogroups of rank 1 results estimated by EMMA. e

Estimated costs by EMMA; the default cost range of 0.3 was applied for all calculations. f

Source for rank 1 haplogroups by EMMA; Haplogroup labels indicate virtual haplotypes from Phylotree, GenBank accession numbers indicate mtGenomes. g

Mutations within the test profile’s range that are not present in the test profile but in the database profile. h

Mutations within the test profile’s range that are present in the test profile but absent in the database profile.

A.W. Ro ¨ck et al. / Forensic Science International: Genetics 7 (2013) 601–609 606

(7)

informationwhensearchingformatchinghaplotypesandnearest neighborstosupporthaplogroupassignment.

Wenotedthatinsomecasesvirtualhaplotypesleadtobetter estimates(lowercosts)ofthehaplogroupthantherelevantfull mtGenome, especially when only a few distant mtGenome sequencesareavailablefordeﬁningahaplogroup,orwhenprivate mutationsputanmtGenomeatfurtherdistancethanthevirtual haplotype. This is for example evident in the CR haplotype 4 (Table1),whichwasassignedtohaplogroupM52aboth bythe virtualPhylotreehaplotypes(costsof4.62–4.92)andthedatabase offullmtGenomes.However,theestimateproducedwiththefull genomes (JQ703445) came with costs of 5.74. Eventually, our analysesconﬁrmedtheadvantagegainedbyincludingboth,full mtGenomes and virtual haplotypes, as primary source for estimatingmtDNAhaplogroupsusingEMMA.

3.3. EMMAforQCpurposes

InthecourseofdevelopingandrefiningthesoftwareEMMAall haplotypesstoredintheEMPOPdatabaseweretested.Sincethe algorithmappliedbyEMMAbaseshaplogroupassignmentoncosts (fluctuationrates),highcostprofilesmayalsopinpointerrorsin sequence data.The following example describes this particular applicationofthesoftwareinmoredetail.TheCRprofileofsample FRE228fromsouthwestGermany[EMPOP,unpublished]16192T 16223T16356C16519C73G195C263G315.1Cresultedinabest matchwith mtGenomeJQ704890(haplogroup H7b) at costsof 1.23fortwoprivatemutations16192Tand73G.Furtherestimates suggested FJ467950 (R8a1b, costs of 1.83, private mutations 16192T 16223T 16356C) and several samplesfrom haplogroup U4b1a(andsubhaplogroupsthereof)atcostsof1.86(seeESM5b). Allofthesehaplotypesharboreddifferencesfromtheprofilein questionduetotheprivatemutations16192T16223T,aswellas 499A, a polymorphism absent in the query sequence but an otherwise stable signature mutation for haplogroup U40_9. _To clarify the haplogroupstatus of this sample the coding region segments 4402–5448 and 11,996–12,860 were sequenced.The thusobservedvariants4646C4769G12308G12372Aconfirmed haplogroup U4 status. The relatively high weight of mutation G499A (costs of 0.77), which could have back-mutated within haplogroupU4,putallotherhaplogroupU4databasehaplotypesat highertotalcostsandthuslowerranks.Amoredetailedanalysisof theEMMA outputgaverisetotheassumption thatthesample could belong to haplogroup U4a2b, which is defined by CR mutations310Cand16223TontopofthehaplogroupU4motif. The lackof the transitionat 310, however,led toinconclusive resultsbyEMMAsuggestingvarioussubcladeswithinU4asbest estimates.Re-examinationoftheCRrawdatarevealedthatdespite redundantsequencecoverage(4/5sequences) aninterpretation errorhad occurred in theprimary analysis ofthis sample.The sequencecontigofsampleFRE228mistakenlycontainedaforward sequencingread ofsample FRE208withtheHVS-II motif204C 263G 315.1C, which led to an artificial recombinant in the consensussequencereportedforFRE228.Forascreenshotofthe sequencecontigseeESM6.Thecorrectedconsensushaplotypefor sampleFRE228reads16192T16223T16356C16519C73G195C 263G310C 499A524.1A 524.2C 524.3A 524.4C 524.5A 524.6C, whichwasfullycompatiblewithhaplogroupstatusU4a2b.This datasetwasanalyzedataveryearlystageoftheEMPOPdatabase projectandloadedwithRelease1.Basedonthisfindingtheentire datasetfromsouthwestGermanywas inspectedagain andone othersimilarinstancewasobservedwherea wronglyimported sequence read caused an incorrect consensus haplotype. Upon reanalysis,itwasestablishedthattheCRprofileofsampleFRE237 16069T16126C16189C16519C73G185A188G228A263G295T 315.1C 462T489C didnotcarry 309.1C. Thiswascorrected on

EMPOPRelease9(seealsosamplehistoryatEMPOP).Thisexample demonstrates the power ofhaplogrouping as a tool for quality controlbyhighlightingmissingandprivatemutationsofasample inquestion.

Reportedinstancesofartificialrecombination(Table3inRef. [7])were further tested with EMMA following the same logic (ESM7). In all but one sample (#7, ESM7) the haplogroups identified by EMMA wereidentical tooneof the tworeported componentsin Ref.[7].For sample#7,haplogroups C4c1b and C7a2wereproposedbyEMMAsincethehaplogroupC1specific transitionatT16325Candthetandemdeletionatpositions290 and291weremissinginthegivenexample.Moreinterestingly,all haplogroupestimateswereflaggedwithratherhighcosts(ESM7), indicatingthatthehaplotypesmayneedtobecheckedwiththe submittingauthorsinaQC format.Inourexperience,andaside fromafewunusualdistanthaplotypes,costvaluesexceeding2.00 weregenerallycorrelatedwitherrorormisalignment,exceptfor unusual distant haplotypes. Therefore, we suggest prioritizing thosesamplesfordatareview.

3.4. Limitations–criticalsamples

The full mtGenome sequence from Spain [Sample ID Z034, unpublished, personal communication]73G 195C 263G315.1C 497T 524.1A 524.2C 750G 1189C 1438G 1811G 2706G 3480G 4769G5460A5655C7028T7521A8860G9055A9698C10398G 10550G11299C11467G11719A11914A12308G12372A14167T 14766T14798C15326G16093C16189C16224C16311C16519C wasinitiallyassignedtohaplogroupK1a12bytheauthors.EMMA’s lowestcostestimate(costsof3.41)wasproducedwithavirtual haplotype for haplogroupK1a1 withprivate mutations16189C 195C5460A5655C7521A.HaplogroupK1a12wasrankedsecond withcostsof3.82resultingfromprivatemutations16093C16189C 195C5655C7521A11914A.Thecostsforcodingregionmutations 5460Aand11914Awereequal(costs of1.05each)andthetwo remaining private coding region mutations(5655C and 7521A) werefoundforbothresults.Theadditionalcostsof0.41forK1a12 were contributed by T16093C (see ESM5a). The best estimate based on the mtGenome database was found with JQ703323 (haplogroupK1a12)withcostsof3.97forfourprivatemutations (16189C5655C7521A11914A)andadditionalcostsof0.94forthe missingmutation16343G(ESM5b).Althoughinvirtuallyallcases fullmtGenomescanbeassignedunambiguouslytoahaplogroup, thisexamplehighlightsanexceptiontothatgeneralrule.Inthis case,primarilytheresolutionofhaplogroupsK1a1andK1a12by fullmtGenomesandthesubsequentdefinitionofthe mitochon-drialphylogenycauseddifficultiesinhaplogroupingofthissample, which canbe attributed to therelatively high mutability (and henceweakdiagnosticvalue) ofsites11,914and5460defining haplogroupsK1a1andK1a12,respectively.

Recurrent and unstable mutations are set in brackets by Phylotree[4].Thesemutationsareinterpretedasoptionalforthe respectivecladeofthemitochondrialphylogenybyEMMA.Thus, theabsenceofsucha mutationinatestproﬁleisnotpenalized withadditionalcosts.ThisisdifferentfromHaploGrep[12]where brackets in Phylotreeareignored andtherespective mutations treatedinthesamemannerasothermutations.ConsidertheCR haplotype VEC033 from Venezuela 16223T 16241G 16325C 16362C 16519C73G 94A204C 263G315.1C 489C in Ref.[30]. ThebestestimatebyEMMAwashaplogroupD4e1awithcostsof 2.00. HaplogroupD1and subhaplogroupsthereoffollowedwith costsof2.26(seeESM5a).UsingHaploGrepthebestestimatewas

(8)

haplogroupD1withaqualityvalueof83.8%.D4e1arankedfourth with a quality value of 79.7% since mutation 16092C for haplogroupD4e1was missing inthis haplotype. However, this mutationwasstatedinbracketsinPhylotreewhichisdisregarded byHaploGrep,resultinginthereducedrankforhaplogroupD4e1a. ThenorthernIndianhaplotype16223T16240G16298C16311C 16327T16357C73G249DEL263G310C315.1C489C[SampleID I_070,unpublished,range16,024–16,36573–340438–576] was assignedtohaplogroupC4a3bbyEMMA(minimalcostsof1.59 toward mtGenome NA18547, see ESM5a). HaploGrep reported haplogroupD1withqualityvalue88.7%.Thecorrecthaplogroup C4a3brankedsecondwithaqualityvalueof87.0%.Again,thiswas duetothefactthat16278TforhaplogroupC4a3wasreportedas missingin thehaplotype eventhoughthis mutationwassetin bracketsinPhylotree.Theseexamplesdemonstratethatunstable/ recurrentmutationscanplayacriticalroleintheassignmentof haplogroupsandthereforemustbeproperlyhandledinautomated software solutions. Besides mutation stability, the amount of sequence information available is vital for reliable haplogroup estimation. The lack of information resulting from sequence stretches omitted for analyses causes ambiguous haplogroup estimates. In ESM8 and ESM9 we present four control region proﬁlesoftheEMPOPdatabaseforwhichthefullmtGenomewas analyzed providing the maximum resolution possible. We demonstrate that haplogrouping cannot be performed fully automatedandmanualinspectionisrequiredespeciallyforpartial proﬁles.

4. Conclusions

WiththepublicavailabilityofPhylotree[4],whichincorporates datafrommyriadsourcesandultimatelyservesasthemost up-to-date and comprehensive mtDNA phylogeny, theassignment of haplogroupstohaplotypeshasbeengreatlyfacilitated. Neverthe-less,anddespitethebroadacceptanceofPhylotreeinpopulation, medical,and forensic genetics, thetaskof manuallyclassifying mitochondrialhaplotypesintohaplogroupsremainstediousand error-prone.Themanualevaluationofthemutationrates/weights –animportantfactorinhaplogroupassignment–canbebiased, andtheinstabilityofparticularmutationswithinhaplogroups(as opposed to across the entire tree) is difﬁcult to incorporate manually.

Though Phylotree gave rise to (semi-)automated tools for haplogrouping,theseapproacheshavebeenshowntoberestricted in their capacity to overcome the deﬁciencies of manual haplogrouping[7].Themainlimitationsofthesetoolsstemfrom thefactthattheyrelysolelyonhaplogroup-deﬁningpatternsof mutationsderived from thephylogenetic treeand theylack a database of real haplotypes that takes private mutations into account.

To improvehaplogroupingof mtDNA sequenceswepropose EMMA,aconceptforestimatingmitochondrialDNAhaplogroups usinga maximumlikelihoodapproach.Theconceptis basedon twopillars:acomprehensiveandcurateddatabaseof14,990full mtGenomesand 3925 virtualhaplotypes of PhylotreeBuild 15 compiled to represent the backbone for drawing haplogroup estimates,anda databaseof19,171manuallyhaplogroupedCR haplotypesthat wereusedtodetermine thefluctuationrateof mutationsusingamaximumlikelihoodapproach.Basedonthese fluctuationrates,costvaluesresultingfromdifferencesbetween themtDNAprofilesarecalculatedbyEMMA.Lowestcostprofiles withinadefinedtolerancearereportedintheoutput.

The approach was validated using full mtGenomescited in Phylotreeasblindsamples,alongwithliteraturedataforwhichthe

haplogroupassignmentshadbeenpreviouslychallenged.EMMA willbemadeavailablewithanewEMPOPversionthatiscurrently under construction, to be used in the evaluation of the ever growingnumberofquality-controlledCRsequences.Inturn,these new CR data will further enhance the quality of EMMA’s haplogroup estimates and improve this already useful tool for mtDNAanalysis.

Acknowledgments

Theauthorswould liketothankMartinBodner,LianeFendt, GabrielaHuber,SimoneNagl,Daniela Niederwieser,and Bettina Zimmermann for additional sequencing and haplogrouping of samples.TheauthorsaregratefultoJodiIrwinforcommentsonthe manuscriptandfruitfuldiscussions.Thestudywassupportedby theFWFAustrianScienceFund(TR397)andtheNationalInstitute ofJustice(NIJ)grant2011-MU-MU-K402andhasreceivedfunding fromtheEuropeanUnionSeventhFrameworkProgramme(FP7/ 2007–2013) under grant agreement no. 285487 (EuroForGen-NoE). MvO wassupported in part by theNetherlands Forensic Institute (NFI)and by a grant fromthe Netherlands Genomics Initiative(NGI)/NetherlandsOrganization forScientiﬁc Research (NWO)withintheframeworkoftheForensicGenomics Consor-tiumNetherlands(FGCN).

References

[1]S.Anderson,A.T.Bankier,B.G.Barrell,M.H.deBruijn,A.R.Coulson,J.Drouin,I.C. Eperon,D.P.Nierlich,B.A.Roe,F.Sanger,P.H.Schreier,A.J.Smith,R.Staden,I.G. Young,Sequenceandorganizationofthehumanmitochondrialgenome,Nature 290(1981)457–465.

[2]R.M. Andrews,I.Kubacka,P.F.Chinnery,R.N.Lightowlers,D.M.Turnbull,N. Howell,ReanalysisandrevisionoftheCambridgereferencesequenceforhuman mitochondrialDNA,Nat.Genet.23(1999)147.

[3]H.J.Bandelt,Q.P.Kong,M.Richards,V.Macaulay,Estimationofmutationratesand coalescencetimes:somecaveats,in:H.J.Bandelt,V.Macaulay,M.Richards(Eds.), HumanMitochondrialDNAandtheEvolutionofHomoSapiens,Springer-Verlag, Berlin,Germany,2006,pp.47–90.

[4]M.vanOven,M.Kayser,Updatedcomprehensivephylogenetictreeofglobal humanmitochondrialDNAvariation,Hum.Mutat.30(2009)E386–E394.

[5]H.J.Bandelt,V.Macaulay,M. Richards,HumanMitochondrialDNA andthe EvolutionofHomoSapiens, Springer-Verlag,Berlin/Heidelberg,2006.

[6]H.J.Bandelt,A.Achilli,Q.P.Kong,A.Salas,S.Lutz-Bonengel,C.Sun,Y.P.Zhang,A. Torroni,Y.G.Yao,Lowpenetranceofphylogeneticknowledgeinmitochondrial diseasestudies,Biochem.Biophys.Res.Commun.333(2005)122–130.

[7]H.J.Bandelt,M.vanOven,A.Salas,HaplogroupingmitochondrialDNAsequences inlegalmedicine/forensicgenetics,Int.J.LegalMed.126(2012)901–916.

[8]A.Salas,A.Carracedo,V.Macaulay,M.Richards,H.J.Bandelt,Apracticalguideto mitochondrialDNAerrorpreventioninclinical,forensic,andpopulationgenetics, Biochem.Biophys.Res.Commun.335(2005)891–899.

[9]L.Fan,Y.G.Yao,MitoTool:awebserverfortheanalysisandretrievalofhuman mitochondrialDNAsequencevariations,Mitochondrion11(2011)351–356.

[10]L.Fan,Y.G.Yao,AnupdatetoMitoTool:usinganewscoringsystemforfaster mtDNAhaplogroupdetermination,Mitochondrion13(2013)360–363.

[11]F. Rubino, R.Piredda, F.M. Calabrese, D.Simone, M. Lang,C. Calabrese,V. Petruzzella, M. Tommaseo-Ponzetta,G. Gasparre,M. Attimonelli, HmtDB, a genomicresourceformitochondrion-basedhumanvariabilitystudies,Nucleic AcidsRes.40(2012)D1150–D1159.

[12]A.Kloss-Brandsta¨tter,D.Pacher,S.Scho¨nherr,H.Weissensteiner,R.Binna,G. Specht,F.Kronenberg,HaploGrep:afastandreliablealgorithmforautomatic classiﬁcationofmitochondrialDNAhaplogroups,Hum.Mutat.32(2011)25–32.

[13]I.Soares,A.Amorim,A.Goios,mtDNAofﬁce:asoftwaretoassignhumanmtDNA macrohaplogroupsthroughautomatedanalysisoftheproteincodingregion, Mitochondrion12(2012)666–668.

[14]H.Y.Lee,I.Song,E.Ha,S.B.Cho,W.I.Yang,K.J.Shin,mtDNAmanager:aWeb-based toolforthemanagementandqualityanalysisofmitochondrialDNA control-regionsequences,BMCBioinformatics9(2008)483.

[15]RCore Team,R: A languageand Environment forStatistical Computing. R FoundationforStatisticalComputing,RFoundationforStatisticalComputing, Vienna,Austria,2012,ISBN3-900051-07-0,http://www.R-project.org/. [16]D.M.Behar,M.vanOven,S.Rosset,M.Metspalu,E.L.Loogvali,N.M.Silva,T.

Kivisild,A.Torroni,R.A.Villems,Copernicanreassessmentofthehuman mito-chondrialDNAtreefromitsroot,Am.J.Hum.Genet.90(2012)675–684.

[17]A.Salas,M.Coble,S.Desmyter,T.Grzybowski,L.Gusmao,C.Hohoff,M.M.Holland, J.A.Irwin,T.Kupiec,H.Y.Lee,B.Ludes,S.Lutz-Bonengel,T.Melton,T.J.Parsons,H. Pfeiffer,L.Prieto,A.Tagliabracci,W.Parson,Acautionarynoteonswitching mitochondrialDNAreferencesequencesinforensicgenetics,ForensicSci.Int. Genet.6(2012)e182–e184.

A.W.Ro¨cketal./ForensicScienceInternational:Genetics7(2013)601–609

(9)

[18]C.Herrnstadt,J.L.Elson,E.Fahy,G.Preston,D.M.Turnbull,C.Anderson,S.S.Ghosh, J.M.Olefsky,M.F.Beal,R.E.Davis,N.Howell,Reduced-median-networkanalysisof completemitochondrialDNAcoding-regionsequencesforthemajorAfrican, Asian,andEuropeanhaplogroups,Am.J.Hum.Genet.70(2002)1152–1171.

[19]G.Gasparre,A.M.Porcelli,E.Bonora,L.F.Pennisi,M.Toller,L.Iommarini,A.Ghelli, M.Moretti,C.M.Betts,G.N.Martinelli,A.R.Ceroni,F.Curcio,V.Carelli,M.Rugolo, G.Tallini,G.Romeo,DisruptivemitochondrialDNAmutationsincomplexI subunitsaremarkersofoncocyticphenotypeinthyroidtumors,Proc.Natl.Acad. Sci.U.S.A.104(2007)9001–9006.

[20]Y.G.Yao,A.Salas,I.Logan,H.J.Bandelt,mtDNAdatamininginGenBankneeds surveying,Am.J.Hum.Genet.85(2009)929–933.

[21]E.D.Gunnarsdottir, M.Li, M.Bauchet,K. Finstermeier,M. Stoneking, High-throughput sequencing of complete human mtDNA genomes from the Philippines,GenomeRes.21(2011)1–11.

[22]A.Scho¨nberg, C.Theunert,M.Li,M. Stoneking,I.Nasidze,High-throughput sequencingofcompletehumanmtDNAgenomesfromtheCaucasusandWest Asia:highdiversityanddemographicinferences,Eur.J.Hum.Genet.19(2011) 988–994.

[23]M.K.Gonder,H.M.Mortensen,F.A.Reed,A.deSousa,S.A.Tishkoff,Whole-mtDNA genomesequenceanalysisofancientAfricanlineages,Mol.Biol.Evol.24(2007) 757–768.

[24]R.Rajkumar,J.Banerjee,H.B.Gunturi,R.Trivedi,V.K.Kashyap,Phylogenyand antiquityofMmacrohaplogroupinferredfromcompletemtDNAsequenceof Indianspeciﬁclineages,BMCEvol.Biol.5(2005)26.

[25]A.W.J.Ro¨ck,A.Irwin,T.Du¨r,W.Parsons,S.A.M.Parson,String-basedsequence searchalgorithmformitochondrialDNAdatabasequeries,ForensicSci.Int.Genet. 5(2011)126–132.

[26]H.J.Bandelt,W.Parson,Consistenttreatmentoflengthvariantsinthehuman mtDNAcontrolregion:areappraisal,Int.J.LegalMed.122(2008)11–12.

[27]P.Soares,L.Ermini,N.Thomson,M. Mormina,R.Rito,A.Ro¨hl, A.Salas, S. Oppenheimer,V.Macaulay,M.B.Richards,Correctingforpurifyingselection: animprovedhumanmitochondrialmolecularclock,Am.J.Hum.Genet.84(2009) 740–759.

[28]M.C.Bobillo,B.Zimmermann,A.Sala,G.Huber,A.W.Ro¨ck,H.J.Bandelt,D.Corach, W.Parson,AmerindianmitochondrialDNAhaplogroupspredominateinthe population ofArgentina:towardsa ﬁrstnationwide forensicmitochondrial DNAsequencedatabase,Int.J.LegalMed.124(2010)263–268.

[29]S.Rand, M.Schurenkamp,C.Hohoff,B.Brinkmann,TheGEDNAPblindtrial conceptpartII.Trendsanddevelopments,Int.J.LegalMed.118(2004)83–89.

[30]D.CastrodeGuerra,P.C.Figuera,C.M.Bravi,J.Saunier,M.Scheible,J.Irwin,M.D. Coble,A.Rodriguez-Larralde,SequencevariationofmitochondrialDNAcontrol regioninNorthCentralVenezuela,ForensicSci.Int.Genet.6(2012)e131–e133.