Research
paper
FDSTools:
A
software
package
for
analysis
of
massively
parallel
sequencing
data
with
the
ability
to
recognise
and
correct
STR
stutter
and
other
PCR
or
sequencing
noise
Jerry
Hoogenboom
a,b,*
,1,
Kristiaan
J.
van
der
Gaag
a,b,1,
Rick
H.
de
Leeuw
a,
Titia
Sijen
b,
Peter
de
Knijff
a,
Jeroen
F.J.
Laros
aa
DepartmentofHumanGenetics,LeidenUniversityMedicalCenter,Leiden,2300RC,TheNetherlands
b
DivisionBiologicalTraces,NetherlandsForensicInstitute,LaanvanYpenburg6,2497GB,TheHague,TheNetherlands
ARTICLE INFO
Articlehistory:
Received26September2016
Receivedinrevisedform31October2016 Accepted23November2016
Availableonline27November2016
Keywords: Forensicscience
Nextgenerationsequencing Massivelyparallelsequencing Shorttandemrepeat Powerseq
FDSTools Software
ABSTRACT
Massivelyparallelsequencing(MPS)isontheadventofabroadscaleapplicationinforensicresearchand
casework.Theimprovedcapabilitiestoanalyseevidentiarytracesrepresentingunbalancedmixturesis
oftenmentionedasoneofthemajoradvantagesof thistechnique.However,mostoftheavailable
softwarepackagesthatanalyseforensicshorttandemrepeat(STR)sequencingdataarenotwellsuited
forhighthroughputanalysisofsuchmixedtraces. Thelargestchallengeisthepresenceofstutter
artefactsinSTRamplifications,whicharenotreadilydiscernedfromminorcontributions.FDSToolsisan
open-sourcesoftwaresolutiondevelopedforthispurpose.Thelevelofstutterformationisinfluencedby
variousaspectsofthesequence,suchasthelengthofthelongestuninterruptedstretchoccurringinan
STR.WhenMPSis used,STRsareevaluatedassequencevariantsthateachhave particularstutter
characteristicswhichcanbepreciselydetermined.FDSToolsusesadatabaseofreferencesamplesto
determinestutterandothersystemicPCRorsequencingartefactsforeachindividualallele.Inaddition,
stuttermodelsarecreatedforeachrepeatingelementinordertopredictstutterartefactsforallelesthat
arenotincludedinthereferenceset.Thisinformationissubsequentlyusedtorecogniseandcompensate
forthenoiseinasequenceprofile.Theresultisabetterrepresentationofthetruecompositionofa
sample.UsingPromegaPowerseqTMAutoSystemdatafrom450referencesamplesand31two-person
mixtures,weshowthattheFDSToolscorrectionmoduledecreasesstutterratiosabove20%tobelow3%.
Consequently,muchlowerlevelsofcontributionsinthemixedtracesaredetected.FDSToolscontains
modules tovisualisethedata inaninteractiveformat allowinguserstofilterdata withtheirown
preferredthresholds.
ã2016TheAuthors.PublishedbyElsevierIrelandLtd.ThisisanopenaccessarticleundertheCC
BY-NC-NDlicense(http://creativecommons.org/licenses/by-nc-nd/4.0/).
1.Introduction
AnalysisofShortTandemRepeats(STRs)hasbeenasuccessful
forensic tool in the past two decades. The comparison of STR
profiles from forensic DNA evidentiary traces with reference
samplesandDNAdatabaseshasprovidedessentialinformationin many forensic cases [1]. Standard practice is to use Capillary Electrophoresis (CE) to analyse STR length variation. In recent
years,Massively Parallel Sequencing(MPS) wasintroducedasa newmethodtoanalyseSTRsandotherforensicDNAmarkers[2,3].
MPS enables the simultaneous detection of both length and
sequence variation of STRs, which increasesthe discriminatory value substantially [4–6]. The output of CE consists of peaks reflectingfluorescentsignalintensitieswiththeirownrespective shapesandpeakheights.TheoutputofMPSdataanalysisconsists simplyofreadcountsoftheobservedsequences.Bothmethodscan sufferfromtheoccurrenceofPCRartefactssuchasSTRstutters[7]. This especially complicatestheanalysis of STR profilescoming
from multiple contributors, which is common in forensic
evidentiarytraces[8].Thelevelofstutterformationdependson a numberofdistinctaspects ofthesequence,includingtheA/T contentoftherepeatunitandthenumberofconsecutiverepeat units occurring in an STR [9]. Since any specific STR length
* Correspondingauthor.
E-mailaddresses:j.hoogenboom@nfi.minvenj.nl(J.Hoogenboom),
k.van.der.gaag@nfi.minvenj.nl(K.J. vanderGaag),[email protected]
(R.H.deLeeuw),t.sijen@nfi.minvenj.nl(T.Sijen),[email protected](P.deKnijff),
[email protected](J.F.J. Laros).
1
Theseauthorscontributedequally.
http://dx.doi.org/10.1016/j.fsigen.2016.11.007
1872-4973/ã2016TheAuthors.PublishedbyElsevierIrelandLtd.ThisisanopenaccessarticleundertheCCBY-NC-NDlicense( http://creativecommons.org/licenses/by-nc-nd/4.0/).
ContentslistsavailableatScienceDirect
Forensic
Science
International:
Genetics
identifiedbyCEcanconsistofmultipledifferentsequences,these CE-identifiedlengthvariantsshowalargervariationinmeasured stutterpercentage than individual sequences analysed through MPS.ThisdecreasedvariationinstutterpercentageforMPSSTR datamayaidintheinterpretationofmixtures[2],asitallowsfora betterpredictionofstutterbehaviour,whichcanbeusedtofilter thedataforstutterproducts.Existingsoftwarepackagesforthe analysisofSTRsequencingdata[10–12]donotsupportextensive filteringandcorrectionofsystemicPCRand/orsequencingerrors andthereforeseemlesssuitedforanalysisofmixedDNAsamples. Thispromptedustodevelopasoftwarepackagethatharboursthe followingfeatures:1)characterisationandcorrectionofnoisein thesequencingdatacausedbyPCRstutterorothersystemicPCR and/orsequencingerrors;2)visualisationofsequencingdataas comprehensiveprofiles;3)filteringofdataingraphsandtables withuserdefinablethresholds and4)open-sourceaccessibility. Forensic DNA Sequencing Tools (FDSTools) is available via the PythonPackageIndex(eitherbymanualinstallationorbyusing thecommand‘pipinstallfdstools’).Weassesstheperformanceof
FDSToolson31two-personmixturesgenotypedviathePromega
PowerseqTMAutoSystemforwhichwefirstgeneratedareference datasetof450samples.
2.Materialsandmethods
2.1.Samplepreparation
PCRproductsandsequencinglibrarieswere preparedasdescribed previously[2]usingaprototypePromegaPowerseqTMAutoSystem containing23STRsandamelogenin.Asetof450Dutchsamples[13]
and31two-personmixtureswereamplifiedandsequenced.The
mixturesconsistedofthreecombinationsoftwodonorsselected randomlyfromapoolofunrelatedindividuals,whichweremixedin differentratios.Theminorcomponentsinthemixturescontributed 0.5%(sixmixtures),1%(sixmixtures),5%(fourmixtures),10%(six mixtures),20%(sixmixtures)and50%(threemixtures).
Sincethemixtureswereusedtotesttheperformanceofthe softwareandalsotodetermineanalysisthresholdsthatarefitfor purpose,webalancedtheinfluenceofvaryingDNAinputsinthe
PCR and increased drop-out due to low DNA input. This was
achievedbytheuseofa minimumof theminorcomponentof
60pginthe0.5%,1%and5%mixtures,resultinginatotalDNAinput of12ng,6ngand1.2ng,respectively(60pgresultedinlessthan 20%drop-outinthevalidationofPowerplex6C[14]).Thesame totalDNAinputof1.2ngwasusedforthe5%,10%,20%and50% mixtures(resultingin120pgand240pgoftheminorcomponents inthe10% and20% mixtures,respectively).TheDNAinputwas 0.5ngforsingledonorsamples.
Thegenotypesofthedonorsusedinthemixtureswereknown, which enablestheidentification ofdrop-in and drop-out allele calls.Paired-endsequencingdataofallampliconswasgenerated usingtheMiSeq1Sequencer(Illumina).
2.2.Initialdataprocessing
InFig.1,themaintoolsoftheFDSToolspackageandtheirrolein thedataanalysispipelinearedisplayed.Thetoolscanbesplitinto threefunctionalgroups:toolsforreferencedatabasecreation,tools forreferencedatabasecuration(dataqualityassessment)andtools forcasesamplefilteringanddatainterpretation.Inaddition,the packagecontainsinitialdataprocessingtoolssuchasTSSV[10]that arecommontoreferencedatabasesamplesandcasesamples.
Fig.1.Flowchartoftheanalysisprocess,showingthemaintoolsofFDSTools.Flowchartshowingthemaintools(bluerectangles)oftheFDSToolspackageandtheirrolesin thedataanalysispipeline.TheoutputofeachtoolcanbevisualisedusingtheVistool(notshown).(Forinterpretationofthereferencestocolourinthisfigurelegend,the readerisreferredtothewebversionofthisarticle.)
2.2.1.Paired-endreadmerging
Using paired-end sequencing, forward and reverse strand
molecules of each amplicon were sequenced from both ends.
Thefirst300nucleotidesfromeitherendwereobtained.These readpairsweremergedintoaconsensusreadbyaligningtheread pair such that the largest possible overlap is obtained while allowingforupto33%mismatchesintheoverlappedregion.Most ampliconswereabout300basepairsinlengthandprovidedfully complementaryreadpairs.
WithSTRampliconsthatarelongerthan300bp,aproblemmay occurwhenbothreadsendinthemiddleoftheSTRstructureand thepairmaybemergedintoatruncatedSTRsequence.Amodified versionofFLASH1.2.11[15](availableviagithub.com/Jerrythafast/ FLASH-lowercase-overhang)wasusedtomarkthebasesthatwere notintheoverlappedregioninlowercaseintheconsensusread. ThisenablesdetectionoftruncatedSTRsequencesindownstream analysis.
2.2.2.Linkingreadstolociandalleles
Themergedreadsarelinkedtospecificlociandallelesbythe TSSVtool,whichisawrapperaroundasimplifiedversionofthe TSSV[10] programcalled TSSV-Lite.TSSV linksreads tolociby scanningthereadsforthesequencesflankingtheSTRlociused. Theflankingsequencesofeachlocus,thatusuallyrepresentthe most50nucleotidesoftheprimers,areprovidedtoFDSToolsina libraryfile,togetherwithvariousotherdetailsaboutthelociused. SupplementaryFile1representsthelibraryfileusedinthisstudy. Thefilecontainsadescriptionforthecontentsofeachsection.
Eachreadisscannedfortheseflankingsequencesbycomputing
alignments. In this study, the flanking sequences were 18
nucleotidesinlengthand twosubstitutions(ortwoinsertedor deletedbases)perflankwereallowedinthealignment.Readsare categorised as ‘unrecognised’ if no flankingsequence is found. Furthermore,bothflankingsequencesarerequiredtohaveatleast oneuppercaseletter,whichensuresthatoverlappedreadsthatare potentially truncatedare categorised as ‘unrecognised’ as well. Readsinwhichonlyoneflankingsequenceisfoundwithatleast oneuppercaselettergetlinkedtoalocusbutflaggedas‘nostart’or ‘noend’dependingonwhethertheleftorrightflankismissing, respectively(optionally, thesereads canbewritten toseparate fastaorfastqfiles).
ThemainoutputofTSSVisatextfilewithtab-separatedvalues. Thefilecontainsonelineforeveryuniquesequenceofeachlocus. Thecolumnsincludethenameofthelocus,thesequence,andthe
numberofreadscarryingthisparticularsequence.Readcountsare givenseparatelyfortheforwardandreversestrand.
TSSVincludesadditionaloptionsforfilteringsequencesthatare seentoofewtimesandsequenceswithalengthoutsideagiven range(e.g.,primer-dimers).Thisrangecanbespecifiedseparately foreachlocus.Furthermore,filteredsequencescanbeaggregated into a single ‘othersequences’ category for each locus. In this project,onlysingletons(i.e.,sequenceswithonlyoneread)were aggregatedtothe‘othersequences’category.
2.3.Buildingareferencedatabase
OnefunctionofFDSToolsisthebuildingofareferencedatabase. Suchadatabasecanbeusedtoobtainestimatesofrecurring allele-specific systemic noise. Here, ‘noise’ refers to the complete collectionofsequencesobservedinasample,exceptthesample’s trueallelicsequences.Noiseincludesanyartefactderivingfrom thePCRaswellasthesequencing(suchasPCRstutteror single-nucleotide errors). Additionally, based on the reference data a statisticalmodelcanbederivedthataimstopredictstutterratios forallelesnotpresentinthereferenceset.
The creation of a reference database involves various tools includedintheFDSToolspackage,whichwillbediscussedinthe nextsections.Inadditiontotheseseparatetools,FDSToolsoffers the Pipeline tool,which convenientlyintegrates theentire data analysispipeline.UsersareadvisedtousePipelineasitremovesthe complexityofhavingtorunseveralseparatetoolsandtocombine theiroutput.Pipelinetakesasimpleconfigurationfilecontaining theanalysis parametersand automaticallyrunstheappropriate tools.
Buildingareferencedatabaseisatwo-phaseprocess.Inthefirst phase,thereferencesamplesareanalysedinaglobalmannerto identifytheirallelesandrejectthosesamplesinwhichthealleles arenotreadilyidentified.Inthesecondphase,thesystemicnoise ofeachoftheseallelesisanalysedindetail.
2.3.1.Allelecallingforreferencesamples
Determiningtheallelesofsingledonorreferencesamplesisa fairlystraightforwardprocessbecausethesegenerallyrepresent theoneortwomostabundantsequencesforanylocus.FDSTools includes Allelefinder to call alleles this way. It is applied after Stuttermark,whichisdescribedbelow.Anumberofthresholdsare used to guard against including alleles of potential low-level contaminations,whichareoutlinedinFig.2.Forheterozygousloci,
Fig.2.ThresholdsusedbyAllelefindertocallallelesinreferencesamples.Sequencevariantswithareadcountabovetheallelethresholdarecalledasalleles.Thefour lighter-shadedbarsrepresentstuttervariants(asrecognisedbyStuttermark),whichareignoredbyAllelefinder.
asecondalleleisonlycalledifitpassestheallelethreshold,which isdefined inthis project as 30%of theread countofthemost frequentalleleatthesamelocus.Asweexpectnostutterabove 30%[2],thisthresholdseparatestheallelesfromnoise.Noalleles arecalledatalocuswhenadditionalsequencesoccurthathavea
read count below the allele threshold but above the noise
threshold(whichisdefinedas15%ofthemostfrequentallelein thisproject)orifathirdsequencepassestheallelethreshold.If morethantwolociinthesamesamplefailtogivearesultforthese reasons,theoverallqualityofthesampleisconsideredtoopoorto reportanyalleles.Additionally,Allelefindercanbeconfiguredtocall atmostonealleleathaploidloci.
The three potential pitfalls are 1) PCR stutter artefacts that exceedthenoise threshold;2)strongread countimbalancefor heterozygousalleles,whichmaybetheresultofe.g.,primer-site sequencevariantsand3)autosomaltrisomy,whichisrare.Todeal
with the problem of stutter, each sample was analysed with
Stuttermark[2]beforecallingalleles.WithStuttermark,sequences thatareinastutterpositionofanothersequencewhilehavinga readcountbelowauser-suppliedpercentagewithrespecttothe othersequencearemarkedas‘stutter’.Sequencesthathavearead countthatistoohightobeexplainedbystutteralonewillnotbe markedas‘stutter’,astheymaycoincidewithagenuineallele.The thresholdsusedherewere30%for1stutter(lossofarepeatunit) and 10% for +1 stutter (gain of a repeat unit). For 2 stutter
products, a 30% threshold of the 1 stutter product is used.
Sequencesthataremarkedas‘stutter’arecompletelyignoredby Allelefinder.
Allelefinderproducesthelistofallelesandareportdetailingfor which samplesand loci allelecallingis rejected and for which reasons.
2.3.2.Estimatingaverageallele-specificsystemicnoise
Foreachallele,aprofileofrecurringsystemicnoise,including PCRstutterproductsaswellasanyother‘sideproducts’,canbe generatedbasedonthereferencedata.Noiseprofilesarealways computedseparatelyforforwardandreversereads,becausestrand biasmayexistinthesequencingtechnologyused.Profilesarealso computedseparatelyforeach locus,undertheassumptionthat noiseproductionisnotinfluencedbyallelesofotherloci.Thelevel ofnoiseisexpressedasthenumberofnoisereadsasapercentage ofthenumberofreadsoftheparentallele.InthecontextofPCR stutteranalysis, thisquantityis oftenreferredtoasthe‘stutter ratio’, despitetherepresentation as a percentageof the parent allele.Weusethegeneralisedterm‘noiseratio’(alsorepresented as a percentage of the parent allele) to account for all other systemicnoiseaswell.
Noiseratio¼NoiseAllelereadsreads100
In homozygoussamples,thenoiseratiocanbecalculatedby dividingthe number of reads of a non-allelic sequence bythe number of reads of the allele.Allele-specific noise profilesare readilycomputedfromhomozygoussamplescarryingthisalleleby scalingthereadcountsineachsamplesuchthattheparentalleleis 100andaveragingthenoiseratiosforeachnoisesequence.These per-allelenoisestatisticsandotherstatistics,suchasthestandard deviationsofthenoiseratioscanbeobtainedusingBGHomStats.In heterozygoussamplestheextractionofnoisesequencesismore complex,becauseithastobedeterminedwhichproportionseach allelecontributedtotheobservednoise sequences.Weassume thatnoiseinheterozygoussamplescorrespondstothesumofthe noiseprofilesofthetwoalleles,aftertheapplicationofascaling correctiontoaccountfordifferencesintheamountofeachallele amplified.Thisisneededasevenforheterozygousallelepairs,PCR efficiencymayvaryduetoprimerbindingsitesequencevariation
orSTRlength[16].Toextractnoisefromheterozygousreference samplesaniterativeapproachwastakenandimplementedinthe BGEstimatetoolinFDSTools.
Inessence,thealgorithm,whichisdiscussedinmoredetailin SupplementaryText1,seeksanon-negativeleastsquaressolution tothematrixequationAP=C.Inthisequation,CisanNMmatrix ofconstantsderivedfromthereadcountsinthereferencesamples, AisanNNmatrixsummarisingtheallelebalanceinthesamples, and P is an NM matrix containing the estimated profilesof systemicnoise.Nisthenumberofuniquegenuineallelesamong
the reference samples and thus also the number of profiles
producedandMisthetotalnumberofuniquesequencesobserved. MatrixCiscomputedonceatthestartofthealgorithm.Each rowinCcorrespondstoonealleleandcontainsthesumoftheread countsofallsamplesthathavethatparticularallele,afterscaling thealleleto100readsforhomozygoussamplesand50readsfor heterozygotes. The noise profiles in P are initialised with the assumptionthatnosystemicnoiseispresent,i.e.,allelementsare setto0, except for theelementsthat correspond totheactual alleles,whicharesetto100.
Thealgorithmthenproceeds byrepeatedlyre-estimatingthe allelebalancematrixAwhilereadingcross-contributionsbetween
the alleles from the current profiles P and subsequently
re-estimating P by finding a non-negative least squares optimal
solutiontoAP=C.ThevaluesthusobtainedinParetheaverage noiseratiosofallobservedsystemicnoiseforallalleles(i.e.,each rowinPcontainsthenoiseprofileofoneallele).
Toavoidnoisefromoneallelebeingincorporatedinthenoise profileofanotherallele,aminimumofthreedifferent heterozy-gousgenotypesperallelewasusedinthisstudy.Athresholdcanbe setfortheminimalreadcountofnoisetoconsiderandtheminimal
percentage (weused 80%)of reference sampleswiththe same
allelewhichshouldcontainthesamenoisebeforeitisincludedin thenoiseprofile.Eachoftheseparameterscanbesetusingvarious optionsofthe‘fdstoolsbgestimate’command.
2.3.3.Relatingtheamountofstuttertorepeatlength
Withthemethodsoutlined above,profilesofsystemicnoise wereobtainedforeachallelepresentinthereferenceset.However, one would also like to be able to filter and correct the noise originatingfromallelesthatarenot(yet)includedinthereference set,ascasesamplesmaybeencounteredthatcontainallelesfor which noreferencesample was available. For this purpose, we
developedamethodtopredictthesequenceandcorresponding
amountofPCRstutterartefactsthatwouldbeproducedforany alleleofagivenlocus.Notethatthismethoddoesnotpredictnoise otherthannoiseresultingfromSTRstutterorsinglenucleotide stretches.
Previous studies have shown that the amount of stutter is
strongly correlated with the length of the repeated sequence
[17]andeven more sowiththenumberof consecutive repeat
units [2,18]. The FDSTools tool Stuttermodel seeks to fit
polynomial functions to the repeat length and stutter ratio in
homozygous referencesamples. Stuttermodel scans each of the
alleles forall positionswhere a particularrepeatunit (e.g.,the
sequence ‘AGAT’) is repeated and records the length of this
repeat, as the number of nucleotides, including incomplete
repeatsatthebeginningorendoftherepeatedstretch.Foreach
sample with this allele, the number of noise reads that lack
exactly one repeat is counted. Reads that combinethe loss of one repeat with one or more other differences (e.g., substitu-tions,orstutterinanotherstretchofrepeatsinthesameallele) areincludedinthiscount.Thecountsthusobtainedareusedto
compute the noise ratios of individual stutter sites and a
polynomial function is fitted to quantify the relationship
Thisanalysisisrepeatedforeachuniquerepeatunitofalength
betweenoneandaconfigurablemaximumnumberofnucleotides
(inclusive), treating cyclicallyequivalent units (e.g., ‘ATAG’ and ‘AGAT’)andtheirrespectivereversecomplements(e.g.,‘CTAT’and ‘ATCT’)synonymously.Theamountof+1stutter,2stutteretc.is analysedthesameway.
Becausedifferentlocibehavedifferentinstutterformation,a separatefunctionisfittedforeachlocus.Additionally,apolynomial functionisfittedtoalldataatonce,whichisusedtopredictstutter inallelesoflociforwhichinsufficientreferencedatawasavailable tofitalocus-specificfunction.Separatefunctionsarefittedforthe forwardandreversestrands.
Foreachfittedfunction,Stuttermodelalsodeterminesthelower boundoftherepeatlengthforwhichthefunctiongivesmeaningful results.Thislowerboundisdefinedasthelowestrepeatlengthfor whichthefunctionproducesanonnegativeresultandthefunction isnon-decreasing.Belowthisthreshold,andinanyotherpoints wherethefunctionvaluewouldbenegative,thefunctionvalueis settozero.
Thequalityoffitis assessedbycomputing thecoefficientof determination, R2¼1
S
iðyiy^iÞ 2S
iðyiyÞ 2 where y^i¼ f0i ffi0 i<0withyithenoiseratiosofthereferencesamples,ythemean,fithe polynomialfunction’sestimateofthenoiseratioofsamplei,and^y themodifiedfunctionvalue.TheR2scorewillbeclosetoonewhen thefunctionisagoodfitandlowerotherwise.
Stuttermodel supports fitting polynomial functions of any
degree.To preventover-fittingwhile stillallowinga non-linear
relationship, second-degree polynomials (with a minimum R2
score)wereused.IncaseswherethefitforonestrandhasanR2 scoreabovethethresholdwhilethefitfortheotherstrandscores belowthethreshold,bothfitsarerejectedtopreventunintended introductionofstrandbiasbyfilteringstutterononlyonestrand. 2.3.4.Curatingthereferencedatabase
Tomakesureallreferencesampleswereofgoodqualityandall alleleswere called correctly, they wereput through the same
analysis pipeline as case samples, thereby performing noise
filteringandcorrectiononthereferencesamples.Itisimportant tonotethatthesereferencesampleswerepreviouslygenotyped byus in greatdetailusing CE [13].The remaining amountsof noisein eachsamplewereassessedusingBGAnalyse(described below)toidentifypotentiallyunsuitable referencesamplesthat still passed the thresholds of Allelefinder. Any sample with a
notablyhigher amountof remainingbackgroundwas manually
removedfromthesetofreferencesamplestopreventpollutionof thenoiseprofiles.
BGAnalyse was developed and employed to analyse the
remainingnoiseaftercorrection.Foreachlocusandeachsample, thistoolcalculatestheleastfrequent(thiscanbeanegativevalue becauseofover-correction),mostfrequent,and totalnoise asa percentageofthenumberof readsofthehighestalleleateach locus.Theseresultsaresubsequentlyvisualisedtoeasilyidentify potentiallyproblematicsamples.Inthevisualisation,samplescan besorted byany of thecalculated values orbycoverage (total numberofreads).Samplesweresubjectedtomanualinspection andanysamplethatexhibitednon-stutterproductswithcorrected readcountsabove4%ofthemostfrequentalleleorabove2%ofthe totalreadswasrejected.
2.4.Analysingcasesamples
Theanalysisofmockcasesampleswasperformedina three-stepprocesswhichisdescribedinthefollowingsections.
1. A prediction was made for the amount of stutter for each
sequenceinthesample,usingthefittedpolynomialfunctions obtainedfromrunningStuttermodelonthereferencesamples. Thesepredictionsareusedtoextendtheallele-specificnoise profiles obtained from running BGEstimate on the reference samples.
2. Theextractednoiseprofilesareusedtofilterandcorrectthe noiseinthecasesample.
3. Alleles are called and the sample is subjected to manual
interpretation.
Similartothecreationofareferencedatabase,analysingcase
samples involves multiple tools discussed in the following
sections.Pipelineoffersaconvenientwaytoautomaticallyanalyse acasesamplewithalltoolsdiscussed.
2.4.1.Predictingstutteramountsforunknownalleles
Becausecasesamplesmaycontainallelesthatarenotpresentin thereferencesamples,noiseprofilesfortheseallelesneedtobe predicted. FDSTools includes the BGPredict tool, which uses a previously created Stuttermodel file to predict the amounts of stutter artefacts for alleles not present in the reference data. BGPredictfindsallsequencesintheanalysedcasesampleinwhich aparticularrepeatunitisrepeated.Theexpectedamountofstutter in this repeatis then computed usingthe correspondingfitted
polynomial function from the Stuttermodel file. All possible
combinations of stutter aretaken into consideration when the
frequencies of each stutter artefact are computed. The noise
profilescreatedinthiswayareusedtoextendthenoiseprofilesin thepreviouslycreated BGEstimate file(atool calledBGMerge is includedinFDSToolsforthispurpose).
2.4.2.Noisefilteringandcorrectionincasesamples
To beabletofiltersystemicnoise in casesamples,one first needstodeterminewhichallelesarelikelypresentinthesample. Tothisend,thealgorithmofBGEstimateisessentiallyreversed,i.e., thegoalisnowtosolveforainaP=c,wherecisarowvectorwith thesample’sreadcountsfortheMsequencesinthenoiseprofiles andaisarowvectorwiththeestimatedamountofeachoftheN profilesin the sample. P is the NM matrix of noise profiles obtainedfromBGEstimate,extendedwiththepredictionsobtained fromBGPredict.Solvingforaisdoneinanon-negativeleastsquares senseasbefore,givingestimatedallelecontributionsthatbestfit thevarioussequences–allelesaswellasnoise–presentinthe sample.
Background-correctedread countscanthen becomputedby
firstsubtractingthescaledprofilesfromthesample’sreadcounts
d caP
and then adding the total size of each profile to the
correspondingallele,i.e., dn dnþan
S
M
m¼1Pn;m; 8n2½1:::N
Notethatdmayhavenegativeelementsifthesamplecontainsa
loweramountof acertain sequencethanwas predictedbythe
profilesofitsdominantalleles.
FDSToolsoffersBGCorrecttofilterandcorrectbackgroundnoise followingtheprocedureoutlinedabove.Givenasampledatafile
(obtained from TSSV for example) and a file containing noise
additional columns giving the amounts of each sequence attributedtonoiseandtheamountsofeachsequencethatwould berecovered by noise correction (i.e., adding the noise tothe originating allele). These values are given separately for the
forward and reverse strand. Although the method by which
BGCorrectcomputes them resultsin non-integer values, it was decidednottoroundthesenumberstoavoidunnecessarylossof precision.Ifnecessary,thesenumberscanberoundedtointeger values,thereby easingtheinterpretation as‘read counts’ when presentedinagraphortableinareport.
2.4.3.Allelecallingforcasesamples
ThenaïvemethodofcallingallelesthatAllelefinderusesisnot appropriateforcasesamples,sincethesemaycontainallelesof multiple contributors in different quantities. Therefore, calling alleles in case samplesis done bycomputing various statistics
based on the information of the detected sequences and
subsequentlysettinginterpretationthresholdsonthesestatistics. Forthis,Samplestatswasdeveloped,whichoperatesonandadds variouscolumnstotheoutputofBGCorrect.Samplestats automati-callymarkssequencesas‘allele’usingthethresholdsoutlinedin Table1.
Alleles canalso becalledwhile visualisingthe sampledata, hence,FDSToolsincludestheSamplevisvisualisation.Bymeansof theinteractivegraphicaluserinterfaceofSamplevis,thesamesetof thresholdsasdepictedinTable1areavailabletofilterthevisible sequences and to automatically call alleles. Thresholds can be specifiedseparatelyforthegraphsandforthetables.Whilethe tabledisplaysthecalledalleles,lessconservativesettingsmaybe usedforthefilteringofthecorrespondinggraphtoensurevisibility of allelesjust below theallele-callingthreshold. The results of
changing the thresholds are immediately visible. Clicking a
sequenceinanyofthegraphstogglesits‘allele’status.Thisallows theusertomanuallyaddallelestoandremoveallelesfromthe profile.Anoteisaddedtomanuallyaddedalleles,statingthatthe alleleis‘User-added’.Similarly,iftheuserremovesanyalleles,the alleleremainsvisiblebuta‘User-removed’noteisadded.Inthis
wayitremainseasytotracebackexactlywhichallelesmeetthe
thresholdsandwhichonesweremanuallyaddedandremoved.
Samplestatscanalsobeusedtofiltersequencesusingthesame typesofthresholds(albeitwithmorestringentthresholdvalues than used for allele calling, as potential alleles should not be filteredout)and(optionally)aggregatethefilteredsequencesper locustoasinglelinecategorised‘othersequences’.
2.5.Visualisation
For visualisation of the data, FDSTools makes use of the
JavaScript graphing library Vega [19]. Vega graphs can be
embedded ona web page, exposing a JavaScript programming
interfacethatallowsforupdatingthegraphsbasedontheuser’s interactionwiththewebpage.VegacanalsorunonNode.js,which allowsittobeincludedinautomatedanalysispipelinestogenerate (static)imagefiles.
FDSToolscomeswithVegagraphspecificationsand accompa-nyinginteractivewebpages(HTMLfiles)tovisualisetheoutputof eachtool.TheVistoolcanbeusedtoobtainself-containedHTML
files containing visualisations of various types of data files
generatedbytheothertools.Forexample,Samplevisvisualisesa sample data file as a sequence profile and Profilevis visualises backgroundnoiseprofilesobtainedfromBGEstimateorBGPredict. AdescriptionofeachvisualisationcanbefoundinSupplementary
Table1.Whenviewedinawebbrowser,thewebpageprovides
additionalcontrolsthatallowtheusertofilterthedata,switch betweenlinearandlogarithmicscales,orselectdifferentsubsetsof thedatatovisualise.Thedefaultvaluesforthesettingsontheweb pagecanbesetwhentheHTMLfileisgeneratedbytheVistool. Thewebpagesalsooffertheoptiontosavethedisplayedgraphs asaScalableVectorGraphics(SVG)orrasterisedPortableNetwork
Graphics (PNG) image, so that they can be imported into
documents. Alternatively, the Vis tool can supply a raw Vega
graphspecificationfile(eitherwithorwithoutembeddeddata),
whichcanthenbeusedbyVegatogenerateSVGorPNGimages
directlyonthecommandline.
Table1
InterpretationthresholdsforcasesamplesinSamplestatsandSamplevis.Sequencesthatmeeteitherthe‘Percentagecorrection’or‘Percentagerecovery’threshold(orboth)as wellasalltheotherthresholdswillbemarkedas‘allele’.Thesethresholdvaluesareevaluatedafternoisecorrection.The‘Allelecallingdefault’columnliststhedefault thresholdvaluesforcallingalleles.The‘Filteringdefault’columnliststhedefaultvaluesusedforfilteringdisplayedsequencesinSamplevisgraphs.
Threshold Description Allele
calling default
Filtering default
Totalreads Minimumnumberofreadsperallele.Non-systemic(andthusunfilterable)sequenceerrorsoccursporadically.Thisthreshold ensuresthataminimalamountofamplifiedproductispresenttosupporttheallelecall.
30 5
Readsper strand
Minimumnumberofreadsperalleleforbothstrands.Thisthresholdcanbeusedtoexcludelowtemplatesequenceswithstrong strandbias.
1 0
Percentage ofmost frequent
Thenumberofreadsasapercentageofthenumberofreadsofthemostfrequentalleleatthelocus.Thisthresholdsetsalimittothe mixtureproportionsthatcanbeanalysedinmixedsamplesortotheallelebalanceinsampleswithasinglecontributor.
2% 0.5%
Percentage oflocus
Eachallelecontributesatleastthispercentagetothetotalnumberofreadsofthelocus.Withthisthreshold,aminimumcontribution percentagecanbeenforced.
1.5% 0%
Percentage correction
Thispercentagederivesfromthenumberofreadsafternoisecorrectionminusthenumberofreadsbeforecorrection,whichis dividedbythenumberofreadsbeforecorrection.Consequently,thepercentagecorrectionisnegativeifnoisecorrectionresultedina reductionofthereadcountofasequence.Therefore,withthisthresholdsetto0%,anysequencerepresentingnoisewillnotbecalled asanallele.
Tobeabletodetectallelesofminorcontributorsthatcoincidewithnoiseproductsforthemajorcontributor’salleles,the‘percentage recovery’thresholddescribedbelowisallowedtooverrulethisthreshold.
0% 0%
Percentage recovery
Thenumberofreadsaddedbynoisecorrectionasapercentageofthetotalnumberofreadsafternoisecorrection.Afternoise correction,atleastthispercentageofreadsmusthaveoriginatedfromcorrectednoise.Therationalebehindthisthresholdisthat onlyallelicsequenceswillhavesubstantialamountsofrecoveredreads.Whenanalleleofaminorcontributioncoincideswiththe stutterofanalleleofthemajorcontributor,noisewillbeextractedandaddedtothemajorcontributor’sparentalleleresultingina negativepercentagecorrection.Yet,sincetheminorcontributor’scontributiontothereadsalsoresultsinnoiseproductsthatare corrected,theallelewillreceiverecoveredreadsandapercentagerecovery>0%.Toallowthecallingofallelesforwhichnonoise profileexists(ornonoisewasdetected)inthereferencedatabasethethresholdissetat0%bydefault.
3.Resultsanddiscussion
WedevelopedFDSTools,asoftwarepackagecontainingasuite oftoolsthatcanbeusedfortheanalysisofforensicMPSdata.With thesetools,FDSToolsprovidesdetailedinsightinthequalityofa sample and the noise profile of a certain allele (or sequence
variant). In Supplementary Table 1, an overview of all tools
currentlyavailableinthepackageisprovided,ofwhichaselection wasdescribedinmoredetailsinSection2.
Toenhancetheanalysisofmixedsamples,FDSToolsidentifies, extractsandcorrectsforPCRorsequencingnoisesuchasstutter fromareferencedatabasewiththeaimtodiscernlow mixture proportions. Different STR amplification assays and different amplification protocols could result in different noise. It is thereforeimportanttobasethedatabasefornoisecorrectionon referencesgenerated bya methodthat is representablefor the caseworksamplestobeanalysed.
Notethatitisnotpossibletocorrectallnoisecompletelyasthe levelofnoiseshowsvariationbetweensamples.
3.1.Referencedatabase
Our reference samples were sequenced with an average
coverageof65,000readsandamodeofabout45,000reads.For thepresentstudy,aminimumcoverageof6000readspersample wasrequired,whichrelatestoanaverageof250readsperlocusas 24 loci wereco-amplified.For heterozygousloci, less than 250 readsperlocusisnotsufficienttoquantifylowamountsofnoise accurately.
3.1.1.Referencesamplecuration
Sincethereferencedatabaseisusedtofilterandcorrectnoisein casesamples,itisessentialthatthereferencesamplescontainno contaminantsandreferenceallelesarecalledcorrectly.Although all other stepscan be performed automatically by FDSTools, a manualcurationofsamplesinthereferencedatabaseisneeded. BGAnalysewasdevelopedtofacilitatethisprocessbyvisualising potentialoutliers.
Allelefinder automatically rejected two out of the initial450 sampleswhichwereclearlycontaminatedandthreesamplesthat hadtoolowcoveragetodetectallelesreliably.Manualinspection ofsampleswithanotablyhigheramountofremainingnoiseafter correctioninBGAnalyseresultedintherejectionofanadditional16 samples.Reasonsforrejectionwerelow-levelcontamination,low coverageandlow sequencingquality.Theinteractive BGAnalyse visualisations displaying the remaining noise for the reference samplesareavailableinSupplementaryFile2a(beforedatabase curation)and2b(aftercuration).Forthemajorityofsamples,the highestremainingnoise variantin thecompleteprofile didnot exceed3%ofthenumberofreadsofthehighestalleleatthelocus whilewithoutcorrectionSTRstutterscanrepresentover20%.For theremaining429samples,nodrop-inordrop-outwasobserved whencallingallelesusingAllelefinderwiththesettingsdescribedin Section2.3.1.
3.1.2.Extendingnoiseprofilesfornoisecorrection
AsdescribedinSection2.4.1,casesamplesmaycontainalleles which arenotpresent inthereferencedatabase.In suchcases, FDSToolsresortstonoisepredictioninsteadofnoiseestimation.A columnintheoutputfileofBGCorrectmarksifcorrectionhasbeen performedusingdataobtainedfromBGEstimate(iftheallelewas availableinthereferencedatabase)orbyusingBGPredict(ifnot availableinthereferencedatabase).
FromtheresultsfromStuttermodelitbecomesevidentthatfor simpleSTRsconsistingofasinglerepeatingelementorforlong stretchesofaspecificrepeatingelementwithina complexSTR,
only few reference samplesare needed toreliably fit a stutter
model. However, when complex repeats consist of several
repeating elementsof which someshowlittle lengthvariation, correctionusingthestuttermodelissuboptimalasexemplifiedby thepredictionsforD12S391.ThisSTRlocusconsistsoftworepeat units; an AGAT repeatstretchof highly variablelength and an ACAGrepeatthatis repeated6 to8timesfor mostindividuals. Since Stuttermodel predictsthe amountofstutterbased onthe repeat length,at leastfour differentrepeatlengths need tobe availableinhomozygousreferencesamplestoobtainareliablefit. However, the setof reference samples used in this study only containedhomozygoteswith6to8repeatsofACAG,whichisnot sufficientlyvariabletoobtainareliablefit.Consequently,ACAGis
omitted from thestutter modelfor D12S391, even thoughthis
repeatstuttersupto9%forthelongerrepeats(8repeatunits,data notshown).WhenBGEstimatedoesnotobtainabackgroundnoise profile,BGPredictwillnotcorrectstutterinthisrepeatandthus stutterswillremainpresent.Asalastresort,BGPredictoffersthe possibilitytouseastuttermodelbasedondatafromalllocithat have the same repeat unit sequence if no locus-specific fit is available.SupplementaryFig. 1displaysthestuttermodelobtained fromthesetof429referencesamples,includingtheindividual observationsonwhichthemodelwasbased.
Combining BGEstimate and BGPredict (by using BGMerge)
insteadofusingBGPredictaloneisexpectedtoreducethenoise
remaining after correction, as the combined correction also
corrects for noise other than stutters. This is confirmed when
we determine the percentage of remaining noise (the reads
representingremainingnoiseasapercentageofthereadsforthe mostfrequentalleleatthelocus)andplotthehighestpercentage
and various percentiles (90th, 95th and 99th) (Supplementary
Fig.2a,b).The percentilesillustrate howoften samplesexhibit outlyingnoisesequencevariantsandwhenthe99thpercentileis regarded,BGPredictalone retainsonaverage2.6%noise andthe combinedcorrection2.4%.Also,thecombinedcorrectionresultsin lessovercorrectedvariants.
Thus,BGPredictcanbeusedwithoutBGEstimatewithaslightly reducedaccuracyincorrection.NotethatBGEstimateshouldnotbe usedwithoutBGPredictsinceallelesnotincludedinthereference databasewillnotbecorrected,whichcanresultinacombination ofcorrectedanduncorrectedallelesandremainingnoiseforthe uncorrectedalleles.
3.1.3.Referencedatabasesizeandcoverage
Totesttheeffectofthesamplesizeandtypefromwhichthe
reference database is built, we used the complete curated
reference database of 429 samples and a random selection of
100 samples (both with combined BGEstimate and BGPredict
correction,whichwasfoundtobeslightlybetterasdescribedin Section3.1.2).SupplementaryFig.2c,ddisplayanoverviewofthe mostfrequentandthetotalremainingnoiseateachlocusafter correction.Thedifferentpercentilesofthereferencesamplesare
given to illustrate how often samples exhibit outlying noise
sequencevariants.
Whencomparingtheresultsforthecompletedatabasewiththe resultsforthesubsetof100samples,thedifferenceinremaining
noise seems surprisingly small (Supplementary Fig. 2c, d).
However,withasmallerdatabase,lessalleleswillfitthecriteria tocreateaBGEstimatenoiseprofileandmoreallelesrelyonnoise prediction by BGPredict. Indeed, for the reference set of 429 samples,only3.5%oftheallelesarecorrectedusingBGPredict.This percentageincreasesto10.2%whenthecorrectionisbasedonthe subsetof100samples.
In alargerreferencedatabasemorealleleswillbeobserved. SupplementaryFig.3displaystheallelesobservedinthereference databasesof429and 100samples.Tofitthecriteriatocreatea
BGEstimate noise profile, alleles need to be present as a homozygousgenotypeorbeavailableaspartofsharedgenotypes withatleastthreeotherallelesthatmustalsofitthesecriteria.For thestuttermodel,onlythehomozygousgenotypesareused.Inthe larger429 database, more alleles fit these criteria than in the smaller100samplesetdatabase.
Toexaminetheeffectofreadcoverageofthereferencesamples onnoise profileanalysis, wegeneratedtwo subsets comprising sampleswithhighorlowcoverage,whichisspecifiedasatotal
read count between 82,000 and 350,000 or 8000 and 44,000
respectively.Thehighcoveragesetcomprised71samples;thelow coverageset70.Wenoticedthatinthelow-coveragenoiseprofiles, strandbiascanoccurespeciallyforthelow-percentagenoisethat isduetosingle-stranddrop-outofthisnoise.Thisisillustratedby theBGEstimatenoiseprofilesfortheCE10_TCTA[10]_-20T>Aallele forlocusD7S820inSupplementaryFig.4,inwhichforwardand reversereadsareingoodorreasonablebalanceforallsevennoise sequencesinthehighcoveragesamplesetwhilegoodbalanceis onlyseenforthetwomainnoisesequencesinthelowcoverageset. Sincethemostabundantnoiseaftercorrectioninasampleis usuallyintherangeof0.5–3%(forSTRanalysis),werecommenda coverageofatleast1000readsperlocus(whichrelatestoa24,000 totalreadcoverageforour24lociamplificationkit)forthesamples
of the reference database to obtain the most accurate noise
estimates.
3.1.4.Infrequentalleles
Depending on the composition of the reference database,
occasionallyalleleswillbeencounteredthatarenotincludedinthe database.BGPredict canpredict thenoise from stutterorother repeatingelementsbutcorrectionofothertypesofnoise(likelow levelSNPs causedby sequenceerrors)is not possibleforthese infrequentalleles.
WethereforerecommendtoobtainBGEstimatenoise profiles for asmany allelesas possible,while retaininggood qualityof thesenoiseprofiles.Severalfilteringcriteriacanbeapplied,suchas
the minimumnumber of different heterozygousgenotypes per
allele, the minimum number of samples per allele and the
minimumnumberofhomozygoussamplesperallele.Theeffectof increasingthestringencyonthefilteringcriteriaonthenumberof retrieved BGEstimatenoise profilesfor our429 referencesetis showninSupplementaryTable2.Thesettingsselectedforusein thisstudyare:atleasttwosamplesperallele(whichensuresnoise isnotbasedonasinglesampleasthatcouldbeanoutlier)that present at least three different heterozygous or at least one homozygote genotype (i.e., the samples can be three different
heterozygotesortwohomozygotesor onehomozygoteandone
heterozygote).
Whenanalleleat aheterozygouslocusfailsthecriteria,the completelocuscarrying thisallelecannotbeusedfor establish-mentofanoiseprofilesincethenoisecannotbeattributedtoany ofthetwoalleles.Thus,forbothallelesataheterozygouslocusno noiseprofileisextracted.
3.1.5.Accuracyofnoisereferencedatabaseandstuttermodel Toverifytheaccuracyofthenoiseprofilesobtainedthrough BGEstimateandBGPredict,itcanbeusefultocomparetheaverage noiseratioswiththenoiseratiosobservedinindividual homozy-gous samples. The noise ratios of all noise in all homozygous
referencesamplescan easilybecollectedusingtheBGHomRaw
tool.Thesedatapointscanbeplottedontopofanoiseprofileto inspecttheconsistencyandvariationinthenoiseratiosofvarious typesofnoiseforeachallele.InFig.3,thenoiseprofileofthemost frequentalleleof D7S820(CE10_TCTA[10]_-20T>A)is displayed,
which has foremost a 1 stutter (CE10_TCTA[9]_-20T>A) in
addition to a 1 nt slippage product at the A-stretch
Fig.3.NoiseprofileofD7S820alleleCE10_TCTA[10]_-20T>A.Thenoiseratioisshownforeachsystemicnoisesequenceobservedwithanoiseratioof0.1%orhigher. Individualobservationsinhomozygoussamples(above0.5%)aredisplayedascircles.Asexpected,themostfrequentlyobservednoisesequenceisthe1stutter,butsincethe allelecontainsasingle-nucleotidestretchof9Anucleotides,aconsiderableportionofthenoiseconsistsofsequenceswithslippageatthisA-stretch(oracombinationofthe two).
(CE9.3_TCTA[10]_-20T>-). The individual observations for the
homozygoussamples coincide nicely withthe estimated noise
profileratios.
Similarly, it is useful to compare the functions fitted by
Stuttermodel to the data points to which they were fitted.
Stuttermodelincludesanoptiontowritetherawdatapointstoa separateoutputfile,whichcanbevisualisedtogetherwiththefitted modelasshowninFig.4forD7S820.Thisexampleshowsthatthe
homozygouscallsandtheStuttermodelestimationfollowthesame trendandthatthereisnodiscrepancybetweenforwardandreverse reads.ThesameholdsfortheA-stretch(datanotshown).
In thestuttermodel, fits withanR2 score below0.75 were rejected.AlthoughthismayseemaverylowR2score,weobtained betterresultsbyincludingmorefitsthanbyexcludingthem,which wouldresultintheinabilityofthestuttermodeltobeusedtofilter andcorrectstutterfortherespectiverepeatunitsatall.
Fig.4.Stuttermodelforthe1stutterofD7S820.Onthex-axisthelengthoftherepeatisdisplayed(innucleotides)andonthey-axisthe1stutternoiseratio(aspercentage ofreadsoftheparentallele)isdisplayed.Eachhomozygousreferencesampleisdisplayedasadotandthelinesdisplaythefittedfunctionsusedforcalculatingtheexpected stutterofeachallele.
Fig.5.Sequenceprofileofasingle-sourcesample.SequenceprofileoflociD18S51andD19S433ofasingle-sourcesample.Asequenceprofiledisplaysthereadcountbefore correction(inpurplebars)andshowstheeffectsofnoisefiltering(lightpurpleforthereadsthatareremoved)andnoisecorrection(withthenoisereadsaddedtotheparent allelesindarkorange).Whenperformingcorrection,itispossiblethatanallelegainsreadsbecausethenoisereadsoriginatingfromthisalleleareadded,butlosesreadsatthe sametimesincethenoiseofanotheralleleintheprofileincludesreadsofthisallele.Thisoverlappingpartofaddedandremovedreadsismarkedseparatelyinlightorange. Thismeansthattheoriginalreadcountofanallelebeforecorrectionisthecombinationofthepurpleandthelightorangebar.Thelinesinthebarsindicatethestrandbalance; thelineisdrawnnearthetopofthebarifthemajorityofreadsofasequenceisontheforwardstrand,nearthebottomofthebarifthemajorityofreadsisonthereverse strand,andinthemiddleofthebarintheabsenceofstrandbias.Sequencesdisplayedingreeninthegraphsaretheallelesthatthesoftwareinferstobegenuineallelesinthe sample.Thesearealsodisplayedinthetable.(Forinterpretationofthereferencestocolourinthisfigurelegend,thereaderisreferredtothewebversionofthisarticle.)
3.2.Sampleanalysis
3.2.1.Allelecalling,interpretationandvisualisation
Whenareferencedatabasehasbeencreated,onecanproceed withtheanalysisofsamples.FDSToolsanalysessequencingdata, calls alleles and interprets the data by correction for noise as inferredfromthereferencedatabase.Resultscanberepresentedas agraphicalsequenceprofileoutputandasaninteractiveprofile report.
InFig.5,anexampleofasequenceprofileoftwolociofa single-sourcesampleisdisplayed(generatedbythecommand‘fdstools vissample’).Asequenceprofiledisplaysthereadcountsbeforeand aftercorrectionand visualisesthe effects of noise filteringand noisecorrection.Amoredetailedexplanationoftheinterpretation ofasequenceprofilecanbefoundinSupplementaryFig.5.
The interactive sequence profile reports provide separate
filtering options for the graphs and tables displayed (see
Section2.4.3).In thegraphs,all alleles that are hidden bythe filtering options are (optionally) aggregated as a separate bar (displayingthecumulativenumbersofreads)withthelabel‘other sequences’.Inaddition,weaggregateallsingletonreadsinto‘other sequences’alreadyinthefirststepoftheanalysis(using‘fdstools tssv–minimum2–aggregate-filtered’)whichhastheadditional benefitsofspeedingupsubsequentanalysisanddecreasingdata storagedemand.
3.2.2.Improvingheterozygotebalancethroughnoisecorrection TheamplificationoflongSTRallelesinthePCRisgenerallyless efficientthanshorterallelesand,inaddition,longSTRallelessuffer fromahigherdegreeofstutterresultinginreducedheterozygote balancebetweenthetwoalleles. [2]SinceFDSTools determines which‘noisereads’arederivedfromwhichparentalleles,these readscan(optionally)beaddedtothereadcountsoftheparent alleles,whichtheoreticallywillimprovetheheterozygotebalance.
When we examine the heterozygote allele balance in the 429
single-source reference samples, an improved heterozygote
balanceisindeedobservedwhenthestutterreadsareaddedto the read counts of the parent alleles (Table 2). Heterozygote balancewasdeterminedperlocusbydividingthereadcountsfor thelessfrequentallelesbythoseforthemorefrequentalleles,and takingtheaverageofall429samples.
3.2.3.Mixtureanalysis
For the analysis mixtures, noise correction may assist in
identifyingtheallelesofalowminorcontributor.Weused31 two-personmixtureswithminorcontributionsof50%,20%,10%,5%,1% and0.5%toassessthisexpectation.
We varied the ‘percentage of locus’ threshold (Table 1) for callingalleles,whichsetsalimittothemixtureproportion.When nonoisecorrectionwasappliedthethresholdwasvariedbetween
5.0% and 1.5%; when noise correction was applied, a lower
Table2
Heterozygotebalancefororiginal,filteredandcorrecteddatasets.Thereadcountsforthelessfrequentallelesaredividedbythoseforthemorefrequentalleles,andthe averageforall429single-sourcereferencesamplesistaken.
Dataset\locus Amel CSF1P0 D10S1248 D12S391 D13S317 D16S539 D18S51 D19S433 D1S1656 D21S11 D22S1045 D2S1338 Uncorrecteddata 0.83 0.88 0.82 0.79 0.88 0.87 0.85 0.84 0.88 0.89 0.85 0.77 Filtereddatawithoutnoisereadsaddedto
allelereadcount
0.83 0.90 0.87 0.80 0.89 0.89 0.87 0.88 0.89 0.90 0.88 0.78
Correcteddatawithnoisereadsaddedto allelereadcount
0.83 0.91 0.89 0.84 0.90 0.91 0.89 0.90 0.91 0.91 0.93 0.80
Dataset\locus D2S441 D3S1358 D5S818 D7S820 D8S1179 FGA PentaD PentaE TH01 TPOX vWA Uncorrecteddata 0.90 0.88 0.90 0.88 0.89 0.85 0.87 0.78 0.88 0.88 0.85 Filtereddatawithoutnoisereadsaddedtoallelereadcount 0.90 0.90 0.91 0.89 0.90 0.87 0.87 0.78 0.88 0.89 0.88 Correcteddatawithnoisereadsaddedtoallelereadcount 0.91 0.91 0.91 0.91 0.91 0.89 0.88 0.81 0.89 0.90 0.90
Table3
Averagenumberofdrop-inallelesandaveragedrop-outpercentagepersamplefordifferent‘percentageoflocus’allele-callingthresholds. a)Summaryofdrop-inanddrop-outratesforvariousallele-callingthresholds
Minorcontribution 50%:600pg 20%:240pg 10%:120pg 5%:60pg 1%:60pg 0.5%:60pg Analysismethodand
threshold #drop-in/ %drop-out #drop-in/ %drop-out #drop-in/ %drop-out #drop-in/ %drop-out #drop-in/ %drop-out #drop-in/ %drop-out Withoutcorrection,5.0% 0.7/0.0% 0.8/2.1% 1.2/25.0% 1.0/33.1% 1.7/39.9% 0.8/40.8% Withoutcorrection,2.5% 4.3/0.0% 3.8/0.0% 4.3/2.5% 4.2/21.6% 3.5/38.3% 2.3/40.4% Withoutcorrection,2.0% 5.3/0.0% 5.3/0.0% 5.3/0.5% 4.8/15.3% 4.2/38.1% 2.8/40.1% Withoutcorrection,1.5% 7.7/0.0% 7.7/0.0% 7.0/0.0% 6.0/10.8% 6.0/37.4% 4.7/39.7% Withcorrection,3.0% 0.0/0.0% 0.3/0.0% 0.3/5.7% 0.0/30.3% 0.7/41.1% 0.3/41.1% Withcorrection,2.5% 0.3/0.0% 0.3/0.0% 0.5/1.4% 0.0/25.1% 0.7/40.6% 0.3/41.1% Withcorrection,2.0% 0.7/0.0% 1.2/0.0% 1.8/0.5% 0.8/17.4% 0.8/40.6% 1.2/40.8% Withcorrection,1.5% 1.7/0.0% 2.3/0.0% 2.5/0.0% 2.0/10.8% 2.7/39.9% 2.3/40.8% Withcorrection,1.0% 4.0/0.0% 5.2/0.0% 4.5/0.0% 4.5/3.8% 5.7/37.8% 3.2/40.6% Withcorrection,0.5% 16.3/0.0% 17.3/0.0% 14.0/0.0% 15.5/2.4% 14.0/30.3% 11.0/37.4% b)Categoriseddrop-outrateswhenusing1.5%allele-callingthreshold(withcorrection)
Minorcontribution 20%:240pg 10%:120pg 5%:60pg 1%:60pg 0.5%:60pg Allelesuniquetotheminor(homozygous) 0.0% 0.0% 0.0% 91.3% 100.0% Allelesuniquetotheminor(heterozygous) 0.0% 0.0% 31.6% 98.1% 99.4%
Allelesuniquetominor 0.0% 0.0% 27.0% 97.2% 99.4%
Allallelesoftheminor 0.0% 0.0% 18.5% 67.7% 69.3%
threshold could be used, varying between 3.0% to 0.5%. We comparedthevariousmethodsbycalculatingpercentagemissed alleles(a.k.a.drop-out)andthenumberoferroneousallelecalls(a. k.a.drop-in).Thepercentagedrop-outwascalculatedbydividing thenumber of donor alleles not calledby thetotal number of possiblealleles (homozygousand shared alleles arecountedas one,Amelisincluded),andthepercentagewasaveragedforthe mixtureswiththesamemixtureratio.Drop-inispresentedinthe averagenumberoccurringinprofileswiththesamemixtureratio. InTable 3, theresultsof theseanalyses aredisplayedand it is obviousthat withoutcorrectionmoredrop-inalleles occurthat mostlyrepresentstutters.Consequently,thethresholdforcalling anallelecanbelowerwhencorrectionisapplied,aslessstutters remaininthecorrectedprofilethatcanbewrongfullycalledasan allele.Asexpected,thepercentageofdrop-outdependslargelyon the‘percentageoflocus’thresholdforallelecalling(andhardlyon theapplication of noise correction); drop-out is more frequent withahigher(morestringent)threshold.Whenthethresholdfor correcteddataisdecreasedbelow1.5%,thenumberof drop-ins rapidly increases for all ratios. Not surprisingly, the drop-out percentageforthemixtureswith1%and0.5%isveryhighwhen
using a threshold that is higher than the minor component.
Therefore,thedatafromthe5%and10%minorcontributionwas usedtodeterminetheoptimalthresholdforallelecalling.
InFig.6,weshowtherelationbetweenthe‘percentageoflocus’ allele-calling threshold, drop-out and drop-in for the mixtures
with a 5% or 10% minor contribution. In the used dataset, a
threshold of 1.5% appears to be the most effective for calling genuine alleles in mixtures with minimal erroneous calling of remainingnoiseinthemixtures.Withmixtureratiosdownto10%,
no drop-out and only minimal drop-in is observed with this
threshold,whereaswithcontributionssmallerthan10%anoptimal balancebetweendrop-outanddrop-inisachieved(Table3).When investigating the noise that is erroneously called using this thresholditisapparentthatthedrop-inallelesarerarelyresulting fromstutterbutalmostexclusivelyconsistofPCRhybrids[20].In
Table3b,thepercentageofdrop-outwhenusingthe1.5% allele-callingthresholdiscategorisedandillustratesthatdrop-outalleles consistmostlyofheterozygousminoralleles.Mostofthese drop-out alleles represent minor contributions on stutter positions (wherestutterratioofthemajorcontributorwaslowerthanthe averageobservedinthesetofreferencesamples,therebycausing over-correction) and long alleles that suffer fromheterozygote imbalance.InFig.6thetrendsfromTable3areconfirmed: drop-outis hardlyand drop-inislargelyaffectedbytheuseofnoise correction. Thus, calling of genuine alleles is not negatively influencedbynoisecorrection.
Notethatthenumberofdrop-insmaybereducedfurtherby applying additional thresholds fromTable 1, but the effects of varyingadditionalthresholdvalueswerenotstudiedindepth.
InFig.7a, b,theeffectofnoisecorrectiononallelecallingis shown for ahighly unbalancedmixture (95:5mixtureratio)in whichtheallelesoftheminorcontributorhaveasimilarorlower read countthanthestutterproductsoftheallelesof themajor
contributor. Without noise correction the four most frequent
sequence variants are the major contributor’s alleles and the corresponding1stuttersandinterpretationofthelessfrequent sequencevariantsbecomesintractable;afternoisecorrection,the stutterproductsandotherPCRartefactsarefilteredoutandfour sequence variants meet the ‘percentage of locus’ allele-calling thresholdof1.5%,whichcorrespondtothefourallelesofthetwo heterozygousdonors.Also,theallelesofboththemajorandminor contributorhavegainedrecoveredreads.
Thisexamplealsodisplaysa pitfallof theinterpretationofa mixedDNAprofilewherethemajorcontributorhasaninfrequent alleleforwhichnoBGEstimatenoiseprofileisavailable.Thenoise correction ofalleleCE22_TAGA[16]CAGA[6]isonlybasedonthe stutter model (using BGPredict), which fails to correct for the CE22_TAGA[17]CAGA[5]PCRartefact.Lookingatthenoiseprofile
of the most resembling allele in the reference database,
CE21_TAGA[15]CAGA[6], we find a similar PCR artefact
CE21_TAGA[16]CAGA[5] thatrepresents a C toTsubstitution at
Fig.6.Averagenumberofdrop-inandpercentageofdrop-outpersamplefordifferent‘percentageoflocus’allele-callingthresholds.Effectofdifferent‘percentageoflocus’ allele-callingthresholdsonthedrop-inanddrop-outrates.Thenumbersnexttothepointsdisplayallele-callingthresholds.Thepositionofeachpointillustratesthenumber ofdrop-insandpercentageofdrop-outforthecorrespondingthresholdinmixtureswitharatioof90:10(orange)and95:5(blue).Pointsconnectedbydashedlines correspondtoresultsobtainedwithoutnoisecorrection,pointsconnectedbysolidlinescorrespondtoresultsobtainedafternoisecorrection.The1.5%‘percentageoflocus’ allele-callingthresholdthatappearsmostoptimalisindicatedinbold.(Forinterpretationofthereferencestocolourinthisfigurelegend,thereaderisreferredtotheweb versionofthisarticle.)
thefirstCAGArepeatunit,withanoiseratioofabout2%(Fig.7c). ThissuggeststhattheCE22_TAGA[17]CAGA[5]artefactwouldbe properlycorrectedifanoiseprofilefortheCE22_TAGA[16]CAGA[6] allelewouldbeavailable.Thus,additionalinspectionoftheapplied
method of correction (BGPredict or BGEstimate) may be useful
wheninfrequentallelesoccur.
3.3.Analysistimeandcomputermemorydemand
Toindicatetherequiredtimeandcomputermemorydemand,
fivesampleswithdifferentnumbersofreads(15,169–318,403total readpairs)wereanalysedandthetimeandpeakmemoryusagefor eachseparatetoolwasregistered(SupplementaryFig.6).Withthe used2.0GHzprocessor,theanalysistimeismostlyconsumedby TSSV(75%ofthetotalanalysistime)andthecompleteanalysis
only takes up to 13:30min for a sample with 318,403 reads.
BGCorrectshows thehighest peak memoryusagebut doesnot
exceed200MBforthelargestsample(ofthefivetestedsamples). Boththerequiredtimeandmemoryincreasemoreorlesslinearly whenthereadcountoftheanalysedsamplesisincreased. 4.Conclusions
WedevelopedFDSToolsfortheanalysisofforensicMPSdata.
FDSTools can determinesystemic PCR and/orsequencing noise
fromthedataofreferencesamples,buildadatabasefromthisdata and use it to correct for systemic noise in case samples. The softwareis also abletopredictthe noise caused bystutterfor
alleles not included in the reference database and uses this
informationinthecorrectionofcasesamples.
Withautomaticthreshold-basedallelecalling,noisecorrection reducestheoccurrenceofdrop-inanddrop-outsubstantiallyand improvesthebalancebetweenallelesofaheterozygotepair.This decreasesthedetectionlimitsofminorcontributionsinmixtures. STRstuttervariantsarenolongerthemostfrequentremaining noiseasPCRhybridartefactsnowgenerallyexceedthecorrected readcountsofstutters.
Althoughreliablenoisecorrectioncanalreadybeobtainedfrom adatabaseof100samples,alargerdatabaseispreferredasalarger numberofallelescanbecorrectedbytheuseofacompletenoise profileinsteadofrelyingonnoisepredictionsbasedonthestutter model.Thiswillalsoreducemanualinspectionofretained non-stutternoiseforinfrequentalleles.Whenbuildingthedatabase,it isimportanttouseanamplificationkitrepresentativeforthekit
used for the samples. Although not extensively tested, we
anticipate that noise prediction will be less precisewhen less DNAis used andmore stochasticPCR effects occur.Also, more strandbiaswilloccurduringthemassivelyparallelsequencing. Theseeffectsareintrinsictolow-levelDNAtypingand
probabilis-ticgenotypingsoftwarehasbeendevelopedthataccommodates
drop-inanddropoutduringprofileinterpretation[21–28].Such softwarearenotyetstraightforwardlyabletodealwithMPSdata, butthenecessaryadaptationsarefeasibleandinclude nomencla-tureforsequencevariants,allelefrequenciesdatabasesandread countsreplacingpeakheightsincontinuousmodels(notrequired forsemi-continuousmodels).InCE-basedanalysis,PCRreplicates areoftenusedtoreduceprofilinguncertainty[7];replicatescanbe enteredinprobabilisticgenotypingsoftwareorusedtopreparea
consensusprofile[7,8].AfutureversionofFDSToolswillfeaturea consensus-basedanalysismethodalikethoseusedwithCEdata [8]. Besides, export options for DNA databasesystems such as CODISwillbeadded.
FDSTools hasbeenvalidatedfollowing recommendations for
software validation [29,30]and is already implemented in the
ISO17025 certified environment of the LUMC for forensic
casework.Thevalidationforuseandperformanceofthesoftware incaseworkwasaseparatestudywhichwasnotbasedonthedata described inthismanuscript.By providingtoolstoevaluatethe performance of noise correction in referencesamples FDSTools facilitatesthedeterminationofanalysisthresholdsthatarefitfor purpose.
TheapplicationofFDSToolsisnotlimitedtotheanalysisofSTRs. FDSToolshasalreadybeenappliedsuccessfullytotheanalysisof multiplexassaysofSNPfragments(manuscriptinpreparation)and
complete mtDNAdata[31].Notethat theminimumnumber of
requiredreferencesamplesforlociotherthanSTRswilldependon theamountofvariationobservedintheseloci.
Acknowledgements
This study was supported by a grant from the Netherlands
Genomics Initiative/Netherlands Organization for Scientific
Re-search (NWO) withinthe frameworkof the ForensicGenomics
ConsortiumNetherlands.TheauthorswishtothankArieKoppelaar (NetherlandsForensicInstitute)forprovidinghelpandexpertisein setting upthecomputerclusterfortheanalyses.Wealsothank
Martin Ensenberger, Cynthia Sprecher and Douglas R. Storts
(Promega Corporation)for theircontributions todesigning and providingthePowerSeqTMAutoSystem.
AppendixA.Supplementarydata
Supplementarydataassociatedwiththisarticlecanbefound,in
the online version, at http://dx.doi.org/10.1016/j.
fsigen.2016.11.007. References
[1]M.A.Jobling,P.Gill,Encodedevidence:DNAinforensicanalysis,Nat.Rev. Genet.5(10)(2004)739–751.
[2]K.J.vanderGaag,R.H.deLeeuw,J.Hoogenboom,J.Patel,D.R.Storts,J.F.J.Laros, P.deKnijff,Massivelyparallelsequencingofshorttandemrepeats—Population dataandmixtureanalysisresultsforthePowerSeqTMsystem,ForensicSci.Int.
Genet.24(2016)86–96.
[3]K.B.Gettings,K.M.Kiesler,S.A.Faith,E.Montano,C.H.Baker,B.A.Young,R.A. Guerrieri,P.M.Vallone,Sequencevariationof22autosomalSTRlocidetected bynextgenerationsequencing,ForensicSci.Int.Genet.21(2016)15–21. [4]C. Gelardi,E.Rockenbauer, S.Dalsgaard,C. Børsting,N.Morling,Second
generationsequencingofthreeSTRsD3S1358:D12S391andD21S11inDanes andanewnomenclatureforsequencedSTRalleles,ForensicSci.Int.Genet.12 (2014)38–41.
[5]M.Scheible,O.Loreille,R.Just,J.Irwin,Shorttandemrepeattypingonthe454 platform:strategiesandconsiderationsfortargetedsequencingofcommon forensicmarkers,ForensicSci.Int.Genet.12(2014)104–119.
[6]X.Zeng,J.L.King,M.Stoljarova,D.H.Warshauer,B.L.LaRue,A.Sajantila,J.Patel, D.R.Storts,B.Budowle,Highsensitivitymultiplexshorttandemrepeatloci analyseswithmassivelyparallelsequencing,ForensicSci.Int.Genet.16(2015) 38–47.
[7]C.C.Benschop,C.P.vanderBeek,H.C.Meiland,A.G.vanGorp,A.A.Westen,T. Sijen,LowtemplateSTRtyping:effectofreplicatenumberandconsensus
Fig.7.Interpretationofamixedsequenceprofilebeforeandaftercorrection.a)SequenceprofileoflocusD12S391ofamixedsamplewitharatioof95:5,withoutnoise filteringandcorrection.b)SequenceprofileoflocusD12S391ofamixedsamplewitharatioof95:5,withnoisefilteringandcorrectionapplied.c)NoiseprofileofD12S391 alleleCE21_TAGA[15]CAGA[6].SequenceprofileoflocusD12S391ofamixedsamplewitharatioof95:5,AwithoutandBwithapplyingnoisefilteringandcorrection.The tabledisplaysallsequencevariantswithatleast1.5%ofthereadsofthelocus(the‘percentageoflocus’allele-callingthresholdused),whicharealsomarkedingreeninthe graph.AnoteinthetableinpanelBwarnsthatnonoiseprofilewasavailableforthemajorCE22_TAGA[16]CAGA[6]alleleandastutterpredictionhasbeenusedinstead.An additionalvariantCE22_TAGA[17]CAGA[5],whichisderivedfromthemajorCE22allele,remainsvisibleinthesequenceprofile(althoughnotmarkedgreenasitdoesnot meetthe1.5%threshold).CNoiseprofileofasimilarallele,showinganon-stutterPCRartefactwithanoiseratioofabout2%.(Forinterpretationofthereferencestocolourin thisfigurelegend,thereaderisreferredtothewebversionofthisarticle.)
methodongenotypingreliabilityandDNAdatabasesearchresults,Forensic Sci.Int.Genet.5(2011)316–328.
[8]C.C.Benschop,T.Sijen,LoCIM-tool:anexpert’sassistantforinferringthemajor contributor’sallelesinmixedconsensusDNAprofiles,ForensicSci.Int.Genet. 11(2014)154–165.
[9]C. Brookes,J.A. Bright,S.Harbison,J. Buckleton,Characterisingstutterin forensicSTRmultiplexes,ForensicSci.Int.Genet.6(2012)58–63. [10]S.Y.Anvar,K.J.vanderGaag,J.W.vanderHeijden,M.H.Veltrop,R.H.Vossen,R.
H.deLeeuw,C.Breukel,H.P.Buermans,J.S.Verbeek,P.deKnijff,J.T.den Dunnen,J.F.J.Laros,TSSV:atoolforcharacterizationofcomplexallelicvariants inpureandmixedgenomes,Bioinformatics30(2014)1651–1659. [11]D.H.Warshauer,J.L.King,B.Budowle,STRaitrazorv2.0:theimprovedSTR
alleleidentificationtool–razor,ForensicSci.Int.Genet.14(2015)182–186. [12]S.L.Friis,A.Buchard,E.Rockenbauer,C.Børsting,N.Morling,Introductionof
thePythonscriptSTRinNGSforanalysisofSTRregionsinFASTQorBAMfiles andexpansionoftheDanishSTRsequencedatabaseto11STRs,ForensicSci. Int.Genet.21(2016)68–75.
[13]A.A.Westen,T.Kraaijenbrink,E.A.RoblesdeMedina,J.Harteveld,P.Willemse, S.B.Zuniga,K.J.vanderGaag,N.E.Weiler,J.Warnaar,M.Kayser,T.Sijen,P.de Knijff,ComparingsixcommercialautosomalSTRkitsinalargeDutch populationsample,ForensicSci.Int.Genet.10(2014)55–63.
[14]M.G.Ensenberger,K.A.Lenz,L.K.Matthies,G.M.Hadinoto,J.E.Schienman,A.J. Przech,M.W.Morganti,D.T.Renstrom,V.M.Baker,K.M.Gawrys,M. Hoogendoorn,C.R.Steffen,P.Martín,A.Alonso,H.R.Olson,C.J.Sprecher,D.R. Storts,DevelopmentalvalidationofthePowerPlex1Fusion6CSystem, ForensicSci.Int.Genet.21(2016)134–144.
[15]T. Mago9c, S.L.Salzberg,FLASH: fastlengthadjustment ofshort readsto improvegenomeassemblies,Bioinformatics27(2011)2957–2963. [16]B.C.Hendrickson,B.Leclair,S.Forrest,J.Ryan,B.E.Ward,D.Petersen,T.D.
Kupferschmid,T.Scholl,AccurateSTRalleledesignationsattheFGAandvWA locidespitesitepolymorphisms,J.ForensicSci.49(2)(2004)250–254. [17]D.Shinde,Y.Lai,F.Sun,N.Arnheim,TaqDNApolymeraseslippagemutation
ratesmeasuredbyPCRandquasi-likelihoodanalysis:(CA/GT)nand(A/T)n microsatellites,Nucl.AcidsRes.31(2003)974–980.
[18]M.Klintschar,P.Wiegand,Polymeraseslippageinrelationtotheuniformityof tetramericrepeatstretches,ForensicSci.Int.135(2)(2003)163–166. [19]A.Satyanarayan,R.Russell,J.Hoffswell,J.Heer,ReactiveVega:astreaming
dataflowarchitecturefordeclarativeinteractivevisualization,IEEETrans.Vis. Comput.Graph.22(2016)659–668.
[20]A.R.Shuldiner,A.Nirula,J.Roth,HybridDNAartifactfromPCRofclosely relatedtargetsequences,Nucl.AcidsRes.17(11)(1989)4409.
[21]H.Haned,P.Gill,AnalysisofcomplexDNAmixturesusingtheForensim package,ForensicSci.Int.Genet.Supp.Ser.3(2011)e79–e80.
[22]H.Swaminathan, A. Garg, C.M. Grgicak, M.Medard, D.S. Lun, CEESIt:a computationaltoolfortheinterpretationofSTRmixtures,ForensicSci.Int. Genet.(2016)149–160.
[23]D.J. Balding, Evaluation of mixed-source, low-template DNA profiles in forensicscience,Proc.Natl.Acad.Sci.U.S.A.110(30)(2013)12241–12246. [24]D.Taylor,J.A.Bright,J.Buckleton,Theinterpretationofsinglesourceandmixed
DNAprofiles,ForensicSci.Int.Genet.7(2013)516–528.
[25]R.G.Cowell,T.Graversen,S.L.Lauritzen,J.Mortera,AnalysisofforensicDNA mixtureswithartefacts,Appl.Stat.64(1)(2015)1–32.
[26]O.Bleka,G.Storvik,P.Gill,EuroForMix:anopensourcesoftwarebasedona continuousmodeltoevaluateSTRDNAprofilesfromamixtureofcontributors withartefacts,ForensicSci.Int.Genet.21(2016)35–44.
[27]M.W.Perlin,M.M.Legler,C.E.Spencer,J.L.Smith,W.P.Allan,J.L.Belrose,B.W. Duceman,ValidatingTrueAllele1DNAmixtureinterpretation,J.ForensicSci. 56(2011)1430–1447.
[28]R.Puch-Solis,L.Rodgers,A.Mazumder,S.Pope,I.W.Evett,J.Curran,D.Balding, EvaluatingforensicDNAprofilesusingpeakheightsallowingformultiple donors,allelelicdropoutandstutters,ForensicSci.Int.Genet.7(2013)555– 563.
[29]H.Haned,P.Gill,K.Lohmueller,K.Inman,N.Rudin,Validationofprobabilistic genotypingsoftwareforuseinforensicDNAcasework:definitionsand illustrations,Sci.Justice56(2)(2016)104–108.
[30]E.J.Rykiel,Testingecologicalmodels:themeaningofvalidation,Ecol.Model. 90(1996)229–244.