My-Forensic-Loci-queries (MyFLq) framework for analysis of forensic STR data generated by massive parallel sequencing

(1)

My-Forensic-Loci-queries

(MyFLq)

framework

for

analysis

of

forensic

STR

data

generated

by

massive

parallel

sequencing

§

Christophe

Van

Neste

a

,

Mado

Vandewoestyne

a

,

Wim

Van

Criekinge

b

,

Dieter

Deforce

a,1

,

Filip

Van

Nieuwerburgh

a,1,

*

a

LaboratoryofPharmaceuticalBiotechnology,FacultyofPharmaceuticalSciences,GhentUniversity,Ghent,Belgium b

Biobix,FacultyofBioscienceEngineering,GhentUniversity,Ghent,Belgium

1. Introduction

ForensicDNAprofilesofshorttandemrepeatlociarecurrently obtainedusingPCRfollowedbycapillaryelectrophoresis(CE).CE separatesfluorescentlylabeledPCRproductsbasedontheirlength [1].Becauseofthis,inordertoproduceunambiguousallelecalls,the sizeranges of STR lociwith the samefluorescenttag mustnot overlap.Thislimitsthenumberoflocithatcanbeinvestigatedina single PCR and in a single capillary injection. Massive parallel sequencing (MPS) technologies, also known as second or next generation sequencing, do not relyon sizeseparation and thus

relieve the limitation on locus multiplexy [2,3]. Additionally, multiplesamplescanbemultiplexedatthesametimeinasingle run.MPSallowsforanalysisofmillionsofindividualDNAstrands (reads) inaDNAmixture,whichintheorywouldallowforhigh resolutionmixtureanalysis.Sequencing alsomakesitpossibleto detectsinglenucleotidepolymorphisms(SNPs)andSTRsequence variantsinadditiontothegrossSTRrepeatnumber[4].Thisallows analyststotellthedifferencebetweenequilengthallelesinaDNA mixture.Certainmassspectrometrytechniquesalsomakeitpossible todifferentiateequilengthalleles,butcompletecharacterizationof polymorphismcanonlybeaccomplishedbysequencing[5].

Forensicscientistsarecurrentlyinvestigatinghowtotransition fromCEtoMPS.Severalbioinformatictoolsarebeingdeveloped to that end [2,6–8]. Previously, we reported that sequencing of multiplexed STR amplicons using Roche GS FLX titanium technology was technically feasible both in single contributor samplesandinmultiplepersonDNAmixtures,notwithstandinga ARTICLE INFO

Articlehistory: Received25July2013

Receivedinrevisedform4October2013 Accepted22October2013 Keywords: Illumina MiSeq STR Forensicloci MPS NGS ABSTRACT

Forensicscientistsarecurrentlyinvestigatinghowtotransitionfromcapillaryelectrophoresis(CE)to massiveparallelsequencing(MPS)foranalysisofforensicDNAproﬁles.MPSoffersseveraladvantages overCEsuchasvirtuallyunlimitedmultiplexyofloci,combiningbothshorttandemrepeat(STR)and singlenucleotidepolymorphism(SNP)loci,smallampliconswithoutconstraintsofsizeseparation,more discriminationpower,deepmixtureresolutionandsamplemultiplexing.Wepresentourbioinformatic frameworkMy-Forensic-Loci-queries(MyFLq)foranalysisofMPSforensicdata.Forallelecalling,the frameworkusesaMySQLreferencealleledatabasewithautomaticallydeterminedregionsofinterest (ROIs)byagenericmaximalﬂankingalgorithmwhichmakesitpossibletouseanySTRorSNPforensic locus.PythonscriptsweredesignedtoautomaticallymakeallelecallsstartingfromrawMPSdata.We alsopresentamethodtoassesstheusefulnessandoverallperformanceofaforensiclocuswithrespectto MPS,aswellasmethodstoestimatewhetheranunknownallele,whichsequenceisnotpresentinthe MySQLdatabase,isinfactanewalleleorasequencingerror.TheMyFLqframeworkwasappliedtoan Illumina MiSeq dataset of a forensic Illumina amplicon library, generated from multilocus STR polymerasechainreaction(PCR)onbothsinglecontributorsamplesandmultiplepersonDNAmixtures. AlthoughthemultilocusPCRwasnot yetoptimizedforMPSintermsofampliconlengthorlocus selection,theresultsshowexcellentresultsformostloci.Theresultsshowahighsignal-to-noiseratio, correctallelecalls,andalowlimitofdetectionforminorDNAcontributorsinmixedDNAsamples. Technically,forensicMPSaffordsgreatpromiseforroutineimplementationinforensicgenomics.The methodisalsoapplicabletoadjacentdisciplinessuchasmolecularautopsyinlegalmedicineandin mitochondrialDNAresearch.

ß2013TheAuthors.PublishedbyElsevierIrelandLtd.ThisisanopenaccessarticleundertheCC BY-NC-SAlicense(http://creativecommons.org/licenses/by-nc-sa/3.0/).

§

Thisisan open-accessarticledistributedunderthetermsoftheCreative Commons Attribution-NonCommercial-No Derivative Works License, which permitsnon-commercial use, distribution,and reproduction in any medium, providedtheoriginalauthorandsourcearecredited.

* Correspondingauthor.Tel.:þ3292648052. 1

Theseauthorscontributedequallytothiswork.

ContentslistsavailableatScienceDirect

Forensic

Science

International:

Genetics

j o urn a l hom e pa ge : ww w. e l s e v i e r. c om/ l o ca t e / f si g

http://dx.doi.org/10.1016/j.fsigen.2013.10.012

1872-4973/ß2013TheAuthors.PublishedbyElsevierIrelandLtd.ThisisanopenaccessarticleundertheCCBY-NC-SAlicense( http://creativecommons.org/licenses/by-nc-sa/3.0/).

(2)

poorperformanceforsomefrequentlyusedforensicloci[6].For those loci, the GS FLX reads needed to be transformed by a homopolymer-compressionalgorithmtoobtainresultsconsistent enoughformixtureanalysis.IntheGSFLXSTRanalysis,thereads neededtobeclusteredaroundaconsensussequence.Theregionof interest(ROI)wassetusingfixedflankingbasedonthespecificPCR primersforeachlocus.Asignalthresholdwasusedtodetermine whichreadclustersareconsideredasignalandwhichclustersare considerednoise.

In the current study we created a general bioinformatic framework, that we call My-Forensic-Loci-queries (MyFLq), for analyzingMPS-generatedforensic data.Itisdesignedtohandle multiplelocus types, includingSTR lengthpolymorphisms and SNPs.TheframeworkusesaMySQLreferencealleledatabasewith automaticallydeterminedROIsandpythonscriptstocompareMPS sequencesagainsttheknownalleledatabase.Wealsopresenta method to asses the usefulness and overall performance of a forensiclocuswhen usedin anMPS analysis,and a methodto estimatewhetheranallelewhichisnotpresentinthedatabaseis infactanewlytypedalleleoraPCR/sequencingerror.

TheMyFLqframeworkwasusedonanIlluminaMiSeqdatasetof anIlluminaforensicampliconlibrarygeneratedfromSTRmultilocus PCR on both single contributor samples and multiple person mixtures.ThemultilocusPCRwas notspeciﬁcallyoptimized for MPSintermsofcriteriasuchaslocusselectionorampliconlength. Theresultsarepromising,showingexcellentresultsonmostbutnot allloci.Therawreadaccuracywashighenoughsothatthereadsdid notneedtobeclusteredaroundaconsensusaccuracyaswiththeGS FLXreads[6].ThefrequencyofhomopolymererrorsintheMiSeq dataisinsigniﬁcantcomparedtotheGSFLXtechnology. Neverthe-less,thehomopolymercompressionalgorithmstillprovedusefulto groupsome readscontaining errors,whetherfromPCRor from sequencing,withthosethatwerecompletelyerror-freeinrespectto acontributor’sreferencesequence.

The MyFLqframeworkhas aCreative Commons opensource license(CCBY-SA3.0).Thesourcecodeisavailableassupplementary material or for the latest version at http://MyFLq.UGent.van-neste.be.CurrentlyweareimplementingitasanIlluminaBaseSpace application,whichshouldbeavailablebythebeginningof2014.

2. Materialsandmethods

2.1. SamplepreparationandprocessingusingIlluminachemistry DNA mixtures werepreparedaccordingtoTable 1from the followingpuriﬁedgenomicDNAsources:K562(Promega),9947A (Promega), 2800M (Promega) and two National Institute of Standards and Technology (NIST) standard reference materials (SRM2391c:DNAA,DNAB).DNAconcentrationofeachsample was measured using the Qubit1

dsDNA HS Assay and Qubit Fluorimeterfollowing manufacturer’sinstructions.A mixtureof foursourceDNAsandamixtureofﬁvesourceDNAs,alongwith 9947AandK562singlesourcesamples,wereusedinmultilocus PCR(Table1).

Primer sequences were ordered without ﬂuorescent tags (Integrated DNA Technologies) for loci: Amelogenin, CSF1PO, D13S317,D16S539,D18S51,D21S11,D3S1358,D5S818,D7S820, D8S1179, FGA, PentaD,PentaE, TH01,TPOX, and vWA[9]. This multiplex hasyet toundergo optimization (e.g., for intra- and inter-locus balance, polymerase stutter (slippage) and other artifacts) forusein MPS.Primers wereusedata concentration of2

m

MeachinPCRwith1ngofDNAorDNAmixturein1Gold STRbuffer(Promega)and0.16UAmpliTaqGOLD(Invitrogen).The samples were ampliﬁed using a BioRad Tetrad instrument as follows:958Cfor11min,968Cfor1min;10cyclesof948Cfor 30s,ramp0.58C/sto608Candthen30sat608C,ramp0.28C/sto 708Candthen45sat708C;22cyclesof908Cfor30s,ramp0.58C/ sto608Candthen30sat608C,ramp0.28C/sto708Candthen 45sat708C;holdat608Cfor30min;48Csoak.

PCR products were quantiﬁed using Qubit1

(Invitrogen) withoutpurification.LibrariesweregeneratedbyligatingTruSeq DNAadapterstothePCRproductsfrom50ngofunpurifiedPCR product(Illumina).Samplesweresubjectedto5cyclesofPCRand purifiedwithSPRI(TruSeqDNASamplePreparationGuide). The completed libraries were quantified using a qPCR assay as recommendedbyIllumina.

LibrarieswerepooledwithPhiXUniversalLibraryandaHuman DNAlibrary.Pooledlibrariesweredenaturedanddilutedto10pM followingIlluminaguidelinesandsequencedonaMiSequsinga MiSeqReagentKitv2,500cycleswithamodiﬁedrecipe.Samples weredemultiplexedusingtheindexsequences,FASTQﬁleswere generatedautomaticallyusingMiSeqReporter(MSR).

2.2. MiSeqdataanalysis 2.2.1. TheMyFLqframework

All necessary steps for an MPS forensic analysis were incorporatedintoouropen-sourceframework.MyFLqis notyet afullapplication.Theendresultsneedtobestatisticallyanalyzed: probabilities of allelecalls, stutterﬁltering, hetero- and homo-zygoticallelecallingandvisualizationarenotyetimplemented.

Theframeworkconsistsof twoparts:(1)A MySQLdatabase backendthatispopulatedbyknown referencealleles,and(2)a Pythonfrontendwithfunctionsforaddingreferenceallelestothe reference allele database, and analysis of MPS STR data from forensic samples. The source code and documentation of all functionscanbefoundinsupplementarymaterials.Italsolists speciﬁcallythefunctionsusedforthispaper,andprovidesashort descriptionforeach.

2.2.2. Buildingthereferencealleledatabase

Thealleledatabaseisideallypopulatedwiththesequencesof allknownSTRallelesthatexistinthegeneralpopulation.Because eachof thesesequencesiscurrentlynotavailable,thedatabase was initially populated withthe STR sequences from theDNA sources in Table 1. These reference sequences were manually inferredfromtheIlluminasequencingdataandtheSTRBaseallele database[10].Theyarenotthebestrepresentationofpopulation alleles,butsufﬁceforthecurrentstudy.Inthefuture,withMPSthe knowndiversitywillbebetterdetermined.

After building the reference allele database, the function processLociNameswasusedtodeterminetheﬂankingregionfor each locus. The function processLociAlleles produced a table containingtheROIforeachalleleanditsintegerallelenumber(if STR)accordingtostandardnomenclature[11,12].

Flanking regions are the maximum right- and left-end consensusbetween all alleles of a locusin the referenceallele database.Fig.1showsasimpliﬁedexample.Eachallelehastwo primers,twoﬂanksandtheROI.Ifthereferencedatabaseallelesfor alocusonlydifferbythenumberofSTRrepeatsandthusnoSNPs, Table1

DNAcompositionofsamples.

DNAstandard Sample1(%) Sample2 (%) Sample3 (%) Sample4 (%) K562 100 0.10 9947A 40 0.50 100 NISTSRM2391c DNAA 30 1 NISTSRM2391c DNAB 20 5 2800M 10 93.40

(3)

non-consensus,orpartialrepeatpatternsarepresent,oneofthe alleles for that locuscan have a ROI of zerolength, as Fig. 1b demonstrates.BecausetheROIiscalculateddynamicallybasedon thesequencespresentinthereferencealleledatabase,theROIof thelocicanchangewhenthereferencealleledatabaseisupdated. 2.2.3. STRdataanalysis

Afterbuildingthedatabase,allFASTQfileswereanalyzedwith theframeworksoftwareasiftheywouldhavebeenfromunknown samples. Analysis consisted of following steps: (1) reads were assigned toa locusbased on the presenceof both PCR primer sequencesforthat locus; (2)primers andflanks wereremoved fromthereads,leavingonlythereadROI;(3)readsweregrouped based on their exact sequence;(4) groups with an abundance lowerthan0.5%werediscarded;(5) theROI ofeachgroupwas comparedtothealleledatabasetableandanallelecallwasmade whenanexact matchwasfound;and (6)groupswithina locus werecomparedtoeachother,aconnectionwasflaggedwhentwo ROIsdifferedbymaximumtwoSNPsorSTRsizeindels.

Instep(2)thedeterminationoftheflankingregionswasdone asfollows:ifaread-endmatchedexactlywiththedatabaseflank, the read-end was removed and flagged ‘clean’. Otherwise, the databaseflankandread-endwerehomopolymer-compressed(two sameconsecutivebaseswerealreadyconsideredahomopolymer). Whenthisresultedinanexactmatchbetweenthosesequences, theread-endwasremovedandflagged‘compressed-clean’.Ifthe flankwasstillnotremoved,theread-endwasflaggedas‘unclean’ andwasremovedbyourflexibleflankingalgorithm(seebelow).In step (4), for each group, counts were gathered on ‘clean’, ‘compressed-clean’, and ‘unclean’ read-ends. Step (6) indicates howtheROIsin theresultsareinterrelated.Thishelpsforensic researcherstodecideonthelikelihoodofanallelecall,e.g.aROI thatisnotinthedatabase,hasalowabundanceintheresultsand onlydiffersbyoneSNPofanotherhighabundantROI,isprobably anerror-containingROI.

2.2.4. Flexibleﬂankingalgorithm

Thisalgorithmalwaysremovesaflankfromaread,nomatter howdissimilartothedatabaseflank.Thestarting hypothesisis thattheread-endtoremoveisaslongasthedatabaseflank.K-mers forthedatabaseflankwithincreasinglengtharesearchedaround their expected index in the read. The found index is scored dependingonhowinformativethek-meris:moreinformativeif

longerandifclosertotheflank-end.Thescoreiscalculatedasthe squarerootoftheproductofthosetwovalues(lengthofk-merand inversedistancetodatabaseflankend).Finally,theproposedindex withthehighestsummedscoreisconsideredtobethemostlikely ending flank position. Based on that position, the read-end is removed. For a more in depth explanation, documentation is provided forthis algorithm inthesourcefile insupplementary materials.

3. Results

3.1. Settinggeneralthreshold

In theanalysis ofMPS STR data,groups withanabundance lowerthan0.5%werediscarded(seestep4of2.2.3).Thisthreshold wasarbitrarilydeterminedbasedon theresultsinFig.2.Fig.2 showsahistogramoftheabundancesofallgroupedidenticalreads for the singlecontributor samples.To generate this ﬁgure, the completesequenceswereconsidered,exceptforthepartoutsideof theprimers.

Erroneousreadsareexpectedtohavemuchsmallerabundances than errorfree reads, especially in single contributor samples. Therearearound105_groups_of_identical_reads_with_abundances

smaller than 0.5%, which are in the context of the single contributor samples deﬁnitelyreadswith errors.The threshold foruniquereadswassetto0.5%,toavoidclutteringtheresultswith noise.Bydoingso,minorcontributorswhichcontributelessthan 0.5%totheDNAmixturewillnotbedetected.

3.2. GeneralpropertiesoftheIlluminadataset

Table2showsthetotalnumberofMiSeqreadsforeachsample, the number of reads filtered based on the 0.5% abundance threshold,andthenumberoferrorfreeROIafterfiltering.Filtered readsconsistofbothreadswithandwithouterrors.ErrorfreeROI arethenumberofreadsofwhichthecompleteROIisidenticaltoa referencedatabaseROI and ofwhich theROI isexpectedtobe presentintherelevantsample.Thesereadscanstillhaveerrorsin theflanksaroundtheROI,buttheseflanksarenotconsideredwhen matching the reads to the reference allele database. This percentageis influencedbymanyfactorssuchasPCRaccuracy, amplicon length, ROI and flank length, stutter, sequencing accuracyandtheabundancefiltercut-off.

A A AG T GCTAGGACAATGAT AGATA GATAGTACGCGCACCTACCTA GCTAGGACAATGATAGATAGATA GTACGCGCACCTACCTA GCTAGGACAATGATAGATAGATAGATAGAT GTACGCGCACCTACCTA

Primer Flank Flank Primer

Region of interest b) Without SNP's in STR region: GCTAGGACAATGATACA TAGATA GATAGTACGCGCACCTA CCTA GCTAGGACAATGATAGA TAGATA GTACGCGCACCTA CCTA GCTAGGACAATGATAGA TAGATAGATAGATTGTACGCGCACCTA CCTA

Primer Flank Flank Primer

Region of interest a) With SNP's in STR region:

Fig.1.Flanksandregionofinterestinthereferencealleledatabase.Agenericlocuswiththreeallelesinthedatabaseisconsideredfortwopossiblecases:withorwithoutSNPs withintheSTRregion.

(4)

3.3. STRallelecallsinfourandﬁvepersonDNAmixtures

Figs.3–6showtheallelecallsforthedifferentsamples.Inthe ﬁgures,bluebarsdenotethetheoreticalabundanceoftheallele basedonhowthesampleswereprepared,greenbarsthedetected read abundance, and red bars erroneous reads (including polymerase stutter). Pure samples (Figs. 3 and 6), contain, at most,two blue bars per locus. Erroneous read bars have been drawnnarrower tomakeit possibletoshowallerroneousread

groups.TheﬁguresareautomaticallygeneratedfromtheMyFLq resultﬁles(insupplementarymaterials),withmanualadditionof thebluebars.

3.4. Progressiveabundancethreshold

Fig.7shows,afterfullanalysis,foreachlocus,thepercentageof errorfreesequences(Y-axis)foragivenabundancethreshold (X-axis).Higherabundancesforerroneoussequencesarelesslikely. Table2

GeneralIlluminasamplescharacteristics.

Sample1 Sample2 Sample3 Sample4

TotalMiSeqreads 246,347 1,176,806 1,261,848 961,236

Filteredreads 203,181(82%) 981,935(83%) 1,073,164(85%) 771,723(80%)

ErrorfreeROI 173,692(71%) 912,643(72%) 996,466(79%) 681,918(71%)

Fig.3.Sample1proﬁle.Bluebars=theoreticalabundance,greenbars=detectedreadabundance,redbars=erroneousreadabundance.

Fig.4.Sample2proﬁle.Bluebars=theoreticalabundance,greenbars=detectedreadabundance,redbars=erroneousreadabundance. 0.0000.0050.0100.0150.020 0.0250.0300.0350.0400.045 0.050 0.0550.0600.0650.0700.0750.080 0.0850.0900.0950.100 Abundance of sequence [0-1] 100 101 102 103 104 105 Number (log) Threshold 0.005

(5)

When theabundance thresholdis increased, thepercentage of errorfreesequencesincreasesand isexpectedtobecome100%. Lociforwhich lesserrorsequencesareproduced,have a lower thresholdforwhich100%errorfreereadsareremaining.Basedon thiscriterion,AmelogeninandD16S539arethebestperforming lociwhilePentaEandPentaDperformatadecreasedlevel.

3.5. Locusqualityanalysis

Foreachlocus,Table3showsthepercentageofthetheoretical abundance which is measured as MPS signal. The theoretical abundance can be calculated from the proportion of each contributor tothesample (seeTable 1)andare shownas blue Fig.5.Sample3proﬁle.Bluebars=theoreticalabundance,greenbars=detectedreadabundance,redbars=erroneousreadabundance.

Fig.6.Sample4proﬁle.Bluebars=theoreticalabundance,greenbars=detectedreadabundance,redbars=erroneousreadabundance.

(6)

barsinFigs.3–6.Themeasuredsignalsareshownasgreenbarsin thesamefigures.WhentherewouldbenoPCRerrorslikestutter and no sequencing errors,the MPS signalwould be100%. The numbersinTable3werecalculatedbylinearmodelforeachlocus separatelybutforalllocusspecificallelesofallsamplestogether. Fig.8showstheaveragepercentageofflankswhichare‘clean’ in reads with an error-free ROI and in reads with an error-containing ROI. For all loci, except PentaD and D5S818, the proportionofcleanflanksishigherinreadswhichalsocontainan error-freeROI.Theproportionofcleanflanksisalsoimpactedby thelength of the flanks. The negativecorrelation between the

lengthoftheﬂanksandtheproportionofcleanﬂanksisexpected aslongersequencescarryahigherprobabilityofcontainingerrors.

4. Discussion

4.1. Frameworkperformance

The MyFLq framework was designed to incorporate all algorithms necessary for an STR MPS analysis in one open framework. Itcontainsabout a thousandlinesof codeand has minimal dependencies: It only requires a working python2 environment and the common python packages numpy and MySQLdb. Currently the target audience of the framework are bioinformaticiansthatworkintheﬁeldofMPSforensicanalysis. An Illumina BaseSpace application is being developed, which shouldbeavailablebythebeginningof2014.Thisway,userswill nothavetointeractwiththeframework,butwillbeabletoanalyze theirSTRdataﬁleswiththeMyFLqapplication.

MyFLqcanalsohandleothertypesofforensicdata,suchasSNP or mitochondrialregions. However, it hasnotbeen extensively testedtooperate withsuchdata.Fromtheanalyzedlociinthis study,onlyAmelogenincouldbeconsideredaSNPlocus.STRait Razorisanothertoolthathasbeenrecentlydevelopedforforensic genomicsandasthenameindicatesonlydealswithSTR’s[8].In ouropinion,futuresoftwarepackagesshouldhandlebothSTRand SNP loci equally well, to provide a full solution to forensic researchers.

MyFLqhascurrentlynotbeendesignedtobecomputational efﬁcient.AfullanalysisofaMiSeqdatasettakesapproximatelyone houronasingleCPU.Thecodecontainsmanysectionsthatcould Table3

MPSsignalpercentageoftheoreticalabundance.

Locus MPSsignal/theoreticalabundance(%) Std-error(%)

Amelogenin 95 4 CSF1PO 79 5 D13S317 89 5 D16S539 88 6 D18S51 75 6 D21S11 78 5 D3S1358 86 5 D5S818 89 5 D7S820 84 6 D8S1179 95 6 FGA 73 6 PentaD 25 5 PentaE 27 7 TH01 92 5 TPOX 90 4 vWA 85 5 0.60 0.65 0.70 0.75 0.80 0.85 0.90 0.95 ] 1 -0 [ s k n a l f n a e l c f o n o i t r o p o r p : :I O R e e r f r o r r E 0.3 0.4 0.5 0.6 0.7 0.8 0.9 1.0 Er

ror containing ROI: pr

oportion of clean flanks

Amelogenin CSF1PO D13S317 D16S539 D18S51 D21S11 D3S1358 D5S818 D7S820 D8S1179 FGA PentaD PentaE TH01 TPOX vWA 1.0

Fig.8.ProportionofcleanﬂanksinreadswithanexpectedROIandinreadswithanerror-containingROI.Sizeofcirclesisproportionaltoaverageallelelengthofthelocus, darknessofcirclesisproportionaltoﬂanklengthofthelocus.

(7)

beparallelizedtoreducetheanalysistimesigniﬁcantly.Thiswill alsobeimplementedintheBaseSpaceapplication.

4.2. Accuracyanddetectionofminorcontributors

IngeneraltheIlluminaMiSeqreadswereofhighquality,with morethan70%ofreadscontainingacompleteerrorfreeROIand constitute the MPS signal. Several factors are influencingthis percentage.Stutters(around10%)andthereadsfiltered bythe abundancethreshold(around20%)areabigpartoftheother30% of the reads. PCR and sequencing accuracy with a per base sequencing accuracy of approximately 99.5% are other main factors. Reducing the amplicon, and ROI length could further improvetheMPSsignal.Readscanbeusedwithoutclustering them to produce a consensus sequence. This is of particular importance,sinceclusteringshouldbeavoidedwhen investigat-ing forensic mixtures. An allele from a minor contributor containingaSNPoraninsertion/deletioncouldclusterwiththe sequencesfromthecorrespondingalleleofamajorcontributor andgo undetected.When thecontributors are unknown, itis impossibletocategorizeaprioribetweenallelesandamplification errors.Consequently,clusteringshouldbeavoidedifpossible,in order to confront the forensic investigator at least with the presenceofthesesequences.

Becausetheminimalabundancethresholdduringdataanalysis wassetto0.5%,theoreticallyitwasstillpossibletodetectalleles fromthe1%contributor in Sample3.However, only forthe5% contributor the alleles were detected consistently, showing a correct allele call for all loci except D13S317. However, the experimentwasnotsetuptodeterminetheallelicdetectionlimit. Manyallelesoftheminorcontributorsarethesameasoneofthe moreabundantcontributors. Futureresearchwill beneededto establish a general detection limit in mixtures for each MPS technologyandPCRmultiplex.

4.3. Lociperformance

Incapillaryelectrophoresis(CE),theallelespecificsignalis85– 90%,aspolymerasestutterproducts usuallycomprise approxi-mately10–15%oftheparentallele’ssignal.Thesestutterproducts were also observed in MPS data. These stutter sequences combined with sequencing errors results in an allele specific MPSsignalthatisslightlylowerthantheabsoluteDNAinput.For mostlocithesevaluesareveryreasonable,asshowninTable3, except for PentaE and PentaD. Those two loci are obvious underperformersinthecurrent experimentalsetup,bothwith allele specificsignalsaround25%.Comparedto theother loci, these2locihavelongamplicons,longROIandlongflanklengths. While these factors are partially contributing to a higher proportionof reads containing errors in theROI,they are not completelyexplainingthelowsignal.Itisunclearatwhichpoint thattheerrorsareintroduced,butgiventhehighabundanceof someoftheerrors,theyareprobablyintroducedatthePCRstep. Futureresearchwill showhowusefultheywill bein MPSSTR analyses.

Fig.8showsahigherproportionofcleanﬂanksinreadswith error-free ROIs. This data could be modeled to estimate the likelihoodoferrorinROIsofanunknownsample.Fig.8showsthat theusefulnessofthisstrategywilldependontheconsideredlocus because values for ‘clean’, ‘compressed-clean’ and ‘unclean’ dependonlocussequencecharacteristicssuchasaveragelength andhomopolymercontent.

ForanMPStechnologytobecomeavalidalternativetoCEit mustproduceatleastequivalentresultsforthecommonlyused loci,andproduceadditional,valuableinformationforanexpanded setofloci.MostlocialreadyperformedverywellwiththeIllumina

MiSeq, while some(PentaD,PentaE)needfurther optimization, suchasprimerre-design,inordertoeclipseCE.AsMPSbecomesa validalternative,ourframeworkcanbeusedtohelpidentifyan additionalsetofidealMPSloci.

4.4. Dynamicﬂankcalculation

Usingsequencing,onlytheROIneedstobeconsideredduring data analysis,i.e.thepartof thesequencethat differsbetween differentallelesforalocus.Anypartoutsidethatregioncanbe treatedasflanking.Asourframeworkisbuilttogenericallyhandle all forensicloci, theROI andtheflanksaround theROIare not predefined, but are dynamically determined. This dynamic flanking tries tomaximizethe flanksand to minimizetheROI for each locus based on the reference alleles present in the reference allele database. Removing flanks from the reads minimizestheimpactoferrorsintheflanksontheanalysis.This aidsdetectingallelesofminorDNAcontributors,asasignificant portionofthenoiseiseliminated.

Applicationofourmethodiscurrentlylimitedbythesizeof thereferencealleledatabase,whichcontainsasmallsubsetofall ofthepossiblealleles,i.c.onlyallelespresentintheDNA from Table 1. To determine the practical applicability of our framework, moresamplesneedtobeanalyzed toincreasethe allele databasesize.However, for futureapplications itwould also be useful to automatically limit the database size by subsetting it. Database alleles could, for example,be selected basedonthesizeofthesequencesinaspecificsample,because the size ofreads already reduces theset ofpossible allelesto which they could be assigned. This subset of possible alleles wouldthenservetocalculatemaximumflanksonthefly.Smaller subsets will resultin bigger flankswhich in turn reduces the impact oferrors.

5. Conclusion

When MPSbecomes routinelyusedtoanalyzeforensic DNA proﬁles,decisionswillneedtobemadeonhowthedatashouldbe processed.Wepresentourbioinformaticframework,MyFLq,that processes forensic MPS data prudently: without clustering, extracting maximal information withautomaticallydetermined regionsofinterest.TheresultsshowIlluminaMiSeqisreadyto analyze STR proﬁles. For routine implementation in forensic laboratories,acarefulselectionofloci,PCRmultiplexoptimization, andanMPS-basedSTRallelereferencedatabaseforalignment,are needed.

Supplementarymaterials

MyFLq.pyThemainframeworkprogram.

InstallMyFLqTables.pyInstallstheMySQLtablesnecessaryfor usingMyFLq.

results.cssNeededinthesamedirectoryasthesampleresult files, for opening theresult filesin a browser. Thisfile also containsdocumentationtointerprettheresultfiles.

sample1–4.xmlSampleresultﬁles.

Acknowledgments

FundingwasprovidedbytheGhentUniversityMultidisciplinary Research Partnership ‘Bioinformatics: from nucleotides to net-works’.

TheauthorswouldalsoliketothankIllumina,Inc.forproviding their technical expertise in sample preparation and MiSeq sequencing.

(8)

AppendixA. Supplementarydata

Supplementarydataassociatedwiththisarticlecanbefound, in the online version, at http://dx.doi.org/10.1016/j.fsigen.2013. 10.012.

References

[1]D.L.Deforce,R.E.Millecamps,D.V.Hoofstat,E.G.V.denEeckhout,Comparisonof slabgelelectrophoresisandcapillaryelectrophoresisforthedetectionofthe ﬂuorescentlylabeledpolymerasechainreactionproductsofshorttandemrepeat fragments,J.Chromatogr.A806(1)(1998)149–155.,http://dx.doi.org/10.1016/ S0021-9673(97)00394-4.

[2]S.L.Fordyce,M.C.Avila-Arcos,E.Rockenbauer,C.Borsting,R.Frank-Hansen,F.T. Petersen,E.Willerslev,A.J.Hansen,N.Morling,M.T.P.Gilbert,High-throughput sequencingofcoreSTRlociforforensicgeneticinvestigationsusingtheRoche GenomeSequencer FLXplatform,Biotechniques51(2)(2011)127þ,http:// dx.doi.org/10.2144/000113721.

[3]K.K.Kidd,J.R.Kidd,W.C.Speed,R.Fang,M.R.Furtado,F.C.L.Hyland,A.J.Pakstis, ExpandingdataandresourcesforforensicuseofSNPsinindividualidentiﬁcation, ForensicSci.Int.Genet.6(5)(2012)646–652,doi:10.1016/j.fsigen.2012.02.012.

[4]L.J.McIver,J.W.FondonIII,M.A.Skinner,H.R.Garner,Evaluationofmicrosatellite variationinthe1000GenomesProjectpilotstudiesisindicativeofthequalityand utilityoftherawdataandalignments,Genomics97(4)(2011)193–199.,http:// dx.doi.org/10.1016/j.ygeno.2011.01.001.

[5]J.V.Planz,K.A.Sannes-Lowery,D.D.Duncan,S.Manalili,B.Budowle,R.Chakraborty, S.A.Hofstadler,T.A.Hall,AutomatedanalysisofsequencepolymorphisminSTR allelesbyPCRanddirectelectrosprayionizationmassspectrometry,ForensicSci. Int.Genet.6(5)(2012)594–606.,http://dx.doi.org/10.1016/j.fsigen.2012.02.002. [6]C.VanNeste,F.VanNieuwerburgh,D.VanHoofstat,D.Deforce,ForensicSTR analysisusingmassiveparallelsequencing,ForensicSci.Int.Genet.6(6)(2012) 810–818.,http://dx.doi.org/10.1016/j.fsigen.2012.03.004.

[7]M.Gymrek,D.Golan,S.Rosset,Y.Erlich,lobSTR:ashorttandemrepeatproﬁlerfor personalgenomes,GenomeRes.22(6)(2012)1154–1162.,http://dx.doi.org/ 10.1101/gr.135780.111.

[8]D.H.Warshauer,D.Lin,K.Hari,R.Jain,C.Davis,B.LaRue,J.L.King,B.Budowle, STRaitRazorAlength-basedforensicSTRallele-callingtoolforusewithsecond generationsequencingdata,ForensicSci.Int.Genet.7(7)(2013)409–417.

[9]A.Masibay,T.Mozer,C.Sprecher,PromegaCorporationrevealsprimersequences initstestingkits,J.ForensicSci.45(6)(2000)1360–1362.

[10]C.Ruitberg,D.Reeder,J.Butler,STRBase:ashorttandemrepeatDNAdatabase forthehumanidentitytestingcommunity,NucleicAcidsRes.29(1)(2001)320– 322.,http://dx.doi.org/10.1093/nar/29.1.320.

[11]P.Gill,C.Brenner,B.Brinkmann,B.Budowle,A.Carracedo,M.Jobling,P.deKnijff, M.Kayser,M.Krawczak,W.Mayr,N.Morling,B.Olaisen,V.Pascali,M.Prinz,L. Roewer, P.Schneider, A.Sajantila,C. Tyler-Smith,DNA Commission ofthe InternationalSocietyofForensicGenetics:recommendationsonforensicanalysis usingY-chromosomeSTRs,Int.J.LegalMed.114(6)(2001)305–309.,http:// dx.doi.org/10.1007/s004140100232.

[12]B.Olaisen,W.Bar,B.Brinkmann,B.Budowle,A.Carracedo,P.Gill,P.Lincoln,W. Mayr,S.Rand, DNArecommendations1997oftheInternational Societyfor ForensicGenetics,VoxSang.74(1)(1998)61–63.,http://dx.doi.org/10.1046/ j.1423-0410.1998.7410061.x.