My-Forensic-Loci-queries
(MyFLq)
framework
for
analysis
of
forensic
STR
data
generated
by
massive
parallel
sequencing
§
Christophe
Van
Neste
a,
Mado
Vandewoestyne
a,
Wim
Van
Criekinge
b,
Dieter
Deforce
a,1,
Filip
Van
Nieuwerburgh
a,1,*
a
LaboratoryofPharmaceuticalBiotechnology,FacultyofPharmaceuticalSciences,GhentUniversity,Ghent,Belgium b
Biobix,FacultyofBioscienceEngineering,GhentUniversity,Ghent,Belgium
1. Introduction
ForensicDNAprofilesofshorttandemrepeatlociarecurrently obtainedusingPCRfollowedbycapillaryelectrophoresis(CE).CE separatesfluorescentlylabeledPCRproductsbasedontheirlength [1].Becauseofthis,inordertoproduceunambiguousallelecalls,the sizeranges of STR lociwith the samefluorescenttag mustnot overlap.Thislimitsthenumberoflocithatcanbeinvestigatedina single PCR and in a single capillary injection. Massive parallel sequencing (MPS) technologies, also known as second or next generation sequencing, do not relyon sizeseparation and thus
relieve the limitation on locus multiplexy [2,3]. Additionally, multiplesamplescanbemultiplexedatthesametimeinasingle run.MPSallowsforanalysisofmillionsofindividualDNAstrands (reads) inaDNAmixture,whichintheorywouldallowforhigh resolutionmixtureanalysis.Sequencing alsomakesitpossibleto detectsinglenucleotidepolymorphisms(SNPs)andSTRsequence variantsinadditiontothegrossSTRrepeatnumber[4].Thisallows analyststotellthedifferencebetweenequilengthallelesinaDNA mixture.Certainmassspectrometrytechniquesalsomakeitpossible todifferentiateequilengthalleles,butcompletecharacterizationof polymorphismcanonlybeaccomplishedbysequencing[5].
Forensicscientistsarecurrentlyinvestigatinghowtotransition fromCEtoMPS.Severalbioinformatictoolsarebeingdeveloped to that end [2,6–8]. Previously, we reported that sequencing of multiplexed STR amplicons using Roche GS FLX titanium technology was technically feasible both in single contributor samplesandinmultiplepersonDNAmixtures,notwithstandinga ARTICLE INFO
Articlehistory: Received25July2013
Receivedinrevisedform4October2013 Accepted22October2013 Keywords: Illumina MiSeq STR Forensicloci MPS NGS ABSTRACT
Forensicscientistsarecurrentlyinvestigatinghowtotransitionfromcapillaryelectrophoresis(CE)to massiveparallelsequencing(MPS)foranalysisofforensicDNAprofiles.MPSoffersseveraladvantages overCEsuchasvirtuallyunlimitedmultiplexyofloci,combiningbothshorttandemrepeat(STR)and singlenucleotidepolymorphism(SNP)loci,smallampliconswithoutconstraintsofsizeseparation,more discriminationpower,deepmixtureresolutionandsamplemultiplexing.Wepresentourbioinformatic frameworkMy-Forensic-Loci-queries(MyFLq)foranalysisofMPSforensicdata.Forallelecalling,the frameworkusesaMySQLreferencealleledatabasewithautomaticallydeterminedregionsofinterest (ROIs)byagenericmaximalflankingalgorithmwhichmakesitpossibletouseanySTRorSNPforensic locus.PythonscriptsweredesignedtoautomaticallymakeallelecallsstartingfromrawMPSdata.We alsopresentamethodtoassesstheusefulnessandoverallperformanceofaforensiclocuswithrespectto MPS,aswellasmethodstoestimatewhetheranunknownallele,whichsequenceisnotpresentinthe MySQLdatabase,isinfactanewalleleorasequencingerror.TheMyFLqframeworkwasappliedtoan Illumina MiSeq dataset of a forensic Illumina amplicon library, generated from multilocus STR polymerasechainreaction(PCR)onbothsinglecontributorsamplesandmultiplepersonDNAmixtures. AlthoughthemultilocusPCRwasnot yetoptimizedforMPSintermsofampliconlengthorlocus selection,theresultsshowexcellentresultsformostloci.Theresultsshowahighsignal-to-noiseratio, correctallelecalls,andalowlimitofdetectionforminorDNAcontributorsinmixedDNAsamples. Technically,forensicMPSaffordsgreatpromiseforroutineimplementationinforensicgenomics.The methodisalsoapplicabletoadjacentdisciplinessuchasmolecularautopsyinlegalmedicineandin mitochondrialDNAresearch.
ß2013TheAuthors.PublishedbyElsevierIrelandLtd.ThisisanopenaccessarticleundertheCC BY-NC-SAlicense(http://creativecommons.org/licenses/by-nc-sa/3.0/).
§
Thisisan open-accessarticledistributedunderthetermsoftheCreative Commons Attribution-NonCommercial-No Derivative Works License, which permitsnon-commercial use, distribution,and reproduction in any medium, providedtheoriginalauthorandsourcearecredited.
* Correspondingauthor.Tel.:þ3292648052. 1
Theseauthorscontributedequallytothiswork.
ContentslistsavailableatScienceDirect
Forensic
Science
International:
Genetics
j o urn a l hom e pa ge : ww w. e l s e v i e r. c om/ l o ca t e / f si g
http://dx.doi.org/10.1016/j.fsigen.2013.10.012
1872-4973/ß2013TheAuthors.PublishedbyElsevierIrelandLtd.ThisisanopenaccessarticleundertheCCBY-NC-SAlicense( http://creativecommons.org/licenses/by-nc-sa/3.0/).
poorperformanceforsomefrequentlyusedforensicloci[6].For those loci, the GS FLX reads needed to be transformed by a homopolymer-compressionalgorithmtoobtainresultsconsistent enoughformixtureanalysis.IntheGSFLXSTRanalysis,thereads neededtobeclusteredaroundaconsensussequence.Theregionof interest(ROI)wassetusingfixedflankingbasedonthespecificPCR primersforeachlocus.Asignalthresholdwasusedtodetermine whichreadclustersareconsideredasignalandwhichclustersare considerednoise.
In the current study we created a general bioinformatic framework, that we call My-Forensic-Loci-queries (MyFLq), for analyzingMPS-generatedforensic data.Itisdesignedtohandle multiplelocus types, includingSTR lengthpolymorphisms and SNPs.TheframeworkusesaMySQLreferencealleledatabasewith automaticallydeterminedROIsandpythonscriptstocompareMPS sequencesagainsttheknownalleledatabase.Wealsopresenta method to asses the usefulness and overall performance of a forensiclocuswhen usedin anMPS analysis,and a methodto estimatewhetheranallelewhichisnotpresentinthedatabaseis infactanewlytypedalleleoraPCR/sequencingerror.
TheMyFLqframeworkwasusedonanIlluminaMiSeqdatasetof anIlluminaforensicampliconlibrarygeneratedfromSTRmultilocus PCR on both single contributor samples and multiple person mixtures.ThemultilocusPCRwas notspecificallyoptimized for MPSintermsofcriteriasuchaslocusselectionorampliconlength. Theresultsarepromising,showingexcellentresultsonmostbutnot allloci.Therawreadaccuracywashighenoughsothatthereadsdid notneedtobeclusteredaroundaconsensusaccuracyaswiththeGS FLXreads[6].ThefrequencyofhomopolymererrorsintheMiSeq dataisinsignificantcomparedtotheGSFLXtechnology. Neverthe-less,thehomopolymercompressionalgorithmstillprovedusefulto groupsome readscontaining errors,whetherfromPCRor from sequencing,withthosethatwerecompletelyerror-freeinrespectto acontributor’sreferencesequence.
The MyFLqframeworkhas aCreative Commons opensource license(CCBY-SA3.0).Thesourcecodeisavailableassupplementary material or for the latest version at http://MyFLq.UGent.van-neste.be.CurrentlyweareimplementingitasanIlluminaBaseSpace application,whichshouldbeavailablebythebeginningof2014.
2. Materialsandmethods
2.1. SamplepreparationandprocessingusingIlluminachemistry DNA mixtures werepreparedaccordingtoTable 1from the followingpurifiedgenomicDNAsources:K562(Promega),9947A (Promega), 2800M (Promega) and two National Institute of Standards and Technology (NIST) standard reference materials (SRM2391c:DNAA,DNAB).DNAconcentrationofeachsample was measured using the Qubit1
dsDNA HS Assay and Qubit Fluorimeterfollowing manufacturer’sinstructions.A mixtureof foursourceDNAsandamixtureoffivesourceDNAs,alongwith 9947AandK562singlesourcesamples,wereusedinmultilocus PCR(Table1).
Primer sequences were ordered without fluorescent tags (Integrated DNA Technologies) for loci: Amelogenin, CSF1PO, D13S317,D16S539,D18S51,D21S11,D3S1358,D5S818,D7S820, D8S1179, FGA, PentaD,PentaE, TH01,TPOX, and vWA[9]. This multiplex hasyet toundergo optimization (e.g., for intra- and inter-locus balance, polymerase stutter (slippage) and other artifacts) forusein MPS.Primers wereusedata concentration of2
m
MeachinPCRwith1ngofDNAorDNAmixturein1Gold STRbuffer(Promega)and0.16UAmpliTaqGOLD(Invitrogen).The samples were amplified using a BioRad Tetrad instrument as follows:958Cfor11min,968Cfor1min;10cyclesof948Cfor 30s,ramp0.58C/sto608Candthen30sat608C,ramp0.28C/sto 708Candthen45sat708C;22cyclesof908Cfor30s,ramp0.58C/ sto608Candthen30sat608C,ramp0.28C/sto708Candthen 45sat708C;holdat608Cfor30min;48Csoak.PCR products were quantified using Qubit1
(Invitrogen) withoutpurification.LibrariesweregeneratedbyligatingTruSeq DNAadapterstothePCRproductsfrom50ngofunpurifiedPCR product(Illumina).Samplesweresubjectedto5cyclesofPCRand purifiedwithSPRI(TruSeqDNASamplePreparationGuide). The completed libraries were quantified using a qPCR assay as recommendedbyIllumina.
LibrarieswerepooledwithPhiXUniversalLibraryandaHuman DNAlibrary.Pooledlibrariesweredenaturedanddilutedto10pM followingIlluminaguidelinesandsequencedonaMiSequsinga MiSeqReagentKitv2,500cycleswithamodifiedrecipe.Samples weredemultiplexedusingtheindexsequences,FASTQfileswere generatedautomaticallyusingMiSeqReporter(MSR).
2.2. MiSeqdataanalysis 2.2.1. TheMyFLqframework
All necessary steps for an MPS forensic analysis were incorporatedintoouropen-sourceframework.MyFLqis notyet afullapplication.Theendresultsneedtobestatisticallyanalyzed: probabilities of allelecalls, stutterfiltering, hetero- and homo-zygoticallelecallingandvisualizationarenotyetimplemented.
Theframeworkconsistsof twoparts:(1)A MySQLdatabase backendthatispopulatedbyknown referencealleles,and(2)a Pythonfrontendwithfunctionsforaddingreferenceallelestothe reference allele database, and analysis of MPS STR data from forensic samples. The source code and documentation of all functionscanbefoundinsupplementarymaterials.Italsolists specificallythefunctionsusedforthispaper,andprovidesashort descriptionforeach.
2.2.2. Buildingthereferencealleledatabase
Thealleledatabaseisideallypopulatedwiththesequencesof allknownSTRallelesthatexistinthegeneralpopulation.Because eachof thesesequencesiscurrentlynotavailable,thedatabase was initially populated withthe STR sequences from theDNA sources in Table 1. These reference sequences were manually inferredfromtheIlluminasequencingdataandtheSTRBaseallele database[10].Theyarenotthebestrepresentationofpopulation alleles,butsufficeforthecurrentstudy.Inthefuture,withMPSthe knowndiversitywillbebetterdetermined.
After building the reference allele database, the function processLociNameswasusedtodeterminetheflankingregionfor each locus. The function processLociAlleles produced a table containingtheROIforeachalleleanditsintegerallelenumber(if STR)accordingtostandardnomenclature[11,12].
Flanking regions are the maximum right- and left-end consensusbetween all alleles of a locusin the referenceallele database.Fig.1showsasimplifiedexample.Eachallelehastwo primers,twoflanksandtheROI.Ifthereferencedatabaseallelesfor alocusonlydifferbythenumberofSTRrepeatsandthusnoSNPs, Table1
DNAcompositionofsamples.
DNAstandard Sample1(%) Sample2 (%) Sample3 (%) Sample4 (%) K562 100 0.10 9947A 40 0.50 100 NISTSRM2391c DNAA 30 1 NISTSRM2391c DNAB 20 5 2800M 10 93.40
non-consensus,orpartialrepeatpatternsarepresent,oneofthe alleles for that locuscan have a ROI of zerolength, as Fig. 1b demonstrates.BecausetheROIiscalculateddynamicallybasedon thesequencespresentinthereferencealleledatabase,theROIof thelocicanchangewhenthereferencealleledatabaseisupdated. 2.2.3. STRdataanalysis
Afterbuildingthedatabase,allFASTQfileswereanalyzedwith theframeworksoftwareasiftheywouldhavebeenfromunknown samples. Analysis consisted of following steps: (1) reads were assigned toa locusbased on the presenceof both PCR primer sequencesforthat locus; (2)primers andflanks wereremoved fromthereads,leavingonlythereadROI;(3)readsweregrouped based on their exact sequence;(4) groups with an abundance lowerthan0.5%werediscarded;(5) theROI ofeachgroupwas comparedtothealleledatabasetableandanallelecallwasmade whenanexact matchwasfound;and (6)groupswithina locus werecomparedtoeachother,aconnectionwasflaggedwhentwo ROIsdifferedbymaximumtwoSNPsorSTRsizeindels.
Instep(2)thedeterminationoftheflankingregionswasdone asfollows:ifaread-endmatchedexactlywiththedatabaseflank, the read-end was removed and flagged ‘clean’. Otherwise, the databaseflankandread-endwerehomopolymer-compressed(two sameconsecutivebaseswerealreadyconsideredahomopolymer). Whenthisresultedinanexactmatchbetweenthosesequences, theread-endwasremovedandflagged‘compressed-clean’.Ifthe flankwasstillnotremoved,theread-endwasflaggedas‘unclean’ andwasremovedbyourflexibleflankingalgorithm(seebelow).In step (4), for each group, counts were gathered on ‘clean’, ‘compressed-clean’, and ‘unclean’ read-ends. Step (6) indicates howtheROIsin theresultsareinterrelated.Thishelpsforensic researcherstodecideonthelikelihoodofanallelecall,e.g.aROI thatisnotinthedatabase,hasalowabundanceintheresultsand onlydiffersbyoneSNPofanotherhighabundantROI,isprobably anerror-containingROI.
2.2.4. Flexibleflankingalgorithm
Thisalgorithmalwaysremovesaflankfromaread,nomatter howdissimilartothedatabaseflank.Thestarting hypothesisis thattheread-endtoremoveisaslongasthedatabaseflank.K-mers forthedatabaseflankwithincreasinglengtharesearchedaround their expected index in the read. The found index is scored dependingonhowinformativethek-meris:moreinformativeif
longerandifclosertotheflank-end.Thescoreiscalculatedasthe squarerootoftheproductofthosetwovalues(lengthofk-merand inversedistancetodatabaseflankend).Finally,theproposedindex withthehighestsummedscoreisconsideredtobethemostlikely ending flank position. Based on that position, the read-end is removed. For a more in depth explanation, documentation is provided forthis algorithm inthesourcefile insupplementary materials.
3. Results
3.1. Settinggeneralthreshold
In theanalysis ofMPS STR data,groups withanabundance lowerthan0.5%werediscarded(seestep4of2.2.3).Thisthreshold wasarbitrarilydeterminedbasedon theresultsinFig.2.Fig.2 showsahistogramoftheabundancesofallgroupedidenticalreads for the singlecontributor samples.To generate this figure, the completesequenceswereconsidered,exceptforthepartoutsideof theprimers.
Erroneousreadsareexpectedtohavemuchsmallerabundances than errorfree reads, especially in single contributor samples. Therearearound105groupsofidenticalreadswithabundances
smaller than 0.5%, which are in the context of the single contributor samples definitelyreadswith errors.The threshold foruniquereadswassetto0.5%,toavoidclutteringtheresultswith noise.Bydoingso,minorcontributorswhichcontributelessthan 0.5%totheDNAmixturewillnotbedetected.
3.2. GeneralpropertiesoftheIlluminadataset
Table2showsthetotalnumberofMiSeqreadsforeachsample, the number of reads filtered based on the 0.5% abundance threshold,andthenumberoferrorfreeROIafterfiltering.Filtered readsconsistofbothreadswithandwithouterrors.ErrorfreeROI arethenumberofreadsofwhichthecompleteROIisidenticaltoa referencedatabaseROI and ofwhich theROI isexpectedtobe presentintherelevantsample.Thesereadscanstillhaveerrorsin theflanksaroundtheROI,buttheseflanksarenotconsideredwhen matching the reads to the reference allele database. This percentageis influencedbymanyfactorssuchasPCRaccuracy, amplicon length, ROI and flank length, stutter, sequencing accuracyandtheabundancefiltercut-off.
A A AG T GCTAGGACAATGAT AGATA GATAGTACGCGCACCTACCTA GCTAGGACAATGATAGATAGATA GTACGCGCACCTACCTA GCTAGGACAATGATAGATAGATAGATAGAT GTACGCGCACCTACCTA
Primer Flank Flank Primer
Region of interest b) Without SNP's in STR region: GCTAGGACAATGATACA TAGATA GATAGTACGCGCACCTA CCTA GCTAGGACAATGATAGA TAGATA GTACGCGCACCTA CCTA GCTAGGACAATGATAGA TAGATAGATAGATTGTACGCGCACCTA CCTA
Primer Flank Flank Primer
Region of interest a) With SNP's in STR region:
Fig.1.Flanksandregionofinterestinthereferencealleledatabase.Agenericlocuswiththreeallelesinthedatabaseisconsideredfortwopossiblecases:withorwithoutSNPs withintheSTRregion.
3.3. STRallelecallsinfourandfivepersonDNAmixtures
Figs.3–6showtheallelecallsforthedifferentsamples.Inthe figures,bluebarsdenotethetheoreticalabundanceoftheallele basedonhowthesampleswereprepared,greenbarsthedetected read abundance, and red bars erroneous reads (including polymerase stutter). Pure samples (Figs. 3 and 6), contain, at most,two blue bars per locus. Erroneous read bars have been drawnnarrower tomakeit possibletoshowallerroneousread
groups.ThefiguresareautomaticallygeneratedfromtheMyFLq resultfiles(insupplementarymaterials),withmanualadditionof thebluebars.
3.4. Progressiveabundancethreshold
Fig.7shows,afterfullanalysis,foreachlocus,thepercentageof errorfreesequences(Y-axis)foragivenabundancethreshold (X-axis).Higherabundancesforerroneoussequencesarelesslikely. Table2
GeneralIlluminasamplescharacteristics.
Sample1 Sample2 Sample3 Sample4
TotalMiSeqreads 246,347 1,176,806 1,261,848 961,236
Filteredreads 203,181(82%) 981,935(83%) 1,073,164(85%) 771,723(80%)
ErrorfreeROI 173,692(71%) 912,643(72%) 996,466(79%) 681,918(71%)
Fig.3.Sample1profile.Bluebars=theoreticalabundance,greenbars=detectedreadabundance,redbars=erroneousreadabundance.
Fig.4.Sample2profile.Bluebars=theoreticalabundance,greenbars=detectedreadabundance,redbars=erroneousreadabundance. 0.0000.0050.0100.0150.020 0.0250.0300.0350.0400.045 0.050 0.0550.0600.0650.0700.0750.080 0.0850.0900.0950.100 Abundance of sequence [0-1] 100 101 102 103 104 105 Number (log) Threshold 0.005
When theabundance thresholdis increased, thepercentage of errorfreesequencesincreasesand isexpectedtobecome100%. Lociforwhich lesserrorsequencesareproduced,have a lower thresholdforwhich100%errorfreereadsareremaining.Basedon thiscriterion,AmelogeninandD16S539arethebestperforming lociwhilePentaEandPentaDperformatadecreasedlevel.
3.5. Locusqualityanalysis
Foreachlocus,Table3showsthepercentageofthetheoretical abundance which is measured as MPS signal. The theoretical abundance can be calculated from the proportion of each contributor tothesample (seeTable 1)andare shownas blue Fig.5.Sample3profile.Bluebars=theoreticalabundance,greenbars=detectedreadabundance,redbars=erroneousreadabundance.
Fig.6.Sample4profile.Bluebars=theoreticalabundance,greenbars=detectedreadabundance,redbars=erroneousreadabundance.
barsinFigs.3–6.Themeasuredsignalsareshownasgreenbarsin thesamefigures.WhentherewouldbenoPCRerrorslikestutter and no sequencing errors,the MPS signalwould be100%. The numbersinTable3werecalculatedbylinearmodelforeachlocus separatelybutforalllocusspecificallelesofallsamplestogether. Fig.8showstheaveragepercentageofflankswhichare‘clean’ in reads with an error-free ROI and in reads with an error-containing ROI. For all loci, except PentaD and D5S818, the proportionofcleanflanksishigherinreadswhichalsocontainan error-freeROI.Theproportionofcleanflanksisalsoimpactedby thelength of the flanks. The negativecorrelation between the
lengthoftheflanksandtheproportionofcleanflanksisexpected aslongersequencescarryahigherprobabilityofcontainingerrors.
4. Discussion
4.1. Frameworkperformance
The MyFLq framework was designed to incorporate all algorithms necessary for an STR MPS analysis in one open framework. Itcontainsabout a thousandlinesof codeand has minimal dependencies: It only requires a working python2 environment and the common python packages numpy and MySQLdb. Currently the target audience of the framework are bioinformaticiansthatworkinthefieldofMPSforensicanalysis. An Illumina BaseSpace application is being developed, which shouldbeavailablebythebeginningof2014.Thisway,userswill nothavetointeractwiththeframework,butwillbeabletoanalyze theirSTRdatafileswiththeMyFLqapplication.
MyFLqcanalsohandleothertypesofforensicdata,suchasSNP or mitochondrialregions. However, it hasnotbeen extensively testedtooperate withsuchdata.Fromtheanalyzedlociinthis study,onlyAmelogenincouldbeconsideredaSNPlocus.STRait Razorisanothertoolthathasbeenrecentlydevelopedforforensic genomicsandasthenameindicatesonlydealswithSTR’s[8].In ouropinion,futuresoftwarepackagesshouldhandlebothSTRand SNP loci equally well, to provide a full solution to forensic researchers.
MyFLqhascurrentlynotbeendesignedtobecomputational efficient.AfullanalysisofaMiSeqdatasettakesapproximatelyone houronasingleCPU.Thecodecontainsmanysectionsthatcould Table3
MPSsignalpercentageoftheoreticalabundance.
Locus MPSsignal/theoreticalabundance(%) Std-error(%)
Amelogenin 95 4 CSF1PO 79 5 D13S317 89 5 D16S539 88 6 D18S51 75 6 D21S11 78 5 D3S1358 86 5 D5S818 89 5 D7S820 84 6 D8S1179 95 6 FGA 73 6 PentaD 25 5 PentaE 27 7 TH01 92 5 TPOX 90 4 vWA 85 5 0.60 0.65 0.70 0.75 0.80 0.85 0.90 0.95 ] 1 -0 [ s k n a l f n a e l c f o n o i t r o p o r p : :I O R e e r f r o r r E 0.3 0.4 0.5 0.6 0.7 0.8 0.9 1.0 Er
ror containing ROI: pr
oportion of clean flanks
Amelogenin CSF1PO D13S317 D16S539 D18S51 D21S11 D3S1358 D5S818 D7S820 D8S1179 FGA PentaD PentaE TH01 TPOX vWA 1.0
Fig.8.ProportionofcleanflanksinreadswithanexpectedROIandinreadswithanerror-containingROI.Sizeofcirclesisproportionaltoaverageallelelengthofthelocus, darknessofcirclesisproportionaltoflanklengthofthelocus.
beparallelizedtoreducetheanalysistimesignificantly.Thiswill alsobeimplementedintheBaseSpaceapplication.
4.2. Accuracyanddetectionofminorcontributors
IngeneraltheIlluminaMiSeqreadswereofhighquality,with morethan70%ofreadscontainingacompleteerrorfreeROIand constitute the MPS signal. Several factors are influencingthis percentage.Stutters(around10%)andthereadsfiltered bythe abundancethreshold(around20%)areabigpartoftheother30% of the reads. PCR and sequencing accuracy with a per base sequencing accuracy of approximately 99.5% are other main factors. Reducing the amplicon, and ROI length could further improvetheMPSsignal.Readscanbeusedwithoutclustering them to produce a consensus sequence. This is of particular importance,sinceclusteringshouldbeavoidedwhen investigat-ing forensic mixtures. An allele from a minor contributor containingaSNPoraninsertion/deletioncouldclusterwiththe sequencesfromthecorrespondingalleleofamajorcontributor andgo undetected.When thecontributors are unknown, itis impossibletocategorizeaprioribetweenallelesandamplification errors.Consequently,clusteringshouldbeavoidedifpossible,in order to confront the forensic investigator at least with the presenceofthesesequences.
Becausetheminimalabundancethresholdduringdataanalysis wassetto0.5%,theoreticallyitwasstillpossibletodetectalleles fromthe1%contributor in Sample3.However, only forthe5% contributor the alleles were detected consistently, showing a correct allele call for all loci except D13S317. However, the experimentwasnotsetuptodeterminetheallelicdetectionlimit. Manyallelesoftheminorcontributorsarethesameasoneofthe moreabundantcontributors. Futureresearchwill beneededto establish a general detection limit in mixtures for each MPS technologyandPCRmultiplex.
4.3. Lociperformance
Incapillaryelectrophoresis(CE),theallelespecificsignalis85– 90%,aspolymerasestutterproducts usuallycomprise approxi-mately10–15%oftheparentallele’ssignal.Thesestutterproducts were also observed in MPS data. These stutter sequences combined with sequencing errors results in an allele specific MPSsignalthatisslightlylowerthantheabsoluteDNAinput.For mostlocithesevaluesareveryreasonable,asshowninTable3, except for PentaE and PentaD. Those two loci are obvious underperformersinthecurrent experimentalsetup,bothwith allele specificsignalsaround25%.Comparedto theother loci, these2locihavelongamplicons,longROIandlongflanklengths. While these factors are partially contributing to a higher proportionof reads containing errors in theROI,they are not completelyexplainingthelowsignal.Itisunclearatwhichpoint thattheerrorsareintroduced,butgiventhehighabundanceof someoftheerrors,theyareprobablyintroducedatthePCRstep. Futureresearchwill showhowusefultheywill bein MPSSTR analyses.
Fig.8showsahigherproportionofcleanflanksinreadswith error-free ROIs. This data could be modeled to estimate the likelihoodoferrorinROIsofanunknownsample.Fig.8showsthat theusefulnessofthisstrategywilldependontheconsideredlocus because values for ‘clean’, ‘compressed-clean’ and ‘unclean’ dependonlocussequencecharacteristicssuchasaveragelength andhomopolymercontent.
ForanMPStechnologytobecomeavalidalternativetoCEit mustproduceatleastequivalentresultsforthecommonlyused loci,andproduceadditional,valuableinformationforanexpanded setofloci.MostlocialreadyperformedverywellwiththeIllumina
MiSeq, while some(PentaD,PentaE)needfurther optimization, suchasprimerre-design,inordertoeclipseCE.AsMPSbecomesa validalternative,ourframeworkcanbeusedtohelpidentifyan additionalsetofidealMPSloci.
4.4. Dynamicflankcalculation
Usingsequencing,onlytheROIneedstobeconsideredduring data analysis,i.e.thepartof thesequencethat differsbetween differentallelesforalocus.Anypartoutsidethatregioncanbe treatedasflanking.Asourframeworkisbuilttogenericallyhandle all forensicloci, theROI andtheflanksaround theROIare not predefined, but are dynamically determined. This dynamic flanking tries tomaximizethe flanksand to minimizetheROI for each locus based on the reference alleles present in the reference allele database. Removing flanks from the reads minimizestheimpactoferrorsintheflanksontheanalysis.This aidsdetectingallelesofminorDNAcontributors,asasignificant portionofthenoiseiseliminated.
Applicationofourmethodiscurrentlylimitedbythesizeof thereferencealleledatabase,whichcontainsasmallsubsetofall ofthepossiblealleles,i.c.onlyallelespresentintheDNA from Table 1. To determine the practical applicability of our framework, moresamplesneedtobeanalyzed toincreasethe allele databasesize.However, for futureapplications itwould also be useful to automatically limit the database size by subsetting it. Database alleles could, for example,be selected basedonthesizeofthesequencesinaspecificsample,because the size ofreads already reduces theset ofpossible allelesto which they could be assigned. This subset of possible alleles wouldthenservetocalculatemaximumflanksonthefly.Smaller subsets will resultin bigger flankswhich in turn reduces the impact oferrors.
5. Conclusion
When MPSbecomes routinelyusedtoanalyzeforensic DNA profiles,decisionswillneedtobemadeonhowthedatashouldbe processed.Wepresentourbioinformaticframework,MyFLq,that processes forensic MPS data prudently: without clustering, extracting maximal information withautomaticallydetermined regionsofinterest.TheresultsshowIlluminaMiSeqisreadyto analyze STR profiles. For routine implementation in forensic laboratories,acarefulselectionofloci,PCRmultiplexoptimization, andanMPS-basedSTRallelereferencedatabaseforalignment,are needed.
Supplementarymaterials
MyFLq.pyThemainframeworkprogram.
InstallMyFLqTables.pyInstallstheMySQLtablesnecessaryfor usingMyFLq.
results.cssNeededinthesamedirectoryasthesampleresult files, for opening theresult filesin a browser. Thisfile also containsdocumentationtointerprettheresultfiles.
sample1–4.xmlSampleresultfiles.
Acknowledgments
FundingwasprovidedbytheGhentUniversityMultidisciplinary Research Partnership ‘Bioinformatics: from nucleotides to net-works’.
TheauthorswouldalsoliketothankIllumina,Inc.forproviding their technical expertise in sample preparation and MiSeq sequencing.
AppendixA. Supplementarydata
Supplementarydataassociatedwiththisarticlecanbefound, in the online version, at http://dx.doi.org/10.1016/j.fsigen.2013. 10.012.
References
[1]D.L.Deforce,R.E.Millecamps,D.V.Hoofstat,E.G.V.denEeckhout,Comparisonof slabgelelectrophoresisandcapillaryelectrophoresisforthedetectionofthe fluorescentlylabeledpolymerasechainreactionproductsofshorttandemrepeat fragments,J.Chromatogr.A806(1)(1998)149–155.,http://dx.doi.org/10.1016/ S0021-9673(97)00394-4.
[2]S.L.Fordyce,M.C.Avila-Arcos,E.Rockenbauer,C.Borsting,R.Frank-Hansen,F.T. Petersen,E.Willerslev,A.J.Hansen,N.Morling,M.T.P.Gilbert,High-throughput sequencingofcoreSTRlociforforensicgeneticinvestigationsusingtheRoche GenomeSequencer FLXplatform,Biotechniques51(2)(2011)127þ,http:// dx.doi.org/10.2144/000113721.
[3]K.K.Kidd,J.R.Kidd,W.C.Speed,R.Fang,M.R.Furtado,F.C.L.Hyland,A.J.Pakstis, ExpandingdataandresourcesforforensicuseofSNPsinindividualidentification, ForensicSci.Int.Genet.6(5)(2012)646–652,doi:10.1016/j.fsigen.2012.02.012.
[4]L.J.McIver,J.W.FondonIII,M.A.Skinner,H.R.Garner,Evaluationofmicrosatellite variationinthe1000GenomesProjectpilotstudiesisindicativeofthequalityand utilityoftherawdataandalignments,Genomics97(4)(2011)193–199.,http:// dx.doi.org/10.1016/j.ygeno.2011.01.001.
[5]J.V.Planz,K.A.Sannes-Lowery,D.D.Duncan,S.Manalili,B.Budowle,R.Chakraborty, S.A.Hofstadler,T.A.Hall,AutomatedanalysisofsequencepolymorphisminSTR allelesbyPCRanddirectelectrosprayionizationmassspectrometry,ForensicSci. Int.Genet.6(5)(2012)594–606.,http://dx.doi.org/10.1016/j.fsigen.2012.02.002. [6]C.VanNeste,F.VanNieuwerburgh,D.VanHoofstat,D.Deforce,ForensicSTR analysisusingmassiveparallelsequencing,ForensicSci.Int.Genet.6(6)(2012) 810–818.,http://dx.doi.org/10.1016/j.fsigen.2012.03.004.
[7]M.Gymrek,D.Golan,S.Rosset,Y.Erlich,lobSTR:ashorttandemrepeatprofilerfor personalgenomes,GenomeRes.22(6)(2012)1154–1162.,http://dx.doi.org/ 10.1101/gr.135780.111.
[8]D.H.Warshauer,D.Lin,K.Hari,R.Jain,C.Davis,B.LaRue,J.L.King,B.Budowle, STRaitRazorAlength-basedforensicSTRallele-callingtoolforusewithsecond generationsequencingdata,ForensicSci.Int.Genet.7(7)(2013)409–417.
[9]A.Masibay,T.Mozer,C.Sprecher,PromegaCorporationrevealsprimersequences initstestingkits,J.ForensicSci.45(6)(2000)1360–1362.
[10]C.Ruitberg,D.Reeder,J.Butler,STRBase:ashorttandemrepeatDNAdatabase forthehumanidentitytestingcommunity,NucleicAcidsRes.29(1)(2001)320– 322.,http://dx.doi.org/10.1093/nar/29.1.320.
[11]P.Gill,C.Brenner,B.Brinkmann,B.Budowle,A.Carracedo,M.Jobling,P.deKnijff, M.Kayser,M.Krawczak,W.Mayr,N.Morling,B.Olaisen,V.Pascali,M.Prinz,L. Roewer, P.Schneider, A.Sajantila,C. Tyler-Smith,DNA Commission ofthe InternationalSocietyofForensicGenetics:recommendationsonforensicanalysis usingY-chromosomeSTRs,Int.J.LegalMed.114(6)(2001)305–309.,http:// dx.doi.org/10.1007/s004140100232.
[12]B.Olaisen,W.Bar,B.Brinkmann,B.Budowle,A.Carracedo,P.Gill,P.Lincoln,W. Mayr,S.Rand, DNArecommendations1997oftheInternational Societyfor ForensicGenetics,VoxSang.74(1)(1998)61–63.,http://dx.doi.org/10.1046/ j.1423-0410.1998.7410061.x.