• No results found

My-Forensic-Loci-queries (MyFLq) framework for analysis of forensic STR data generated by massive parallel sequencing

N/A
N/A
Protected

Academic year: 2021

Share "My-Forensic-Loci-queries (MyFLq) framework for analysis of forensic STR data generated by massive parallel sequencing"

Copied!
8
0
0

Loading.... (view fulltext now)

Full text

(1)

My-Forensic-Loci-queries

(MyFLq)

framework

for

analysis

of

forensic

STR

data

generated

by

massive

parallel

sequencing

§

Christophe

Van

Neste

a

,

Mado

Vandewoestyne

a

,

Wim

Van

Criekinge

b

,

Dieter

Deforce

a,1

,

Filip

Van

Nieuwerburgh

a,1,

*

a

LaboratoryofPharmaceuticalBiotechnology,FacultyofPharmaceuticalSciences,GhentUniversity,Ghent,Belgium b

Biobix,FacultyofBioscienceEngineering,GhentUniversity,Ghent,Belgium

1. Introduction

ForensicDNAprofilesofshorttandemrepeatlociarecurrently obtainedusingPCRfollowedbycapillaryelectrophoresis(CE).CE separatesfluorescentlylabeledPCRproductsbasedontheirlength [1].Becauseofthis,inordertoproduceunambiguousallelecalls,the sizeranges of STR lociwith the samefluorescenttag mustnot overlap.Thislimitsthenumberoflocithatcanbeinvestigatedina single PCR and in a single capillary injection. Massive parallel sequencing (MPS) technologies, also known as second or next generation sequencing, do not relyon sizeseparation and thus

relieve the limitation on locus multiplexy [2,3]. Additionally, multiplesamplescanbemultiplexedatthesametimeinasingle run.MPSallowsforanalysisofmillionsofindividualDNAstrands (reads) inaDNAmixture,whichintheorywouldallowforhigh resolutionmixtureanalysis.Sequencing alsomakesitpossibleto detectsinglenucleotidepolymorphisms(SNPs)andSTRsequence variantsinadditiontothegrossSTRrepeatnumber[4].Thisallows analyststotellthedifferencebetweenequilengthallelesinaDNA mixture.Certainmassspectrometrytechniquesalsomakeitpossible todifferentiateequilengthalleles,butcompletecharacterizationof polymorphismcanonlybeaccomplishedbysequencing[5].

Forensicscientistsarecurrentlyinvestigatinghowtotransition fromCEtoMPS.Severalbioinformatictoolsarebeingdeveloped to that end [2,6–8]. Previously, we reported that sequencing of multiplexed STR amplicons using Roche GS FLX titanium technology was technically feasible both in single contributor samplesandinmultiplepersonDNAmixtures,notwithstandinga ARTICLE INFO

Articlehistory: Received25July2013

Receivedinrevisedform4October2013 Accepted22October2013 Keywords: Illumina MiSeq STR Forensicloci MPS NGS ABSTRACT

Forensicscientistsarecurrentlyinvestigatinghowtotransitionfromcapillaryelectrophoresis(CE)to massiveparallelsequencing(MPS)foranalysisofforensicDNAprofiles.MPSoffersseveraladvantages overCEsuchasvirtuallyunlimitedmultiplexyofloci,combiningbothshorttandemrepeat(STR)and singlenucleotidepolymorphism(SNP)loci,smallampliconswithoutconstraintsofsizeseparation,more discriminationpower,deepmixtureresolutionandsamplemultiplexing.Wepresentourbioinformatic frameworkMy-Forensic-Loci-queries(MyFLq)foranalysisofMPSforensicdata.Forallelecalling,the frameworkusesaMySQLreferencealleledatabasewithautomaticallydeterminedregionsofinterest (ROIs)byagenericmaximalflankingalgorithmwhichmakesitpossibletouseanySTRorSNPforensic locus.PythonscriptsweredesignedtoautomaticallymakeallelecallsstartingfromrawMPSdata.We alsopresentamethodtoassesstheusefulnessandoverallperformanceofaforensiclocuswithrespectto MPS,aswellasmethodstoestimatewhetheranunknownallele,whichsequenceisnotpresentinthe MySQLdatabase,isinfactanewalleleorasequencingerror.TheMyFLqframeworkwasappliedtoan Illumina MiSeq dataset of a forensic Illumina amplicon library, generated from multilocus STR polymerasechainreaction(PCR)onbothsinglecontributorsamplesandmultiplepersonDNAmixtures. AlthoughthemultilocusPCRwasnot yetoptimizedforMPSintermsofampliconlengthorlocus selection,theresultsshowexcellentresultsformostloci.Theresultsshowahighsignal-to-noiseratio, correctallelecalls,andalowlimitofdetectionforminorDNAcontributorsinmixedDNAsamples. Technically,forensicMPSaffordsgreatpromiseforroutineimplementationinforensicgenomics.The methodisalsoapplicabletoadjacentdisciplinessuchasmolecularautopsyinlegalmedicineandin mitochondrialDNAresearch.

ß2013TheAuthors.PublishedbyElsevierIrelandLtd.ThisisanopenaccessarticleundertheCC BY-NC-SAlicense(http://creativecommons.org/licenses/by-nc-sa/3.0/).

§

Thisisan open-accessarticledistributedunderthetermsoftheCreative Commons Attribution-NonCommercial-No Derivative Works License, which permitsnon-commercial use, distribution,and reproduction in any medium, providedtheoriginalauthorandsourcearecredited.

* Correspondingauthor.Tel.:þ3292648052. 1

Theseauthorscontributedequallytothiswork.

ContentslistsavailableatScienceDirect

Forensic

Science

International:

Genetics

j o urn a l hom e pa ge : ww w. e l s e v i e r. c om/ l o ca t e / f si g

http://dx.doi.org/10.1016/j.fsigen.2013.10.012

1872-4973/ß2013TheAuthors.PublishedbyElsevierIrelandLtd.ThisisanopenaccessarticleundertheCCBY-NC-SAlicense( http://creativecommons.org/licenses/by-nc-sa/3.0/).

(2)

poorperformanceforsomefrequentlyusedforensicloci[6].For those loci, the GS FLX reads needed to be transformed by a homopolymer-compressionalgorithmtoobtainresultsconsistent enoughformixtureanalysis.IntheGSFLXSTRanalysis,thereads neededtobeclusteredaroundaconsensussequence.Theregionof interest(ROI)wassetusingfixedflankingbasedonthespecificPCR primersforeachlocus.Asignalthresholdwasusedtodetermine whichreadclustersareconsideredasignalandwhichclustersare considerednoise.

In the current study we created a general bioinformatic framework, that we call My-Forensic-Loci-queries (MyFLq), for analyzingMPS-generatedforensic data.Itisdesignedtohandle multiplelocus types, includingSTR lengthpolymorphisms and SNPs.TheframeworkusesaMySQLreferencealleledatabasewith automaticallydeterminedROIsandpythonscriptstocompareMPS sequencesagainsttheknownalleledatabase.Wealsopresenta method to asses the usefulness and overall performance of a forensiclocuswhen usedin anMPS analysis,and a methodto estimatewhetheranallelewhichisnotpresentinthedatabaseis infactanewlytypedalleleoraPCR/sequencingerror.

TheMyFLqframeworkwasusedonanIlluminaMiSeqdatasetof anIlluminaforensicampliconlibrarygeneratedfromSTRmultilocus PCR on both single contributor samples and multiple person mixtures.ThemultilocusPCRwas notspecificallyoptimized for MPSintermsofcriteriasuchaslocusselectionorampliconlength. Theresultsarepromising,showingexcellentresultsonmostbutnot allloci.Therawreadaccuracywashighenoughsothatthereadsdid notneedtobeclusteredaroundaconsensusaccuracyaswiththeGS FLXreads[6].ThefrequencyofhomopolymererrorsintheMiSeq dataisinsignificantcomparedtotheGSFLXtechnology. Neverthe-less,thehomopolymercompressionalgorithmstillprovedusefulto groupsome readscontaining errors,whetherfromPCRor from sequencing,withthosethatwerecompletelyerror-freeinrespectto acontributor’sreferencesequence.

The MyFLqframeworkhas aCreative Commons opensource license(CCBY-SA3.0).Thesourcecodeisavailableassupplementary material or for the latest version at http://MyFLq.UGent.van-neste.be.CurrentlyweareimplementingitasanIlluminaBaseSpace application,whichshouldbeavailablebythebeginningof2014.

2. Materialsandmethods

2.1. SamplepreparationandprocessingusingIlluminachemistry DNA mixtures werepreparedaccordingtoTable 1from the followingpurifiedgenomicDNAsources:K562(Promega),9947A (Promega), 2800M (Promega) and two National Institute of Standards and Technology (NIST) standard reference materials (SRM2391c:DNAA,DNAB).DNAconcentrationofeachsample was measured using the Qubit1

dsDNA HS Assay and Qubit Fluorimeterfollowing manufacturer’sinstructions.A mixtureof foursourceDNAsandamixtureoffivesourceDNAs,alongwith 9947AandK562singlesourcesamples,wereusedinmultilocus PCR(Table1).

Primer sequences were ordered without fluorescent tags (Integrated DNA Technologies) for loci: Amelogenin, CSF1PO, D13S317,D16S539,D18S51,D21S11,D3S1358,D5S818,D7S820, D8S1179, FGA, PentaD,PentaE, TH01,TPOX, and vWA[9]. This multiplex hasyet toundergo optimization (e.g., for intra- and inter-locus balance, polymerase stutter (slippage) and other artifacts) forusein MPS.Primers wereusedata concentration of2

m

MeachinPCRwith1ngofDNAorDNAmixturein1Gold STRbuffer(Promega)and0.16UAmpliTaqGOLD(Invitrogen).The samples were amplified using a BioRad Tetrad instrument as follows:958Cfor11min,968Cfor1min;10cyclesof948Cfor 30s,ramp0.58C/sto608Candthen30sat608C,ramp0.28C/sto 708Candthen45sat708C;22cyclesof908Cfor30s,ramp0.58C/ sto608Candthen30sat608C,ramp0.28C/sto708Candthen 45sat708C;holdat608Cfor30min;48Csoak.

PCR products were quantified using Qubit1

(Invitrogen) withoutpurification.LibrariesweregeneratedbyligatingTruSeq DNAadapterstothePCRproductsfrom50ngofunpurifiedPCR product(Illumina).Samplesweresubjectedto5cyclesofPCRand purifiedwithSPRI(TruSeqDNASamplePreparationGuide). The completed libraries were quantified using a qPCR assay as recommendedbyIllumina.

LibrarieswerepooledwithPhiXUniversalLibraryandaHuman DNAlibrary.Pooledlibrariesweredenaturedanddilutedto10pM followingIlluminaguidelinesandsequencedonaMiSequsinga MiSeqReagentKitv2,500cycleswithamodifiedrecipe.Samples weredemultiplexedusingtheindexsequences,FASTQfileswere generatedautomaticallyusingMiSeqReporter(MSR).

2.2. MiSeqdataanalysis 2.2.1. TheMyFLqframework

All necessary steps for an MPS forensic analysis were incorporatedintoouropen-sourceframework.MyFLqis notyet afullapplication.Theendresultsneedtobestatisticallyanalyzed: probabilities of allelecalls, stutterfiltering, hetero- and homo-zygoticallelecallingandvisualizationarenotyetimplemented.

Theframeworkconsistsof twoparts:(1)A MySQLdatabase backendthatispopulatedbyknown referencealleles,and(2)a Pythonfrontendwithfunctionsforaddingreferenceallelestothe reference allele database, and analysis of MPS STR data from forensic samples. The source code and documentation of all functionscanbefoundinsupplementarymaterials.Italsolists specificallythefunctionsusedforthispaper,andprovidesashort descriptionforeach.

2.2.2. Buildingthereferencealleledatabase

Thealleledatabaseisideallypopulatedwiththesequencesof allknownSTRallelesthatexistinthegeneralpopulation.Because eachof thesesequencesiscurrentlynotavailable,thedatabase was initially populated withthe STR sequences from theDNA sources in Table 1. These reference sequences were manually inferredfromtheIlluminasequencingdataandtheSTRBaseallele database[10].Theyarenotthebestrepresentationofpopulation alleles,butsufficeforthecurrentstudy.Inthefuture,withMPSthe knowndiversitywillbebetterdetermined.

After building the reference allele database, the function processLociNameswasusedtodeterminetheflankingregionfor each locus. The function processLociAlleles produced a table containingtheROIforeachalleleanditsintegerallelenumber(if STR)accordingtostandardnomenclature[11,12].

Flanking regions are the maximum right- and left-end consensusbetween all alleles of a locusin the referenceallele database.Fig.1showsasimplifiedexample.Eachallelehastwo primers,twoflanksandtheROI.Ifthereferencedatabaseallelesfor alocusonlydifferbythenumberofSTRrepeatsandthusnoSNPs, Table1

DNAcompositionofsamples.

DNAstandard Sample1(%) Sample2 (%) Sample3 (%) Sample4 (%) K562 100 0.10 9947A 40 0.50 100 NISTSRM2391c DNAA 30 1 NISTSRM2391c DNAB 20 5 2800M 10 93.40

(3)

non-consensus,orpartialrepeatpatternsarepresent,oneofthe alleles for that locuscan have a ROI of zerolength, as Fig. 1b demonstrates.BecausetheROIiscalculateddynamicallybasedon thesequencespresentinthereferencealleledatabase,theROIof thelocicanchangewhenthereferencealleledatabaseisupdated. 2.2.3. STRdataanalysis

Afterbuildingthedatabase,allFASTQfileswereanalyzedwith theframeworksoftwareasiftheywouldhavebeenfromunknown samples. Analysis consisted of following steps: (1) reads were assigned toa locusbased on the presenceof both PCR primer sequencesforthat locus; (2)primers andflanks wereremoved fromthereads,leavingonlythereadROI;(3)readsweregrouped based on their exact sequence;(4) groups with an abundance lowerthan0.5%werediscarded;(5) theROI ofeachgroupwas comparedtothealleledatabasetableandanallelecallwasmade whenanexact matchwasfound;and (6)groupswithina locus werecomparedtoeachother,aconnectionwasflaggedwhentwo ROIsdifferedbymaximumtwoSNPsorSTRsizeindels.

Instep(2)thedeterminationoftheflankingregionswasdone asfollows:ifaread-endmatchedexactlywiththedatabaseflank, the read-end was removed and flagged ‘clean’. Otherwise, the databaseflankandread-endwerehomopolymer-compressed(two sameconsecutivebaseswerealreadyconsideredahomopolymer). Whenthisresultedinanexactmatchbetweenthosesequences, theread-endwasremovedandflagged‘compressed-clean’.Ifthe flankwasstillnotremoved,theread-endwasflaggedas‘unclean’ andwasremovedbyourflexibleflankingalgorithm(seebelow).In step (4), for each group, counts were gathered on ‘clean’, ‘compressed-clean’, and ‘unclean’ read-ends. Step (6) indicates howtheROIsin theresultsareinterrelated.Thishelpsforensic researcherstodecideonthelikelihoodofanallelecall,e.g.aROI thatisnotinthedatabase,hasalowabundanceintheresultsand onlydiffersbyoneSNPofanotherhighabundantROI,isprobably anerror-containingROI.

2.2.4. Flexibleflankingalgorithm

Thisalgorithmalwaysremovesaflankfromaread,nomatter howdissimilartothedatabaseflank.Thestarting hypothesisis thattheread-endtoremoveisaslongasthedatabaseflank.K-mers forthedatabaseflankwithincreasinglengtharesearchedaround their expected index in the read. The found index is scored dependingonhowinformativethek-meris:moreinformativeif

longerandifclosertotheflank-end.Thescoreiscalculatedasthe squarerootoftheproductofthosetwovalues(lengthofk-merand inversedistancetodatabaseflankend).Finally,theproposedindex withthehighestsummedscoreisconsideredtobethemostlikely ending flank position. Based on that position, the read-end is removed. For a more in depth explanation, documentation is provided forthis algorithm inthesourcefile insupplementary materials.

3. Results

3.1. Settinggeneralthreshold

In theanalysis ofMPS STR data,groups withanabundance lowerthan0.5%werediscarded(seestep4of2.2.3).Thisthreshold wasarbitrarilydeterminedbasedon theresultsinFig.2.Fig.2 showsahistogramoftheabundancesofallgroupedidenticalreads for the singlecontributor samples.To generate this figure, the completesequenceswereconsidered,exceptforthepartoutsideof theprimers.

Erroneousreadsareexpectedtohavemuchsmallerabundances than errorfree reads, especially in single contributor samples. Therearearound105groupsofidenticalreadswithabundances

smaller than 0.5%, which are in the context of the single contributor samples definitelyreadswith errors.The threshold foruniquereadswassetto0.5%,toavoidclutteringtheresultswith noise.Bydoingso,minorcontributorswhichcontributelessthan 0.5%totheDNAmixturewillnotbedetected.

3.2. GeneralpropertiesoftheIlluminadataset

Table2showsthetotalnumberofMiSeqreadsforeachsample, the number of reads filtered based on the 0.5% abundance threshold,andthenumberoferrorfreeROIafterfiltering.Filtered readsconsistofbothreadswithandwithouterrors.ErrorfreeROI arethenumberofreadsofwhichthecompleteROIisidenticaltoa referencedatabaseROI and ofwhich theROI isexpectedtobe presentintherelevantsample.Thesereadscanstillhaveerrorsin theflanksaroundtheROI,buttheseflanksarenotconsideredwhen matching the reads to the reference allele database. This percentageis influencedbymanyfactorssuchasPCRaccuracy, amplicon length, ROI and flank length, stutter, sequencing accuracyandtheabundancefiltercut-off.

A A AG T GCTAGGACAATGAT AGATA GATAGTACGCGCACCTACCTA GCTAGGACAATGATAGATAGATA GTACGCGCACCTACCTA GCTAGGACAATGATAGATAGATAGATAGAT GTACGCGCACCTACCTA

Primer Flank Flank Primer

Region of interest b) Without SNP's in STR region: GCTAGGACAATGATACA TAGATA GATAGTACGCGCACCTA CCTA GCTAGGACAATGATAGA TAGATA GTACGCGCACCTA CCTA GCTAGGACAATGATAGA TAGATAGATAGATTGTACGCGCACCTA CCTA

Primer Flank Flank Primer

Region of interest a) With SNP's in STR region:

Fig.1.Flanksandregionofinterestinthereferencealleledatabase.Agenericlocuswiththreeallelesinthedatabaseisconsideredfortwopossiblecases:withorwithoutSNPs withintheSTRregion.

(4)

3.3. STRallelecallsinfourandfivepersonDNAmixtures

Figs.3–6showtheallelecallsforthedifferentsamples.Inthe figures,bluebarsdenotethetheoreticalabundanceoftheallele basedonhowthesampleswereprepared,greenbarsthedetected read abundance, and red bars erroneous reads (including polymerase stutter). Pure samples (Figs. 3 and 6), contain, at most,two blue bars per locus. Erroneous read bars have been drawnnarrower tomakeit possibletoshowallerroneousread

groups.ThefiguresareautomaticallygeneratedfromtheMyFLq resultfiles(insupplementarymaterials),withmanualadditionof thebluebars.

3.4. Progressiveabundancethreshold

Fig.7shows,afterfullanalysis,foreachlocus,thepercentageof errorfreesequences(Y-axis)foragivenabundancethreshold (X-axis).Higherabundancesforerroneoussequencesarelesslikely. Table2

GeneralIlluminasamplescharacteristics.

Sample1 Sample2 Sample3 Sample4

TotalMiSeqreads 246,347 1,176,806 1,261,848 961,236

Filteredreads 203,181(82%) 981,935(83%) 1,073,164(85%) 771,723(80%)

ErrorfreeROI 173,692(71%) 912,643(72%) 996,466(79%) 681,918(71%)

Fig.3.Sample1profile.Bluebars=theoreticalabundance,greenbars=detectedreadabundance,redbars=erroneousreadabundance.

Fig.4.Sample2profile.Bluebars=theoreticalabundance,greenbars=detectedreadabundance,redbars=erroneousreadabundance. 0.0000.0050.0100.0150.020 0.0250.0300.0350.0400.045 0.050 0.0550.0600.0650.0700.0750.080 0.0850.0900.0950.100 Abundance of sequence [0-1] 100 101 102 103 104 105 Number (log) Threshold 0.005

(5)

When theabundance thresholdis increased, thepercentage of errorfreesequencesincreasesand isexpectedtobecome100%. Lociforwhich lesserrorsequencesareproduced,have a lower thresholdforwhich100%errorfreereadsareremaining.Basedon thiscriterion,AmelogeninandD16S539arethebestperforming lociwhilePentaEandPentaDperformatadecreasedlevel.

3.5. Locusqualityanalysis

Foreachlocus,Table3showsthepercentageofthetheoretical abundance which is measured as MPS signal. The theoretical abundance can be calculated from the proportion of each contributor tothesample (seeTable 1)andare shownas blue Fig.5.Sample3profile.Bluebars=theoreticalabundance,greenbars=detectedreadabundance,redbars=erroneousreadabundance.

Fig.6.Sample4profile.Bluebars=theoreticalabundance,greenbars=detectedreadabundance,redbars=erroneousreadabundance.

(6)

barsinFigs.3–6.Themeasuredsignalsareshownasgreenbarsin thesamefigures.WhentherewouldbenoPCRerrorslikestutter and no sequencing errors,the MPS signalwould be100%. The numbersinTable3werecalculatedbylinearmodelforeachlocus separatelybutforalllocusspecificallelesofallsamplestogether. Fig.8showstheaveragepercentageofflankswhichare‘clean’ in reads with an error-free ROI and in reads with an error-containing ROI. For all loci, except PentaD and D5S818, the proportionofcleanflanksishigherinreadswhichalsocontainan error-freeROI.Theproportionofcleanflanksisalsoimpactedby thelength of the flanks. The negativecorrelation between the

lengthoftheflanksandtheproportionofcleanflanksisexpected aslongersequencescarryahigherprobabilityofcontainingerrors.

4. Discussion

4.1. Frameworkperformance

The MyFLq framework was designed to incorporate all algorithms necessary for an STR MPS analysis in one open framework. Itcontainsabout a thousandlinesof codeand has minimal dependencies: It only requires a working python2 environment and the common python packages numpy and MySQLdb. Currently the target audience of the framework are bioinformaticiansthatworkinthefieldofMPSforensicanalysis. An Illumina BaseSpace application is being developed, which shouldbeavailablebythebeginningof2014.Thisway,userswill nothavetointeractwiththeframework,butwillbeabletoanalyze theirSTRdatafileswiththeMyFLqapplication.

MyFLqcanalsohandleothertypesofforensicdata,suchasSNP or mitochondrialregions. However, it hasnotbeen extensively testedtooperate withsuchdata.Fromtheanalyzedlociinthis study,onlyAmelogenincouldbeconsideredaSNPlocus.STRait Razorisanothertoolthathasbeenrecentlydevelopedforforensic genomicsandasthenameindicatesonlydealswithSTR’s[8].In ouropinion,futuresoftwarepackagesshouldhandlebothSTRand SNP loci equally well, to provide a full solution to forensic researchers.

MyFLqhascurrentlynotbeendesignedtobecomputational efficient.AfullanalysisofaMiSeqdatasettakesapproximatelyone houronasingleCPU.Thecodecontainsmanysectionsthatcould Table3

MPSsignalpercentageoftheoreticalabundance.

Locus MPSsignal/theoreticalabundance(%) Std-error(%)

Amelogenin 95 4 CSF1PO 79 5 D13S317 89 5 D16S539 88 6 D18S51 75 6 D21S11 78 5 D3S1358 86 5 D5S818 89 5 D7S820 84 6 D8S1179 95 6 FGA 73 6 PentaD 25 5 PentaE 27 7 TH01 92 5 TPOX 90 4 vWA 85 5 0.60 0.65 0.70 0.75 0.80 0.85 0.90 0.95 ] 1 -0 [ s k n a l f n a e l c f o n o i t r o p o r p : :I O R e e r f r o r r E 0.3 0.4 0.5 0.6 0.7 0.8 0.9 1.0 Er

ror containing ROI: pr

oportion of clean flanks

Amelogenin CSF1PO D13S317 D16S539 D18S51 D21S11 D3S1358 D5S818 D7S820 D8S1179 FGA PentaD PentaE TH01 TPOX vWA 1.0

Fig.8.ProportionofcleanflanksinreadswithanexpectedROIandinreadswithanerror-containingROI.Sizeofcirclesisproportionaltoaverageallelelengthofthelocus, darknessofcirclesisproportionaltoflanklengthofthelocus.

(7)

beparallelizedtoreducetheanalysistimesignificantly.Thiswill alsobeimplementedintheBaseSpaceapplication.

4.2. Accuracyanddetectionofminorcontributors

IngeneraltheIlluminaMiSeqreadswereofhighquality,with morethan70%ofreadscontainingacompleteerrorfreeROIand constitute the MPS signal. Several factors are influencingthis percentage.Stutters(around10%)andthereadsfiltered bythe abundancethreshold(around20%)areabigpartoftheother30% of the reads. PCR and sequencing accuracy with a per base sequencing accuracy of approximately 99.5% are other main factors. Reducing the amplicon, and ROI length could further improvetheMPSsignal.Readscanbeusedwithoutclustering them to produce a consensus sequence. This is of particular importance,sinceclusteringshouldbeavoidedwhen investigat-ing forensic mixtures. An allele from a minor contributor containingaSNPoraninsertion/deletioncouldclusterwiththe sequencesfromthecorrespondingalleleofamajorcontributor andgo undetected.When thecontributors are unknown, itis impossibletocategorizeaprioribetweenallelesandamplification errors.Consequently,clusteringshouldbeavoidedifpossible,in order to confront the forensic investigator at least with the presenceofthesesequences.

Becausetheminimalabundancethresholdduringdataanalysis wassetto0.5%,theoreticallyitwasstillpossibletodetectalleles fromthe1%contributor in Sample3.However, only forthe5% contributor the alleles were detected consistently, showing a correct allele call for all loci except D13S317. However, the experimentwasnotsetuptodeterminetheallelicdetectionlimit. Manyallelesoftheminorcontributorsarethesameasoneofthe moreabundantcontributors. Futureresearchwill beneededto establish a general detection limit in mixtures for each MPS technologyandPCRmultiplex.

4.3. Lociperformance

Incapillaryelectrophoresis(CE),theallelespecificsignalis85– 90%,aspolymerasestutterproducts usuallycomprise approxi-mately10–15%oftheparentallele’ssignal.Thesestutterproducts were also observed in MPS data. These stutter sequences combined with sequencing errors results in an allele specific MPSsignalthatisslightlylowerthantheabsoluteDNAinput.For mostlocithesevaluesareveryreasonable,asshowninTable3, except for PentaE and PentaD. Those two loci are obvious underperformersinthecurrent experimentalsetup,bothwith allele specificsignalsaround25%.Comparedto theother loci, these2locihavelongamplicons,longROIandlongflanklengths. While these factors are partially contributing to a higher proportionof reads containing errors in theROI,they are not completelyexplainingthelowsignal.Itisunclearatwhichpoint thattheerrorsareintroduced,butgiventhehighabundanceof someoftheerrors,theyareprobablyintroducedatthePCRstep. Futureresearchwill showhowusefultheywill bein MPSSTR analyses.

Fig.8showsahigherproportionofcleanflanksinreadswith error-free ROIs. This data could be modeled to estimate the likelihoodoferrorinROIsofanunknownsample.Fig.8showsthat theusefulnessofthisstrategywilldependontheconsideredlocus because values for ‘clean’, ‘compressed-clean’ and ‘unclean’ dependonlocussequencecharacteristicssuchasaveragelength andhomopolymercontent.

ForanMPStechnologytobecomeavalidalternativetoCEit mustproduceatleastequivalentresultsforthecommonlyused loci,andproduceadditional,valuableinformationforanexpanded setofloci.MostlocialreadyperformedverywellwiththeIllumina

MiSeq, while some(PentaD,PentaE)needfurther optimization, suchasprimerre-design,inordertoeclipseCE.AsMPSbecomesa validalternative,ourframeworkcanbeusedtohelpidentifyan additionalsetofidealMPSloci.

4.4. Dynamicflankcalculation

Usingsequencing,onlytheROIneedstobeconsideredduring data analysis,i.e.thepartof thesequencethat differsbetween differentallelesforalocus.Anypartoutsidethatregioncanbe treatedasflanking.Asourframeworkisbuilttogenericallyhandle all forensicloci, theROI andtheflanksaround theROIare not predefined, but are dynamically determined. This dynamic flanking tries tomaximizethe flanksand to minimizetheROI for each locus based on the reference alleles present in the reference allele database. Removing flanks from the reads minimizestheimpactoferrorsintheflanksontheanalysis.This aidsdetectingallelesofminorDNAcontributors,asasignificant portionofthenoiseiseliminated.

Applicationofourmethodiscurrentlylimitedbythesizeof thereferencealleledatabase,whichcontainsasmallsubsetofall ofthepossiblealleles,i.c.onlyallelespresentintheDNA from Table 1. To determine the practical applicability of our framework, moresamplesneedtobeanalyzed toincreasethe allele databasesize.However, for futureapplications itwould also be useful to automatically limit the database size by subsetting it. Database alleles could, for example,be selected basedonthesizeofthesequencesinaspecificsample,because the size ofreads already reduces theset ofpossible allelesto which they could be assigned. This subset of possible alleles wouldthenservetocalculatemaximumflanksonthefly.Smaller subsets will resultin bigger flankswhich in turn reduces the impact oferrors.

5. Conclusion

When MPSbecomes routinelyusedtoanalyzeforensic DNA profiles,decisionswillneedtobemadeonhowthedatashouldbe processed.Wepresentourbioinformaticframework,MyFLq,that processes forensic MPS data prudently: without clustering, extracting maximal information withautomaticallydetermined regionsofinterest.TheresultsshowIlluminaMiSeqisreadyto analyze STR profiles. For routine implementation in forensic laboratories,acarefulselectionofloci,PCRmultiplexoptimization, andanMPS-basedSTRallelereferencedatabaseforalignment,are needed.

Supplementarymaterials

MyFLq.pyThemainframeworkprogram.

InstallMyFLqTables.pyInstallstheMySQLtablesnecessaryfor usingMyFLq.

results.cssNeededinthesamedirectoryasthesampleresult files, for opening theresult filesin a browser. Thisfile also containsdocumentationtointerprettheresultfiles.

sample1–4.xmlSampleresultfiles.

Acknowledgments

FundingwasprovidedbytheGhentUniversityMultidisciplinary Research Partnership ‘Bioinformatics: from nucleotides to net-works’.

TheauthorswouldalsoliketothankIllumina,Inc.forproviding their technical expertise in sample preparation and MiSeq sequencing.

(8)

AppendixA. Supplementarydata

Supplementarydataassociatedwiththisarticlecanbefound, in the online version, at http://dx.doi.org/10.1016/j.fsigen.2013. 10.012.

References

[1]D.L.Deforce,R.E.Millecamps,D.V.Hoofstat,E.G.V.denEeckhout,Comparisonof slabgelelectrophoresisandcapillaryelectrophoresisforthedetectionofthe fluorescentlylabeledpolymerasechainreactionproductsofshorttandemrepeat fragments,J.Chromatogr.A806(1)(1998)149–155.,http://dx.doi.org/10.1016/ S0021-9673(97)00394-4.

[2]S.L.Fordyce,M.C.Avila-Arcos,E.Rockenbauer,C.Borsting,R.Frank-Hansen,F.T. Petersen,E.Willerslev,A.J.Hansen,N.Morling,M.T.P.Gilbert,High-throughput sequencingofcoreSTRlociforforensicgeneticinvestigationsusingtheRoche GenomeSequencer FLXplatform,Biotechniques51(2)(2011)127þ,http:// dx.doi.org/10.2144/000113721.

[3]K.K.Kidd,J.R.Kidd,W.C.Speed,R.Fang,M.R.Furtado,F.C.L.Hyland,A.J.Pakstis, ExpandingdataandresourcesforforensicuseofSNPsinindividualidentification, ForensicSci.Int.Genet.6(5)(2012)646–652,doi:10.1016/j.fsigen.2012.02.012.

[4]L.J.McIver,J.W.FondonIII,M.A.Skinner,H.R.Garner,Evaluationofmicrosatellite variationinthe1000GenomesProjectpilotstudiesisindicativeofthequalityand utilityoftherawdataandalignments,Genomics97(4)(2011)193–199.,http:// dx.doi.org/10.1016/j.ygeno.2011.01.001.

[5]J.V.Planz,K.A.Sannes-Lowery,D.D.Duncan,S.Manalili,B.Budowle,R.Chakraborty, S.A.Hofstadler,T.A.Hall,AutomatedanalysisofsequencepolymorphisminSTR allelesbyPCRanddirectelectrosprayionizationmassspectrometry,ForensicSci. Int.Genet.6(5)(2012)594–606.,http://dx.doi.org/10.1016/j.fsigen.2012.02.002. [6]C.VanNeste,F.VanNieuwerburgh,D.VanHoofstat,D.Deforce,ForensicSTR analysisusingmassiveparallelsequencing,ForensicSci.Int.Genet.6(6)(2012) 810–818.,http://dx.doi.org/10.1016/j.fsigen.2012.03.004.

[7]M.Gymrek,D.Golan,S.Rosset,Y.Erlich,lobSTR:ashorttandemrepeatprofilerfor personalgenomes,GenomeRes.22(6)(2012)1154–1162.,http://dx.doi.org/ 10.1101/gr.135780.111.

[8]D.H.Warshauer,D.Lin,K.Hari,R.Jain,C.Davis,B.LaRue,J.L.King,B.Budowle, STRaitRazorAlength-basedforensicSTRallele-callingtoolforusewithsecond generationsequencingdata,ForensicSci.Int.Genet.7(7)(2013)409–417.

[9]A.Masibay,T.Mozer,C.Sprecher,PromegaCorporationrevealsprimersequences initstestingkits,J.ForensicSci.45(6)(2000)1360–1362.

[10]C.Ruitberg,D.Reeder,J.Butler,STRBase:ashorttandemrepeatDNAdatabase forthehumanidentitytestingcommunity,NucleicAcidsRes.29(1)(2001)320– 322.,http://dx.doi.org/10.1093/nar/29.1.320.

[11]P.Gill,C.Brenner,B.Brinkmann,B.Budowle,A.Carracedo,M.Jobling,P.deKnijff, M.Kayser,M.Krawczak,W.Mayr,N.Morling,B.Olaisen,V.Pascali,M.Prinz,L. Roewer, P.Schneider, A.Sajantila,C. Tyler-Smith,DNA Commission ofthe InternationalSocietyofForensicGenetics:recommendationsonforensicanalysis usingY-chromosomeSTRs,Int.J.LegalMed.114(6)(2001)305–309.,http:// dx.doi.org/10.1007/s004140100232.

[12]B.Olaisen,W.Bar,B.Brinkmann,B.Budowle,A.Carracedo,P.Gill,P.Lincoln,W. Mayr,S.Rand, DNArecommendations1997oftheInternational Societyfor ForensicGenetics,VoxSang.74(1)(1998)61–63.,http://dx.doi.org/10.1046/ j.1423-0410.1998.7410061.x.

Figure

Table 2 shows the total number of MiSeq reads for each sample, the number of reads filtered based on the 0.5% abundance threshold, and the number of error free ROI after filtering
Fig. 3. Sample 1 profile. Blue bars = theoretical abundance, green bars = detected read abundance, red bars = erroneous read abundance.
Fig. 6. Sample 4 profile. Blue bars = theoretical abundance, green bars = detected read abundance, red bars = erroneous read abundance.
Fig. 8 shows the average percentage of flanks which are ‘clean’

References

Related documents

We would expect such a model to predict that free compulsory education or vouchers are optimal for primary education because the level of temptation is comparatively high and that

Direct product of finite intuitionistic anti fuzzy normal subrings over non-associative rings..

Individuals who visit a festival will have opportunities to shop, eat, and travel at the festival site, or along the way, as they experience the activities the festival offers.

The graduate section of the Student Investment Fund, jointly managed by College of Business and Economics undergraduate Business Honors Students and MBA students, are at the end

When SVPWM pulse generator is connected to 3- phase bridge inverter with the induction motor load form an open loop drive. The reference speed command is

We will also show that (4) has a solution for all x in the basin of attraction of an exponentially stable equilibrium or periodic orbit, which implies error estimates of the

Although knowledge of BBB concepts is existent and recovery plans are produced for guidance there are complications in post-disaster environments such as: balancing the extent

Figure 3 shows the activation of six units that were chosen as to show the different types of units we identified, and how unit be- havior varies across network layers.. Figure 3(a)