Contents lists available atScienceDirect
Theoretical
Computer
Science
www.elsevier.com/locate/tcs
Linear-time
superbubble
identification
algorithm
for
genome
assembly
Ljiljana Brankovic
a,
b,
Costas S. Iliopoulos
b,
Ritu Kundu
b,
Manal Mohamed
b,
Solon P. Pissis
b,
∗
,
Fatima Vayani
baSchoolofElectricalEngineeringandComputerScience,TheUniversityofNewcastle,NewcastleNSW 2308,Australia bDepartmentofInformatics,King’sCollegeLondon,LondonWC2R2LS,UnitedKingdom
a
r
t
i
c
l
e
i
n
f
o
a
b
s
t
r
a
c
t
Articlehistory: Received 25 May 2015
Received in revised form 16 September 2015
Accepted 16 October 2015 Available online 23 October 2015 Communicated by R. Giancarlo Keywords:
Genome assembly de Bruijn graphs Superbubble
DNAsequencingistheprocessofdeterminingtheexactorderofthenucleotidebasesofan individual’sgenomeinordertocataloguesequencevariationandunderstanditsbiological implications.Whole-genomesequencingtechniquesproducemassesofdataintheformof shortsequencesknownasreads.Assemblingthesereadsintoawholegenomeconstitutesa majoralgorithmicchallenge.MostassemblyalgorithmsutilisedeBruijngraphsconstructed fromreadsforthispurpose.A criticalstepofthesealgorithmsistodetecttypicalmotif structuresinthegraphcausedbysequencingerrorsandgenomerepeats,andfilterthem out;onesuchcomplexsubgraphclassisaso-calledsuperbubble.Inthispaper,wepropose anO(n+m)-timealgorithmtodetectallsuperbubblesinadirectedacyclicgraphwithn
verticesandm(directed)edges,improvingthebest-knownO(mlogm)-timealgorithmby Sungetal.
©2015TheAuthors.PublishedbyElsevierB.V.Thisisanopenaccessarticleunderthe CCBYlicense(http://creativecommons.org/licenses/by/4.0/).
1. Introduction
Since the publication of the first draft of the human genome [1,2], the field of genomics has changed dramatically. Recentdevelopmentsinsequencingtechnologies(see[3],forexample)havemadeitpossibletosequencenewgenomesat afractionofthetimeandcostrequiredonlyafewyearsago.Withapplicationsincludingsequencingthegenomeofanew species, anindividualwithinapopulation,andRNAmoleculesfromaparticularsample,sequencingremainsatthecoreof genomics.
Whole-genome sequencing creates massesof data, in the order oftens ofgigabytes, in the form of shortsequences (reads). Genomeassemblyinvolvespiecingtogetherthesereadstoformasetofcontiguoussequences(contigs) represent-ing the DNAsequence inthe sample.Traditional assemblyalgorithms rely onthe overlap-layout-consensus approach[4], representingeach readasavertexinanoverlapgraphandeachdetected overlapasadirectededgebetweenthevertices correspondingtooverlappingreads.Thesemethodshaveprovedtheirusethroughnumerousdenovogenomeassemblies[5]. Subsequently,afundamentally differentapproachbasedondeBruijngraphswasproposed [6],whererepresentationof data elementswas organisedaroundwords ofk nucleotides, ork-mers, insteadofreads.Unlikeinan overlapgraph,ina deBruijngraph [7],each k
−
1 nucleotidelongprefix andsuffix ofthek-mersisrepresented asavertexandeach k-mer*
Corresponding author.E-mailaddresses:[email protected](L. Brankovic), [email protected](C.S. Iliopoulos), [email protected](R. Kundu), [email protected](M. Mohamed), [email protected](S.P. Pissis), [email protected](F. Vayani).
http://dx.doi.org/10.1016/j.tcs.2015.10.021
0304-3975/©2015 The Authors. Published by Elsevier B.V. This is an open access article under the CC BY license (http://creativecommons.org/licenses/by/4.0/).
isrepresentedasadirected edgebetweenitsprefix andsuffixvertices.Themarginalinformationcontainedby ak-meris its last nucleotide.Thesequence ofthosefinal nucleotides iscalledthesequence ofthevertex. Ina de Bruijngraph,the assemblyproblemisreducedtofindinganEulerianpath,thatis,atrailthatvisitseachedgeinthegraphexactlyonce.
However,sequencingerrorsandgenomerepeatssignificantlycomplicatethedeBruijngraphbyaddingfalseverticesand edgestoit.Efficientandrobustfilteringmethodshavebeenproposedtosimplifythegraphbyfilteringout motifssuchas tips,bubbles,andcrosslinks,whichprovedtobecausedbysequencingerrors[8].Inparticular,abubbleconsistsofmultiple directedunipaths (whereaunipathisa pathinwhichallinternalverticesare ofdegree 2)fromavertexv toavertexu andiscommonlycausedbyasmallnumberoferrorsinthecentreofreads.Althoughthesetypesofmotifsaresimpleand caneasilybeidentifiedandfilteredout,morecomplicatedmotifsprovetobemorechallenging.
Recently,acomplexgeneralisationofabubble,theso-calledsuperbubble,was proposedasanimportantsubgraphclass foranalysingassemblygraphs [9].Asuperbubble isdefinedasaminimal subgraph H inthede Bruijn graphwithexactly one start vertex s and one end vertex t such that: (1) H is a directed, acyclic, single-source (s), single-sink (t) graph (2) there isno edgefroma vertexnot in H going toa vertexin H
\{
s}
and(3)there isno edge fromavertexin H\{
t}
goingtoavertexnotin H.Itisclearthat manysuperbubblesareformedasaresultofsequencingerrors,inexactrepeats, diploid/polyploidgenomes, orfrequentmutations.Thus, efficientdetectionofsuperbubblesisessential forthe application ofgenomeassembly[9].Onoderaetal. gavean
O
(
nm)
-timealgorithm todetectsuperbubbles,wheren isthenumberofverticesandm isthe number ofedges in the graph[9]. Very recently,Sung etal. gave an improvedO
(
mlogm)
-time algorithm to solve this problem[10]. The algorithm partitionsthe given graph intoa set of subgraphssuch that the set ofsuperbubbles in all thesesubgraphsis thesame astheset ofsuperbubblesin thegivengraph. Thissetconsists ofsubgraphscorresponding toeach non-singletonstronglyconnectedcomponentandasubgraph correspondingtothe setofalltheverticesinvolved in singleton strongly connected components. Superbubbles are then detected in each subgraph; if it is cyclic, it is first convertedintoadirectedacyclicsubgraphbymeansofdepth-firstsearchandbyduplicatingsomevertices.Our contribution.Notethatthecostofpartitioningthegraphandtransformingitintothedirectedacyclicsubgraphsis linearwithrespecttothesizeofthegraph.However,computingthesuperbubblesineachdirectedacyclicsubgraphrequires
O
(
mlogm)
time [10],whichdominatesthetimeboundofthealgorithm.Inthispaper,we proposeanewO
(
n+
m)
-time algorithmtocomputeallsuperbubblesinadirectedacyclicgraph.Thispaperisorganisedasfollows:InSection 2wedefinesuperbubblesandintroducesomeoftheirproperties,andin Section 3 we outline the
O
(
n+
m)
-time algorithm forcomputingsuperbubbles in a directed acyclicgraph.In Section 4we explaina methodtovalidate a candidatesuperbubblein constanttime. Thealgorithmis analysedinSection 5,while Section6providessomefinalremarksanddirectionsforfutureresearch.
2. Properties
Theconceptofsuperbubbleswasintroducedandformallydefinedin[9]asfollows.
Definition 1. (See[9].) LetG
=
(
V,
E)
beadirectedgraph.Foranyorderedpairofdistinctverticessandt, s,
tiscalleda superbubbleifitsatisfiesthefollowing:•
reachability:t isreachablefroms;•
matching:thesetofverticesreachablefromswithoutpassingthroughtisequaltothesetofverticesfromwhichtis reachablewithoutpassingthroughs;•
acyclicity:thesubgraphinducedby U isacyclic,whereU isthesetofverticessatisfyingthematchingcriterion;•
minimality:novertexinU otherthantformsapairwithsthatsatisfiestheconditionsabove;verticessandt,andU
\{
s,
t}
usedintheabovedefinitionarethesuperbubble’sentrance,exitandinterior,respectively. We note that a superbubbles,
t inthe above definitionisequivalent toa single-source, single-sink, directed acyclic subgraphofGwithsourcesandsinkt,whichdoesnothaveanycutverticesandpreservesallin-degreesandout-degrees ofverticesinU\{
s,
t}
,aswellastheout-degreeofsandin-degreeoft.Wenext state afew importantpropertiesofsuperbubbles whichenable thelinear-time enumerationofsuperbubbles.
Lemmas 1 and 2wereprovedbyOnoderaetal.[9]andSungetal.[10],respectively. Lemma 1. (See[9].) Anyvertexcanbetheentrance(respectivelyexit)ofatmostonesuperbubble.
NotethatLemma 1doesnotexcludethepossibilitythatavertexistheentranceofasuperbubbleandtheexitofanother superbubble.
Lemma 2. (See[10].)LetG beadirectedacyclicgraph.Wehavethefollowingtwoobservations.
1)Suppose
(
p,
c)
isanedgeinG,wherep hasonechildandc hasoneparent,then p,
cisasuperbubbleinG. 2)Foranysuperbubble s,
tinG,theremustexistsomeparentp oft suchthatp hasexactlyonechildt.Fig. 1.A graphGwith set of verticesV= {v1,v2,· · ·,v15}. Note thatGhas as a single sourcev1and as a single sinkv14.
Fig. 2.Vertices of Fig. 1in topological order, where ordD[v1]=1, ordD[v2]=2, ordD[v3]=3, ordD[v4]=11, ordD[v5]=6, ordD[v6]=8, ordD[v7]=10, ordD[v8]=12, ordD[v9]=7, ordD[v10]=9, ordD[v11]=4, ordD[v12]=5, ordD[v13]=13, ordD[v14]=15 and ordD[v15]=14.
Inthispaperwestartbyshowinganotherimportantpropertyofsuperbubblesthatisclosely-relatedtoLemma 2. Lemma 3. Foranysuperbubble s
,
tinadirectedacyclicgraphG,theremustexistsomechildc ofs suchthatc hasexactlyone parent s.Proof. Assumethatallthechildrenofshavemorethanoneparent.Then,theremustbesomecycleorsomechildc which hasaparentthatdoesnotbelongtothesuperbubble s
,
t.Thisisacontradiction.2
3. Finding a superbubble in a directed acyclic graph
ThemaincontributionofthispaperisanalgorithmSuperBubblethatreportsallsuperbubblesinadirectedacyclicgraph G
=
(
V,
E)
withexactlyonesource(thevertexwithin-degree0)andexactlyonesink(vertexwithout-degree0).IfG has morethanonesourcethenanewsourcevertexrisaddedtoV andanedgefromrtoeachexistingsourceisaddedtoE. The sameis done ifG hasmorethan onesink; inthiscase, a newsinkvertext isadded to V andan edgefrom each existing sinkto t isaddedto E. Ifsuch preprocessingisdone, thenamong thesuperbubblesreported bythe algorithm, only those which donot start at r and donot end att representthe superbubbles inthe original graph.For thesake ofsimplicity,fortherestofthispaperandinall thepropositions,lemmasandtheorems thatfollow,weuse G todenote a directed acyclicgraphwithexactlyone source andexactly one sink,andweusen andm to denotethenumberofits verticesandedgesrespectively,thatis,forG=
(
V,
E)
wehaven= |
V|
andm= |
E|
.Atopologicalordering ordDofG mapseachvertextoanintegerbetween1 andn,suchthatordD
[
x]
<
ordD[
y]
holdsfor alledges(
x,
y)
∈
E.Thereexistsaclassicallinear-timealgorithmforcomputingthetopologicalorderingofadirectedacyclic graph[11,12].Initsrecursiveform,thealgorithmvisitsanunvisitedvertexofthegraph,findsitsunvisitedneighbour,sayv, andperformsanothertopologicalsortstartingfromv.Thealgorithmreturnsifthecurrentvertexdoesnothaveunvisited neighbours.AlgorithmTopologicalSort,givenbelow,isasimplifiedversionthattakesasinputasingle-source,single-sink directedacyclicgraph,andproducesatopologicalorderingofvertices.ForthegraphG inFig. 1,TopologicalSortproduces anorderinggiveninFig. 2.TopologicalSort
(
G)
1 order←
n2 foreach vertex v
∈
V do 3 state[
v] ←
unvisited4 RecursiveTopologicalSort
(
G,
source)
RecursiveTopologicalSort(
G,
v)
1 state
[
v] ←
visited2 foreach vertex wadjacent to vdo 3 ifstate
[
w] =
unvisitedthen4 RecursiveTopologicalSort
(
G,
w)
5 ordD[
v] ←
order6 order
←
order−
1Proposition 1. ForanytopologicalorderingordD ofverticesingraphG,ifvertexu isreachablefromv,thatis,ifthereisapathfrom v tou,thenordD
[
v]
<
ordD[
u]
.j 1 2 3 4 5 6 7 8 9 10 11 12 13 14 15 v1 v2 v3 v11 v12 v5 v9 v6 v10 v7 v4 v8 v13 v15 v14
entrance s1 s2 s3 s4 s5 s6
exit t1 t2 t3 t4 t5 t6
Fig. 3.Candidates list for Fig. 1, candidates= {v1(entrance),v3(exit), v3(entrance), v11(entrance),v12(exit),v5(entrance), v10(exit), v7(exit), v8(exit),
v8(entrance),v13(entrance),v14(exit)}. Note that both v3and v8appear twice in the list.
Proof. Ifthepathfromv touisoflength1,i.e., thereisanedge
(
v,
u)
,thenbythedefinitionoftopologicalorderingwe haveordD[
v]
<
ordD[
u]
.Otherwise, wedenote thepath fromv to u oflength k,k>
1,as v,
x1,
. . . ,
xk−1,
u.Then bythe definitionoftopologicalorderingwehaveordD[
v]
<
ordD[
x1]<
· · ·
<
ordD[
u]
.Transitively,wehaveordD[
v]
<
ordD[
u]
.2
Importantly,inthispaperwedonotconsideralltopologicalorderingsofgraphG butonlythoseobtainedbyalgorithm TopologicalSort.NotethatthisalgorithmfindsadirectedspanningtreeT ofGrootedatthesource,whichcontainsapath fromthesourcetoanyvertexinG.ThedirectedspanningtreeT ofG obtainedbyalgorithmTopologicalSortispresented byboldedgesinFig. 2.Itmaybeworthmentioningthatadirectedrootedtreeisalsoknownasarborescence.
WenextpresentanotherimportantpropertyoftopologicalorderingobtainedbyalgorithmTopologicalSort.
Proposition 2. LetordDand T bea topologicalorderingandadirectedrootedspanningtreeofgraphG obtainedbyalgorithm TopologicalSort.IfthereisapathinT fromavertexv toavertexu,then,foreachvertexw suchthatordD
[
v]
<
ordD[
w]
<
ordD[
u]
, thereisapathfromv to w.Proof. Recall that T contains a path from the root to each vertex of the tree; this is also true for each subtree of T. Furthermore,ifthereisapathfromv tou inT,thenu iscontainedinasubtreeofT rootedat v,andeach w suchthat ordD
[
v]
<
ordD[
w]
<
ordD[
u]
isalsocontainedinthe subtreerooted atv (butnot inthesubtree rootedatu). Therefore, thereisapathfromv tow,foreach wsuchthatordD[
v]
<
ordD[
w]
<
ordD[
u]
.2
We next show that in an ordering obtained by TopologicalSort, a vertex has the topological ordering betweenthe orderingsoftheentranceandtheexitofasuperbubbleifandonlyifitbelongstothesuperbubble.
Lemma 4. LetgraphG containasuperbubble s
,
t.ThenatopologicalorderingobtainedbyTopologicalSorthasthefollowing properties.1. Forallx suchthatx
∈
U\{
s,
t}
,ordD[
s]
<
ordD[
x]
<
ordD[
t]
. 2. Forally suchthaty∈
/
U ,ordD[
y]
<
ordD[
s]
orordD[
y]
>
ordD[
t]
.Proof. RecallthatU isthesetofverticesformingasuperbubble(seeDefinition 1).
1. Sincethereisa pathfromthestart softhesuperbubbletoall x
∈
U\{
s}
,byProposition 1 wehaveordD[
s]
<
ordD[
x]
forall x such that x∈
U\{
s}
. Similarly, since thereis a path fromall x∈
U\{
t}
to theend t ofthe superbubble, byProposition 1wehaveordD
[
x]
<
ordD[
t]
forallxsuchthatx∈
U\{
t}
.Therefore,forallxsuchthatx∈
U\{
s,
t}
,ordD[
s]
<
ordD
[
x]
<
ordD[
t]
.2. Suppose theopposite, that is,suppose thatthere exists some y
∈
/
U such that ordD[
s]
<
ordD[
y]
<
ordD[
t]
. Sincethe superbubble s,
t isitself a single-source, single-sink subgraph of G,any directed spanning tree of G rooted atthe source,willcontainapathfroms tot.Thenby Proposition 2therealsoexistsa pathfromsto y inT andthusalso inG.However,bythedefinitionofthesuperbubble,theonlyverticesreachablefromswithoutgoingthrought arethe internalverticesofthesuperbubble—acontradiction.Therefore,forall ysuchthat y∈
/
U,eitherordD[
y]
<
ordD[
s]
or ordD[
y]
>
ordD[
t]
.2
AlgorithmSuperBubble starts by topologically ordering thevertices ofgraph G andthen checkseach vertexin V, in topologicalorder,toidentifywhetheritisanexitoranentrancecandidate(orboth).AccordingtoLemmas 2 and 3,a vertex visanexitcandidateifithasatleastoneparentwithexactlyonechild(out-degree1)andanentrancecandidateifithas atleastonechild withexactlyoneparent (in-degree1). Thereare atmost2n candidates,thusthecost ofconstructinga doubly-linked listof all thecandidates is linearinn.The elements of thecandidates listare ordered according toordD, andeach candidateislabelledasanexitoran entrancecandidate.Notethatifavertex v isbothanexitandanentrance candidate,then vappearstwiceinthecandidateslist,firstasanexitandthenasanentrance(Fig. 3).
Algorithm SuperBubble processes the candidates list of graph G in decreasing topological order (backwards). Let v1
,
v2,
. . . ,
v bethelistofcandidates.Thealgorithmperformsthefollowing:•
Ifvjisanentrancecandidate,thendeletevj;•
If vj isan exitcandidate,then subroutineReportSuperBubbleiscalledto findandreport thesuperbubbleendingat vj,that is,the superbubblevi,
vj, forsome entrance candidate vi. SubroutineReportSuperBubblealso recursively findsandreportsallnestedsuperbubblesbetweenvi andvj.Forclarityofpresentation, we nextprovidea listandashortdescriptionofsubroutinesandarraysusedby algorithm SuperBubble and subroutine ReportSuperBubble. Before that, it is worth mentioning that candidates is a doubly-linked listofentrance andexitcandidates; specifically,an elementofthelistisa vertexalong withalabelspecifyingifitisan entrance oran exitcandidate.Forthesake ofsimplicityofthefollowingroutines, we useavertexanditscorresponding candidate (element in the candidates list) interchangeably. This does not add to the complexity ofthe algorithm aswe can usean auxiliaryarray v,where v
[
i]
storesapointertothecorrespondingelement vi incandidatessoastoprovidea constant-timeconversionfromavertextothecorrespondingcandidate.1. Entrance
(
v)
takesasinput avertexv andoutputsTRUE
ifv isanentrancecandidate,that is,ifitsatisfiesLemma 3, andFALSE
otherwise.2. Exit
(
v)
takesasinputavertexvandoutputsTRUE
ifvisanexitcandidate,thatis,ifitsatisfiesLemma 2,andFALSE
otherwise.3. InsertEntrance
(
v)
takesasinputavertexv,insertsitattheendofcandidatesandlabelsitasentrance. 4. InsertExit(
v)
takesasinputavertexv,insertsitattheendofcandidatesandlabelsitasexit.5.Head(candidates)andTail(candidates)returnthefirstandthelastelementincandidates,respectively. 6.DeleteTail(candidates)deletesthelastelementincandidates.
7.Next(v)returnsthecandidatefollowingv incandidates.
Inadditiontotheabovesubroutines,themainalgorithmalsoexplicitlymakesuseofthefollowingarrays. 1. ThearrayordDstoresthetopologicalorderofthevertices.
2. ThearraypreviousEntrancestoresthepreviousentrancecandidatesforeachvertexv.Formally,previousEntrance
[
v]
=
s where sisanentrance candidatesuch thatordD[
s]
<
ordD[
v]
;andtheredoesnot existanotherentrancecandidates suchthatordD[
s]
<
ordD[
s]
<
ordD[
v]
.3. ThearrayalternativeEntranceisusedtoreducethenumberofentrance
−
exitpairsthatneedtobeconsideredaspossible superbubbles.ArrayalternativeEntranceisfurtherdetailedinSection4.1.Note thatsubroutineReportSuperBubble iscalledforeach exitcandidateindecreasingorder eitherbyalgorithm Su-perBubble orthrougharecursivecalltoidentifyanestedsuperbubble.AcalltosubroutineReportSuperBubble(start,exit) checks thepossibleentrance candidatesbetweenstart andexit,starting withthenearest previous entrancecandidate(to exit).ThistaskisaccomplishedwiththehelpofsubroutineValidateSuperBubble,explainedinthefollowingsection,which checkswhetheragivencandidate s
,
tisasuperbubbleornot;ifitisnot,thealgorithmreturnseithera“−
1”whichmeans that nosuperbubbleendsatt,oran alternativeentrancecandidateforasuperbubblethatcouldendatt.ForthegraphinFig. 1,thealgorithmdetects andreportsfivesuperbubbles: v8
,
v14, v3,
v8,v5,
v7, v11,
v12and v1,
v3.Here,both v5,
v7and v11,
v12arenestedsuperbubbles.SuperBubble
(
G)
1 TopologicalSort
(
G)
2 prevEnt←
NULL
3 foreach vertex vin topological orderdo 4 alternativeEntrance
[
v] ←
NULL
5 previousEntrance[
v] ←
prevEnt 6 ifExit(
v)
then 7 InsertExit(
v)
8 ifEntrance(
v)
then 9 InsertEntrance(
v)
10 prevEnt←
v11 whilecandidatesis not emptydo 12 ifEntrance
(
Tail(
candidates))
then 13 DeleteTail(
candidates)
ReportSuperBubble
(
start, exit)
1
Report the superbubble ending atexit(if any)2 if
(
start=
NULL
)
|
(
exit=
NULL
)
|
(
ordD[
start] ≥
ordD[
exit]
)
then 3 DeleteTail(
candidates)
4 return
5 s
←
previousEntrance[
exit]
6 while(
ordD[
s] ≥
ordD[
start]
)
do 7 valid←
ValidateSuperBubble(
s,
exit)
8 if
(
valid=
s)
|
(
valid=
alternativeEntrance[
s]
)
|
(
valid= −
1)
then9 break 10 alternativeEntrance
[
s] ←
valid 11 s←
valid 12 DeleteTail(
candidates)
13 if(
valid=
s)
then 14 Report(
s,
exit)
15 while
(
Tail(
candidates)
is nots)
do 16 ifExit(
Tail(
candidates))
then 17 Check for nested superbubbles18 ReportSuperBubble
(
Next(
s)
,Tail(
candidates))
19 elseDeleteTail(
candidates)
20 return
Remark 1. Itisalsopossibletodesignthealgorithmsoastomoveforwardintopologicalorderinsteadofbackwards. ForgraphGinFig. 1,algorithmSuperBubble
(
G)
makesexactlythreecallstosubroutineReportSuperBubble:1. ReportSuperBubble
(
v1,
v14)
: First, it checks theexit candidate v14 against the nearest previous entrance candidate, i.e. vertex v13. Subroutine ValidateSuperBubble(
v13,
v14)
returns v8 asan alternative entrance candidate.The new candidateisthencheckedandthesuperbubble v8,
v14isreported.2. ReportSuperBubble
(
v1,
v8)
: First, it checks the exit candidate v8 against the nearest previous entrance candidate, i.e. vertex v5.SubroutineValidateSuperBubble(
v5,
v8)
returnsv3asanalternativeentrancecandidate.Thenew candi-dateisthencheckedandthesuperbubble v3,
v8isreported.Additionally,tworecursivecallsaremade:(a) ReportSuperBubble
(
v11,
v7)
:First,itvalidates v5,
v7andreportsit.Then,itmakesarecursivecalltosubroutine ReportSuperBubble(
v10,
v10)
whichterminateswithoutreportinganysuperbubble.(b) ReportSuperBubble
(
v11,
v12)
:validates v11,
v12andreportsit. 3. ReportSuperBubble(
v1,
v3)
:validatesv1,
v3andreportsit. 4. Validating a superbubbleInthissection,wedescribesubroutineValidateSuperBubble.Theabilitytovalidateacandidatesuperbubbledependson thefollowingresultrelatedtotheRangeMinimumQueryproblem.
TheRangeMinimumQueryproblem,RMQforshort,istopreprocessagivenarray A
[
1.
.
n]
forsubsequentqueriesofthe form:“Given indicesi,
j,whatis theminimum valueof A[
i. .
j]
?”. The problemhasbeenstudied intensivelyfordecades andseveral O(
n),
O(
1)
-RMQdatastructureshavebeenproposed,manyofwhichdependontheequivalencebetweenthe RangeMinimumQueryandtheLowestCommonAncestorproblems[13–15].Inordertocheckwhetherasuperbubblecandidate s
,
tisasuperbubbleornot,weproposetoutilisetherangemin/max queryproblemasfollows:•
ForagivengraphG=
(
V,
E)
andforeachvertexv∈
V withtopologicalorderordD[
v]
,calculatethetopological order-ingsoftheparentandthechildofv thataretopologicallyfurthestfromv.OutParent
[
ordD[
v]] =
min({
ordD[
ui] |
(
ui,
v)
∈
E}
),
OutChild[
ordD[
v]] =
max({
ordD[
ui] |
(
v,
ui)
∈
E}
).
•
Foran integerarray A andindicesi and j wedefine RangeMin(
A,
i,
j)
andRangeMax(
A,
i,
j)
toreturntheminimum andmaximumvalueof A[
i..
j]
,respectively.Thenforagivensuperbubblecandidate s
,
t,wheresandt arean entranceandanexitcandidaterespectively (satis-fyingLemmas 1 and 2),ifs,
tisasuperbubblethenthefollowingtwoconditionsarevalidRangeMin
(
OutParent,
ordD[
s] +
1,ordD[
t])
=
ordD[
s],
RangeMax(
OutChild,
ordD[
s],
ordD[
t] −
1)=
ordD[
t].
j 1 2 3 4 5 6 7 8 9 10 11 12 13 14 15 v1 v2 v3 v11 v12 v5 v9 v6 v10 v7 v4 v8 v13 v15 v14
OutParent[j] – 1 1 3 4 3 6 6 7 8 3 5 12 13 12
OutChild[j] 3 3 11 5 12 8 9 10 10 12 12 15 15 15 –
Fig. 4.OutParentandOutChildarrays for the graph inFig. 1.
Forexample,Fig. 4representsbothOutParentandOutChildarrayscomputedforgraphGinFig. 1.Furthermore,a candi-date v5
,
v8isnotasuperbubbleasRangeMin(
OutParent,
ordD[
v5] +1,
ordD[
v8])
=
3=
6=
ordD[
v5].Itshouldbeclearthatafteran
O
(
n+
m)
-timepreprocessing,validatingasuperbubblerequiresO
(
1)
timewhichisthe costforrangemax/minquery.SubroutineValidateSuperBubble(
startVertex,
endVertex)
isdesignedtoreturnanappropriate entrancecandidateforasuperbubbleendingatendVertex(ifany),asfollows.ValidateSuperBubble
(
startVertex, endVertex)
1 start←
ordD[
startVertex]
2 end
←
ordD[
endVertex]
3 outchild
←
RangeMax(
OutChild,
start,
end−
1)
4 outparent←
RangeMin(
OutParent,
start+
1,
end)
5 ifoutchild=
endthen6 return
−
17 ifoutparent
=
startthen 8 returnstartVertex9 elseifEntrance
(
Vertex(
outparent))
then 10 returnVertex(
outparent)
11 else returnpreviousEntrance
[
Vertex(
outparent)
]
Notethat subroutineValidateSuperBubbleutilises subroutineEntranceandthearray previousEntrancedefinedin Sec-tion3,aswellassubroutineVertexthattakesasinputanintegeri andoutputsvertexv suchthatordD
[
i] =
v.An importantobservationisthata subsequentcalltosubroutineValidateSuperBubble,foragivenentrancecandidate, returnsalternativeentrancecandidatesinstrictlynon-decreasingtopologicalorderasprovedbyLemma 5.
Lemma 5. Lett be the alternative entrance candidatereturned by subroutine ValidateSuperBubble
(
s,
e)
. Thenfor any exit candidateesuchthatordD[
s]
<
ordD[
e]
<
ordD[
e]
,theorderofthealternativeentrancecandidatet returnedbysubroutine ValidateSuperBubble(
s,
e)
willbegreaterthanorequaltotheorderof t.Proof. Recall thatthealternativeentrancet returnedby thesubroutineValidateSuperBubble
(
s,
e)
iseitheravertexwith topologicalorderoutparent,orthepreviousEntranceofthisvertex.Sinceoutparent
=
RangeMin(
OutParent,
ordD[
s]
+
1,
ordD[
e]
)
,outparent=
RangeMin(
OutParent,
ordD[
s]
+
1,
ordD[
e]
)
and ordD[
s]
<
ordD[
e]
<
ordD[
e]
,wehaveoutparent≤
outparent.Therefore,ordD(
t)
≤
ordD(
t)
.2
4.1. ValidationandalternativeEntrance
Incasethevalidationofthecandidatepair
(
t0,
e)
fails,subroutineValidateSuperBubble(
t0,
e)
returnseither“−
1”oran alternativecandidatet1 whichmightbeanentranceofasuperbubbleendingate.Thisalternativecandidatet1 iseithera vertexu1,ifu1 isanentrancecandidate,orthepreviousentrancecandidateofu1 suchthatordD
[
u1] =OutParent[
ordD[
v1]]=
RangeMin(
OutParent,
ordD[
t0] +1,ordD[
e]),
where v1issomevertexbetweent0 andeinthetopologicalordering.Suppose t1 is also not a valid entrance of thesuperbubble ending at e. Then there must be a vertex v2, ordD
[
t1]<
ordD
[
v2]<
ordD[
t0],withsomeparentu2,suchthatordD[
u2] =OutParent[
ordD[
v2]].Thenthealternativeentranceissomet2, which is eithera vertex u2 orits previous entrance andthus ordD
[
t2]<
ordD[
t1].A series of such failed validations producesasequencet1,
t2,
. . .
offailedalternativeentrancecandidates.Animportantobservationhereisthatanyentranceti,fori
≥
1,fromsuchasequenceisaninvalidentrancenotonlyfor thesuperbubbleendingatebutalsoforallthoseendingatanyotherexitvertexesuchthatordD[
ti−1]<
ordD[
e]
<
ordD[
e]
andti=
ValidateSuperBubble(
ti−1,
e)
.Thisisthe casebecausethevertex vi whichcausesthe alternativeentrance ti to fail is suchthat ordD[
ti]
<
ordD[
vi]
<
ordD[
ti−1] fori≥
1.Therefore, vi doesnot dependon theexite butratheron the previousfailedcandidateentrance.ThisiswherearrayalternativeEntrance playsanimportantrole.Storing alternativeEntrance
[
ti−1] =tifori≥
1 enablesus toskipthissequenceatalaterstageiftiisreturnedbysubroutineValidateSuperBubble(
ti−1,
e)
.5. Algorithm analysis
Inthissection,weanalysethecorrectnessandtherunningtimeoftheproposedalgorithmSuperBubble.Forsimplicity, inthefollowinglemmawewillslightlyabusetheterminologyandreferto s
,
tasasuperbubbleifitsatisfiesthefirstthree conditionsgiveninDefinition 1,andasminimalsuperbubbleifitalsosatisfiesthelastconditioninthesamedefinition. Lemma 6. For agiven exitcandidatee, lets betheentrancecandidatesuchthatsuperbubble s,
e isreportedbysubroutine ValidateSuperBubble(
s,
e)
.Then s,
eisaminimalsuperbubble.Proof. Bycontradiction, letebe anexitcandidatesuchthat
s,
eisalsoasuperbubbleandordD[
s]
<
ordD[
e]
<
ordD[
e]
. Then,eitherordD[
e] =
ordD[
e] +
1 orthereisatleastonevertexv suchthatordD[
e]
<
ordD[
v]
<
ordD[
e]
.Inthe firstcase, ordD
[
e] =
ordD[
e] +
1 implies thate isthe onlychildof e ande isthe onlyparentof e,which,byLemma 2makes e
,
easuperbubble.Inthesecondcase,wherethereisatleastonevertexv suchthatordD
[
e]
<
ordD[
v]
<
ordD[
e]
,wealsoarguethat e,
e mustbeasuperbubble.Indeed, e,
esatisfiesthefollowingconditions:1. Reachability:Since s
,
eisasuperbubble,eisreachablefroms;since s,
eisalsoassumedtobeasuperbubble,any pathfromstoemustgothroughe,thereforeeisreachablefrome.2. Matching:Theonlyverticesreachablefrome withoutgoingthrough earethosewhosetopologicalorderisbetween ordD
(
e)
andordD(
e)
.Indeed,since s,
eand s,
earesuperbubbles,alltheseverticesarereachablefromsthroughe, andnoverticeswithtopologicalordergreater thanordD(
e)
arereachablefrome without goingthrough e.Similarly, there are no edges between vertices withtopological order lessthan ordD(
e)
andthose with the topologicalorder between ordD(
e)
and ordD(
e)
. Therefore,the only vertices from which e is reachable without going through e are thosewhosetopologicalorderisbetweenordD(
e)
andordD(
e)
.3. Acyclicity:Since s
,
eisasuperbubbleitisacyclic;since e,
eisasubgraphof s,
e,itisalsoacyclic.Inbothcases,sinceforeachexitcandidatetheentrancecandidatesarecheckedinreversetopologicalorder,subroutine ValidateSuperBubblewould havebeen calledon
e,
e first, andwouldhavereported e,
einstead of s,
e. Therefore, s,
eisaminimalsuperbubble.2
Lemma 7. Forthegivenentranceandexitcandidatess ande,respectively,subroutineValidateSuperBubblereports s
,
tifandonly if s,
tisasuperbubble.Proof. Weprovethelemmabyshowingthatif s
,
tisasuperbubblethensubroutineValidateSuperBubblereportsit,and ifValidateSuperBubblereports s,
tthen s,
tisasuperbubble.1. Westartbyshowingthatif s
,
tisasuperbubblethensubroutineValidateSuperBubblereportsit.Indeed,byLemma 4, all theverticeswithtopologicalorderings betweens andt belong tothe superbubbles,
t.Therefore,the minimum OutParentissandthemaximumOutChildist andthussubroutineValidateSuperBubblereports s,
t.2. Wenext show thatifsubroutineValidateSuperBubblereports
s,
tthen s,
tisa superbubble.Letstart andend be two integers,such that ordD[
s] =
start andordD[
t] =
end.Thegraph G asdefined, hasa single sourcer anda single sinkr;thisimpliesthatanyvertexv∈
V isreachablefromr and,atthesametime,canreachr.Thisisalsotruefor s,tandforanyvertexv suchthatordD[
s]
<
ordD[
v]
<
ordD[
t]
.First,we show thatt isreachable froms.Recallthat t is anexitcandidate,so, ithasa parent p without-degree 1. Assumethattisnotreachablefroms,thentheremustbeapathfromr
t whichdoesnotinvolves.Thisimpliesthat eitherOutParent[
end]
<
start,orthereexists avertexv such thatstart<
ordD[
v]
<
end,OutParent[
v]
<
start andthere existsapathrv t,whichisacontradiction.Similarly, we can show that every vertex v such that start
<
ordD[
v]
<
end satisfies the matching criterion of the superbubble.Theacyclicitycriterion isguaranteed bythe acyclicityof G andtheminimalityis satisfiedbythe designof subrou-tineReportSuperBubblewhichassigns each exitofasuperbubbleto thenearest entrance,andby thecorrectnessof
Lemma 6.
2
Lemma 8. Foragivenexitcandidatee,lett bethealternativeentrancecandidatereturnedbysubroutineValidateSuperBubble
(
s,
e)
. Thenanyentrancecandidatebetweent ande cannotbeavalidentranceforthesuperbubbleendingat e.Proof. Bycontradiction, assume that s is an entrancecandidatebetweent ande such that
s,
eis asuperbubble. If s hadbeen between s ande, it wouldhave alreadybeen reported,asSuperBubble checks entrance candidates inreverse topological order starting frome.Therefore, s is between t and s, such that ordD[
t]
<
ordD[
s]
<
ordD[
s]
<
ordD[
e]
. Letoutparent
=
RangeMin(
OutParent,
ordD[
s]
+
1,
ordD[
e]
)
.Then, vertexatoutparentisbetweent ands,otherwisesubroutine ValidateSuperBubble(
s,
e)
wouldhavereturneds(insteadoft).Therefore,ordD[
t]
≤
outparent<
ordD[
s]
.Let outparent
=
RangeMin(
OutParent,
ordD[
s] +
1,
ordD[
e]
)
.Then outparent≤
outparent.Thisimpliesthat outparent≤
outparent<
ordD[
s]
.However, for s,
e tobe avalidsuperbubble, outparent shouldhave beenequalto ordD[
s]
. Hence, the assumptionis wrongandthus, itisproved that therecannot be anentrance candidate,betweent ande,whichis a validentranceforthesuperbubbleendingate.2
Lemma 9. Forthegivenentranceandexitcandidatess ande1,respectively,letalternativeEntrance
[
s]
besettot1whichlatergetsresettot2suchthatt2
=
t1,whileconsiderings withanotherexitcandidatee2.Thennoentrancecandidatebetweens ande2canresetalternativeEntrance
[
s]
tot1again.Proof. Let e3 be an exit candidate between s and e2 such that subroutine ValidateSuperBubble
(
s,
e3)
returns t3. Then by Lemma 5, ordD[
t1] ≤ordD[
t2] ≤ordD[
t3]. Sincet1=
t2, we haveordD[
t1]<
ordD[
t2] ≤ordD[
t3]. Therefore,t1<
t3 andalternativeEntrance
[
s]
cannotberesettothesamevaluet1again.2
Theorem 1. AlgorithmSuperBubblereportsallsuperbubbles,andonlysuperbubbles,ingraphG indecreasingtopologicalorderof theirexitverticesin
O
(
n+
m)
time.Proof. Consider anexecutionofalgorithmSuperBubble.Letsuperbubbles s1
,
t1,
· · ·
,
sk,
tkbe thesuccessive superbub-bles reported just after the execution of Line 14 of subroutine ReportSuperBubble, where ordD(
t1)
>
ordD(
t2)
>
· · ·
>
ordD(
tk)
.1. First,weshowthateach si
,
tireportedbythealgorithminLine14isasuperbubble.Thisisprovedbythecorrectness ofLemma 7.2. Second,nosuperbubbleismissedout bythealgorithm asproved bythefollowing.SubroutineReportSuperBubbleis calledforeach exitcandidatein decreasingorder.The entrancecandidate forthesuperbubble(ifany) endingatexit willonlybebetweenstartandexit,wherestartiseithertheheadofthe candidateslist(whensubroutine ReportSuper-BubbleiscalledfromalgorithmSuperBubble)ornextcandidateoftheentranceofanouter superbubble(whencalled through a recursivecall to identifya nestedsuperbubble).A callto subroutineReportSuperBubble(start,exit) checks thepossibleentrancecandidatesbetweenstart andexit,startingwiththenearestpreviousentrancecandidate(toexit). SubroutineValidateSuperBubbleeither successfullyvalidatesanentrance candidate,orreturnsa “
−
1”,orreturnsan alternative entrance candidate.From Lemma 8,there cannot be anyvalidentrance betweenthisalternativeentrance andexit.Ifthisalternativeentrancestartsasequenceofentrancesalreadycheckedforsome exitcandidatepreviously (asdepictedbyalternativeEntrance),then allentrancesofthatsequencewillbeskipped,otherwisethisalternative en-trancewillbetested. However,asmentionedinSection4.1,noneoftheentrancecandidatesintheskipped sequence can be valid. Therefore,foreach exit candidate,every potential entrancecandidate ischeckedfor validity, andthose whicharenotconsideredarenotvalid.3. Third,therunningtimeofSuperBubbleis
O
(
n+
m)
.Indeed,therunningtimeoftheTopologicalSortandcomputing thecandidateslistisO
(
n+
m)
.Furthermore,alllistoperationscostconstanttimeeach,andsumuptoalinearcostofO
(
n)
,asthereareatmost2ncandidatesinthelist.Finally,eachcallforsubroutineValidateSuperBubblecostsO
(
1)
. ThetotalnumberoftimesValidateSuperBubbleiscalledisO
(
n+
m)
.ThisisbecausesubroutineValidateSuperBubble iscalledonceforeach exitcandidateinLine7ofsubroutineReportSuperBubble,andthetotalnumberofsuchcalls isboundedbyO
(
n)
.Additionally,itiscalledeverytimeanewalternativeEntrancesequenceisgeneratedbysubroutine ValidateSuperBubble.ItfollowsfromLemma 9thatonceanalternativeEntrancesequenceisreset,itcannotbegenerated again by subsequent calls to subroutine ValidateSuperBubble. This resettingof alternativeEntrance foreach entrance candidate(Line10)thus enablesavoidingrepeatedchecksofthesamesequences ofentrancecandidates.Resettingis doneeverytimeanedgeisconsideredforthefirsttimebetweenavertex(inbetweenanentrancecandidatestartVertex andanexitcandidateendVertex)anditstopologicallyfurthestparent(whoseorderislessthanthatofstartVertex).Thus, thetotalnumberoftimesalternativeEntrancewillbereset(foralltheentrancecandidates)isboundedbyO
(
m)
. Therefore,thetotalrunningtimeforreportingallsuperbubblesingraphG isO
(
n+
m)
.2
6. Final remarks
Wepresentedan
O
(
n+
m)
-timealgorithmtocomputeallsuperbubblesinadirectedacyclicgraph,wherenisthe num-berofverticesandmisthenumberofedges,improvingthebest-knownO
(
mlogm)
-timealgorithmforthisproblem[10]. Itisalsointerestingtonotethatinthistypeofgraph,thatis,constructedfromsequencesoverafixed-sizedalphabet,the out-degreeofeachvertexisboundedbythesizeofthealphabet(fourforDNAalphabet);therefore,thetimecomplexityof theproposedalgorithmisessentiallylinearinn.Ourimmediategoalistopractically evaluateouralgorithmandcompareits implementationtoan earlierresult[9].It wouldalsobeinterestingtoinvestigateothersuperbubble-likestructuresinassemblygraphs,suchascomplexbulges[16].
References
[1]E.S.Lander,L.M.Linton,B.Birren,C.Nusbaum,M.C.Zody,J.Baldwin,K.Devon,K.Dewar,M.Doyle,W.FitzHugh,etal.,Initialsequencingandanalysis ofthehumangenome,Nature409 (6822)(2001)860–921.
[2]J.C.Venter,M.D.Adams,E.W.Myers,P.W.Li,R.J.Mural,G.G.Sutton,H.O.Smith,M.Yandell,C.A.Evans,R.A.Holt,etal.,Thesequenceofthehuman genome,Science291 (5507)(2001)1304–1351.
[3] S. Balasubramanian, D. Klenerman, C. Barnes, M. Osborne, Patent US20077232656, 2007.
[4]S.Batzoglou,Algorithmicchallengesinmammaliangenomesequenceassembly,in:EncyclopediaofGenomics,ProteomicsandBioinformatics,John WileyandSons,Hoboken,NewJersey,2005.
[5]J.Butler,I.MacCallum,M.Kleber,I.A.Shlyakhter,M.K.Belmonte,E.S.Lander,C.Nusbaum,D.B.Jaffe,ALLPATHS:denovoassemblyofwhole-genome shotgunmicroreads,GenomeRes.18 (5)(2008)810–820.
[6]P.A.Pevzner,H.Tang,M.S.Waterman,AnEulerianpathapproachtoDNAfragmentassembly,Proc.Natl.Acad.Sci.USA98 (17)(2001)9748–9753. [7]N.G.deBruijn,Acombinatorialproblem,K.Ned.Akad.Wet.49(1946)758–764.
[8]D.R.Zerbino,E.Birney,Velvet:algorithmsfordenovoshortreadassemblyusingdeBruijngraphs,GenomeRes.18 (5)(2008)821–829. [9]T.Onodera,K.Sadakane,T.Shibuya,Detectingsuperbubblesinassemblygraphs,in:WABI,2013,pp. 338–348.
[10]W.Sung,K.Sadakane,T.Shibuya,A.Belorkar,I.Pyrogova,AnO(mlogm)-timealgorithmfordetectingsuperbubbles,IEEE/ACMTrans.Comput.Biology Bioinform.12 (4)(2015)770–777.
[11]R.L.R.Thomas,H.Cormen,CharlesE.Leiserson,C.Stein,IntroductiontoAlgorithms,MITPress,Cambridge,MA,2001. [12]R.Tarjan,Edge-disjointspanningtreesanddepth-firstsearch,ActaInform.6 (2)(1976)171–185.
[13]D.Harel,R.Tarjan,Fastalgorithmsforfindingnearestcommonancestors,SIAMJ.Comput.13 (2)(1984)338–355.
[14]J.Fischer,V.Heun,TheoreticalandpracticalimprovementsontheRMQ-problem,withapplicationstoLCAandLCE,in:M.Lewenstein,G.Valiente (Eds.),CombinatorialPatternMatching,Proceedingsofthe17thAnnualSymposium,CPM2006,Barcelona,Spain,July5–7,2006,in:LectureNotesin ComputerScience,vol. 4009,Springer,2006,pp. 36–48.
[15]S.Durocher,Asimplelinear-spacedatastructureforconstant–timerangeminimumquery,in:A.Brodnik,A.López-Ortiz,V.Raman,A.Viola(Eds.), Space-EfficientDataStructures,Streams,andAlgorithms,in:LectureNotesinComputerScience,vol. 8066,SpringerBerlinHeidelberg,2013,pp. 48–60. [16]S. Nurk,A.Bankevich,D.Antipov, A.A.Gurevich,A.Korobeynikov,A.Lapidus,A.D.Prjibelski,A.Pyshkin,A.Sirotkin,Y.Sirotkin,R.Stepanauskas, S.R. Clingenpeel,T.Woyke,J.S.McLean,R.Lasken,G.Tesler,M.A.Alekseyev,P.A.Pevzner,Assemblingsingle-cellgenomesandmini-metagenomesfrom chimericMDAproducts,J.Comput.Biol.20 (10)(2013)714–737.