• No results found

Linear-time superbubble identification algorithm for genome assembly

N/A
N/A
Protected

Academic year: 2021

Share "Linear-time superbubble identification algorithm for genome assembly"

Copied!
10
0
0

Loading.... (view fulltext now)

Full text

(1)

Contents lists available atScienceDirect

Theoretical

Computer

Science

www.elsevier.com/locate/tcs

Linear-time

superbubble

identification

algorithm

for

genome

assembly

Ljiljana Brankovic

a

,

b

,

Costas S. Iliopoulos

b

,

Ritu Kundu

b

,

Manal Mohamed

b

,

Solon P. Pissis

b

,

,

Fatima Vayani

b

aSchoolofElectricalEngineeringandComputerScience,TheUniversityofNewcastle,NewcastleNSW 2308,Australia bDepartmentofInformatics,King’sCollegeLondon,LondonWC2R2LS,UnitedKingdom

a

r

t

i

c

l

e

i

n

f

o

a

b

s

t

r

a

c

t

Articlehistory: Received 25 May 2015

Received in revised form 16 September 2015

Accepted 16 October 2015 Available online 23 October 2015 Communicated by R. Giancarlo Keywords:

Genome assembly de Bruijn graphs Superbubble

DNAsequencingistheprocessofdeterminingtheexactorderofthenucleotidebasesofan individual’sgenomeinordertocataloguesequencevariationandunderstanditsbiological implications.Whole-genomesequencingtechniquesproducemassesofdataintheformof shortsequencesknownasreads.Assemblingthesereadsintoawholegenomeconstitutesa majoralgorithmicchallenge.MostassemblyalgorithmsutilisedeBruijngraphsconstructed fromreadsforthispurpose.A criticalstepofthesealgorithmsistodetecttypicalmotif structuresinthegraphcausedbysequencingerrorsandgenomerepeats,andfilterthem out;onesuchcomplexsubgraphclassisaso-calledsuperbubble.Inthispaper,wepropose anO(n+m)-timealgorithmtodetectallsuperbubblesinadirectedacyclicgraphwithn

verticesandm(directed)edges,improvingthebest-knownO(mlogm)-timealgorithmby Sungetal.

©2015TheAuthors.PublishedbyElsevierB.V.Thisisanopenaccessarticleunderthe CCBYlicense(http://creativecommons.org/licenses/by/4.0/).

1. Introduction

Since the publication of the first draft of the human genome [1,2], the field of genomics has changed dramatically. Recentdevelopmentsinsequencingtechnologies(see[3],forexample)havemadeitpossibletosequencenewgenomesat afractionofthetimeandcostrequiredonlyafewyearsago.Withapplicationsincludingsequencingthegenomeofanew species, anindividualwithinapopulation,andRNAmoleculesfromaparticularsample,sequencingremainsatthecoreof genomics.

Whole-genome sequencing creates massesof data, in the order oftens ofgigabytes, in the form of shortsequences (reads). Genomeassemblyinvolvespiecingtogetherthesereadstoformasetofcontiguoussequences(contigs) represent-ing the DNAsequence inthe sample.Traditional assemblyalgorithms rely onthe overlap-layout-consensus approach[4], representingeach readasavertexinanoverlapgraphandeachdetected overlapasadirectededgebetweenthevertices correspondingtooverlappingreads.Thesemethodshaveprovedtheirusethroughnumerousdenovogenomeassemblies[5]. Subsequently,afundamentally differentapproachbasedondeBruijngraphswasproposed [6],whererepresentationof data elementswas organisedaroundwords ofk nucleotides, ork-mers, insteadofreads.Unlikeinan overlapgraph,ina deBruijngraph [7],each k

1 nucleotidelongprefix andsuffix ofthek-mersisrepresented asavertexandeach k-mer

*

Corresponding author.

E-mailaddresses:[email protected](L. Brankovic), [email protected](C.S. Iliopoulos), [email protected](R. Kundu), [email protected](M. Mohamed), [email protected](S.P. Pissis), [email protected](F. Vayani).

http://dx.doi.org/10.1016/j.tcs.2015.10.021

0304-3975/©2015 The Authors. Published by Elsevier B.V. This is an open access article under the CC BY license (http://creativecommons.org/licenses/by/4.0/).

(2)

isrepresentedasadirected edgebetweenitsprefix andsuffixvertices.Themarginalinformationcontainedby ak-meris its last nucleotide.Thesequence ofthosefinal nucleotides iscalledthesequence ofthevertex. Ina de Bruijngraph,the assemblyproblemisreducedtofindinganEulerianpath,thatis,atrailthatvisitseachedgeinthegraphexactlyonce.

However,sequencingerrorsandgenomerepeatssignificantlycomplicatethedeBruijngraphbyaddingfalseverticesand edgestoit.Efficientandrobustfilteringmethodshavebeenproposedtosimplifythegraphbyfilteringout motifssuchas tips,bubbles,andcrosslinks,whichprovedtobecausedbysequencingerrors[8].Inparticular,abubbleconsistsofmultiple directedunipaths (whereaunipathisa pathinwhichallinternalverticesare ofdegree 2)fromavertexv toavertexu andiscommonlycausedbyasmallnumberoferrorsinthecentreofreads.Althoughthesetypesofmotifsaresimpleand caneasilybeidentifiedandfilteredout,morecomplicatedmotifsprovetobemorechallenging.

Recently,acomplexgeneralisationofabubble,theso-calledsuperbubble,was proposedasanimportantsubgraphclass foranalysingassemblygraphs [9].Asuperbubble isdefinedasaminimal subgraph H inthede Bruijn graphwithexactly one start vertex s and one end vertex t such that: (1) H is a directed, acyclic, single-source (s), single-sink (t) graph (2) there isno edgefroma vertexnot in H going toa vertexin H

\{

s

}

and(3)there isno edge fromavertexin H

\{

t

}

goingtoavertexnotin H.Itisclearthat manysuperbubblesareformedasaresultofsequencingerrors,inexactrepeats, diploid/polyploidgenomes, orfrequentmutations.Thus, efficientdetectionofsuperbubblesisessential forthe application ofgenomeassembly[9].

Onoderaetal. gavean

O

(

nm

)

-timealgorithm todetectsuperbubbles,wheren isthenumberofverticesandm isthe number ofedges in the graph[9]. Very recently,Sung etal. gave an improved

O

(

mlogm

)

-time algorithm to solve this problem[10]. The algorithm partitionsthe given graph intoa set of subgraphssuch that the set ofsuperbubbles in all thesesubgraphsis thesame astheset ofsuperbubblesin thegivengraph. Thissetconsists ofsubgraphscorresponding toeach non-singletonstronglyconnectedcomponentandasubgraph correspondingtothe setofalltheverticesinvolved in singleton strongly connected components. Superbubbles are then detected in each subgraph; if it is cyclic, it is first convertedintoadirectedacyclicsubgraphbymeansofdepth-firstsearchandbyduplicatingsomevertices.

Our contribution.Notethatthecostofpartitioningthegraphandtransformingitintothedirectedacyclicsubgraphsis linearwithrespecttothesizeofthegraph.However,computingthesuperbubblesineachdirectedacyclicsubgraphrequires

O

(

mlogm

)

time [10],whichdominatesthetimeboundofthealgorithm.Inthispaper,we proposeanew

O

(

n

+

m

)

-time algorithmtocomputeallsuperbubblesinadirectedacyclicgraph.

Thispaperisorganisedasfollows:InSection 2wedefinesuperbubblesandintroducesomeoftheirproperties,andin Section 3 we outline the

O

(

n

+

m

)

-time algorithm forcomputingsuperbubbles in a directed acyclicgraph.In Section 4

we explaina methodtovalidate a candidatesuperbubblein constanttime. Thealgorithmis analysedinSection 5,while Section6providessomefinalremarksanddirectionsforfutureresearch.

2. Properties

Theconceptofsuperbubbleswasintroducedandformallydefinedin[9]asfollows.

Definition 1. (See[9].) LetG

=

(

V

,

E

)

beadirectedgraph.Foranyorderedpairofdistinctverticessandt, s

,

t

iscalleda superbubbleifitsatisfiesthefollowing:

reachability:t isreachablefroms;

matching:thesetofverticesreachablefromswithoutpassingthroughtisequaltothesetofverticesfromwhichtis reachablewithoutpassingthroughs;

acyclicity:thesubgraphinducedby U isacyclic,whereU isthesetofverticessatisfyingthematchingcriterion;

minimality:novertexinU otherthantformsapairwithsthatsatisfiestheconditionsabove;

verticessandt,andU

\{

s

,

t

}

usedintheabovedefinitionarethesuperbubble’sentrance,exitandinterior,respectively. We note that a superbubble

s

,

t

inthe above definitionisequivalent toa single-source, single-sink, directed acyclic subgraphofGwithsourcesandsinkt,whichdoesnothaveanycutverticesandpreservesallin-degreesandout-degrees ofverticesinU

\{

s

,

t

}

,aswellastheout-degreeofsandin-degreeoft.

Wenext state afew importantpropertiesofsuperbubbles whichenable thelinear-time enumerationofsuperbubbles.

Lemmas 1 and 2wereprovedbyOnoderaetal.[9]andSungetal.[10],respectively. Lemma 1. (See[9].) Anyvertexcanbetheentrance(respectivelyexit)ofatmostonesuperbubble.

NotethatLemma 1doesnotexcludethepossibilitythatavertexistheentranceofasuperbubbleandtheexitofanother superbubble.

Lemma 2. (See[10].)LetG beadirectedacyclicgraph.Wehavethefollowingtwoobservations.

1)Suppose

(

p

,

c

)

isanedgeinG,wherep hasonechildandc hasoneparent,then p

,

c

isasuperbubbleinG. 2)Foranysuperbubble s

,

t

inG,theremustexistsomeparentp oft suchthatp hasexactlyonechildt.
(3)

Fig. 1.A graphGwith set of verticesV= {v1,v2,· · ·,v15}. Note thatGhas as a single sourcev1and as a single sinkv14.

Fig. 2.Vertices of Fig. 1in topological order, where ordD[v1]=1, ordD[v2]=2, ordD[v3]=3, ordD[v4]=11, ordD[v5]=6, ordD[v6]=8, ordD[v7]=10, ordD[v8]=12, ordD[v9]=7, ordD[v10]=9, ordD[v11]=4, ordD[v12]=5, ordD[v13]=13, ordD[v14]=15 and ordD[v15]=14.

Inthispaperwestartbyshowinganotherimportantpropertyofsuperbubblesthatisclosely-relatedtoLemma 2. Lemma 3. Foranysuperbubble s

,

t

inadirectedacyclicgraphG,theremustexistsomechildc ofs suchthatc hasexactlyone parent s.

Proof. Assumethatallthechildrenofshavemorethanoneparent.Then,theremustbesomecycleorsomechildc which hasaparentthatdoesnotbelongtothesuperbubble s

,

t

.Thisisacontradiction.

2

3. Finding a superbubble in a directed acyclic graph

ThemaincontributionofthispaperisanalgorithmSuperBubblethatreportsallsuperbubblesinadirectedacyclicgraph G

=

(

V

,

E

)

withexactlyonesource(thevertexwithin-degree0)andexactlyonesink(vertexwithout-degree0).IfG has morethanonesourcethenanewsourcevertexrisaddedtoV andanedgefromrtoeachexistingsourceisaddedtoE. The sameis done ifG hasmorethan onesink; inthiscase, a newsinkvertext isadded to V andan edgefrom each existing sinkto t isaddedto E. Ifsuch preprocessingisdone, thenamong thesuperbubblesreported bythe algorithm, only those which donot start at r and donot end att representthe superbubbles inthe original graph.For thesake ofsimplicity,fortherestofthispaperandinall thepropositions,lemmasandtheorems thatfollow,weuse G todenote a directed acyclicgraphwithexactlyone source andexactly one sink,andweusen andm to denotethenumberofits verticesandedgesrespectively,thatis,forG

=

(

V

,

E

)

wehaven

= |

V

|

andm

= |

E

|

.

Atopologicalordering ordDofG mapseachvertextoanintegerbetween1 andn,suchthatordD

[

x

]

<

ordD

[

y

]

holdsfor alledges

(

x

,

y

)

E.Thereexistsaclassicallinear-timealgorithmforcomputingthetopologicalorderingofadirectedacyclic graph[11,12].Initsrecursiveform,thealgorithmvisitsanunvisitedvertexofthegraph,findsitsunvisitedneighbour,sayv, andperformsanothertopologicalsortstartingfromv.Thealgorithmreturnsifthecurrentvertexdoesnothaveunvisited neighbours.AlgorithmTopologicalSort,givenbelow,isasimplifiedversionthattakesasinputasingle-source,single-sink directedacyclicgraph,andproducesatopologicalorderingofvertices.ForthegraphG inFig. 1,TopologicalSortproduces anorderinggiveninFig. 2.

TopologicalSort

(

G

)

1 order

n

2 foreach vertex v

V do 3 state

[

v

] ←

unvisited

4 RecursiveTopologicalSort

(

G

,

source

)

RecursiveTopologicalSort

(

G

,

v

)

1 state

[

v

] ←

visited

2 foreach vertex wadjacent to vdo 3 ifstate

[

w

] =

unvisitedthen

4 RecursiveTopologicalSort

(

G

,

w

)

5 ordD

[

v

] ←

order

6 order

order

1

Proposition 1. ForanytopologicalorderingordD ofverticesingraphG,ifvertexu isreachablefromv,thatis,ifthereisapathfrom v tou,thenordD

[

v

]

<

ordD

[

u

]

.
(4)

j 1 2 3 4 5 6 7 8 9 10 11 12 13 14 15 v1 v2 v3 v11 v12 v5 v9 v6 v10 v7 v4 v8 v13 v15 v14

entrance s1 s2 s3 s4 s5 s6

exit t1 t2 t3 t4 t5 t6

Fig. 3.Candidates list for Fig. 1, candidates= {v1(entrance),v3(exit), v3(entrance), v11(entrance),v12(exit),v5(entrance), v10(exit), v7(exit), v8(exit),

v8(entrance),v13(entrance),v14(exit)}. Note that both v3and v8appear twice in the list.

Proof. Ifthepathfromv touisoflength1,i.e., thereisanedge

(

v

,

u

)

,thenbythedefinitionoftopologicalorderingwe haveordD

[

v

]

<

ordD

[

u

]

.Otherwise, wedenote thepath fromv to u oflength k,k

>

1,as v

,

x1

,

. . . ,

xk−1

,

u.Then bythe definitionoftopologicalorderingwehaveordD

[

v

]

<

ordD

[

x1]

<

· · ·

<

ordD

[

u

]

.Transitively,wehaveordD

[

v

]

<

ordD

[

u

]

.

2

Importantly,inthispaperwedonotconsideralltopologicalorderingsofgraphG butonlythoseobtainedbyalgorithm TopologicalSort.NotethatthisalgorithmfindsadirectedspanningtreeT ofGrootedatthesource,whichcontainsapath fromthesourcetoanyvertexinG.ThedirectedspanningtreeT ofG obtainedbyalgorithmTopologicalSortispresented byboldedgesinFig. 2.Itmaybeworthmentioningthatadirectedrootedtreeisalsoknownasarborescence.

WenextpresentanotherimportantpropertyoftopologicalorderingobtainedbyalgorithmTopologicalSort.

Proposition 2. LetordDand T bea topologicalorderingandadirectedrootedspanningtreeofgraphG obtainedbyalgorithm TopologicalSort.IfthereisapathinT fromavertexv toavertexu,then,foreachvertexw suchthatordD

[

v

]

<

ordD

[

w

]

<

ordD

[

u

]

, thereisapathfromv to w.

Proof. Recall that T contains a path from the root to each vertex of the tree; this is also true for each subtree of T. Furthermore,ifthereisapathfromv tou inT,thenu iscontainedinasubtreeofT rootedat v,andeach w suchthat ordD

[

v

]

<

ordD

[

w

]

<

ordD

[

u

]

isalsocontainedinthe subtreerooted atv (butnot inthesubtree rootedatu). Therefore, thereisapathfromv tow,foreach wsuchthatordD

[

v

]

<

ordD

[

w

]

<

ordD

[

u

]

.

2

We next show that in an ordering obtained by TopologicalSort, a vertex has the topological ordering betweenthe orderingsoftheentranceandtheexitofasuperbubbleifandonlyifitbelongstothesuperbubble.

Lemma 4. LetgraphG containasuperbubble s

,

t

.ThenatopologicalorderingobtainedbyTopologicalSorthasthefollowing properties.

1. Forallx suchthatx

U

\{

s

,

t

}

,ordD

[

s

]

<

ordD

[

x

]

<

ordD

[

t

]

. 2. Forally suchthaty

/

U ,ordD

[

y

]

<

ordD

[

s

]

orordD

[

y

]

>

ordD

[

t

]

.

Proof. RecallthatU isthesetofverticesformingasuperbubble(seeDefinition 1).

1. Sincethereisa pathfromthestart softhesuperbubbletoall x

U

\{

s

}

,byProposition 1 wehaveordD

[

s

]

<

ordD

[

x

]

forall x such that x

U

\{

s

}

. Similarly, since thereis a path fromall x

U

\{

t

}

to theend t ofthe superbubble, by

Proposition 1wehaveordD

[

x

]

<

ordD

[

t

]

forallxsuchthatx

U

\{

t

}

.Therefore,forallxsuchthatx

U

\{

s

,

t

}

,ordD

[

s

]

<

ordD

[

x

]

<

ordD

[

t

]

.

2. Suppose theopposite, that is,suppose thatthere exists some y

/

U such that ordD

[

s

]

<

ordD

[

y

]

<

ordD

[

t

]

. Sincethe superbubble

s

,

t

isitself a single-source, single-sink subgraph of G,any directed spanning tree of G rooted atthe source,willcontainapathfroms tot.Thenby Proposition 2therealsoexistsa pathfromsto y inT andthusalso inG.However,bythedefinitionofthesuperbubble,theonlyverticesreachablefromswithoutgoingthrought arethe internalverticesofthesuperbubble—acontradiction.Therefore,forall ysuchthat y

/

U,eitherordD

[

y

]

<

ordD

[

s

]

or ordD

[

y

]

>

ordD

[

t

]

.

2

AlgorithmSuperBubble starts by topologically ordering thevertices ofgraph G andthen checkseach vertexin V, in topologicalorder,toidentifywhetheritisanexitoranentrancecandidate(orboth).AccordingtoLemmas 2 and 3,a vertex visanexitcandidateifithasatleastoneparentwithexactlyonechild(out-degree1)andanentrancecandidateifithas atleastonechild withexactlyoneparent (in-degree1). Thereare atmost2n candidates,thusthecost ofconstructinga doubly-linked listof all thecandidates is linearinn.The elements of thecandidates listare ordered according toordD, andeach candidateislabelledasanexitoran entrancecandidate.Notethatifavertex v isbothanexitandanentrance candidate,then vappearstwiceinthecandidateslist,firstasanexitandthenasanentrance(Fig. 3).

(5)

Algorithm SuperBubble processes the candidates list of graph G in decreasing topological order (backwards). Let v1

,

v2

,

. . . ,

v bethelistofcandidates.Thealgorithmperformsthefollowing:

Ifvjisanentrancecandidate,thendeletevj;

If vj isan exitcandidate,then subroutineReportSuperBubbleiscalledto findandreport thesuperbubbleendingat vj,that is,the superbubble

vi

,

vj

, forsome entrance candidate vi. SubroutineReportSuperBubblealso recursively findsandreportsallnestedsuperbubblesbetweenvi andvj.

Forclarityofpresentation, we nextprovidea listandashortdescriptionofsubroutinesandarraysusedby algorithm SuperBubble and subroutine ReportSuperBubble. Before that, it is worth mentioning that candidates is a doubly-linked listofentrance andexitcandidates; specifically,an elementofthelistisa vertexalong withalabelspecifyingifitisan entrance oran exitcandidate.Forthesake ofsimplicityofthefollowingroutines, we useavertexanditscorresponding candidate (element in the candidates list) interchangeably. This does not add to the complexity ofthe algorithm aswe can usean auxiliaryarray v,where v

[

i

]

storesapointertothecorrespondingelement vi incandidatessoastoprovidea constant-timeconversionfromavertextothecorrespondingcandidate.

1. Entrance

(

v

)

takesasinput avertexv andoutputs

TRUE

ifv isanentrancecandidate,that is,ifitsatisfiesLemma 3, and

FALSE

otherwise.

2. Exit

(

v

)

takesasinputavertexvandoutputs

TRUE

ifvisanexitcandidate,thatis,ifitsatisfiesLemma 2,and

FALSE

otherwise.

3. InsertEntrance

(

v

)

takesasinputavertexv,insertsitattheendofcandidatesandlabelsitasentrance. 4. InsertExit

(

v

)

takesasinputavertexv,insertsitattheendofcandidatesandlabelsitasexit.

5.Head(candidates)andTail(candidates)returnthefirstandthelastelementincandidates,respectively. 6.DeleteTail(candidates)deletesthelastelementincandidates.

7.Next(v)returnsthecandidatefollowingv incandidates.

Inadditiontotheabovesubroutines,themainalgorithmalsoexplicitlymakesuseofthefollowingarrays. 1. ThearrayordDstoresthetopologicalorderofthevertices.

2. ThearraypreviousEntrancestoresthepreviousentrancecandidatesforeachvertexv.Formally,previousEntrance

[

v

]

=

s where sisanentrance candidatesuch thatordD

[

s

]

<

ordD

[

v

]

;andtheredoesnot existanotherentrancecandidates suchthatordD

[

s

]

<

ordD

[

s

]

<

ordD

[

v

]

.

3. ThearrayalternativeEntranceisusedtoreducethenumberofentrance

exitpairsthatneedtobeconsideredaspossible superbubbles.ArrayalternativeEntranceisfurtherdetailedinSection4.1.

Note thatsubroutineReportSuperBubble iscalledforeach exitcandidateindecreasingorder eitherbyalgorithm Su-perBubble orthrougharecursivecalltoidentifyanestedsuperbubble.AcalltosubroutineReportSuperBubble(start,exit) checks thepossibleentrance candidatesbetweenstart andexit,starting withthenearest previous entrancecandidate(to exit).ThistaskisaccomplishedwiththehelpofsubroutineValidateSuperBubble,explainedinthefollowingsection,which checkswhetheragivencandidate s

,

t

isasuperbubbleornot;ifitisnot,thealgorithmreturnseithera“

1”whichmeans that nosuperbubbleendsatt,oran alternativeentrancecandidateforasuperbubblethatcouldendatt.Forthegraphin

Fig. 1,thealgorithmdetects andreportsfivesuperbubbles: v8

,

v14, v3

,

v8,

v5

,

v7, v11

,

v12and v1

,

v3.Here,both

v5

,

v7and v11

,

v12arenestedsuperbubbles.

SuperBubble

(

G

)

1 TopologicalSort

(

G

)

2 prevEnt

NULL

3 foreach vertex vin topological orderdo 4 alternativeEntrance

[

v

] ←

NULL

5 previousEntrance

[

v

] ←

prevEnt 6 ifExit

(

v

)

then 7 InsertExit

(

v

)

8 ifEntrance

(

v

)

then 9 InsertEntrance

(

v

)

10 prevEnt

v

11 whilecandidatesis not emptydo 12 ifEntrance

(

Tail

(

candidates

))

then 13 DeleteTail

(

candidates

)

(6)

ReportSuperBubble

(

start, exit

)

1

Report the superbubble ending atexit(if any)

2 if

(

start

=

NULL

)

|

(

exit

=

NULL

)

|

(

ordD

[

start

] ≥

ordD

[

exit

]

)

then 3 DeleteTail

(

candidates

)

4 return

5 s

previousEntrance

[

exit

]

6 while

(

ordD

[

s

] ≥

ordD

[

start

]

)

do 7 valid

ValidateSuperBubble

(

s

,

exit

)

8 if

(

valid

=

s

)

|

(

valid

=

alternativeEntrance

[

s

]

)

|

(

valid

= −

1

)

then

9 break 10 alternativeEntrance

[

s

] ←

valid 11 s

valid 12 DeleteTail

(

candidates

)

13 if

(

valid

=

s

)

then 14 Report

(

s

,

exit

)

15 while

(

Tail

(

candidates

)

is nots

)

do 16 ifExit

(

Tail

(

candidates

))

then 17

Check for nested superbubbles

18 ReportSuperBubble

(

Next

(

s

)

,Tail

(

candidates

))

19 elseDeleteTail

(

candidates

)

20 return

Remark 1. Itisalsopossibletodesignthealgorithmsoastomoveforwardintopologicalorderinsteadofbackwards. ForgraphGinFig. 1,algorithmSuperBubble

(

G

)

makesexactlythreecallstosubroutineReportSuperBubble:

1. ReportSuperBubble

(

v1

,

v14

)

: First, it checks theexit candidate v14 against the nearest previous entrance candidate, i.e. vertex v13. Subroutine ValidateSuperBubble

(

v13

,

v14

)

returns v8 asan alternative entrance candidate.The new candidateisthencheckedandthesuperbubble v8

,

v14isreported.

2. ReportSuperBubble

(

v1

,

v8

)

: First, it checks the exit candidate v8 against the nearest previous entrance candidate, i.e. vertex v5.SubroutineValidateSuperBubble

(

v5

,

v8

)

returnsv3asanalternativeentrancecandidate.Thenew candi-dateisthencheckedandthesuperbubble v3

,

v8isreported.Additionally,tworecursivecallsaremade:

(a) ReportSuperBubble

(

v11

,

v7

)

:First,itvalidates v5

,

v7andreportsit.Then,itmakesarecursivecalltosubroutine ReportSuperBubble

(

v10

,

v10

)

whichterminateswithoutreportinganysuperbubble.

(b) ReportSuperBubble

(

v11

,

v12

)

:validates v11

,

v12andreportsit. 3. ReportSuperBubble

(

v1

,

v3

)

:validates

v1

,

v3

andreportsit. 4. Validating a superbubble

Inthissection,wedescribesubroutineValidateSuperBubble.Theabilitytovalidateacandidatesuperbubbledependson thefollowingresultrelatedtotheRangeMinimumQueryproblem.

TheRangeMinimumQueryproblem,RMQforshort,istopreprocessagivenarray A

[

1

.

.

n

]

forsubsequentqueriesofthe form:“Given indicesi

,

j,whatis theminimum valueof A

[

i

. .

j

]

?”. The problemhasbeenstudied intensivelyfordecades andseveral O

(

n

),

O

(

1

)

-RMQdatastructureshavebeenproposed,manyofwhichdependontheequivalencebetweenthe RangeMinimumQueryandtheLowestCommonAncestorproblems[13–15].

Inordertocheckwhetherasuperbubblecandidate s

,

t

isasuperbubbleornot,weproposetoutilisetherangemin/max queryproblemasfollows:

ForagivengraphG

=

(

V

,

E

)

andforeachvertexv

V withtopologicalorderordD

[

v

]

,calculatethetopological order-ingsoftheparentandthechildofv thataretopologicallyfurthestfromv.

OutParent

[

ordD

[

v

]] =

min(

{

ordD

[

ui

] |

(

ui

,

v

)

E

}

),

OutChild

[

ordD

[

v

]] =

max(

{

ordD

[

ui

] |

(

v

,

ui

)

E

}

).

Foran integerarray A andindicesi and j wedefine RangeMin

(

A

,

i

,

j

)

andRangeMax

(

A

,

i

,

j

)

toreturntheminimum andmaximumvalueof A

[

i

..

j

]

,respectively.

Thenforagivensuperbubblecandidate s

,

t

,wheresandt arean entranceandanexitcandidaterespectively (satis-fyingLemmas 1 and 2),if

s

,

t

isasuperbubblethenthefollowingtwoconditionsarevalid

RangeMin

(

OutParent

,

ordD

[

s

] +

1,ordD

[

t

])

=

ordD

[

s

],

RangeMax

(

OutChild

,

ordD

[

s

],

ordD

[

t

] −

1)

=

ordD

[

t

].

(7)

j 1 2 3 4 5 6 7 8 9 10 11 12 13 14 15 v1 v2 v3 v11 v12 v5 v9 v6 v10 v7 v4 v8 v13 v15 v14

OutParent[j] – 1 1 3 4 3 6 6 7 8 3 5 12 13 12

OutChild[j] 3 3 11 5 12 8 9 10 10 12 12 15 15 15 –

Fig. 4.OutParentandOutChildarrays for the graph inFig. 1.

Forexample,Fig. 4representsbothOutParentandOutChildarrayscomputedforgraphGinFig. 1.Furthermore,a candi-date v5

,

v8isnotasuperbubbleasRangeMin

(

OutParent

,

ordD

[

v5] +1

,

ordD

[

v8]

)

=

3

=

6

=

ordD

[

v5].

Itshouldbeclearthatafteran

O

(

n

+

m

)

-timepreprocessing,validatingasuperbubblerequires

O

(

1

)

timewhichisthe costforrangemax/minquery.SubroutineValidateSuperBubble

(

startVertex

,

endVertex

)

isdesignedtoreturnanappropriate entrancecandidateforasuperbubbleendingatendVertex(ifany),asfollows.

ValidateSuperBubble

(

startVertex, endVertex

)

1 start

ordD

[

startVertex

]

2 end

ordD

[

endVertex

]

3 outchild

RangeMax

(

OutChild

,

start

,

end

1

)

4 outparent

RangeMin

(

OutParent

,

start

+

1

,

end

)

5 ifoutchild

=

endthen

6 return

1

7 ifoutparent

=

startthen 8 returnstartVertex

9 elseifEntrance

(

Vertex

(

outparent

))

then 10 returnVertex

(

outparent

)

11 else returnpreviousEntrance

[

Vertex

(

outparent

)

]

Notethat subroutineValidateSuperBubbleutilises subroutineEntranceandthearray previousEntrancedefinedin Sec-tion3,aswellassubroutineVertexthattakesasinputanintegeri andoutputsvertexv suchthatordD

[

i

] =

v.

An importantobservationisthata subsequentcalltosubroutineValidateSuperBubble,foragivenentrancecandidate, returnsalternativeentrancecandidatesinstrictlynon-decreasingtopologicalorderasprovedbyLemma 5.

Lemma 5. Lett be the alternative entrance candidatereturned by subroutine ValidateSuperBubble

(

s

,

e

)

. Thenfor any exit candidateesuchthatordD

[

s

]

<

ordD

[

e

]

<

ordD

[

e

]

,theorderofthealternativeentrancecandidatet returnedbysubroutine ValidateSuperBubble

(

s

,

e

)

willbegreaterthanorequaltotheorderof t.

Proof. Recall thatthealternativeentrancet returnedby thesubroutineValidateSuperBubble

(

s

,

e

)

iseitheravertexwith topologicalorderoutparent,orthepreviousEntranceofthisvertex.

Sinceoutparent

=

RangeMin

(

OutParent

,

ordD

[

s

]

+

1

,

ordD

[

e

]

)

,outparent

=

RangeMin

(

OutParent

,

ordD

[

s

]

+

1

,

ordD

[

e

]

)

and ordD

[

s

]

<

ordD

[

e

]

<

ordD

[

e

]

,wehaveoutparent

outparent.Therefore,ordD

(

t

)

ordD

(

t

)

.

2

4.1. ValidationandalternativeEntrance

Incasethevalidationofthecandidatepair

(

t0

,

e

)

fails,subroutineValidateSuperBubble

(

t0

,

e

)

returnseither“

1”oran alternativecandidatet1 whichmightbeanentranceofasuperbubbleendingate.Thisalternativecandidatet1 iseithera vertexu1,ifu1 isanentrancecandidate,orthepreviousentrancecandidateofu1 suchthat

ordD

[

u1] =OutParent

[

ordD

[

v1]]

=

RangeMin

(

OutParent

,

ordD

[

t0] +1,ordD

[

e

]),

where v1issomevertexbetweent0 andeinthetopologicalordering.

Suppose t1 is also not a valid entrance of thesuperbubble ending at e. Then there must be a vertex v2, ordD

[

t1]

<

ordD

[

v2]

<

ordD

[

t0],withsomeparentu2,suchthatordD

[

u2] =OutParent

[

ordD

[

v2]].Thenthealternativeentranceissome

t2, which is eithera vertex u2 orits previous entrance andthus ordD

[

t2]

<

ordD

[

t1].A series of such failed validations producesasequencet1

,

t2

,

. . .

offailedalternativeentrancecandidates.

Animportantobservationhereisthatanyentranceti,fori

1,fromsuchasequenceisaninvalidentrancenotonlyfor thesuperbubbleendingatebutalsoforallthoseendingatanyotherexitvertexesuchthatordD

[

ti−1]

<

ordD

[

e

]

<

ordD

[

e

]

andti

=

ValidateSuperBubble

(

ti1

,

e

)

.Thisisthe casebecausethevertex vi whichcausesthe alternativeentrance ti to fail is suchthat ordD

[

ti

]

<

ordD

[

vi

]

<

ordD

[

ti−1] fori

1.Therefore, vi doesnot dependon theexite butratheron the previousfailedcandidateentrance.

ThisiswherearrayalternativeEntrance playsanimportantrole.Storing alternativeEntrance

[

ti−1] =tifori

1 enablesus toskipthissequenceatalaterstageiftiisreturnedbysubroutineValidateSuperBubble

(

ti1

,

e

)

.
(8)

5. Algorithm analysis

Inthissection,weanalysethecorrectnessandtherunningtimeoftheproposedalgorithmSuperBubble.Forsimplicity, inthefollowinglemmawewillslightlyabusetheterminologyandreferto s

,

t

asasuperbubbleifitsatisfiesthefirstthree conditionsgiveninDefinition 1,andasminimalsuperbubbleifitalsosatisfiesthelastconditioninthesamedefinition. Lemma 6. For agiven exitcandidatee, lets betheentrancecandidatesuchthatsuperbubble

s

,

e

isreportedbysubroutine ValidateSuperBubble

(

s

,

e

)

.Then s

,

e

isaminimalsuperbubble.

Proof. Bycontradiction, letebe anexitcandidatesuchthat

s

,

e

isalsoasuperbubbleandordD

[

s

]

<

ordD

[

e

]

<

ordD

[

e

]

. Then,eitherordD

[

e

] =

ordD

[

e

] +

1 orthereisatleastonevertexv suchthatordD

[

e

]

<

ordD

[

v

]

<

ordD

[

e

]

.

Inthe firstcase, ordD

[

e

] =

ordD

[

e

] +

1 implies thate isthe onlychildof e ande isthe onlyparentof e,which,by

Lemma 2makes e

,

e

asuperbubble.

Inthesecondcase,wherethereisatleastonevertexv suchthatordD

[

e

]

<

ordD

[

v

]

<

ordD

[

e

]

,wealsoarguethat e

,

e

mustbeasuperbubble.Indeed, e

,

e

satisfiesthefollowingconditions:

1. Reachability:Since s

,

e

isasuperbubble,eisreachablefroms;since s

,

e

isalsoassumedtobeasuperbubble,any pathfromstoemustgothroughe,thereforeeisreachablefrome.

2. Matching:Theonlyverticesreachablefrome withoutgoingthrough earethosewhosetopologicalorderisbetween ordD

(

e

)

andordD

(

e

)

.Indeed,since s

,

e

and s

,

e

aresuperbubbles,alltheseverticesarereachablefromsthroughe, andnoverticeswithtopologicalordergreater thanordD

(

e

)

arereachablefrome without goingthrough e.Similarly, there are no edges between vertices withtopological order lessthan ordD

(

e

)

andthose with the topologicalorder between ordD

(

e

)

and ordD

(

e

)

. Therefore,the only vertices from which e is reachable without going through e are thosewhosetopologicalorderisbetweenordD

(

e

)

andordD

(

e

)

.

3. Acyclicity:Since s

,

e

isasuperbubbleitisacyclic;since e

,

e

isasubgraphof s

,

e

,itisalsoacyclic.

Inbothcases,sinceforeachexitcandidatetheentrancecandidatesarecheckedinreversetopologicalorder,subroutine ValidateSuperBubblewould havebeen calledon

e

,

e

first, andwouldhavereported

e

,

e

instead of

s

,

e

. Therefore,

s

,

e

isaminimalsuperbubble.

2

Lemma 7. Forthegivenentranceandexitcandidatess ande,respectively,subroutineValidateSuperBubblereports s

,

t

ifandonly if s

,

t

isasuperbubble.

Proof. Weprovethelemmabyshowingthatif s

,

t

isasuperbubblethensubroutineValidateSuperBubblereportsit,and ifValidateSuperBubblereports s

,

t

then s

,

t

isasuperbubble.

1. Westartbyshowingthatif s

,

t

isasuperbubblethensubroutineValidateSuperBubblereportsit.Indeed,byLemma 4, all theverticeswithtopologicalorderings betweens andt belong tothe superbubble

s

,

t

.Therefore,the minimum OutParentissandthemaximumOutChildist andthussubroutineValidateSuperBubblereports s

,

t

.

2. Wenext show thatifsubroutineValidateSuperBubblereports

s

,

t

then

s

,

t

isa superbubble.Letstart andend be two integers,such that ordD

[

s

] =

start andordD

[

t

] =

end.Thegraph G asdefined, hasa single sourcer anda single sinkr;thisimpliesthatanyvertexv

V isreachablefromr and,atthesametime,canreachr.Thisisalsotruefor s,tandforanyvertexv suchthatordD

[

s

]

<

ordD

[

v

]

<

ordD

[

t

]

.

First,we show thatt isreachable froms.Recallthat t is anexitcandidate,so, ithasa parent p without-degree 1. Assumethattisnotreachablefroms,thentheremustbeapathfromr

t whichdoesnotinvolves.Thisimpliesthat eitherOutParent

[

end

]

<

start,orthereexists avertexv such thatstart

<

ordD

[

v

]

<

end,OutParent

[

v

]

<

start andthere existsapathr

v t,whichisacontradiction.

Similarly, we can show that every vertex v such that start

<

ordD

[

v

]

<

end satisfies the matching criterion of the superbubble.

Theacyclicitycriterion isguaranteed bythe acyclicityof G andtheminimalityis satisfiedbythe designof subrou-tineReportSuperBubblewhichassigns each exitofasuperbubbleto thenearest entrance,andby thecorrectnessof

Lemma 6.

2

Lemma 8. Foragivenexitcandidatee,lett bethealternativeentrancecandidatereturnedbysubroutineValidateSuperBubble

(

s

,

e

)

. Thenanyentrancecandidatebetweent ande cannotbeavalidentranceforthesuperbubbleendingat e.

Proof. Bycontradiction, assume that s is an entrancecandidatebetweent ande such that

s

,

e

is asuperbubble. If s hadbeen between s ande, it wouldhave alreadybeen reported,asSuperBubble checks entrance candidates inreverse topological order starting frome.Therefore, s is between t and s, such that ordD

[

t

]

<

ordD

[

s

]

<

ordD

[

s

]

<

ordD

[

e

]

. Let
(9)

outparent

=

RangeMin

(

OutParent

,

ordD

[

s

]

+

1

,

ordD

[

e

]

)

.Then, vertexatoutparentisbetweent ands,otherwisesubroutine ValidateSuperBubble

(

s

,

e

)

wouldhavereturneds(insteadoft).Therefore,ordD

[

t

]

outparent

<

ordD

[

s

]

.

Let outparent

=

RangeMin

(

OutParent

,

ordD

[

s

] +

1

,

ordD

[

e

]

)

.Then outparent

outparent.Thisimpliesthat outparent

outparent

<

ordD

[

s

]

.However, for s

,

e

tobe avalidsuperbubble, outparent shouldhave beenequalto ordD

[

s

]

. Hence, the assumptionis wrongandthus, itisproved that therecannot be anentrance candidate,betweent ande,whichis a validentranceforthesuperbubbleendingate.

2

Lemma 9. Forthegivenentranceandexitcandidatess ande1,respectively,letalternativeEntrance

[

s

]

besettot1whichlatergetsreset

tot2suchthatt2

=

t1,whileconsiderings withanotherexitcandidatee2.Thennoentrancecandidatebetweens ande2canreset

alternativeEntrance

[

s

]

tot1again.

Proof. Let e3 be an exit candidate between s and e2 such that subroutine ValidateSuperBubble

(

s

,

e3

)

returns t3. Then by Lemma 5, ordD

[

t1] ≤ordD

[

t2] ≤ordD

[

t3]. Sincet1

=

t2, we haveordD

[

t1]

<

ordD

[

t2] ≤ordD

[

t3]. Therefore,t1

<

t3 and

alternativeEntrance

[

s

]

cannotberesettothesamevaluet1again.

2

Theorem 1. AlgorithmSuperBubblereportsallsuperbubbles,andonlysuperbubbles,ingraphG indecreasingtopologicalorderof theirexitverticesin

O

(

n

+

m

)

time.

Proof. Consider anexecutionofalgorithmSuperBubble.Letsuperbubbles s1

,

t1

,

· · ·

,

sk

,

tk

be thesuccessive superbub-bles reported just after the execution of Line 14 of subroutine ReportSuperBubble, where ordD

(

t1

)

>

ordD

(

t2

)

>

· · ·

>

ordD

(

tk

)

.

1. First,weshowthateach si

,

ti

reportedbythealgorithminLine14isasuperbubble.Thisisprovedbythecorrectness ofLemma 7.

2. Second,nosuperbubbleismissedout bythealgorithm asproved bythefollowing.SubroutineReportSuperBubbleis calledforeach exitcandidatein decreasingorder.The entrancecandidate forthesuperbubble(ifany) endingatexit willonlybebetweenstartandexit,wherestartiseithertheheadofthe candidateslist(whensubroutine ReportSuper-BubbleiscalledfromalgorithmSuperBubble)ornextcandidateoftheentranceofanouter superbubble(whencalled through a recursivecall to identifya nestedsuperbubble).A callto subroutineReportSuperBubble(start,exit) checks thepossibleentrancecandidatesbetweenstart andexit,startingwiththenearestpreviousentrancecandidate(toexit). SubroutineValidateSuperBubbleeither successfullyvalidatesanentrance candidate,orreturnsa “

1”,orreturnsan alternative entrance candidate.From Lemma 8,there cannot be anyvalidentrance betweenthisalternativeentrance andexit.Ifthisalternativeentrancestartsasequenceofentrancesalreadycheckedforsome exitcandidatepreviously (asdepictedbyalternativeEntrance),then allentrancesofthatsequencewillbeskipped,otherwisethisalternative en-trancewillbetested. However,asmentionedinSection4.1,noneoftheentrancecandidatesintheskipped sequence can be valid. Therefore,foreach exit candidate,every potential entrancecandidate ischeckedfor validity, andthose whicharenotconsideredarenotvalid.

3. Third,therunningtimeofSuperBubbleis

O

(

n

+

m

)

.Indeed,therunningtimeoftheTopologicalSortandcomputing thecandidateslistis

O

(

n

+

m

)

.Furthermore,alllistoperationscostconstanttimeeach,andsumuptoalinearcostof

O

(

n

)

,asthereareatmost2ncandidatesinthelist.Finally,eachcallforsubroutineValidateSuperBubblecosts

O

(

1

)

. ThetotalnumberoftimesValidateSuperBubbleiscalledis

O

(

n

+

m

)

.ThisisbecausesubroutineValidateSuperBubble iscalledonceforeach exitcandidateinLine7ofsubroutineReportSuperBubble,andthetotalnumberofsuchcalls isboundedby

O

(

n

)

.Additionally,itiscalledeverytimeanewalternativeEntrancesequenceisgeneratedbysubroutine ValidateSuperBubble.ItfollowsfromLemma 9thatonceanalternativeEntrancesequenceisreset,itcannotbegenerated again by subsequent calls to subroutine ValidateSuperBubble. This resettingof alternativeEntrance foreach entrance candidate(Line10)thus enablesavoidingrepeatedchecksofthesamesequences ofentrancecandidates.Resettingis doneeverytimeanedgeisconsideredforthefirsttimebetweenavertex(inbetweenanentrancecandidatestartVertex andanexitcandidateendVertex)anditstopologicallyfurthestparent(whoseorderislessthanthatofstartVertex).Thus, thetotalnumberoftimesalternativeEntrancewillbereset(foralltheentrancecandidates)isboundedby

O

(

m

)

. Therefore,thetotalrunningtimeforreportingallsuperbubblesingraphG is

O

(

n

+

m

)

.

2

6. Final remarks

Wepresentedan

O

(

n

+

m

)

-timealgorithmtocomputeallsuperbubblesinadirectedacyclicgraph,wherenisthe num-berofverticesandmisthenumberofedges,improvingthebest-known

O

(

mlogm

)

-timealgorithmforthisproblem[10]. Itisalsointerestingtonotethatinthistypeofgraph,thatis,constructedfromsequencesoverafixed-sizedalphabet,the out-degreeofeachvertexisboundedbythesizeofthealphabet(fourforDNAalphabet);therefore,thetimecomplexityof theproposedalgorithmisessentiallylinearinn.

Ourimmediategoalistopractically evaluateouralgorithmandcompareits implementationtoan earlierresult[9].It wouldalsobeinterestingtoinvestigateothersuperbubble-likestructuresinassemblygraphs,suchascomplexbulges[16].

(10)

References

[1]E.S.Lander,L.M.Linton,B.Birren,C.Nusbaum,M.C.Zody,J.Baldwin,K.Devon,K.Dewar,M.Doyle,W.FitzHugh,etal.,Initialsequencingandanalysis ofthehumangenome,Nature409 (6822)(2001)860–921.

[2]J.C.Venter,M.D.Adams,E.W.Myers,P.W.Li,R.J.Mural,G.G.Sutton,H.O.Smith,M.Yandell,C.A.Evans,R.A.Holt,etal.,Thesequenceofthehuman genome,Science291 (5507)(2001)1304–1351.

[3] S. Balasubramanian, D. Klenerman, C. Barnes, M. Osborne, Patent US20077232656, 2007.

[4]S.Batzoglou,Algorithmicchallengesinmammaliangenomesequenceassembly,in:EncyclopediaofGenomics,ProteomicsandBioinformatics,John WileyandSons,Hoboken,NewJersey,2005.

[5]J.Butler,I.MacCallum,M.Kleber,I.A.Shlyakhter,M.K.Belmonte,E.S.Lander,C.Nusbaum,D.B.Jaffe,ALLPATHS:denovoassemblyofwhole-genome shotgunmicroreads,GenomeRes.18 (5)(2008)810–820.

[6]P.A.Pevzner,H.Tang,M.S.Waterman,AnEulerianpathapproachtoDNAfragmentassembly,Proc.Natl.Acad.Sci.USA98 (17)(2001)9748–9753. [7]N.G.deBruijn,Acombinatorialproblem,K.Ned.Akad.Wet.49(1946)758–764.

[8]D.R.Zerbino,E.Birney,Velvet:algorithmsfordenovoshortreadassemblyusingdeBruijngraphs,GenomeRes.18 (5)(2008)821–829. [9]T.Onodera,K.Sadakane,T.Shibuya,Detectingsuperbubblesinassemblygraphs,in:WABI,2013,pp. 338–348.

[10]W.Sung,K.Sadakane,T.Shibuya,A.Belorkar,I.Pyrogova,AnO(mlogm)-timealgorithmfordetectingsuperbubbles,IEEE/ACMTrans.Comput.Biology Bioinform.12 (4)(2015)770–777.

[11]R.L.R.Thomas,H.Cormen,CharlesE.Leiserson,C.Stein,IntroductiontoAlgorithms,MITPress,Cambridge,MA,2001. [12]R.Tarjan,Edge-disjointspanningtreesanddepth-firstsearch,ActaInform.6 (2)(1976)171–185.

[13]D.Harel,R.Tarjan,Fastalgorithmsforfindingnearestcommonancestors,SIAMJ.Comput.13 (2)(1984)338–355.

[14]J.Fischer,V.Heun,TheoreticalandpracticalimprovementsontheRMQ-problem,withapplicationstoLCAandLCE,in:M.Lewenstein,G.Valiente (Eds.),CombinatorialPatternMatching,Proceedingsofthe17thAnnualSymposium,CPM2006,Barcelona,Spain,July5–7,2006,in:LectureNotesin ComputerScience,vol. 4009,Springer,2006,pp. 36–48.

[15]S.Durocher,Asimplelinear-spacedatastructureforconstant–timerangeminimumquery,in:A.Brodnik,A.López-Ortiz,V.Raman,A.Viola(Eds.), Space-EfficientDataStructures,Streams,andAlgorithms,in:LectureNotesinComputerScience,vol. 8066,SpringerBerlinHeidelberg,2013,pp. 48–60. [16]S. Nurk,A.Bankevich,D.Antipov, A.A.Gurevich,A.Korobeynikov,A.Lapidus,A.D.Prjibelski,A.Pyshkin,A.Sirotkin,Y.Sirotkin,R.Stepanauskas, S.R. Clingenpeel,T.Woyke,J.S.McLean,R.Lasken,G.Tesler,M.A.Alekseyev,P.A.Pevzner,Assemblingsingle-cellgenomesandmini-metagenomesfrom chimericMDAproducts,J.Comput.Biol.20 (10)(2013)714–737.

Theoretical Computer Science 609 (2016) 374–383 ScienceDirect www.elsevier.com/locate/tcs (http://creativecommons.org/licenses/by/4.0/ 860–921. 1304–1351. 2005. 810–820. 9748–9753. 758–764. 821–829. 2013, 770–777. 2001. 171–185. 338–355. 2006, 2013, 714–737.

References

Related documents