• No results found

Linking indexing data structures to de Bruijn graphs: Construction and update

N/A
N/A
Protected

Academic year: 2021

Share "Linking indexing data structures to de Bruijn graphs: Construction and update"

Copied!
19
0
0

Loading.... (view fulltext now)

Full text

(1)

Contents lists available atScienceDirect

Journal

of

Computer

and

System

Sciences

www.elsevier.com/locate/jcss

Linking

indexing

data

structures

to

de

Bruijn

graphs:

Construction

and

update

Bastien Cazaux

a,b

,

Thierry Lecroq

c

,

Eric Rivals

a,b,

aLIRMM,CNRSandUniversitédeMontpellier,161rueAda,34095MontpellierCedex5,France

bInstitutBiologieComputationnelle,CNRSandUniversitédeMontpellier,860rueSaintPriest,34095MontpellierCedex5,France cNormandieUniv.&UNIROUEN,UNIHAVRE,INSARouen,LITIS,76000RouenFrance

a

r

t

i

c

l

e

i

n

f

o

a

b

s

t

r

a

c

t

Articlehistory: Received25June2015

Receivedinrevisedform 26May2016 Accepted27June2016 Availableonlinexxxx Keywords: Index Datastructure Suffixtree Suffixarray Dynamicupdate Overlap

ContracteddeBruijngraph Assembly

Algorithms Bioinformatics

DNA sequencing technologieshavetremendously increased theirthroughput, and hence complicatedDNAassembly.NumerousassemblyprogramsusedeBruijngraphs(dBG)built fromshortreadstomergetheseintocontigs,whichrepresentputativeDNAsegments.In adBGoforder

k

,nodesaresubstringsoflength

k

ofreads(or

k

-mers),whilearcsaretheir k +1-mers.Asanalysingreadsoftenrequiretoindexalltheirsubstrings,itisinteresting toexhibitalgorithms thatdirectlybuildadBG fromapre-existingindex,and especially acontracteddBG,wherenon-branching pathsarecondensedintosinglenodes.Here,we exhibitlineartimealgorithmsforconstructingthefullorcontracteddBGsfromsuffixtrees, suffixarrays,andtruncatedsuffixtrees.Withthelattertheconstructionusesaspacethat islinearinthesizeofthedBG.Finally,wealsoprovidealgorithmstodynamicallyupdate theorderofthegraphwithoutreconstructingit.

©2016TheAuthor(s).PublishedbyElsevierInc.Thisisanopenaccessarticleunderthe CCBYlicense(http://creativecommons.org/licenses/by/4.0/).

1. Introduction

In life sciences, determining the sequence of bio-molecules is an essential step towards the understanding of their functions and interactions within an organism. Powerful sequencing technologies allow to get huge quantities of short sequencingreads thatneed tobe assembledto inferthecomplete targetsequence. Theseconstraintsfavourthe useofa versionofthedeBruijnGraph(dBG)dedicatedtogenomeassembly–aversionwhichdiffersfromthecombinatorial struc-tureinventedbyN.G.de Bruijn[1].Givenaset S

= {

s1

,

. . . ,

sn

}

ofn readsandanintegerk,anassemblydeBruijn Graph,

orforshortsimplyde Bruijn Graph,storeseach k-mer (k-longsubstring)occurringinthereadsasnodesandhasan arc joiningtwok-mersiftheyappearassuccessive(andhenceoverlapping)k-mersinatleastoneread.

The dBGis then traversed to extract long paths, which willform the contigs, i.e., the sequenceof sub-regions ofthe molecule.Innon-repetitive regions,thelayoutofthereadsdictatesasimplepathofk-merswithoutbifurcations.Anysimple path betweenan in-branchingnode andthe next out-branching node, can then be contractedinto a single arc without loosinganyinformationonthegraphstructure.Thesequencesofsuchsimplepathsarecalledunitigs(thecontractionfrom uniqueandcontigs).TheversionofthedBGwhereeachsuch“non-branching”pathiscondensedintoasinglearcistermed theContracteddBG(CdBG).

*

Correspondingauthorat:LIRMM,CNRSandUniversitédeMontpellier,161rueAda,34095MontpellierCedex5,France. E-mailaddresses:[email protected](B. Cazaux),[email protected](T. Lecroq),[email protected](E. Rivals). http://dx.doi.org/10.1016/j.jcss.2016.06.008

0022-0000/©2016TheAuthor(s).PublishedbyElsevierInc.ThisisanopenaccessarticleundertheCCBYlicense (http://creativecommons.org/licenses/by/4.0/).

(2)

Sequencingtechnologiesofthesecondgenerationcanyieldhundredsofmillionsofreads.Comparedtotheoverlapgraph ortothestringgraph,whichwereusedwithprevioustechnologies,thedBGhasanumberofnodesthatisnotproportional to thenumberofreads:itdependsonausercontrolledparameterk,termed theorder ofthedBG.Its memoryusage can befinetunedthroughthisparameter.

In bioinformatics,dBGs are heavily exploitedforgenomeassembly [2],butforother purposes aswell. Actually,some programs mine the dBG to seekgraph patterns representingmutations, large insertions/deletions, orchromosomal rear-rangements[3].Othersuseittocorrectsequencingerrorsinlongreads[4].

ThedeBruijnGraphisusuallybuiltdirectlyfromthesetofreads,whichistimeandspaceconsuming.Severalcompact datastructuresforstoring dBGshavebeendeveloped[5,6] includingprobabilistic ones[7].Theemphasisisplacedonthe practical spaceneededtostorethedBGs inmemory.Moreover,some recentassemblyalgorithms putforwardthe advan-tage ofusingforthesameinput,multipledBGswithincreasingorders[8],therebyemphasising theneedfordynamically updatingthedBG.Inallcases,theconstructionalgorithmsneedtoscanthroughthewholesetofreads.

Severalgenomeassemblyprograms usedhashtablesto storethek-mer ofthereadsandallownavigatingthrough the arcs ofthedBG,butthesesolutionssufferfromseverallimitationsregardinge.g. functionalities andflexibility.Withhash functions, itisoftennotpossibleto addextrainformationtothe nodes,likeforinstancethenumberoftimesak-mer is observed inthereadset,whichis usedasa confidencemeasure.Hashtablesmakeitdifficulttocompute thecontracted dBGortochangethevalueofk.Themainadvantageofsophisticatedhashfunctionsis theirmemoryfootprint.Forinstance Minia [9] offers a very spaceefficient storage to handlethe dBG based oncascading Bloom filters, which are a type of hash functions. This hashtable based solution was used forlong read errorcorrection andalso proves efficient in that context[4].

Instudiesinvolvingtheanalysisofsequencingreads,distincttasksrequiretoindexeitherallsubstrings,orthek-mersof thereads.Forinstance,fastdBGassemblyprogramsfirstcountthek-mersbeforebuildingthedBGtoestimatethememory needed [7]. Anotherexample: some errorcorrection software build a suffix tree ofall short readsto correct them [10]. Hence, before the assembly starts, the read set has already beenscanned through andindexed. It can thus be efficient to enable theconstruction ofthedBGforthesubsequentassembly, directlyfromtheindexratherthan fromscratch.For these reasons, we set out to find algorithms that transformusual indexes intoa dBGor a contracted dBG.It is also of theoretical interesttobuildbridges betweenwell studiedindexesandthisgraphonwords.Despiterecentresults[11,12], formal methods forconstructing dBGs fromsuffix trees are an open question. In comparison, Simpson and Durbin have proposedanalgorithmtobuildtheStringGraphfromaFM-index[13].

Here, we present algorithms to build directly the CdBG from a Generalised Suffix Tree or froma Generalised Suffix Array ofthereads[14–17].Thesealgorithmstake spaceandtimethatarelinearintheinputsize.Thesewell-knowndata structuresindexallsubstringsofthereads,andnotonlytheirk-mers.Thisresultsinonedrawbackandinoneadvantage.

Thedrawbackistheirspaceoccupancy.Wewillthenconsideranindexingdatastructurethatreducesthesetofindexed substrings: thetruncatedsuffixtree[18,19]. Weintroducethe reducedtruncatedsuffix tree(TST) andthen showhow to constructwiththisindexboththedBGandCdBGintimeandspacethatarelinearinthesizeofthefinaldBG,ratherthan inthecumulatedlengthofthereads.BysizeofthedBGwemeanthesumofnumberofnodes,plusthenumberofarcs. Thisalgorithmachievesanoptimaltimeandspacecomplexity.

The advantage isthecounterpart: assubstringsofall lengthsareindexed, itallows toupdate theorder ofthegraph, that is to changedynamically the value ofk without reconstructing thedBG. Finally,we provide efficientalgorithms for increasing ordecreasingthevalueofk.Ofcourse,ifoneusesthetruncatedsuffixtreeinsteadofthefull suffixtree,only some updatesremainpossible.Ourresultsneverthelessremainapplicabletothetruncatedsuffixtree,wheretheordercan bedynamicallydecreased.

Thisarticleincludesresultsthatappearedin[20,21]. 1.1. Indexingdatastructures

Suffix trees arewell-known indexingdatastructuresthat enable tostore andretrieveall thefactorsof agivenstring. The suffix tree of a string y of length s can be build in time and space in O

(

s

)

on a constant size alphabet [14,22]. Then, itis possibletocheck ifapatternx oflengthm isa factorofastring y of S intime O

(

m

)

.Counting thenumber of occurrences of x in y can also be done intime O

(

m

)

while enumerating the positions where x occurs in y can be performedintime O

(

m

+

occ

)

,whereocc denotesthe numberofoccurrencesof xin y.Suffix trees canbe adaptedto a finitesetofstringsandarethencalledGeneralisedSuffixTrees(GSTs).Thus,givenasetSofnstringsoftotallength

S

on aconstantsizealphabet,thegeneralisedsuffixtreefor Scanbebuildintimeandspace O

(

S

)

.Foradetailedexposition ofpropertiesofsuffixtreeswereferthereaderto[17].Suffixtreeshavebeenwidelystudiedandusedinalargenumberof applications(see[15]and[17]).Inpractice,theyconsumetoomuchspaceandareoftenreplacedbythemoreeconomical suffixarrays[16],whichhavethesameproperties[23].

When oneisonlyinterested infactorsofagivenlength,truncatedsuffix treesonlystorethefactors oflength uptoa givenconstantk ofagivenstring.Theycanalsobebuild inlineartime andspace[18].Inpractice,truncatedsuffix trees savealotofnodescompared tosuffixtrees.

(3)

Fig. 1.S:= {bacbab,cbabcaa,bcaacb,cbaac,bbacbaa}isasetofwords.Therefore,wehaveSupport(ba)= {(1,1),(1,4),(2,2),(4,2),(5,2),(5,5)},RC(ba)=

{ε,c,cb,cba,cbab,b,bc,bca,bcaa,a,ac,cbaa},LC(ba)= {ε,c,ac,bac,b,bbac}andd(ba)=0.OnehasRC(ba)= {a,b,c}.Thus,thewordbaisnotright extensibleinS(seeDefinition 2).

2. DefinitionsofdeBruijngraphs

2.1. Notationaboutstrings

Hereweintroduceanotationandbasicdefinitions.

An alphabet

is afinite setof letters.A finitesequence ofelements of

is calleda wordor a string.The set ofall wordsover

isdenotedby

,and

ε

denotestheemptyword.Forawordx,

|

x

|

denotesthelengthofx.Giventwowords xandy,wedenoteby x

·

yorsimplyxytheconcatenationofxandy.Forevery1

i

j

≤ |

x

|

,x

[

i

]

denotesthei-thletter of x,and x

[

i

..

j

]

denotes the substringor factor x

[

i

]

x

[

i

+

1

]

. . .

x

[

j

]

.Letk be a positive integer. If

|

x

|

k, f irstk

(

x

)

isthe

prefixoflength k ofxandlastk

(

x

)

isthesuffix oflengthk of x.Then asubstring oflengthk of xis calledak-mer ofx.

Fori such that 1

i

≤ |

x

|

k

+

1,

(

x

)

k,i isthek-merof xstarting inpositioni,i.e.,

(

x

)

k,i

=

x

[

i

..

i

+

k

1

]

.Thus we have

f irstk

(

x

)

=

(

x

)

k,1andlastk

(

x

)

=

(

x

)

k,|x|−k+1.Wedenoteby

()

thecardinalityofanyfiniteset

.

LetS

= {

s1

,

. . . ,

sn

}

beafinitesetofwords.Itisourrunninginstanceforallthefollowing.Letusdenotethesumofthe

lengthsoftheinputstringsby

S

:=

siS

|

si|

Wedenoteby

F

(

S

)

thesetoffactorsofwordsof S,i.e., F

(

S

)

= {

w

| ∃

u

,

v

,

1

i

n

,

si

=

u w v

}

.

Fk

(

S

)

thesetoffactorsoflengthkofS wherekisapositiveinteger,i.e., Fk

(

S

)

=

F

(

S

)

k.

Suffk

(

S

)

isthesetofsuffixesoflengthkofwordsofS.

2.2. ClassicaldefinitionofdeBruijngraph

All definitions below refer to the set S; however, as S is clear from the context, we simply omit the “in S” in the notation.

ForawordwofF

(

S

)

,

Support

(

w

)

isthesetofpairs

(

i

,

j

)

,where wisthesubstring

(

si

)

|w|,j.Support

(

w

)

iscalledthesupportofw inS.

RC

(

w

)

(resp.LC

(

w

)

)isthesetofrightcontext(resp.leftcontext)ofthewordw inS,i.e.,thesetofwords w suchthat w w

F

(

S

)

(resp.ww

F

(

S

)

).

w

isthewordw w where w isthelongestwordof RC

(

w

)

suchthat Support

(

w

)

=

Support

(

w w

)

.Inother words, suchthat wandw w haveexactlythesamesupportinS.

w

isthewordw wherew isthelongestprefixof wsuchthat Support

(

w

)

=

Support

(

w

)

.

d

(

w

)

:= |

w

|

− |

w

|

.

Inotherwords,

w

isthelongestextensionofwhavingthesamesupportaswinS,while

w

istheshortestreduction ofwwithasupportdifferentfromthatofwinS.ThesedefinitionsareillustratedinarunningexamplepresentedinFig. 1. WegivethedefinitionofadeBruijn graphforassembly(dBGforshort),whichdiffersfromtheoriginaldefinitionofa completegraphoverallpossiblewordsoflengthkstatedbydeBruijn[1].

Definition1. Let k be a positive integer. The deBruijngraph of order k for S, denoted by D B G+k, is a directed graph, D B Gk+

:=

(

Vk+

,

Ek+

)

,whoseverticesarethek-mersofwordsof Sandwhereanarclinksu to v ifandonlyifuandv are twosuccessivek-mersofawordofS,i.e.:

Vk+

:=

Fk

(

S

)

(4)

Fig. 2.Examplesofarcsfrom D BG+k.(a)showslettersintherightcontextofba,and(b)thesuccessorsofnodebainD B G+2;oneforeachletterin RC(w).(c)showslettersintheleftcontextofba,and(d)thepredecessorsofnodebainD BG+2;oneforeachletterinLC(w).

Fig. 3.Withsolidarcsonly,thegraphscorrespondtoD BG+2 (a)and D BG+3 (b)forourrunningexample.Withbothsolidanddottedarcs,theyrepresent D B G−2 (a)andD BG−3 (b).

AnequivalentdefinitionofE+k canbestatedusingtheleftinsteadofrightcontext:

E+k

:= {

(

u

,

v

)

Vk+2

|

lastk−1

(

u

)

=

f irstk−1

(

v

)

andu

[

1

] ∈

LC

(

v

)

}

.

ExamplesofarcsaredisplayedonFig. 2.ThesizeofD B Gk+isdenotedbyanddefinedassize

(

D B G+k

)

:=

(

Vk+

)

+

(

E+k

)

. Notethatanother,simplerdefinitionofthearcsinthedeBruijngraphcoexistswiththatofDefinition 1.There,anarclinks utov ifandonlyifu overlapsv byk

1 symbols.Thisgraphisdenotedby D B Gk

=

(

Vk

,

Ek

)

,where:

Vk

:=

Fk(S

)

Ek

:= {

(

u

,

v

)

Vk−2

|

lastk−1

(

u

)

=

f irstk−1

(

v

)

}

.

ThearcsofEk−satisfylessconstraintsthanthoseofE+k;hence,Ek+isasubsetof Ek.Bothdefinitionsareillustratedon Fig. 3.SomeassemblyprogramsuseD B Gk−[9].AllthealgorithmicresultsthatweobtainforD B Gk+remainvalidforD B Gk. Inthesequel,wefocusonlyonD B G+k.

LetusintroducenowthenotionsofextensibilityforasubstringofS andthatofaContracteddBG(CdBGforshort).

Definition2(Extensibility).LetwbeawordofF

(

S

)

.

w isrightextensibleinS ifandonlyif

(

RC

(

w

)

)

=

1.

w isleftextensibleinS ifandonlyif

(

LC

(

w

)

)

=

1.

Let wbeawordof

.Theword wissaidtobeaunique k-merof Sifandonlyifk

kandforalli

∈ [

1

..

k

k

+

1

]

,

(5)

Fig. 4.The graphs correspond to CDBG+2 (a) and CDBG+3 (b) for our running example.

Definition3.AcontracteddeBruijngraphoforderk,denotedbyCDBGk+

=

(

Vk+,c

,

E+k,c

)

,isadirectedgraphwhere:

Vk+,c

= {

w

|

wis ak-mer unique maximal by substring andk

k

}

E+k,c

= {(

u

,

v

)

Vk+,c2

|

lastk−1

(

u

)

=

f irstk−1

(

v

)

andv

[

k

] ∈

RC

(

lastk(u

))}.

ExamplesofCDBG+k aredisplayed onFig. 4.Notethatinthepreviousdefinition,anelement w inVk+,c doesnot neces-sarilybelong to F

(

S

)

,since w mayonlyexistasthe substringoftheagglomerationoftwo wordsof S.Thus, let w bea k-meruniquemaximalbysubstringwithk

k:

lastk

(

w

)

isnotrightextensibleorRC

(

lastk

(

w

))

= {

a

}

andlastk−1

(

w

)

·

aisnotleftextensible,

f irstk

(

w

)

isnotleftextensibleorLC

(

f irstk

(

w

))

= {

a

}

anda

·

f irstk−1

(

w

)

isnotrightextensible.

Withthisargument,wehavebothfollowingpropositions.

Proposition1.Let

(

u

,

v

)

E+k,c;

(

lastk

(

u

),

f irstk

(

v

))

Ek+andthereexists w

Vk+suchthat

(

w

,

f irstk

(

v

))

Ek+

\ {

(

lastk

(

u

),

f irstk

(

v

))

}

or

(

lastk

(

u

),

w

)

Ek+

\ {

(

lastk

(

u

),

f irstk

(

v

))

}

.

Proposition2.Let

(

u

,

v

)

Ek+.Ifu isrightextensibleandv isleftextensible,thenthereexistsw

Vk+,csuchthatu

·

v

[

k

]

isasubstring ofw.Otherwise,thereexists

(

u

,

v

)

E+k,csuchthatu

=

lastk

(

u

)

andv

=

f irstk

(

v

)

.

Accordingto Propositions 1 and2,CDBG+k isthegraph D B G+k wherethearcs

(

u

,

v

)

are contractedifandonly ifuis rightextensibleandvisleftextensible.

2.3. Constructivecharacterisation ofthedeBruijngraph

Letkbeapositiveinteger.WedefinethefollowingthreesubsetsofF

(

S

)

.

Init Exactk

= {

w

F

(

S

)

| |

w

|

=

kandd

(

w

)

=

0

}

Initk

= {

w

F

(

S

)

| |

w

|

kandd

(

f irstk

(

w

))

= |

w

|

k

}

SubInitk

=

Init Exactk−1

AwordofInit Exactkiseitheronlythesuffixofsomesiorhasatleasttworightextensions,whilethefirstk-merofaword

inInitk

\

Init Exactkhasonlyonerightextension.

Proposition3.Init Exactk

=

Initk

∩ {

w

F

(

S

)

| |

w

|

=

k

}

.

Proof. Letw

Init Exactk.Inthiscase,weget f irstk

(

w

)

=

w and

|

w

|

k

=

0.Thismeansthatd

(

f irstk

(

w

))

= |

w

|

kand

thereforew

Initk.

2

Forw anelementofInitk, f irstk

(

w

)

isak-merofS.Giventwowords w1 andw2 ofInitk, f irstk

(

w1

)

and f irstk

(

w2

)

aredistinctk-mersofS.Furthermoreforeachk-mer w ofS,thereexistsawordwofInitksuchthat f irstk

(

w

)

=

w.From

this,wegetthefollowingproposition.

Proposition4.ThereexistsabijectionbetweenInitkandthesetofthek-mersofS.

AccordingtoDefinition 1andProposition 4,eachvertexofD B G+k canbeassimilatedtoauniqueelementofInitk.Asthe

(6)

TodefinethearcsbetweenthewordsofInitk,whichcorrespondtoarcsofD B Gk+,weneedthefollowingproposition,which

statesthateachsingleletterthatisarightextensionof wgivesrisetoasinglearc.

Proposition5.Forw

Init Exactkanda

RC

(

w

)

,thereexistsauniquew

Initksuchthatlastk−1

(

w

)

a isaprefixofw.

Proof. Let w be a word of Init Exactk and a a letter of RC

(

w

)

. By definition of right context, lastk−1

(

w

)

a

F

(

S

)

. As

|

lastk−1

(

w

)

a

|

=

k, there exists w such that lastk−1

(

w

)

a is a prefix of w and

|

lastk−1

(

w

)

a

|

+

d

(

lastk−1

(

w

)

a

)

= |

w

|

. By

definitionofInitk, w

Initk.

2

ThesetInitkrepresentsthenodesofD B G+k.LetusnowbuildthesetofarcsthatisisomorphictoEk+.Letwbeaword

of Initk andSucck

(

w

)

denote theset ofsuccessors of f irstk

(

w

)

: Succk

(

w

)

:= {

x

Initk

|

(

f irstk

(

w

),

f irstk

(

x

))

Ek+

}

. We

knowthat foreach lettera in RC

(

w

)

,there existsan arcfrom f irstk

(

w

)

to f irstk

(

last|w|−1

(

w

)

a

)

in D B Gk+.We consider

twocasesdependingonthelengthofw:

Case1.

|

w

|

=

k.

AccordingtoProposition 3, w

Init Exactkandhencelastk−1

(

w

)

SubInitk.Therefore,theoutgoingarcsofwinD B G+k

arethearcsfromwto w satisfyingtheconditionofProposition 5.Then,

Succk(w

)

=

aRC(w)

lastk−1

(

w

)

a

.

Case2.

|

w

|

>

k.

As w islongerthank,itcontainsthenextk-mer;hence f irstk

(

last|w|−1

(

w

)

a

)

=

f irstk

(

last|w|−1

(

w

))

,andthereexistsa

uniqueoutgoingarcofw:thatfromwto

w

[

2

..

k

]

.Indeed,bydefinitionofInitk,

w

[

2

..

k

]

Initk,andthus

Succk

(

w

)

= {

w

[

2

..

k

]}

.

Now,wecanbuildintegrally D B Gk+ormoreexactlyanisomorphicgraphofD B G+k.

Theorem1.WiththesetsInitk,Init ExactkandSubInitk,wecanbuildanisomorphicgraphofD B G+k inlineartimeinthesizeof

thesesets.

Forsimplicity,fromnowon,weconfoundthegraphwebuildwithD B G+k. 2.4. Constructivecharacterisation ofthecontracted deBruijngraph

TodothesamewithCDBGk+,initiallywebeginbyexplainingthealgorithmthatweusetobuildthisgraphandinthe secondtimeweneedtocharacterisetheconceptsofrightandleftextensibilityintermsofwordproperties.

OuralgorithmtobuildCDBG+k.We present a generic algorithm to build incrementally CDBG+k. It is explained interms of words,anddoesnotdependonanyindexingdatastructure.Infollowingsections,wewillusethisgenericalgorithmand explainhowitcanbeperformedefficientlyusingaspecifiedindexingstructure.

Themainalgorithm(Algorithm 2)exploresD B Gk+tofindthenodeskeptinCDBGk+andsetallsinglearcsthatrepresent wholenon-branching pathsofD B G+k thatareproperlycontracted.Thekeypointistofindallstartingnodesofsimplepaths andexplorethesepathsfromthem;theexplorationisdonebyAlgorithm 1.

Amoredetailedexplanation.First, note that to build D B G+k it suffices to know the set Succk

(.)

foreach node. The

algo-rithm belowsimulatesa traversalof D B G+k without buildingit,andstoresonlyone nodeper unique maximalk-merof D B G+k.Forsuchak-mer, saym,wechoose torepresentitby thenode v suchthat f irstk

(

v

)

isaprefixofm.In D B G+k,

m is represented by a simple (i.e., non-branching) path and v is its first node. In the traversalalgorithm, for a current starting node vc in Initk,we traversethesimple pathuntilwe arriveata node u havingseveralsuccessorsorsuchthat

its onlysuccessoris notleft extensible (i.e.,hasseveralpredecessors).Inother words,untilwe findu suchthat u is not right extensible ornext

(

u

)

isnot left extensible. In D B Gk+,there exists a simplepath between vc andu,and thismust

build a single node in CDBG+k.To contract thispath,we choose tokeep vc, andforanysuccessor w ofu,we insertan

arc betweenu and w, asthisarc cannot be contracted. Noting that w necessarily starts a chain(having atleast a sin-gle node), if w is not yetin CDBG+k, we launcha newpath exploration starting from w,one gets that f irstk

(

w

)

is the

prefix of a node ofCDBGk+, andthus w can appropriately represent the path.Now,if w alreadybelongsto CDBGk+, the case istrickier. If vf stores thefirst vc calledby theprocedure, it maynot be thestarting node ofa path,but be

(7)

Algorithm 1:BuildAuxCDBG

(

V

,

E

,

vf

,

vc

)

.

Input : ThepartialcontractedgraphCDBG+k as(V,E),twonodesvf andvc.vf theinitialstartingnode,andvcthecurrentstartingnode. Output: Theupdatedcontractedgraph(V,E),whichnowcontainsallpathsstartingfromvc.

1 begin

2 u:=vc;marku

3 //searchthenodeendingthechainthatgoesthroughvc

4 whileu isrightextensibleandnext(u)isleftextensibledo

5 ifvf=next(u)then

6 update(vf,i)by(vc,i)forall(vf,i)E

7 return(V\ {vf},E)

8 u:=next(u);marku

9 //nowexplorethepathstartinginthesuccessorofu 10 forwSucck(u)do

11 ifwV then

12 (V,E):=(V,E∪ {(vc,w)});

13 else

14 (V,E):=Build AuxC D B G(V∪ {w},E∪ {(vc,w)},vf,w);; // explore from node w 15 return(V,E)

Algorithm 2:BuildCDBG

(

S

)

. Input : AsetofwordsS. Output: CDBG+k ofS. 1 begin

2 (V,E)=(,)

3 //searchforanynodevofD BG+k withoutpredecessors 4 //andbuildCDBG+k fromv

5 forvInitkdo

6 ifthereexistsnow suchthatvSucck(w)then

7 (V,E):=(V,E)Build AuxC D B G(V∪ {v},E,v,v) 8 //exploreD BG+k fromanynodenotyetvisited

9 forvcanunmarkednodeofInitkdo

10 (V,E):=(V,E)Build AuxC D B G(V∪ {vc},E,vc,vc)

11 return(V,E)

path:hencewe mustupdate V byexchanging vf withvc andterminatetheexploration.Otherwise, vf is traversed

dur-ing the for loop (as the value of w), then it is a successor of u and the beginning of a simple path: we just add an arc linking vc to w andstop. Finally,if w alreadybelongs to V but w

=

vf, we also add an arc linking vc to w and

stop.

TheprocessperformedbyAlgorithm 1augmentsthepartialgraphCDBG+k restrainedtothenodesvisitedwhenexploring the pathstarting from vc.It sufficesnow toensure that all arcsof D B G+k are examined,which Algorithm 2 does.More

precisely, it starts by visiting the simple paths starting atnodes having no predecessors (otherwise these nodes would not be visited). Once this is done, one must explore all nodes not yet marked and continue until all nodes have been visited/marked.

Fromtheabovediscussion,weobtainthefollowingtheorem.

Theorem2.Assumeonecandetermineinconstanttimeforanarc

(

u

,

v

)

ofEk+,c,whetheru isrightextensibleandwhetherv isleft extensible.Then,withthesetsInitk,Init ExactkandSubInitk,Algorithm 2buildsagraphthatisisomorphictoCDBG+k inlineartime

inthesizeofthesesets.

Remark. Executing Algorithm 2 doesnot requireto build D B G+k, since the set ofsuccessors Succk

(

u

)

of anynode u is

computedinconstanttime.

Characterisation oftheconceptsofrightandleftextensibility.By the constructionof D B Gk+,we get thefollowing properties, whichwillturnusefulfortheconstructionoftheCdBGfromspecificindexes(Section3and4).

(8)

Proposition 7. Let w be a word of Initk such that f irstk

(

w

)

is right extensible. Let the letter a be theunique element of

RC

(

f irstk

(

w

))

,thenlastk−1

(

f irstk

(

w

))

a isleftextensibleifandonlyif

(

Support

(

f irstk(w

)))

=

(

Support

(

lastk−1

(

f irstk(w

))

a

)

\ {(

i

,

1

)

|

1

i

n

}).

Proof. Let

(

i

,

j

)

beapairofSupport

(

f irstk

(

w

))

.Wehave

(

i

,

j

+

1

)

Support

(

lastk−1

(

f irstk

(

w

))).

As Support

(

lastk−1

(

f irstk

(

w

)))

=

Support

(

lastk−1

(

f irstk

(

w

))

a

)

,itfollowsthat

(

i

,

j

+

1

)

Support

(

lastk−1

(

f irstk

(

w

))

a

).

Ifthere exists

(

i

,

j

)

Support

(

lastk−1

(

f irstk

(

w

)))

such that j

>

0 and

(

i

,

j

1

)

/

Support

(

f irstk

(

w

))

,thereexists aletter

b

=

w

[

1

]

suchthat

(

i

,

j

1

)

Support

(

b

·

lastk−1

(

f irstk

(

w

)))

.

Hence

(

b

·

lastk−1

(

f irstk

(

w

)),

lastk−1

(

f irstk

(

w

))

a

)

alsobelongsto E+,andthuslastk−1

(

f irstk

(

w

))

a isnot left

extensi-ble.

2

Insummary,thissectiongivesaformulationofthedBGofS intermsofwords.Nowassumethatthesubstringsofthe words areindexed inadata structure,e.g. ageneralisedsuffix array.Howcan webuild thedBGor thecontractedgraph directlyfromthisstructure?Toachieve this,itsufficestocomputethethreesets Initk,Init Exactk, SubInitk,aswell asthe

sets Support

(.)

andSucck

(.)

forsome appropriatesubstrings. Inthefollowing sections,we exhibitalgorithms tocompute

D B G+k andCDBG+k fortwoimportantindexingstructuresandforahome-madetruncateddatastructure.

3. TransitionfromanindexingdatastructuretodeBruijngraphs

3.1. Fromageneralisedsuffixtree

Suffix Trees(ST)belongtothemoststudiedindexingdatastructures.Ageneralised STcanindexthesubstringsofaset of words.Generallyforthissake,all wordsare concatenatedandseparatedby aspecialsymbol not occurringelsewhere. However,thistrickisnotcompulsory,andanalternativeistokeeptheindicationofaterminatingnodewithineachnode. 3.1.1. Thesuffixtree anditsproperties

The GeneralisedSuffixTreeofasetofwords S isthe suffixtreeof S,whereeachwordof S doesnotnecessarilyfinish by aletterofuniqueoccurrence. Hence,foreachnode v oftheGeneralised SuffixTreeof S,wekeepinmemorytheset, denoted by Suff

(

v

)

, ofpairs

(

i

,

j

)

such that the word representedby v is the suffix of si starting at position j. Letus

denotebyT thegeneralisedsuffixtreeofS (fromnowon,wesimplysaythetree)andby VT itssetofnodes.Forv

VT,

Children

(

v

)

denotesitssetofchildrenand f

(

v

)

itsparent.SeeFig. 5foranexampleofGST.

Some nodesof T mayhavejustonechild.The size oftheunionof Suff

(

v

)

forallnode v ofT equals thenumberof leavesinthegeneralisedsuffix treewhen thewordsendwithaterminating symbol.Hence,thespacetostore T andthe sets Suff

(.)

islinearin

S

.Bysimplicity,foranode v ofT,thewordrepresentedbyv isconfusedwithv.Foreachnode v ofT,v

F

(

S

)

.AsallelementsofF

(

S

)

arenotnecessarilyrepresentedbyanodeofT,wegivethefollowingproposition.

Proposition8.ThesetofnodesofT isexactlythesetofwordsw ofF

(

S

)

suchthatd

(

w

)

=

0.

Werecallthenotionofasuffixlink(SL)foranynode vofT (leavesincluded).Letsl

(

v

)

denotethenodetargetedbythe suffixlinkofv,i.e.,sl

(

v

)

=

v

[

2

..

|

v

|]

.Bydefinitionofasuffixtree,forall w

F

(

S

)

,thereexistsanode v ofT suchthat w isaprefixof v.Let v thenodeofminimal lengthofT suchthat wisaprefixof v,then

|

v

|

= |

w

|

+

d

(

w

)

,andtherefore

w

=

v.

Proposition9.Letw

F

(

S

)

.Then

|

w

|

≥ |

w

|

>

|

f

(

w

)

|

,where f

(

w

)

istheparentof

w

inT .

Proof. As f

(

w

)

=

w

,theresultisobvious.

2

3.1.2. ConstructionofD B Gk+

Let

[

x1

..

xm

]

bethesetofk-mersofS.AccordingtothedefinitionofInitkandtoProposition 4,Initk

= [

x1

..

xm

]

.Thus,

by Proposition 9, Initk

= {

v

VT

| |

f

(

v

)

|

<

kand

|

v

|

k

}

.Similarly, Init Exactk

= {

v

VT

| |

v

|

=

k

}

.Now,itappearsclearly

that Init ExactkisasubsetofInitk,sinceforallv

VT,

|

f

(

v

)

|

<

|

v

|

.

(9)

Fig. 5.Thegeneralisedsuffixtree forourrunningexampleandtheconstructeddeBruijngraph fork:=2.Squarenodesrepresentwordsthatoccuras asuffixofsomesi,circlenodesaretheothernodesofT.NodesingreyarethoseusedtorepresentthenodesofthedBG.Eachsquarenodestoresits positionsofoccurrencesinS;forsimplicity,wedisplaythestartingpositionasanumberandthewordofSinwhichitoccursasitscolour,insteadof showingthelistofpairs(i,j).ThesolidcurvedarrowsaretheedgesofthedeBruijngraphfork:=2;thosecoloured inredcorrespondtoCase1and thoseinbluetoCase2.(Forinterpretationofthereferencestocolourinthisfigurelegend,thereaderisreferredtothewebversionofthisarticle.)

Fig. 6.Thefigures(a),(b)and(c)showCase1andCase2encounteredwhencomputingthearcsofD BG+k.Thegreennoderepresentsthenodev,and theoneinorangesl(v).Thedashedarcscorrespondtosuffixlinks.ArcsofD BG+k areinsolidlineandcolouredinredforCase1(a),orinblueforCase2 (b), (c).(Forinterpretationofthereferencestocolourinthisfigurelegend,thereaderisreferredtothewebversionofthisarticle.)

Case1.

|

v

|

=

k(Fig. 6a).

As v

Init Exactk,sl

(

v

)

SubInitk.Therefore,each childu ofsl

(

v

)

isan elementof Initk.Thus, theoutgoingarcsofv

in D B Gk+are thearcsfrom v tothechildu ofsl

(

v

)

wherethefirstletterofthelabelbetweensl

(

v

)

andu isanelement oftherightcontextof v.Asthesetofthefirstlettersofthelabelbetweenv andchildrenofv isexactly RC

(

v

)

,the numberofoutgoingarcsof v inD B G+k isthenumberofchildrenof v.Tobuildtheoutgoingarcsof v inD B G+k,foreach childu ofv,weassociatev withthenodeofInitkbetweentherootandsl

(

u

)

,i.e.,

f irstk

(

sl

(

u

))

.

Case2.

|

v

|

>

k(Figs. 6band 6c).

Wehavethatsl

(

v

)

isanode ofVT.As

|

v

|

>

k,

|

sl

(

v

)

|

k.Thus,thereexistsan elementofInitk betweentherootand

sl

(

v

)

.Weassociatev withthisnode,i.e.

f irstk

(

sl

(

v

))

.

WeillustratethesetwocasesinFig. 5:

Case1.Casewherev is 6,6,sl

(

v

)

is 7,7,theuniquechildu ofv is 3,andsl

(

u

)

is 4,whichisinInitk.

Case2.Casewherev is 1,sl

(

v

)

is 2,and

f irstk

(

sl

(

v

))

is .

Inbothcases,buildingthearcsofE+requirestofollowtheSLofsomenode.Thenode,sayu,pointedatbyaSLmay notbeinitial.Hence,theinitialnoderepresentingtheassociatedfirstk-merofu istheonlyancestralinitialnodeofu.We

(10)

equipeachsuch nodeuwithapointer p

(

u

)

that pointstotheonlyinitialnode onitspathfromtheroot. Inother words, foranyu

/

Initksuchthat

|

u

|

>

k,onehasp

(

u

)

:=

f irstk

(

u

)

.

The algorithmtobuildthe D B G+k isasfollows.Aninitial depthfirsttraversalofT allows tocollectthenodesofInitk

andforeachsuchnodetosetthepointerp

(.)

ofallitsdescendantsinthetree.FinallytobuildE+,onescansthrough Initk

andforeachnode v oneaddsSucck

(

v

)

to E+usingtheformulagivenabove.Altogetherthisalgorithmtakesatimelinear

inthesizeofT.Moreover,thenumberofarcsinE+islinearinthetotalnumberofchildrenofinitialnodes.Thisgivesus thefollowingresult.

Theorem3.ForasetofwordsS,buildingthedeBruijnGraphoforderk,D B Gk+takeslineartimeandspacein

|

T

|

,i.e.,in

S

. 3.1.3. ConstructionofCDBG+k

InSection2.3,wehaveseenanalgorithmthatallowstocomputedirectlyCDBG+k providedthatonecandetermineifa node v isrightextensibleandifnext

(

v

)

isleftextensible,wherenext

(

v

)

denotestheonlysuccessorofv.Letusseehow tocomputetheextensibilityinthecaseofaSuffixTree.

By applyingProposition 6 inthe caseof a tree,foran element v of Initk, f irstk

(

v

)

is rightextensible ifandonly if

|

v

|

>

kor

(

Children

(

v

))

=

1.Thuscheckingtherightextensibilityofanodetakesconstanttime.

Fortheleft extensibilityofthesingle successorofa node,oneonlyneedsthesize ofsupportofsome nodes( Proposi-tion 7).Letusseefirsthowtocompute

(

Support

(.))

onthetree,andthenhowtoapplyProposition 7.

Proposition10.Letv beawordofF

(

S

)

andVT

(

v

)

denotesthesetofnodesofthesubtreerootedin

v

. Support

(

v

)

=

vVT(v)

Suff

(

v

).

Along atraversalofthetree,we cancompute andstore

(

Support

(

v

))

and

(

Support

(

v

)

∩ {

(

i

,

1

)

|

1

i

n

}

)

foreach node v inlineartimein

|

T

|

.

Letv beawordofInitk suchthat f irstk

(

v

)

isrightextensible.

Case1. If

|

v

|

=

k, then f irstk

(

v

)

=

v and

(

Children

(

v

))

=

1. Let u be the only child of v. Thus,

|

u

|

>

k, RC

(

v

)

=

{

u

[

k

+

1

]}

,andlastk−1

(

v

)

u

[

k

+

1

]

=

f irstk

(

sl

(

u

))

.Hence,

(

Support

(

v

))

=

(

Support

(

f irstk

(

sl

(

u

)))

\ {

(

i

,

1

)

|

1

i

n

}

)

andbyProposition 7, f irstk

(

sl

(

u

))

isleftextensible.

Case2.If

|

v

|

>

k,thenRC

(

f irstk

(

v

))

= {

v

[

k

+

1

]}

and

lastk−1

(

f irstk(v

))

v

[

k

+

1

] =

lastk(f irstk+1

(

v

))

=

f irstk(sl

(

v

)).

ByProposition 7, f irstk

(

sl

(

v

))

isleftextensibleifandonlyif

(

Support

(

f irstk(v

)))

=

(

Support

(

f irstk(sl

(

v

)))

\ {(

i

,

1

)

|

1

i

n

})

As

(

Support

(

f irstk

(

v

)))

=

(

Support

(

f irstk

(

v

)

))

and

(

Support

(

v

)

\ {

(

i

,

1

)

|

1

i

n

}

)

=

(

Support

(

v

))

(

Support

(

v

)

∩ {

(

i

,

1

)

|

1

i

n

}

)

, determining if next

(

v

)

is left extensible takes constant time. Toconclude, asfor any initial node v,we cancompute in O

(

1

)

timeitssetofsuccessorsSucck

(

v

)

,itsrightextensibility,andtheleft extensibility

ofits singlesuccessor,we canreadilyapply Algorithm 2tobuiltCDBG+k andweobtaina complexitythat islinearinthe sizeofD B G+k,sinceeachsuccessorisaccessedonlyonce.ThisyieldsTheorem 4.

Theorem4.ForasetofwordsS,buildingtheContracteddeBruijnGraphoforderk,CDBG+k takeslineartimeandspacein

|

T

|

,i.e., in

S

.

3.2. Fromageneralisedsuffixarray

IntheprevioussubsectionswehaveshownhowtobuilddeBruijngraphsfromsuffixtrees.Suffixtreesareveryelegant datastructuresbuttheyaretoospace-consuminginpractice.Inmanyapplicationstheyhavebeenreplacedbysuffixarrays thatareequivalent datastructuresandaremorespaceeconomical.WewillnowshowhowtobuilddeBruijngraphsfrom suffixarrays.

LetSAandLCPbethegeneralisedenhancedsuffixarrayofS:

• ∀

1

i

<

S

,SA

[

i

]

=

(

g

,

h

)

,SA

[

i

+

1

]

=

(

g

,

h

)

thensg

[

h

. .

|

sg

|]

<

sg

[

h

. .

|

sg

|]

,

• ∀

2

i

S

,LCP

[

i

]

isthelengthofthelongestcommonprefix betweensuffixesstoredinSA

[

i

1

]

andinSA

[

i

]

,and LCP

[

1

]

=

LCP

[

S

+

1

]

= −

1.

(11)

Letusrecallthedefinitionofanlcp-interval.

Definition4([23]).Aninterval

[

i

,

j

]

,1

i

<

j

S

iscalledalcp-intervalofvalue

,alsodenotedby

-

[

i

,

j

]

,iff: 1. LCP

[

i

]

<

,

2. LCP

[

g

]

fori

<

g

j,

3. LCP

[

g

]

=

foratleastone gsuchthati

<

g

j, 4. LCP

[

j

+

1

]

<

.

Letusnowrecallthedefinitionsofthepreviousandnextsmallervalues(PSV andNSV)arrays.

Definition5([23]).For2

i

S

:

PSV

[

i

]

=

max

{

j

|

1

j

<

iandLCP

[

j

]

<

LCP

[

i

]}

,

NSV

[

i

]

=

min

{

j

|

i

<

j

S

+

1 andLCP

[

j

]

<

LCP

[

i

]}

.

Recallthat if 2

i

S

then

[

PSV

[

i

]

,

NSV

[

i

]

1

]

is an lcp-intervalof value LCP

[

i

]

. The direct inclusion among lcp-intervalsdefinesatreerelationshipcalledthelcp-intervaltree(see [23,Def.4.4.3,p.87]).Givenanlcp-interval

-

[

i

,

j

]

,its parentlcp-interval

-

[

i

,

j

]

canbeeasilycomputedinconstanttimeusingthearraysLCP,PSV andNSV.Then:

Initk consistsof:

– thelcp-intervals

-

[

i

,

j

]

suchthat

kandtheparentinterval

-

[

i

,

j

]

of

-

[

i

,

j

]

issuchthat

<

k(theassociated stringissSA[i].g

[

SA

[

i

]

.

h

. .

SA

[

i

]

.

h

+

1

]

);

– thepositionsSA

[

i

]

=

(

g

,

h

)

suchthati isnotcontainedinlcp-intervals

-

[

i

,

j

]

with

kandh

≤ |

sg

|

k

+

1 (the

associatedstringissg

[

h

. .

|

sg

]

);

Init Exactkiscomposedofthelcp-intervalsk-

[

i

,

j

]

;

SubInitk

=

Init Exactk−1.

Actuallythelcp-intervaltreedoesnotneedtobe explicitlybuildandthesetscanbecomputedbyasinglescan ofthe SAandLCParrays.

Foranlcp-interval

-

[

i

,

j

]

Initkwehave

(

Support

(

sSA[i].g

[

SA

[

i

]

.

h

. .

SA

[

i

]

.

h

+

k

1

]

))

=

j

i

+

1.

Theorem5.ThedeBruijngraphoforderk,CDBGk+,forasetofwordsS canbebuiltinatimeandspacethatarelinearin

S

using thegeneralisedsuffixarrayofS.

4. TransitionfromatruncatedstructuretodeBruijngraphs

Thissectionisorganisedasfollows.InSection4.1,wedefine asimpleconditionthat asetofinputstringsmustsatisfy toallowbuildingageneralisedindexandsketchamodificationofMcCreight’salgorithm[14] fordoing so.InSection4.2, weintroducethereducedtruncatedsuffixtreeandspecialisetheprevious algorithmforconstructingitefficiently.Finally, in Section 4.3we show how toconstruct both thede Bruijn Graph andits contractedversion in optimaltime fromthe reducedtruncatedsuffixtree.

4.1. Setofchainsofsuffix-dependantstringsandtree

Here,weintroducethenotionofsuffixdependencebetweenstrings,andthenotionofchainofsuffix-dependantstrings in ordertodefineaunifiedindexthatgeneralisesboththesuffixtree[14]andthetruncatedsuffixtree[18].First,letusdefine theconceptofsuffix-dependantstringsandofchainsofsuffix-dependantstrings.

Definition6.

1. Astringxissaidtobesuffix-dependantofanotherstring yifx

[

2

..

|

x

|]

isprefixof y.

2. Let w be a stringand m be a positive integer smaller than

|

w

|

1. A m-tuple ofm strings

(

x1

,

. . . ,

xm

)

is a chain

ofsuffix-dependantstringsof w if x1 is a prefix of w and for each i

∈ [

2

,

m

]

, xi is a prefix of w

[

i

,

|

w

|]

such that

|

xi

|

≥ |

xi−1

|

1.

Let

R

= {

C1

,

. . . ,

Cn

}

be a set of tuples such that foreach i

∈ [

1

,

n

]

, Ci is a chain of suffix-dependant strings ofthe

stringsi.Fori

∈ [

1

,

n

]

and j

∈ [

1

,

|

Ci

|]

,Ci

[

j

]

isthe jth stringofthetupleCi.Let

R

= {

C

1

,

. . . ,

C

n

}

bethesetoftuplessuch

(12)

With

R

and S, we can easily compute

R

. In the sequel, we use

R

to demonstrate our results, and

R

to state the complexitiesofalgorithms.Indeed,inthecasewhereCiisthetupleofeachsuffixofsi,thesizeofCiislinearin

|

si

|

2 but

Ci islinearin

|

si

|

.

Let w beastring; w mayoccurindistinct tuplesof

R

.Thus, wedefine N

(

w

)

thesetof

(

i

,

j

)

such that w

=

Ci

[

j

]

.In

otherwords,N

(

w

)

isthesetofcoordinatesoftheelementsof

R

thatareequalto w.

Wedefineacontractedversionofthewell-knownAho–Corasicktree[17].Infact,weapplynearlythesamecontraction process that turnsa trie ofawordinto itscompact SuffixTree [17].Consider theAho–Corasicktree of S,inwhicheach noderepresentsaprefixofwordsinS.Wecontractthenon-branchingpartsofthebranchesexceptthatwekeepallnodes representingawordthatbelongstoatuplein

R

.Fromnowon,letT

(

R

)

denotethiscontractedversionoftheAho–Corasick treeofS.

N

and

L

denoterespectivelythesetofnodesandthesetofleavesofT

(

R

)

.Furthermore,wedefineforeachnodev of T

(

R

)

twoweights:

s

(

v

)

isthenumberoftimesthatanelementofatupleof

R

isequaltothewordrepresentedbyv(i.e.,s

(

v

)

:= |

N

(

v

)

|

).

t

(

v

)

isthenumberoftimesthatthefirstelementofatupleof

R

isequaltothewordrepresentedby v (i.e.,t

(

v

)

:=

|{

(

i

,

1

)

N

(

v

)

|

i

∈ [

1

,

n

]}|

).

Letw beastring,weputSucc

(

w

)

= {

(

i

,

j

)

|

(

i

,

j

1

)

N

(

w

)

and j

≤ |

Ci

|}

.Wedefine

H

asthesubsetof

L

suchthat:

H

:= {

u

L

| ∃

C

R

andj

<

|

C

|

such thatu

=

C

[

j

]}

It isequivalenttosaythat

H

= {

u

L

|

Succ

(

u

)

is not empty

}

.Amappingm from

H

to

N

iscalledpossiblelinkiffor eachnode v in

H

,

(

i

,

j

)

Succ

(

v

)

suchthatm

(

v

)

=

Ci

[

j

]

.

Belowwe presentanalgorithm thatconstructs T

(

R

)

,andcomputesforeach node v in

N

,the weightss

(

v

)

andt

(

v

)

andapossiblelinkP0.

ConstructionofT

(

R

)

.Now,we giveanalgorithmtoconstructT

(

R

)

.WeusetheversionofMcCreight’salgorithmgivenby Naetal.[18] onourinputandwebuildforeachleafv,s

(

v

)

,t

(

v

)

and P0

(

v

)

.ForbuildingT

(

R

)

,westartwithatreethat

contains onlythe root.Then, foreach word w ineverychain C,we createorupdate (ifitexists)thenode wasfollows. Assumethatwekeepinmemorythenode v thathasbeenprocessedjustbeforew.

If w is the first wordof C, we go down from the root by comparing w to the labels of the tree. If we create the node w,s

(

w

)

andt

(

w

)

areinitialisedto1,andP0

(

w

)

tonil.If walreadyexistsonthetree,weincrements

(

w

)

andt

(

w

)

by 1.

If w is not the first word of C, we start from v, andas in McCreight’s algorithm, we createor arrive on the node representing w.Ifweneedtocreatethisnode,s

(

w

)

isinitialisedto1,t

(

w

)

to0,andP0

(

w

)

tonil.Otherwise,weadd1 to

s

(

w

)

.Weset P0

(

v

)

=

w.

Theloopcontinueswiththenextworduntiltheend,andweobtain T

(

R

)

.

Theorem6.Forasetofchainofsuffix-dependantstrings

R

,wecanconstructT

(

R

)

inO

(

S

)

timeandspace.

Proof. Tobeginwith,letustoprovethatT

(

R

)

isinO

(

S

)

space.Itsnumberofleavesequals

CR

|

C

|

.Hence,itsnumber ofnodesisatmost2

CR

|

C

|

1

2

S

,anditsnumberofedgesisatmost2

S

.ThusthesizeofT

(

R

)

isinO

(

S

)

.

Clearly, theconstruction algorithmof T

(

R

)

computesbothweights s

(.)

andt

(.)

,andthepossible link P0

(.)

correctly.

Forthecomplexity,foreachchainofsuffix-dependant Ciof

R

,thelengthofthetraversepathonthetreeisequalto

|

wi

|

,

thankstotheuseofthesuffixlinks.ThusasinMcCreight’salgorithm,thecomplexityisinO

(

S

)

.

2

Now,weareequippedwithanalgorithmthatbuildsT

(

R

)

foranysetofchainsofsuffix-dependantstrings.Letusreview someinstancesofsets S,forwhich T

(

R

)

isinfactawell-knowntree.

If

C

:= ∪

wS

{

tuple of suffixes ofw

}

, then T

(

C

)

is the Generalised Suffix Tree of S (see Fig. 7a). We have that the

restrainedmappingsl

(.)

isanexampleofapossiblelink.

If Bk

:= ∪

wS

{

tuple ofk-mer of wand suffixes of lengthk

<

kofw

}

,thenT

(

Bk

)

isthegeneralisedk-truncatedsuffix

treeofS,asdefinedin[19](whichgeneralisesthek-truncatedsuffixtreeofNaetal.[18]).

If Ak

:= ∪

wS

{

tuple ofk

+

1-mer of wand suffixes of lengthkofw

}

,then T

(

Ak

)

is thetruncatedsuffix tree thatwe

definebelowinSection4.2(seeFig. 7b). 4.2. Ourtruncatedsuffixtree

(13)

Fig. 7.(a)Thegeneralisedsuffixtreeforthesetofwords{bacbab,bbacbaa,bcaacb,cbaac,cbabcaa}.ThepartabovethegreenlinecorrespondstotheTST T(A2),whichisshownin(b).(b)ThetruncatedsuffixtreeT(A2)forthesamesetofwords.(Forinterpretationofthereferencestocolourinthisfigure legend,thereaderisreferredtothewebversionofthisarticle.)

Definition7.

1. Foralli

∈ [

1

,

|

S

|]

and j

∈ [

1

,

|

si

|

k

+

1

]

, Ak,idenotesthetuplesuchthatits jth componentisdefinedby

Ak,i[j

] :=

wi[j

,

j

+

k

]

ifj

≤ |

wi| −k wi[j

,

|

wi|] otherwise

2. and Akisthesetofthesetuples: Ak

:=

n

i=1Ak,i.

Proposition11.

1. Ak,iisachainofsuffix-dependantstringsofsi.

2. Moreover,

{

w

Ak,i

|

Ak,i

Ak

}

=

Fk+1

(

S

)

Suffk

(

S

)

.

Proof.

1. Forall j

∈ [

1

,

|

Ak,i

|

k

]

,itiseasytoseethat Ak,i

[

j

]

isasuffix-dependantstringof Ak,i

[

j

+

1

]

.

2. Forthesecondpoint

{

w

Ak,i

|

Ak,i

Ak} = n

i=1

(

|si|−

k+1 j=1

{

Ak,i[j

]}

)

=

n

i=1

(

|s

i|−k j=1

{

Ak,i[j

]}

{

Ak,i[|si| −k

+

1

]}

)

References

Related documents

Journal of Applied and Natural Science 11(1): 23- 34 (2019) ISSN : 0974-9411 (Print), 2231-5209 (Online) journals.ansfoundation.org Composition, richness and floristic

And finally, traders from outside the district visit the primary markets and purchase truck-loads of animals (250 goats, 25 cattle) for cash and transport them to

In this section, we explore to what extent different logics associated with causation, effectuation and bricolage as well as service design logic enhances our understanding of

If the political attack against the MBH98 study was rooted in synecdochally situat­ ing the hockey-stick article as representative of all global climate change studies, the

The histological evaluation of the brainstem nuclei, according to the guidelines established by the Italian law, are focused on identifying the hypoglossus, the

If a defensive player (as defined in 29a) moves forward prior to the snap of the ball and after the quarterback calls the offensive team into a set position, the defense shall

If your medical equipment has another power plug connection type, you must supply your own adaptor to convert your equipment’s power plug to the US type power plug connection..

66 retin-a prescription price 67 retin-a micro rebate form 68 tretinoin cream