Contents lists available atScienceDirect
Journal
of
Computer
and
System
Sciences
www.elsevier.com/locate/jcss
Linking
indexing
data
structures
to
de
Bruijn
graphs:
Construction
and
update
Bastien Cazaux
a,b,
Thierry Lecroq
c,
Eric Rivals
a,b,∗
aLIRMM,CNRSandUniversitédeMontpellier,161rueAda,34095MontpellierCedex5,FrancebInstitutBiologieComputationnelle,CNRSandUniversitédeMontpellier,860rueSaintPriest,34095MontpellierCedex5,France cNormandieUniv.&UNIROUEN,UNIHAVRE,INSARouen,LITIS,76000RouenFrance
a
r
t
i
c
l
e
i
n
f
o
a
b
s
t
r
a
c
t
Articlehistory: Received25June2015
Receivedinrevisedform 26May2016 Accepted27June2016 Availableonlinexxxx Keywords: Index Datastructure Suffixtree Suffixarray Dynamicupdate Overlap
ContracteddeBruijngraph Assembly
Algorithms Bioinformatics
DNA sequencing technologieshavetremendously increased theirthroughput, and hence complicatedDNAassembly.NumerousassemblyprogramsusedeBruijngraphs(dBG)built fromshortreadstomergetheseintocontigs,whichrepresentputativeDNAsegments.In adBGoforder
k
,nodesaresubstringsoflengthk
ofreads(ork
-mers),whilearcsaretheir k +1-mers.Asanalysingreadsoftenrequiretoindexalltheirsubstrings,itisinteresting toexhibitalgorithms thatdirectlybuildadBG fromapre-existingindex,and especially acontracteddBG,wherenon-branching pathsarecondensedintosinglenodes.Here,we exhibitlineartimealgorithmsforconstructingthefullorcontracteddBGsfromsuffixtrees, suffixarrays,andtruncatedsuffixtrees.Withthelattertheconstructionusesaspacethat islinearinthesizeofthedBG.Finally,wealsoprovidealgorithmstodynamicallyupdate theorderofthegraphwithoutreconstructingit.©2016TheAuthor(s).PublishedbyElsevierInc.Thisisanopenaccessarticleunderthe CCBYlicense(http://creativecommons.org/licenses/by/4.0/).
1. Introduction
In life sciences, determining the sequence of bio-molecules is an essential step towards the understanding of their functions and interactions within an organism. Powerful sequencing technologies allow to get huge quantities of short sequencingreads thatneed tobe assembledto inferthecomplete targetsequence. Theseconstraintsfavourthe useofa versionofthedeBruijnGraph(dBG)dedicatedtogenomeassembly–aversionwhichdiffersfromthecombinatorial struc-tureinventedbyN.G.de Bruijn[1].Givenaset S
= {
s1,
. . . ,
sn}
ofn readsandanintegerk,anassemblydeBruijn Graph,orforshortsimplyde Bruijn Graph,storeseach k-mer (k-longsubstring)occurringinthereadsasnodesandhasan arc joiningtwok-mersiftheyappearassuccessive(andhenceoverlapping)k-mersinatleastoneread.
The dBGis then traversed to extract long paths, which willform the contigs, i.e., the sequenceof sub-regions ofthe molecule.Innon-repetitive regions,thelayoutofthereadsdictatesasimplepathofk-merswithoutbifurcations.Anysimple path betweenan in-branchingnode andthe next out-branching node, can then be contractedinto a single arc without loosinganyinformationonthegraphstructure.Thesequencesofsuchsimplepathsarecalledunitigs(thecontractionfrom uniqueandcontigs).TheversionofthedBGwhereeachsuch“non-branching”pathiscondensedintoasinglearcistermed theContracteddBG(CdBG).
*
Correspondingauthorat:LIRMM,CNRSandUniversitédeMontpellier,161rueAda,34095MontpellierCedex5,France. E-mailaddresses:[email protected](B. Cazaux),[email protected](T. Lecroq),[email protected](E. Rivals). http://dx.doi.org/10.1016/j.jcss.2016.06.0080022-0000/©2016TheAuthor(s).PublishedbyElsevierInc.ThisisanopenaccessarticleundertheCCBYlicense (http://creativecommons.org/licenses/by/4.0/).
Sequencingtechnologiesofthesecondgenerationcanyieldhundredsofmillionsofreads.Comparedtotheoverlapgraph ortothestringgraph,whichwereusedwithprevioustechnologies,thedBGhasanumberofnodesthatisnotproportional to thenumberofreads:itdependsonausercontrolledparameterk,termed theorder ofthedBG.Its memoryusage can befinetunedthroughthisparameter.
In bioinformatics,dBGs are heavily exploitedforgenomeassembly [2],butforother purposes aswell. Actually,some programs mine the dBG to seekgraph patterns representingmutations, large insertions/deletions, orchromosomal rear-rangements[3].Othersuseittocorrectsequencingerrorsinlongreads[4].
ThedeBruijnGraphisusuallybuiltdirectlyfromthesetofreads,whichistimeandspaceconsuming.Severalcompact datastructuresforstoring dBGshavebeendeveloped[5,6] includingprobabilistic ones[7].Theemphasisisplacedonthe practical spaceneededtostorethedBGs inmemory.Moreover,some recentassemblyalgorithms putforwardthe advan-tage ofusingforthesameinput,multipledBGswithincreasingorders[8],therebyemphasising theneedfordynamically updatingthedBG.Inallcases,theconstructionalgorithmsneedtoscanthroughthewholesetofreads.
Severalgenomeassemblyprograms usedhashtablesto storethek-mer ofthereadsandallownavigatingthrough the arcs ofthedBG,butthesesolutionssufferfromseverallimitationsregardinge.g. functionalities andflexibility.Withhash functions, itisoftennotpossibleto addextrainformationtothe nodes,likeforinstancethenumberoftimesak-mer is observed inthereadset,whichis usedasa confidencemeasure.Hashtablesmakeitdifficulttocompute thecontracted dBGortochangethevalueofk.Themainadvantageofsophisticatedhashfunctionsis theirmemoryfootprint.Forinstance Minia [9] offers a very spaceefficient storage to handlethe dBG based oncascading Bloom filters, which are a type of hash functions. This hashtable based solution was used forlong read errorcorrection andalso proves efficient in that context[4].
Instudiesinvolvingtheanalysisofsequencingreads,distincttasksrequiretoindexeitherallsubstrings,orthek-mersof thereads.Forinstance,fastdBGassemblyprogramsfirstcountthek-mersbeforebuildingthedBGtoestimatethememory needed [7]. Anotherexample: some errorcorrection software build a suffix tree ofall short readsto correct them [10]. Hence, before the assembly starts, the read set has already beenscanned through andindexed. It can thus be efficient to enable theconstruction ofthedBGforthesubsequentassembly, directlyfromtheindexratherthan fromscratch.For these reasons, we set out to find algorithms that transformusual indexes intoa dBGor a contracted dBG.It is also of theoretical interesttobuildbridges betweenwell studiedindexesandthisgraphonwords.Despiterecentresults[11,12], formal methods forconstructing dBGs fromsuffix trees are an open question. In comparison, Simpson and Durbin have proposedanalgorithmtobuildtheStringGraphfromaFM-index[13].
Here, we present algorithms to build directly the CdBG from a Generalised Suffix Tree or froma Generalised Suffix Array ofthereads[14–17].Thesealgorithmstake spaceandtimethatarelinearintheinputsize.Thesewell-knowndata structuresindexallsubstringsofthereads,andnotonlytheirk-mers.Thisresultsinonedrawbackandinoneadvantage.
Thedrawbackistheirspaceoccupancy.Wewillthenconsideranindexingdatastructurethatreducesthesetofindexed substrings: thetruncatedsuffixtree[18,19]. Weintroducethe reducedtruncatedsuffix tree(TST) andthen showhow to constructwiththisindexboththedBGandCdBGintimeandspacethatarelinearinthesizeofthefinaldBG,ratherthan inthecumulatedlengthofthereads.BysizeofthedBGwemeanthesumofnumberofnodes,plusthenumberofarcs. Thisalgorithmachievesanoptimaltimeandspacecomplexity.
The advantage isthecounterpart: assubstringsofall lengthsareindexed, itallows toupdate theorder ofthegraph, that is to changedynamically the value ofk without reconstructing thedBG. Finally,we provide efficientalgorithms for increasing ordecreasingthevalueofk.Ofcourse,ifoneusesthetruncatedsuffixtreeinsteadofthefull suffixtree,only some updatesremainpossible.Ourresultsneverthelessremainapplicabletothetruncatedsuffixtree,wheretheordercan bedynamicallydecreased.
Thisarticleincludesresultsthatappearedin[20,21]. 1.1. Indexingdatastructures
Suffix trees arewell-known indexingdatastructuresthat enable tostore andretrieveall thefactorsof agivenstring. The suffix tree of a string y of length s can be build in time and space in O
(
s)
on a constant size alphabet [14,22]. Then, itis possibletocheck ifapatternx oflengthm isa factorofastring y of S intime O(
m)
.Counting thenumber of occurrences of x in y can also be done intime O(
m)
while enumerating the positions where x occurs in y can be performedintime O(
m+
occ)
,whereocc denotesthe numberofoccurrencesof xin y.Suffix trees canbe adaptedto a finitesetofstringsandarethencalledGeneralisedSuffixTrees(GSTs).Thus,givenasetSofnstringsoftotallengthSon aconstantsizealphabet,thegeneralisedsuffixtreefor Scanbebuildintimeandspace O(
S)
.Foradetailedexposition ofpropertiesofsuffixtreeswereferthereaderto[17].Suffixtreeshavebeenwidelystudiedandusedinalargenumberof applications(see[15]and[17]).Inpractice,theyconsumetoomuchspaceandareoftenreplacedbythemoreeconomical suffixarrays[16],whichhavethesameproperties[23].When oneisonlyinterested infactorsofagivenlength,truncatedsuffix treesonlystorethefactors oflength uptoa givenconstantk ofagivenstring.Theycanalsobebuild inlineartime andspace[18].Inpractice,truncatedsuffix trees savealotofnodescompared tosuffixtrees.
Fig. 1.S:= {bacbab,cbabcaa,bcaacb,cbaac,bbacbaa}isasetofwords.Therefore,wehaveSupport(ba)= {(1,1),(1,4),(2,2),(4,2),(5,2),(5,5)},RC(ba)=
{ε,c,cb,cba,cbab,b,bc,bca,bcaa,a,ac,cbaa},LC(ba)= {ε,c,ac,bac,b,bbac}andd(ba)=0.OnehasRC(ba)∩= {a,b,c}.Thus,thewordbaisnotright extensibleinS(seeDefinition 2).
2. DefinitionsofdeBruijngraphs
2.1. Notationaboutstrings
Hereweintroduceanotationandbasicdefinitions.
An alphabet
is afinite setof letters.A finitesequence ofelements of
is calleda wordor a string.The set ofall wordsover
isdenotedby
,and
ε
denotestheemptyword.Forawordx,|
x|
denotesthelengthofx.Giventwowords xandy,wedenoteby x·
yorsimplyxytheconcatenationofxandy.Forevery1≤
i≤
j≤ |
x|
,x[
i]
denotesthei-thletter of x,and x[
i..
j]
denotes the substringor factor x[
i]
x[
i+
1]
. . .
x[
j]
.Letk be a positive integer. If|
x|
≥
k, f irstk(
x)
istheprefixoflength k ofxandlastk
(
x)
isthesuffix oflengthk of x.Then asubstring oflengthk of xis calledak-mer ofx.Fori such that 1
≤
i≤ |
x|
−
k+
1,(
x)
k,i isthek-merof xstarting inpositioni,i.e.,(
x)
k,i=
x[
i..
i+
k−
1]
.Thus we havef irstk
(
x)
=
(
x)
k,1andlastk(
x)
=
(
x)
k,|x|−k+1.Wedenoteby()
thecardinalityofanyfiniteset.
LetS
= {
s1,
. . . ,
sn}
beafinitesetofwords.Itisourrunninginstanceforallthefollowing.Letusdenotethesumofthelengthsoftheinputstringsby
S:=
si∈S
|
si|Wedenoteby
•
F(
S)
thesetoffactorsofwordsof S,i.e., F(
S)
= {
w∈
| ∃
u,
v∈
,
1≤
i≤
n,
si=
u w v}
.•
Fk(
S)
thesetoffactorsoflengthkofS wherekisapositiveinteger,i.e., Fk(
S)
=
F(
S)
∩
k.
•
Suffk(
S)
isthesetofsuffixesoflengthkofwordsofS.2.2. ClassicaldefinitionofdeBruijngraph
All definitions below refer to the set S; however, as S is clear from the context, we simply omit the “in S” in the notation.
ForawordwofF
(
S)
,•
Support(
w)
isthesetofpairs(
i,
j)
,where wisthesubstring(
si)
|w|,j.Support(
w)
iscalledthesupportofw inS.•
RC(
w)
(resp.LC(
w)
)isthesetofrightcontext(resp.leftcontext)ofthewordw inS,i.e.,thesetofwords w suchthat w w∈
F(
S)
(resp.ww∈
F(
S)
).•
wisthewordw w where w isthelongestwordof RC(
w)
suchthat Support(
w)
=
Support(
w w)
.Inother words, suchthat wandw w haveexactlythesamesupportinS.•
wisthewordw wherew isthelongestprefixof wsuchthat Support(
w)
=
Support(
w)
.•
d(
w)
:= |
w|
− |
w|
.Inotherwords,
wisthelongestextensionofwhavingthesamesupportaswinS,whilewistheshortestreduction ofwwithasupportdifferentfromthatofwinS.ThesedefinitionsareillustratedinarunningexamplepresentedinFig. 1. WegivethedefinitionofadeBruijn graphforassembly(dBGforshort),whichdiffersfromtheoriginaldefinitionofa completegraphoverallpossiblewordsoflengthkstatedbydeBruijn[1].Definition1. Let k be a positive integer. The deBruijngraph of order k for S, denoted by D B G+k, is a directed graph, D B Gk+
:=
(
Vk+,
Ek+)
,whoseverticesarethek-mersofwordsof Sandwhereanarclinksu to v ifandonlyifuandv are twosuccessivek-mersofawordofS,i.e.:Vk+
:=
Fk(
S)
Fig. 2.Examplesofarcsfrom D BG+k.(a)showslettersintherightcontextofba,and(b)thesuccessorsofnodebainD B G+2;oneforeachletterin RC(w)∩.(c)showslettersintheleftcontextofba,and(d)thepredecessorsofnodebainD BG+2;oneforeachletterinLC(w)∩.
Fig. 3.Withsolidarcsonly,thegraphscorrespondtoD BG+2 (a)and D BG+3 (b)forourrunningexample.Withbothsolidanddottedarcs,theyrepresent D B G−2 (a)andD BG−3 (b).
AnequivalentdefinitionofE+k canbestatedusingtheleftinsteadofrightcontext:
E+k
:= {
(
u,
v)
∈
Vk+2|
lastk−1(
u)
=
f irstk−1(
v)
andu[
1] ∈
LC(
v)
}
.
ExamplesofarcsaredisplayedonFig. 2.ThesizeofD B Gk+isdenotedbyanddefinedassize
(
D B G+k)
:=
(
Vk+)
+
(
E+k)
. Notethatanother,simplerdefinitionofthearcsinthedeBruijngraphcoexistswiththatofDefinition 1.There,anarclinks utov ifandonlyifu overlapsv byk−
1 symbols.Thisgraphisdenotedby D B G−k=
(
Vk−,
E−k)
,where:Vk−
:=
Fk(S)
E−k
:= {
(
u,
v)
∈
Vk−2|
lastk−1(
u)
=
f irstk−1(
v)
}
.
ThearcsofEk−satisfylessconstraintsthanthoseofE+k;hence,Ek+isasubsetof E−k.Bothdefinitionsareillustratedon Fig. 3.SomeassemblyprogramsuseD B Gk−[9].AllthealgorithmicresultsthatweobtainforD B Gk+remainvalidforD B G−k. Inthesequel,wefocusonlyonD B G+k.
LetusintroducenowthenotionsofextensibilityforasubstringofS andthatofaContracteddBG(CdBGforshort).
Definition2(Extensibility).LetwbeawordofF
(
S)
.•
w isrightextensibleinS ifandonlyif(
RC(
w)
∩
)
=
1.•
w isleftextensibleinS ifandonlyif(
LC(
w)
∩
)
=
1.Let wbeawordof
.Theword wissaidtobeaunique k-merof Sifandonlyifk
≥
kandforalli∈ [
1..
k−
k+
1]
,Fig. 4.The graphs correspond to CDBG+2 (a) and CDBG+3 (b) for our running example.
Definition3.AcontracteddeBruijngraphoforderk,denotedbyCDBGk+
=
(
Vk+,c,
E+k,c)
,isadirectedgraphwhere:Vk+,c
= {
w∈
|
wis ak-mer unique maximal by substring andk≥
k}
E+k,c= {(
u,
v)
∈
Vk+,c2|
lastk−1(
u)
=
f irstk−1(
v)
andv[
k] ∈
RC(
lastk(u))}.
ExamplesofCDBG+k aredisplayed onFig. 4.Notethatinthepreviousdefinition,anelement w inVk+,c doesnot neces-sarilybelong to F
(
S)
,since w mayonlyexistasthe substringoftheagglomerationoftwo wordsof S.Thus, let w bea k-meruniquemaximalbysubstringwithk≥
k:•
lastk(
w)
isnotrightextensibleorRC(
lastk(
w))
∩
= {
a}
andlastk−1(
w)
·
aisnotleftextensible,•
f irstk(
w)
isnotleftextensibleorLC(
f irstk(
w))
∩
= {
a}
anda·
f irstk−1(
w)
isnotrightextensible.Withthisargument,wehavebothfollowingpropositions.
Proposition1.Let
(
u,
v)
∈
E+k,c;(
lastk(
u),
f irstk(
v))
∈
Ek+andthereexists w∈
Vk+suchthat(
w,
f irstk(
v))
∈
Ek+\ {
(
lastk(
u),
f irstk
(
v))
}
or(
lastk(
u),
w)
∈
Ek+\ {
(
lastk(
u),
f irstk(
v))
}
.Proposition2.Let
(
u,
v)
∈
Ek+.Ifu isrightextensibleandv isleftextensible,thenthereexistsw∈
Vk+,csuchthatu·
v[
k]
isasubstring ofw.Otherwise,thereexists(
u,
v)
∈
E+k,csuchthatu=
lastk(
u)
andv=
f irstk(
v)
.Accordingto Propositions 1 and2,CDBG+k isthegraph D B G+k wherethearcs
(
u,
v)
are contractedifandonly ifuis rightextensibleandvisleftextensible.2.3. Constructivecharacterisation ofthedeBruijngraph
Letkbeapositiveinteger.WedefinethefollowingthreesubsetsofF
(
S)
.•
Init Exactk= {
w∈
F(
S)
| |
w|
=
kandd(
w)
=
0}
•
Initk= {
w∈
F(
S)
| |
w|
≥
kandd(
f irstk(
w))
= |
w|
−
k}
•
SubInitk=
Init Exactk−1AwordofInit Exactkiseitheronlythesuffixofsomesiorhasatleasttworightextensions,whilethefirstk-merofaword
inInitk
\
Init Exactkhasonlyonerightextension.Proposition3.Init Exactk
=
Initk∩ {
w∈
F(
S)
| |
w|
=
k}
.Proof. Letw
∈
Init Exactk.Inthiscase,weget f irstk(
w)
=
w and|
w|
−
k=
0.Thismeansthatd(
f irstk(
w))
= |
w|
−
kandthereforew
∈
Initk.2
Forw anelementofInitk, f irstk
(
w)
isak-merofS.Giventwowords w1 andw2 ofInitk, f irstk(
w1)
and f irstk(
w2)
aredistinctk-mersofS.Furthermoreforeachk-mer w ofS,thereexistsawordwofInitksuchthat f irstk
(
w)
=
w.Fromthis,wegetthefollowingproposition.
Proposition4.ThereexistsabijectionbetweenInitkandthesetofthek-mersofS.
AccordingtoDefinition 1andProposition 4,eachvertexofD B G+k canbeassimilatedtoauniqueelementofInitk.Asthe
TodefinethearcsbetweenthewordsofInitk,whichcorrespondtoarcsofD B Gk+,weneedthefollowingproposition,which
statesthateachsingleletterthatisarightextensionof wgivesrisetoasinglearc.
Proposition5.Forw
∈
Init Exactkanda∈
∩
RC(
w)
,thereexistsauniquew∈
Initksuchthatlastk−1(
w)
a isaprefixofw.Proof. Let w be a word of Init Exactk and a a letter of RC
(
w)
. By definition of right context, lastk−1(
w)
a∈
F(
S)
. As|
lastk−1(
w)
a|
=
k, there exists w such that lastk−1(
w)
a is a prefix of w and|
lastk−1(
w)
a|
+
d(
lastk−1(
w)
a)
= |
w|
. BydefinitionofInitk, w
∈
Initk.2
ThesetInitkrepresentsthenodesofD B G+k.LetusnowbuildthesetofarcsthatisisomorphictoEk+.Letwbeaword
of Initk andSucck
(
w)
denote theset ofsuccessors of f irstk(
w)
: Succk(
w)
:= {
x∈
Initk|
(
f irstk(
w),
f irstk(
x))
∈
Ek+}
. Weknowthat foreach lettera in RC
(
w)
,there existsan arcfrom f irstk(
w)
to f irstk(
last|w|−1(
w)
a)
in D B Gk+.We considertwocasesdependingonthelengthofw:
Case1.
|
w|
=
k.AccordingtoProposition 3, w
∈
Init Exactkandhencelastk−1(
w)
∈
SubInitk.Therefore,theoutgoingarcsofwinD B G+karethearcsfromwto w satisfyingtheconditionofProposition 5.Then,
Succk(w
)
=
a∈∩RC(w) lastk−1(
w)
a.
Case2.
|
w|
>
k.As w islongerthank,itcontainsthenextk-mer;hence f irstk
(
last|w|−1(
w)
a)
=
f irstk(
last|w|−1(
w))
,andthereexistsauniqueoutgoingarcofw:thatfromwto
w[
2..
k]
.Indeed,bydefinitionofInitk,w[
2..
k]
∈
Initk,andthusSucck
(
w)
= {
w[
2..
k]}
.
Now,wecanbuildintegrally D B Gk+ormoreexactlyanisomorphicgraphofD B G+k.
Theorem1.WiththesetsInitk,Init ExactkandSubInitk,wecanbuildanisomorphicgraphofD B G+k inlineartimeinthesizeof
thesesets.
Forsimplicity,fromnowon,weconfoundthegraphwebuildwithD B G+k. 2.4. Constructivecharacterisation ofthecontracted deBruijngraph
TodothesamewithCDBGk+,initiallywebeginbyexplainingthealgorithmthatweusetobuildthisgraphandinthe secondtimeweneedtocharacterisetheconceptsofrightandleftextensibilityintermsofwordproperties.
OuralgorithmtobuildCDBG+k.We present a generic algorithm to build incrementally CDBG+k. It is explained interms of words,anddoesnotdependonanyindexingdatastructure.Infollowingsections,wewillusethisgenericalgorithmand explainhowitcanbeperformedefficientlyusingaspecifiedindexingstructure.
Themainalgorithm(Algorithm 2)exploresD B Gk+tofindthenodeskeptinCDBGk+andsetallsinglearcsthatrepresent wholenon-branching pathsofD B G+k thatareproperlycontracted.Thekeypointistofindallstartingnodesofsimplepaths andexplorethesepathsfromthem;theexplorationisdonebyAlgorithm 1.
Amoredetailedexplanation.First, note that to build D B G+k it suffices to know the set Succk
(.)
foreach node. Thealgo-rithm belowsimulatesa traversalof D B G+k without buildingit,andstoresonlyone nodeper unique maximalk-merof D B G+k.Forsuchak-mer, saym,wechoose torepresentitby thenode v suchthat f irstk
(
v)
isaprefixofm.In D B G+k,m is represented by a simple (i.e., non-branching) path and v is its first node. In the traversalalgorithm, for a current starting node vc in Initk,we traversethesimple pathuntilwe arriveata node u havingseveralsuccessorsorsuchthat
its onlysuccessoris notleft extensible (i.e.,hasseveralpredecessors).Inother words,untilwe findu suchthat u is not right extensible ornext
(
u)
isnot left extensible. In D B Gk+,there exists a simplepath between vc andu,and thismustbuild a single node in CDBG+k.To contract thispath,we choose tokeep vc, andforanysuccessor w ofu,we insertan
arc betweenu and w, asthisarc cannot be contracted. Noting that w necessarily starts a chain(having atleast a sin-gle node), if w is not yetin CDBG+k, we launcha newpath exploration starting from w,one gets that f irstk
(
w)
is theprefix of a node ofCDBGk+, andthus w can appropriately represent the path.Now,if w alreadybelongsto CDBGk+, the case istrickier. If vf stores thefirst vc calledby theprocedure, it maynot be thestarting node ofa path,but be
Algorithm 1:BuildAuxCDBG
(
V,
E,
vf,
vc)
.Input : ThepartialcontractedgraphCDBG+k as(V,E),twonodesvf andvc.vf theinitialstartingnode,andvcthecurrentstartingnode. Output: Theupdatedcontractedgraph(V,E),whichnowcontainsallpathsstartingfromvc.
1 begin
2 u:=vc;marku
3 //searchthenodeendingthechainthatgoesthroughvc
4 whileu isrightextensibleandnext(u)isleftextensibledo
5 ifvf=next(u)then
6 update(vf,i)by(vc,i)forall(vf,i)∈E
7 return(V\ {vf},E)
8 u:=next(u);marku
9 //nowexplorethepathstartinginthesuccessorofu 10 forw∈Succk(u)do
11 ifw∈V then
12 (V,E):=(V,E∪ {(vc,w)});
13 else
14 (V,E):=Build AuxC D B G(V∪ {w},E∪ {(vc,w)},vf,w);; // explore from node w 15 return(V,E)
Algorithm 2:BuildCDBG
(
S)
. Input : AsetofwordsS. Output: CDBG+k ofS. 1 begin2 (V,E)=(∅,∅)
3 //searchforanynodevofD BG+k withoutpredecessors 4 //andbuildCDBG+k fromv
5 forv∈Initkdo
6 ifthereexistsnow suchthatv∈Succk(w)then
7 (V,E):=(V,E)Build AuxC D B G(V∪ {v},E,v,v) 8 //exploreD BG+k fromanynodenotyetvisited
9 forvcanunmarkednodeofInitkdo
10 (V,E):=(V,E)Build AuxC D B G(V∪ {vc},E,vc,vc)
11 return(V,E)
path:hencewe mustupdate V byexchanging vf withvc andterminatetheexploration.Otherwise, vf is traversed
dur-ing the for loop (as the value of w), then it is a successor of u and the beginning of a simple path: we just add an arc linking vc to w andstop. Finally,if w alreadybelongs to V but w
=
vf, we also add an arc linking vc to w andstop.
TheprocessperformedbyAlgorithm 1augmentsthepartialgraphCDBG+k restrainedtothenodesvisitedwhenexploring the pathstarting from vc.It sufficesnow toensure that all arcsof D B G+k are examined,which Algorithm 2 does.More
precisely, it starts by visiting the simple paths starting atnodes having no predecessors (otherwise these nodes would not be visited). Once this is done, one must explore all nodes not yet marked and continue until all nodes have been visited/marked.
Fromtheabovediscussion,weobtainthefollowingtheorem.
Theorem2.Assumeonecandetermineinconstanttimeforanarc
(
u,
v)
ofEk+,c,whetheru isrightextensibleandwhetherv isleft extensible.Then,withthesetsInitk,Init ExactkandSubInitk,Algorithm 2buildsagraphthatisisomorphictoCDBG+k inlineartimeinthesizeofthesesets.
Remark. Executing Algorithm 2 doesnot requireto build D B G+k, since the set ofsuccessors Succk
(
u)
of anynode u iscomputedinconstanttime.
Characterisation oftheconceptsofrightandleftextensibility.By the constructionof D B Gk+,we get thefollowing properties, whichwillturnusefulfortheconstructionoftheCdBGfromspecificindexes(Section3and4).
Proposition 7. Let w be a word of Initk such that f irstk
(
w)
is right extensible. Let the letter a be theunique element ofRC
(
f irstk(
w))
∩
,thenlastk−1
(
f irstk(
w))
a isleftextensibleifandonlyif(
Support(
f irstk(w)))
=
(
Support(
lastk−1(
f irstk(w))
a)
\ {(
i,
1)
|
1≤
i≤
n}).
Proof. Let
(
i,
j)
beapairofSupport(
f irstk(
w))
.Wehave(
i,
j+
1)
∈
Support(
lastk−1(
f irstk(
w))).
As Support
(
lastk−1(
f irstk(
w)))
=
Support(
lastk−1(
f irstk(
w))
a)
,itfollowsthat(
i,
j+
1)
∈
Support(
lastk−1(
f irstk(
w))
a).
Ifthere exists
(
i,
j)
∈
Support(
lastk−1(
f irstk(
w)))
such that j>
0 and(
i,
j−
1)
∈
/
Support(
f irstk(
w))
,thereexists aletterb
=
w[
1]
suchthat(
i,
j−
1)
∈
Support(
b·
lastk−1(
f irstk(
w)))
.Hence
(
b·
lastk−1(
f irstk(
w)),
lastk−1(
f irstk(
w))
a)
alsobelongsto E+,andthuslastk−1(
f irstk(
w))
a isnot leftextensi-ble.
2
Insummary,thissectiongivesaformulationofthedBGofS intermsofwords.Nowassumethatthesubstringsofthe words areindexed inadata structure,e.g. ageneralisedsuffix array.Howcan webuild thedBGor thecontractedgraph directlyfromthisstructure?Toachieve this,itsufficestocomputethethreesets Initk,Init Exactk, SubInitk,aswell asthe
sets Support
(.)
andSucck(.)
forsome appropriatesubstrings. Inthefollowing sections,we exhibitalgorithms tocomputeD B G+k andCDBG+k fortwoimportantindexingstructuresandforahome-madetruncateddatastructure.
3. TransitionfromanindexingdatastructuretodeBruijngraphs
3.1. Fromageneralisedsuffixtree
Suffix Trees(ST)belongtothemoststudiedindexingdatastructures.Ageneralised STcanindexthesubstringsofaset of words.Generallyforthissake,all wordsare concatenatedandseparatedby aspecialsymbol not occurringelsewhere. However,thistrickisnotcompulsory,andanalternativeistokeeptheindicationofaterminatingnodewithineachnode. 3.1.1. Thesuffixtree anditsproperties
The GeneralisedSuffixTreeofasetofwords S isthe suffixtreeof S,whereeachwordof S doesnotnecessarilyfinish by aletterofuniqueoccurrence. Hence,foreachnode v oftheGeneralised SuffixTreeof S,wekeepinmemorytheset, denoted by Suff
(
v)
, ofpairs(
i,
j)
such that the word representedby v is the suffix of si starting at position j. LetusdenotebyT thegeneralisedsuffixtreeofS (fromnowon,wesimplysaythetree)andby VT itssetofnodes.Forv
∈
VT,Children
(
v)
denotesitssetofchildrenand f(
v)
itsparent.SeeFig. 5foranexampleofGST.Some nodesof T mayhavejustonechild.The size oftheunionof Suff
(
v)
forallnode v ofT equals thenumberof leavesinthegeneralisedsuffix treewhen thewordsendwithaterminating symbol.Hence,thespacetostore T andthe sets Suff(.)
islinearinS.Bysimplicity,foranode v ofT,thewordrepresentedbyv isconfusedwithv.Foreachnode v ofT,v∈
F(
S)
.AsallelementsofF(
S)
arenotnecessarilyrepresentedbyanodeofT,wegivethefollowingproposition.Proposition8.ThesetofnodesofT isexactlythesetofwordsw ofF
(
S)
suchthatd(
w)
=
0.Werecallthenotionofasuffixlink(SL)foranynode vofT (leavesincluded).Letsl
(
v)
denotethenodetargetedbythe suffixlinkofv,i.e.,sl(
v)
=
v[
2..
|
v|]
.Bydefinitionofasuffixtree,forall w∈
F(
S)
,thereexistsanode v ofT suchthat w isaprefixof v.Let v thenodeofminimal lengthofT suchthat wisaprefixof v,then|
v|
= |
w|
+
d(
w)
,andtherefore w=
v.Proposition9.Letw
∈
F(
S)
.Then|
w|
≥ |
w|
>
|
f(
w)
|
,where f(
w)
istheparentofwinT .Proof. As f
(
w)
=
w,theresultisobvious.2
3.1.2. ConstructionofD B Gk+Let
[
x1..
xm]
bethesetofk-mersofS.AccordingtothedefinitionofInitkandtoProposition 4,Initk= [
x1..
xm]
.Thus,by Proposition 9, Initk
= {
v∈
VT| |
f(
v)
|
<
kand|
v|
≥
k}
.Similarly, Init Exactk= {
v∈
VT| |
v|
=
k}
.Now,itappearsclearlythat Init ExactkisasubsetofInitk,sinceforallv
∈
VT,|
f(
v)
|
<
|
v|
.Fig. 5.Thegeneralisedsuffixtree forourrunningexampleandtheconstructeddeBruijngraph fork:=2.Squarenodesrepresentwordsthatoccuras asuffixofsomesi,circlenodesaretheothernodesofT.NodesingreyarethoseusedtorepresentthenodesofthedBG.Eachsquarenodestoresits positionsofoccurrencesinS;forsimplicity,wedisplaythestartingpositionasanumberandthewordofSinwhichitoccursasitscolour,insteadof showingthelistofpairs(i,j).ThesolidcurvedarrowsaretheedgesofthedeBruijngraphfork:=2;thosecoloured inredcorrespondtoCase1and thoseinbluetoCase2.(Forinterpretationofthereferencestocolourinthisfigurelegend,thereaderisreferredtothewebversionofthisarticle.)
Fig. 6.Thefigures(a),(b)and(c)showCase1andCase2encounteredwhencomputingthearcsofD BG+k.Thegreennoderepresentsthenodev,and theoneinorangesl(v).Thedashedarcscorrespondtosuffixlinks.ArcsofD BG+k areinsolidlineandcolouredinredforCase1(a),orinblueforCase2 (b), (c).(Forinterpretationofthereferencestocolourinthisfigurelegend,thereaderisreferredtothewebversionofthisarticle.)
Case1.
|
v|
=
k(Fig. 6a).As v
∈
Init Exactk,sl(
v)
∈
SubInitk.Therefore,each childu ofsl(
v)
isan elementof Initk.Thus, theoutgoingarcsofvin D B Gk+are thearcsfrom v tothechildu ofsl
(
v)
wherethefirstletterofthelabelbetweensl(
v)
andu isanelement oftherightcontextof v.Asthesetofthefirstlettersofthelabelbetweenv andchildrenofv isexactly RC(
v)
∩
,the numberofoutgoingarcsof v inD B G+k isthenumberofchildrenof v.Tobuildtheoutgoingarcsof v inD B G+k,foreach childu ofv,weassociatev withthenodeofInitkbetweentherootandsl
(
u)
,i.e.,f irstk(
sl(
u))
.Case2.
|
v|
>
k(Figs. 6band 6c).Wehavethatsl
(
v)
isanode ofVT.As|
v|
>
k,|
sl(
v)
|
≥
k.Thus,thereexistsan elementofInitk betweentherootandsl
(
v)
.Weassociatev withthisnode,i.e.f irstk(
sl(
v))
.WeillustratethesetwocasesinFig. 5:
Case1.Casewherev is 6,6,sl
(
v)
is 7,7,theuniquechildu ofv is 3,andsl(
u)
is 4,whichisinInitk.Case2.Casewherev is 1,sl
(
v)
is 2,andf irstk(
sl(
v))
is .Inbothcases,buildingthearcsofE+requirestofollowtheSLofsomenode.Thenode,sayu,pointedatbyaSLmay notbeinitial.Hence,theinitialnoderepresentingtheassociatedfirstk-merofu istheonlyancestralinitialnodeofu.We
equipeachsuch nodeuwithapointer p
(
u)
that pointstotheonlyinitialnode onitspathfromtheroot. Inother words, foranyu∈
/
Initksuchthat|
u|
>
k,onehasp(
u)
:=
f irstk(
u)
.The algorithmtobuildthe D B G+k isasfollows.Aninitial depthfirsttraversalofT allows tocollectthenodesofInitk
andforeachsuchnodetosetthepointerp
(.)
ofallitsdescendantsinthetree.FinallytobuildE+,onescansthrough Initkandforeachnode v oneaddsSucck
(
v)
to E+usingtheformulagivenabove.AltogetherthisalgorithmtakesatimelinearinthesizeofT.Moreover,thenumberofarcsinE+islinearinthetotalnumberofchildrenofinitialnodes.Thisgivesus thefollowingresult.
Theorem3.ForasetofwordsS,buildingthedeBruijnGraphoforderk,D B Gk+takeslineartimeandspacein
|
T|
,i.e.,inS. 3.1.3. ConstructionofCDBG+kInSection2.3,wehaveseenanalgorithmthatallowstocomputedirectlyCDBG+k providedthatonecandetermineifa node v isrightextensibleandifnext
(
v)
isleftextensible,wherenext(
v)
denotestheonlysuccessorofv.Letusseehow tocomputetheextensibilityinthecaseofaSuffixTree.By applyingProposition 6 inthe caseof a tree,foran element v of Initk, f irstk
(
v)
is rightextensible ifandonly if|
v|
>
kor(
Children(
v))
=
1.Thuscheckingtherightextensibilityofanodetakesconstanttime.Fortheleft extensibilityofthesingle successorofa node,oneonlyneedsthesize ofsupportofsome nodes( Proposi-tion 7).Letusseefirsthowtocompute
(
Support(.))
onthetree,andthenhowtoapplyProposition 7.Proposition10.Letv beawordofF
(
S)
andVT(
v)
denotesthesetofnodesofthesubtreerootedinv. Support(
v)
=
v∈VT(v)
Suff
(
v).
Along atraversalofthetree,we cancompute andstore
(
Support(
v))
and(
Support(
v)
∩ {
(
i,
1)
|
1≤
i≤
n}
)
foreach node v inlineartimein|
T|
.Letv beawordofInitk suchthat f irstk
(
v)
isrightextensible.Case1. If
|
v|
=
k, then f irstk(
v)
=
v and(
Children(
v))
=
1. Let u be the only child of v. Thus,|
u|
>
k, RC(
v)
∩
=
{
u[
k+
1]}
,andlastk−1(
v)
u[
k+
1]
=
f irstk(
sl(
u))
.Hence,(
Support(
v))
=
(
Support(
f irstk(
sl(
u)))
\ {
(
i,
1)
|
1≤
i≤
n}
)
andbyProposition 7, f irstk
(
sl(
u))
isleftextensible.Case2.If
|
v|
>
k,thenRC(
f irstk(
v))
∩
= {
v[
k+
1]}
andlastk−1
(
f irstk(v))
v[
k+
1] =
lastk(f irstk+1(
v))
=
f irstk(sl(
v)).
ByProposition 7, f irstk
(
sl(
v))
isleftextensibleifandonlyif(
Support(
f irstk(v)))
=
(
Support(
f irstk(sl(
v)))
\ {(
i,
1)
|
1≤
i≤
n})
As
(
Support(
f irstk(
v)))
=
(
Support(
f irstk(
v)
))
and(
Support(
v)
\ {
(
i,
1)
|
1≤
i≤
n}
)
=
(
Support(
v))
−
(
Support(
v)
∩ {
(
i,
1)
|
1≤
i≤
n}
)
, determining if next(
v)
is left extensible takes constant time. Toconclude, asfor any initial node v,we cancompute in O(
1)
timeitssetofsuccessorsSucck(
v)
,itsrightextensibility,andtheleft extensibilityofits singlesuccessor,we canreadilyapply Algorithm 2tobuiltCDBG+k andweobtaina complexitythat islinearinthe sizeofD B G+k,sinceeachsuccessorisaccessedonlyonce.ThisyieldsTheorem 4.
Theorem4.ForasetofwordsS,buildingtheContracteddeBruijnGraphoforderk,CDBG+k takeslineartimeandspacein
|
T|
,i.e., inS.3.2. Fromageneralisedsuffixarray
IntheprevioussubsectionswehaveshownhowtobuilddeBruijngraphsfromsuffixtrees.Suffixtreesareveryelegant datastructuresbuttheyaretoospace-consuminginpractice.Inmanyapplicationstheyhavebeenreplacedbysuffixarrays thatareequivalent datastructuresandaremorespaceeconomical.WewillnowshowhowtobuilddeBruijngraphsfrom suffixarrays.
LetSAandLCPbethegeneralisedenhancedsuffixarrayofS:
• ∀
1≤
i<
S,SA[
i]
=
(
g,
h)
,SA[
i+
1]
=
(
g,
h)
thensg[
h. .
|
sg|]
<
sg[
h. .
|
sg|]
,• ∀
2≤
i≤
S,LCP[
i]
isthelengthofthelongestcommonprefix betweensuffixesstoredinSA[
i−
1]
andinSA[
i]
,and LCP[
1]
=
LCP[
S+
1]
= −
1.Letusrecallthedefinitionofanlcp-interval.
Definition4([23]).Aninterval
[
i,
j]
,1≤
i<
j≤
Siscalledalcp-intervalofvalue,alsodenotedby
-
[
i,
j]
,iff: 1. LCP[
i]
<
,2. LCP
[
g]
≥
fori
<
g≤
j,3. LCP
[
g]
=
foratleastone gsuchthati
<
g≤
j, 4. LCP[
j+
1]
<
.Letusnowrecallthedefinitionsofthepreviousandnextsmallervalues(PSV andNSV)arrays.
Definition5([23]).For2
≤
i≤
S:•
PSV[
i]
=
max{
j|
1≤
j<
iandLCP[
j]
<
LCP[
i]}
,•
NSV[
i]
=
min{
j|
i<
j≤
S+
1 andLCP[
j]
<
LCP[
i]}
.Recallthat if 2
≤
i≤
S then[
PSV[
i]
,
NSV[
i]
−
1]
is an lcp-intervalof value LCP[
i]
. The direct inclusion among lcp-intervalsdefinesatreerelationshipcalledthelcp-intervaltree(see [23,Def.4.4.3,p.87]).Givenanlcp-interval-
[
i,
j]
,its parentlcp-interval-
[
i,
j]
canbeeasilycomputedinconstanttimeusingthearraysLCP,PSV andNSV.Then:•
Initk consistsof:– thelcp-intervals
-
[
i,
j]
suchthat≥
kandtheparentinterval-
[
i,
j]
of-
[
i,
j]
issuchthat<
k(theassociated stringissSA[i].g[
SA[
i]
.
h. .
SA[
i]
.
h+
−
1]
);– thepositionsSA
[
i]
=
(
g,
h)
suchthati isnotcontainedinlcp-intervals-
[
i,
j]
with≥
kandh≤ |
sg|
−
k+
1 (theassociatedstringissg
[
h. .
|
sg]
);•
Init Exactkiscomposedofthelcp-intervalsk-[
i,
j]
;•
SubInitk=
Init Exactk−1.Actuallythelcp-intervaltreedoesnotneedtobe explicitlybuildandthesetscanbecomputedbyasinglescan ofthe SAandLCParrays.
Foranlcp-interval
-
[
i,
j]
∈
Initkwehave(
Support(
sSA[i].g[
SA[
i]
.
h. .
SA[
i]
.
h+
k−
1]
))
=
j−
i+
1.Theorem5.ThedeBruijngraphoforderk,CDBGk+,forasetofwordsS canbebuiltinatimeandspacethatarelinearin
Susing thegeneralisedsuffixarrayofS.4. TransitionfromatruncatedstructuretodeBruijngraphs
Thissectionisorganisedasfollows.InSection4.1,wedefine asimpleconditionthat asetofinputstringsmustsatisfy toallowbuildingageneralisedindexandsketchamodificationofMcCreight’salgorithm[14] fordoing so.InSection4.2, weintroducethereducedtruncatedsuffixtreeandspecialisetheprevious algorithmforconstructingitefficiently.Finally, in Section 4.3we show how toconstruct both thede Bruijn Graph andits contractedversion in optimaltime fromthe reducedtruncatedsuffixtree.
4.1. Setofchainsofsuffix-dependantstringsandtree
Here,weintroducethenotionofsuffixdependencebetweenstrings,andthenotionofchainofsuffix-dependantstrings in ordertodefineaunifiedindexthatgeneralisesboththesuffixtree[14]andthetruncatedsuffixtree[18].First,letusdefine theconceptofsuffix-dependantstringsandofchainsofsuffix-dependantstrings.
Definition6.
1. Astringxissaidtobesuffix-dependantofanotherstring yifx
[
2..
|
x|]
isprefixof y.2. Let w be a stringand m be a positive integer smaller than
|
w|
−
1. A m-tuple ofm strings(
x1,
. . . ,
xm)
is a chainofsuffix-dependantstringsof w if x1 is a prefix of w and for each i
∈ [
2,
m]
, xi is a prefix of w[
i,
|
w|]
such that|
xi|
≥ |
xi−1|
−
1.Let
R
= {
C1,
. . . ,
Cn}
be a set of tuples such that foreach i∈ [
1,
n]
, Ci is a chain of suffix-dependant strings ofthestringsi.Fori
∈ [
1,
n]
and j∈ [
1,
|
Ci|]
,Ci[
j]
isthe jth stringofthetupleCi.LetR
= {
C1,
. . . ,
Cn}
bethesetoftuplessuchWith
R
and S, we can easily computeR
. In the sequel, we useR
to demonstrate our results, andR
to state the complexitiesofalgorithms.Indeed,inthecasewhereCiisthetupleofeachsuffixofsi,thesizeofCiislinearin|
si|
2 butCi islinearin
|
si|
.Let w beastring; w mayoccurindistinct tuplesof
R
.Thus, wedefine N(
w)
thesetof(
i,
j)
such that w=
Ci[
j]
.Inotherwords,N
(
w)
isthesetofcoordinatesoftheelementsofR
thatareequalto w.Wedefineacontractedversionofthewell-knownAho–Corasicktree[17].Infact,weapplynearlythesamecontraction process that turnsa trie ofawordinto itscompact SuffixTree [17].Consider theAho–Corasicktree of S,inwhicheach noderepresentsaprefixofwordsinS.Wecontractthenon-branchingpartsofthebranchesexceptthatwekeepallnodes representingawordthatbelongstoatuplein
R
.Fromnowon,letT(
R
)
denotethiscontractedversionoftheAho–Corasick treeofS.N
andL
denoterespectivelythesetofnodesandthesetofleavesofT(
R
)
.Furthermore,wedefineforeachnodev of T(
R
)
twoweights:•
s(
v)
isthenumberoftimesthatanelementofatupleofR
isequaltothewordrepresentedbyv(i.e.,s(
v)
:= |
N(
v)
|
).•
t(
v)
isthenumberoftimesthatthefirstelementofatupleofR
isequaltothewordrepresentedby v (i.e.,t(
v)
:=
|{
(
i,
1)
∈
N(
v)
|
i∈ [
1,
n]}|
).Letw beastring,weputSucc
(
w)
= {
(
i,
j)
|
(
i,
j−
1)
∈
N(
w)
and j≤ |
Ci|}
.WedefineH
asthesubsetofL
suchthat:H
:= {
u∈
L
| ∃
C∈
R
andj<
|
C|
such thatu=
C[
j]}
It isequivalenttosaythat
H
= {
u∈
L
|
Succ(
u)
is not empty}
.Amappingm fromH
toN
iscalledpossiblelinkiffor eachnode v inH
,∃
(
i,
j)
∈
Succ(
v)
suchthatm(
v)
=
Ci[
j]
.Belowwe presentanalgorithm thatconstructs T
(
R
)
,andcomputesforeach node v inN
,the weightss(
v)
andt(
v)
andapossiblelinkP0.
ConstructionofT
(
R
)
.Now,we giveanalgorithmtoconstructT(
R
)
.WeusetheversionofMcCreight’salgorithmgivenby Naetal.[18] onourinputandwebuildforeachleafv,s(
v)
,t(
v)
and P0(
v)
.ForbuildingT(
R
)
,westartwithatreethatcontains onlythe root.Then, foreach word w ineverychain C,we createorupdate (ifitexists)thenode wasfollows. Assumethatwekeepinmemorythenode v thathasbeenprocessedjustbeforew.
If w is the first wordof C, we go down from the root by comparing w to the labels of the tree. If we create the node w,s
(
w)
andt(
w)
areinitialisedto1,andP0(
w)
tonil.If walreadyexistsonthetree,weincrements(
w)
andt(
w)
by 1.
If w is not the first word of C, we start from v, andas in McCreight’s algorithm, we createor arrive on the node representing w.Ifweneedtocreatethisnode,s
(
w)
isinitialisedto1,t(
w)
to0,andP0(
w)
tonil.Otherwise,weadd1 tos
(
w)
.Weset P0(
v)
=
w.Theloopcontinueswiththenextworduntiltheend,andweobtain T
(
R
)
.Theorem6.Forasetofchainofsuffix-dependantstrings
R
,wecanconstructT(
R
)
inO(
S)
timeandspace.Proof. Tobeginwith,letustoprovethatT
(
R
)
isinO(
S)
space.ItsnumberofleavesequalsC∈R|
C|
.Hence,itsnumber ofnodesisatmost2C∈R|
C|
−
1≤
2S,anditsnumberofedgesisatmost2S.ThusthesizeofT(
R
)
isinO(
S)
.Clearly, theconstruction algorithmof T
(
R
)
computesbothweights s(.)
andt(.)
,andthepossible link P0(.)
correctly.Forthecomplexity,foreachchainofsuffix-dependant Ciof
R
,thelengthofthetraversepathonthetreeisequalto|
wi|
,thankstotheuseofthesuffixlinks.ThusasinMcCreight’salgorithm,thecomplexityisinO
(
S)
.2
Now,weareequippedwithanalgorithmthatbuildsT
(
R
)
foranysetofchainsofsuffix-dependantstrings.Letusreview someinstancesofsets S,forwhich T(
R
)
isinfactawell-knowntree.•
IfC
:= ∪
w∈S{
tuple of suffixes ofw}
, then T(
C
)
is the Generalised Suffix Tree of S (see Fig. 7a). We have that therestrainedmappingsl
(.)
isanexampleofapossiblelink.•
If Bk:= ∪
w∈S{
tuple ofk-mer of wand suffixes of lengthk<
kofw}
,thenT(
Bk)
isthegeneralisedk-truncatedsuffixtreeofS,asdefinedin[19](whichgeneralisesthek-truncatedsuffixtreeofNaetal.[18]).
•
If Ak:= ∪
w∈S{
tuple ofk+
1-mer of wand suffixes of lengthkofw}
,then T(
Ak)
is thetruncatedsuffix tree thatwedefinebelowinSection4.2(seeFig. 7b). 4.2. Ourtruncatedsuffixtree
Fig. 7.(a)Thegeneralisedsuffixtreeforthesetofwords{bacbab,bbacbaa,bcaacb,cbaac,cbabcaa}.ThepartabovethegreenlinecorrespondstotheTST T(A2),whichisshownin(b).(b)ThetruncatedsuffixtreeT(A2)forthesamesetofwords.(Forinterpretationofthereferencestocolourinthisfigure legend,thereaderisreferredtothewebversionofthisarticle.)
Definition7.
1. Foralli
∈ [
1,
|
S|]
and j∈ [
1,
|
si|
−
k+
1]
, Ak,idenotesthetuplesuchthatits jth componentisdefinedbyAk,i[j
] :=
wi[j
,
j+
k]
ifj≤ |
wi| −k wi[j,
|
wi|] otherwise2. and Akisthesetofthesetuples: Ak
:=
ni=1Ak,i.
Proposition11.
1. Ak,iisachainofsuffix-dependantstringsofsi.
2. Moreover,
{
w∈
Ak,i|
Ak,i∈
Ak}
=
Fk+1(
S)
∪
Suffk(
S)
.Proof.
1. Forall j
∈ [
1,
|
Ak,i|
−
k]
,itiseasytoseethat Ak,i[
j]
isasuffix-dependantstringof Ak,i[
j+
1]
.2. Forthesecondpoint