Contents lists available atScienceDirect
Theoretical
Computer
Science
www.elsevier.com/locate/tcs
Wheeler
graphs:
A
framework
for
BWT-based
data
structures
✩
Travis Gagie
a,
Giovanni Manzini
b,
c,
∗
,
Jouni Sirén
daDiego Portales University and CEBIB, Santiago, Chile bUniversity of Eastern Piedmont, Alessandria, Italy cIIT-CNR, Pisa, Italy
dWellcome Trust Sanger Institute, Cambridge, UK
a
r
t
i
c
l
e
i
n
f
o
a
b
s
t
r
a
c
t
Article history: Received3March2017
Receivedinrevisedform3June2017 Accepted6June2017
Availableonline27June2017 Keywords:
Compresseddatastructures Burrows–Wheelertransform Patternmatching
ThefamousBurrows–WheelerTransform(BWT)wasoriginallydefined forasinglestring butvariationshavebeendevelopedforsetsofstrings,labeled trees,deBruijngraphs,etc. Inthispaperweproposeaframeworkthatincludesmanyofthesevariationsandthatwe hopewillsimplifythesearchformore.
WefirstdefineWheelergraphsandshowtheyhaveapropertywecallpathcoherence.We showthatifthestatediagramofafinite-stateautomatonisaWheelergraphthen,byits pathcoherence,wecanorderthenodessuchthat,foranystring,thenodesreachablefrom theinitialstateorstatesbyprocessingthatstringareconsecutive.Thismeansthateven iftheautomatonisnon-deterministic,wecanstill storeitcompactlyandprocessstrings withitquickly.
We thenrederiveseveralvariationsoftheBWTby designingstraightforwardfinite-state automata forthe relevant problemsand showingthat theirstatediagramsare Wheeler graphs.
©2017TheAuthors.PublishedbyElsevierB.V.Thisisanopenaccessarticleunderthe CCBYlicense(http://creativecommons.org/licenses/by/4.0/).
1. Introduction
TheBurrows–WheelerTransformation(BWT)hasaverypeculiarhistory.Firstconceivedin1983,itwas publishedonly elevenyearslaterinatechnicalreport[9],presumablybecauseitwassoinnovativethatthefirstreviewerswerenotableto graspitsfullsignificance.Afewyearslater,thecompressionalgorithmbzip2basedontheBWTbecamepopular,challenging gzip’sdominance,thankstothefinelyengineeredimplementationofJulian Seward[46](the verysamecomputer scientist whogaveusalsotheinvaluabletoolValgrind[47]).
Afterits introductionasa compression tool,interest in theBWTwas rekindledwhen manyresearchers realizedthat, amongthe differenttechniquesdiscovered attheturn ofthecentury fordesigning compressedindexes [19,29,34],those basedon theBWTare probablythesimplestandmostspaceefficient [15,43].Afterthisrealization, inthelast tenyears wehavewitnessedan unusualphenomenonincomputerscience:variantsoftheBWThavebeenproposedandappliedto moreandmorecomplexobjects:fromtrees,tographs,to alignments.Thesevariantsareclearly relatedtotheBWTeven
✩ ThisworkwassupportedbyAcademyofFinlandgrant268324,FONDECYT grant1171058,KITE-DiSITproject,INdAM-GNCSProject“Efficientalgorithms andtechniquesfortheorganization,managementandanalysisofbiologicalBigData”,andtheWellcomeTrustgrant[098051].
*
Correspondingauthor.E-mail addresses:[email protected](T. Gagie),
[email protected]
(G. Manzini),jouni.siren@iki.fi
(J. Sirén). http://dx.doi.org/10.1016/j.tcs.2017.06.0160304-3975/©2017TheAuthors.PublishedbyElsevierB.V.ThisisanopenaccessarticleundertheCCBYlicense (http://creativecommons.org/licenses/by/4.0/).
68 T. Gagie et al. / Theoretical Computer Science 698 (2017) 67–78
Fig. 1.Aneight-nodeWheelergraph.Node1hasin-degree0;edgeslabeledaentersinnodes2,3,4;edgeslabeledbinnodes5,6;edgeslabeledcin nodes7,8.
if some ofthem nolonger havethe two main features ofthe original BWT,namelyofbeing invertibleandof “helping” compression.
AtthispointitisnaturaltoaskwhetherwehaveapproachedtheBWTastheblind menapproachedtheelephant(see, e.g., [45]), withonetouching aleg andthinking theelephant islikea tree,anothertouching thetrunk andthinkingit is likea snake,andyetanothertouching thetailandthinking itislikearope.Wedonotdisparageprevioussurveysofthe BWTandrelateddatastructures,such as[1,35,42],sincewe toohavespent yearstryingtomakesense ofoursometimes disparate impressionsofthe it.Withoutpretending togive a completeanswer,in thispaperwe propose aunifyingview formanydifferentBWTvariants.Somewhatsurprisinglywegetourunifyingview consideringtheNondeterministicFinite Automatarelatedtodifferentpatternmatchingproblems.Weshowthatthestategraphsassociatedtotheseautomatahave commonpropertiesthatwesummarizewiththeconceptofWheelergraphs.1
Using thenotion of a Wheeler graph,we show that it is possible to process strings efficiently, e.g., in linear time if the alphabet is constant, even ifthe automaton is nondeterministic. In addition, we show that Wheeler graphs can be compactlyrepresentedandtraversedusinguptothreearrayswithadditionaldatastructuressupportingefficientrankand selectoperations. Itturnsout thatthesearrayscoincide with,oraresubstantiallyequivalent to,theoutputofmanyBWT variantsdescribedintheliterature.
We believeourunifyingview canhelpresearchersdevelopnewBWTvariantsandnewindexingdatastructures. How-ever, westress that not everyBWT-related datastructure fitsour framework:forexample Gangulyetal.’s parameterized BWT[25] andtheindexfororder-preservingmatchingin[23].Therefore,wehope thatourcontributionwillspurfurther researchresultinginawidervisionofthefascinatingfieldoriginatedbytheseminalworkofBurrowsandWheeler.
2. Definitionsandbasicresults
Consider adirected edge-labeledgraph
G
such thateach edgeislabeledby acharacterfromantotally-ordered alpha-bet A.Weuse≺
todenotetheorderingamongA’selements.Labelsontheedgesleavingagivennodearenotnecessarily distinct, andtherecanbe multipleedgeslinkingthesamepairofnodes(forsimplicitywe stillusethetermgraph
rather thanthemoreformallycorrectmulti-graph
).Definition1.
G
is a Wheeler graph if there is an ordering of the nodes such that nodes within-degree 0 precede those with positivein-degree and, foranypairofedgese
=
(u, v)
ande=
(u
,
v)
labeled a anda
respectively,the following monotonicitypropertieshold:a
≺
a⇒
v<
v,
(
a=
a)
∧
(
u<
u)
⇒
v≤
v.
(1)AnexampleofaWheelergraphisshowninFig. 1.Asanimmediateconsequenceof(1),alledgesenteringagivennode musthavethesamelabel.WenowshowthatWheelergraphsalsopossessthefollowingproperty:
Definition2.
G
ispath coherent
ifthereisatotalorderofthenodessuchthatforanyconsecutiverange[
i, j]
ofnodesand stringα
,thenodesreachable fromthose in[
i, j]
in|
α
|
steps byfollowingedges whoselabels forα
whenconcatenated, themselvesformaconsecutiverange.Lemma3.
If G is a Wheeler graph by an ordering
π
on its nodes then it is path coherent byπ
.Proof. Suppose
G
isaWheelergraphbyπ
.Consideraconsecutiverange[
i, j]
ofnodesandlet[
i, j
]
bethesmallestrange that contains all thenodes reachable fromthose in[
i, j]
in one step by followingedges labeled with some charactera
. Byourchoice of[
i, j
]
,both iand j arereachablefromnodesin[
i, j]
inonestepby followingedgeslabeleda.Byour1 OnmanyoccasionsMikeBurrowsstatedthat,asreportedalsoin[9],theoriginalideaoftheBWTisduetoDavidWheeler.Wethereforedecidedto
definitionofaWheelergraph,nodeswithin-degree 0precedethose withpositive in-degree,suchas
i
,so everynode in[
i, j
]
hasatleastoneincomingedge.Assume some node v strictly betweeni and j hasan incomingedge labeleda
=
a.Since i<
v we havea≺
a,by ourdefinitionofaWheelergraphandmodus tollens
;similarly,since v < jwe havea
≺
a,thusobtaining acontradiction. Itfollowsthattheedgesarrivingatnodesin[
i, j
]
arealllabeleda.Furthermore,sincethelabelsareequal,bythesecond implicationin(1)andmodus
tollenswegetthatanyedgewithdestinationstrictlybetweeni
and jmustoriginatein[
i, j]
. Itfollowsthatthenodesreachableinonestepfromthosein[
i, j]
byfollowingedges labeledaare theonesin[
i, j
]
, whichisaconsecutiverange.Foranystringα
,therefore,thenodesreachablein|
α
|
stepsfromthosein[
i, j]
byfollowing edgeswhoselabelsformα
,themselvesformaconsecutiverange,byinductiononthelengthofα
.2
Inthelatersectionsofthispaper,weexploretheimplicationsofpathcoherence.Intheremainderofthissectionwefirst showit ispossibletoobtainafastandcompactrepresentationofa Wheelergraph,then sketchwhyWheelergraphscan achievecompression.Wepointoutthat,evenwithoutexplicitlydefiningWheelergraphs,manyresearchershaveimplicitly consideredthemwhilestudyinghowtoimplementefficientlyBWTvariantsandhowtoboundtheirspaceusage,andgiven resultsanalogoustotheoneswegivenow.
Aplain,edge-by-edgerepresentationofalabeled graphwith
n
nodesande
edgesuses(e(
logn+
log|
A|
))
bits.Given aWheelergraphG
,letx
1<
x2<
· · ·
<
xn denotetheorderedsetofnodes.Fori
=
1,
. . . ,
nleti
andki
denoterespectively theout-degreeandin-degreeofnodex
i.Definethebinaryarraysoflengthe
+
nO
=
011 021· · ·
0n1,
I=
0k11 0k21· · ·
0kn1.
(2) NotethatO
(resp.I)consistsoftheconcatenatedunaryrepresentationsoftheout-degrees(resp. in-degrees).LetL
i denote themultisetoflabelsontheedges exitingfromxi
arrangedinan arbitraryorder,andlet L[
1..e
]
denotetheconcatenation L=
L1L2· · ·
Ln.Byconstruction,|
Li|
=
i
andthereisa naturalone-to-one correspondencebetweenthe 0’sin O andthe charactersinL
.Forexample,forthegraphofFig. 1itisO
=
0001 001 01 1 001 001 001 01L
=
aab ac b ac bc bc awhiletheindegreearrayis
I
=
101001001001001001001.Finally,letC
[
1..
|
A|]
denotethearraysuchthatC
[
c]
isthenumber ofedgeswithlabelsmallerthanc
.Forsimplicity,assumeevery distinctcharacterlabelssomeedge;otherwise,westorea bitvector of|
A|
bitsmarking thecharactersthatdo labeledges andworkwiththat subset.Withthisassumption, C[
c]
<
C[
c+
1]
andxj
hasallincomingedgeslabeledc
ifandonlyifC
[
c]
<
i≤jki≤
C[
c+
1]
.Given an array Z we use the standard notation rankc
(Z
,
i) to denote the numberof occurrences of c in Z[
1,
i]
, and selectc(Z, j)todenotethepositionofthe j-thc
inZ.Forsimplicityweassumerank
c(Z,
0)
=
0.Thefollowingtwoproperties arestraightforwardtoprove;indeed,similarpropertieshavebeenestablishedinmanypapersdealingwithBWT-relateddata structures.1. Theout-degree
i
ofnodex
iisselect
1(O,
i)−
select1(O
,
i−
1)
.Thelabelsintheedgesleavingx
i areL
[
wi−
i
+
1, w
i]
wherewi
=
select1(O,
i)−
i.2. If node xi hasone ormore outgoingedges labeled c, then thelargest index j such that
(xi,
xj) has labelc can be computedasfollows.Letwi bedefinedasaboveandleth
i=
rankc(L, w
i).Ifweorderedgesbylabelwithtiesbroken by origin,(xi,
xj) is the hi-thedge labeled c. Since there are C[
c]
edges with a label smaller than c, we have j=
1
+
rank1(I,
select0(I,
hi+
C[
c]
))
.Tofindthesmallerindices j suchthat(xi,
xj)haslabelc
,wedecrementhi
andrepeat thisprocedure,untilh
i=
rankc(L, wi−
i)
+
1.Withasymmetric reasoning,given xj and
c
we canestablishwhetherthereexists anedge labeledc
enteringin xj,and, ifthisisthecase,all indicesi
such that(x
i, xj)haslabelc.Using standarddatastructurestorepresentcompactly arrays supportingrankandselectqueries[43],wecanestablishthefollowingresult.Lemma4.
It is possible to represent an n-node, e-edge Wheeler graph with labels over the alphabet
A in 2(e
+
n)+
e log|
A|
+
|
A|
loge+
o(n+
e log|
A|
)
bits. The representation supports the forward and backward traversing of the edges in O(
log|
A|
)
time.2
We canoftenreduce even furtherthe spaceusedto representWheelergraphs. Forexample,suppose G is aWheeler graph;V
kisthesetofnodesofG
reachableinexactlyk
stepsfromsomeothernodes;S
isasetofstringsoflengthk
such thatevery nodeinV
k isreachableink
stepsbyfollowingedgeswhoselabelswhenconcatenatedformastringinS;and f isafunctionassigningto eachnode v∈
Vk astrings
∈
S such that v isreachableink
steps byfollowingedgeswhose labelswhenconcatenatedforms
.ConsiderthelistL
koflabelsofedgesoriginatinginV
kinorderbyoriginwithtiesbroken bylabel.ByLemma 3,for
s
∈
Sallthenodesv
∈
Vksuchthat f(v)
=
sformaconsecutiveinterval.Therefore,wecanpartitionLk
into|
S|
consecutiveintervalssuchthateachintervalL
sistheconcatenationofthelabelsofedgesleavingnodesv
suchthat f(v)
=
s;Ls
isemptyfors /
∈
S.Itfollowsthatifk
≤
(
1−
)
lgσ|
Vk|
,where>
0 isaconstantandσ
= |
A|
isthesizeofthe70 T. Gagie et al. / Theoretical Computer Science 698 (2017) 67–78
Fig. 2.Deterministic (left) and non-deterministic (right) finite automata accepting all substrings ofABRACADABRA. All states are accepting.
alphabetoflabels,thenwecanstore
Lk
ins∈Ak|
Ls|
H0(Ls)
+
o(|
Vk|
)
bits,whereH0(Ls)
isthe0th-orderempiricalentropy of Ls.Informally,thismeansthat ifknowinghowwe cangettoa nodeink
steps tellsusalotaboutthelabelsthatare likelytobeontheedges leavingthatnode,thenwe cancompress Lkwell. Insome waysthisgeneralizesthewell-known analyses[14,39]ofthecompressionachievableusingtheBurrows–WheelerTransformonasinglestring.3. NFAs,WheelergraphsandFM-indexes
Suppose we haveneverheardthewords “SuffixArray” or“Burrows–Wheeler Transform”butwe wantto builda data structure supporting efficientqueries forthe substringsof,say, s
=
ABRACADABRA. Asimplesolution wouldbe tobuild the DeterministicFinite Automaton (DFA)acceptingall substringsof s,see Fig. 2(left). Althoughthere aretechniques to compactly representDFAs, such avenue willlikely lead toa data structure muchlarger than theO
(n
log|
A|
)
bitsof the originaltext.Amorespaceeconomicalalternative istobuildtheNondeterministicFinite Automaton(NFA)forthesamesetof sub-strings: Ithasalinearstructure,seeFig. 2(right),butbecauseofnondeterminism,processingofpatternsappearstobe a difficult task.It turnsout thatthisdifficulty isonlyamatter ofperspective:traversing theDFAbecomes muchsimplerif werecognizethatitisaWheelergraphindisguise.
The NFA of Fig. 2 is already a labeled directed graph. To make it a Wheeler graph we only need to eliminate the
-transitions(which violatetheproperty thatall theincomingedges toanode havethesamelabel), makeallthe states initial(sothelanguageisnotchanged),anddefineanorderingofthenodessuchthatthemonotonicityproperties(1)hold. Tothisend,weassociatetoeachnodeoftheNFAtheprefixof
s
thattakesustherewithoutan-transitionandorderthe NFAnodesaccordingtotheright-to-leftlexicographic rankofsuchstrings,seeFig. 3(leftandcenter).Withthisdefinition the NFAwithout
-transitionsis aWheelergraph.Oncethenodeordering isestablishedwe candiscardtheprefixesand identifyeachnodewithitsrankintheordering,seeFig. 3(right).
IntheWheelergraphderivedfromtheNFAofFig. 3everynodehasout-degree1exceptfromnode 5,correspondingto thelongestprefix,thathasout-degreezero.Hence,intherepresentationoftheWheelergraphdescribedinSect.2,wecan getridofthebitarray
O
andinsteadinsertinposition5ofLasymbolnotoccurringelsewhereins
,say$.Sinceallnodes havein-degree1exceptnode0(ignoringthesourcelessedgesindicatingthatallnodesareinitial)thereisnoneedtostore thebitarrayI
either.Summingup,wecanrepresentandnavigatetheWheelergraphusingthestringL
=
ABDBC$RRAAAA, enriched withdatastructuresforrank
/select operations,andthe arrayC[
1..
|
A|]
.Sincethestring Lcoincides withthelast columnoftheBWTmatrixofs
R,thestring sreversed,andC
isawell-known representationofthefirst columnofsuch matrix,wehaveestablishedthefollowingresult.Lemma5.
Given a string s, let NFA(s)
denote the corresponding NFA described above. The FM-index of sRis a compact representation of NFA(s). The Last-to-First and First-to-Last maps of the FM-index coincides with the navigation operations in NFA(s).2
Fig. 3.FromtheNFA(left)totheWheelergraph(center)totheFM-index(right).Allstatesareinitialandacceptingandnumberedontherighttoshow thatthediagramisaWheelergraph.
TheattentivereadermayaskwhytheWheelergraphrepresentstheFM-indexof
s
R whilehistoricallytheFM-indexwas definedfors.ThereasonisthattheFM-indexwas derivedfromtheBWTofs
.The consequenceisthat thesearchofthe patternmustbe done right-to-left:the infamousbackward-search procedure [20].Here, we havedefinedtheNFA sothat thesearchisdoneleft-to-rightandthroughtheWheelergraphwehavethereforeobtainedtheFM-indexofs
R.Lemma 5 provides anew perspective onFM-indexes that maybe interesting in its ownright and, moreimportantly, suggestsawaywecan systematicallygeneralize theideas behind FM-indexestoindexingother kindsofdatathansingle strings.Usingtheconceptofpathcoherence,wenowproveamorepowerfulversionofthislemma:
Theorem6.
Consider a finite-state automaton over an alphabet A without
-transitions and with either one initial state or all states initial. If its state diagram is a Wheeler graph with n nodes and e edges then we can store it in 2
(e
+
n)+
e log|
A|
+ |
A|
loge+
o(n+
e log|
A|
)
bits such that, for any stringα
, in O(
|
α
|
log|
A|
)
time we can compute the set of states reachable from the initial states onα
.Proof. Supposethestate diagramisa Wheelergraphbyan ordering
π
onits nodesso,byLemma 3,itispath coherent byπ
.Sinceeitheronestateisinitialorallare,theinitialstatesformaconsecutiverangeinπ
.ByDefinition 2,thenodes reachable fromtheinitial statesoneach prefix ofα
formaconsecutive interval.We canuse Lemma 4to mapfromthe intervalforeachprefixtotheintervalforthenextprefixinO
(
log|
A|
)
time.2
Theorem 6meansthat ifwehaveaproblemwecansolvewithafinite-stateautomaton,theneveniftheautomatonis non-deterministic wemaystill be abletoimplementitwithout theusual blowupsinspaceorquerytime. Intherestof thispaperweshowthatmanydatastructuresbasedonvariantsoftheBWTfitintothisframework.
WenoteasanasidethatFM-indexesusuallyincludeasampledsuffixarraywhichpermitsustolocateeachoccurrence ofapatternintheindexedstring.Extendingthisideatoautomataseems straightforward,bysamplingthenodes,butwe donotexplorethatinthispaper.
4. OtherWheelergraphsindisguise
4.1. Multi-string BWT and permuterm index
The first naturalgeneralization ofthe BWTis to extendit to acollection ofstrings. Thiswas done forthe first time in[36–38]wheretheauthorsdescribedareversiblemulti-stringtransformationinspiredbytheBWT,andshowedits effec-tivenessfordatacompressionandformeasuringstringsimilarity.
Inthecontextofpatternmatching,givenacollectionofstrings
s
1,
. . . ,
sd,inadditiontosubstringqueriesitisconvenient to offer alsothe possibility ofprefix/suffix queries. In a prefix/suffix query,given two substringα
,
β
we want tofind all strings si such that si isprefixed byα
andsuffixedbyβ
.As first observed in [21,22] withthe Compressed Permuterm Index,such queries can be solved by addinga specialsymbol $ to each stringandsearching forthe patternβ
$α
inside a circular versionof each si$. Fig. 4(left) showstheNFA supporting circularqueries forthestrings AT$, HOT$,HAT$. To each nodewe naturally associatea circularshiftofoneofthe strings; ifwe sortthenodesaccordingto theright-to-left lexicographicorderofsuchshiftstheresultinggraphisaWheelergraph,seeFig. 4(right).Since each node in the Wheeler graph has in-degree and out-degree 1, we can represent it with the approach of
Lemma 4, usingonly the arrays L andC. Note that the array L, L
=
AHHTTAOT$$$ in the example of Fig. 4, coincides withthemulti-stringBWTdefinedintermsofthe cyclicshiftsin[38].Instead,theCompressedPermutermIndexisbuilt computingthesingle-string BWToftheconcatenation s1$s2· · ·
$sd thatdoesnot supportnaturally thesearch forcircular patterns. The authors of [21] got around thisproblem by sorting the strings s1,
. . . ,
sd before the concatenation andby72 T. Gagie et al. / Theoretical Computer Science 698 (2017) 67–78
Fig. 4.Left:AfiniteautomatonacceptingallcircularsubstringsofAT$,HOT$,HAT$;allstatesareinitialandaccepting.Right:Sortingthestatesaccording totheassociatedcircularshiftsgivesusaWheelergraph.
Fig. 5.AfiniteautomatonacceptingallsubstringsofAAC,ABA,ACAA,BAandBC.Allstatesareinitialandacceptingandnumberedtoshowthatthe diagramisaWheelergraph.
applying theso-calledjump2end function.Morerecentalgorithms usingcircularpatternsearch allmake useofthecyclic multi-stringBWT[2,7,8,30].
4.2. XBWT and trie representation
Ifwe are onlyinterested inthe substringsofacollection s1
,
. . . ,
sd, andnotincircular matches,asmaller automaton thantheoneconsideredintheprevioussectionistheonederivedfromthetrie
datastructure,seeFig. 5.Followingthelead from[16–18],weassociatetoeachnodethestringformedbythelabelsinthenode-to-rootupwardpath,andweorderall nodesaccordingtothelexicographicrankofsuchstrings.The resultinggraphisaWheeler graphwiththe samenumberofnodesastheoriginal trie.Ina trieall nodesexcept theroothavein-degree1sointherepresentationofLemma 4we donotneedtostore thein-degreebinary array.Hence therepresentationconsistsoftheout-degreebinaryarray O,thelabelsarray Landthecountarray
C
.ForthetrieinFig. 5wehave
O
=
001 0001 01 1 1 1 01 001 01 01 1L
=
AB ABC C A AC A AThis representation isessentially equivalent to the XBWT(eXtended BWT) introduced in [16] to representlabeled trees, withthearrays
O
andLcorrespondingrespectivelytoS
last andS
α in[16].Notethatwecanrestrictthesearchtoprefixesbysettingtherootastheonlystartstate,andwecanrestrictthesearchtosuffixesbysettingasacceptingstatesonlythose correspondingtooneoftheinputstringswithoutaffectingourrepresentation.
4.3. de Bruijn graphs
Bowe,Onodera,SadakaneandShibuya[6](seealso,e.g.,[5,10,33]) extendedtheXBWTfromtreestodeBruijn graphs, whicharewidelyusedinbioinformaticsfordenovoassembly,readcorrection,identifyinggeneticvariationsinapopulation,
Fig. 6.Afiniteautomatonacceptingallprefixesoflengthatleast3ofstringsinwhichtheonlytriplesareAAA,AAB,AAC,ABA,BAA,BAB,CABandCCA
andinwhichBAAAandCABAdonotoccur.State1isinitial,allstatesatdistance3fromtheinitialstateareaccepting.Statesarenumberedtoshowthat thediagramisaWheelergraph.
C C T C – A – A A C C C – C T C A A A C C C C T C C A – A A C A C C T C C A A A C A C C T T – A T A A C – C C T T A T A – A C C C T – – – – A A C C C – – – C T A A C C
Fig. 7.Left:AnalignmentofstringsCCTCAAACC,CCTCCAAACA,CCTTATAAC,andCCTAACC.Thealignmenthasbeenseparatedintocommonand non-commonregions.Right:ThealignmenttransformedforFMA.Thesuffixesmovedtothenon-commonregionsareCTforthefirstcommonregionandAC
forthesecondone.
andotherapplications.A
k
th-orderde Bruijngraphforastringorsetofstrings containsa nodeforeachdistinctk
-tuple thatoccursinthosestrings,andan edge(u, v)
ifthereisa(k
+
1)
-tupleinthestringsthatstarts withu
andendswithv
(whichimpliesthatv
canbeobtainedfromu
bydeletingu
’sfirstcharacterandappendingacharacter).TogetherwiththeGCSA,describedinSection4.5,Boweetal.’srepresentationisreallytheprototypicalBWT-baseddata structurethatrequiresustothinkintermsofgraphsinsteadofstrings.Themaindifferencebetweenrepresentingalabeled treeandrepresentingadeBruijngraphisthat,inagraph,nodescanhavein-degreemorethan 1,butthiscanbehandled usingthein-degreearray
I
asdescribedinLemma 4.We canalso rederive Bowe etal.’srepresentationfrom Theorem 6: we build a finite-stateautomaton that acceptsall prefixesoflengthatleast
k
ofstringscontaining onlyk
-tuplesand(k
+
1)
-tuplesfroma givenlist,by buildingatrie for thek
-tuplesandthenaddingedgesconnectingtheleavesappropriately.Fig. 6showsanexampleforthetriplesAAA,AAB, AAC,ABA, BAA, BAB, CABandCCA,withtheedges forBAAAandCABAmissing:thisautomaton acceptsall prefixesof lengthatleast3ofstringsinwhichtheonlytriplesareAAA,AAB,AAC,ABA,BAA,BAB,CABandCCAandinwhichBAAA andCABAdonotoccur.Bynumberingeachnodeaccordingtothelexicographicrankofthestringlabeling thenode-to-root path,thestatediagramisimmediatelyseentobeaWheelergraph.4.4. FM-index of alignment
Let
G
beaWheelergraphundernodeorderingπ
.Ifallincomingedgestonodesinrange[
i, j]
havethesamelabel,we canmergetherangeintoasinglenode v,andtheresultinggraphG willstillbeaWheelergraph.IfgraphG
hasanedge withlabela
fromnodeu
toanodeinrange[
i, j]
,graphG
willhaveanedge(u, v)
withlabela
.Similarly,ifgraphG
has anedgewithlabela
fromanodeinrange[
i, j]
tonode w,graphG
willhaveanedge(v
, w)
withlabela.Ifwehaveamulti-stringBWT,wecantransformitsWheelergraphintoamorecompactrepresentationofthestringsby mergingrangesofnodes.Thiscompactrepresentationmaycontain
false positives
:pathlabelsthatcombinesubstringsfrom differentoriginal strings. The FM-index of alignment (FMA)[40,41] hasan efficientprocedure fordetecting false positives, basedoncarefullychoosingthenodestomerge.We start with an alignment of the strings, and separate the alignment into common regions Xi shared between all strings,and
non-common
regions Yi wheresome stringsaredifferent. Weassume thateach commonregionhasaproper suffix Zi that doesnot occur anywhere else in the strings. Ifno such suffix exists, we merge the common region into the neighboringnon-common regions. We transformthe alignment intwo steps. First we move the suffixes Zi intothe followingnon-commonregionsYi
.Thenwejustifyeachnon-commonregiontotheright,anduse XiandY
itodenotethe commonandnon-commonregionsinthetransformedalignment.SeeFig. 7foranexample.The transformed alignment guides us in merging the Wheeler graph of the multi-string BWT into the FM-index of alignment. As we are buildingan FM-index for the original strings, we need to build a Wheeler graph for the reverse strings.Wemergenodescorrespondingtoalignedpositions,ifthenodesa) havein-degree 0;b) areina commonregion; orc) correspondtothesamesuffixofthenon-commonregion.SeeFig. 8foranexample.
74 T. Gagie et al. / Theoretical Computer Science 698 (2017) 67–78
Fig. 8.Top:Wheelergraphofthemulti-stringBWTofthetransformedalignmentin
Fig. 7
.Nodeorderingisbasedonthelabelsinthenodes.Graynodes correspondtothecommonregions.Bottom:WheelergraphoftheFMAforthesamealignment.Thecolorsofeachedgemarktheoriginalstringsthat crosstheedge.Thelabelsofnon-mergednodesarethesameasinthetopgraphforconvenience.(Forinterpretationofthecolorsinthisfigure,thereader isreferredtothewebversionofthisarticle.)ByLemma 3,theWheelergraphofamulti-stringBWTispathcoherent.ToseethatthegraphremainsaWheelergraph aftermerging,wenotethat:
1. Sourcenodeswithin-degree0formaconsecutiverangebyDefinition 1.
2. Acommonregion
X
iisenteredeitherfromthesourcenodesorfromthefollowingnon-commonregionY
i.Inthelatter case, thesetofnodesreachablewithstring ZRi isthesetofnodeswithoutgoingedgestothecommonregion Xi.By
Lemma 3,thesenodesformaconsecutiverange.Asthecommonregionisenteredfromaconsecutiverangeofnodes, we see by iterating Definition 2 that nodes corresponding to aligned positions in the region also form consecutive ranges.
3. Bytheprevious twocases,thenon-commonregionisenteredfromaconsecutiverangeofnodes.ByDefinition 2,we seethatthenodescorrespondingtoasuffixoftheregionformconsecutiveranges.
False positivesare paths that donot correspond to anyof theoriginal strings. Theycan occur when the pathcomes from anon-common region Yi+1,enters a commonregion Xi+1, andexitsinto anothernon-common region Yi. Assume that foreachnode
v
wehavestoredthesetofstrings Sv passingthrough thenode.Tocheckforfalsepositives, wetake theintersectionofsets Sv overallnodesv
onthepath.Iftheintersectionisnon-empty,itcontainsthestringscompatible
withthepath.If(u, v)
isthe onlyoutgoingedgefromnode u,we have Su⊆
Sv.Similarly, if(u,
v) istheonlyincoming edgetonodev
,wehaveSv⊆
Su.Henceitisenoughtotaketheintersectiona)atthefinalnode;andb)atnodesu
,where thepathcrossesanedge(u, v)
toanode v withmultipleincomingedges.ItisalsoenoughtostorethesetSu explicitlyif a)nodeu
hasmultipleornooutgoingedges;orb)Su
=
Sv fortheonlyoutgoingedge(u, v)
.Let
α
beastring,letV
α bethesetofnodesreachablebyfollowingpathswithlabelα
,andlet Sα bethesetofstringscompatiblewiththepaths.BecausethegraphisaDFA,thesetofreachablenodesismonotonicallydecreasing:
|
Vαa| ≤ |
Vα| and|
Vaα| ≤ |
Vα|
foranycharactera
∈
A.If Zi isamoved substring,VZRia
≤
1 foranycharactera
∈
A,asthepaths can onlyendattherightmostnodeofthecommonregion Xi.Ifset VZRia isnon-empty,set SZiRa containsallstrings,as
a Zi
is asubstringofallstrings.When we search in FMA, we move from Vα to Vαa. We update the set of compatible strings S lazily. Initially the set contains all strings. If
|
Vα|>
1, there cannot be false positives. Either none of the paths enters a common region from anon-common region,orall such paths start within Zi andenter Xi through every incomingedge. Ineithercase, Sα=
v∈VαSv. If|
Vαa| =
1 and the node v∈
Vαa has multiple incoming edges, there is a risk of false positives. We thereforeupdateS
←
S∩
v∈VαSv.Whenthesearchfinishesat
V
α,thesetofcompatiblestringsisS
∩
v∈VαSv. 4.5. GCSA
While notall NFAs haveanordering ofnodesthatmakes themWheeler graphs,wecan useWheelergraphs toindex pathlabelsoflengthupto
k
inarbitrarygraphs[49].ThisistheideabehindGeneralizedCompressedSuffixArrays(GCSAs). WeemphasizethatWheelergraphscanbelargerthanthegraphstheyareusedtoindex.Fig. 9.Top:ADFA.Second:A3rd-orderde BruijngraphforpathlabelsintheDFA.NodelabelsindicatenodeorderingintheWheelergraph.Wealsoshow the
k
-tuplescorrespondingtoeachnode(inreversetomatchthesortingorder)andthemapping f tothenodesoftheDFA.Third:Aprunedde Bruijn graphfortheDFA.Nodelabelsarecompatiblewiththesecondgraph.Graycolorindicatesnodeswherethemappingmustbestoredexplicitly.Bottom: A pathgraphbasedona4th-orderde Bruijngraph.Definition7.Let
G
beanNFA,letG
beaWheelergraph,letα
beastring,andletV
α andV
α bethesetsofnodesreachablewith
α
inG
andG
,respectively.GraphG
isak
th-orderpath graph
ofG
forak >
0,ifthereisafunction f fromnodesof GtosetsofnodesofG
suchthat Vα=
v∈Vα f(v
)
forall|
α
| ≤
k.We can usea
k
th-order de Bruijngraph ofpath labels ingraph G asa kth-order path graphof G. SeeFig. 9 foran example. Function f hasthe same role as the suffix array.As withthe sets of strings in FMA, we must store set f(u)
explicitly,if a)node u hasmultipleorno outgoingedges;orb) we cannot derive f(u)
from f(v)
fortheonly outgoing edge(u, v)
.WeoftennumberthenodesoftheNFAinawaysuchthat f(u)
= {
x−
1|
x∈
f(v)
}
,if(u, v)
istheonlyoutgoing edgefromu
andtheonlyincomingedgetov
.Withlargevaluesof
k
,de Bruijngraphsoftenhavemanyredundantnodeswhenweusethemaspathgraphs.If|
α
| ≤
k, thenodes Vα reachablewithα
inthede Bruijn grapharetheoneswithα
asasuffix ofthecorrespondingk
-tuples.ByLemma 3,thenodesformaconsecutiverange.If f
(v)
=
f(v
)
forallv,
v∈
Vα,wecanmergethenodeswithoutaffecting reachability.GCSA2
[49]usessuchpruned de Bruijn graphs
tosavespacewhenindexingpathlabelsinanNFA.SeeFig. 9for anexample.The original
GCSA
[50]considers a differentscenario.Instead ofindexingpaths oflength up tok
inan arbitraryNFA, it indexespaths ofarbitrarylength in an acyclicDFA.Conceptually, GCSAconstruction searchesfor ade Bruijn graph G thatisequivalent totheinputgraphG
asaDFA,andthenusesthatde Bruijngraphasaninfinite-order pathgraphofG. Consecutive ranges of nodesare merged if f(v)
=
f(v
)
for all nodes v and v in the range,even ifthe ranges do not correspondtosharedsuffixesofthek
-tuples.SeeFig. 9(bottom)foranexample.4.6. PBWT, wavelet trees and wavelet matrices
Another remarkablevariant of the BWT is Durbin’spositional BWT [12] (PBWT), which he introduced for haplotype matchingbuthasalsobeenusedforreconstructingancestralrecombinationgraphs[48].Given
d
stringsofequallengththe PBWTsupportsthematchingofsubstringsstartingataspecifiedpositioninthestrings.76 T. Gagie et al. / Theoretical Computer Science 698 (2017) 67–78
Fig. 10.A standard representation of a wavelet tree and its representation as a Wheeler graph.
Fig. 11.A standard representation of a wavelet matrix and its representation as a Wheeler graph with edge labels over a larger alphabet.
WecanviewthePBWTasaWheelergraphwiththesamestructureasthatofthemulti-stringBWT.Ifthemulti-string BWThasanedge
(u, v)
withlabela
fromtextpositioni
topositioni
+
1,PBWTuses(i
+
1,
a)asthelabel.Becausealledges frompositioni
gotopositioni
+
1,thepositionalcomponentisalwaysi
+
1 foralloutgoingedgesinrange[
(i
−
1)d
+
1,
id]
. Hence, we caninferthepositionfromtherangeandstore justthecharactera
inthesuccinctrepresentation(Lemma 4). A generalization [44] ofPBWTrelaxes therequirements.Instead ofstoring stringsof equallength,we store thelabels of paths inagraph.Ifedge(u, v)
intheWheelergraphcorrespondstoedge(u
,
v)
withlabela
intheunderlyinggraph,we use(v
,
a)asthelabeloftheedge.Aswecannolongerinferthepositionalcomponent,wehavetostoreitexplicitly.Although introduced forcompletely differentapplications, the PBWT bearsa striking resemblanceto a datastructure calleda waveletmatrix,which Claude,Navarro andOrdóñez [11]introduced asa version ofthewavelettree [28] better suited forlarge alphabets.This suggeststhateven wavelettrees andmatricescan beviewed asWheelergraphs. Forthe sake ofbrevity, weassume thereaderis familiarwiththesedatastructureswith“myriadvirtues” [13,24],andnote only that,whilebuildingawavelettreeisanalogoustoamost-significant-bit-firstradixsort,buildingeitheraPBWTorawavelet matrixisanalogoustoaleast-significant-bit-firstradixsort.
ConsiderthewavelettreeshownontheleftinFig. 10.Sinceitisatree,wecanuseourconstructionfromSubsection4.2
to build an FSA (withone initial state)that acceptsall thebinary strings that are root-to-leaf paths inthe wavelettree. This doesnot store all theinformation inthe wavelet tree,however, sowe turnthe graphfrom an FSA intoa directed multi-graph, asshownonthe rightinFig. 10,whichisstill aWheelergraph.Thisgraphhassomeinteresting properties: e.g.,theorderoftheoutgoingedgesfromeachnodeencodethebitvectorstoredatthatnodeinthewavelettree;asinthe wavelettree,foreachinternalnode v excepttheroot, v’sin-degreeandout-degreeareequalanditsin-degreeisequalto thesumofthein-degreesofitsleafdescendants.Itisbeyondthescopeofthispapertoinvestigatethisrepresentationof wavelettreesasadatastructure,butitseems likeaninterestingdirectionforfutureresearch thatgoesbeyondthereach ofTheorem 6.
Now considerthewaveletmatrixshowontheleft inFig. 11.WecanturnitintoaWheelergraphif,asinthewavelet tree,wetransformitintoadirectedmulti-graphandthen,asinthePBWT,werenamethelabel
c
oneachedgewiththepair(i,
c),wherei
isthelevelofthestartingnode,asshownontherightinFig. 11.Again,theWheelergraphclearlyencodes thewaveletmatrix,butweleaveasfutureworkinvestigatingthepossiblebenefitsofthisalternativerepresentation.5. Conclusionsandfuturework
We have defined Wheeler graphs to try to capture the ideas behind many of the variants of the BWT,and given a frameworkfordevelopingnewvariantsby solvingproblemswithfinite-state automatawhosestatediagramsareWheeler graphs.Wedidnotpursuethetopichere,butWheelergraphsseemabletocapturepropertiesalsoofstringtransformations notrelatedtopatternmatching:byreversingtheinequality
u <
uin(1)wegetaslightlydifferentnotionofWheelergraph thatcanbeusedtosuccinctlyrepresentthevariantoftheBWTdefinedintermsofthealternatinglexicographicorder[26](seealso[27]foranancestorofthemulti-string BWT).
There has been so much work involving the BWT, however, that it would be surprising if one idea could subsume it all, and indeed some BWT-related results, such as the positional BWT, seemingly cannot be reasonably modeled by finite-state automata, evenif they canstill be viewed asWheeler graphs; other BWT-related results,such asGanguly et al.’sparameterizedBWT[25] andtheindexfororder-preservingmatchingin[23],seemnoteventobebasedonWheeler graphs atall.We havenot yeteven consideredbidirectional FM-indexes[3,31,32] andbidirectionalBWT-basedde Bruijn graphs[4].
Apartfromapplying andextending ourframework, we hope to develop algorithms to recognizeWheeler graphs effi-ciently,andtocharacterizeclassesoffinite-statediagramsthatareWheelergraphsorcanbeexpandedslightlytobecome Wheelergraphswithoutchangingthelanguageacceptedbytheautomata(althoughcharacterizingallsuchstate diagrams mightbedifficult).Inthisregardweobservethatnotallregularlanguageshavefinite-stateautomatawhosestatediagrams areWheelergraphs:e.g.,inanyfinite-stateautomatonforthelanguage
(ax
∗b)|
(cx
∗d),theremustbedisjointpathsforax
kb andcx
kdandbothendsofalledgeswithlabelxmustbeinthesameorder,sotheremustbeseparate nodesforx
ib and xidforallvaluesofi.Finally,wehope ournewperspective onBWTvariantsmakes themmoreaccessibletocomputer scientistsfromareas outsidestringalgorithmsanddatastructures.
References
[1]D.Adjeroh,T.Bell,A.Mukherjee,TheBurrows–WheelerTransform:DataCompression,SuffixArrays,andPatternMatching,SpringerScience&Business Media,2008.
[2]M.J.Bauer,A.J.Cox,G.Rosone,LightweightalgorithmsforconstructingandinvertingtheBWTofstringcollections,Theoret.Comput.Sci.483(2013) 134–148.
[3]D.Belazzougui,F.Cunial,J.Kärkkäinen,V.Mäkinen,VersatilesuccinctrepresentationsofthebidirectionalBurrows–Wheelertransform,in:European SymposiumonAlgorithms,Springer,2013,pp. 133–144.
[4]D.Belazzougui,T.Gagie,V.Mäkinen,M.Previtali,S.J.Puglisi,Bidirectionalvariable-orderdeBruijngraphs,in:LatinAmericanSymposiumonTheoretical Informatics,Springer,2016,pp. 164–178.
[5]T.Beller,E.Ohlebusch,EfficientconstructionofacompresseddeBruijngraph forpan-genomeanalysis,in:AnnualSymposiumonCombinatorial PatternMatching,Springer,2015,pp. 40–51.
[6]A.Bowe,T.Onodera,K.Sadakane,T.Shibuya,SuccinctdeBruijngraphs,in:Int.WorkshoponAlgorithmsinBioinformatics,Springer,2012,pp. 225–235. [7]N.R. Brisaboa, A.Cerdeira-Pena, A.Fariña, G. Navarro, Acompact RDFstoreusing suffix arrays,in: SPIRE, in:LNCS, vol. 9309, Springer,2015,
pp. 103–115.
[8]N.R.Brisaboa,A.Fariña,D.Galaktionov,M.A.Rodríguez,Compacttriprepresentationovernetworks,in:SPIRE,in:LNCS,vol. 9954,2016,pp. 240–253. [9]M.Burrows,D.Wheeler,ABlock-SortingLosslessDataCompressionAlgorithm,Tech.Rep.124,DigitalEquipmentCorporation,1994.
[10]R.Chikhi,A.Limasset,S.Jackman,J.T.Simpson,P.Medvedev,OntherepresentationofdeBruijngraphs,J.Comput.Biol.22 (5)(2015)336–352. [11]F.Claude,G.Navarro,A.Ordónez,Thewaveletmatrix:anefficientwavelettreeforlargealphabets,Inform.Sci.47(2015)15–32.
[12]R.Durbin,EfficienthaplotypematchingandstorageusingthepositionalBurrows–Wheelertransform(PBWT),Bioinformatics30 (9)(2014)1266–1272. [13]P.Ferragina,R.Giancarlo,G.Manzini,Themyriadvirtuesofwavelettrees,Inform.andComput.207(2009)849–866.
[14]P.Ferragina,R.Giancarlo,G.Manzini,M.Sciortino,Boostingtextualcompressioninoptimallineartime,J.ACM52(2005)688–713. [15]P.Ferragina,R.González,G.Navarro,R.Venturini,Compressedtextindexes:fromtheorytopractice,ACMJ.Exp.Algorithmics13(2009).
[16]P.Ferragina,F.Luccio,G.Manzini,S.Muthukrishnan,Structuringlabeledtreesforoptimalsuccinctness,andbeyond,in:Proc.46thIEEESymposiumon FoundationsofComputerScience,FOCS’05,2005,pp. 184–193.
[17]P.Ferragina,F.Luccio,G.Manzini,S.Muthukrishnan,CompressingandsearchingXMLdataviatwozips,in:Proc.15thInternationalWorldWideWeb Conference,WWW’06,2006,pp. 751–760.
[18]P.Ferragina,F.Luccio,G.Manzini,S.Muthukrishnan,Compressingandindexinglabeledtrees,withapplications,J.ACM57(2009).
[19]P.Ferragina,G.Manzini,Opportunisticdatastructureswithapplications,in:Proc.41stIEEESymp.onFound.ofComputerScience,2000,pp. 390–398. [20]P.Ferragina,G.Manzini,Indexingcompressedtext,J.ACM52 (4)(2005)552–581.
[21]P.Ferragina,R.Venturini,Compressedpermutermindex,in:SIGIR,ACM,2007,pp. 535–542.
[22]P.Ferragina,R.Venturini,Thecompressedpermutermindex,ACMTrans.Algorithms7 (1)(2010)10:1–10:21.
[23]T.Gagie,G.Manzini,R.Venturini,Anencodingfororder-preservingmatching,in:Proc.25thAnnualEuropeanSymposiumonAlgorithms,ESA2017, DagstuhlPublishing,Germany,2017.
[24]T.Gagie,G.Navarro,S.J.Puglisi,Newalgorithmsonwavelettreesandapplicationstoinformationretrieval,Theoret.Comput.Sci.426(2012)25–41. [25]A.Ganguly,R.Shah,S.V.Thankachan,pBWT:achievingsuccinctdatastructuresforparameterizedpatternmatchingandrelatedproblems,in:Proc.
28thACM–SIAMSymposiumonDiscreteAlgorithms,SODA’17,SIAM,2017,pp. 397–407.
[26] I.M.Gessel,A.Restivo,C.Reutenauer,Abijectionbetweenwordsandmultisetsofnecklaces,EuropeanJ.Combin.33 (7)(2012)1537–1546,
http://
dx.doi.org/10.1016/j.ejc.2012.03.016.[27]I.M.Gessel,C.Reutenauer,Countingpermutationswithgivencyclestructureanddescentset,J.Combin.TheorySer.A64 (2)(1993)189–215. [28]R.Grossi,A.Gupta,J.S.Vitter,High-orderentropy-compressedtextindexes,in:Proc.14thACM–SIAMSymposiumonDiscreteAlgorithms,SODA’03,
SIAM,2003,pp. 841–850.
[29]R.Grossi, J.S.Vitter,Compressedsuffixarraysandsuffixtreeswith applicationstotextindexingand stringmatching,in:Proc.ofthe32ndACM SymposiumonTheoryofComputing,2000,pp. 397–406.
78 T. Gagie et al. / Theoretical Computer Science 698 (2017) 67–78
[31]G.Kucherov,K.Salikhov,D.Tsur,Approximatestringmatchingusingabidirectionalindex,Theoret.Comput.Sci.638(2016)145–158.
[32]T.W.Lam,R.Li,A.Tam,S.Wong,E.Wu,S.-M.Yiu,Highthroughputshortreadalignmentviabi-directionalBWT,in:IEEEInternationalConferenceon BioinformaticsandBiomedicine,BIBM’09,IEEE,2009,pp. 31–36.
[33]D.Li,C.-M.Liu,R.Luo,K.Sadakane,T.-W.Lam,Megahit:anultra-fastsingle-nodesolutionforlargeandcomplexmetagenomicsassemblyviasuccinct de Bruijngraph,Bioinformatics31 (10)(2015)1674–1676.
[34]V.Mäkinen,Compactsuffixarray, in:Proc.ofthe11thSymposiumonCombinatorialPatternMatching,in:LNCS,vol. 1848,Springer-Verlag,2000, pp. 305–319.
[35]V.Mäkinen,D.Belazzougui,F.Cunial,A.I.Tomescu,Genome-ScaleAlgorithmDesign,CambridgeUniversityPress,2015.
[36]S.Mantaci,A.Restivo,G.Rosone,M.Sciortino,Anextensionoftheburrowswheelertransformandapplicationstosequencecomparisonanddata compression,in:CPM,in:LNCS,vol. 3537,Springer,2005,pp. 178–189.
[37]S.Mantaci,A.Restivo,G.Rosone,M.Sciortino,AnextensionoftheBurrows–Wheelertransform,Theoret.Comput.Sci.387 (3)(2007)298–312. [38]S.Mantaci,A.Restivo,M.Sciortino,AnextensionoftheBurrows–Wheelertransformtokwords,in:DCC,IEEEComputerSociety,2005,p. 469. [39]G.Manzini,AnanalysisoftheBurrows–Wheelertransform,J.ACM48 (3)(2001)407–430.
[40]J.C.Na,H.Kim,S.Min,H.Park,T.Lecroq,M.Léonard,L.Mouchard,K.Park,FM-indexofalignmentwithgaps,arXiv:1606.03897,2016.
[41]J.C.Na,H.Kim,H.Park,T.Lecroq,M.Léonard,L.Mouchard,K.Park,FM-indexofalignment:acompressedindexforsimilarstrings,Theoret.Comput. Sci.638(2016)159–170.
[42]G.Navarro,CompactDataStructures:APracticalApproach,CambridgeUniversityPress,2016. [43]G.Navarro,V.Mäkinen,Compressedfull-textindexes,ACMComput.Surv.39 (1)(2007)2.
[44]A.M.Novak,E.Garrison,B.Paten,AgraphextensionofthepositionalBurrows–Wheelertransformanditsapplications,in:InternationalWorkshopon AlgorithmsinBioinformatics,Springer,2016,pp. 246–256.
[45]J.G.Saxe,ThePoemsofJohnGodfreySaxe,Houghton,MifflinandCo.,1881. [46] J.Seward,Thebzip2homepage,
http://www.bzip.org
,1996.[47] J.Seward,Thevalgrindhomepage,
http://valgrind.org
,2000.[48]V.Shchur,R.Durbin,TreeconsistentPBWTandtheirapplicationtoreconstructingancestralrecombinationgraphsanddemographicinference,in: PosterPresentedatthe19thAnnualInternationalConferenceonResearchinComputationalMolecularBiology(RECOMB),2015.
[49]J.Sirén,Indexingvariationgraphs,in:Proc.19thMeetingonAlgorithmEngineeringandExperiments,ALENEX’17,SIAM,2017,pp. 13–27.
[50]J.Sirén,N.Välimäki,V.Mäkinen,Indexinggraphsforpathquerieswithapplicationsingenomeresearch,IEEE/ACMTrans.Comput.Biol.Bioinform. 11 (2)(2014)375–388.