Part II Index Data Structures
3.6 Applications
3.6.2 Traversing the suf ix tree
With the enhanced suf ix array consisting of suf ix array, lcp table, and child table, we are now able to top-down traverse the nodes of the suf ix tree of a text which actually is a node traversal of the corresponding lcp-interval tree.
Algorithm 3.13: L H(π , π)
input : text π and pa ern π
output : minimal π with π β€| |π πππΏππΊπ»[ ] 1 π β 0, π β 0 2 π β |π |, π β 0 3 whileπ < π do 4 π β 5 π β min{π , π } 6 ifπ πππΏππΊπ»[ ] <| | π then 7 π β π + 1 8 π β π + lcp π πππΏππΊπ»[ ] , π 9 else 10 π β π 11 π β π + lcp π πππΏππΊπ»[ ] , π 12 returnπ Algorithm 3.14: U H(π , π)
input : text π and pa ern π
output : minimal π with π <| |π πππΏππΊπ»[ ] 1 π β 0, π β 0 2 π β |π |, π β 0 3 whileπ < π do 4 π β 5 π β min{π , π } 6 ifπ πππΏππΊπ»[ ] β€| | π then 7 π β π + 1 8 π β π + lcp π πππΏππΊπ»[ ] , π 9 else 10 π β π 11 π β π + lcp π πππΏππΊπ»[ ] , π 12 returnπ
Figure 3.8:Binary search with mlr-heuristic. L H and U H return the
same interval boundaries as L and U . A heuristic determines a lower bound on the lcp length between π πππΏππΊπ»[ ]and π and thereby reduces the
number of character comparisons on average.
1 5 50 500 5000 0.0 0.4 0.8 1.2 hs_chr2 (Ξ£ =5) pattern length running time (s) LCPE MLR Naive SufTree 1 5 50 500 5000 0.0 0.4 0.8 1.2 sprot (Ξ£ =24) pattern length running time (s) 1 5 50 500 5000 0.0 0.4 0.8 1.2 rfc (Ξ£ =256) pattern length running time (s)
Figure 3.9:Comparison of the ESA based search algorithms available in SeqAn. We com-
pared the running times required to search 100,000 patterns using the naive binary search (Naive), the binary search with mlr-heuristic (MLR), the binary search with lcp-tree (LCPE), and the suf ix tree search (SufTree). As texts we used the irst 100 M characters of a DNA, amino acid, and natural language text. Patterns are random substrings of varying length.
Top-down traversal
We implemented a so-called top-down iterator and functions to go to the root node, to go down to the leftmost child, and to go right to the next sibling of the currently visited node, where the children are lexicographically ordered by their edge labels from left to right. The top-down iterator maintains the values ππ and ππ, the lcp-interval boundaries of the
Algorithm 3.15: R (ππ‘ππ)
input : suο¬x tree iterator ππ‘ππ
1 π β |π πΌπ| β 1 // π is the text length 2 ππ‘ππ.ππ β 0 3 ππ‘ππ.ππ β π 4 ππ‘ππ.ππππππ‘π π β β₯ Algorithm 3.16: N (π) input : β-index π 1 π β πΌπ π½[π] 2 ifπ < πandπ πΌπ[π] = π πΌπ[π]then 3 returntrue 4 returnfalse Algorithm 3.17: D (ππ‘ππ)
input : suο¬x tree iterator ππ‘ππ output : returns true on success 1 ifππ‘ππ.ππ β ππ‘ππ.ππ β€ 1then 2 returnfalse;
3 ifππ‘ππ.ππ β ππ‘ππ.ππππππ‘π πthen 4 ππ‘ππ.ππππππ‘π π β ππ‘ππ.ππ
// get ππ value of right boundary 5 ππ‘ππ.ππ β πΌπ π½[ππ‘ππ.ππ β 1]
6 else
// get π½πππ value of left boundary 7 ππ‘ππ.ππ β πΌπ π½[ππ‘ππ.ππ]
8 returntrue
Algorithm 3.18: R (ππ‘ππ)
input : suο¬x tree iterator ππ‘ππ output : returns true on success 1 ifππ‘ππ.ππππππ‘π π β {β₯, ππ‘ππ.ππ}then 2 returnfalse
3 ππ‘ππ.ππ β ππ‘ππ.ππ 4 if N (ππ‘ππ.ππ)then
// ππ‘ππ.ππ has a succeeding β-index
5 ππ‘ππ.ππ β πΌπ π½[ππ‘ππ.ππ] 6 else
// ππ‘ππ.ππ is the last β-index
7 ππ‘ππ.ππ β ππ‘ππ.ππππππ‘π π 8 returntrue
currently visited node, and ππππππ‘π π the right boundary of the parent node. It starts in the root of the lcp-interval tree which is the interval [ππ..ππ) = [0..π). For the root node, we initialize ππππππ‘π π with β₯, see Algorithm 3.15. The intervals in the lcp-interval tree are distinct from each other and two iterators can be compared by comparing their boundary pairs. For leaf nodes hold ππ β ππ = 1.
When moving the iterator to the irst child of the current node, the left boundary re- mains the same whereas the smallest β-index in β-indices(ππ, ππ) becomes the new right boundary ππ and ππππππ‘π π the former value of ππ. If the current node is not the last child of its parent (ππ β ππππππ‘π π), the smallest β-index in β-indices(ππ, ππ) is the ππ value of ππ stored at πΌπ π½[ππ β 1], otherwise it is the π½πππ value of ππ stored only in this case at πΌπ π½[ππ]. The corresponding pseudo-code is given in Algorithm 3.17. Moving the iterator to the next sibling is possible iff ππ β {β₯, ππππππ‘π π}, i.e. for all but the last sibling. Then ππ becomes the former ππ and ππ becomes, if existent, its next β-index or ππππππ‘π π, otherwise (see Algorithm 3.18).
Besides the D function that moves an iterator along the leftmost edge to the irst child, we also implemented variants to go down to a child at the end of an edge starting with a certain character or to go down along the path of string characters. All the traversal functions return a boolean which is true, iff the iterator could be successfully moved.
Random traversal
If only the 3 tables of the enhanced suf ix array are given, it is not possible to move a top-down iterator upwards the tree in πͺ(1) time. It is however possible to recover the state the iterator had in the parent node by manually maintaing a stack of top-down it-
Algorithm 3.19: N _P (ππ‘)
input : suο¬x tree iterator ππ‘ // try go down, if not try go right 1 if not D (ππ‘)and not R (ππ‘)then
2 while U (ππ‘)and not R (ππ‘)do
// go up until we can go right 3 ifI R (ππ‘)then
// end when entering the root node 4 (ππ‘.ππ, ππ‘.ππ) β (β1, β1)
5 returnfalse 6 returntrue
Algorithm 3.20: N _P (ππ‘)
input : suο¬x tree iterator ππ‘ 1 if R (ππ‘)then
2 while D (ππ‘)do
// try go right and down to leaf
3 else // otherwise go up
4 if not U (ππ‘)then
// end after halting in root 5 (ππ‘.ππ, ππ‘.ππ) β (β1, β1)
6 returnfalse
7 returntrue
erator copies. To provide a more convenient interface we implemented the subclass top-
down history iterator which extends the top-down iterator by a stack. The stack stores
the lcp-interval boundaries of nodes on the path to the root. We adapted the R and D functions to clear the stack and to push the current interval boundary pair (ππ, ππ)onto the stack before going down. The new function U simply replaces (ππ, ππ) by the topmost pair and removes it from stack. We also need to adapt ππππππ‘π π, which is set to the new topmost ππ value on stack.
Depth- irst search
While the two iterators above require 3 tables of the enhanced suf ix array, we have seen in Section 3.5.1 that only 2 tables are required to conduct a postorder depth- irst search (DFS). Algorithm 3.9 can be executed to report all nodes. However, in some applications it is more appropriate to have an iterator that can be moved node-by-node. Therefore, we implemented a light-weight bottom-up iterator which maintains ππ and ππ, the current node boundaries, and provides functions B and N . Like a coroutine [Knuth, 1997], N resumes and suspends the execution of Algorithm 3.9 between two calls of report(πππ‘πππ£ππ) and instead of reporting the interval it sets ππ and ππ accordingly. B executes Algorithm 3.9 up to the irst call of report(πππ‘πππ£ππ). The last node the iterator halts in is the root node. If N is called again, it sets the iterator in a de ined end-state and indicates that by returning false.
We provide the same depth- irst search interface for the top-down history iterator but implemented N by means of D , R , and U to traverse all nodes in the order of either a postorder or a preorder depth- irst search. The pseudo-codes of both DFS variants are shown in Algorithms 3.19 and 3.20. In contrast to postorder DFS, in a preorder DFS each node is traversed before its children.
Some applications require to traverse nodes only up to a certain tree or string depth or have a different criteria to skip a whole subtree in the traversal. To meet this require-
ment, we implemented N R and N U . N R skips the subtree of
the current node and proceeds with the next sibling by omitting D in line 1 of Algo- rithm 3.19. Analogously N U skips the subtree and all right siblings of the current node.