• No results found

Traversing the suf ix tree

Part II Index Data Structures

3.6 Applications

3.6.2 Traversing the suf ix tree

With the enhanced suf ix array consisting of suf ix array, lcp table, and child table, we are now able to top-down traverse the nodes of the suf ix tree of a text which actually is a node traversal of the corresponding lcp-interval tree.

Algorithm 3.13: L H(𝑠, 𝑝)

input : text 𝑠 and pa ern 𝑝

output : minimal 𝑙 with 𝑝 ≀| |π‘ π—Œπ—Žπ–Ώπ—π–Ίπ–»[ ] 1 𝑙 ← 0, π‘˜ ← 0 2 𝑙 ← |𝑠|, π‘˜ ← 0 3 while𝑙 < 𝑙 do 4 𝑖 ← 5 π‘˜ ← min{π‘˜ , π‘˜ } 6 ifπ‘ π—Œπ—Žπ–Ώπ—π–Ίπ–»[ ] <| | 𝑝 then 7 𝑙 ← 𝑖 + 1 8 π‘˜ ← π‘˜ + lcp π‘ π—Œπ—Žπ–Ώπ—π–Ίπ–»[ ] , 𝑝 9 else 10 𝑙 ← 𝑖 11 π‘˜ ← π‘˜ + lcp π‘ π—Œπ—Žπ–Ώπ—π–Ίπ–»[ ] , 𝑝 12 return𝑙 Algorithm 3.14: U H(𝑠, 𝑝)

input : text 𝑠 and pa ern 𝑝

output : minimal π‘Ÿ with 𝑝 <| |π‘ π—Œπ—Žπ–Ώπ—π–Ίπ–»[ ] 1 π‘Ÿ ← 0, π‘˜ ← 0 2 π‘Ÿ ← |𝑠|, π‘˜ ← 0 3 whileπ‘Ÿ < π‘Ÿ do 4 𝑖 ← 5 π‘˜ ← min{π‘˜ , π‘˜ } 6 ifπ‘ π—Œπ—Žπ–Ώπ—π–Ίπ–»[ ] ≀| | 𝑝 then 7 π‘Ÿ ← 𝑖 + 1 8 π‘˜ ← π‘˜ + lcp π‘ π—Œπ—Žπ–Ώπ—π–Ίπ–»[ ] , 𝑝 9 else 10 π‘Ÿ ← 𝑖 11 π‘˜ ← π‘˜ + lcp π‘ π—Œπ—Žπ–Ώπ—π–Ίπ–»[ ] , 𝑝 12 returnπ‘Ÿ

Figure 3.8:Binary search with mlr-heuristic. L H and U H return the

same interval boundaries as L and U . A heuristic determines a lower bound on the lcp length between π‘ π—Œπ—Žπ–Ώπ—π–Ίπ–»[ ]and 𝑝 and thereby reduces the

number of character comparisons on average.

1 5 50 500 5000 0.0 0.4 0.8 1.2 hs_chr2 (Ξ£ =5) pattern length running time (s) LCPE MLR Naive SufTree 1 5 50 500 5000 0.0 0.4 0.8 1.2 sprot (Ξ£ =24) pattern length running time (s) 1 5 50 500 5000 0.0 0.4 0.8 1.2 rfc (Ξ£ =256) pattern length running time (s)

Figure 3.9:Comparison of the ESA based search algorithms available in SeqAn. We com-

pared the running times required to search 100,000 patterns using the naive binary search (Naive), the binary search with mlr-heuristic (MLR), the binary search with lcp-tree (LCPE), and the suf ix tree search (SufTree). As texts we used the irst 100 M characters of a DNA, amino acid, and natural language text. Patterns are random substrings of varying length.

Top-down traversal

We implemented a so-called top-down iterator and functions to go to the root node, to go down to the leftmost child, and to go right to the next sibling of the currently visited node, where the children are lexicographically ordered by their edge labels from left to right. The top-down iterator maintains the values 𝑙𝑏 and π‘Ÿπ‘, the lcp-interval boundaries of the

Algorithm 3.15: R (π‘–π‘‘π‘’π‘Ÿ)

input : suffix tree iterator π‘–π‘‘π‘’π‘Ÿ

1 𝑛 ← |𝗅𝖼𝗉| βˆ’ 1 // 𝑛 is the text length 2 π‘–π‘‘π‘’π‘Ÿ.𝑙𝑏 ← 0 3 π‘–π‘‘π‘’π‘Ÿ.π‘Ÿπ‘ ← 𝑛 4 π‘–π‘‘π‘’π‘Ÿ.π‘π‘Žπ‘Ÿπ‘’π‘›π‘‘π‘…π‘ ← βŠ₯ Algorithm 3.16: N (𝑖) input : β„“-index 𝑖 1 𝑗 ← 𝖼𝗅𝖽[𝑖] 2 if𝑖 < 𝑗and𝗅𝖼𝗉[𝑖] = 𝗅𝖼𝗉[𝑗]then 3 returntrue 4 returnfalse Algorithm 3.17: D (π‘–π‘‘π‘’π‘Ÿ)

input : suffix tree iterator π‘–π‘‘π‘’π‘Ÿ output : returns true on success 1 ifπ‘–π‘‘π‘’π‘Ÿ.π‘Ÿπ‘ βˆ’ π‘–π‘‘π‘’π‘Ÿ.𝑙𝑏 ≀ 1then 2 returnfalse;

3 ifπ‘–π‘‘π‘’π‘Ÿ.π‘Ÿπ‘ β‰  π‘–π‘‘π‘’π‘Ÿ.π‘π‘Žπ‘Ÿπ‘’π‘›π‘‘π‘…π‘then 4 π‘–π‘‘π‘’π‘Ÿ.π‘π‘Žπ‘Ÿπ‘’π‘›π‘‘π‘…π‘ ← π‘–π‘‘π‘’π‘Ÿ.π‘Ÿπ‘

// get π—Žπ—‰ value of right boundary 5 π‘–π‘‘π‘’π‘Ÿ.π‘Ÿπ‘ ← 𝖼𝗅𝖽[π‘–π‘‘π‘’π‘Ÿ.π‘Ÿπ‘ βˆ’ 1]

6 else

// get π–½π—ˆπ—π—‡ value of left boundary 7 π‘–π‘‘π‘’π‘Ÿ.π‘Ÿπ‘ ← 𝖼𝗅𝖽[π‘–π‘‘π‘’π‘Ÿ.𝑙𝑏]

8 returntrue

Algorithm 3.18: R (π‘–π‘‘π‘’π‘Ÿ)

input : suffix tree iterator π‘–π‘‘π‘’π‘Ÿ output : returns true on success 1 ifπ‘–π‘‘π‘’π‘Ÿ.π‘π‘Žπ‘Ÿπ‘’π‘›π‘‘π‘…π‘ ∈ {βŠ₯, π‘–π‘‘π‘’π‘Ÿ.π‘Ÿπ‘}then 2 returnfalse

3 π‘–π‘‘π‘’π‘Ÿ.𝑙𝑏 ← π‘–π‘‘π‘’π‘Ÿ.π‘Ÿπ‘ 4 if N (π‘–π‘‘π‘’π‘Ÿ.π‘Ÿπ‘)then

// π‘–π‘‘π‘’π‘Ÿ.π‘Ÿπ‘ has a succeeding β„“-index

5 π‘–π‘‘π‘’π‘Ÿ.π‘Ÿπ‘ ← 𝖼𝗅𝖽[π‘–π‘‘π‘’π‘Ÿ.π‘Ÿπ‘] 6 else

// π‘–π‘‘π‘’π‘Ÿ.π‘Ÿπ‘ is the last β„“-index

7 π‘–π‘‘π‘’π‘Ÿ.π‘Ÿπ‘ ← π‘–π‘‘π‘’π‘Ÿ.π‘π‘Žπ‘Ÿπ‘’π‘›π‘‘π‘…π‘ 8 returntrue

currently visited node, and π‘π‘Žπ‘Ÿπ‘’π‘›π‘‘π‘…π‘ the right boundary of the parent node. It starts in the root of the lcp-interval tree which is the interval [𝑙𝑏..π‘Ÿπ‘) = [0..𝑛). For the root node, we initialize π‘π‘Žπ‘Ÿπ‘’π‘›π‘‘π‘…π‘ with βŠ₯, see Algorithm 3.15. The intervals in the lcp-interval tree are distinct from each other and two iterators can be compared by comparing their boundary pairs. For leaf nodes hold π‘Ÿπ‘ βˆ’ 𝑙𝑏 = 1.

When moving the iterator to the irst child of the current node, the left boundary re- mains the same whereas the smallest β„“-index in β„“-indices(𝑙𝑏, π‘Ÿπ‘) becomes the new right boundary π‘Ÿπ‘ and π‘π‘Žπ‘Ÿπ‘’π‘›π‘‘π‘…π‘ the former value of π‘Ÿπ‘. If the current node is not the last child of its parent (π‘Ÿπ‘ β‰  π‘π‘Žπ‘Ÿπ‘’π‘›π‘‘π‘…π‘), the smallest β„“-index in β„“-indices(𝑙𝑏, π‘Ÿπ‘) is the π—Žπ—‰ value of π‘Ÿπ‘ stored at 𝖼𝗅𝖽[π‘Ÿπ‘ βˆ’ 1], otherwise it is the π–½π—ˆπ—π—‡ value of 𝑙𝑏 stored only in this case at 𝖼𝗅𝖽[𝑙𝑏]. The corresponding pseudo-code is given in Algorithm 3.17. Moving the iterator to the next sibling is possible iff π‘Ÿπ‘ βˆ‰ {βŠ₯, π‘π‘Žπ‘Ÿπ‘’π‘›π‘‘π‘…π‘}, i.e. for all but the last sibling. Then 𝑙𝑏 becomes the former π‘Ÿπ‘ and π‘Ÿπ‘ becomes, if existent, its next β„“-index or π‘π‘Žπ‘Ÿπ‘’π‘›π‘‘π‘…π‘, otherwise (see Algorithm 3.18).

Besides the D function that moves an iterator along the leftmost edge to the irst child, we also implemented variants to go down to a child at the end of an edge starting with a certain character or to go down along the path of string characters. All the traversal functions return a boolean which is true, iff the iterator could be successfully moved.

Random traversal

If only the 3 tables of the enhanced suf ix array are given, it is not possible to move a top-down iterator upwards the tree in π’ͺ(1) time. It is however possible to recover the state the iterator had in the parent node by manually maintaing a stack of top-down it-

Algorithm 3.19: N _P (𝑖𝑑)

input : suffix tree iterator 𝑖𝑑 // try go down, if not try go right 1 if not D (𝑖𝑑)and not R (𝑖𝑑)then

2 while U (𝑖𝑑)and not R (𝑖𝑑)do

// go up until we can go right 3 ifI R (𝑖𝑑)then

// end when entering the root node 4 (𝑖𝑑.𝑙𝑏, 𝑖𝑑.π‘Ÿπ‘) ← (βˆ’1, βˆ’1)

5 returnfalse 6 returntrue

Algorithm 3.20: N _P (𝑖𝑑)

input : suffix tree iterator 𝑖𝑑 1 if R (𝑖𝑑)then

2 while D (𝑖𝑑)do

// try go right and down to leaf

3 else // otherwise go up

4 if not U (𝑖𝑑)then

// end after halting in root 5 (𝑖𝑑.𝑙𝑏, 𝑖𝑑.π‘Ÿπ‘) ← (βˆ’1, βˆ’1)

6 returnfalse

7 returntrue

erator copies. To provide a more convenient interface we implemented the subclass top-

down history iterator which extends the top-down iterator by a stack. The stack stores

the lcp-interval boundaries of nodes on the path to the root. We adapted the R and D functions to clear the stack and to push the current interval boundary pair (𝑙𝑏, π‘Ÿπ‘)onto the stack before going down. The new function U simply replaces (𝑙𝑏, π‘Ÿπ‘) by the topmost pair and removes it from stack. We also need to adapt π‘π‘Žπ‘Ÿπ‘’π‘›π‘‘π‘…π‘, which is set to the new topmost π‘Ÿπ‘ value on stack.

Depth- irst search

While the two iterators above require 3 tables of the enhanced suf ix array, we have seen in Section 3.5.1 that only 2 tables are required to conduct a postorder depth- irst search (DFS). Algorithm 3.9 can be executed to report all nodes. However, in some applications it is more appropriate to have an iterator that can be moved node-by-node. Therefore, we implemented a light-weight bottom-up iterator which maintains 𝑙𝑏 and π‘Ÿπ‘, the current node boundaries, and provides functions B and N . Like a coroutine [Knuth, 1997], N resumes and suspends the execution of Algorithm 3.9 between two calls of report(π‘–π‘›π‘‘π‘’π‘Ÿπ‘£π‘Žπ‘™) and instead of reporting the interval it sets 𝑙𝑏 and π‘Ÿπ‘ accordingly. B executes Algorithm 3.9 up to the irst call of report(π‘–π‘›π‘‘π‘’π‘Ÿπ‘£π‘Žπ‘™). The last node the iterator halts in is the root node. If N is called again, it sets the iterator in a de ined end-state and indicates that by returning false.

We provide the same depth- irst search interface for the top-down history iterator but implemented N by means of D , R , and U to traverse all nodes in the order of either a postorder or a preorder depth- irst search. The pseudo-codes of both DFS variants are shown in Algorithms 3.19 and 3.20. In contrast to postorder DFS, in a preorder DFS each node is traversed before its children.

Some applications require to traverse nodes only up to a certain tree or string depth or have a different criteria to skip a whole subtree in the traversal. To meet this require-

ment, we implemented N R and N U . N R skips the subtree of

the current node and proceeds with the next sibling by omitting D in line 1 of Algo- rithm 3.19. Analogously N U skips the subtree and all right siblings of the current node.