Part II Index Data Structures
3.4 Construction of the lcp table
The irst suf ix array construction algorithm [Manber and Myers, 1993] as well as the skew algorithm [Kรคrkkรคinen et al., 2006] can be extended to construct the lcp table as a byproduct with auxiliary data structures. The irst optimal approach was a standalone linear-time algorithm published by Kasai et al. [2001].
3.4.1 The linear-time algorithm by Kasai et al.
The basic idea of the lcp table construction algorithm proposed in [Kasai et al., 2001] is to use the lcp length of a suf ix and its lexicographical predecessor for the comparison of the next shorter suf ix and its predecessor. The linear running time is possible due to the following lemma.
Lemma 3.3. Given a string ๐ of length ๐, the corresponding suf ix array, and the lcp ta- ble. For every ๐ โ [0..๐ โ 1) with ๐ ๐ผ๐ ๐๐๐ฟ๐๐บ๐ป [๐] = ๐ and ๐๐๐ฟ๐๐บ๐ป [๐ + 1] โ 0 holds
๐ ๐ผ๐ ๐๐๐ฟ๐๐บ๐ป [๐ + 1] โฅ ๐ โ 1.
Proof. Let ๐ be a suf ix with ๐ ๐ผ๐ ๐๐๐ฟ๐๐บ๐ป [๐] = ๐. The assumption obviously holds for
๐ โค 0. For ๐ > 0, ๐ has a lexicographical predecessor, say ๐ , and for the next shorter suf ixes ๐ and ๐ holds lcp{๐ , ๐ } = ๐ โ 1and ๐ <lex๐ . The lexicographical predecessor of ๐ is ๐ ๐๐๐ฟ๐๐บ๐ป ๐๐๐ฟ๐๐บ๐ป [ ] and it holds ๐ โคlex๐ ๐๐๐ฟ๐๐บ๐ป ๐๐๐ฟ๐๐บ๐ป [ ] <lex ๐ . From the latter follows:
๐ ๐ผ๐ ๐๐๐ฟ๐๐บ๐ป [๐ + 1] = lcp ๐ ๐๐๐ฟ๐๐บ๐ป ๐๐๐ฟ๐๐บ๐ป [ ] , ๐ โฅ lcp ๐ , ๐ = ๐ โ 1. (3.23)
As a consequence, the lcp values of suf ixes ๐ can be computed for increasing ๐ be- ginning with ๐ = 0 and the pairwise suf ix comparison can skip the common pre ix of at least max(๐ โ 1, 0) characters, where ๐ is the lcp value of the previous comparison. In this way, the overall number of character comparisons is less than 2๐ and the inner loop in line 9 of Algorithm 3.5 takes ๐ช(๐) overall time as well as the whole algorithm.
Algorithm 3.5: L T (๐ , ๐๐๐ฟ๐๐บ๐ป)
input : text string ๐ , su๏ฌx array ๐๐๐ฟ๐๐บ๐ป
output : lcp table ๐ ๐ผ๐
1 ๐ โ |๐ |, ๐ ๐ผ๐[0] โ โ1, ๐ ๐ผ๐[๐] โ โ1
2 for๐ โ 0to๐ โ 1do // invert suffix array
3 ๐๐๐ฟ๐๐บ๐ป [๐๐๐ฟ๐๐บ๐ป[๐]] โ ๐
4 ๐ โ 0
5 for๐ โ 0to๐ โ 1do
6 if๐๐๐ฟ๐๐บ๐ป [๐] โ 0then
7 ๐ โ ๐๐๐ฟ๐๐บ๐ป ๐๐๐ฟ๐๐บ๐ป [๐] โ 1 // determine the lex. predecessor ๐ of ๐
8 whilemin(๐, ๐) + ๐ < ๐and๐ [๐ + ๐] = ๐ [๐ + ๐]do // compute lcp ๐ , ๐
9 ๐ โ ๐ + 1
10 ๐ ๐ผ๐ ๐๐๐ฟ๐๐บ๐ป [๐] โ ๐
11 if๐ > 0then ๐ โ ๐ โ 1 // skip max(๐ โ 1, 0) prefix in the next round
3.4.2 Space-saving variant
Manzini [2004] found a way to save the 4๐ bytesยฒ of additional memory consumed by the inverse suf ix array by reusing the memory of ๐ ๐ผ๐. Before the lcp values are written, ๐ ๐ผ๐stores for each suf ix rank ๐ the rank of the next shorter suf ix, i.e. ๐ฑ๐บ๐๐๐ญ๐พ๐๐[๐] = ๐๐๐ฟ๐๐บ๐ป [๐๐๐ฟ๐๐บ๐ป[๐] + 1]. After substitution of ๐ by ๐๐๐ฟ๐๐บ๐ป[๐] and ๐๐๐ฟ๐๐บ๐ป [๐] by ๐ in Al- gorithm 3.5, Manzini replaces lines 2โ3 by a ๐ฑ๐บ๐๐๐ญ๐พ๐๐ construction algorithm which uti- lizes the rank-preservation property of the Burrows-Wheeler transform [Burrows and Wheeler, 1994].
Algorithm 3.6: L T _I (๐ , ๐๐๐ฟ๐๐บ๐ป)
input : text string ๐ , su๏ฌx array ๐๐๐ฟ๐๐บ๐ป
output : lcp table ๐ ๐ผ๐
1 ๐ โ |๐ |
2 for๐ โ 0to๐ โ 1do // invert suffix array
3 ๐ ๐ผ๐ [๐๐๐ฟ๐๐บ๐ป[๐]] โ ๐
4 ๐ โ 0
5 for๐ โ 0to๐ โ 1do
6 if๐ ๐ผ๐[๐] โ 0then
7 ๐ โ ๐๐๐ฟ๐๐บ๐ป [๐ ๐ผ๐[๐] โ 1] // determine the lex. predecessor ๐ of ๐
8 whilemin(๐, ๐) + ๐ < ๐and๐ [๐ + ๐] = ๐ [๐ + ๐]do // compute lcp ๐ , ๐
9 ๐ โ ๐ + 1
10 ๐ ๐ผ๐[๐] โ โ(๐ + 1)
11 if๐ > 0then ๐ โ ๐ โ 1 // skip max(๐ โ 1, 0) prefix in the next round
12 for๐ โ 0to๐ โ 1do // transform lcp values from text to suffix array order
13 if๐ ๐ผ๐[๐] < 0then // find a cycle that needs to be permuted
14 ๐ โ ๐, ๐ก๐๐๐โ ๐ ๐ผ๐[๐] 15 while๐๐๐ฟ๐๐บ๐ป[๐] โ ๐do // for ๐ = ๐๐๐ฟ๐๐บ๐ป[๐], ๐๐๐ฟ๐๐บ๐ป [๐๐๐ฟ๐๐บ๐ป[๐]] , โฆ , ๐ 16 ๐ ๐ผ๐[๐] โ โ๐ ๐ผ๐ [๐๐๐ฟ๐๐บ๐ป[๐]] โ 1 // move ๐ ๐ผ๐ ๐๐๐ฟ๐๐บ๐ป [๐] โ ๐ ๐ผ๐[๐] 17 ๐ โ ๐๐๐ฟ๐๐บ๐ป[๐] 18 ๐ ๐ผ๐[๐] โ โ๐ก๐๐๐โ 1 19 ๐ ๐ผ๐[0] โ โ1, ๐ ๐ผ๐[๐] โ โ1 20 return๐ ๐ผ๐
Independent from Manziniโs approach, we found another simple way to reuse the memory of ๐ ๐ผ๐ by storing ๐๐๐ฟ๐๐บ๐ป in it [Weese, 2006]. As the values of ๐๐๐ฟ๐๐บ๐ป are read only once and in sequential order, each entry can be used after reading to store the com- puted lcp value. However, after all lcp values have been computed they are in text order and must be permuted in-place to be in suf ix array order, i.e. an lcp value at position ๐ must be moved to position ๐๐๐ฟ๐๐บ๐ป [๐]. To permute elements without overwriting oth- ers, we swap them along cycles ๐, ๐๐๐ฟ๐๐บ๐ป[๐], ๐๐๐ฟ๐๐บ๐ป [๐๐๐ฟ๐๐บ๐ป[๐]] , โฆ , ๐. To permute all cycles exactly once we iterate over all cycle start positions ๐ and mark non-permuted elements with negative values. The algorithmic details are shown in Algorithm 3.6. A similar algo- rithm that constructs a sparse lcp table was later published in [Kรคrkkรคinen et al., 2009].
3.4.3 Adaptation to external memory
Adapting the algorithm in [Kasai et al., 2001] to ef iciently use external memory is chal- lenging, as it shows a poor locality behavior. Although in the main loop all accesses to ๐๐๐ฟ๐๐บ๐ป and text accesses via ๐ [๐ + ๐] are in sequential order, accesses to ๐๐๐ฟ๐๐บ๐ป, ๐ ๐ผ๐, and ๐ [๐ + ๐]are random. Our in-place algorithm, described in the previous section, suggests how ๐ ๐ผ๐ can be permuted such that accesses to it become sequential. A similar permu- tation is possible for ๐๐๐ฟ๐๐บ๐ป as it is clear beforehand in which pattern ๐๐๐ฟ๐๐บ๐ป values will be accessed. For text accesses via ๐ [๐ + ๐] this does not hold and yet all approaches to an external memory lcp construction [Kasai et al., 2001; Kรคrkkรคinen et al., 2009; Gog and Ohlebusch, 2011] are semi-external, i.e. they require the whole text [Gog and Ohlebusch, 2011] and an additional array of ๐ [Gog and Ohlebusch, 2011] or 4๐ byte [Kasai et al., 2001; Kรคrkkรคinen et al., 2009] to reside in main memory.
We developed a window based approach that is applicable even if the text does not it into main memory. It processes consecutive non-overlapping text windows of an arbi- trary size ๐ค in โ โ rounds. If ๐ [๐..๐) is the current window, then character comparisons between ๐ [๐ + ๐] and ๐ [๐ + ๐] can only be conducted if ๐ + ๐ โ [๐..๐). However, some suf ix comparisons may exceed the window border. Those comparisons must be inter- rupted at the end of the current window and resumed in the next window. The following lemma will help to easily keep track of suf ixes ๐ whose comparisons were interrupted. Whereas Lemma 3.3 states a relation between lcp lengths of suf ixes and their lexico- graphical successors, the next lemma is its counterpart and gives a relation of suf ixes and their successors.
Lemma 3.4. Given a string ๐ of length ๐ and the corresponding suf ix array and lcp table. For every ๐ โ [0..๐ โ 1) with ๐ ๐ผ๐ ๐๐๐ฟ๐๐บ๐ป [๐] + 1 = ๐ and ๐๐๐ฟ๐๐บ๐ป [๐ + 1] โ ๐ โ 1 holds
๐ ๐ผ๐ ๐๐๐ฟ๐๐บ๐ป [๐ + 1] + 1 โฅ ๐ โ 1.
Proof. This lemma can be proven analogously to Lemma 3.3.
A direct consequence of Lemma 3.4 is that if an lcp comparison of a suf ix ๐ with its lexicographical successor exceeds the window end ๐, comparisons of all shorter suf ixes ๐ with ๐ < ๐ < ๐ will leave the window as well. Let ๐(๐) be de ined as the leftmost start position of such suf ixes:
๐(๐) = min ๐ ๐ โ [0..๐) โง ๐ โค ๐ โค ๐ + ๐ ๐ผ๐ ๐๐๐ฟ๐๐บ๐ป [๐] + 1 . (3.24) Clearly, comparisons of suf ixes ๐ will end left of ๐ if ๐ < ๐(๐) and exceed ๐ if ๐(๐) โค ๐ < ๐. This allows to stop comparisons of suf ixes ๐ at the window end ๐ and to deter- mine ๐(๐), the smallest of such ๐. Comparisons of suf ixes ๐ with ๐(๐) โค ๐ < ๐ were interrupted at end of the previous window and can be resumed by setting ๐ to at least ๐ โ ๐.
Algorithm 3.7 shows the pseudo-code of our implementation. Lines 1โ3 prepare val- ues ๐๐๐ฟ๐๐บ๐ป [๐] and ๐๐๐ฟ๐๐บ๐ป ๐๐๐ฟ๐๐บ๐ป [๐] โ 1 for increasing ๐. The main loop from line 4 to 16 iterates over all non-overlapping windows ๐ [๐..๐), where ๐๐บ equals ๐(๐) and ๐๐ป
was not interrupted or ends at the text end. Then the lcp value ๐ ๐ผ๐[๐] and its rank ๐ is ap- pended to ๐ฟ. At the end, ๐ฟ is permuted to be ascending in ๐ and iltered for values ๐ ๐ผ๐[๐] in lines 17โ18.
Algorithm 3.7: L T _E M (๐ , ๐๐๐ฟ๐๐บ๐ป)
input : text string ๐ , su๏ฌx array ๐๐๐ฟ๐๐บ๐ป
output : lcp table ๐ ๐ผ๐
1 ๐ โ |๐ |, ๐๐บโ 0, ๐ฟ โ (0, โ1) โ (๐, โ1) // ๐ฟ is a string of pairs (rank,lcp value)
2 ๐ด โ (๐, ๐๐๐ฟ๐๐บ๐ป[๐ โ 1], ๐๐๐ฟ๐๐บ๐ป[๐]) ๐ โ [0..๐)
3 permute๐ด such that (๐, ๐, ๐) is moved to ๐ // prepare values
4 for๐ โ 1toโ โdo
5 ๐ โ (๐ โ 1) โ ๐ค, ๐ โ min(๐๐ค, ๐), ๐๐ปโ ๐ // process window ๐ [๐..๐)
6 foreach(๐, ๐, ๐) โ ๐ดdo
7 ifk>0then
8 if๐๐บ โค ๐and๐ + ๐ < ๐then
9 ๐ โ max(๐, ๐ โ ๐) // resume interrupted comparison
10 while(๐ + ๐) โ [๐..๐)and๐ + ๐ < ๐and๐ [๐ + ๐] = ๐ [๐ + ๐]do
11 ๐ โ ๐ + 1
12 if๐ + ๐ < ๐or๐ = ๐then // if comparison was not interrupted
13 ๐ฟ โ ๐ฟ โ (๐, ๐) // append (rank,lcp value) to ๐ฟ
14 if๐ + ๐ โฅ ๐then ๐๐ปโ min(๐๐ป, ๐)
15 if๐ > 0then ๐ โ ๐ โ 1 // skip max(๐ โ 1, 0) prefix in the next round
16 ๐๐บโ ๐๐ป
17 permute๐ฟ such that (๐, ๐) is moved to ๐ // order lcp values by their rank
18 ๐ ๐ผ๐ โ ๐ (๐, ๐) โ ๐
19 return๐ ๐ผ๐
We implemented the algorithm using the pipelining interface. The current window ๐ [๐..๐)is loaded into a memory buffer of size ๐ค. To minimize the running time, ๐ค should be chosen as large as possible.
3.4.4 Extension to multiple sequences
All of the lcp table construction algorithms described above can easily be adapted to mul- tiple sequences. For a given a set ๐ฎ = {๐ , โฆ , ๐ } of strings of lengths ๐ , โฆ , ๐ and the corresponding generalized suf ix array ๐๐๐ฟ๐๐บ๐ป, we de ine:
๐(๐, ๐) = ๐ + ๐ and ๐ = ๐ . (3.25)
As the generalized suf ix array stores pairs instead of single integers, its entries cannot directly be used to access ๐๐๐ฟ๐๐บ๐ป . Therefore, we adapt Algorithm 3.5 and use ๐ as a unique mapping of suf ix start positions onto the interval [0..๐) in lines 5,9,10, and 13 in Algorithm 3.8. The second adaptation concerns the lcp comparison in line 11.
Algorithm 3.8: L T _M (๐ , โฆ , ๐ , ๐๐๐ฟ๐๐บ๐ป)
input : mul ple text strings ๐ , โฆ , ๐ , su๏ฌx array ๐๐๐ฟ๐๐บ๐ป
output : lcp table ๐ ๐ผ๐
1 ๐ โ 0
2 for๐ โ 1to๐do
3 ๐ โ |๐ |, ๐ โ ๐ + ๐
4 for๐ โ 0to๐ โ 1do // invert suffix array
5 ๐๐๐ฟ๐๐บ๐ป [๐ (๐๐๐ฟ๐๐บ๐ป[๐])] โ ๐
6 ๐ โ 0, ๐ ๐ผ๐[0] โ โ1, ๐ ๐ผ๐[๐] โ โ1
7 for๐ โ 1to๐do
8 for๐ โ 0to๐ โ 1do
9 if๐๐๐ฟ๐๐บ๐ป [๐(๐, ๐)] โ 0then
10 (๐, ๐) โ ๐๐๐ฟ๐๐บ๐ป ๐๐๐ฟ๐๐บ๐ป [๐(๐, ๐)] โ 1 // determine lex. predecessor
11 while๐ + ๐ < ๐ and๐ + ๐ < ๐ and๐ [๐ + ๐] = ๐ [๐ + ๐] do
12 ๐ โ ๐ + 1 // compute lcp value
13 ๐ ๐ผ๐ ๐๐๐ฟ๐๐บ๐ป [๐(๐, ๐)] โ ๐
14 if๐ > 0then ๐ โ ๐ โ 1
15 return๐ ๐ผ๐
By implementing ๐ using an array of length ๐ that stores at position ๐ the partial sum of the irst ๐ โ 1 sequence lengths, ๐(๐, ๐) can be determined in constant time and Algo- rithm 3.8 constructs the lcp table in ๐ช(๐) time. Algorithms 3.6 and 3.7 can analogously be adapted without changing their asymptotical running time.