• No results found

Construction of the lcp table

Part II Index Data Structures

3.4 Construction of the lcp table

The irst suf ix array construction algorithm [Manber and Myers, 1993] as well as the skew algorithm [Kรคrkkรคinen et al., 2006] can be extended to construct the lcp table as a byproduct with auxiliary data structures. The irst optimal approach was a standalone linear-time algorithm published by Kasai et al. [2001].

3.4.1 The linear-time algorithm by Kasai et al.

The basic idea of the lcp table construction algorithm proposed in [Kasai et al., 2001] is to use the lcp length of a suf ix and its lexicographical predecessor for the comparison of the next shorter suf ix and its predecessor. The linear running time is possible due to the following lemma.

Lemma 3.3. Given a string ๐‘  of length ๐‘›, the corresponding suf ix array, and the lcp ta- ble. For every ๐‘— โˆˆ [0..๐‘› โˆ’ 1) with ๐—…๐–ผ๐—‰ ๐—Œ๐—Ž๐–ฟ๐—๐–บ๐–ป [๐‘—] = ๐‘™ and ๐—Œ๐—Ž๐–ฟ๐—๐–บ๐–ป [๐‘— + 1] โ‰  0 holds

๐—…๐–ผ๐—‰ ๐—Œ๐—Ž๐–ฟ๐—๐–บ๐–ป [๐‘— + 1] โ‰ฅ ๐‘™ โˆ’ 1.

Proof. Let ๐‘  be a suf ix with ๐—…๐–ผ๐—‰ ๐—Œ๐—Ž๐–ฟ๐—๐–บ๐–ป [๐‘—] = ๐‘™. The assumption obviously holds for

๐‘™ โ‰ค 0. For ๐‘™ > 0, ๐‘  has a lexicographical predecessor, say ๐‘  , and for the next shorter suf ixes ๐‘  and ๐‘  holds lcp{๐‘  , ๐‘  } = ๐‘™ โˆ’ 1and ๐‘  <lex๐‘  . The lexicographical predecessor of ๐‘  is ๐‘ ๐—Œ๐—Ž๐–ฟ๐—๐–บ๐–ป ๐—Œ๐—Ž๐–ฟ๐—๐–บ๐–ป [ ] and it holds ๐‘  โ‰คlex๐‘ ๐—Œ๐—Ž๐–ฟ๐—๐–บ๐–ป ๐—Œ๐—Ž๐–ฟ๐—๐–บ๐–ป [ ] <lex ๐‘  . From the latter follows:

๐—…๐–ผ๐—‰ ๐—Œ๐—Ž๐–ฟ๐—๐–บ๐–ป [๐‘— + 1] = lcp ๐‘ ๐—Œ๐—Ž๐–ฟ๐—๐–บ๐–ป ๐—Œ๐—Ž๐–ฟ๐—๐–บ๐–ป [ ] , ๐‘  โ‰ฅ lcp ๐‘  , ๐‘  = ๐‘™ โˆ’ 1. (3.23)

As a consequence, the lcp values of suf ixes ๐‘  can be computed for increasing ๐‘— be- ginning with ๐‘— = 0 and the pairwise suf ix comparison can skip the common pre ix of at least max(๐‘™ โˆ’ 1, 0) characters, where ๐‘™ is the lcp value of the previous comparison. In this way, the overall number of character comparisons is less than 2๐‘› and the inner loop in line 9 of Algorithm 3.5 takes ๐’ช(๐‘›) overall time as well as the whole algorithm.

Algorithm 3.5: L T (๐‘ , ๐—Œ๐—Ž๐–ฟ๐—๐–บ๐–ป)

input : text string ๐‘ , su๏ฌƒx array ๐—Œ๐—Ž๐–ฟ๐—๐–บ๐–ป

output : lcp table ๐—…๐–ผ๐—‰

1 ๐‘› โ† |๐‘ |, ๐—…๐–ผ๐—‰[0] โ† โˆ’1, ๐—…๐–ผ๐—‰[๐‘›] โ† โˆ’1

2 for๐‘– โ† 0to๐‘› โˆ’ 1do // invert suffix array

3 ๐—Œ๐—Ž๐–ฟ๐—๐–บ๐–ป [๐—Œ๐—Ž๐–ฟ๐—๐–บ๐–ป[๐‘–]] โ† ๐‘–

4 ๐‘™ โ† 0

5 for๐‘— โ† 0to๐‘› โˆ’ 1do

6 if๐—Œ๐—Ž๐–ฟ๐—๐–บ๐–ป [๐‘—] โ‰  0then

7 ๐‘– โ† ๐—Œ๐—Ž๐–ฟ๐—๐–บ๐–ป ๐—Œ๐—Ž๐–ฟ๐—๐–บ๐–ป [๐‘—] โˆ’ 1 // determine the lex. predecessor ๐‘  of ๐‘ 

8 whilemin(๐‘–, ๐‘—) + ๐‘™ < ๐‘›and๐‘ [๐‘– + ๐‘™] = ๐‘ [๐‘— + ๐‘™]do // compute lcp ๐‘  , ๐‘ 

9 ๐‘™ โ† ๐‘™ + 1

10 ๐—…๐–ผ๐—‰ ๐—Œ๐—Ž๐–ฟ๐—๐–บ๐–ป [๐‘—] โ† ๐‘™

11 if๐‘™ > 0then ๐‘™ โ† ๐‘™ โˆ’ 1 // skip max(๐‘™ โˆ’ 1, 0) prefix in the next round

3.4.2 Space-saving variant

Manzini [2004] found a way to save the 4๐‘› bytesยฒ of additional memory consumed by the inverse suf ix array by reusing the memory of ๐—…๐–ผ๐—‰. Before the lcp values are written, ๐—…๐–ผ๐—‰stores for each suf ix rank ๐‘˜ the rank of the next shorter suf ix, i.e. ๐–ฑ๐–บ๐—‡๐—„๐–ญ๐–พ๐—‘๐—[๐‘˜] = ๐—Œ๐—Ž๐–ฟ๐—๐–บ๐–ป [๐—Œ๐—Ž๐–ฟ๐—๐–บ๐–ป[๐‘˜] + 1]. After substitution of ๐‘— by ๐—Œ๐—Ž๐–ฟ๐—๐–บ๐–ป[๐‘˜] and ๐—Œ๐—Ž๐–ฟ๐—๐–บ๐–ป [๐‘—] by ๐‘˜ in Al- gorithm 3.5, Manzini replaces lines 2โ€“3 by a ๐–ฑ๐–บ๐—‡๐—„๐–ญ๐–พ๐—‘๐— construction algorithm which uti- lizes the rank-preservation property of the Burrows-Wheeler transform [Burrows and Wheeler, 1994].

Algorithm 3.6: L T _I (๐‘ , ๐—Œ๐—Ž๐–ฟ๐—๐–บ๐–ป)

input : text string ๐‘ , su๏ฌƒx array ๐—Œ๐—Ž๐–ฟ๐—๐–บ๐–ป

output : lcp table ๐—…๐–ผ๐—‰

1 ๐‘› โ† |๐‘ |

2 for๐‘– โ† 0to๐‘› โˆ’ 1do // invert suffix array

3 ๐—…๐–ผ๐—‰ [๐—Œ๐—Ž๐–ฟ๐—๐–บ๐–ป[๐‘–]] โ† ๐‘–

4 ๐‘™ โ† 0

5 for๐‘— โ† 0to๐‘› โˆ’ 1do

6 if๐—…๐–ผ๐—‰[๐‘—] โ‰  0then

7 ๐‘– โ† ๐—Œ๐—Ž๐–ฟ๐—๐–บ๐–ป [๐—…๐–ผ๐—‰[๐‘—] โˆ’ 1] // determine the lex. predecessor ๐‘  of ๐‘ 

8 whilemin(๐‘–, ๐‘—) + ๐‘™ < ๐‘›and๐‘ [๐‘– + ๐‘™] = ๐‘ [๐‘— + ๐‘™]do // compute lcp ๐‘  , ๐‘ 

9 ๐‘™ โ† ๐‘™ + 1

10 ๐—…๐–ผ๐—‰[๐‘—] โ† โˆ’(๐‘™ + 1)

11 if๐‘™ > 0then ๐‘™ โ† ๐‘™ โˆ’ 1 // skip max(๐‘™ โˆ’ 1, 0) prefix in the next round

12 for๐‘— โ† 0to๐‘› โˆ’ 1do // transform lcp values from text to suffix array order

13 if๐—…๐–ผ๐—‰[๐‘—] < 0then // find a cycle that needs to be permuted

14 ๐‘– โ† ๐‘—, ๐‘ก๐—๐—†๐—‰โ† ๐—…๐–ผ๐—‰[๐‘—] 15 while๐—Œ๐—Ž๐–ฟ๐—๐–บ๐–ป[๐‘–] โ‰  ๐‘—do // for ๐‘˜ = ๐—Œ๐—Ž๐–ฟ๐—๐–บ๐–ป[๐‘—], ๐—Œ๐—Ž๐–ฟ๐—๐–บ๐–ป [๐—Œ๐—Ž๐–ฟ๐—๐–บ๐–ป[๐‘—]] , โ€ฆ , ๐‘— 16 ๐—…๐–ผ๐—‰[๐‘–] โ† โˆ’๐—…๐–ผ๐—‰ [๐—Œ๐—Ž๐–ฟ๐—๐–บ๐–ป[๐‘–]] โˆ’ 1 // move ๐—…๐–ผ๐—‰ ๐—Œ๐—Ž๐–ฟ๐—๐–บ๐–ป [๐‘˜] โ† ๐—…๐–ผ๐—‰[๐‘˜] 17 ๐‘– โ† ๐—Œ๐—Ž๐–ฟ๐—๐–บ๐–ป[๐‘–] 18 ๐—…๐–ผ๐—‰[๐‘–] โ† โˆ’๐‘ก๐—๐—†๐—‰โˆ’ 1 19 ๐—…๐–ผ๐—‰[0] โ† โˆ’1, ๐—…๐–ผ๐—‰[๐‘›] โ† โˆ’1 20 return๐—…๐–ผ๐—‰

Independent from Manziniโ€™s approach, we found another simple way to reuse the memory of ๐—…๐–ผ๐—‰ by storing ๐—Œ๐—Ž๐–ฟ๐—๐–บ๐–ป in it [Weese, 2006]. As the values of ๐—Œ๐—Ž๐–ฟ๐—๐–บ๐–ป are read only once and in sequential order, each entry can be used after reading to store the com- puted lcp value. However, after all lcp values have been computed they are in text order and must be permuted in-place to be in suf ix array order, i.e. an lcp value at position ๐‘— must be moved to position ๐—Œ๐—Ž๐–ฟ๐—๐–บ๐–ป [๐‘—]. To permute elements without overwriting oth- ers, we swap them along cycles ๐‘—, ๐—Œ๐—Ž๐–ฟ๐—๐–บ๐–ป[๐‘—], ๐—Œ๐—Ž๐–ฟ๐—๐–บ๐–ป [๐—Œ๐—Ž๐–ฟ๐—๐–บ๐–ป[๐‘—]] , โ€ฆ , ๐‘—. To permute all cycles exactly once we iterate over all cycle start positions ๐‘— and mark non-permuted elements with negative values. The algorithmic details are shown in Algorithm 3.6. A similar algo- rithm that constructs a sparse lcp table was later published in [Kรคrkkรคinen et al., 2009].

3.4.3 Adaptation to external memory

Adapting the algorithm in [Kasai et al., 2001] to ef iciently use external memory is chal- lenging, as it shows a poor locality behavior. Although in the main loop all accesses to ๐—Œ๐—Ž๐–ฟ๐—๐–บ๐–ป and text accesses via ๐‘ [๐‘— + ๐‘™] are in sequential order, accesses to ๐—Œ๐—Ž๐–ฟ๐—๐–บ๐–ป, ๐—…๐–ผ๐—‰, and ๐‘ [๐‘– + ๐‘™]are random. Our in-place algorithm, described in the previous section, suggests how ๐—…๐–ผ๐—‰ can be permuted such that accesses to it become sequential. A similar permu- tation is possible for ๐—Œ๐—Ž๐–ฟ๐—๐–บ๐–ป as it is clear beforehand in which pattern ๐—Œ๐—Ž๐–ฟ๐—๐–บ๐–ป values will be accessed. For text accesses via ๐‘ [๐‘– + ๐‘™] this does not hold and yet all approaches to an external memory lcp construction [Kasai et al., 2001; Kรคrkkรคinen et al., 2009; Gog and Ohlebusch, 2011] are semi-external, i.e. they require the whole text [Gog and Ohlebusch, 2011] and an additional array of ๐‘› [Gog and Ohlebusch, 2011] or 4๐‘› byte [Kasai et al., 2001; Kรคrkkรคinen et al., 2009] to reside in main memory.

We developed a window based approach that is applicable even if the text does not it into main memory. It processes consecutive non-overlapping text windows of an arbi- trary size ๐‘ค in โŒˆ โŒ‰ rounds. If ๐‘ [๐‘Ž..๐‘) is the current window, then character comparisons between ๐‘ [๐‘– + ๐‘™] and ๐‘ [๐‘— + ๐‘™] can only be conducted if ๐‘– + ๐‘™ โˆˆ [๐‘Ž..๐‘). However, some suf ix comparisons may exceed the window border. Those comparisons must be inter- rupted at the end of the current window and resumed in the next window. The following lemma will help to easily keep track of suf ixes ๐‘  whose comparisons were interrupted. Whereas Lemma 3.3 states a relation between lcp lengths of suf ixes and their lexico- graphical successors, the next lemma is its counterpart and gives a relation of suf ixes and their successors.

Lemma 3.4. Given a string ๐‘  of length ๐‘› and the corresponding suf ix array and lcp table. For every ๐‘— โˆˆ [0..๐‘› โˆ’ 1) with ๐—…๐–ผ๐—‰ ๐—Œ๐—Ž๐–ฟ๐—๐–บ๐–ป [๐‘—] + 1 = ๐‘™ and ๐—Œ๐—Ž๐–ฟ๐—๐–บ๐–ป [๐‘— + 1] โ‰  ๐‘› โˆ’ 1 holds

๐—…๐–ผ๐—‰ ๐—Œ๐—Ž๐–ฟ๐—๐–บ๐–ป [๐‘— + 1] + 1 โ‰ฅ ๐‘™ โˆ’ 1.

Proof. This lemma can be proven analogously to Lemma 3.3.

A direct consequence of Lemma 3.4 is that if an lcp comparison of a suf ix ๐‘  with its lexicographical successor exceeds the window end ๐‘, comparisons of all shorter suf ixes ๐‘  with ๐‘– < ๐‘– < ๐‘ will leave the window as well. Let ๐œ”(๐‘) be de ined as the leftmost start position of such suf ixes:

๐œ”(๐‘) = min ๐‘– ๐‘– โˆˆ [0..๐‘›) โˆง ๐‘– โ‰ค ๐‘ โ‰ค ๐‘– + ๐—…๐–ผ๐—‰ ๐—Œ๐—Ž๐–ฟ๐—๐–บ๐–ป [๐‘–] + 1 . (3.24) Clearly, comparisons of suf ixes ๐‘  will end left of ๐‘ if ๐‘– < ๐œ”(๐‘) and exceed ๐‘ if ๐œ”(๐‘) โ‰ค ๐‘– < ๐‘. This allows to stop comparisons of suf ixes ๐‘  at the window end ๐‘ and to deter- mine ๐œ”(๐‘), the smallest of such ๐‘–. Comparisons of suf ixes ๐‘  with ๐œ”(๐‘Ž) โ‰ค ๐‘– < ๐‘Ž were interrupted at end of the previous window and can be resumed by setting ๐‘™ to at least ๐‘Ž โˆ’ ๐‘–.

Algorithm 3.7 shows the pseudo-code of our implementation. Lines 1โ€“3 prepare val- ues ๐—Œ๐—Ž๐–ฟ๐—๐–บ๐–ป [๐‘—] and ๐—Œ๐—Ž๐–ฟ๐—๐–บ๐–ป ๐—Œ๐—Ž๐–ฟ๐—๐–บ๐–ป [๐‘—] โˆ’ 1 for increasing ๐‘—. The main loop from line 4 to 16 iterates over all non-overlapping windows ๐‘ [๐‘Ž..๐‘), where ๐œ”๐–บ equals ๐œ”(๐‘Ž) and ๐œ”๐–ป

was not interrupted or ends at the text end. Then the lcp value ๐—…๐–ผ๐—‰[๐‘˜] and its rank ๐‘˜ is ap- pended to ๐ฟ. At the end, ๐ฟ is permuted to be ascending in ๐‘˜ and iltered for values ๐—…๐–ผ๐—‰[๐‘˜] in lines 17โ€“18.

Algorithm 3.7: L T _E M (๐‘ , ๐—Œ๐—Ž๐–ฟ๐—๐–บ๐–ป)

input : text string ๐‘ , su๏ฌƒx array ๐—Œ๐—Ž๐–ฟ๐—๐–บ๐–ป

output : lcp table ๐—…๐–ผ๐—‰

1 ๐‘› โ† |๐‘ |, ๐œ”๐–บโ† 0, ๐ฟ โ† (0, โˆ’1) โ‹… (๐‘›, โˆ’1) // ๐ฟ is a string of pairs (rank,lcp value)

2 ๐ด โ† (๐‘˜, ๐—Œ๐—Ž๐–ฟ๐—๐–บ๐–ป[๐‘˜ โˆ’ 1], ๐—Œ๐—Ž๐–ฟ๐—๐–บ๐–ป[๐‘˜]) ๐‘˜ โˆˆ [0..๐‘›)

3 permute๐ด such that (๐‘˜, ๐‘–, ๐‘—) is moved to ๐‘— // prepare values

4 for๐‘— โ† 1toโŒˆ โŒ‰do

5 ๐‘Ž โ† (๐‘— โˆ’ 1) โ‹… ๐‘ค, ๐‘ โ† min(๐‘—๐‘ค, ๐‘›), ๐œ”๐–ปโ† ๐‘ // process window ๐‘ [๐‘Ž..๐‘)

6 foreach(๐‘˜, ๐‘–, ๐‘—) โˆˆ ๐ดdo

7 ifk>0then

8 if๐œ”๐–บ โ‰ค ๐‘–and๐‘– + ๐‘™ < ๐‘then

9 ๐‘™ โ† max(๐‘™, ๐‘Ž โˆ’ ๐‘–) // resume interrupted comparison

10 while(๐‘– + ๐‘™) โˆˆ [๐‘Ž..๐‘)and๐‘— + ๐‘™ < ๐‘›and๐‘ [๐‘– + ๐‘™] = ๐‘ [๐‘— + ๐‘™]do

11 ๐‘™ โ† ๐‘™ + 1

12 if๐‘– + ๐‘™ < ๐‘or๐‘ = ๐‘›then // if comparison was not interrupted

13 ๐ฟ โ† ๐ฟ โ‹… (๐‘˜, ๐‘™) // append (rank,lcp value) to ๐ฟ

14 if๐‘– + ๐‘™ โ‰ฅ ๐‘then ๐œ”๐–ปโ† min(๐œ”๐–ป, ๐‘–)

15 if๐‘™ > 0then ๐‘™ โ† ๐‘™ โˆ’ 1 // skip max(๐‘™ โˆ’ 1, 0) prefix in the next round

16 ๐œ”๐–บโ† ๐œ”๐–ป

17 permute๐ฟ such that (๐‘–, ๐‘™) is moved to ๐‘– // order lcp values by their rank

18 ๐—…๐–ผ๐—‰ โ† ๐‘™ (๐‘–, ๐‘™) โˆˆ ๐‘ƒ

19 return๐—…๐–ผ๐—‰

We implemented the algorithm using the pipelining interface. The current window ๐‘ [๐‘Ž..๐‘)is loaded into a memory buffer of size ๐‘ค. To minimize the running time, ๐‘ค should be chosen as large as possible.

3.4.4 Extension to multiple sequences

All of the lcp table construction algorithms described above can easily be adapted to mul- tiple sequences. For a given a set ๐’ฎ = {๐‘  , โ€ฆ , ๐‘  } of strings of lengths ๐‘› , โ€ฆ , ๐‘› and the corresponding generalized suf ix array ๐—Œ๐—Ž๐–ฟ๐—๐–บ๐–ป, we de ine:

๐œ™(๐‘–, ๐‘—) = ๐‘— + ๐‘› and ๐‘› = ๐‘› . (3.25)

As the generalized suf ix array stores pairs instead of single integers, its entries cannot directly be used to access ๐—Œ๐—Ž๐–ฟ๐—๐–บ๐–ป . Therefore, we adapt Algorithm 3.5 and use ๐œ™ as a unique mapping of suf ix start positions onto the interval [0..๐‘›) in lines 5,9,10, and 13 in Algorithm 3.8. The second adaptation concerns the lcp comparison in line 11.

Algorithm 3.8: L T _M (๐‘  , โ€ฆ , ๐‘  , ๐—Œ๐—Ž๐–ฟ๐—๐–บ๐–ป)

input : mul ple text strings ๐‘  , โ€ฆ , ๐‘  , su๏ฌƒx array ๐—Œ๐—Ž๐–ฟ๐—๐–บ๐–ป

output : lcp table ๐—…๐–ผ๐—‰

1 ๐‘› โ† 0

2 for๐‘– โ† 1to๐‘šdo

3 ๐‘› โ† |๐‘  |, ๐‘› โ† ๐‘› + ๐‘›

4 for๐‘– โ† 0to๐‘› โˆ’ 1do // invert suffix array

5 ๐—Œ๐—Ž๐–ฟ๐—๐–บ๐–ป [๐œ™ (๐—Œ๐—Ž๐–ฟ๐—๐–บ๐–ป[๐‘–])] โ† ๐‘–

6 ๐‘™ โ† 0, ๐—…๐–ผ๐—‰[0] โ† โˆ’1, ๐—…๐–ผ๐—‰[๐‘›] โ† โˆ’1

7 for๐‘– โ† 1to๐‘šdo

8 for๐‘— โ† 0to๐‘› โˆ’ 1do

9 if๐—Œ๐—Ž๐–ฟ๐—๐–บ๐–ป [๐œ™(๐‘–, ๐‘—)] โ‰  0then

10 (๐‘Ž, ๐‘) โ† ๐—Œ๐—Ž๐–ฟ๐—๐–บ๐–ป ๐—Œ๐—Ž๐–ฟ๐—๐–บ๐–ป [๐œ™(๐‘–, ๐‘—)] โˆ’ 1 // determine lex. predecessor

11 while๐‘ + ๐‘™ < ๐‘› and๐‘— + ๐‘™ < ๐‘› and๐‘  [๐‘ + ๐‘™] = ๐‘  [๐‘— + ๐‘™] do

12 ๐‘™ โ† ๐‘™ + 1 // compute lcp value

13 ๐—…๐–ผ๐—‰ ๐—Œ๐—Ž๐–ฟ๐—๐–บ๐–ป [๐œ™(๐‘–, ๐‘—)] โ† ๐‘™

14 if๐‘™ > 0then ๐‘™ โ† ๐‘™ โˆ’ 1

15 return๐—…๐–ผ๐—‰

By implementing ๐œ™ using an array of length ๐‘š that stores at position ๐‘– the partial sum of the irst ๐‘– โˆ’ 1 sequence lengths, ๐œ™(๐‘–, ๐‘—) can be determined in constant time and Algo- rithm 3.8 constructs the lcp table in ๐’ช(๐‘›) time. Algorithms 3.6 and 3.7 can analogously be adapted without changing their asymptotical running time.