Extraction - The New Algorithm - Fischer, Johannes (2007): Data Structures for Efficient

6.3 The New Algorithm

6.3.3 Extraction

We now describe how to output all strings that pass the frequency-based criterion. As mentioned above, this step is accomplished by a simulated bottom-up- traversal of the suffix tree due to Kasai et al. (2001)3_{, calculating for each LCP-}

interval representing string φ the values SDj(φ) and CDj(φ) for j = 1, . . . , m,

3_{Observe that one could also simulate this DFS by means of the Enhanced Suffix Array (cf.}

Chapter 5). However, as we are doing a bottom-up-traversal here, it is simpler to use Kasai et al.’s method.

Algorithm 6.2: Extraction of all substrings satisfyingp.

Input: suffix array SA, LCP-arrayLCP,C_D′ _j as computed by Alg. 6.1 (all of sizen), frequency-based predicate p(supp_D₁, . . . ,supp_D_m)

Output: All substrings satisfyingp

S is a stack holding tuples (v.h, v.SDj, . . . , v.SDm, v.SD2, v.CD1, . . . , v.CDm)

Letv be a stopper element with v.h=−∞, pushv on S

fori= 1, . . . , n+ 1do 3

v _←top(S) _{v represents the string to be examined next_}

SDj ←0, CDj ←0 for allj = 1, . . . , m

while v.h >LCP[i]do 6

v←pop(S), w←top(S) {w always points to top of stack}

if w.h_≥LCP[i]then 8

{Otherwisew is not parent node of current nodev._}

w.SDj +=v.SDj,w.CDj +=v.CDj for all j {accumulate}

endif 11

freq_D_j _←v.SDj−v.CDj for all j= 1, . . . , m

if p(freq_D₁, . . . ,freq_D_m) then 13

{Nowv represents a maximally repeated substring satisfyingp._}

forh= max{w.h,LCP[i]}+ 1, . . . , v.hdo printtSA[i]..SA[i]+h−1 15 endif 16 SDj ←v.SDj,CDj ←v.CDj for allj = 1, . . . , m 17 v←w 18 endw 19

if v.h <LCP[i]thenpush (LCP[i], SD1, . . . , SDm, CD1, . . . , CDm) onS

top(S).CDj +=CD′ j[i] for all j= 1, . . . , m {gather correction factors}

if i_≤nthen 22

LetSA[i] point toDj; setSDj ←1 and all other SDj′’s to 0

23 push (n₋SA[i] + 1, SD1, . . . , SDm,0, . . . ,0) on S 24 endif 25 endfor 26

thereby yielding the frequency ofφin_Dj asSDj(φ)−CDj(φ). The formula for

theC-numbers is given by (6.4), and for theS-numbers we have SDj(φ) =

l≤i≤r SA[i] points toD_j

1 (6.5)

(again, [l :r] is φ’s LCP-interval). This is simply because the interval [l, r] in

SA represents all suffixes of t that are prefixed by φ. As for the C-numbers, (6.5) can be rewritten to allow a recursive calculation:

SDj(φ) =        0 ifl=r and SA[l] points toj′ 6=j 1 ifl=r and SA[l] points toj P [l′:r′] child-interval of [l:r] [l′:r′] representsψ6=φ SDj(ψ) otherwise (6.6) Alg. 6.2 is used for the extraction phase. If one deletes lines 5, 21 and 23 from Alg. 6.2 and substitutes lines 8–17 by the single command “print tSA[i]..SA[i]+v.h−1,” this yields exactly the algorithm in Fig. 7 of Kasai et al.

(2001) which solves the substring traversal problem, i.e., the enumeration of all maximally repeated substrings. The idea behind this algorithm is to visit all suffixes of tin lexicographic order and to keep all maximally repeated prefixes of the current suffix on a stackS, ordered by their length, with the longest such prefix being on top of S. A more formal description is as follows. Each element on S is represented by a tuple (h, SD1, . . . , SDm, CD1, . . . , CDm), whereh is the

length of the prefix (i.e., the corresponding prefix istSA[i]..v.h−1), and the other

variables are the counters as defined by (6.1) and (6.2). At the beginning of step i of the for-loop (lines 3–26), we have that the (i−1)’st suffix and all maximally repeated prefixes oftSA[i−1]..nare on S. Then the (i−1)’th suffix is

visited (line 4) and the following steps are performed:

1. The while-loop (lines 6–19) removes fromSall tuples representing strings with length greater than lce(i−1, i) = LCP[i]. These are exactly the prefixes oftSA[i−1]..n which are not a prefix of tSA[i]..n. All strings passing

the statistical criterion are returned (line 15).

2. The counter-values SDj(φ) andCDj(φ) of the current stringv are added

to the respective counters of the string on top of the stack (line 10). This step takes care of the last sums in Eq. (6.4) and (6.6), respectively, as v represents a child of the string which is on top ofS.

3. When pushing the longest common prefix of two lexicographically adja- cent suffixes on S (line 20), the counter-values are initialized correctly. 4. The C′-numbers are added to the correct string (line 21) which is again

on top of the stack. This step takes care of the first sum in (6.4).

5. The suffix tSA[i]..n is pushed on S with the correct counter-values (lines

22–25). Line 23 accounts for the initialization of the SDj-values, i.e., the

It is shown by Kasai et al. (2001) that this algorithm visits all maximally repeated substrings of t, and its running time isO(n) (apart from the for-loop that outputs the solutions, line 11). The discussion from Sect. 6.3.2 shows that the S- and C-values are calculated correctly, and thus in line 9 we have that the frequency of the string φ that is represented by v is calculated correctly. We thus have the following

Theorem 6.2 (Frequency-based string mining). For a constant number of databases of strings of total length n, all strings that satisfy a frequency-based criterion (e.g., emerging substrings) can be calculated in O(n+s) time, solely by using array-based data structures occupying O(n) words of additional space (apart from the output), where s is the total size of the strings that satisfy the

criterion.

In document Fischer, Johannes (2007): Data Structures for Efficient String Algorithms. Dissertation, LMU München: Fakultät für Mathematik, Informatik und Statistik (Page 102-105)