6.3 The New Algorithm
6.3.3 Extraction
We now describe how to output all strings that pass the frequency-based crite- rion. As mentioned above, this step is accomplished by a simulated bottom-up- traversal of the suffix tree due to Kasai et al. (2001)3, calculating for each LCP-
interval representing string φ the values SDj(φ) and CDj(φ) for j = 1, . . . , m,
3Observe that one could also simulate this DFS by means of the Enhanced Suffix Array (cf.
Chapter 5). However, as we are doing a bottom-up-traversal here, it is simpler to use Kasai et al.’s method.
Algorithm 6.2: Extraction of all substrings satisfyingp.
Input: suffix array SA, LCP-arrayLCP,CD′ j as computed by Alg. 6.1 (all of sizen), frequency-based predicate p(suppD1, . . . ,suppDm)
Output: All substrings satisfyingp
S is a stack holding tuples (v.h, v.SDj, . . . , v.SDm, v.SD2, v.CD1, . . . , v.CDm)
1
Letv be a stopper element with v.h=−∞, pushv on S
2
fori= 1, . . . , n+ 1do 3
v ←top(S) {v represents the string to be examined next}
4
SDj ←0, CDj ←0 for allj = 1, . . . , m
5
while v.h >LCP[i]do 6
v←pop(S), w←top(S) {w always points to top of stack}
7
if w.h≥LCP[i]then 8
{Otherwisew is not parent node of current nodev.}
9
w.SDj +=v.SDj,w.CDj +=v.CDj for all j {accumulate}
10
endif 11
freqDj ←v.SDj−v.CDj for all j= 1, . . . , m
12
if p(freqD1, . . . ,freqDm) then 13
{Nowv represents a maximally repeated substring satisfyingp.}
14
forh= max{w.h,LCP[i]}+ 1, . . . , v.hdo printtSA[i]..SA[i]+h−1 15 endif 16 SDj ←v.SDj,CDj ←v.CDj for allj = 1, . . . , m 17 v←w 18 endw 19
if v.h <LCP[i]thenpush (LCP[i], SD1, . . . , SDm, CD1, . . . , CDm) onS
20
top(S).CDj +=CD′ j[i] for all j= 1, . . . , m {gather correction factors}
21
if i≤nthen 22
LetSA[i] point toDj; setSDj ←1 and all other SDj′’s to 0
23 push (n−SA[i] + 1, SD1, . . . , SDm,0, . . . ,0) on S 24 endif 25 endfor 26
thereby yielding the frequency ofφinDj asSDj(φ)−CDj(φ). The formula for
theC-numbers is given by (6.4), and for theS-numbers we have SDj(φ) =
X
l≤i≤r SA[i] points toDj
1 (6.5)
(again, [l :r] is φ’s LCP-interval). This is simply because the interval [l, r] in
SA represents all suffixes of t that are prefixed by φ. As for the C-numbers, (6.5) can be rewritten to allow a recursive calculation:
SDj(φ) = 0 ifl=r and SA[l] points toj′ 6=j 1 ifl=r and SA[l] points toj P [l′:r′] child-interval of [l:r] [l′:r′] representsψ6=φ SDj(ψ) otherwise (6.6) Alg. 6.2 is used for the extraction phase. If one deletes lines 5, 21 and 23 from Alg. 6.2 and substitutes lines 8–17 by the single command “print tSA[i]..SA[i]+v.h−1,” this yields exactly the algorithm in Fig. 7 of Kasai et al.
(2001) which solves the substring traversal problem, i.e., the enumeration of all maximally repeated substrings. The idea behind this algorithm is to visit all suffixes of tin lexicographic order and to keep all maximally repeated prefixes of the current suffix on a stackS, ordered by their length, with the longest such prefix being on top of S. A more formal description is as follows. Each element on S is represented by a tuple (h, SD1, . . . , SDm, CD1, . . . , CDm), whereh is the
length of the prefix (i.e., the corresponding prefix istSA[i]..v.h−1), and the other
variables are the counters as defined by (6.1) and (6.2). At the beginning of step i of the for-loop (lines 3–26), we have that the (i−1)’st suffix and all maximally repeated prefixes oftSA[i−1]..nare on S. Then the (i−1)’th suffix is
visited (line 4) and the following steps are performed:
1. The while-loop (lines 6–19) removes fromSall tuples representing strings with length greater than lce(i−1, i) = LCP[i]. These are exactly the prefixes oftSA[i−1]..n which are not a prefix of tSA[i]..n. All strings passing
the statistical criterion are returned (line 15).
2. The counter-values SDj(φ) andCDj(φ) of the current stringv are added
to the respective counters of the string on top of the stack (line 10). This step takes care of the last sums in Eq. (6.4) and (6.6), respectively, as v represents a child of the string which is on top ofS.
3. When pushing the longest common prefix of two lexicographically adja- cent suffixes on S (line 20), the counter-values are initialized correctly. 4. The C′-numbers are added to the correct string (line 21) which is again
on top of the stack. This step takes care of the first sum in (6.4).
5. The suffix tSA[i]..n is pushed on S with the correct counter-values (lines
22–25). Line 23 accounts for the initialization of the SDj-values, i.e., the
It is shown by Kasai et al. (2001) that this algorithm visits all maximally repeated substrings of t, and its running time isO(n) (apart from the for-loop that outputs the solutions, line 11). The discussion from Sect. 6.3.2 shows that the S- and C-values are calculated correctly, and thus in line 9 we have that the frequency of the string φ that is represented by v is calculated correctly. We thus have the following
Theorem 6.2 (Frequency-based string mining). For a constant number of databases of strings of total length n, all strings that satisfy a frequency-based criterion (e.g., emerging substrings) can be calculated in O(n+s) time, solely by using array-based data structures occupying O(n) words of additional space (apart from the output), where s is the total size of the strings that satisfy the
criterion.