Require: Sequence databaseD, σ, γ, λ, f-list Fσ,0,1(D)
1: Map(T ):
2: for all distinct w ∈ T s.t. w ∈ Fσ,0,1(D) do
3: Construct a sequence databasePw(T ) that is (w, γ, λ)-equivalent to { T }
4: For eachS ∈Pw(T ), output (w, S)
5: end for 6: 7: Reduce(w, Pw): 8: Fσ,γ,λ(Pw) ← FSMσ,γ,λ(Pw) 9: for all S ∈ Fσ,γ,λ(Pw)do 10: if p(S)= w and S , w then 11: Output(S, fγ(S, Pw)) 12: end if 13: end for
w ∈ Σ and then mines frequent length- and gap-constrained sequences in each partition independently. The itemw is referred to as the pivot item of partitionPw. The MG-FSM algorithm is divided into a preprocessing phase, a partitioning phase, and a mining phase; all of which are fully parallelized.
Preprocessing phase
In the preprocessing phase, we compute the frequency of each item w ∈ Σ and construct the set Fσ,0,1(D) of frequent items, commonly called f-list. This can be done efficiently in a single MapReduce job (by running a version of WordCount that ignores repeated occurrences of items within an input sequence). We use the f-list to establish a total order< on Σ: Set w < w0if f0(w, D) > f0(w0, D); ties are broken arbitrarily. Thus items are ordered by decreasing frequency. WriteS ≤ w ifw0 ≤ w for all w0 ∈ S and denote byΣ+
≤w = { S ∈ Σ+: w ∈ S, S ≤ w } the set of
all sequences that containw but no items larger than w. Finally, denote by p(S) = minw ∈S(S ≤ w) the pivot item of sequence S, i.e., the largest item in S. Note that
p(S)= w ⇐⇒ w ∈ S ∧ S ≤ w ⇐⇒ S ∈ Σ≤w+ . For example, whenS= abc, then
S ≤ c and p(S) = c; here, as well as in all subsequent examples, we assume order a < b < c < d.
Partitioning phase
The partitioning and mining phases of MG-FSM are performed in a single MapRe- duce job. In the partitioning phase, we construct partitionsPw in the map phase: For each distinct itemw in each input sequence T ∈ D, we compute a small se- quence databasePw(T ) and output each of its sequences with reduce key w. We requirePw(T ) to be “(w, γ, λ)-equivalent” to T, see Section.3.2.3. For now, assume
thatPw(T )= { T }; a key ingredient of MG-FSM is to use rewrites that make Pw(T ) as small as possible.
Mining phase
The mining phase is carried out in the reduce function. The input to the mining phase is given by
Pw =
Ú
T ∈D,w∈T
Pw(T ),
which is automatically constructed by the MapReduce framework. Each reduce function runs an arbitrary FSM algorithm with parameters σ, γ, and λ on Pw— denoted FSMσ,γ,λ(Pw) in Alg.3.1—to obtain the frequent sequences Fσ,γ,λ(Pw) as well as their frequencies. Since every frequent sequence may be generated at multiple partitions, MG-FSM performs a filtering step to produce each frequent se- quence exactly once. In particular, we output sequenceS at partitionPp(S), i.e., at the partition corresponding to its largest item.
3.2.3 Constructing Partitions
We now summarize the partition construction of MG-FSM and, in particular rewrit- ing techniques for constructing Pw(T ) for an input sequence T . These rewriting techniques aim to minimize partition size, and therefore reduce communication cost between Map and Reduce phase, computational cost at each partition, and partition skew while maintaining correctness.
w-equivalency.
w-equivalency is a necessary and sufficient condition for the correctness of MG- FSM. A sequenceS is a pivot sequence w.r.t. w ∈ Σ if p(S) = w and 2 ≤ |S| ≤ λ. Denote by
Gw,γ,λ(T )= [F1,γ,λ({ T }) ∩Σ≤w+ ] \ { w }
theset of pivot sequences that occur inT , i.e., areγ-subsequences of T with largest itemw. If S ∈ Gw,γ,λ(T ), then T is said to (w, γ, λ)-generate (or simply w-generate) S. For example,
Gc,1,2(acb f deac f c)= { ac, cb, cc } .
Two sequencesT and T0are said to be(w, γ, λ)-equivalent (or simply w-equivalent), if
Gw,γ,λ(T )= Gw,γ,λ(T0),
i.e., they both generate the same set of pivot sequences. Similarly, two sequence databasesD and Pw are(w, γ, λ)-equivalent (or simply w-equivalent) iff
Gw,γ,λ(D) = Gw,γ,λ(Pw).
Constructing Pw(T ).
We now summarize rewriting techniques that aim to reduce to overall size ofPw(T ). LetT = t1. . . t|T |be an input sequence and consider pivotw. An index1 ≤ i ≤ |T | is w-relevant if ti is w-relevant, i.e., if ti ≤ w; otherwise it is w-irrelevant. When ti = w, we say that the index i is pivot index. Since irrelevant items do not contrib-
ute to a pivot sequence, MG-FSM replaces these items with “blanks”. For example, sequence abddc is written as ab␣␣c (for pivot c). Replacing irrelevant items with blanks enables effective compression (e.g.,abddc can be written as ab␣2c).
Perhaps the most important rewrite is unreachability reduction that removes unreachable items, i.e., items that are “far away” from any pivot item. For example, consider a input sequenceT = cadbabeadcddae and corresponding sequence T0= ca␣bab␣a␣c␣␣a␣ obtained after replacing c-irrelevant items with blanks. Here indexes 1 and 10 are pivot indexes. For removing unreachable items, we compute the left and the right distance to a pivot item. The left distance of an indexi is the smallest number of items (number of “hops”+1) from a pivot index to index i; only relevant indexes are considered and subsequence indexes must satisfy the gap constraint (at mostγ items in between). Similarly, right distance of an index i is the distance to the closest pivot to the right ofi. For example, we obtain the following forγ = 1:
i 1 2 3 4 5 6 7 8 9 10 11 12 13 14
ti c a ␣ b a b ␣ a ␣ c ␣ ␣ a ␣
left 1 2 2 3 4 4 − − − 1 2 2 − −
right 1 − − 4 4 3 3 2 2 1 − − − −
Here− correspond to infinite distance. The left distance for index 5 for example is 4, which is determined by indexes 1, 2, 4, 5 (indexes 1, 3, 5 is not allowed since index3 is irrelevant). Indexes where the minimum distance min(left, right) ≤ λ are unreachable and can be safely removed. For example, for λ = 3 we obtain T0 = ca␣bb␣a␣c␣␣.
Other important rewrites includeprefix/suffix reduction where leading and trail- ing blanks are removed (e.g.,ca␣bb␣a␣c␣␣ is reduced to ca␣bb␣a␣c) and blank reduction where any sequence of more thanγ + 1 blanks are replaced with exactly γ + 1 (e.g., ca␣␣␣␣cba can be reduced to ca␣␣cba forγ = 1). MG-FSM also performs blank separ- ation, where a sequence can be written in terms of multiple shorter sequences (e.g., acb␣␣bca can be split into acb and bca for pivot c). Blank separation, however, is ineffective when items are not often repeated.
Rewrites in practice
In practice, the above rewrites are performed as follows. For each sequenceT and each frequent item w ∈ T , MG-FSM performs a backward scan to obtain the right distances of all indexes. It then performs a forward scan ofT in which it simul- taneously (1) computes the left distances, (2) performs unreachability reduction, (3)
replaces irrelevant items by blanks, (4) performs prefix/suffix and blank reduction to obtainPw(T ).
3.3
Mining Partitions
In this section, we first briefly discuss how existing FSM approaches of Section2.2
can be adapted to mine length- and gap-constrained sequences. These approaches mine all frequent sequences and must be combined with a filtering step to restrict output to pivot sequences (cf. line7, Algorithm3.1). We then propose a more effi- cient, special-purpose sequence miner that directly mines pivot sequences.
3.3.1 Sequential FSM algorithms
We first briefly describe how we extend BFS and DFS approaches to handle length and gap constraints and then discuss the overhead associated with them in context of MG-FSM.
BFS with length and gap constraints
Recall that BFS uses a level-wise approach to iteratively generate sequences of length- 1, then length-2, and so on. To adapt BFS to handle the length constraintλ, we stop iterative process afterλthiteration, i.e., we add the conditionk ≤ λ in line3of Al- gorithm2.1. To handle gap constraintγ, we modify the the posting list intersection in which we merge two postingsT hposi and T0hpos0i only when the conditions T = T0andpos < pos0−γ + 1 satisfy. For example, consider two sequences a and d and their posting lists La = T1h1i, T2h2i, T3h1i and Ld = T1h3i, T3h2i. We obtain
Lad = T3h2i for γ= 0 and Lad = T1h1i, T3h1i for γ= 1.
DFS with length and gap constraints
We adapt the DFS approach (Algorithm 2.2) to handle length and gap constraints as follows. To handle length constraint, we only expand a sequenceS if |S| < λ. To handle gap constraint, when we expandS, we only look for the set of right items in input sequencesT ∈ DS, which is given byΣS(T )= { w | Sw ⊆γ T }, i.e., we look for occurrences ofS, and then consider the items that occur to the right of S that are at mostγ + 1 items apart. For example, if T = cabda, then we have Σca(T )= { b } forγ = 0 and Σca(T )= { b, d } for γ = 1.
Overhead
In the context of MG-FSM, the BFS and DFS approaches have substantial computa- tional overhead: They compute and output all frequent sequences, whether or not
these sequences are pivot sequences (i.e., p(S)= w) and thus non-pivot sequences need to be pruned. To see this, consider pivotd and example partition
Pd= { adda, cabd, ca␣db, b␣aadbc } , (3.1)
for σ = 2, γ = 1 and λ = 4. Both BFS and DFS methods will produce sequences such as ca and ab, neither of which contain pivot d and thus need to be filtered out by MG-FSM. Unfortunately, neither BFS nor DFS can be readily extended to avoid enumerating non-pivot sequences. This is because short non-pivot sequence might contribute to longer pivot sequences. In BFS, we obtain frequent pivot se- quence cad from ca (a non-pivot sequence) and ad (a pivot sequence). Similarly, DFS obtainscad by expanding the non-pivot sequence ca. This costly computation of non-pivot sequences cannot be avoided without sacrificing correctness. Note that both approaches also compute frequent sequences that do not contribute to a pivot sequence later on (e.g., sequenceab).
3.3.2 Pivot Sequence Miner
In what follows, we propose PSM, an effective and efficient algorithm that signi- ficantly reduces the computational cost of mining each partition. In contrast to the methods discussed above, PSM restricts its search space to only pivot sequences and is thus customized to MG-FSM. We also describe optimizations that further improve the performance of PSM.
Algorithm
The key goal of PSM is to only enumerate pivot sequences. PSM is based on DFS, but, in contrast, starts with the pivotw (instead of the empty sequence) and expands a sequence to the left and to the right (instead of just to the right). Since PSM starts with the pivot, every intermediate sequence will be a pivot sequence. The PSM al- gorithm is shown as Algorithm3.2. We assume that for allT ∈ Pw,p(T )= w; this property is ensured by MG-FSM’s partitioning framework.
PSM starts withS = w (pivot item) and determines the support set Dw (line 1); under our assumptions,Dw = Pw so that nothing needs to be done. We then perform a series of right-expansions almost as expansions in DFS (lines 2 and 13); the only difference is that we do not right-expand with the pivot item (cf. line 11). After the right-expansions are completed, we have produced all frequent pivot sequences that start with the pivot item (and do not contain another occurrence of the pivot item).
Figure3.1illustrates PSM on the partition of Equation (3.1) with pivotd. Solid nodes represent frequent sequences; dotted nodes represent infrequent sequences that are explored by PSM. Each edge corresponds to an expansion and is labeled with its type (RE=right expansion, LE=left expansion) and order of expansion. We start
Algorithm 3.2 Mining pivot sequences