Prism: An effective approach for frequent sequence mining via prime-block encoding

(1)

Karam Gouda

a

, Mosab Hassaan

a

, Mohammed J. Zaki

b

,

∗

a_{Mathematics Dept., Faculty of Science, Benha, Egypt}

b_{Computer Science Dept., Rensselaer Polytechnic Institute, Troy, NY, USA}

a r t i c l e

i n f o

a b s t r a c t

Article history:

Received 30 September 2007 Received in revised form 4 June 2008 Available online 23 May 2009

Keywords:

Frequent sequence mining Prime encoding Data mining

Sequence mining is one of the fundamental data mining tasks. In this paper we present a novel approach for mining frequent sequences, called Prism. It utilizes a vertical approach for enumeration and support counting, based on the novel notion of primal block encoding, which in turn is based on prime factorization theory. Via an extensive evaluation on both synthetic and real datasets, we show that Prism outperforms popular sequence mining methods like SPADE [M.J. Zaki, SPADE: An efficient algorithm for mining frequent sequences, Mach. Learn. J. 42 (1/2) (Jan/Feb 2001) 31–60], PrefixSpan [J. Pei, J. Han, B. Mortazavi-Asl, H. Pinto, Q. Chen, U. Dayal, M.-C. Hsu, PrefixSpan: Mining sequential patterns efficiently by prefixprojected pattern growth, in: Int’l Conf. Data Engineering, April 2001] and SPAM [J. Ayres, J.E. Gehrke, T. Yiu, J. Flannick, Sequential pattern mining using bitmaps, in: SIGKDD Int’l Conf. on Knowledge Discovery and Data Mining, July 2002], by an order of magnitude or more.

1. Introduction

Many real world applications, such as in bioinformatics, web mining, text mining and so on, have to deal with sequen-tial/temporal data. Sequence mining helps to discover frequent sequential patterns across time or positions in a given data set. Mining frequent sequences is one of the basic exploratory mining tasks, and has attracted a lot of attention [1,2,8–10, 13,14].

Problem deﬁnition. The problem of mining sequential patterns can be stated as follows: Let

I = {

i1

,

i2

, . . . ,

im

}

be a set of

m distinct attributes, also called items. An itemset is a non-empty unordered collection of items (without loss of generality,

we assume that items of an itemset are sorted in increasing order). A sequence is an ordered list of itemsets. An itemset

i is denoted as

(

i1i2

· · ·

ik

)

, where ij is an item. An itemset with k items is called a k-itemset. A sequence S is denoted as

(

s1

→

s2

→ · · · →

sq

)

, where the sequence element sj is an itemset. The number of itemsets (or elements) in the sequence

gives its size, and the total number of items in the sequence gives its length (k

=

_j

|

sj

|

). A sequence of length k is also

called a k-sequence. For example,

(

b

→

ac

)

is a 3-sequence of size 2.

A sequence S

= (

s1

→ · · · →

sn

)

is a subsequence of another sequence R

= (

r1

→ · · · →

rm

)

, denoted as S

⊆

R, if there

exist integers i1

<

i2

<

· · · <

in such that sj

⊆

rij for all sj. For example the sequence

(

b

→

ac

)

is a subsequence of

(

ab

→

e

→

acd

)

, but

(

ab

→

e

)

is not a subsequence of

(

abe

)

, and vice versa. If S

⊆

R, we also say that R contains S.

✩ _{This work was supported in part by NSF CAREER Award IIS-0092978, and NSF grants EIA-0103708 and EMT-0432098.}

*

Corresponding author.

E-mail address:[email protected](M.J. Zaki).

(2)

Given a database

D

of sequences, each having a unique sequence identiﬁer, and given some sequence S

= (

s1

→

· · · →

sn

)

, the absolute support of S in

D

is deﬁned as the total number of sequences in

D

that contain S, given as

sup

(

S

,

D) = |{

Si

∈

D |

S

⊆

Si

}|

. The relative support of S is given as the fraction of database sequences that contain S. We

use absolute and relative supports interchangeably. Given a user-speciﬁed threshold called the minimum support (denoted

minsup), we say that a sequence is frequent if occurs more than minsup times. A frequent sequence is maximal if it is not a

subsequence of any other frequent sequence. A frequent sequence is closed if it is not a subsequence of any other frequent sequence with the same support. Given a database

D

of sequences and minsup, the problem of mining sequential patterns is to ﬁnd all frequent sequences in the database.

Related work. The problem of mining sequential patterns was introduced in [1]. Many other approaches have followed since then [2,8–11,13,14]. Methods for mining closed sequences appear in [3,12]. Sequence mining is essentially an enumeration problem over the sub-sequence partial order looking for those sequences that are frequent. The search can be performed in a breadth-first or depth-first manner, starting with more general (shorter) sequences and extending them towards more specific (longer) ones. The existing methods essentially differ in the data structures used to “index” the database to facilitate fast enumeration. The existing methods utilize three main approaches to sequence mining:

•

Horizontal method: These are exempliﬁed by the GSP method [11]. Here the database sequences are not indexed at

all, whereas the candidate frequent patterns are stored in some hash-based/preﬁx-based structure. Pattern frequency is determined by directly checking which candidate sequences appear in which database sequences. The methods in [1,9] also follow a horizontal approach.

•

Vertical method: The prototype method here is SPADE [14]. Here each item/symbol is associated with an inverted index,

called tidlist, which lists the sequence ids, and the positions where the symbol occurs. Frequency of a pattern is de-termined by performing temporal joins on these tidlists. SPAM [2] is also a vertical approach, that uses bit-vectors to represent the tidlists. The episode mining method in [8] is a variant of the vertical approach tailored to mining a single long sequence, as opposed to many sequences.

•

Projection method: PreﬁxSpan [10] follows a database projection approach, which is a hybrid between the horizontal

and vertical approaches. Given any prefix sequence P , the main idea is to project the horizontal database, so that the projected (or conditional) database contains only those sequences that have prefix P . The frequency of extensions of P can be directly counted in the projected database. Via recursive projections all frequent sequences can be enumerated. PrefixSpan is a hybrid method, since the projected database is equivalent to a horizontal representation of the tidlists of sequences that share a given prefix P . LAPIN [13] is another method that applies the so-called last-position induction optimization with a projection approach.

Domain speciﬁc sequence mining methods, such as in bioinformatics, have a rich literature [6,7]. Many of these tech-niques are geared towards ﬁnding consecutive sequences, whereas, we are interested in more unconstrained sequences. Furthermore, sequential patterns allow each sequence element to be an itemset, whereas most biosequence methods allow only the base symbols as sequence elements. On the other hand, biosequence mining methods allow other variations, such as approximate matching and symbol substitution matrices, which we do not consider here.

Our contributions. In this paper we present a novel approach called Prism (which stands for the bold letters in: PRIme-Encoding Based Sequence Mining) for mining frequent sequences. Note that a preliminary short paper on Prism appeared in [5]. Prism utilizes a vertical approach for enumeration and support counting, based on the novel notion of primal block

encoding, which in turn is based on prime factorization theory. Via an extensive evaluation on both synthetic and real

datasets, we show that Prism outperforms popular sequence mining methods like SPADE [14], PreﬁxSpan [10] and SPAM [2], by an order of magnitude or more.

2. Preliminary concepts

2.1. Prime factors & generators

An integer p is a prime integer if p

>

1 and the only positive divisors of p are 1 and p.

Theorem 2.1 (Unique factorization theorem). (See [4].) Every positive integer n is either 1 or can be expressed as a product of prime

integers, and this factorization is unique except for the order of the factors.

Let p1

,

p2

, . . . ,

prbe the distinct prime factors of n, arranged in order, so that p1

<

p2

<

· · · <

pr. All repeated factors can

be collected together and expressed using exponents, so that n

=

pm1

1

×

p

m2

2

× · · · ×

p

mr

r , where each miis a positive integer,

called the multiplicity of pi, and this factorization of n is called the standard form of n. For example, n

=

31 752

=

23

×

34

×

72.

Given two integers a

=

ra_i₌₁pmia_ia and b

=

rb i=1p

mib

ib in their standard forms, the greatest common divisor of the two

(3)

Fig. 1. Lattice overP(G). Each node shows a set S∈P(G)using its bit-vector SB_{and the value obtained by multiplying its elements}_S. with 1

j

ra, 1

k

rb. For example, if a

=

7056

=

24

×

32

×

72 and b

=

18 900

=

22

×

33

×

52

×

7, then gcd

(

a

,

b

)

=

22

_×

₃2

_×

₇

₌

_252.

For our purposes, we are particularly interested in square-free integers n, deﬁned as integers, whose prime factors pi all

have multiplicity mi

=

1 (note: the name square-free suggests that no multiplicity is 2 or more, i.e., the number does not

contain a square of any factor). Given a set G, let P

(

G

)

denote the set of all subsets of G. If we assume that G is ordered and indexed by the set

{

1

,

2

, . . . ,

|

G

|}

, then any subset S

∈

P

(

G

)

can be represented as a

|

G

|

-length bit-vector (or binary vector), denoted SB_{, whose ith bit (from left) is 1 if the ith element of G is in S, or else the ith bit is 0. For example, if}

G

= {

2

,

3

,

5

,

7

}

, and S

= {

2

,

5

}

, then SB

₌

_1010.

Given a set S

∈

P

(

G

)

, we deﬁne a multiplication operator

⊗

on S, as follows:

S

=

s1

×

s2

× · · · ×

s|S|, with si

∈

S.

If S

= ∅

, deﬁne

S

=

1. In other words,

S is the value obtained by multiplying all members of S. Further, deﬁne

P

(

G

)

= {

S: S

∈

P

(

G

)

}

to be the set obtained by applying the multiplication operator on all sets in P

(

G

)

. In this case we also say that G is a generator of

P

(

G

)

under the multiplication operator.

We say that a set G is a square-free generator if each X

∈

P

(

G

)

is square-free. In case a generator G consists of only prime integers, we call it a prime generator. Recall that a semi-group is a set that is closed under an associative binary operator

⊗

. We say that a set P is a square-free semi-group (under operator

⊗

) iff for all X

,

Y

∈

P , if Z

=

X

⊗

Y is

square-free, then Z

∈

P .

Theorem 2.2. A set P is a square-free semi-group with operator

⊗

iff it has a square-free prime generator G. In other words, P is a

square-free semi-group iff P

=

P

(

G

)

.

Proof. If P is a square-free semi-group, then we choose as its generator set G, the distinct prime factors over all elements of P . On the other hand, if G is a square-free prime generator, then

P

(

G

)

is a square-free semi-group. It follows that

P

=

P

(

G

)

.

2

As an example, let G

= {

2

,

3

,

5

,

7

}

be the set of the ﬁrst four prime numbers. Then

P

(

G

)

= {

1

,

2

,

3

,

5

,

7

,

6

,

10

,

14

,

15

,

21

,

35

,

30

,

42

,

70

,

105

,

210

}

. It is easy to see that G is a square-free generator of

P

(

G

)

, which in turn is a square-free semi-group, since the product of any two of its elements that is square-free is already in the set.

The set P

(

G

)

induces a lattice over the semi-group

P

(

G

)

as shown in Fig. 1. In this lattice, the meet operation (

∧

)

is set intersection over elements of P

(

G

)

, which corresponds to the gcd of the corresponding elements of

P

(

G

)

. The

join operation

(

∨)

is set union (over P

(

G

)

), which corresponds to the least common multiple (lcm) over

P

(

G

)

. For

exam-ple, 1010

(

10

)

∧

1001

(

14

)

=

1000

(

2

)

, conﬁrming that gcd

(

10

,

14

)

=

2, and 1010

(

10

)

∨

1001

(

14

)

=

1011

(

70

)

, indicating that

lcm

(

10

,

14

)

=

70. More formally, we have:

Lemma 2.3. Let

P

(

G

)

be a square-free semi-group with prime generator G, and let X

,

Y

∈

P

(

G

)

be two distinct elements, then gcd

(

X

,

Y

)

=

(

SX

∩

SY

)

, and lcm

(

X

,

Y

)

=

(

SX

∪

SY

)

, where X

=

SXand Y

=

SY, and SX

,

SY

∈

P

(

G

)

are the prime factors

(4)

Proof. Since X and Y are square-free, gcd

(

X

,

Y

)

=

_ipi, where pi is a factor common to both X and Y . But this is exactly

the expression given by

(

SX

∩

SY

)

. Likewise lcm

(

X

,

Y

)

=

ipi, where pi is a factor of either X and Y . This is the same

as

(

SX

∪

SY

)

.

2

Deﬁne the factor-cardinality, denoted

X

G, for any X

∈

P

(

G

)

, as the number of prime factors from G in the

factoriza-tion of X . Let X

=

SX, with SX

⊆

G. Then

X

G

= |

SX

|

. For example,

21

G

= |{

3

,

7

}| =

2. Note that

{

1

}

G

=

0, since 1

has no prime factors in G.

Corollary 2.4. Let

P

(

G

)

be a square-free semi-group with prime generator G, and let X

,

Y

∈

P

(

G

)

be two distinct elements, then gcd

(

X

,

Y

)

∈

P

(

G

)

.

Proof. gcd

(

X

,

Y

)

=

_ipi, where pi

∈

G is a factor common to both X and Y . It follows that gcd

(

X

,

Y

)

∈

P

(

G

)

.

2

2.2. Primal block encoding

Let

T = [

1

:

N

] = {

1

,

2

, . . . ,

N

}

be the set of the ﬁrst N positive integers, let G be a base set of prime num-bers sorted in increasing order. Without loss of generality assume that N is a multiple of

|

G

|

, i.e., N

=

m

× |

G

|

. Let

B

∈ {

0

,

1

}

N be a bit-vector of length N. Then B can partitioned into m

=

_|N_G_| consecutive blocks, where each block

Bi

=

B [

(

i

−

1

)

× |

G

| +

1

:

i

× |

G

|

], with 1

i

m. In fact, each Bi

∈ {

0

,

1

}

|G|, is the indicator bit-vector SB

represent-ing some subset S

⊆

G. Let Bi

[

j

]

denote the jth bit in Bi, and let G

[

j

]

denote the jth prime in G. Deﬁne the

value of Bi with respect to G as follows,

ν

(

Bi

,

G

)

=

{

G

[

j

]

Bi[j]

_}

_{. For example if B}

i

=

1001, and G

= {

2

,

3

,

5

,

7

}

, then

ν

(

Bi

,

G

)

=

21

×

30

×

50

×

71

=

2

×

7

=

14. Note also that if Bi

=

0000 then

ν

(

Bi

,

G

)

=

1.

Deﬁne

ν

(

B

,

G

)

= {

ν

(

Bi

,

G

)

: 1

i

m

}

, as the primal block encoding of B with respect to the base prime set G. It should be

clear that each

ν

(

Bi

,

G

)

∈

P

(

G

)

. Note that when there is no ambiguity, we write

ν

(

Bi

,

G

)

as

ν

(

Bi

)

, and

ν

(

B

,

G

)

as

ν

(

B

)

. As

an example, let

T = {

1

,

2

, . . . ,

12

}

, G

= {

2

,

3

,

5

,

7

}

, and B

=

100111100100. Then there are m

=

12

/

4

=

3 blocks, B1

=

1001,

B2

=

1110 and B3

=

0100. We have

ν

(

B1

)

=

SG

(

B1

)

=

{

2

,

7

} =

2

×

7

=

14, and the primal block encoding of B is given as

ν

(

B

)

= {

14

,

30

,

3

}

. We also deﬁne the inverse operation

ν

−1

₍

_{

₁₄

_,

₃₀

_,

₃

_{}) =}

_ν

−1

₍

₁₄

₎

_ν

−1

₍

₃₀

₎

_ν

−1

₍

₃

₎

₌

_100111100100

₌

_B. We also write i

∈

ν

−1

₍

_·)

_{if the ith bit is set in}

_ν

−1

₍

_·)

_{. Also a bit-vector of all zeros (of any length) is denoted as 0, and} its corresponding value/encoding is denoted as 1. For example, if C

=

00000000, then we also write C

=

0, and

ν

(

C

)

=

{

1

,

1

} =

1.

Let G be the base prime set, and let A

=

A1A2

· · ·

Am, and B

=

B1B2

· · ·

Bmbe any two bit-vectors in

{

0

,

1

}

N, with N

=

m

× |

G

|

, and Ai

,

Bi

∈ {

0

,

1

}

|G|. Deﬁne gcd

(

ν

(

A

),

ν

(

B

))

= {

gcd

(

ν

(

Ai

),

ν

(

Bi

))

: 1

i

m

}

. For example, for

ν

(

B

)

= {

14

,

30

,

5

}

and

ν

(

A

)

= {

2

,

210

,

2

}

, we have gcd

(

ν

(

B

),

ν

(

A

))

= {

gcd

(

14

,

2

),

gcd

(

30

,

210

),

gcd

(

5

,

2

)

} = {

2

,

30

,

1

}

.

Let A

=

A1A2

· · ·

Ambe a bit-vector of length N, where each Ai is a

|

G

|

length bit-vector. Let fA

=

arg minj

{

A

[

j

] =

1

}

be

the position of the ﬁrst ‘1’ in A, across all blocks Ai. Deﬁne a masking operator

(

A

)

as follows:

(

A

)

[

j

] =

0

,

j

fA

,

1

,

j

>

fA

.

In other words,

(

A

)

is the bit vector obtained by setting A

[

fA

] =

0 and setting A

[

j

] =

1 for all j

>

fA. For example, if

A

=

001001100100, then fA

=

3, and

(

A

)

=

000111111111. Likewise, we can deﬁne the masking operator for a primal

block encoding as follows:

(

ν

(

A

))

=

ν

((

A

)

. For example,

(

ν

(

A

))

=

ν

((

001001100100

)

=

ν

(

000111111111

)

=

ν

(

0001

)

ν

(

1111

)

ν

(

1111

)

= {

7

,

210

,

210

}

. In other words,

(

{

5

,

15

,

3

})

= {

7

,

210

,

210

}

, since

ν

(

A

)

=

ν

(

001001100100

)

=

ν

(

0010

)

ν

(

0110

)

ν

(

0100

)

= {

5

,

15

,

3

}

. 3. The PRISMalgorithm

Sequence mining involves a combinatorial enumeration or search for frequent sequences over the sequence partial order. There are three key aspects of Prism that need elucidation: (i) the search space traversal strategy, (ii) the data structures used to represent the database and intermediate candidate information, and (iii) how support counting is done for candi-dates. Prism uses the primal block encoding approach to represent candidates sequences, and uses join operations over the primal blocks to determine the frequency for each candidate. We detail these three aspects below.

3.1. Search space

The partial order induced by the subsequence relation is typically represented as a search tree. The sequence search tree can be deﬁned recursively as follows: The root of the tree is at level zero and is labeled with the null sequence

∅

. A node labeled with sequence S at level k, i.e., a k-sequence, is repeatedly extended by adding one item from

I

to generate a child node at the next level (k

+

1), i.e., a

(

k

+

1

)

-sequence. There are two ways to extend a sequence by an item: sequence

extension and itemset extension. In a sequence extension, the item is appended to the sequential pattern as a new itemset.

(5)

Fig. 2. Partial Sequence Search Tree: solid lines denote sequence extensions, whereas dotted lines denote itemset extensions.

greater than all items in the last itemset. Thus, a sequence-extension always increases the size of the sequence, whereas, an itemset-extension does not. For example, if we have a node S

=

ab

→

a and an item b for extending S, then ab

→

a

→

b is

a sequence-extension, and ab

→

ab is an itemset extension. Fig. 2 shows an example of a partial sequence search tree for

two items, a and b, showing sequences up to size three; only the subtree under a is shown. Essentially all of the current sequence mining methods follow the sequence and itemset extension steps to traverse the search tree. The main difference lies in how the support for candidate patterns is determined. Below we describe how Prism computes the frequency for each extension.

3.2. Primal block encoding

Consider the example sequence database shown in Fig. 3(a), consisting of 5 sequences of different lengths over the items

I = {

a

,

b

,

c

}

. Let G

= {

2

,

3

,

5

,

7

}

be the base square-free prime generator set. Let us see how Prism constructs the primal block encoding for a single item a. In the ﬁrst step, Prism constructs the primal encoding of the positions within each sequence. For example, since a occurs in positions 1, 4, and 6 (assuming positions/indexes starting at 1) in sequence 1, we obtain the bit-encoding of a’s occurrences: 100101. Prism next pads this bit-vector so that it is a multiple of

|

G

| =

4, to obtain A

=

10010100 (note: bold bits denote padding). Next we compute

ν

(

A

)

=

ν

(

1001

)

ν

(

0100

)

= {

14

,

3

}

. The position encoding for a over all the sequences is shown in Fig. 3(b).

Prism next computes the primal encoding for the sequence ids. Since a occurs in all sequences, except for 4, we can represent a’s sequence occurrences as a bit-vector A

=

11101000 after padding. This yields the primal encoding shown in Fig. 3(c), since

ν

(

A

)

=

ν

(

1110

)

ν

(

1000

)

= {

30

,

2

}

. The full primal encoding for item a consists of all the sequence and position blocks, as shown in Fig. 3(d). A block Ai

=

0000

=

0, with

ν

(

Ai

)

= {

1

} =

1, is also called an empty block. Note

that the full encoding retains all the empty position blocks, for example, a does not occur in the second position block in sequence 5, and thus its bit-vector is 0000, and the primal code is

{

1

}

. In general, since items are expected to be sparse, there may be many blocks within a sequence where an item does not appear.

To eliminate those empty blocks, Prism retains only the non-empty blocks in the primal encoding. To do this it needs to keep an index with each sequence block to indicate which non-empty position blocks correspond to a given sequence block. Fig. 3(e) shows the actual (compact) primal block encoding for item a. The ﬁrst sequence block is 30, with factor-cardinality

30

G

=

3, which means that there are 3 valid (i.e., with non-empty position blocks) sequences in this block, and for each

of these, we store the offsets into the position blocks. For example, the offset of sequence 1 is 1, with the ﬁrst two position blocks corresponding to this sequence. Thus the offset for sequence 2 is 3, with only one position block, and ﬁnally, the offset of sequence 3 is 4. Note that the sequences which represent the sequence block 30, can be found directly from the corresponding bit-vector

ν

−1

₍

₃₀

₎

₌

_{1110, which indicates that sequence 4 is not valid. The second sequence block for a} is 2 (corresponding to

ν

−1

₍

₂

₎

₌

_{1000), indicating that only sequence 5 is valid, and its position blocks begin as position 5.} The beneﬁt of this sparse representation becomes clear when we consider the primal encoding for c. Its full encoding (see Fig. 3(d)) contains a lot of redundant information, which has been eliminated in the compact primal block encoding (see Fig. 3(e)).

It is worth noting that the support of a sequence S can be directly determined from its sequence blocks in the primal block encoding. Let

E(

S

)

= (

S

,

P

S

)

denote the primal block encoding for sequence S, where

S

S is the set of all encoded

sequence blocks, and

P

S is the set of all encoded position blocks for S.

(6)

Fig. 3. Example of primal block encoding: (a) Example database. (b) Position encoding for a. (c) Sequence encoding for a. (d) Full primal blocks for a, b and c. (e) Primal block encoding for a, b and c.

Proof.

vi

G gives the number of prime factors from G in the factorization of block vi. But a prime factor appears in

block vi at position j iff S occurs in sequence id j associated with that block. Thus

vi∈SS

vi

G gives a count of all the

sequences in the database where S occurs across all blocks, which equals sup

(

S

)

.

2

For example, for S

=

a, since

S

a

= {

30

,

2

}

, we have sup

(

a

)

=

30

G

+

2

G

=

3

+

1

=

4. Given a list of full or compact

position blocks

P

S for a sequence S, we use the notation

P

Si to denote those positions blocks, which come from

se-quence id i. For example, in

P

_a1

= {

14

,

3

}

. In the full encoding

P

_a5

= {

210

,

1

}

, but in the compact encoding

P

_a5

= {

210

}

(see Fig. 3(d)–(e)).

3.3. Support counting via primal block joins

We now show how the support for itemset and sequence extensions can be determined via primal block joins. Assume that we are at some node S in the sequence search tree, where S is a k-sequence. Let P be the parent node of S, which means that P is a

(

k

−

1

)

length preﬁx of S. Let us assume that we have available the primal block encoding for P (namely

E(

P

)

), as well as all its children. As illustrated in Fig. 2, S is extended by considering the other sibling nodes of S under parent P . For example, if P

=

ab, and S

=

ab

→

a, then we can generate other extensions of S by also considering which

(7)

Fig. 4. Itemset Extensions via primal block joins.

items a and b can be used to extend S

=

ab

→

a. There are two possible sequence extensions, namely ab

→

a

→

a and

ab

→

a

→

b, and one itemset extension ab

→

ab. In fact, the support for any extension is always the result of joining the

primal block encodings of two (possible the same) sibling nodes under the same preﬁx P . For example,

E(

ab

→

a

→

b

)

is obtained as result of a primal block sequence join operation over

E(

ab

→

a

)

and

E(

ab

→

b

)

, whereas

E(

ab

→

ab

)

is obtained as result of a primal block itemset join operation over

E(

ab

→

a

)

and

E(

ab

→

b

)

.

The frequent sequence enumeration process starts with the root of the search tree as the preﬁx node P

= ∅

, and Prism assumes that initially we know the primal block encodings for all single items. Prism then recursively extends each node in the search tree, computes the support via the primal block joins, and retains new candidates (or extensions) only if they are frequent. The search is essentially depth-ﬁrst, the main difference being that for any node S, all of its extensions are evaluated before the depth-ﬁrst recursive call. When there are no new frequent extensions found, the search stops. To complete the description, we now detail the primal block join operations. We will illustrate the primal block itemset and sequence joins using the primal encodings

E(

a

)

and

E(

b

)

, for items a and b, respectively, as shown in Fig. 3(e).

Itemset extensions. Let us ﬁrst consider how to obtain the primal block encoding for the itemset extension

E(

ab

)

, which is illustrated in Fig. 4. Note that the sequence blocks

S

a

= {

30

,

2

}

and

S

b

= {

210

,

2

}

contain all information about the

relevant sequence ids where a and b occur, respectively. To ﬁnd the sequence block for itemset extension ab, we simply have to compute the gcd for the corresponding elements from the two sequence blocks, namely gcd

(

30

,

210

)

=

30 (which corresponds to the bit-vector 1110), and gcd

(

2

,

2

)

=

2 (which corresponds to bit-vector 1000). We say that a sequence id i

∈

gcd

(

S

a

,

S

b

)

if the ith bit is set in the bit vector

ν

−1

(

gcd

(

S

a

,

S

b

))

. Since

ν

−1

(

gcd

(

S

a

,

S

b

))

=

ν

−1

(

{

30

,

2

}, {

210

,

2

}) =

ν

−1

₍

₃₀

_,

₂

₎

₌

_{11101000, we ﬁnd that sids 1, 2, 3 and 5 are the ones that contain occurrences of both a and b.}

All that remains to be done is to determine, by looking at the position blocks, if a and b, in fact, occur simultaneously at some position in those sequences. Let us consider each sequence separately. Looking at sequence 1, we ﬁnd in Fig. 3(e) that its positions blocks are

P

a1

= {

14

,

3

}

in

E(

a

)

and

P

b1

= {

210

,

2

}

in

E(

b

)

. To ﬁnd where a and b co-occur in sequence 1,

all we have to do is compute the gcd of these position blocks to obtain gcb

(

a

,

b

)

= {

gcd

(

14

,

210

),

gcd

(

3

,

2

)

} = {

14

,

1

}

, which indicates that ab only occur at positions 1 and 4 in sequence 1 (since

ν

−1

₍

₁₄

₎

₌

_{1001). A quick look at Fig. 3(a) conﬁrms} that this is indeed correct. If we continue in like manner for the remaining sequences (2, 3 and 5), we obtain the results shown in Fig. 4, which also shows the ﬁnal primal block encoding

E(

ab

)

. Note that there is at least one non-empty block for each of the sequences, even though for sequence 1, the second position block is discarded in the ﬁnal primal encoding. Thus sup

(

ab

)

=

30

G

+

2

G

=

3

+

1

=

4. In general, we can state that:

Theorem 3.2. Let X and Y be any two different siblings sharing a preﬁx node P in the sequence search tree, and let

E(

X

)

= (

S

X

,

P

X

)

and

E(

Y

)

= (

S

Y

,

P

Y

)

be their primal block encodings. Let gcdX Y

=

gcd

(S

X

,

S

Y

)

. The primal block for the itemset extension of X with

Y is given as

E(

X Y

)

= (

S

X Y

,

P

X Y

)

, where

P

X Y

= {

gcd

(P

Xi

,

P

Yi

)

|

i

∈

ν

−1

(

gcdX Y

)

}

, and

S

X Y

=

ν

(

BX Y

)

, where BX Yis the bit-vector

(8)

Fig. 5. Sequence extensions via primal block joins.

Proof. Note that gcd_{X Y} gives exactly the set of sequences where both X and Y occur. Further i

∈

ν

−1

₍

_gcd

X Y

)

gives the

sequence ids of those sequences. Next,

P

X Y is obtained by looking at precisely the set of blocks from those sequence ids.

The set of positions within sequence id i where both X and Y co-occur are given by gcd

(P

i_X

,

P

_Yi

)

. Finally,

S

X Y is derived

from those position-blocks in sequence i that are non-empty (i.e.,

P

i

X Y

=

1).

2

For our example above, gcd_{X Y}

= {

gcd

(

30

,

210

),

gcd

(

2

,

2

)

} = {

30

,

1

}

. Since

ν

−1

₍

_{

₃₀

_,

₁

_{}) =}

_{11101000, we have to look} at the position blocks from sids 1, 2, 3 and 5. Thus

P

X Y

= {

gcd

(

P

_X1

,

P

_Y1

),

gcd

(

P

2_X

,

P

_Y2

),

gcd

(

P

_X3

,

P

_Y3

),

gcd

(

P

5_X

,

P

_Y5

)

} =

{

gcd

(

{

14

,

3

}, {

210

,

2

}),

gcd

(

{

2

}, {

30

}),

gcd

(

{

3

}, {

6

}),

gcd

(

{

210

}, {

30

})} = {{

14

,

1

}, {

2

}, {

3

}, {

30

}}

. Eliminating the empty blocks, we have

P

X Y

= {{

14

}, {

2

}, {

3

}, {

30

}}

. Since all sequences (i.e., those with sids 1, 2, 3, 5) have at least one non-empty block,

we have BX Y

=

11101000, and

S

X Y

=

ν

(

BX Y

)

= {

30

,

2

}

. This corresponds to the compact primal encoding shown in Fig. 4.

Sequence extensions. Let us consider how to obtain the primal block encoding for the sequence extension

E(

a

→

b

)

, which is illustrated in Fig. 5. The ﬁrst step involves computing the gcd for the sequence blocks as before, which yields sequences 1, 2, 3 and 5, as those which may potentially contain the sequence a

→

b.

For sequence 1, we have the positions blocks

P

a1

= {

14

,

3

}

for a and

P

b1

= {

210

,

2

}

for b. The key difference with the

itemset extension is the way in which we process each sequence. Instead of computing gcd

(

{

14

,

3

}, {

210

,

2

})

, we com-pute gcd

((

{

14

,

3

})

,

{

210

,

2

}) =

gcd

(

{

105

,

210

}, {

210

,

2

}) = {

gcd

(

105

,

210

),

gcd

(

210

,

2

)

} = {

105

,

2

}

. Note that

ν

−1

₍

_{

₁₀₅

_,

₂

_{}) =}

01111000, which precisely indicate those positions in sequence 1, where b occurs after an a. Thus, sequence joins always keep track of the positions of the last item in the sequence. Proceeding in like manner for sequences 2, 3, and 5, we obtain the results shown in Fig. 5. Note that for sequence 3, even though it contains both items a and b, b never oc-curs after an a, and thus sequence 3 does not contribute to the support of a

→

b. This is also conﬁrmed by computing gcd

((

3

)

,

6

)

=

gcd

(

35

,

6

)

=

1, which leads to an empty block (

ν

−1

₍

₁

₎

₌

_{0000). Thus in the compact primal encoding of}

E(

a

→

b

)

, sequence 3 drops out. The remaining sequences 1, 2, and 5, contribute at least one non-empty block, which yields

S

a→b

=

ν

(

11001000

)

= {

6

,

2

}

, as shown in Fig. 5, with support sup

(

a

→

b

)

=

6

G

+

2

G

=

2

+

1

=

3. In general, we can

state that:

Theorem 3.3. Let X and Y be any two (possibly the same) sibling sharing a preﬁx node P in the sequence search tree, and let

E(

X

)

=

(S

X

,

P

X

)

and

E(

Y

)

= (

S

Y

,

P

Y

)

be their primal block encodings. Let gcdX→Y

=

gcd

(S

X

,

S

Y

)

. The primal block for the sequence

extension of X with Y is given as

E(

X

→

Y

)

= (

S

X→Y

,

P

X→Y

)

, where

P

X→Y

= {

gcd

((P

iX

)

,

P

Yi

)

|

i

∈

ν

−1

(

gcdX Y

)

}

, and

S

X→Y

=

ν

(

BX→Y

)

, where BX→Yis the bit-vector deﬁned as follows: BX→Y

[

i

] =

1

⇔

P

i_X_→_Y

=

1.

Proof. The proof is similar to that in Theorem 3.2. i

∈

ν

−1

₍

_gcd

X Y

)

gives the sequence ids which contain both X and Y . We

have look at the position blocks within each such sid i. gcd

((P

i_X

)

,

P

_Yi

)

is obtained from those positions for Y that come after some position for X , since

(P

i

X

)

turns on all the bits after the ﬁrst set bit, which is the ﬁrst position where X occurs.

(9)

Thus each block size is now 8 instead of 4. Note that with the new G, the largest element in

P

(

G

)

is

G

=

9 699 690. In total there are

|

P

(

G

)

| =

256 possible elements in semi-group

P

(

G

)

.

In a naive implementation, the GCD lookup table can be stored as a two-dimensional array with cardinality 9 699 690

×

9 699 690, where GCD

(

i

,

j

)

=

gcd

(

i

,

j

)

for any two integers i

,

j

∈ [

1

:

9 699 690

]

. This is clearly grossly ineﬃcient, since there are in fact only 256 distinct (square-free) products in

P

(

G

)

, and we thus really need a table of size 256

×

256 to store all the gcd values. We achieve this by representing each element in

P

(

G

)

by its rank, as opposed to its value. Let

S

∈

P

(

G

)

, and let SB _its

_|

_G

_|

_{-length indicator bit-vector, whose ith bit is ‘1’ iff the i-element of G is in S. Then deﬁne the}

rank of

S as equal to the decimal value of SB_{, i.e., rank}

₍

_S

₎

₌

_decimal

₍

_SB

₎

_{(with the left-most bit being the least}

signiﬁcant bit), which imposes a total order on

P

(

G

)

. For example, the rank

(

1

)

=

decimal

(

00000000

)

=

0, rank

(

13

)

=

decimal

(

00000100

)

=

32, rank

(

35

)

=

decimal

(

00110000

)

=

12, and rank

(

9 699 690

)

=

decimal

(

11111111

)

=

255.

Lemma 3.4. Let S

,

T

∈

P

(

G

)

, and let SB

_,

_TB _{be their indicator bit-vectors with respect to generator set G. Then}

rank

(

gcd

(

S

,

T

))

=

decimal

(

SB

_∧

_TB

₎

_.

Proof. By Lemma 2.3 gcd

(

S

,

T

)

=

(

S

∩

T

)

. Thus rank

(

gcd

(

S

,

T

))

=

rank

(

S

∩

T

))

=

decimal

(

SB

_∧

_TB

₎

_.

₂

Consider for example, gcd

(

35

,

6

)

=

1. We have rank

(

gcd

(

35

,

6

))

=

decimal

(

00110000

∧

11000000

)

=

decimal

(

00000000

)

=

0, which matches the computation rank

(

gcd

(

35

,

6

))

=

rank

(

1

)

=

0.

Instead of using direct values, all gcd computations are performed in terms of the ranks of the corresponding elements. Thus each cell in the GCD table stores: GCD

(

rank

(

i

),

rank

(

j

))

=

rank

(

gcd

(

i

,

j

))

, where i

,

j

∈

P

(

G

)

. This brings down the storage requirements of the GCD table to just 256

×

256

=

65 536 bytes, since each rank requires only one byte of memory (since rank

∈ [

0

:

255

]

).

Other lookup tables. Once the ﬁnal sequence blocks are computed for after a join operation, we need to determine the ac-tual support, by adding the factor cardinalities for each sequence block. To speed up this support determination, Prism main-tains a one-dimensional look-up array called CARD to store the factor-cardinality for each element in the set

P

(

G

)

. That is we store CARD

(

rank

(

x

))

=

x

G for all x

∈

P

(

G

)

. For example, since

35

G

=

2, we have CARD

(

rank

(

35

))

=

CARD

(

12

)

=

2.

Furthermore, in sequence block joins, we need to compute the masking operation for each position block. For this Prism maintains another one-dimensional array called MASK, where MASK

(

rank

(

x

))

=

rank

((

x

)

for each x

∈

P

(

G

)

. For example

MASK

(

rank

(

2

))

=

rank

((

2

)

=

rank

(

4 849 845

)

=

254.

Finally, as an optimization for fast joins, once we determine gcd_{X Y} or gcd_X_→_Y in the primal itemset/sequence block joins, if the number of supporting sequences is less than minsup, we can stop further processing of position blocks, since the resulting extensions cannot be frequent in this case.

4. Experiments

In this section we study the performance of Prism by varying different database parameters and by comparing it with other state-of-the-art sequence mining algorithms like SPADE [14], PreﬁxSpan [10] and SPAM [2]. The codes/executables for these methods were obtained from their authors. All experiments were performed on a laptop with 2.4 GHz Intel Celeron processor, and with 512 MB memory, running Linux.

Synthetic datasets. We used several synthetic datasets, generated using the approach outlined in [1]. These datasets mimic real-world transactions, where people buy a sequence of sets of items. Some customers may buy only some items from the sequences, or they may buy items from multiple sequences. The input sequence size and itemset size are clustered around a mean and a few of them may have many elements. The datasets are generated using the following process. First NI maximal

(10)

Fig. 6. Performance comparison with varying minimum support.

by assigning itemsets from NI to each sequence. Next a customer (or input sequence) of average C transactions (or itemsets)

is created, and sequences in NS are assigned to different customer elements, respecting the average transaction size of T .

The generation stops when D input-sequences have been generated. For example, the dataset

C20T50S20I10N1kD100k

, means that it has D

=

100k sequences, with C

=

20 average transactions, T

=

50 average transaction size, chosen from a pool with average sequence size S

=

20 and average transaction size I

=

10, with N

=

1k different items. The default itemset and sequence pool sizes are always set to NS

=

5000 and NI

=

25 000, respectively.

Real datasets. We also compared the methods on two real datasets taken from [13].

Gazelle

was part of the KDD Cup 2000 challenge dataset. It contains log data from a (defunct) web retailer. It has 59 602 sequences, with an average length of 2.5, length range of

[

1

,

267

]

, and 497 distinct items. The

Protein

dataset contains 116 142 proteins sequences down-loaded from the Entrez database at NCBI/NIH. The average sequence length is 482, with length range of

[

400

,

600

]

, and 24 distinct items (the different amino acids).

4.1. Performance comparison

Figs. 6, 7 and 8 show the performance comparison of the four algorithms, namely, SPAM [2], PreﬁxSpan [10], SPADE [14] and Prism, on different synthetic and real datasets, with varying minimum support. As noted earlier, for Prism we used the ﬁrst 8 primes as the base prime generator set G.

Fig. 6(a)–(d) show (small) datasets where all four methods can run for at least some support values. The datasets used in (a) and (b) are identical, except that (a) has only 100 distinct items

(

N

=

0

.

1k

)

, whereas (b) has 1000

(

N

=

1k

)

; (c) has much larger average transaction length

(

T

=

60

)

compared to (b), with

(

T

=

20

)

; (d) has larger sequence length

(

C

=

35

)

, and larger parameters compared to (a)–(c). For these datasets, we ﬁnd that, with a few exceptions, Prism has the best overall performance. For the support values where SPAM can run, it is generally in the second spot (in fact, it is the fastest

(11)

on

C10T20S4I4N0.1kD10k

). However, SPAM fails to run for lower support values. Compared to PreﬁxSpan and SPADE, Prismgenerally outperforms them by over an order of magnitude.

Fig. 7(a)–(d) show larger datasets (generally with D

=

100k sequences). On these SPAM could not run on our laptop, and thus is not shown in Fig. 7(a)–(d). Prism again outperforms PreﬁxSpan and SPADE, by up to an order of magnitude, except on (a), where PreﬁxSpan does the best. Nevertheless, if we reduce the number of items (to N

=

0

.

1k in (b)), increase the sequence length (to C

=

20 in (c)) or increase other parameters (in (d)), we ﬁnd that Prism is clearly superior.

Fig. 8(a)–(b) show performance on two datasets with larger average sequence (C

=

50 in (a)) and transaction lengths (T

=

50 in (b)). On these even PreﬁxSpan is unable to run, and SPADE runs only for larger support values. For these datasets, Prism is the most effective.

Finally, Fig. 8(c)–(d) show the performance comparison on the real datasets,

Gazelle

and

Protein

. SPAM failed to run on both these datasets on our laptop, and PreﬁxSpan did not run on

Protein

. On

Gazelle

Prism is an order of magnitude faster than PreﬁxSpan, but is comparable to SPADE. On

Protein

Prism_{outperforms SPADE by an order of} magnitude (for lower support).

Based on these extensive set of results on diverse datasets, we can observe some general trends. Across the board, our new approach, Prism, is the fastest (with a few exceptions), and runs for lower support values than competing methods. SPAM generally works only for smaller datasets due to its very high memory consumption (see below); when it runs, SPAM is generally the second best. SPADE and PreﬁxSpan do not suffer from the same problems as SPAM, but they are much slower than Prism, or they fail to run for lower support values, when the database parameters are large. Since LAPIN [13] has also been shown to outperform existing approaches, especially on long, dense datasets, in the future, we would like to experimentally compare our approach with LAPIN.

Scalability. Fig. 9 shows the scalability of the different methods when we vary the different dataset parameters. The base values used are as follows: C

=

20, T

=

20, S

=

20, I

=

10, N

=

1k, and D

=

100k. We vary a single parameter at a time,

(12)

keeping all others fixed to the default values. Since SPAM failed to run on these larger datasets, we could not report on its scalability. Fig. 9(a) shows the effect of increasing the number of sequences from 10k to 100k. The trend is approximately linear. Fig. 9(b) shows the effect of increasing the average number of transactions per sequence (or the average sequence length). PrefixSpan is very sensitive to this parameter, displaying an exponential growth; it fails to run on larger values. Prism scales gracefully, and outperforms SPADE by over 5 times. Fig. 9(c) shows what happens when we increase the average transaction size within the sequences. All methods display an exponential growth, though PrefixSpan grows faster. Both PrefixSpan and SPADE could not run on T

=

50. Fig. 9(d) shows the effect of larger itemsets in the dataset generation pool. The three methods are not very sensitive to this parameter, and the running time grows slowly. Finally, Fig. 9(e) shows the impact of longer sequences in the pool. We ﬁnd that the time decreases exponentially as the maximal sequence size becomes longer. This mainly because fewer sequences are embedded into each customer sequence as the sequence length increases, resulting in fewer frequent sequences. It is worth noting that for shorter values, Prism is over an order of magnitude faster than PreﬁxSpan, and about two times faster than SPADE.

Memory usage. Fig. 10 shows the memory consumption of the four methods on a sample of the datasets. Fig. 10 plots the peak memory consumption during execution (measured using the

memusage

command in Linux). Fig. 10(a)–(b) quickly demonstrate why SPAM is not able to run on all except very small datasets. We ﬁnd that its memory consumption is well beyond the physical memory available (512 MB), and thus the program aborts when the operating system runs out of memory. We can also note that SPADE generally has a 3–5 times higher memory consumption than PreﬁxSpan and Prism. The latter two have comparable and very low memory requirements.

(13)

(14)

Fig. 10. Memory consumption. 5. Conclusion

Based on the extensive experimental comparison with popular sequence mining methods, we conclude that, across the board, Prism is one of the most eﬃcient methods for frequent sequence mining. It outperforms existing methods by an order of magnitude or more, and has a very low memory footprint. It also has good scalability with respect to a number of database parameters. Future work will consider the tasks of mining all the closed and maximal frequent sequences, as well as the task of pushing constraints within the mining process to make the method suitable for domain-speciﬁc sequence mining tasks. For example, allowing approximate matches, allowing substitution costs, and so on.

References

[1] R. Agrawal, R. Srikant, Mining sequential patterns, in: 11th IEEE Intl. Conf. on Data Engineering, 1995.

[2] J. Ayres, J.E. Gehrke, T. Yiu, J. Flannick, Sequential pattern mining using bitmaps, in: SIGKDD Int’l Conf. on Knowledge Discovery and Data Mining, July 2002.

[3] J.L. Balcazar, G. Casas-Garriga, On horn axiomatizations for sequential data, in: 10th International Conference on Database Theory, 2005. [4] J. Gilbert, L. Gilbert, Elements of Modern Algebra, PWS Publishing Co., 1995.

[5] K. Gouda, M. Hassaan, M.J. Zaki, PRISM: A prime-encoding approach for frequent sequence mining, in: IEEE Int’l Conf. on Data Mining, October 2007. [6] D. Gusﬁeld, Algorithms on Strings, Trees, and Sequences: Computer Science and Computational Biology, Cambridge University Press, New York, 1997. [7] K.L. Jensen, M.P. Styczynski, I. Rigoutsos, G.N. Stephanopoulos, A generic motif discovery algorithm for sequential data, Bioinformatics 22 (2006) 21–28. [8] H. Mannila, H. Toivonen, I. Verkamo, Discovery of frequent episodes in event sequences, Data Min. Knowl. Discov. 1 (3) (1997) 259–289.

[9] F. Masseglia, F. Cathala, P. Poncelet, The PSP approach for mining sequential patterns, in: European Conf. on Principles of Data Mining and Knowledge Discovery, 1998.

[10] J. Pei, J. Han, B. Mortazavi-Asl, H. Pinto, Q. Chen, U. Dayal, M.-C. Hsu, PrefixSpan: Mining sequential patterns efficiently by prefixprojected pattern growth, in: Int’l Conf. Data Engineering, April 2001.

[11] R. Srikant, R. Agrawal, Mining sequential patterns: Generalizations and performance improvements, in: 5th Intl. Conf. Extending Database Technology, March 1996.

(15)