Discovering Emerging Patterns Using Improved PrefixSpan Algorithm SK International Journal of Multidisciplinary Research Hub

(1)

ISSN: 2394-3122 (Online) Volume 2, Issue 5, May 2015

SK International Journal of Multidisciplinary Research Hub

Journal for all Subjects

Research Article / Survey Paper / Case Study Published By: SK Publisher (www.skpublisher.com)

Discovering Emerging Patterns Using Improved PrefixSpan Algorithm

Jhankhana Koshti¹ Computer Engineering Department

KITRC, Kalol India

Chetna Chand² Computer Engineering Department

KITRC, Kalol India

Abstract:The concept of sequence Data Mining was first introduced by Rakesh Agrawal and Ramakrishnan Srikant in the year 1995.Sequential pattern mining is a significant data-mining method for determining time-related behavior in sequence databases. Discovering sequential patterns is a well-studied area in data mining and has been found in many diverse applications, such as customer purchase behavior analysis, web-log analysis, medical treatments, natural disasters, telecommunication, network detection, etc. Several algorithms were proposed. The very first was Apriori algorithm, which was put forward by the founders themselves. Later more scalable algorithms for complex applications were developed like GSP, Spade, PrefixSpan etc. A comprehensive performance study shows that PrefixSpan, in most cases, outperforms the apriori-based algorithm GSP, FreeSpan, and SPADE (a sequential pattern mining algorithm that adopts vertical data format), and PrefixSpan integrated with pseudoprojection is the fastest among all the tested algorithms. Furthermore, this mining methodology can be extended to mining sequential patterns with user-specified constraints. The high promise of the pattern-growth approach may lead to its further extension toward efficient mining of other kinds of frequent patterns, such as frequent substructures. In this work a survey is made on improving the efficiency of traditional sequential pattern mining algorithm called prefixspan by incorporating various constraints effectively and efficiently into sequential mining process to discover interesting and valuable sequential patterns from sequential databases. At the end, performance analysis is done on the basis of pseudo projection, memory space, number of sequential patterns and execution time supported by various algorithms.

Keywords: PrefixSpan, Sequential Pattern Mining, Pattern Growth Approach, Projected Database I. INTRODUCTION

Sequential Pattern Mining mines the frequent subsequences from the sequence database and is an important data mining problem with many applications such as analysis of customer purchase behavior, web access patterns, diagnosing diseases and also in the DNA sequences. It also performs analysis of sequencing or or time related processes for different experiments.

The sequential pattern mining problem was first introduced by Agrawal and Srikant[1]. Given a set of sequences, where each sequence consists of a list of itemsets and each element consists of a set of items, and given a user-specified min_support threshold, sequential pattern mining is to find all frequent subsequences, i.e., the subsequences which occur frequently in t he set of sequences and satisfies min_support.

In general, the goal of sequential pattern mining algorithms is to discover the sequential patterns from sequential database.

Recently, researchers found that only frequency is not the best measure to determine the different useful and valuable patterns in different applications. When a single frequency constraint is employed, the conventional mining approaches normally produce a large number of patterns and rules, but majority of them are futile. Due to its ineptitude, the significance of constraint-based pattern mining has increased. In several cases, the user prospects on the discovery process of the mining patterns and the

(2)

background knowledge of the user have not been considered and so this result in high cost and very hard to deal with the mining process. The sequential pattern mining that handles sequential data (for e.g., the analysis of frequent behaviors) face the same drawbacks. Constraints that limit the number and range of mined patterns are utilized by sequential pat-tern mining algorithms to reduce this intricacy.

Many approaches have been proposed to extract se-quential patterns from sequence databases. Most of early works focus on the following two directions:

» Improve the efficiency of the mining process

This direction of research focuses on the efficient mining of sequential patterns in time-related data. In general, these methods can be categorized into three classes: (1) Apriori-based, horizontal formatting method, such as GSP[2]; (2) Apriori- based, vertical formatting method, such as SPADE [6]; and (3) Projection-based pattern growth method, such as PrefixSpan [3].

» Extend the mining of sequential patterns to other time-related patterns.

Emerging Patterns

Emerging patterns[12] may be described as patterns, which occur frequently in one set of data and seldom in another.

II. EXISTING ALGORITHM

PREFIXSPAN ALGORITHM

PrefixSpan algorithm[3] is a sequential pattern mining algorithm. PrefixSpan algorithm major idea is: any frequent subsequences can always be found by growing frequent prefixes.

Table 1. Sequence Database

Table 2. Projected database and Sequential patterns

(3)

The PrefixSpan algorithm steps are:

1. Find length -1 sequential patterns. Scan S once to find all frequent items in sequences. Each of those frequent items is a length -1 sequential pattern. They are <a>:4 , <b>:4, <c>:4, <d>:3, < e>:3 and <f>:3, where <pattern> : count represents the pattern and its associated support count.

2. Divide search spaces. The complete set of sequential pat-terns can be partitioned into the following six subsets according to the six prefixes:

(1) Ones having prefix a; Ones having prefix b;…(6) the ones having prefix <f>;

3. Find subsets of sequential patterns. The subset of sequential patterns can be mined by constructing corresponding projected databases and mine each re-cursively.

PrefixSpan algorithm is created for mining in projected data-bases. In this study our database is a long continuous sequence. So we used PrefixSpan algorithm and to mine a continuous sequence database. Significant issue of PrefixSpan algorithm’s main attribute is that PrefixSpan only grows longer sequential patterns from the shorter frequent this was our useful source of idea.

Algorithm of PrefixSpan is as follows:

Input: A sequence database S, and the minimum support threshold θ Output: The complete set of sequential patterns and Emerging patterns.

Method: Call PrefixSpan (<>, 0, S) Subroutine PrefixSpan (α, l, S|α )

Parameters: α sequential pattern; l the length of α, the S|α -projected database, if α≠<>; otherwise the sequence database S.

Conclusion of Existing Work:

» Candidate subsequence generation is reduced.

» Uses pattern growth method.

» Better than both GSP and FreeSpan.

» Explores prefix-projection in sequential pattern mining.

» Mines the complete set of patterns.

» As Prefix Projection method is used the Projected Data-base shrinks in size.

» Cost of construction Projected database is high.

» Bi-level projection and pseudo-projection may improve mining efficiency.

2.1 Constraint-based Sequential Pattern Mining

Constraint-based sequential pattern mining makes the following contributions:

» First, various kinds of constraints are classified in two or-thogonal ways, based on their application semantics and their roles in sequential pattern mining, respectively.

» Constraint-based mining prunes a large search space effec-tively in sequential pattern mining.

(4)

Categories of constraints

A constraint C for sequential pattern mining is a boolean function C(α) on the set of all sequences. The problem of constraint-based sequential pattern mining is to find the complete set of sequential patterns satisfying a given constraint C.

Constraints can be examined and characterized from different points of views.

From the application point of view, the following seven cate-gories of constraints based on the semantics and the forms of the constraints can be used.

Constraint 1 (Item Constraint) An item constraint specifies subset of items that should or should not be present in the patterns.

For example, when mining sequential patterns over a web log, a user may be interested in only patterns about visits to online bookstores.

Constraint 2 (Length Constraint) A length constraint specifies the requirement on the length of the patterns, where the length can be either the number of occurrences of items or the number of transactions. Length constraints can also be specified as the number of distinct items, or even the maximal number of items per transactions.

For example, a user may want to find only long patterns (e.g., patterns consisting of at least 50 transactions) in bio- sequence analysis. Such a requirement can be expressed by a length constraint Clen(α) ≡ (len(α) ≥ 50).

Constraint 3 (Super-pattern Constraint) A super-pattern con-straint, to find patterns that contain a particular set of patterns as sub-patterns.

For example, an analyst might want to find sequential patterns that first buy a PC and then buy a digital camera. The constraint can be expressed as Cpat(α) ≡ <(PC)(digital_camera)> α.

Constraint 4 (Aggregate Constraint) An aggregate constraint is the constraint on an aggregate of items in a pattern, where the aggregate function can be sum, avg, max, min, standard deviation, etc.

For example, a marketing analyst may want sequential pat-terns where the average price of all the items in each pattern is over Rs.1000.

Constraint 5 (Regular Expression Constraint) A regular expression constraint CRE is a constraint specified as a regular expression over the set of items using the established set of regular expression operators, such as disjunction and Kleene closure. A sequential pattern satisfies CRE if and only if the pattern is accepted by its equivalent deterministic finite automata.

For example, to find sequential patterns about a Web click stream starting from Yahoo’s home page and reaching hotels in New York City, one may use regular expression constraint Travel (New York | New York City ) ( Hotels | Hotels and Motels | Lodging ), where “|” stands for disjunction.

Constraint 6 (Duration Constraint) A duration constraint is defined only in sequence databases where each transaction in every sequence has a time-stamp. It requires that the sequential patterns in the sequence database must have the property such that the time-stamp difference between the first and the last transactions in a sequential pattern must be longer or shorter than a given period.

Constraint 7 (Gap Constraint) A gap constraint set is defined only in sequence databases where each transaction in every sequence has a timestamp. It requires that the sequential patterns in the sequence database must have the property such that the timestamp difference between every two adjacent transactions must be longer or shorter than a given gap.

Among the constraints listed above, duration con-straints and gap constraints are support-related, i.e., they are applied to confine how a sequence matches a pattern. To find whether a sequential pattern satisfies these constraints, one needs to examine

(5)

the sequence databases. For other con-straints, whether the constraint is satisfied can be determined by the frequent patterns themselves without referring to the support counting process, such as item constraint, aggregate constraint, regular expression constraint etc.

III. PROPOSED ALGORITHM

The Proposed algorithm discovers Sequential as well as Emerging Patterns by using some constraints.

Algorithm Analysis

Consider the following database. The database contains 4 sequences. Each sequence contains itemsets that are annotated with a timestamp. For example, consider the sequence S1. This sequence indicates that itemset {1} appeared at day 1. It is followed by the itemset {1, 2, 3} at day 2. Finally, itemset {1 2} appeared at day 3.

Table 3. Sequence Database

From a sequence database, the basic task of sequential pattern mining is to find all sequential patterns that occurs more than min_sup. Each sequential patterns must respect following parameters.

Constraints Used:

1. Minimum Support ( min_sup) 2. Minimum Confidence ( min_conf ) 3. Maximum Gap ( max_gap )

4. Maximum Compactness (max_compact) 5. Recency Support ( R_sup).

6. Quantitative Support ( Q_sup) Algorithm

For generating Sequential Patterns

1. Scan database to find length-1 Sequential Pattern.

2. Generate pseudo sequence database by removing items of length-1 from sequence database which are not frequent also eliminate the one which does not satisfies Quantitative constraint. .

3. Divide complete set of sequential patterns into different subsets according to set of length-1 sequential patterns (prefix).

4. Construct projected database for each prefix.

5. Find frequent item b from each projected database which satisfies Max_gap and Q_Sup.

6. For each frequent item b append it to prefix to generate new prefix in such a way that

» b can be assembled to the last element of prefix to form a sequential pattern or;

» <b> can be appended to prefix to form a sequential pattern.

7. Recursively generate projected database for each new prefix which satisfies Max_compact and mine it to find local frequent patterns.

ID Sequences

S1 <0> 1(2), <1> 2(1) 1(3), <2> 3(4) 5(2) S2 <0> 3(6), <1> 5(3) 4(3), <2> 6(4) 5(3) S3 <0> 1(1), <1> 7(4) 2(3), <2> 5(2) 3(2) S4 <0> 2(3), <1> 5(2) 1(2), <2> 8(4) 4(1)

(6)

8. Merge local frequent patterns which satisfies R_sup to generate global frequent patterns.

9. Output the complete set of sequential patterns.

For generating Emerging Patterns

1. Supply the new sequence database file which contains up dated data.

2. Generate global frequent patterns from the new sequence database file.

3. Calculate the differences between frequent patterns in both the files which have drastic change in support count are the emerging patterns.

IV. IMPLEMENTATION DETAILS

The Sequential database consists of set of sequences that include timestamp and quantity. Here we use two Sequential Database(*.txt) as we have to compare the databases to discover the emerging patterns.

Here I am using synthetic dataset which is generated by using spmf dataset generator.

Parameters of Synthetic Dataset Parameters Description

|C| Number of Customers (in 1000’s)

|S| Average number of transactions per sequence |T| Average number of items per transaction

|N| Number of distinct items (in 1000’s)

Table 4. Parameters of Synthetic Dataset

Here I am using the following datasets which contains timestamps and quantity along with sequences.

Name |C| |S| |T| |N|

C1S8T2.5N1 1 8 2.5 1

C10S4T2.5N10 10 4 2.5 10

C10S6T2.5N10 10 6 2.5 10

C10S8T2.5N10 10 8 2.5 10

C40S4T2.5N10 40 4 2.5 10

C4S4T2.5N1 4 4 2.5 1

Table 5. Parameter setting.

4.1 Platform

We have performed a simulation study to compare Con-strained PrefixSpan algorithm with traditional sequential pattern mining algorithm PrefixSpan. Performances of both the algorithms are tested on Intel Core i7 Processor with Windows 8 system and 8 gigabyte of main memory.

4.2 Language and database tool

The Java Netbeans IDE is used for whole algorithm implementation.

(7)

V. RESULTS

The number of experiments done to evaluate the performance of the Constrained PrefixSpan algorithm:

First, the comparison is done on the 1st dataset on the basis of time taken and number of patterns to run the algorithm by comparing traditional PrefixSpan algorithm and the Con-strained PrefixSpan algorithm. It depicts that in both the cases Constrained PrefixSpan Outperforms the Traditional one.

Fig. 1 Runtime vs Support

Fig. 2 Number of Patterns vs Support

Second, the comparison is done on the 2nd dataset on the basis of memory consumed during the algorithm by comparing traditional PrefixSpan algorithm and the Constrained PrefixSpan algorithm.

Fig. 3 Memory Usage vs Support

Third, the comparison is done on the basis of quantity on the 2^nd dataset. Here as the number of quantity increases the number of pattern decreases.

(8)

Fig. 4 Number of Patterns vs Quantity

Next the comparison is done by varing the gap between 0 to 4 for the 3^rd and 4^th dataset. The below comparison shows that the number of patterns vary according to the gap.

Fig. 5 Number of pattern vs Gap

Lastly, our goal that is finding emerging patterns. Next Fig shows the result of emerging patterns discovered using 5^th and 6^th dataset.

Fig. 6 Emerging Patterns

VI. CONCLUSION

Most of the existing sequential pattern mining methods, however, focus on the concept of frequency in the sequence database and did not consider other factors like concepts of recency, gap and compactness, quantity. To apply these factors in sequential pattern mining process we first modify the traditional sequential pattern mining algorithm prefixspan to apply gap, compactness, quantity and recency constraints. The detailed experiments are also presented. Seven synthetic datasets are used in our performance analysis and results are also presented. Resuts shows that the constrained prefixspan algorithm performs well in all kind of situations and also it gives the .In future the constrained prefixspan algorithm can be implemented on multiprocessor system. We may consider about adding other useful constraints, for example the time constraint can be kept as fuzzy time constraint adding a constraint like Profitable Customer so that more valuable patterns can be discovered.

(9)

ACKNOWLEDGMENT

First of all I would like to thank the Almighty God for their blessings and strength empowered by them to complete this task. Next comes my guide Prof. Ms. Chetna Chand for her support, guidance, patience and constant encouragement without which this work would have not been possible. Last but not the least my big thanks goes to my entire family for their endless love and support because of that I have accom-plished this task.

References

1. Rakesh Agrawal Ramakrishnan Srikant, “Mining Sequential Patterns”, 11th Int. Conf. on Data Engineering, IEEE Computer Society Press, Taiwan, pp. 3- 14, 1995.

2. Srikant R.and Agrawal, R., “Mining sequential patterns: Generalizations and performance improvements”, Proceedings of the 5th International Conference Extending Database Technology, 1057, 3-17, 1996.

3. J. Pei, J. Han, B. Mortazavi-Asi, H. Pino, " Mining Sequential Patterns by Pattern-Growth: The PrefixSpan Approach", IEEE Transactions on Knowledge and Data Engineering, VOL. 16, NO. 10, OCTOBER 2004.

4. M. Garofalakis, R. Rastogi, and K. Shim, "SPIRIT: Sequential pattern mining with regular expression constraints", VLDB'99, 1999.

5. Han J., Dong G., Mortazavi-Asl B., Chen Q., Dayal U., Hsu M.-C., ”Freespan: Frequent pattern-projected sequential pattern mining”, Proceedings 2000 Int. Conf. Knowledge Discovery and Data Mining (KDD’00), pp. 355-359, 2000

6. M. Zaki, "SPADE: An efficient algorithm for mining frequent sequences”, Machine Learning, 2001.

7. Han, J., Pei, J., Mortazavi-Asl, B. and Zhu, H., “Mining access patterns efficiently from web logs”, In Proceedings of the Pacific- Asia Conference on Knowledge Discovery and Data Mining (PAKDD’00) Kyoto Japan, 2000.

8. Jian Pei, Jiawei Han, Wei Wang, “Constraint-based sequential pattern mining: the pattern growth methods”, J Intell Inf Syst , Vol. 28, No.2, pp. 133 –160 , 2007

9. Dhany Saputra, Dayang Rohaya Awang Rambali and Foong Oi Mean, “Sequential Pattern Mining using PrefixSpan with Pseudoprojection and Separator Database”, 978-1-4244-2328-6/08 IEEE, 2008

10. Bhawna Mallick, Deepak Garg, and Preetam Singh Grover, “Constraint-Based Sequential Pattern Mining: A Pattern Growth Algorithm Incorporating Compactness, Length and Monetary”.

11. Jiawei Han & Micheline Kamber, “Data Mining: Concepts and Techniques” 2nd ed., Morgan Kaufmann Publishers, 2006.

12. Kotagiri Ramamohanarao and James Bailey, “Discovery of Emerging Patterns and Their Use in Classification”.

13. Jyoti Mehta and Rajni Mehta, “Prefix Projection: A Technique for Mining Sequential Pattern Included Length and Aggregate”, Vol.7 No.11 (2012), International Journal of Applied Engineering Research.

14. Anita Zala, Mehul Barot, “Mining Sequential Pattern with Time-Constraint”, International Journal of Engineering and Innovative Technology (IJEIT) Volume 2, Issue 7, January 2013.

15. C K Bhensdadia1 & Y. P. Kosta, “Discovering Active and Profitable Patterns with RFM (RECENCY, FREQUENCY AND MONETARY) Sequential Pattern Mining–A CONSTRAINT BASED APPROACH”, International Journal of Information Technology and Knowledge Management, January-June 2011, Volume 4, No. 1, pp. 27-32.

16. Dhany Saputra, Dayang R. A. Rambli, Oi Mean Foong, “Mining Sequential Patterns Using I-PrefixSpan”, World Academy of Science, Engineering and Technology, Vol:1 2007-11-20.

17. K. Suneetha & Dr.M.Usha Rani, “Web Page Recommendation Approach Using Weighted Sequential Patterns and Markov Model”, Global Journal of Computer Science and Technology, Volume 12 Issue 9 Version 1.0 April 2012.