A New Tree StructureTo Extract Frequent Pattern

(1)

International Journal of Emerging Technology and Advanced Engineering

Website: www.ijetae.com (ISSN 2250-2459,ISO 9001:2008 Certified Journal, Volume 3, Issue 3, March 2013)

901

A New Tree StructureTo Extract Frequent Pattern

Minu Pandya

1

Department of Computer Science, SBTC, Rajasthan Technical University, Jaipur, India

Guided by Priyanka Trikha

SBTC, Rajasthan Technical University, Jaipur, India

Abstract— Frequent pattern mining is a heavily researched area in the field of data mining with wide range of applications. Finding a frequent pattern (or items) plays as essentials role in data mining. Efficient algorithm to discover frequent patterns is essential in data mining research. A number of research works have been published that presenting new algorithm or improvements on existing algorithm to solve data mining problem efficiently. In that Apriori algorithm is the first algorithm proposed in this field. By the time of change or improvement in Apriori algorithm, the algorithms that compressed large database in to small tree data structure like FP tree, CAN tree and CP tree have been discovered. These algorithms are partitioned based , divide and conquer method used that decompose mining task in to smaller set of task for mining confined patterns in conditional database, which dramatically reduce search space. In this paper I propose a new novel tree structure - extension of CP tree that extract all frequent pattern from transactional database. This tree structure constructs compact prefix free structure with one database scan and it provide same mining performance as FP growth technique by efficient tree restructuring process. It also supports interactive and incremental mining without rescan the original database

Keywords- frequent pattern,association rule,data mining,data stream

I. INTRODUCTION

Data mining is a powerful new technology with great potential to help companies focus on the most important information in the collected data. It discovers information within the data that queries and reports can’t effectively reveal. By performing data mining, interesting, knowledge, regularities or high-level information can be extracted from databases and viewed from different angles. The discovered knowledge can be applied to decision making, process control, information management and query processing [13]. A procedure called mining frequent itemsets is a foundation step of association rules discovering. The problem of mining association rules- and the more general problem of finding frequent pattern- from large database has been subject of numerous studies [8]. These studies can be broadly divided into following categories. (a)Functionalities: The central question is what (kind of rule or pattern) to compute. (b)Performance: The central questions consider is how to compute association rule or frequent pattern as efficient as possible.

(2)

International Journal of Emerging Technology and Advanced Engineering

902

II. PREVIOUS WORK

In this section we discuss FP tree, Can Tree and CP Tree and algorithms which are based on that. Also show that whether it support for interactive mining or incremental mining.

A. Frequent Pattern Tree (FP Tree)

The FP-Growth algorithm has been introduced in [5]. The algorithm first constructs the tree out of the original data set and then grows the frequent patterns. The FP-Growth algorithm allows generating frequent item sets and tries to avoid generating a huge amount of candidates that is necessary for the Apriori algorithm. In that it require two data scan to construct FP tree. It follow the divide and conquer strategy ,in that compress the database representing frequent item in to frequent pattern tree ,but it preserve itemset association information and then divide such compressed database in to set of conditional database, each associated with one frequent item and mine each database separately. So by using this approach dramatically reduces the size of the subsequent conditional pattern bases and conditional FP-tree .The Benefit of FP tree structure is to provide completeness and compactness.[9]

For discovering all frequent item sets, the FP-Growth algorithm take a look at each level of depth of the tree starting from the bottom and generating all possible item sets that include nodes in that specific level. After having mined the frequent patterns for every level, they are stored in the complete set of frequent patterns.

B.The CANnonical order tree (CAN Tree)

It designed for incremental mining [3]. The construction of Can Tree require only one database scan, this is differ from the FP tree that two database scan require. In CAN tree item arranged in some canonical order, which can be determined by user prior to the mining process or at a run time during mining process so it is unaffected by the item frequency unlike FPtree. In that item can be arranged in lexicographic order or alphabetic order. The Can tree has nice property (i) the ordering of item is unaffected by changes in frequency caused by incremental update. (ii) The frequency of node in the Can tree is at least as high as the sum of frequencies of its children. In the construction of Can tree we similar frequent pattern like FP tree and it use the approach of divide –and-conquer approach. In that it traverse database in upward only. Since items are consistently ordered in our CanTree, any insertions, deletions, and/or modifications of transactions have no effect on the ordering of items in the tree. As a result, swapping of tree nodes–which may lead to merging and splitting of tree nodes–is not needed.

C.The Compact Pattern Tree (CP Tree)

Although CAN tree offer simple single pass construction process, it usually lead poor compaction in tree size compared FP tree. Therefore, it is storage and runtime inefficient causing higher mining time since the item in the tree are not stored in frequency descending order.[4] CP-tree (Compact Pattern tree)[1], is a compact prefix-tree structure which is constructed with one database scan and provides the same mining performance as the FP-growth technique by efficient tree restructuring process. The construction operation mainly consists of two phases: Insertion phase inserts transaction(s) into CP-tree according to item appearance order and updates frequency count of respective items in I-list; and Restructuring phase, that rearranges the I-list according to frequency-descending order of items and restructures the tree nodes according to new I-list. It uses Branch Restructuring Method (BRM) in restructuring phase [2]. In that Branch restructure method, it restructure by it sorting unsorted path one after another and the I-list in frequency descending order. CP-Tree is highly compact data structure than Can-Tree and despite additional insignificant tree restructuring cost; CP-tree achieves a remarkable performance gain on overall runtime.

III. OVERVIEW OF PROPOSED DATA STRUCTURE

In this thesis, we introduce a new tree data structure which is an extension of Tree. We have seen that CP-Tree contains all the items (frequent and also non-frequent) in the tree at the mining time. We introduce a new tree which contains only frequent items in the tree at the mining time like FP-Tree but still extract frequent itemset from database with only one database scan like CP-Tree.

A.Construction of our new tree in detail

(3)

International Journal of Emerging Technology and Advanced Engineering

903

In this method if path is not sorted according to new I-list order, it is removed from the tree, deleted non-frequent items, sorted according to new I-list order into a temporary array and then again inserted into tree in order. After construction of tree it mines frequent pattern using FP-Growth algorithm like in case of all other tree structures.

Algorithm (Branch Sorting Method):

Input: T and I

Output: Tsort and Isort with only frequent items.

1. Compute Isort from I in frequency-descending order and contains only frequent items.

2. For each branch Bi in T

3. For each unprocessed path Pj in Bi

4. If Pj contains only frequent items in order according to Isort

5. Process Branch (Pj)

6. Else Sort_Path (Pj)

7. Terminate when all the branches contains only frequent items in order according to Isort and output Tsort

and Isort

8. Process_Branch(P){

9. For each branching node nb in P from the leafp node

10. For each sub-path from nb to leafk with k≠p

11. If item ranks of all nodes between nb and leafk are

greater than that of nb

12. P=sub-path from nb to leafk

13. If P is contains only frequent items in order according to Isort

14. Process_Branch(P)

15. Else P=path from the root to leafk

16. Sort_Path(P)

17. }

18. Sort_Path (Q)

19. Reduce the count of all node of Q by the value of leafQ count

20. Delete all nodes contains non-frequent items and sort remaining nodes in an array according to Isort order

21. Delete all nodes having count zero from Q

22. Insert sorted Q into T at the location from where it was taken

23.}

Algorithm for BSM method (For my proposed data structure)

Let’s see one by one step for my proposed data structure for database given in table 1.

Figure 3.1 shows I-list and tree after Insertion phase, it is same as Insertion phase in CP-Tree. For simplicity of figures we do not show the node traverse pointer in tree however, they are maintained in a fashion like FP-tree does.

As shown in fig. I-list maintains the all transaction with its frequency and can ne implemented by input with restructuring phase that contain tree.

Finally, complete content and organizational editing before formatting .Please take note of the following items. When proofreading spelling and grammar:

(4)

International Journal of Emerging Technology and Advanced Engineering

(5)

International Journal of Emerging Technology and Advanced Engineering

905

1.2 Total processing of Restructure phase of my proposed data structure (a,b,c and d)

B. Comparison with various tree structures

_TreeFP- _TreeCan _TreeCP My proposed

Tree Structure

No. of database

scans required

Two One One One

Item Order

Frequency descending

order

Canonical order (i.e. lexicograph

ic order)

Insertion: item appearance

order

nsertion: item earance

order

Restructure: Frequency descending

order

Which items contain?

Only frequent

items All items All items

Only frequent

items

Interactiv

e mining No Yes Yes Yes

Incremen tal mining

(6)

International Journal of Emerging Technology and Advanced Engineering

906

C. Interactive mining

Our proposed tree structure also supports Interactive mining. In interactive mining user specified minimum support can be changed for the same database. In this case, we can save tree after insertion of all transactions. Then we can restructure tree according to user specified minimum support. So, there is no need to rescan the database for interactive mining in this tree structure like FP-Tree. We have support of all the items in our I-list. So, we can keep only frequent items according to different minimum support without the need of rescan the database

D. Incremental mining

Our proposed tree structure also supports Incremental mining [8,9]. In incremental mining transaction(s) are added and deleted in original database. In this case, we can save tree after insertion of all transactions in original database. When new transaction(s) are added or some transaction(s) are deleted, at that time we can add new transaction(s) and delete those transaction(s) in that saved tree easily and restructure the tree without the need of rescan the original database like FP-Tree.

IV. CONCLUSION

a)A New tree data structure saves search space as compared to CP tree because it store only frequent items.

b)It also supports Interactive mining like Can-Tree and CP-Tree, means if user specified minimum support is changed then also it can extract frequent patterns without the need to rescan the database. c)3) It also supports Incremental mining like Can

Tree and CP-Tree, means later if transaction(s) are added or old transaction(s) are deleted then also it can extract frequent patterns without the need to rescan the original database

REFERENCES

[1 ] Syed Khairuzzaman Tanbeer, Chowdhury Farhan Ahmed, Byeong-Soo Jeong, and Young-Koo Lee, “CP-Tree: A Tree Structure for Single-Pass Frequent Pattern Mining.” In: Springer-Verlag Berlin Heidelberg 2008, pp 1-6

[2 ] Syed Khairuzzaman Tanbeer, Chowdhury Farhan Ahmed, Byeong-Soo Jeong , Young-Koo Lee, “Efficient single-pass frequent pattern mining using a prefix-tree” In: Information Sciences-179 2008, pp 559–583

[3 ] Leung, C.K., Khan, Q.I., Li, Z., Hoque, T., “CanTree: A Canonical-Order Tree for Incremental Frequent-Pattern Mining.” In:Knowledge and Information Systems- 11(3) 2007, pp 287-311 [4 ] Carson Kai-Sang Leung, Quamrul I. Khan, Zhan Li · Tariqul

Hoque “CanTree: a canonical-order tree for incremental frequent-pattern mining.” In: Springer-Verlag London Limited 2006. [5 ] Jiawei Han, Jian Pei, Yiwen Yin, Runying Mao “Mining Frequent

Patterns Without Candidate Generation: A Frequent-Pattern Tree Approach” In: Kluwer Academic Publishers 2004,pp 53-87. [6 ] Han, Jiawei; Pei, Jian; Yin, Yiwen “Mining Frequent Patterns

without Candidate Generation.” In: ACM SIGMOD Intl. Conference on Management of Data, ACM Press 1999,pp 1-12. [7 ] Agrawal, Rakesh; Srikant, Ramakrishnan “Fast Algorithms for

Mining Association Rules.” In: Proc. 20th Int. Conf. Very Large Data Bases, VLDB, 1994.

[8 ] Koh J-L, Shieh S-F “An efficient approach for maintaining association rules based on adjusting FP-tree structures.” In: Springer-Verlag, Berlin Heidelberg New York, 2004 pp 417–424 [9 ] Cheung, W., Zaïane, O.R., “Incremental Mining of Frequent

Patterns without Candidate Generation or Support Constraint.” In: Seventh International Database Engineering and Applications Symposium (IDEAS) 2003

[10 ]Bharat Gupta, Dr. Deepak Garg, “FP-Tree Based Algorithms Analysis: FPGrowth, COFI-Tree and CT-PRO” In: International Journal on Computer Science and Engineering (IJCSE), ISSN : 0975-3397 Vol. 3 No. 7 July 2011, pp1-9

[11 ]W. Cheung and O.R. Za¨ıane. Incremental mining of frequent patterns without candidate generation or support constraint. In

Proc. IDEAS2003, pp. 111–116.

[12 ]J.-L. Koh and S.-F. Shieh. An efficient approach for maintaining association rules based on adjusting FP-tree structures. In Proc.

DASFAA2004, pp. 417–424.

[13 ]R. Dass, A Mahanti, “An Efficient Algorithm for Real-Time Frequent Pattern Mining for Real- Time Business Intelligence analytics”, Proceedings of the 39th_{Hawaii International}