Association Rules Mining with Multiple Constraints

(1)

Procedia

Engineering

Procedia Engineering 00 (2011) 000–000 www.elsevier.com/locate/procedia

* Corresponding author. Tel: (0)13977177085 E-mail address: [email protected].

Advanced in Control Engineering and Information Science

Association Rules Mining with Multiple Constraints

Li Guang-yuan

a,b

_{a*, Cao Dan-yang}

a

_{, Guo Jian-wei}

a

a_{School of Computer and Communication Engineering, University of Science&Technology Beijing, Beijing, China} b_{Shool of Computer and Information Engineering, Guangxi Teachers Education University, Nanning, China}

Abstract

Association rules mining(ARM) is an important task in the field of data mining, mining frequent itemsets is a key step of many algorithms for ARM. In a very large dataset, rules generated may be very large, but some of them are useless to the users, to improve the effectiveness and efficiency of mining tasks, constraint-based mining enables users to concentrate on mining their interested association rules instead of the complete set of association rules. Most of previously proposed methods are mainly deal with a single constraint. In this paper, we present an algorithm for mining association rules with multiple constraints, the proposed algorithm simultaneously copes with two different kinds of constraints, it consists of three phases, first, the frequent 1-itemset are generated, second, we exploit the properties of the given constraints to prune search space or save constraint checking in the conditional databases. Third, for each itemset possible to satisfy the constraint, we generate its conditional database and perform the three phases in the conditional database recursively. Experimental results show that the proposed method outperform the revised FP-growth algorithm.

Selection and/or peer-review under responsibility of [CEIS 2011]

Keywords: Data mining; Association rules mining; Constraints-based mining.

1. Introduction

Association rules mining is an important task in the field of data mining, and frequent itemset mining is a key step of many algorithms for association rules mining. There had been lots of work done for mining of association rules. When the dataset are large, the rules generated may be very large, but some of them are not interesting to the users, so, it is common to set some parameters to reduced the numbers of rules generated, support and confidence are two common parameters, but using only the support and confidence has some drawbacks[1]: first, it is lack of user exploration and control, second, it is lack of focus, third, it is a rigid notion of relationship. To improve the effectiveness and efficiency of mining

Open access under CC BY-NC-ND license.

(2)

tasks, constraint-based mining enables users to concentrate on mining their interested association rules instead of the complete set of association rules. According to the properties of the constraints, there are four kinds of constraints, which are monotonic constraint, anti-monotone constraint, succinct constraint, and convertible constraint. The problem of discovering all frequent itemsets that satisfy constraints is a difficult one, the difficulty stems from the fact that[2], first, testing for minimum support and maximum support can not be done simultaneously, since when valid, one is always true for subsets while the other is always true for supersets. Second, despite their selective power, some constraints cannot be checked to filter candidate itemsets until a very late stage of the mining process depending upon the type of constraint and the search space traversal strategy used. However, there are some efficient algorithms proposed to deal with this problem[2-7], but most of these algorithms only cope with one constraint, in this paper, we present an algorithm to mine association rules with multiple constraints, it copes with two different kinds of constraints simultaneously. The proposed method consists of three phases, first, the frequent 1-itemset are generated, second, we exploit the properties of the given constraints to prune search space or save constraint checking in the conditional databases. Third, for each itemset possible to satisfy the constraint, we generate its conditional database and perform the three phases in the conditional database recursively. Experimental results show that the proposed method outperform the revised FP-growth algorithm.

2. Problem Definition

Let Items={x1,x2,…,xn}be a set of distinct items. An itemset Xis a non-empty of Items. If X has k

items, thenX is call a k-itemset. A transaction is a couple< > where is the transaction identifier and X is the content of the transaction. A transaction database DB is a set of transactions. An itemset

X ,

IDt IDt

X is contained in a transaction <IDt,Y > if X ⊆Y , the support of an itemset X , written as is the numbers of

) (X

Support X that contain in DB. Given a user-defined minimum support δ , an itemset X is called frequent in DB if Support(X)≥δ .

A constraint Cis a predicate on the powerset of the set of items I , i.e., . An itemset } , { 2 : true false C I _→

Xsatisfies a constraint Cif and only if is true. The complete set of itemsets satisfying a constraintC is ) (X C

{

X X I C X true

}

I C( )= ∧ ( )= SAT ⊆

Definition 1 Given an itemset X, a constraint is anti-monotone if C

∀Y⊆X:C(X)=true⇒C(Y)=true

Definition 2 Given an itemset X, a constraint C is monotone if ∀Y⊆X:C(Y)=true⇒C(X)=true

Definition 3 A constraint C is convertible anti-monotone provided there is an order on items such that whenever an itemset X satisfies , so does any prefix of C X. A constraint Φis convertible monotone provided there is an order on items such that whenever an itemset X violates so does any prefix of Φ

X.

In this paper, the proposed algorithm deal with two constraints, the two constraints are anti-monotone and monotone. We use the FP-Growth algorithm as the basic approach to mine frequent itemsets since it is more efficient compare with many other algorithms such as Apriori-like algorithm. Given a DB as well as two constraints , , is a anti-monotone constraint, and is a monotone constraint, our goal is to generate all the itemset1

C C₂ C₁ C₂

(3)

{

X X I C X C X true

}

X

SATC1∧C2( )= ⊆ ∧ 1( )∧ 2( )= .

3. The Proposed Method

In our algorithm, we use an example date set shown in table 1 below to illustrate how the proposed algorithm works, two constraints are such as(max(S.cost)≤min(S.price)) and (total(S.price)≥100), it is obviously that the former constraint is anti-monotone, and the latter is monotone, where S is an itemset and each item in S contains two attributes cost and price, max(S.cost) denotes the maximum cost of all items in S and min(S.price) denotes the minimum price of all items in S, total(S.price) denotes the total of price of all items in S. Before we give the description of the algorithm, first, we present some definitions and lemmas as follows:

Definition 4 [4]. Given a database T and a project conditionpc.

1. if the relationship between itemsets and is correct. For example, let , and = prefix relationship, because is the prefix of

true s s pc( 1, 2)= ab = 1 s =2 abcd 1 s s , 2 2 s true s pc pc(s1 )= s1 s2

2. Itemset is called the max-a projection of a transaction , w.r.t , if and only if <i> and ; <ii> ; <iii>there exists no proper superset of such that

and . b pc > <tid,It c pc a⊆I_t t I b⊆ t I ⊆ true b a pc( , )= true b a, )= b c (

3. The a-conditional database is the collection of max-a projections of transactions containing a w.r.t. pc. Definition 5[4]. Letα be a frequent itemset and λ be the set of frequent items inα ’s conditional database.

λ

αU forms the potential largest frequent itemset in the α ’s conditional database.

Lemma 1[4]. Let β be the set of frequent items in T and_a γ be a sub-set of the set of frequent items in

a

T If we have confirmed Φ(aUβ)=true in T where is an anti-monotone constraint, we do not _a

need to check Φ in Φ a ∪ a} {

T for each a∈βbecause ({ }a U aUγ)⊆(αUβ)and thusΦ({a}UaUγ) is certainly true.

Lemma 2[4]. Let γ be a sub-set of the set of frequent items inT_{_a_}_∪_a, if we have confirmed

false

a =

Φ( Uβ) , where is an anti-monotone constraint and a is an individual frequent item in Φ T ,_a we do not need to generateT_{_a_}_∪_a because {a}UaUγ contains{a}Uaand thusΦ({a}UaUγ)is certainly false.

Lemma 3. Letβ be the set of frequent items in T and_a γ be a sub-set of the set of frequent items in T If_a

we have confirmed Φ(aUβ)= flase in T where is a monotone constraint, we do not need to check _a

in

Φ

Φ T_{_a_}_∪_a for each a∈βbecause ({a}UaUγ)⊆(αUβ)and thusΦ({a}UaUγ) is certainly false. Lemma 4. Let γ be a sub-set of the set of frequent items inT_{_a_}_∪_a, If we have confirmed

true

a =

Φ( Uβ) , where is a monotone constraint and a is an individual frequent item in Φ T , we do _a not need to generateT_{_a_}_∪_a because {a}UaUγ contains {a}Uaand thusΦ({a}UaUγ)is certainly true.

(4)

Table 2. Items in database T Table 1. An example database T

TID Items 1 A, B, E 2 B,C 3 A,D,E 4 A,B,C,E 5 B,D

Now, we give a brief description of the algorithm below.

Item Cost price

A 30 45 B 35 40 C 50 60 D 25 40 E 20 45 ) , (α Tα MCAL

Input: DB,anti-monotone constraint C ,monotone constraint C , minimum support threshold 1 2 δ Output: All frequent itemsets X satisfying C1(X)∧C2(X).

1. Collect the set of frequent items and their supports from the FP-tree header table of L Tα

2. β =L；if(C₂(βUα)= false,then exit, there are no frequent itemsets Xthat satisfy C₁(X)∧C₂(X) 3. if C2(βUα)=true

Apply lemma3 and lemma4 to calculate the number of item that satisfy N C₂

4. for each a∈β 5. MCAL(α,T α) 6. if (L <) N 7. continue 8. else 9. for each χ∈L 10. if C1(χUa)=true 11. GenerateT_χ_U_a，_if ₍_{L >}₎ _N 12. Output L 13. endfor 14. endfor 4. Experimental Results

In order to evaluate the performance of the proposed algorithm, we compare it with the FP-growth+[5], all the experiments were performed on a Pentium IV3.2GHz personal computer with 2MB main memory, running Windows XP. The program is written in C++ and compiled with Microsoft Visual C++6.0. The data set is generated with a similar way as in[8], the data set is denotes as V25F20T50I1L100, which V25 denotes that the average size of the transactions is 25, F25 denotes that the average size of the maximal potentially frequent itemsets is 20, T50 denotes that the number of transactions is 50K, I1 denotes that the

(5)

0 200 400 600 800 1000 20K 40K 60K 80K 100K Number of Transactions Exe cute t ime(se c) FP-grow+ Proposed algotithm

number of items is 1K, L100 denotes that the number of maximal potentially frequent itemsets is 100. The experimental results are shown in figure.1.and figure.2. as follows:

V25F20T50I1L100 V25F20T50I1L100 0 100 200 300 400 500 600 0.1% 0.2% 0.4% 0.8% 1% Support Threshold E x e c u t e t i m e ( s e c ) FP-grow+ Proposed algotithm

Fig.1. Scalability with number of transactions _{Fig.2. Scalability with minimum support threshold}

Conclusions

In this paper, we present an efficient algorithm for mining association rules with multiply constraints,

The proposed method consists of three phases, first, the frequent 1-itemset are generated, second, we exploit the properties of the given constraints to prune search space or save constraint checking in the conditional databases, third, for each itemset possible to satisfy the constraint, we generate its conditional database and perform the three phases in the conditional database recursively. Experimental results show that the proposed method outperform the revised FP-growth algorithm such as FP-growth+. In the future, we plan to investigate the multiple constraints based in uncertain data mining.

(6)

Acknowledgements

This work was supported by the Science and Technology Project of Beijing, Beijing, China.

References

[1] Raymond T. Ng, Laks V.S.Lakshmannan, Jiawei Han. Exploratory mining and pruning optimizations of constrained associations rules. Proceedings ACM SIGMOD International Conference on Management of Data, June ,1998, Seattle, Washington, USA. [2] Mohammad EI-Hajj, Osmar R.Zaiane, Paul Nalos. Bifold constraint-based mining by simultaneous monotone and anti-monotone checking. Proceedings of the 5th IEEE International Conference on Data Mining , November 2005, Houston, Texas, USA

[3] L. Lakshmanan, R. Ng, J. Han, A. Pang. Optimization of constrained frequent set queries with 2-variable constraints. In ACM SIGMOD Conference on Management of Data, p. 157–168, 1999.

[4] Anthony J.T.Lee, Wan-chuen Lin, Chun-sheng Wang. Mining association rules with multi-dimensional constraints. The Journal of Systems and Software 2006, (79), p. 79-92.

[5] J. Pie, J. Han, and L. Lakshmanan. Mining frequent itemsets with convertible constraints. In IEEE ICDE Conference, p.433– 442, 2001.

[6] C. Bucila, J. Gehrke, D. Kifer, W. White. Dualminer: Adual-pruning algorithm for itemsets with constraints. In EightACM SIGKDD Internationa Conf. on Knowledge Discovery and Data Mining, p. 42–51, Edmonton, Alberta, August 2002.

[7] R. M.Ting, J. Bailey, K. Ramamohanarao. Paradualminer: An efficient parallel implementation of the dualminer algorithm. In Eight Pacific-Asia Conference, PAKDD 2004, p. 96–105, Sydney, Australia, May 2004.

[8] Agrawal,R., Srikant,R. Fast algorithms for mining association rules. Proceedings of International Conference on Very Large Data Bases,p.487-499.