• No results found

Frequent Itemsets, Association Rules q , Evaluation

N/A
N/A
Protected

Academic year: 2020

Share "Frequent Itemsets, Association Rules q , Evaluation"

Copied!
119
0
0

Loading.... (view fulltext now)

Full text

(1)

DATA MINING DATA MINING LECTURE 4

LECTURE 4

Frequent Itemsets, Association Rules q , Evaluation

Alternative Algorithms

Alternative Algorithms

(2)

RECAP

(3)

Mining Frequent Itemsets Mining Frequent Itemsets

TID Items

1 Bread, Milk

2 Bread, Diaper, Beer, Eggs 3 Milk, Diaper, Beer, Coke 4 Bread, Milk, Diaper, Beer 5 Bread, Milk, Diaper, Coke

(4)

The itemset lattice The itemset lattice

null

A B C D E

AB AC AD AE BC BD BE CD CE DE

ABC ABD ABE ACD ACE ADE BCD BCE BDE CDE

ABCD ABCE ABDE ACDE BCDE

ABCDE

Given d items, there are 2d possible itemsets

Too expensive to test all!

(5)

The Apriori Principle The Apriori Principle

• Apriori principle (Main observation):

– If an itemset is frequent, then all of its subsets must also be frequent

– If an itemset is If an itemset is not frequent, then all of its supersets not frequent then all of its supersets cannot be frequent

) ( )

( )

( :

, Y X Y s X s Y

X ⊆ ⇒ ≥

– The support of an itemset never exceeds the support of its subsets

) ( )

( )

(

, ⊆

– This is known as the anti-monotone property of support

(6)

Illustration of the Apriori principle Illustration of the Apriori principle

Frequent b t subsets

Found to be frequent

(7)

Illustration of the Apriori principle Illustration of the Apriori principle

null null

A B C D E

A B C D E

Found to be

AB AC AD AE BC BD BE CD CE DE

AB AC AD AE BC BD BE CD CE DE

Infrequent

ABC ABD ABE ACD ACE ADE BCD BCE BDE CDE

ABC ABD ABE ACD ACE ADE BCD BCE BDE CDE

ABCD ABCE ABDE ACDE BCDE

ABCD ABCE ABDE ACDE BCDE

Infrequent supersets

ABCDE ABCDE

Pruned Infrequent supersets

(8)

The Apriori algorithm The Apriori algorithm

Level-wise approach C

k

= candidate itemsets of size k L

k

= frequent itemsets of size k L

k

frequent itemsets of size k

1. k = 1, C

1

= all items

2 Whil C t t

Frequent itemset

2. While C

k

not empty

3. Scan the database to find which itemsets in C f t d t th i t L

Candidate generation

generation

C

k

are frequent and put them into L

k

4. Use L

k

to generate a collection of candidate itemsets C of size k+1

generation

itemsets C

k+1

of size k+1 5. k = k+1

R. Agrawal, R. Srikant: "Fast Algorithms for Mining Association Rules", Proc. of the 20th Int'l Conference on Very Large Databases, 1994.

(9)

Candidate Generation Candidate Generation

• Basic principle (Apriori): Basic principle (Apriori):

• An itemset of size k+1 is candidate to be frequent only if all of its subsets of size k are known to be frequent

• Main idea:

• Construct a candidate of size k+1 by combining two

f t it t f i k

frequent itemsets of size k

• Prune the generated k+1-itemsets that do not have all k-subsets to be frequent

k subsets to be frequent

(10)

Computing Frequent Itemsets Computing Frequent Itemsets

• Given the set of candidate itemsets C

kk

, we need to compute , p the support and find the frequent itemsets L

k

.

• Scan the data, and use a hash structure to keep a counter

f h did t it t th t i th d t

for each candidate itemset that appears in the data

Transactions

TID Items

1 Bread, Milk

2 B d Di B E

Transactions

Ck

2 Bread, Diaper, Beer, Eggs 3 Milk, Diaper, Beer, Coke 4 Bread, Milk, Diaper, Beer 5 Bread, Milk, Diaper, Coke

(11)

A simple hash structure A simple hash structure

• Create a Create a dictionary dictionary (hash table) that stores the (hash table) that stores the candidate itemsets as keys, and the number of appearances as the value.

• Initialize with zero

• Increment the counter for each itemset that you

see in the data

(12)

Example

Key Value

Example

S h 15 did t

{3 6 7} 0

{3 4 5} 1

{1 3 6} 3

Suppose you have 15 candidate itemsets of length 3:

{1 4 5}, {1 2 4}, {4 5 7}, {1 2 5}, {4 5 8},

{1 3 6} 3

{1 4 5} 5

{2 3 4} 2

{1 5 9} 1

{1 5 9}, {1 3 6}, {2 3 4}, {5 6 7}, {3 4 5}, {3 5 6}, {3 5 7}, {6 8 9}, {3 6 7}, {3 6 8}

{1 5 9} 1

{3 6 8} 0

{4 5 7} 2

Hash table stores the counts of the candidate itemsets as they have been computed so far

{6 8 9} 0

{5 6 7} 3

{1 2 4} 8

computed so far {1 2 4} 8

{3 5 7} 1

{1 2 5} 0

{3 5 6} 1

{3 5 6} 1

{4 5 8} 0

(13)

Example

Key Value

Example

T l {1 2 3 5 6} t th

{3 6 7} 0

{3 4 5} 1

{1 3 6} 3

Tuple {1,2,3,5,6} generates the following itemsets of length 3:

{1 3 6} 3

{1 4 5} 5

{2 3 4} 2

{1 5 9} 1

{1 2 3}, {1 2 5}, {1 2 6}, {1 3 5}, {1 3 6}, {1 5 6}, {2 3 5}, {2 3 6}, {3 5 6},

{1 5 9} 1

{3 6 8} 0

{4 5 7} 2

Increment the counters for the itemsets in the dictionary

{6 8 9} 0

{5 6 7} 3

{1 2 4} 8

{1 2 4} 8

{3 5 7} 1

{1 2 5} 0

{3 5 6} 1

{3 5 6} 1

{4 5 8} 0

(14)

Example

Key Value

Example

T l {1 2 3 5 6} t th

{3 6 7} 0

{3 4 5} 1

{1 3 6} 4

Tuple {1,2,3,5,6} generates the following itemsets of length 3:

{1 3 6} 4

{1 4 5} 5

{2 3 4} 2

{1 5 9} 1

{1 2 3}, {1 2 5}, {1 2 6}, {1 3 5}, {1 3 6}, {1 5 6}, {2 3 5}, {2 3 6}, {3 5 6},

{1 5 9} 1

{3 6 8} 0

{4 5 7} 2

Increment the counters for the itemsets in the dictionary

{6 8 9} 0

{5 6 7} 3

{1 2 4} 8

{1 2 4} 8

{3 5 7} 1

{1 2 5} 1

{3 5 6} 2

{3 5 6} 2

{4 5 8} 0

(15)

Mining Association Rules Mining Association Rules

z

Association Rule

– An implication expression of the form

TID Items

1 Bread, Milk

p p

X → Y, where X and Y are itemsets

{Milk, Diaper} → {Beer}

z

Rule Evaluation Metrics

2 Bread, Diaper, Beer, Eggs 3 Milk, Diaper, Beer, Coke 4 Bread, Milk, Diaper, Beer

Example:

B }

Di Milk

{ ⇒

– Support (s)

‹ Fraction of transactions that contain both X and Y = the probability P(X,Y) that X and Y

5 Bread, Milk, Diaper, Coke

Beer }

Diaper ,

Milk

{ ⇒

4 . 5 0 2

| T

|

) Beer Diaper,

, Milk

( = =

= σ s occur together

– Confidence (c)

‹ How often Y appears in transactions that t i X th diti l b bilit P(Y|X)

67 . 3 0 2 )

Diaper ,

Milk (

) Beer Diaper,

Milk,

( = =

= σ

c σ contain X = the conditional probability P(Y|X)

that Y occurs given that X has occurred.

z

Problem Definition

– Input A set of transactions T, over a set of items I, minsup, minconf values – Output: All rules with items in I having s ≥ minsup and c≥ minconf

(16)

Mining Association Rules Mining Association Rules

• Two-step approach: Two step approach:

1. Frequent Itemset Generation

Generate all itemsets whose support ≥ minsup

2. Rule Generation

Generate high confidence rules from each frequent itemset

Generate high confidence rules from each frequent itemset,

where each rule is a partitioning of a frequent itemset into Left-Hand-Side (LHS) and Right-Hand-Side (RHS)

Frequent itemset: {A,B,C,D}

Rule: AB→CD

Rule: AB→CD

(17)

Association Rule anti monotonicity Association Rule anti-monotonicity

• Confidence is anti-monotone w.r.t. number of items on the RHS of the rule (or monotone with respect to the LHS of the rule)

• e.g., L = {A,B,C,D}:

c(ABC → D) ≥ c(AB → CD) ≥ c(A → BCD)

(18)

Rule Generation for APriori Algorithm Rule Generation for APriori Algorithm

• Candidate rule is generated by merging two rules that share the same prefix

in the RHS

• join(CD→AB,BD→AC)

would produce the candidate rule D → ABC

rule D → ABC

• Prune rule D → ABC if its

subset AD→BC does not have subset AD→BC does not have high confidence

• Essentially we are doing APriori on the RHS

(19)

RESULT

POST-PROCESSING

(20)

Compact Representation of Frequent p p q Itemsets

Some itemsets are redundant because they have identical Some itemsets are redundant because they have identical support as their supersets

TID A1 A2 A3 A4 A5 A6 A7 A8 A9 A10 B1 B2 B3 B4 B5 B6 B7 B8 B9 B10 C1 C2 C3 C4 C5 C6 C7 C8 C9 C10

1 1 1 1 1 1 1 1 1 1 1 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0

2 1 1 1 1 1 1 1 1 1 1 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0

3 1 1 1 1 1 1 1 1 1 1 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0

4 1 1 1 1 1 1 1 1 1 1 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0

5 1 1 1 1 1 1 1 1 1 1 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0

6 0 0 0 0 0 0 0 0 0 0 1 1 1 1 1 1 1 1 1 1 0 0 0 0 0 0 0 0 0 0

7 0 0 0 0 0 0 0 0 0 0 1 1 1 1 1 1 1 1 1 1 0 0 0 0 0 0 0 0 0 0

8 0 0 0 0 0 0 0 0 0 0 1 1 1 1 1 1 1 1 1 1 0 0 0 0 0 0 0 0 0 0

9 0 0 0 0 0 0 0 0 0 0 1 1 1 1 1 1 1 1 1 1 0 0 0 0 0 0 0 0 0 0

10 0 0 0 0 0 0 0 0 0 0 1 1 1 1 1 1 1 1 1 1 0 0 0 0 0 0 0 0 0 0

11 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 1 1 1 1 1 1 1 1 1 1

12 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 1 1 1 1 1 1 1 1 1 1

13 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 1 1 1 1 1 1 1 1 1 1

14 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 1 1 1 1 1 1 1 1 1 1

Number of frequent itemsets

14 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 1 1 1 1 1 1 1 1 1 1

15 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 1 1 1 1 1 1 1 1 1 1

∑ ⎟ ⎞

⎜ ⎛

×

=

10

10

Number of frequent itemsets 3

Need a compact representation

=

⎜ ⎠

× ⎝ 3

1

k

k

(21)

Maximal Frequent Itemset

An itemset is maximal frequent if none of its immediate supersets is frequent

Maximal

Itemsets

Maximal itemsets = positive border

Infrequent

Border Infrequent

Itemsets

Maximal: no superset has this property

(22)

Negative Border

Itemsets that are not frequent, but all their immediate subsets are frequent.

Infrequent

Border Infrequent

Itemsets

Minimal: no subset has this property

(23)

Border Border

• Border Border = Positive Border Positive Border + Negative Border Negative Border

• Itemsets such that all their immediate subsets are frequent and all their immediate supersets are

infrequent.

• Either the positive, or the negative border is

sufficient to summarize all frequent itemsets.

(24)

Closed Itemset Closed Itemset

• An itemset is closed if none of its immediate supersets p has the same support as the itemset

It t S t It t S t

TID Items

1 {A,B}

2 {B,C,D}

3 {A B C D}

Itemset Support

{A} 4

{B} 5

{C} 3

Itemset Support {A,B,C} 2 {A,B,D} 3 {A,C,D} 2 3 {A,B,C,D}

4 {A,B,D}

5 {A,B,C,D}

{C} 3

{D} 4

{A,B} 4

{A,C} 2

{A,C,D} 2 {B,C,D} 3 {A,B,C,D} 2

{A,D} 3

{B,C} 3

{B,D} 4

{C D} 3

{C,D} 3

(25)

Maximal vs Closed Itemsets

TID It

null

A B C D E

124 123 1234 245 345

Transaction Ids

TID Items

1 ABC

2 ABCD

A B C D E

12 124 24 4 123 2 3 24 34 45

3 BCE

4 ACDE

5 DE

AB AC AD AE BC BD 3 BE 24 CD 34 CE 45 DE

5 DE

ABC ABD ABE ACD ACE ADE BCD BCE BDE CDE

12 2 24 4 4 2 3 4

ABCD ABCE ABDE ACDE BCDE

2 4

Not supported

ABCDE

Not supported by any

transactions

(26)

Maximal vs Closed Frequent Itemsets Maximal vs Closed Frequent Itemsets

Minimum support = 2 null

Closed but not maximal

A B C D E

124 123 1234 245 345

Minimum support 2

Closed and

AB AC AD AE BC BD BE CD CE DE

12 124 24 4 123 2 3 24 34 45

and maximal

ABC ABD ABE ACD ACE ADE BCD BCE BDE CDE

12 2 24 4 4 2 3 4

ABCD ABCE ABDE ACDE BCDE

2 4

# Closed = 9

ABCDE

# Maximal = 4

(27)

Maximal vs Closed Itemsets

Maximal vs Closed Itemsets

(28)

Pattern Evaluation Pattern Evaluation

Association rule algorithms tend to produce too many rules but

f th i t ti d d t

many of them are uninteresting or redundant

Redundant if {A,B,C} → {D} and {A,B} → {D} have same support &

confidence

Summarization techniquesq

Uninteresting, if the pattern that is revealed does not offer useful information.

Interestingness measures: a hard problem to define

Interestingness measures can be used to prune/rank the derived patterns

Subjective measures: require human analyst

Objective measures: rely on the data.

In the original formulation of association rules, support &

fid th l d

confidence are the only measures used

(29)

Computing Interestingness Measure Computing Interestingness Measure

• Given a rule X → Y , information needed to compute rule , p interestingness can be obtained from a contingency table

Contingency table for X → Y

f11 f10 f1+

g y

f

11

: support of X and Y f

10

: support of X and Y

f t f X d Y

f01 f00 fo+

f+1 f+0 N

f

01

: support of X and Y f

00

: support of X and Y

Used to define various measures t fid lift Gi i

‹

support, confidence, lift, Gini,

J-measure, etc.

(30)

Drawback of Confidence Drawback of Confidence

Number of people that Number of people that drink tea

Coffee Coffee

Tea 15 5 20

Tea 75 5 80

Number of people that drink coffee and tea Number of people that drink coffee but not tea

Tea 75 5 80

90 10 100

drink coffee but not tea Number of people that drink coffee

(31)

Statistical Independence Statistical Independence

• Population of 1000 students Population of 1000 students

• 600 students know how to swim (S)

• 700 students know how to bike (B) ( )

• 420 students know how to swim and bike (S,B)

• P(S∧B) = 420/1000 = 0.42

• P(S) × P(B) = 0.6 × 0.7 = 0.42

• P(S∧B) = P(S) × P(B) => Statistical independence

(32)

Statistical Independence Statistical Independence

• Population of 1000 students Population of 1000 students

• 600 students know how to swim (S)

• 700 students know how to bike (B) ( )

• 500 students know how to swim and bike (S,B)

• P(S∧B) = 500/1000 = 0.5

• P(S) × P(B) = 0.6 × 0.7 = 0.42

• P(S∧B) > P(S) × P(B) => Positively correlated

(33)

Statistical Independence Statistical Independence

• Population of 1000 students Population of 1000 students

• 600 students know how to swim (S)

• 700 students know how to bike (B) ( )

• 300 students know how to swim and bike (S,B)

• P(S∧B) = 300/1000 = 0.3

• P(S) × P(B) = 0.6 × 0.7 = 0.42

• P(S∧B) < P(S) × P(B) => Negatively correlated

(34)

Statistical based Measures Statistical-based Measures

(35)

Example: Lift/Interest Example: Lift/Interest

Coffee Coffee

Tea 15 5 20

Tea 75 5 80

90 10 100

Association Rule: Tea → Coffee

f d ( ff | )

Confidence= P(Coffee|Tea) = 0.75 but P(Coffee) = 0.9

⇒ Lift 0 75/0 9 0 8333 (< 1 therefore is negatively associated)

⇒ Lift = 0.75/0.9= 0.8333 (< 1, therefore is negatively associated)

= 0.15/(0.9*0.2)

(36)

Another Example Another Example

of the of, the

Fraction of 0 9 0 9 0 8

documents 0.9 0.9 0.8

If I was creating a document by picking words randomly, (of, the) have

more or less the same probability of appearing together by chance No correlation

hong kong hong, kong Fraction of

doc ments 0.2 0.2 0.19

No correlation

documents

(hong, kong) have much lower probability to appear together by chance.

The two words appear almost always only together Positive correlation

obama karagounis obama, karagounis Fraction of

documents 0.2 0.2 0.001

(obama, karagounis) have much higher probability to appear together by chance.

The two words appear almost never together Negative correlation

(37)

Drawbacks of Lift/Interest/Mutual Information Drawbacks of Lift/Interest/Mutual Information

h k k k h k k k

honk konk honk, konk Fraction of

documents 0.0001 0.0001 0.0001

hong kong hong, kong Fraction of

documents 0.2 0.2 0.19

Rare co-occurrences are deemed more interesting.

But this is not always what we want

(38)

THE FP-TREE AND THE

FP-GROWTH ALGORITHM

Slides from course lecture of E Pitoura

Slides from course lecture of E. Pitoura

(39)

Overview Overview

• The The FP tree FP-tree contains contains a compressed a compressed

representation of the transaction database.

• A trie (prefix-tree) data structure is used

• Each transaction is a path in the tree – paths can overlap.

• Once the FP-tree is constructed the recursive,

divide and conquer FP Growth algorithm is used

divide-and-conquer FP-Growth algorithm is used

to enumerate all frequent itemsets.

(40)

FP tree Construction FP-tree Construction

• The FP-tree is a The FP tree is a trie trie (prefix tree) (prefix tree)

• Since transactions are sets of

items, we need to transform them

TID Items

1 {A,B}

2 {B,C,D} ,

into ordered sequences so that we can have prefixes

{ , , } 3 {A,C,D,E}

4 {A,D,E}

5 {A B C}

• Otherwise, there is no common prefix between sets {A,B} and {B,C,A}

W d t i d t

5 {A,B,C}

6 {A,B,C,D}

7 {B,C}

8 {A B C} • We need to impose an order to the items

• Initially assume a lexicographic order

8 {A,B,C}

9 {A,B,D}

10 {B,C,E}

• Initially, assume a lexicographic order.

(41)

FP tree Construction FP-tree Construction

• Initially the tree is empty Initially the tree is empty

null

TID Items

1 {A,B}{ , } 2 {B,C,D}

3 {A,C,D,E}

4 {A,D,E}

5 {A,B,C}

6 {A,B,C,D}

7 {B,C}

8 {A B C}

8 {A,B,C}

9 {A,B,D}

10 {B,C,E}

(42)

FP tree Construction FP-tree Construction

• Reading transaction TID = 1 g

TID Items

1 {A,B}

2 {B C D}

null

2 {B,C,D}

3 {A,C,D,E}

4 {A,D,E}

5 {A,B,C}

A:1

5 {A,B,C}

6 {A,B,C,D}

7 {B,C}

8 {A,B,C}

B:1

• Each node in the tree has a label consisting of the item

9 {A,B,D}

10 {B,C,E} Node label = item:support

Each node in the tree has a label consisting of the item

and the support (number of transactions that reach that

node, i.e. follow that path)

(43)

FP tree Construction FP-tree Construction

• Reading transaction TID = 2 Reading transaction TID 2

TID Items

1 {A,B}

2 {B C D}

null

2 {B,C,D}

3 {A,C,D,E}

4 {A,D,E}

5 {A,B,C}

A:1

B:1

B:1

5 {A,B,C}

6 {A,B,C,D}

7 {B,C}

8 {A,B,C}

B:1 C:1

D:1

W dd i t b t d th t f t th

9 {A,B,D}

10 {B,C,E} Each transaction is a path in the tree

• We add pointers between nodes that refer to the

same item

(44)

FP tree Construction FP-tree Construction

TID Items TID Items

ll

1 {A,B}

2 {B,C,D}

3 {A,C,D,E}

null

A:1 B:1

After reading

transactions TID=1, 2:

{ }

4 {A,D,E}

5 {A,B,C}

6 {A,B,C,D}

7 {B C}

B:1 C:1

7 {B,C}

8 {A,B,C}

9 {A,B,D}

10 {B C E}

D:1

Item Pointer Header Table 10 {B,C,E}

A B C D The Header Table and the

pointers assist in D

E pointers assist in

computing the itemset support

(45)

FP tree Construction FP-tree Construction

• Reading transaction TID = 3 Reading transaction TID 3 null

TID Items

1 {A,B}

2 {B C D}

A:1 B:1

A:1

2 {B,C,D}

3 {A,C,D,E}

4 {A,D,E}

5 {A B C}

B:1 C:1

5 {A,B,C}

6 {A,B,C,D}

7 {B,C}

8 {A,B,C}

Item Pointer A

D:1

9 {A,B,D}

10 {B,C,E}

B C D E E

(46)

FP tree Construction FP-tree Construction

• Reading transaction TID = 3 Reading transaction TID 3 null

TID Items

1 {A,B}

2 {B C D}

B:1 A:2

2 {B,C,D}

3 {A,C,D,E}

4 {A,D,E}

5 {A B C}

B:1 C:1 C:1

5 {A,B,C}

6 {A,B,C,D}

7 {B,C}

8 {A,B,C}

D:1

Item Pointer A

D:1

9 {A,B,D}

10 {B,C,E}

B C D E

E:1

E

(47)

FP tree Construction FP-tree Construction

• Reading transaction TID = 3 Reading transaction TID 3 null

TID Items

1 {A,B}

2 {B C D}

B:1 A:2

2 {B,C,D}

3 {A,C,D,E}

4 {A,D,E}

5 {A B C}

B:1 C:1 C:1

5 {A,B,C}

6 {A,B,C,D}

7 {B,C}

8 {A,B,C}

D:1

Item Pointer A

D:1

9 {A,B,D}

10 {B,C,E}

B C D E

E:1

E

Each transaction is a path in the tree

(48)

FP Tree Construction FP-Tree Construction

TID Items

1 {A B} Transaction Each transaction is a path in the tree

null

1 {A,B}

2 {B,C,D}

3 {A,C,D,E}

4 {A,D,E}

5 {A B C}

Database

A:7 B:3

C 3

5 {A,B,C}

6 {A,B,C,D}

7 {B,C}

8 {A,B,C}

B:5 C:3

D:1 C:1

C:3

D:1

E:1

9 {A,B,D}

10 {B,C,E}

Header table

C:3 D:1

D:1

E:1 D:1

E:1

Item Pointer A

B

C

D:1

Pointers are used to assist frequent itemset generation

C D E

(49)

FP tree size FP-tree size

• Every transaction is a path in the FP-tree Every transaction is a path in the FP tree

• The size of the tree depends on the compressibility of the data p y

• Extreme case: All transactions are the same, the FP- tree is a single branch

• Extreme case: All transactions are different the size of the tree is the same as that of the database (bigger actually since we need additional pointers)

actually since we need additional pointers)

(50)

Item ordering Item ordering

The size of the tree also depends on the ordering of the items.

Heuristic: order the items in according to their frequency from larger to smaller.

We would need to do an extra pass over the dataset to count frequencies

frequencies

Example:

TID Items

1 {A B} TID Items

( ) ( )

1 {A,B}

2 {B,C,D}

3 {A,C,D,E}

4 {A,D,E}

1 {Β,Α}

2 {B,C,D}

3 {A,C,D,E}

4 {A D E}

σ(Α)=7, σ(Β)=8, σ(C)=7, σ(D)=5, σ(Ε)=3

O d i Β Α C D E 4 {A,D,E}

5 {A,B,C}

6 {A,B,C,D}

7 {B,C}

4 {A,D,E}

5 {Β,Α,C}

6 {Β,Α,C,D}

7 {B,C}

{ C}

Ordering : Β,Α,C,D,E

8 {A,B,C}

9 {A,B,D}

10 {B,C,E}

8 {Β,Α,C}

9 {Β,Α,D}

10 {B,C,E}

(51)

Finding Frequent Itemsets Finding Frequent Itemsets

• Input: The FP-tree Input: The FP tree

• Output: All Frequent Itemsets and their support

• Method:

• Method:

• Divide and Conquer:

• Consider all itemsets that end in: E, D, C, B, A , , , ,

For each possible ending item, consider the itemsets with last items one of items preceding it in the ordering

E g for E consider all itemsets with last item D C B A This

E.g, for E, consider all itemsets with last item D, C, B, A. This way we get all the itesets ending at DE, CE, BE, AE

Proceed recursively this way.

Do this for all items

Do this for all items.

(52)

Frequent itemsets Frequent itemsets

All Itemsets

Ε D C B A

Ε D C B A

DE CE BE AE CD BD AD BC AC AB

DE CE BE AE CD BD AD BC AC AB CDE BDE ADE BCE ACE ABE BCD ACD ABD ABC

ACDE BCDE ABDE ABCE ABCD

ABCDE ABCDE

(53)

Frequent Itemsets Frequent Itemsets

All Itemsets

Ε

D C B A

Frequent?;

DE

CE BE AE CD BD AD BC AC AB

CDE

BDE ADE BCE ACE ABE BCD ACD ABD ABC

Frequent?;

ACDE

BCDE ABDE ABCE ABCD

Frequent?

Frequent?

ABCDE

q

(54)

Frequent Itemsets Frequent Itemsets

All Itemsets

Ε

D C B A

Ε

D C B A

DE

CE BE AE CD BD AD BC AC AB

Frequent?

CDE

BDE ADE BCE ACE ABE BCD ACD ABD ABC

Frequent?

Frequent? Frequent?

ACDE

BCDE ABDE ABCE ABCD

ABCDE

Frequent?

ABCDE

(55)

Frequent Itemsets Frequent Itemsets

All Itemsets

Ε

Ε

D C B A

DE

CE BE AE CD BD AD BC AC AB

Frequent?

DE

CE BE AE CD BD AD BC AC AB

CDE

BDE ADE BCE ACE ABE BCD ACD ABD ABC

Frequent?

F t?

ACDE

BCDE ABDE ABCE ABCD

Frequent?

ABCDE

We can generate all itemsets this way

We expect the FP-tree to contain a lot less

(56)

Using the FP tree to find frequent itemsets Using the FP-tree to find frequent itemsets

TID Items

1 {A,B} Transaction

Database

null

B:3

2 {B,C,D}

3 {A,C,D,E}

4 {A,D,E}

5 {A,B,C}

Database

A:7

B:5

B:3

C:3

5 { , ,C}

6 {A,B,C,D}

7 {B,C}

8 {A,B,C}

9 {A B D}

B:5 C:3

D:1 C:1

C:3 D 1

D:1

E 1

E:1

9 {A,B,D}

10 {B,C,E}

It P i t Header table

D:1

D:1

E:1

Bottom p tra ersal of the tree D:1

E:1

Item Pointer A

B

C

Bottom-up traversal of the tree.

First, itemsets ending in E, then D, etc, each time a suffix-based class

D E

(57)

Finding Frequent Itemsets

Subproblem: find frequent

null

Finding Frequent Itemsets

A:7 B:3

Subproblem: find frequent itemsets ending in E

B:5 C:3

C:1 D:1

D:1 C:3 D:1

E:1 E:1

D:1

Item Pointer A

B

Header table

D:1 E:1

B C D E

ƒ We will then see how to compute the support for the possible itemsets

(58)

Finding Frequent Itemsets

null

Finding Frequent Itemsets

A:7 B:3

Ending in D

B:5 C:3

C:1 D:1

D:1 C:3 D:1

E:1 E:1

D:1

Item Pointer A

B

Header table

D:1 E:1

B C D E

(59)

Finding Frequent Itemsets

null

Finding Frequent Itemsets

A:7 B:3

Ending in

C

B:5 C:3

C:1 D:1

D:1 C:3 D:1

E:1 E:1

D:1

Item Pointer A

B

Header table

D:1 E:1

B C D E

(60)

Finding Frequent Itemsets

null

Finding Frequent Itemsets

A:7 B:3

Ending in

B

B:5 C:3

C:1 D:1

D:1 C:3 D:1

E:1 E:1

D:1

Item Pointer A

B

Header table

D:1 E:1

B C D E

(61)

Finding Frequent Itemsets

Ending in

Α null

Finding Frequent Itemsets

A:7 B:3

B:5 C:3

C:1 D:1

D:1 C:3 D:1

E:1 E:1

D:1

Item Pointer A

B

Header table

D:1 E:1

B C D E

(62)

Algorithm Algorithm

• For each suffix X

• Phase 1

• Construct the prefix tree for X as shown before, and compute the support using the header table and the pointers

• Phase 2

• If X is frequent, construct the conditional FP-tree for X in th f ll i t

the following steps

1.

Recompute support

2.

Prune infrequent items

3.

Prune leaves and recurse

(63)

Example

null

Phase 1 – construct

Example

A:7 B:3

Phase 1 – construct prefix tree

Find all prefix paths that contain E

B:5 C:3

C:1 D:1

contain E

D:1 C:3 D:1

E:1 E:1

D:1

Item Pointer A

B

Header table

D:1 E:1

B C D

E Suffix Paths for Ε:

{A,C,D,E}, {A,D,Ε}, {B,C,E}

(64)

Example

null

Phase 1 – construct

Example

A:7 B:3

Phase 1 – construct prefix tree

Find all prefix paths that contain E

C:1 D:1 C:3

contain E

D:1 E:1 E:1

E:1

Prefix Paths for Ε:

{A,C,D,E}, {A,D,Ε}, {B,C,E}

(65)

Example

null

Compute Support for E

Example

A:7 B:3

Compute Support for E (minsup = 2)

How?

C:1 D:1 C:3

Follow pointers while

summing up counts:

1+1+1 = 3 > 2

E i f t

D:1 E:1 E:1

E is frequent

E:1

{E} is frequent so we can now consider suffixes DE, CE, BE, AE

(66)

Example

null

Example

E i f t d ith Ph 2

A:7 B:3

Phase 2

Convert the prefix tree of E into a

E is frequent so we proceed with Phase 2

C:1 D:1 C:3

Convert the prefix tree of E into a

conditional FP-tree Two changes

D:1 E:1 E:1

(1) Recompute support (2) Prune infrequent

E:1

(67)

Example

null

Example

A:7 B:3

Recompute Support

C:1 D:1 C:3

The support counts for some of the

nodes include transactions that do not end in E

D:1 E:1 E:1

For example in null->B->C->E we count {B, C}

The support of any node is equal to

E:1

the sum of the support of leaves with label E in its subtree

(68)

Example

null

Example

A:7 B:3

C:1 D:1 C:3

D:1 E:1 E:1

E:1

(69)

Example

null

Example

A:7 B:3

C:1 D:1 C:1

D:1 E:1 E:1

E:1

(70)

Example

null

Example

A:7 B:1

C:1 D:1 C:1

D:1 E:1 E:1

E:1

(71)

Example

null

Example

A:7 B:1

C:1 D:1 C:1

D:1 E:1 E:1

E:1

(72)

Example

null

Example

A:7 B:1

C:1 D:1 C:1

D:1 E:1 E:1

E:1

(73)

Example

null

Example

A:2 B:1

C:1 D:1 C:1

D:1 E:1 E:1

E:1

(74)

Example

null

Example

A:2 B:1

C:1 D:1 C:1

D:1 E:1 E:1

E:1

(75)

Example

null

Example

A:2 B:1

Truncate

C:1 D:1 C:1

Delete the nodes of Ε

D:1 E:1 E:1

E:1

(76)

Example

null

Example

A:2 B:1

Truncate

C:1 D:1 C:1

Delete the nodes of Ε

D:1 E:1 E:1

E:1

(77)

Example

null

Example

A:2 B:1

Truncate

C:1 D:1 C:1

Delete the nodes of Ε

D:1

(78)

Example

null

Example

A:2 B:1

Prune infrequent

C:1 D:1 C:1

In the conditional FP-tree some nodes may have support less than minsup

D:1

support less than minsup e.g., B needs to be pruned

pruned

This means that B

appears with E less than pp

minsup times

(79)

Example

null

Example

A:2 B:1

C:1 D:1 C:1

D:1

(80)

Example

null

Example

A:2 C:1

C:1 D:1

D:1

(81)

Example

null

Example

A:2 C:1

C:1 D:1

D:1

The conditional FP-tree for E

Repeat the algorithm for {D E} {C E} {A E}

Repeat the algorithm for {D, E}, {C, E}, {A, E}

(82)

Example

null

Example

A:2 C:1

C:1 D:1

D:1

Phase 1

Find all prefix paths that contain D (DE) in the conditional FP-tree

(83)

Example

null

Example

A:2

C:1 D:1

D:1

Phase 1

Find all prefix paths that contain D (DE) in the conditional FP-tree

(84)

Example

null

Example

A:2

C:1 D:1

D:1

Compute the support of {D,E} by following the pointers in the tree 1+1 = 2 ≥ 2 = minsup

{D,E} is frequent

(85)

Example

null

Example

A:2

C:1 D:1

D:1

Phase 2 Phase 2

Construct the conditional FP-tree 1 Recompute Support

1. Recompute Support

2. Prune nodes

(86)

Example

null

Example

A:2

C:1 D:1

Recompute support

D:1

(87)

Example

null

Example

A:2

C:1 D:1

Prune nodes

D:1

(88)

Example

null

Example

A:2

Prune nodes C:1

(89)

Example

null

Example

A:2

C:1

Small support

Prune nodes

(90)

Example

null

Example

A:2

Final condition FP-tree for {D,E}

The support of A is ≥ minsup so {A D E} is frequent The support of A is ≥ minsup so {A,D,E} is frequent

Since the tree has a single node we return to the next

subproblem

(91)

Example

null

Example

A:2 C:1

C:1 D:1

D:1

The conditional FP-tree for E

We repeat the algorithm for {D,E}, {C,E}, {A,E}

(92)

Example

null

Example

A:2 C:1

C:1 D:1

D:1

Phase 1

Find all prefix paths that contain C (CE) in the conditional FP-tree

(93)

Example

null

Example

A:2 C:1

C:1

Phase 1

Find all prefix paths that contain C (CE) in the conditional FP-tree

(94)

Example

null

Example

A:2 C:1

C:1

Compute the support of {C,E} by following the pointers in the tree 1+1 = 2 ≥ 2 = minsup

{C,E} is frequent

(95)

Example

null

Example

A:2 C:1

C:1

Phase 2 Phase 2

Construct the conditional FP-tree 1 Recompute Support

1. Recompute Support

2. Prune nodes

(96)

Example

null

Example

A:1 C:1

Recompute support C:1

(97)

Example

null

Example

A:1 C:1

Prune nodes C:1

(98)

Example

null

Example

A:1

Prune nodes

(99)

Example

null

Example

A:1

Prune nodes

(100)

Example

null

Example

Prune nodes

Return to the previous subproblem

(101)

Example

null

Example

A:2 C:1

C:1 D:1

D:1

The conditional FP-tree for E

We repeat the algorithm for {D,E}, {C,E}, {A,E}

(102)

Example

null

Example

A:2 C:1

C:1 D:1

D:1

Phase 1

Find all prefix paths that contain A (AE) in the conditional FP-tree

(103)

Example

null

Example

A:2

Phase 1

Find all prefix paths that contain A (AE) in the conditional FP-tree

(104)

Example

null

Example

A:2

Compute the support of {A,E} by following the pointers in the tree 2 ≥ minsup

{A,E} is frequent

There is no conditional FP tree for {A E}

There is no conditional FP-tree for {A,E}

(105)

Example Example

• So for E we have the following frequent itemsets So for E we have the following frequent itemsets {E}, {D,E}, {C,E}, {A,E}

• We proceed with D

(106)

Example

null

Example

A:7 B:3

Ending in

D

B:5 C:3

C:1 D:1

D:1 C:3 D:1

E:1 E:1

D:1

Item Pointer A

B

Header table

D:1 E:1

B C D E

(107)

Example

null

Example

A:7 B:3

Phase 1 – construct prefix tree

Find all prefix paths that

B:5 C:3

C:1 D:1

contain D

Support 5 > minsup, D is frequent

D:1 C:3 D:1

D:1

Phase 2

Convert prefix tree into conditional FP-tree

D:1

conditional FP tree

(108)

Example

null

Example

A:7 B:3

B:5 C:3

C:1 D:1

D:1 C:1 D:1

D:1 D:1

Recompute support

(109)

Example

null

Example

A:7 B:3

B:2 C:3

C:1 D:1

D:1 C:1 D:1

D:1 D:1

Recompute support

(110)

Example

null

Example

A:3 B:3

B:2 C:3

C:1 D:1

D:1 C:1 D:1

D:1 D:1

Recompute support

(111)

Example

null

Example

A:3 B:3

B:2 C:1

C:1 D:1

D:1 C:1 D:1

D:1 D:1

Recompute support

(112)

Example

null

Example

A:3 B:1

B:2 C:1

C:1 D:1

D:1 C:1 D:1

D:1 D:1

Recompute support

(113)

Example

null

Example

A:3 B:1

B:2 C:1

C:1 D:1

D:1 C:1 D:1

D:1 D:1

Prune nodes

(114)

Example

null

Example

A:3 B:1

B:2 C:1

C:1

C:1

Prune nodes

(115)

Example

null

Example

A:3 B:1

B:2 C:1

C:1

C:1

Construct conditional FP-trees for {C,D}, {B,D}, {A,D}

And so on….

(116)

Observations Observations

• At each recursive step we solve a subproblem At each recursive step we solve a subproblem

• Construct the prefix tree

• Compute the new support p pp

• Prune nodes

• Subproblems are disjoint so we never consider the same itemset twice

• Support computation is efficient – happens

together with the computation of the frequent

iitemsets.

(117)

Observations Observations

• The efficiency of the algorithm depends on the The efficiency of the algorithm depends on the compaction factor of the dataset

• If the tree is bushy then the algorithm does not work well, it increases a lot of number of ,

subproblems that need to be solved.

(118)

FREQUENT ITEMSET

RESEARCH

(119)

Figure

Illustration of the Apriori principleIllustration of the Apriori principle
Illustration of the Apriori principleIllustration of the Apriori principle

References

Related documents