What Is Frequent Pattern Analysis?

(1)

DATA MINING

CSE -4229

Sajal Halder

(2)

What Is Frequent Pattern Analysis?

■ Frequent pattern: a pattern (a set of items, subsequences, substructures,

etc.) that occurs frequently in a data set

■ First proposed by Agrawal, Imielinski, and Swami [AIS93] in the context of frequent itemsets and association rule mining

■ Motivation: Finding inherent regularities in data

■ What products were often purchased together?— Beer and diapers?!

■ What are the subsequent purchases after buying a PC?

■ What kinds of DNA are sensitive to this new drug?

■ Can we automatically classify web documents?

■ Applications

■ Basket data analysis, cross-marketing, catalog design, sale campaign

(3)

Why Is Freq. Pattern Mining Important?

■ Discloses an intrinsic and important property of data sets

■ Forms the foundation for many essential data mining tasks

■ Association, correlation, and causality analysis

■ Sequential, structural (e.g., sub-graph) patterns

■ Pattern analysis in spatiotemporal, multimedia,

time-series, and stream data

■ Classification: associative classification

■ Cluster analysis: frequent pattern-based clustering

■ Data warehousing: iceberg cube and cube-gradient

■ Semantic data compression: fascicles

(4)

Market-Basket Data

■

A large set of

items

, e.g., things sold in a

supermarket.

■

A large set of baskets, each of which is a small

(5)

Market-Baskets – (2)

■

Really, a general many-to-many mapping

(association) between two kinds of things, where

the one (the baskets) is a set of the other (the

items

)

■

But we ask about connections among “items,”

not “baskets.”

■

The technology focuses on

common events

, not

(6)

• Given a set of transactions, find combinations of items (itemsets) that occur frequently

Market-Basket transactions

{Bread}: 4 {Milk} : 4 {Diaper} : 4 {Beer}: 3

{Diaper, Beer} : 3 {Milk, Bread} : 3

Frequent Itemsets

(7)

Applications – (1)

■

Items

= products; baskets = sets of products

someone bought in one trip to the store.

■

Example

application

: given that many people buy

beer and diapers together:

■

Run a sale on diapers; raise price of beer.

(8)

Applications – (2)

■

Baskets = Web pages;

items

= words.

■

Example

application:

Unusual words appearing

together in a large number of documents, e.g.,

“Brad” and “Angelina,” may indicate an

(9)

Applications – (3)

■

Baskets = sentences;

items

= documents

containing those sentences.

■

Example

application:

Items that appear together

too often could represent plagiarism.

(10)

Definition: Frequent Itemset

■ Itemset

■ A collection of one or more items

■ Example: {Milk, Bread, Diaper}

■ k-itemset

■ An itemset that contains k items

■ Support count (σ)

■ Frequency of occurrence of an itemset ■ E.g. σ({Milk, Bread,Diaper}) = 2

■ Support

■ Fraction of transactions that contain an

itemset

■ E.g. s({Milk, Bread, Diaper}) = 2/5=40%

■ Frequent Itemset

■ An itemset whose support is greater than

(11)

Association Rule Mining

■ Given a set of transactions, find rules that will predict the

occurrence of an item based on the occurrences of other items in the transaction

Market-Basket transactions

Example of Association Rules

{Diaper} → {Beer},

{Milk, Bread} → {Eggs,Coke}, {Beer, Bread} → {Milk},

(12)

Definition: Association Rule

Let D be database of

transactions

■

e.g.:

■

Let I be the set of items that appear in the

database, e.g., I={A,B,C,D,E,F}

■

A

rule

is defined by X

⇒

Y, where X

⊂

I, Y

⊂

I,

and X

∩

Y=

∅

(13)

Definition: Association Rule

Exampl e:

Association Rule

■ An implication expression of the

form X → Y, where X and Y are itemsets

■ Example:

{Milk, Diaper} → {Beer}

Rule Evaluation Metrics ■ Support (s)

Fraction of transactions that contain both X and Y

■ Confidence (c)

Measures how often items in Y appear in transactions that

(14)

Rule Measures: Support and Confidence

Find all the rules X ⇒ Y with minimum confidence and support

■ support, s, probability that a

transaction contains {X ∪ Y}

■ confidence, c, conditional

probability that a transaction having X also contains Y

Let minimum support 50%, and minimum confidence 50%, we have

■ A ⇒ C (50%, 66.6%) ■ C ⇒ A (50%, 100%)

(15)

Basic Concepts: Frequent Patterns and

Association Rules

■ Let sup

min = 50%, confmin = 50%

■ Freq. Pat.: {A:3, B:3, D:4, E:3, AD:3}

■ Association rules:

■ A D (60%, 100%) ■ D A (60%, 75%)

Transaction-id Items bought

(16)

TID date items_bought

100 10/10/99 {F,A,D,B}

200 15/10/99 {D,A,C,E,B}

300 19/10/99 {C,A,B,E}

400 20/10/99 {B,A,D}

Example

■

What is the

support

and

confidence

of the rule:

{B,D}

⇒

{A}

Support:

■ percentage of tuples that contain {A,B,D} =

(17)

Association Rule Mining Task

■ Given a set of transactions T, the goal of association rule

mining is to find all rules having

■ support ≥ minsup threshold

■ confidence ≥ minconf threshold

■ Brute-force approach:

■ List all possible association rules

■ Compute the support and confidence for each rule

■ Prune rules that fail the minsup and minconf thresholds

(18)

Mining Association Rules

Example of Rules:

{Milk,Diaper} → {Beer} (s=0.4, c=0.67) {Milk,Beer} → {Diaper} (s=0.4, c=1.0) {Diaper,Beer} → {Milk} (s=0.4, c=0.67) {Beer} → {Milk,Diaper} (s=0.4, c=0.67) {Diaper} → {Milk,Beer} (s=0.4, c=0.5) {Milk} → {Diaper,Beer} (s=0.4, c=0.5)

Observations:

• All the above rules are binary partitions of the same itemset: {Milk, Diaper, Beer}

• Rules originating from the same itemset have identical support but can have different confidence

(19)

Mining Association Rules

■

Two-step approach:

■

Frequent Itemset Generation

■ Generate all itemsets whose support ≥ minsup

■

Rule Generation

■ Generate high confidence rules from each frequent

itemset, where each rule is a binary partitioning of a frequent itemset

■

Frequent

itemset

generation

is

still

(20)

Scalable Methods for Mining Frequent Patterns

■ The downward closure property of frequent patterns

■ Any subset of a frequent itemset must be frequent

■ If {beer, diaper, nuts} is frequent, so is {beer,

diaper}

■ i.e., every transaction having {beer, diaper, nuts} also

contains {beer, diaper}

■ Scalable mining methods: Three major approaches

■ Apriori (Agrawal & Srikant@VLDB’94)

■ Freq. pattern growth (FPgrowth—Han, Pei & Yin

@SIGMOD’00)

■ Vertical data format approach (Charm—Zaki & Hsiao

(21)

The Apriori Principle

• Apriori

principle (Main observation):

– If an itemset is frequent, then all of its subsets must also be frequent

– If an itemset is not frequent, then all of its supersets

cannot be frequent

– The support of an itemset never exceeds the support of its subsets

(22)

Illustration of the Apriori principle

Found to be frequent

(23)

Illustration of the Apriori principle

Found to be

Infrequent

Pruned

(24)

Apriori: A Candidate Generation-and-Test Approach

■ Apriori pruning principle: If there is any itemset which is

infrequent, its superset should not be generated/tested! (Agrawal & Srikant @VLDB’94, Mannila, et al. @ KDD’ 94)

■ Method:

■ Initially, scan DB once to get frequent 1-itemset

■ Generate length (k+1) candidate itemsets from length k

frequent itemsets

■ Test the candidates against DB

■ Terminate when no frequent or candidate set can be

(25)

The Apriori Algorithm—An Example

Database TDB

1st scan

C₁ L1

L₂

C₂ _C

2

2nd scan

C₃ ₃rd_scan L₃ Tid Items

10 A, C, D 20 B, C, E 30 A, B, C, E 40 B, E

Itemset sup {A} 2 {B} 3 {C} 3 {D} 1 {E} 3 Itemset sup {A} 2 {B} 3 {C} 3 {E} 3 Itemset {A, B} {A, C} {A, E} {B, C} {B, E} {C, E} Itemset sup

{A, B} 1 {A, C} 2 {A, E} 1 {B, C} 2 {B, E} 3 {C, E} 2

Itemset sup

{A, C} 2 {B, C} 2 {B, E} 3 {C, E} 2

Itemset

{B, C, E}

Itemset sup

{B, C, E} 2

(26)

The Apriori algorithm

Level-wise approach C_Lk = candidate itemsets of size k

k = frequent itemsets of size k

Candidate generation

Frequent itemset generation

1. k = 1

,

C

₁

= all items

2. While

C

_k

not empty

3. Scan the database to find which itemsets in

C

_k

are

frequent

and put them into

L

_k

4. Use

L

_k

to generate a collection of candidate

itemsets

C

_k+1

of size

k+1

(27)

Items

(1-itemsets)

Pairs (2-itemsets)

(No need to generate candidates involving Coke or Eggs)

Triplets

(3-itemsets)

minsup = 3

Illustration of the Apriori principle

(28)

How to Generate Candidates?

■ Suppose the items in L

k-1 are listed in an order

■ Step 1: self-joining L

k-1

insert into C_k

select p.item₁, p.item₂, …, p.item_k-1, q.item_k-1 from L_k-1 p, L_k-1q

where p.item₁=q.item₁, …, p.item_k-2=q.item_k-2, p.item_k-1< q.item_k-1

■ Step 2: pruning

forall itemsets c in C_k do

forall (k-1)-subsets s of c do

(29)

Generating Candidates C

_k+1

in SQL

• self-join L

k

insert into C_k+1

select p.item₁, p.item₂, …, p.item_k, q.item_k from L_k p, L_kq

(30)

• L

₃

={abc, abd, acd, ace, bcd}

• Self-join

:

L

₃

*L

₃

– abcd from abc and abd

– acde from acd and ace

Example I

item 1 item 2 item 3

a b c

a b d

a c d

a c e

b c d

item 1 item 2 item 3

a b c

a b d

a c d

a c e

b c d

(31)

• L

₃

={abc, abd, acd, ace, bcd}

• Self-joining

:

L

₃

*L

₃

Example I

item 1 item 2 item 3

a b c

a b d

a c d

a c e

b c d

item 1 item 2 item 3

a b c

a b d

a c d

a c e

b c d

(32)

• L

₃

={abc, abd, acd, ace, bcd}

• Self-joining

:

L

₃

*L

₃

Example I

{a,b,c} {a,b,d}

{a,b,c,d} item1 item2 item3

a b c

a b d

a c d

a c e

b c d

item1 item2 item3

a b c

a b d

a c d

a c e

b c d

(33)

• L

₃

={abc, abd, acd, ace, bcd}

• Self-joining

:

L

₃

*L

₃

{a,c,d} {a,c,e}

{a,c,d,e}

Example I

item 1 item 2 item 3

a b c

a b d

a c d

a c e

b c d

item 1 item 2 item 3

a b c

a b d

a c d

a c e

b c d

(34)

Example II

{Bread,Diape r{}Bread,Mil

k}

{Diaper, Milk}

(35)

• L₃={abc, abd, acd, ace, bcd}

• Self-joining: L₃*L₃

– abcd from abc and abd

– acde from acd and ace

• Pruning:

– abcd is kept since all subset itemsets are in L₃

– acde is removed because ade is not in L₃

• C₄={abcd}

{a,c,d} {a,c,e}

{a,c,d,e}

acd ace ade cde

√ √ X

Example I

{a,b,c} {a,b,d}

{a,b,c,d}

abc abd acd bcd

(36)

• We have all frequent k-itemsets L_k

• Step 1: self-join L_k

• Create set C_k+1 by joining frequent k-itemsets that share the first k-1 items

• Step 2: prune

• Remove from C_k+1 the itemsets that contain a subset k-itemset that is not frequent

(37)

Apriori

■ Step 1: scans all of the transactions in order to count the

(38)

Apriori

(39)

Apriori

(40)

Apriori

■ Step 4: Generate C2- candidate 2-itemsets and scan the

(41)

Apriori

(42)

Apriori

(43)

Apriori

(44)

Apriori

(45)

Apriori

(46)

Apriori

(47)

Generate Association Rule

■

Once the frequent itemsets from database have

been found, it is straightforward to generate

strong association rules from them

■

strong association rules

-association rules satisfy

(48)

Generate Association Rule

■

Suppose the data contain the frequent itemset

l={I1,I2,I5}

■

The nonempty subsets of l are:

■ {I1, I2}, ■ {I1, I5},

■ {I2, I5}, ■ {I1},

■ {I2},

(49)

Generate Association Rule

■ Suppose the data contain the frequent

itemset l={I1,I2,I5}

■ The resulting association rules are as shown

(50)