• No results found

What Is Frequent Pattern Analysis?

N/A
N/A
Protected

Academic year: 2020

Share "What Is Frequent Pattern Analysis?"

Copied!
50
0
0

Loading.... (view fulltext now)

Full text

(1)

DATA MINING

CSE -4229

Sajal Halder

(2)

What Is Frequent Pattern Analysis?

■ Frequent pattern: a pattern (a set of items, subsequences, substructures,

etc.) that occurs frequently in a data set

■ First proposed by Agrawal, Imielinski, and Swami [AIS93] in the context of frequent itemsets and association rule mining

■ Motivation: Finding inherent regularities in data

■ What products were often purchased together?— Beer and diapers?!

■ What are the subsequent purchases after buying a PC?

■ What kinds of DNA are sensitive to this new drug?

■ Can we automatically classify web documents?

■ Applications

■ Basket data analysis, cross-marketing, catalog design, sale campaign

(3)

Why Is Freq. Pattern Mining Important?

■ Discloses an intrinsic and important property of data sets

■ Forms the foundation for many essential data mining tasks

■ Association, correlation, and causality analysis

■ Sequential, structural (e.g., sub-graph) patterns

■ Pattern analysis in spatiotemporal, multimedia,

time-series, and stream data

■ Classification: associative classification

■ Cluster analysis: frequent pattern-based clustering

■ Data warehousing: iceberg cube and cube-gradient

■ Semantic data compression: fascicles

(4)

Market-Basket Data

A large set of

items

, e.g., things sold in a

supermarket.

A large set of baskets, each of which is a small

(5)

Market-Baskets – (2)

Really, a general many-to-many mapping

(association) between two kinds of things, where

the one (the baskets) is a set of the other (the

items

)

But we ask about connections among “items,”

not “baskets.”

The technology focuses on

common events

, not

(6)

• Given a set of transactions, find combinations of items (itemsets) that occur frequently

Market-Basket transactions

{Bread}: 4 {Milk} : 4 {Diaper} : 4 {Beer}: 3

{Diaper, Beer} : 3 {Milk, Bread} : 3

Frequent Itemsets

(7)

Applications – (1)

Items

= products; baskets = sets of products

someone bought in one trip to the store.

Example

application

: given that many people buy

beer and diapers together:

Run a sale on diapers; raise price of beer.

(8)

Applications – (2)

Baskets = Web pages;

items

= words.

Example

application:

Unusual words appearing

together in a large number of documents, e.g.,

“Brad” and “Angelina,” may indicate an

(9)

Applications – (3)

Baskets = sentences;

items

= documents

containing those sentences.

Example

application:

Items that appear together

too often could represent plagiarism.

(10)

Definition: Frequent Itemset

Itemset

■ A collection of one or more items

■ Example: {Milk, Bread, Diaper}

■ k-itemset

■ An itemset that contains k items

Support count (σ)

■ Frequency of occurrence of an itemset ■ E.g. σ({Milk, Bread,Diaper}) = 2

Support

■ Fraction of transactions that contain an

itemset

■ E.g. s({Milk, Bread, Diaper}) = 2/5=40%

Frequent Itemset

■ An itemset whose support is greater than

(11)

Association Rule Mining

■ Given a set of transactions, find rules that will predict the

occurrence of an item based on the occurrences of other items in the transaction

Market-Basket transactions

Example of Association Rules

{Diaper} → {Beer},

{Milk, Bread} → {Eggs,Coke}, {Beer, Bread} → {Milk},

(12)

Definition: Association Rule

Let D be database of

transactions

e.g.:

Let I be the set of items that appear in the

database, e.g., I={A,B,C,D,E,F}

A

rule

is defined by X

Y, where X

I, Y

I,

and X

Y=

(13)

Definition: Association Rule

Exampl e:

Association Rule

■ An implication expression of the

form X → Y, where X and Y are itemsets

■ Example:

{Milk, Diaper} → {Beer}

Rule Evaluation Metrics ■ Support (s)

Fraction of transactions that contain both X and Y

■ Confidence (c)

Measures how often items in Y appear in transactions that

(14)

Rule Measures: Support and Confidence

Find all the rules X ⇒ Y with minimum confidence and support

■ support, s, probability that a

transaction contains {X ∪ Y}

■ confidence, c, conditional

probability that a transaction having X also contains Y

Let minimum support 50%, and minimum confidence 50%, we have

A C (50%, 66.6%) C A (50%, 100%)

(15)

Basic Concepts: Frequent Patterns and

Association Rules

Let sup

min = 50%, confmin = 50%

Freq. Pat.: {A:3, B:3, D:4, E:3, AD:3}

■ Association rules:

A D (60%, 100%) D A (60%, 75%)

Transaction-id Items bought

(16)

TID date items_bought

100 10/10/99 {F,A,D,B}

200 15/10/99 {D,A,C,E,B}

300 19/10/99 {C,A,B,E}

400 20/10/99 {B,A,D}

Example

What is the

support

and

confidence

of the rule:

{B,D}

{A}

Support:

■ percentage of tuples that contain {A,B,D} =

(17)

Association Rule Mining Task

■ Given a set of transactions T, the goal of association rule

mining is to find all rules having

■ support ≥ minsup threshold

■ confidence ≥ minconf threshold

■ Brute-force approach:

■ List all possible association rules

■ Compute the support and confidence for each rule

■ Prune rules that fail the minsup and minconf thresholds

(18)

Mining Association Rules

Example of Rules:

{Milk,Diaper} → {Beer} (s=0.4, c=0.67) {Milk,Beer} → {Diaper} (s=0.4, c=1.0) {Diaper,Beer} → {Milk} (s=0.4, c=0.67) {Beer} → {Milk,Diaper} (s=0.4, c=0.67) {Diaper} → {Milk,Beer} (s=0.4, c=0.5) {Milk} → {Diaper,Beer} (s=0.4, c=0.5)

Observations:

• All the above rules are binary partitions of the same itemset: {Milk, Diaper, Beer}

• Rules originating from the same itemset have identical support but can have different confidence

(19)

Mining Association Rules

Two-step approach:

Frequent Itemset Generation

■ Generate all itemsets whose support ≥ minsup

Rule Generation

■ Generate high confidence rules from each frequent

itemset, where each rule is a binary partitioning of a frequent itemset

Frequent

itemset

generation

is

still

(20)

Scalable Methods for Mining Frequent Patterns

■ The downward closure property of frequent patterns

■ Any subset of a frequent itemset must be frequent

If {beer, diaper, nuts} is frequent, so is {beer,

diaper}

■ i.e., every transaction having {beer, diaper, nuts} also

contains {beer, diaper}

■ Scalable mining methods: Three major approaches

■ Apriori (Agrawal & Srikant@VLDB’94)

■ Freq. pattern growth (FPgrowth—Han, Pei & Yin

@SIGMOD’00)

■ Vertical data format approach (Charm—Zaki & Hsiao

(21)

The Apriori Principle

Apriori

principle (Main observation):

– If an itemset is frequent, then all of its subsets must also be frequent

– If an itemset is not frequent, then all of its supersets

cannot be frequent

– The support of an itemset never exceeds the support of its subsets

(22)

Illustration of the Apriori principle

Found to be frequent

(23)

Illustration of the Apriori principle

Found to be

Infrequent

Pruned

(24)

Apriori: A Candidate Generation-and-Test Approach

■ Apriori pruning principle: If there is any itemset which is

infrequent, its superset should not be generated/tested! (Agrawal & Srikant @VLDB’94, Mannila, et al. @ KDD’ 94)

■ Method:

■ Initially, scan DB once to get frequent 1-itemset

■ Generate length (k+1) candidate itemsets from length k

frequent itemsets

■ Test the candidates against DB

■ Terminate when no frequent or candidate set can be

(25)

The Apriori Algorithm—An Example

Database TDB

1st scan

C1 L1

L2

C2 C

2

2nd scan

C3 3rd scan L3 Tid Items

10 A, C, D 20 B, C, E 30 A, B, C, E 40 B, E

Itemset sup {A} 2 {B} 3 {C} 3 {D} 1 {E} 3 Itemset sup {A} 2 {B} 3 {C} 3 {E} 3 Itemset {A, B} {A, C} {A, E} {B, C} {B, E} {C, E} Itemset sup

{A, B} 1 {A, C} 2 {A, E} 1 {B, C} 2 {B, E} 3 {C, E} 2

Itemset sup

{A, C} 2 {B, C} 2 {B, E} 3 {C, E} 2

Itemset

{B, C, E}

Itemset sup

{B, C, E} 2

(26)

The Apriori algorithm

Level-wise approach CLk = candidate itemsets of size k

k = frequent itemsets of size k

Candidate generation

Frequent itemset generation

1.

k = 1

,

C

1

= all items

2. While

C

k

not empty

3. Scan the database to find which itemsets in

C

k

are

frequent

and put them into

L

k

4. Use

L

k

to generate a collection of candidate

itemsets

C

k+1

of size

k+1

(27)

Items

(1-itemsets)

Pairs (2-itemsets)

(No need to generate candidates involving Coke or Eggs)

Triplets

(3-itemsets)

minsup = 3

Illustration of the Apriori principle

(28)

How to Generate Candidates?

■ Suppose the items in L

k-1 are listed in an order

■ Step 1: self-joining L

k-1

insert into Ck

select p.item1, p.item2, …, p.itemk-1, q.itemk-1 from Lk-1 p, Lk-1 q

where p.item1=q.item1, …, p.itemk-2=q.itemk-2, p.itemk-1 < q.itemk-1

■ Step 2: pruning

forall itemsets c in Ck do

forall (k-1)-subsets s of c do

(29)

Generating Candidates C

k+1

in SQL

self-join L

k

insert into Ck+1

select p.item1, p.item2, …, p.itemk, q.itemk from Lk p, Lk q

(30)

L

3

={abc, abd, acd, ace, bcd}

Self-join

:

L

3

*L

3

– abcd from abc and abd

acde from acd and ace

Example I

item 1 item 2 item 3

a b c

a b d

a c d

a c e

b c d

item 1 item 2 item 3

a b c

a b d

a c d

a c e

b c d

(31)

L

3

={abc, abd, acd, ace, bcd}

Self-joining

:

L

3

*L

3

– abcd from abc and abd

acde from acd and ace

Example I

item 1 item 2 item 3

a b c

a b d

a c d

a c e

b c d

item 1 item 2 item 3

a b c

a b d

a c d

a c e

b c d

(32)

L

3

={abc, abd, acd, ace, bcd}

Self-joining

:

L

3

*L

3

– abcd from abc and abd

acde from acd and ace

Example I

{a,b,c} {a,b,d}

{a,b,c,d} item1 item2 item3

a b c

a b d

a c d

a c e

b c d

item1 item2 item3

a b c

a b d

a c d

a c e

b c d

(33)

L

3

={abc, abd, acd, ace, bcd}

Self-joining

:

L

3

*L

3

– abcd from abc and abd

acde from acd and ace

{a,c,d} {a,c,e}

{a,c,d,e}

Example I

item 1 item 2 item 3

a b c

a b d

a c d

a c e

b c d

item 1 item 2 item 3

a b c

a b d

a c d

a c e

b c d

(34)

Example II

{Bread,Diape r{}Bread,Mil

k}

{Diaper, Milk}

(35)

L3={abc, abd, acd, ace, bcd}

Self-joining: L3*L3

abcd from abc and abd

acde from acd and ace

Pruning:

abcd is kept since all subset itemsets are in L3

acde is removed because ade is not in L3

C4={abcd}

{a,c,d} {a,c,e}

{a,c,d,e}

acd ace ade cde

√ √ X

Example I

{a,b,c} {a,b,d}

{a,b,c,d}

abc abd acd bcd

(36)

• We have all frequent k-itemsets Lk

Step 1: self-join Lk

• Create set Ck+1 by joining frequent k-itemsets that share the first k-1 items

Step 2: prune

• Remove from Ck+1 the itemsets that contain a subset k-itemset that is not frequent

(37)

Apriori

■ Step 1: scans all of the transactions in order to count the

(38)

Apriori

(39)

Apriori

(40)

Apriori

■ Step 4: Generate C2- candidate 2-itemsets and scan the

(41)

Apriori

(42)

Apriori

(43)

Apriori

(44)

Apriori

(45)

Apriori

(46)

Apriori

(47)

Generate Association Rule

Once the frequent itemsets from database have

been found, it is straightforward to generate

strong association rules from them

strong association rules

-association rules satisfy

(48)

Generate Association Rule

Suppose the data contain the frequent itemset

l={I1,I2,I5}

The nonempty subsets of l are:

{I1, I2}, {I1, I5},

{I2, I5}, {I1},

{I2},

(49)

Generate Association Rule

■ Suppose the data contain the frequent

itemset l={I1,I2,I5}

■ The resulting association rules are as shown

(50)

References

Related documents

Under this license, works must always be attributed to the copyright holder (original author), cannot be used for any commercial purposes, and may not be altered.. Any other use would

Comparison of the estimates of recombination fraction, map distance, and LOD score when constructing a genetic map with 251 two-point linkage analyses and six families of

The Oxford Ankle Foot Questionnaire for Children (OxAFQ-C) was completed at baseline, 1, 2, 6 and 12 month time points by both child and parent.. Results: A total of 133 children

ADC will place these models into Virtual Field Guides to assess if there is a benefit to the students learning, either pre, post or during fieldwork and will further research into

Examining The Relationship Among Patient- Centered Communication, Patient Engagement, And Patient’s Perception Of Quality Of Care In The General U.S..

Block based encryption and decryption algorithm is based on the combination of image transformation followed by encrypted images and image measurements of correlation, entropy and

Fast building damage mapping using a single post earthquake PolSAR image a case study of the 2010 Yushu earthquake Zhai and Huang Earth, Planets and Space (2016) 68 86 DOI 10

International Journal of Scientific Research in Computer Science, Engineering and Information Technology CSEIT411822 | Published 25 April 2018 | March April 2018 [ (4 ) 1 135 139 ]