FP Mining with Vertical Data Format

(1)

DATA MINING

CSE -4229

Sajal Halder

Assistant Professor, Dept. of CSE

Jagannath University

(2)

FP Mining with Vertical Data Format

Both

Apriori

and

FP-growth

use

horizontal

data format

TID List of item IDS

(3)

FP Mining with Vertical Data Format

Alternatively data can also be represented in

vertical format

itemset TID_set

I1 {T100,T400,T500,T700,T800,T900} I2 {T100,T200,T300,T400,T600,T800,T900} I3 {T300,T500,T600,T700,T800,T900} I4 {T200,T400}

(4)

Transform the horizontally formatted data to the

vertical format by scanning the database once

The support count of an itemset is simply the

length of the TID_set of the itemset

TID List of item IDS

T100 I1,I2,I5 T200 I2,I4 T300 I2,I3 T400 I1,I2,I4 T500 I1,I3 T600 I2,I3 T700 I1,I3 T800 I1,I2,I3,I5 T900 I1,I2,I3 itemset TID_set I1 {T100,T400,T500,T700,T800,T900} I2 {T100,T200,T300,T400,T600,T800,T900} I3 {T300,T500,T600,T700,T800,T900} I4 {T200,T400} I5 {T100,T800}

(5)

Determine

support

of

any

k-itemset

by

intersecting tid-lists of two of its (k-1) subsets

∧

_→

(6)

The frequent k-itemsets can be used to construct

the candidate (k+1)-itemsets based on the Apriori

property

FP Mining with Vertical Data Format

itemset TID_set I1 {T100,T400,T500,T700,T800,T900} I2 {T100,T200,T300,T400,T600,T800,T900} I3 {T300,T500,T600,T700,T800,T900} I4 {T200,T400} I5 {T100,T800}

Frequent 1-itemsets in vertical format

(7)

The frequent k-itemsets can be used to construct

the candidate (k+1)-itemsets based on the Apriori

property

FP Mining with Vertical Data Format

itemset TID_set {I1,I2} {T100,T400,T800,T900} {I1,I3} {T500,T700,T800,T900} {I1,I5} {T100,T800} {I2,I3} {T300,T600,T800,T900} {I2,I4} {T200,T400} {I2,I5} {T100,T800} min_sup=2 itemset TID_set {I1,I2,I3} {T800,T900} {I1,I2,I5} {T100,T800}

(8)

Mining multilevel association

Miming multidimensional association

Mining quantitative association

Mining interesting correlation patterns

(9)

Mining Various Kinds of Association Rules

Multilevel association

rules involve concepts at

different levels of abstraction.

Multidimensional association

rules involve more

than one dimension or predicate

■ e.g., rules relating what a customer buys as well as the

customer’s age.

Quantitative association

rules involve numeric

attributes that have an implicit ordering among

values

(10)

Mining Multiple-Level Association Rules

It is difficult to find interesting purchase patterns

An AllElectronics store, showing the items

(11)

Mining Multiple-Level Association Rules

“IBM-ThinkPad-R40/P4M”

or

“Symantec-Norton-Antivirus-2003” occurs in a

very small fraction of the transactions

(12)

Mining Multiple-Level Association Rules

Data can be generalized by replacing low-level concepts within the data by their higher-level concepts.

✔

strong associations between generalized abstractions of the items

(13)

Mining Multiple-Level Association Rules

Items often form hierarchy. Items at the lower level are expected to have lower

support.

Rules regarding itemsets at appropriate levels could be

quite useful.

A transactional database can be encoded based on

dimensions and levels We can explore shared multi-level mining

Food

bread

milk

skim

Fraser

full fat

wheat

white

(14)

Items often form hierarchies

Flexible support settings

■

Items at the lower level are expected to have

lower support

Exploration of

shared

multi-level mining

(15)

Mining Multiple-Level Association Rules

Using uniform minimum support for all

levels

■ uniform support – level 1: 5% and level 2: 5%

■ Milk and fat milk is frequent

■ Milk is frequnt but Skim milk is infrequent.

Milk

[support = 10%]

Fat Milk [support = 6%]

Skim Milk [support = 4%]

Level 1

min_sup = 5%

Level 2

(16)

Mining Multiple-Level Association Rules

Using reduced minimum support at lower

levels

■ reduced support - level 1: 5% and level 2: 3%

■ Milk and fat milk is frequent

■ Milk and Skim milk is also frequent.

Milk

[support = 10%]

Fat Milk [support = 6%]

Skim Milk [support = 4%] Level 1

min_sup = 5%

Level 2

(17)

A top down, progressive deepening approach:

■ First find high-level strong rules:

milk → bread [20%, 60%].

■ Then find their lower-level “weaker” rules:

full fat milk → wheat bread [6%, 50%].

(18)

Variations at mining multiple-level association

■ Level-crossed association rules:

full fat milk

→

Wonder wheat bread

■ Association rules with multiple, alternative hierarchies:

full fat milk

→

Wonder bread

(19)

Multi-Dimensional Association: Concepts

Single-dimensional rules:

buys(X, “milk”) ⇒ buys(X, “bread”)

Multi-dimensional rules: ≥2 dimensions or

predicates

■ Inter-dimension association rules (no repeated predicates)

age(X,”19-25”) ∧ occupation(X,“student”) ⇒ buys(X,“coke”)

■ hybrid-dimension association rules (repeated predicates)

age(X,”19-25”) ∧ buys(X, “popcorn”) ⇒ buys(X, “coke”)

Categorical Attributes

■ finite number of possible values, no ordering among values

Quantitative Attributes

(20)

age(X,”30-34”) ∧ income(X,”24K - 48K”) ⇒ buys(X,”high resolution TV”)

Numeric attributes are dynamically discretized

■ Such that the confidence or compactness of the rules

mined is maximized.

2-D quantitative association rules:

A

_quan1

∧

A

_quan2

⇒

A

_cat

Cluster “adjacent”

association rules to form general rules using a 2-D grid.

Example:

(21)

Whether a rule is interesting or not can be assessed either

subjectively or objectively

Occur when mining at low support thresholds or mining for

long patterns.

Objective measures

Two popular measurements:

● support; and ● Confidence

Subjective measures

A rule (pattern) is interesting if

● actionable (the user can do something with it)

(22)

Example of a misleading “strong” association rule

■

Analyze transactions of AllElectronics data about

computer games and videos

■

Of the

10,000

transactions analyzed

6,000 of the transactions include computer games 7,500 of the transactions include videos

4,000 of the transactions include both

■

Suppose that min_sup=30% and

min_confidence=60%

■

The following association rule is discovered:

Buys(X, “computer games”) ⇒ buys(X, “videos”)[support =40%, confidence=66%]

(23)

Buys(X, “computer games”) ⇒ buys(X, “videos”)[support 40%, confidence=66%]

This rule is strong but it is

misleading

The probability of purchasing videos is 75% which

is even larger than 66%

In fact computer games and videos are negatively

associated

■ the purchase of one of these items actually decreases the

likelihood of purchasing the other

(24)

Buys(X, “computer games”) ⇒ buys(X, “videos”)[support 40%, confidence=66%]

The confidence of a rule A

⇒

B can be deceiving

■ It is only an estimate of the conditional probability of itemset

B given itemset A.

■ It does not measure the real strength of the correlation

implication between A and B

Need to use

Correlation Analysis

(25)

Association Analysis to Correlation Analysis

The support and confidence measures are insufficient at filtering out uninteresting association rules

Need to use Correlation Analysis

A ⇒ B [support, confidence. correlation].

A correlation rule is measured not only by its support and

confidence but also by the correlation between itemsets A

and B.

There are many different correlation measures

(26)

Given a rule X

→

Y, information needed to compute

rule interestingness can be obtained from a

contingency table

Y Y

X f₁₁ f₁₀ f₁₊

X f₀₁ f₀₀ f_o+

f₊₁ f₊₀ |T|

Contingency table for

X → Y

f

₁₁

: support of X and Y

f

₁₀

: support of X and Y

f

₀₁

: support of X and Y

f

₀₀

: support of X and Y

Used to define various measures

● support, confidence, lift, Gini,

J-measure, etc.

(27)

Example: Lift/Interest

Lift is a simple correlation measure

Occurrence of itemset A is independent of the occurrence of

itemset B if

P(AUB) = P(A) × P(B)

(28)

Statistical Independence

Population of 1000 students

■

600 students know how to

swim (S)

■

700 students know how to

bike

(B)

■

420 students know how to

swim

and

bike

(

S

,

B

)

■

P(S

∧

B)

= 420/1000 = 0.42

■

P(S) × P(B)

= 0.6 × 0.7 = 0.42

(29)

Statistical Independence

Population of 1000 students

■

600 students know how to

swim (S)

■

700 students know how to

bike (B)

■

500 students know how to

swim

and

bike

(

S

,

B

)

■

P(S

∧

B)

= 500/1000 =

0.5

■

P(S) × P(B)

= 0.6 × 0.7 = 0.42

(30)

Statistical Independence

Population of 1000 students

■

600 students know how to

swim (S)

■

700 students know how to

bike (B)

■

300 students know how to

swim

and bike (

S

,

B

)

■

P(S

∧

B)

= 300/1000 =

0.3

■

P(S) × P(B)

= 0.6 × 0.7 = 0.42

(31)

Example: Lift/Interest

If the resulting value is greater than 1,

■ A and B are positively correlated

If the resulting value of Equation is less than 1

■ A and B are negatively correlated

If the resulting value is equal to 1,

■ A and B are independent and

(32)

Example: Lift/Interest

Coffee Coffee

Tea 15 5 20

Tea 75 5 80

90 10 100

Number of people that drink coffee and tea

Number of people that drink coffee but not tea

Number of people that drink coffee

Number of people that drink tea

Association Rule: Tea

→

Coffee

Confidence= P(Coffee|Tea) = 0.75

but P(Coffee) = 0.9

(33)

Example: Lift/Interest

play basketball ⇒ eat cereal [40%, 66.7%] is misleading

■ The overall % of students eating cereal is 75% > 66.7%.

play basketball ⇒ not eat cereal [20%, 33.3%] is more accurate, although with lower support and confidence

Measure of dependent/correlated events: lift

Basketball Not basketball Sum (row)

Cereal 2000 1750 3750

Not cereal 1000 250 1250

(34)

Example: χ

2

To compute the χ2 value, we take the squared difference

between the observed and expected value for a slot (A and B pair) in the contingency table, divided by the expected value.

(35)

Example: χ

2

χ2 value is greater than one

Observed value of the slot (game, video) = 4,000, which is less than the expected value 4,500

(36)

(37)

“Buy walnuts ⇒ buy milk [1%, 80%]” is misleading

■ if 85% of customers buy milk

Support and confidence are not good to represent correlations

Milk No Milk Sum (row)

Coffee m, c ~m, c c

No Coffee m, ~c ~m, ~c ~c

Sum(col.) m ~m Σ

(38)

(39)