DATA MINING
CSE -4229
Sajal Halder
Assistant Professor, Dept. of CSE
Jagannath University
FP Mining with Vertical Data Format
Both
Apriori
and
FP-growth
use
horizontal
data format
TID List of item IDS
FP Mining with Vertical Data Format
Alternatively data can also be represented in
vertical format
itemset TID_set
I1 {T100,T400,T500,T700,T800,T900} I2 {T100,T200,T300,T400,T600,T800,T900} I3 {T300,T500,T600,T700,T800,T900} I4 {T200,T400}
Transform the horizontally formatted data to the
vertical format by scanning the database once
The support count of an itemset is simply the
length of the TID_set of the itemset
TID List of item IDS
T100 I1,I2,I5 T200 I2,I4 T300 I2,I3 T400 I1,I2,I4 T500 I1,I3 T600 I2,I3 T700 I1,I3 T800 I1,I2,I3,I5 T900 I1,I2,I3 itemset TID_set I1 {T100,T400,T500,T700,T800,T900} I2 {T100,T200,T300,T400,T600,T800,T900} I3 {T300,T500,T600,T700,T800,T900} I4 {T200,T400} I5 {T100,T800}
Determine
support
of
any
k-itemset
by
intersecting tid-lists of two of its (k-1) subsets
∧
→
The frequent k-itemsets can be used to construct
the candidate (k+1)-itemsets based on the Apriori
property
FP Mining with Vertical Data Format
itemset TID_set I1 {T100,T400,T500,T700,T800,T900} I2 {T100,T200,T300,T400,T600,T800,T900} I3 {T300,T500,T600,T700,T800,T900} I4 {T200,T400} I5 {T100,T800}
Frequent 1-itemsets in vertical format
Frequent 2-itemsets in vertical format
The frequent k-itemsets can be used to construct
the candidate (k+1)-itemsets based on the Apriori
property
FP Mining with Vertical Data Format
Frequent 2-itemsets in vertical format
itemset TID_set {I1,I2} {T100,T400,T800,T900} {I1,I3} {T500,T700,T800,T900} {I1,I5} {T100,T800} {I2,I3} {T300,T600,T800,T900} {I2,I4} {T200,T400} {I2,I5} {T100,T800} min_sup=2 itemset TID_set {I1,I2,I3} {T800,T900} {I1,I2,I5} {T100,T800}
Mining multilevel association
Miming multidimensional association
Mining quantitative association
Mining interesting correlation patterns
Mining Various Kinds of Association Rules
Multilevel association
rules involve concepts at
different levels of abstraction.
Multidimensional association
rules involve more
than one dimension or predicate
■ e.g., rules relating what a customer buys as well as the
customer’s age.
Quantitative association
rules involve numeric
attributes that have an implicit ordering among
values
Mining Multiple-Level Association Rules
It is difficult to find interesting purchase patterns
An AllElectronics store, showing the items
Mining Multiple-Level Association Rules
“IBM-ThinkPad-R40/P4M”
or
“Symantec-Norton-Antivirus-2003” occurs in a
very small fraction of the transactions
Mining Multiple-Level Association Rules
Data can be generalized by replacing low-level concepts within the data by their higher-level concepts.
✔
strong associations between generalized abstractions of the items
Mining Multiple-Level Association Rules
Items often form hierarchy. Items at the lower level are expected to have lower
support.
Rules regarding itemsets at appropriate levels could be
quite useful.
A transactional database can be encoded based on
dimensions and levels We can explore shared multi-level mining
Food
bread
milk
skim
Fraser
full fat
wheat
white
Items often form hierarchies
Flexible support settings
■
Items at the lower level are expected to have
lower support
Exploration of
shared
multi-level mining
Mining Multiple-Level Association Rules
Using uniform minimum support for all
levels
■ uniform support – level 1: 5% and level 2: 5%
■ Milk and fat milk is frequent
■ Milk is frequnt but Skim milk is infrequent.
Milk
[support = 10%]
Fat Milk [support = 6%]
Skim Milk [support = 4%]
Level 1
min_sup = 5%
Level 2
Mining Multiple-Level Association Rules
Using reduced minimum support at lower
levels
■ reduced support - level 1: 5% and level 2: 3%
■ Milk and fat milk is frequent
■ Milk and Skim milk is also frequent.
Milk
[support = 10%]
Fat Milk [support = 6%]
Skim Milk [support = 4%] Level 1
min_sup = 5%
Level 2
A top down, progressive deepening approach:
■ First find high-level strong rules:
milk → bread [20%, 60%].
■ Then find their lower-level “weaker” rules:
full fat milk → wheat bread [6%, 50%].
Variations at mining multiple-level association
■ Level-crossed association rules:
full fat milk
→
Wonder wheat bread■ Association rules with multiple, alternative hierarchies:
full fat milk
→
Wonder breadMulti-Dimensional Association: Concepts
Single-dimensional rules:
buys(X, “milk”) ⇒ buys(X, “bread”)
Multi-dimensional rules: ≥2 dimensions or
predicates
■ Inter-dimension association rules (no repeated predicates)
age(X,”19-25”) ∧ occupation(X,“student”) ⇒ buys(X,“coke”)
■ hybrid-dimension association rules (repeated predicates)
age(X,”19-25”) ∧ buys(X, “popcorn”) ⇒ buys(X, “coke”)
Categorical Attributes
■ finite number of possible values, no ordering among values
Quantitative Attributes
age(X,”30-34”) ∧ income(X,”24K - 48K”) ⇒ buys(X,”high resolution TV”)
Numeric attributes are dynamically discretized
■ Such that the confidence or compactness of the rules
mined is maximized.
2-D quantitative association rules:
A
quan1∧
A
quan2⇒
A
catCluster “adjacent”
association rules to form general rules using a 2-D grid.
Example:
Whether a rule is interesting or not can be assessed either
subjectively or objectively
Occur when mining at low support thresholds or mining for
long patterns.
Objective measures
Two popular measurements:
● support; and ● Confidence
Subjective measures
A rule (pattern) is interesting if
● actionable (the user can do something with it)
Example of a misleading “strong” association rule
■
Analyze transactions of AllElectronics data about
computer games and videos
■
Of the
10,000
transactions analyzed
6,000 of the transactions include computer games 7,500 of the transactions include videos
4,000 of the transactions include both
■
Suppose that min_sup=30% and
min_confidence=60%
■
The following association rule is discovered:
Buys(X, “computer games”) ⇒ buys(X, “videos”)[support =40%, confidence=66%]
Buys(X, “computer games”) ⇒ buys(X, “videos”)[support 40%, confidence=66%]
This rule is strong but it is
misleading
The probability of purchasing videos is 75% which
is even larger than 66%
In fact computer games and videos are negatively
associated
■ the purchase of one of these items actually decreases the
likelihood of purchasing the other
Buys(X, “computer games”) ⇒ buys(X, “videos”)[support 40%, confidence=66%]
The confidence of a rule A
⇒
B can be deceiving
■ It is only an estimate of the conditional probability of itemset
B given itemset A.
■ It does not measure the real strength of the correlation
implication between A and B
Need to use
Correlation Analysis
Association Analysis to Correlation Analysis
The support and confidence measures are insufficient at filtering out uninteresting association rules
Need to use Correlation Analysis
A ⇒ B [support, confidence. correlation].
A correlation rule is measured not only by its support and
confidence but also by the correlation between itemsets A
and B.
There are many different correlation measures
Given a rule X
→
Y, information needed to compute
rule interestingness can be obtained from a
contingency table
Y Y
X f11 f10 f1+
X f01 f00 fo+
f+1 f+0 |T|
Contingency table for
X → Y
f
11: support of X and Y
f
10: support of X and Y
f
01: support of X and Y
f
00: support of X and Y
Used to define various measures
● support, confidence, lift, Gini,
J-measure, etc.
Example: Lift/Interest
Lift is a simple correlation measure
Occurrence of itemset A is independent of the occurrence of
itemset B if
P(AUB) = P(A) × P(B)
Statistical Independence
Population of 1000 students
■
600 students know how to
swim (S)
■700 students know how to
bike
(B)
■
420 students know how to
swim
and
bike
(
S
,
B
)
■
P(S
∧
B)
= 420/1000 = 0.42
■
P(S) × P(B)
= 0.6 × 0.7 = 0.42
Statistical Independence
Population of 1000 students
■
600 students know how to
swim (S)
■700 students know how to
bike (B)
■
500
students know how to
swim
and
bike
(
S
,
B
)
■
P(S
∧
B)
= 500/1000 =
0.5
■
P(S) × P(B)
= 0.6 × 0.7 = 0.42
Statistical Independence
Population of 1000 students
■
600 students know how to
swim (S)
■700 students know how to
bike (B)
■
300
students know how to
swim
and bike (
S
,
B
)
■
P(S
∧
B)
= 300/1000 =
0.3
■
P(S) × P(B)
= 0.6 × 0.7 = 0.42
Example: Lift/Interest
If the resulting value is greater than 1,
■ A and B are positively correlated
If the resulting value of Equation is less than 1
■ A and B are negatively correlated
If the resulting value is equal to 1,
■ A and B are independent and
Example: Lift/Interest
Coffee Coffee
Tea 15 5 20
Tea 75 5 80
90 10 100
Number of people that drink coffee and tea
Number of people that drink coffee but not tea
Number of people that drink coffee
Number of people that drink tea
Association Rule: Tea
→
Coffee
Confidence= P(Coffee|Tea) = 0.75
but P(Coffee) = 0.9
Example: Lift/Interest
play basketball ⇒ eat cereal [40%, 66.7%] is misleading
■ The overall % of students eating cereal is 75% > 66.7%.
play basketball ⇒ not eat cereal [20%, 33.3%] is more accurate, although with lower support and confidence
Measure of dependent/correlated events: lift
Basketball Not basketball Sum (row)
Cereal 2000 1750 3750
Not cereal 1000 250 1250
Example: χ
2
To compute the χ2 value, we take the squared difference
between the observed and expected value for a slot (A and B pair) in the contingency table, divided by the expected value.
Example: χ
2
χ2 value is greater than one
Observed value of the slot (game, video) = 4,000, which is less than the expected value 4,500
“Buy walnuts ⇒ buy milk [1%, 80%]” is misleading
■ if 85% of customers buy milk
Support and confidence are not good to represent correlations
Milk No Milk Sum (row)
Coffee m, c ~m, c c
No Coffee m, ~c ~m, ~c ~c
Sum(col.) m ~m Σ