DATA MINING
CSE -4229
Sajal Halder
What Is Frequent Pattern Analysis?
■ Frequent pattern: a pattern (a set of items, subsequences, substructures,
etc.) that occurs frequently in a data set
■ First proposed by Agrawal, Imielinski, and Swami [AIS93] in the context of frequent itemsets and association rule mining
■ Motivation: Finding inherent regularities in data
■ What products were often purchased together?— Beer and diapers?!
■ What are the subsequent purchases after buying a PC?
■ What kinds of DNA are sensitive to this new drug?
■ Can we automatically classify web documents?
■ Applications
■ Basket data analysis, cross-marketing, catalog design, sale campaign
Why Is Freq. Pattern Mining Important?
■ Discloses an intrinsic and important property of data sets
■ Forms the foundation for many essential data mining tasks
■ Association, correlation, and causality analysis
■ Sequential, structural (e.g., sub-graph) patterns
■ Pattern analysis in spatiotemporal, multimedia,
time-series, and stream data
■ Classification: associative classification
■ Cluster analysis: frequent pattern-based clustering
■ Data warehousing: iceberg cube and cube-gradient
■ Semantic data compression: fascicles
Market-Basket Data
■
A large set of
items
, e.g., things sold in a
supermarket.
■
A large set of baskets, each of which is a small
Market-Baskets – (2)
■
Really, a general many-to-many mapping
(association) between two kinds of things, where
the one (the baskets) is a set of the other (the
items
)
■
But we ask about connections among “items,”
not “baskets.”
■
The technology focuses on
common events
, not
• Given a set of transactions, find combinations of items (itemsets) that occur frequently
Market-Basket transactions
{Bread}: 4 {Milk} : 4 {Diaper} : 4 {Beer}: 3
{Diaper, Beer} : 3 {Milk, Bread} : 3
Frequent Itemsets
Applications – (1)
■
Items
= products; baskets = sets of products
someone bought in one trip to the store.
■
Example
application
: given that many people buy
beer and diapers together:
■
Run a sale on diapers; raise price of beer.
Applications – (2)
■
Baskets = Web pages;
items
= words.
■
Example
application:
Unusual words appearing
together in a large number of documents, e.g.,
“Brad” and “Angelina,” may indicate an
Applications – (3)
■
Baskets = sentences;
items
= documents
containing those sentences.
■
Example
application:
Items that appear together
too often could represent plagiarism.
Definition: Frequent Itemset
■ Itemset
■ A collection of one or more items
■ Example: {Milk, Bread, Diaper}
■ k-itemset
■ An itemset that contains k items
■ Support count (σ)
■ Frequency of occurrence of an itemset ■ E.g. σ({Milk, Bread,Diaper}) = 2
■ Support
■ Fraction of transactions that contain an
itemset
■ E.g. s({Milk, Bread, Diaper}) = 2/5=40%
■ Frequent Itemset
■ An itemset whose support is greater than
Association Rule Mining
■ Given a set of transactions, find rules that will predict the
occurrence of an item based on the occurrences of other items in the transaction
Market-Basket transactions
Example of Association Rules
{Diaper} → {Beer},
{Milk, Bread} → {Eggs,Coke}, {Beer, Bread} → {Milk},
Definition: Association Rule
Let D be database of
transactions
■
e.g.:
■
Let I be the set of items that appear in the
database, e.g., I={A,B,C,D,E,F}
■
A
rule
is defined by X
⇒
Y, where X
⊂
I, Y
⊂
I,
and X
∩
Y=
∅
Definition: Association Rule
Exampl e:
Association Rule
■ An implication expression of the
form X → Y, where X and Y are itemsets
■ Example:
{Milk, Diaper} → {Beer}
Rule Evaluation Metrics ■ Support (s)
Fraction of transactions that contain both X and Y
■ Confidence (c)
Measures how often items in Y appear in transactions that
Rule Measures: Support and Confidence
Find all the rules X ⇒ Y with minimum confidence and support
■ support, s, probability that a
transaction contains {X ∪ Y}
■ confidence, c, conditional
probability that a transaction having X also contains Y
Let minimum support 50%, and minimum confidence 50%, we have
■ A ⇒ C (50%, 66.6%) ■ C ⇒ A (50%, 100%)
Basic Concepts: Frequent Patterns and
Association Rules
■ Let sup
min = 50%, confmin = 50%
■ Freq. Pat.: {A:3, B:3, D:4, E:3, AD:3}
■ Association rules:
■ A D (60%, 100%) ■ D A (60%, 75%)
Transaction-id Items bought
TID date items_bought
100 10/10/99 {F,A,D,B}
200 15/10/99 {D,A,C,E,B}
300 19/10/99 {C,A,B,E}
400 20/10/99 {B,A,D}
Example
■
What is the
support
and
confidence
of the rule:
{B,D}
⇒
{A}
Support:
■ percentage of tuples that contain {A,B,D} =
Association Rule Mining Task
■ Given a set of transactions T, the goal of association rule
mining is to find all rules having
■ support ≥ minsup threshold
■ confidence ≥ minconf threshold
■ Brute-force approach:
■ List all possible association rules
■ Compute the support and confidence for each rule
■ Prune rules that fail the minsup and minconf thresholds
Mining Association Rules
Example of Rules:
{Milk,Diaper} → {Beer} (s=0.4, c=0.67) {Milk,Beer} → {Diaper} (s=0.4, c=1.0) {Diaper,Beer} → {Milk} (s=0.4, c=0.67) {Beer} → {Milk,Diaper} (s=0.4, c=0.67) {Diaper} → {Milk,Beer} (s=0.4, c=0.5) {Milk} → {Diaper,Beer} (s=0.4, c=0.5)
Observations:
• All the above rules are binary partitions of the same itemset: {Milk, Diaper, Beer}
• Rules originating from the same itemset have identical support but can have different confidence
Mining Association Rules
■
Two-step approach:
■
Frequent Itemset Generation
■ Generate all itemsets whose support ≥ minsup
■
Rule Generation
■ Generate high confidence rules from each frequent
itemset, where each rule is a binary partitioning of a frequent itemset
■
Frequent
itemset
generation
is
still
Scalable Methods for Mining Frequent Patterns
■ The downward closure property of frequent patterns
■ Any subset of a frequent itemset must be frequent
■ If {beer, diaper, nuts} is frequent, so is {beer,
diaper}
■ i.e., every transaction having {beer, diaper, nuts} also
contains {beer, diaper}
■ Scalable mining methods: Three major approaches
■ Apriori (Agrawal & Srikant@VLDB’94)
■ Freq. pattern growth (FPgrowth—Han, Pei & Yin
@SIGMOD’00)
■ Vertical data format approach (Charm—Zaki & Hsiao
The Apriori Principle
•
Apriori
principle (Main observation):
– If an itemset is frequent, then all of its subsets must also be frequent
– If an itemset is not frequent, then all of its supersets
cannot be frequent
– The support of an itemset never exceeds the support of its subsets
Illustration of the Apriori principle
Found to be frequent
Illustration of the Apriori principle
Found to be
Infrequent
Pruned
Apriori: A Candidate Generation-and-Test Approach
■ Apriori pruning principle: If there is any itemset which is
infrequent, its superset should not be generated/tested! (Agrawal & Srikant @VLDB’94, Mannila, et al. @ KDD’ 94)
■ Method:
■ Initially, scan DB once to get frequent 1-itemset
■ Generate length (k+1) candidate itemsets from length k
frequent itemsets
■ Test the candidates against DB
■ Terminate when no frequent or candidate set can be
The Apriori Algorithm—An Example
Database TDB
1st scan
C1 L1
L2
C2 C
2
2nd scan
C3 3rd scan L3 Tid Items
10 A, C, D 20 B, C, E 30 A, B, C, E 40 B, E
Itemset sup {A} 2 {B} 3 {C} 3 {D} 1 {E} 3 Itemset sup {A} 2 {B} 3 {C} 3 {E} 3 Itemset {A, B} {A, C} {A, E} {B, C} {B, E} {C, E} Itemset sup
{A, B} 1 {A, C} 2 {A, E} 1 {B, C} 2 {B, E} 3 {C, E} 2
Itemset sup
{A, C} 2 {B, C} 2 {B, E} 3 {C, E} 2
Itemset
{B, C, E}
Itemset sup
{B, C, E} 2
The Apriori algorithm
Level-wise approach CLk = candidate itemsets of size k
k = frequent itemsets of size k
Candidate generation
Frequent itemset generation
1.
k = 1
,
C
1= all items
2. While
C
knot empty
3. Scan the database to find which itemsets in
C
kare
frequent
and put them into
L
k4. Use
L
kto generate a collection of candidate
itemsets
C
k+1of size
k+1
Items
(1-itemsets)
Pairs (2-itemsets)
(No need to generate candidates involving Coke or Eggs)
Triplets
(3-itemsets)
minsup = 3
Illustration of the Apriori principle
How to Generate Candidates?
■ Suppose the items in L
k-1 are listed in an order
■ Step 1: self-joining L
k-1
insert into Ck
select p.item1, p.item2, …, p.itemk-1, q.itemk-1 from Lk-1 p, Lk-1 q
where p.item1=q.item1, …, p.itemk-2=q.itemk-2, p.itemk-1 < q.itemk-1
■ Step 2: pruning
forall itemsets c in Ck do
forall (k-1)-subsets s of c do
Generating Candidates C
k+1in SQL
• self-join L
k
insert into Ck+1
select p.item1, p.item2, …, p.itemk, q.itemk from Lk p, Lk q
•
L
3={abc, abd, acd, ace, bcd}
•
Self-join
:
L
3*L
3– abcd from abc and abd
– acde from acd and ace
Example I
item 1 item 2 item 3a b c
a b d
a c d
a c e
b c d
item 1 item 2 item 3
a b c
a b d
a c d
a c e
b c d
•
L
3={abc, abd, acd, ace, bcd}
•
Self-joining
:
L
3*L
3– abcd from abc and abd
– acde from acd and ace
Example I
item 1 item 2 item 3a b c
a b d
a c d
a c e
b c d
item 1 item 2 item 3
a b c
a b d
a c d
a c e
b c d
•
L
3={abc, abd, acd, ace, bcd}
•
Self-joining
:
L
3*L
3– abcd from abc and abd
– acde from acd and ace
Example I
{a,b,c} {a,b,d}
{a,b,c,d} item1 item2 item3
a b c
a b d
a c d
a c e
b c d
item1 item2 item3
a b c
a b d
a c d
a c e
b c d
•
L
3={abc, abd, acd, ace, bcd}
•
Self-joining
:
L
3*L
3– abcd from abc and abd
– acde from acd and ace
{a,c,d} {a,c,e}
{a,c,d,e}
Example I
item 1 item 2 item 3a b c
a b d
a c d
a c e
b c d
item 1 item 2 item 3
a b c
a b d
a c d
a c e
b c d
Example II
{Bread,Diape r{}Bread,Mil
k}
{Diaper, Milk}
• L3={abc, abd, acd, ace, bcd}
• Self-joining: L3*L3
– abcd from abc and abd
– acde from acd and ace
• Pruning:
– abcd is kept since all subset itemsets are in L3
– acde is removed because ade is not in L3
• C4={abcd}
{a,c,d} {a,c,e}
{a,c,d,e}
acd ace ade cde
√ √ X
Example I
{a,b,c} {a,b,d}
{a,b,c,d}
abc abd acd bcd
• We have all frequent k-itemsets Lk
• Step 1: self-join Lk
• Create set Ck+1 by joining frequent k-itemsets that share the first k-1 items
• Step 2: prune
• Remove from Ck+1 the itemsets that contain a subset k-itemset that is not frequent
Apriori
■ Step 1: scans all of the transactions in order to count the
Apriori
Apriori
Apriori
■ Step 4: Generate C2- candidate 2-itemsets and scan the
Apriori
Apriori
Apriori
Apriori
Apriori
Apriori
Generate Association Rule
■
Once the frequent itemsets from database have
been found, it is straightforward to generate
strong association rules from them
■
strong association rules
-association rules satisfy
Generate Association Rule
■
Suppose the data contain the frequent itemset
l={I1,I2,I5}
■
The nonempty subsets of l are:
■ {I1, I2}, ■ {I1, I5},
■ {I2, I5}, ■ {I1},
■ {I2},
Generate Association Rule
■ Suppose the data contain the frequent
itemset l={I1,I2,I5}
■ The resulting association rules are as shown