Data Mining
Association Analysis: Basic Concepts
and Algorithms
Lecture Notes for Chapter 6
Introduction to Data Mining
by
Association Rule Mining
●
Given a set of transactions, find rules that will predict the
occurrence of an item based on the occurrences of other
items in the transaction
Market-Basket transactions
Example of Association Rules
{Diaper} → {Beer},
{Milk, Bread} → {Eggs,Coke},
{Beer, Bread} → {Milk},
Implication means co-occurrence,
not causality!
Definition: Frequent Itemset
●
Itemset
–
A collection of one or more items
◆Example: {Milk, Bread, Diaper}
–
k-itemset
◆
An itemset that contains k items
●
Support count (σ)
–
Frequency of occurrence of an itemset
–
E.g. σ({Milk, Bread,Diaper}) = 2
●
Support
–
Fraction of transactions that contain
an itemset
–
E.g. s({Milk, Bread, Diaper}) = 2/5
●
Frequent Itemset
–
An itemset whose support is greater
than or equal to a minsup threshold
Definition: Association Rule
Example:
●
Association Rule
–
An implication expression of the form
X → Y, where X and Y are itemsets
–
Example:
{Milk, Diaper} → {Beer}
●
Rule Evaluation Metrics
–
Support (s)
◆
Fraction of transactions that contain
both X and Y
–
Confidence (c)
◆
Measures how often items in Y
appear in transactions that
Association Rule Mining Task
●
Given a set of transactions T, the goal of
association rule mining is to find all rules having
–
support ≥ minsup threshold
–
confidence ≥ minconf threshold
●
Brute-force approach:
–
List all possible association rules
–
Compute the support and confidence for each rule
–
Prune rules that fail the minsup and minconf
thresholds
Mining Association Rules
Example of Rules:
{Milk,Diaper} → {Beer} (s=0.4, c=0.67)
{Milk,Beer} → {Diaper} (s=0.4, c=1.0)
{Diaper,Beer} → {Milk} (s=0.4, c=0.67)
{Beer} → {Milk,Diaper} (s=0.4, c=0.67)
{Diaper} → {Milk,Beer} (s=0.4, c=0.5)
{Milk} → {Diaper,Beer} (s=0.4, c=0.5)
Observations:
• All the above rules are binary partitions of the same itemset:
{Milk, Diaper, Beer}
• Rules originating from the same itemset have identical support but
can have different confidence
Mining Association Rules
●
Two-step approach:
1.
Frequent Itemset Generation
–
Generate all itemsets whose support ≥ minsup
2.
Rule Generation
–
Generate high confidence rules from each frequent itemset,
where each rule is a binary partitioning of a frequent itemset
●
Frequent itemset generation is still
Frequent Itemset Generation
Given d items, there
are 2
d
possible
Frequent Itemset Generation
●
Brute-force approach:
–
Each itemset in the lattice is a
candidate
frequent itemset
–
Count the support of each candidate by scanning the
database
–
Match each transaction against every candidate
Computational Complexity
●
Given d unique items:
–
Total number of itemsets = 2
d
–
Total number of possible association rules:
Frequent Itemset Generation Strategies
●
Reduce the
number of candidates
(M)
–
Complete search: M=2
d
–
Use pruning techniques to reduce M
●
Reduce the
number of transactions
(N)
–
Reduce size of N as the size of itemset increases
–
Used by DHP and vertical-based mining algorithms
●
Reduce the
number of comparisons
(NM)
–
Use efficient data structures to store the candidates or
transactions
–
No need to match every candidate against every
Reducing Number of Candidates
●
Apriori principle
:
–
If an itemset is frequent, then all of its subsets must also
be frequent
●
Apriori principle holds due to the following property
of the support measure:
–
Support of an itemset never exceeds the support of its
subsets
Found to be
Infrequent
Illustrating Apriori Principle
Pruned
supersets
Illustrating Apriori Principle
Items (1-itemsets)
Pairs (2-itemsets)
(No need to generate
candidates involving Coke
or Eggs)
Triplets (3-itemsets)
Minimum Support = 3
If every subset is considered,
6