Chapter 28
Definitions of Data Mining
The discovery of new information in terms of
patterns or rules from vast amounts of data.
The process of finding interesting structure in
data.
The process of employing one or more computer
Data Warehousing
The data warehouse is a historical database
designed for decision support.
Data mining can be applied to the data in a
warehouse to help with certain types of decisions.
Proper construction of a data warehouse is
Knowledge Discovery in Databases
(KDD)
Data mining is actually one step of a larger
process known as knowledge discovery in
databases (KDD).
The KDD process model comprises six phases
Data selection Data cleansing Enrichment
Data transformation or encoding Data mining
Goals of Data Mining and Knowledge
Discovery (PICO)
Prediction:
Determine how certain attributes will behave in the
future.
Identification:
Identify the existence of an item, event, or activity.
Classification:
Types of Discovered Knowledge
Association Rules
Classification
Classification
Classification is the process of learning a model
that is able to describe different classes of data.
Learning is supervised as the classes to be
learned are predetermined.
Learning is accomplished by using a training set
of pre-classified data.
The model produced is usually in the form of a
K – Nearest Neighbor Algorithm
K – Nearest Neighbor Algorithm
2 2 1
2 2
1
)
(
)
(
x
x
y
y
Association Rules
Association rules are frequently used to generate rules
from market-basket data.
A market basket corresponds to the sets of items a
consumer purchases during one visit to a supermarket. An association rule is of the form X=>Y
X is called antecedent (left-hand-side or LHS) and Y is called consequent (right-hand-side or RHS) of the rule.
Example of rule
{onions, potatoes} => {beef}
Found in the sales data of a supermarket would
indicate that if a customer buys onions and
potatoes together, he or she is likely to also buy beef
Such information can be used as the basis for
Mathematical formulation of Rule
Let I={i1, i2, .., in} be a set of n binary attributes called items.
Let D={t1, t2, .., tm} be a set of transactions called the database. Each transaction in D has a unique transaction ID and contains a
subset of the items in I.
A rule is defined as an implication of the form XY where X∩Y= ∅ and X,Y
⊆ I
.
Itemset
A collection of one or more items
Example: {Milk, Butter, Bread}
Association rule:
Support
Every association rule has a support and a confidence.
“The support is the percentage of transactions that demonstrate the rule.”
Example: Database with transactions ( customer_# : item_a1, item_a2, … )
1: 1, 3, 5.
2: 1, 8, 14, 17, 12. 3: 4, 6, 8, 12, 9, 104. 4: 2, 1, 8.
Association rule:
Support
An itemset is called frequent if its support is equal or greater than an agreed upon minimal value – the support threshold
add to previous example: if threshold 50%
Association Rules :Confidence
Confidence
Every association rule has a support and a confidence.
An association rule is of the form: X => Y
• X => Y: if someone buys X, he also buys Y
The confidence is the conditional probability that, given X
present in a transition , Y will also be present.
Confidence measure, by definition:
Association Rules :Confidence
Confidence
We should only consider rules derived from
itemsets with high support, and that also have high confidence.
“A rule with low confidence is not meaningful.”
Association Rules :Confidence
Example: Database with transactions ( customer_# : item_a1, item_a2, … )
1: 3, 5, 8. 2: 2, 6, 8.
3: 1, 4, 7, 10. 4: 3, 8, 10. 5: 2, 5, 8.
Conf ( {5} => {8} ) ?
supp({5}) = 5 , supp({8}) = 7 , supp({5,8}) = 4,
then conf( {5} => {8} ) = 4/5 = 0.8 or 80%
Association Rules :Confidence
1. Generate all itemsets that have a support that
exceeds the threshold. These sets of items are
called large (or frequent) itemsets. Note that large here means large support.
2. For each large itemset, all the rules that have a
Generating Association Rules:
Apriori Algorithm
Apriori principle:
If an itemset is frequent, then all of its subsets
Generating Association Rules:
Illustrating
Apriori Principle
Item Count
Bread 4
Coke 2
Milk 4
Beer 3
Diaper 4
Eggs 1
Itemset Count {Bread,Milk} 3 {Bread,Beer} 2 {Bread,Diaper} 3 {Milk,Beer} 2 {Milk,Diaper} 3 {Beer,Diaper} 3 Items (1-itemsets) Pairs (2-itemsets)
No need to generate
candidates involving Coke or Eggs
Triplets (3-itemsets)
No need to generate
candidates involving {bread, beer} and {milk, beer}
Generating Association Rules:
Apriori
Algorithm
N= Nomber of attributes
Attribute_list = all attributes
For i=1 to N
Frequent_item_set = generate item set of i attributes using
attribute_list
Frequent_item_set = remove all infrequent itemsets
Rule = Rule U generate rule using Frequent_item_set
Attribute_list = Only attributes contained in Frequent_item_set
Clustering
Unsupervised learning or clustering builds
models from data without predefined classes.
The goal is to place records into groups where
the records in a group are highly similar to each other and dissimilar to records in other groups.
The k-Means algorithm is a simple yet effective
Problem
Example
Suppose we have 4 types of medicines and each has two attributes (pH and weight index). Our goal is to group these objects into K=2 group of medicine.
Medicine Weight pH-Index
A 1 1
B 2 1
C 4 3
D 5 4
A B
C
Example
Step 1: Use initial seed points for partitioning
c1 A ,c2 B
24 . 4 ) 1 4 ( ) 2 5 ( ) , ( 5 ) 1 4 ( ) 1 5 ( ) , ( 2 2 2 2 2 1 c D d c D d
Assign each object to the cluster
Euclidean distance
D
C
Example
Step 2: Compute new centroids of the current
partition
Knowing the members of each cluster, now we compute the new
centroid of each group based on these new memberships.
Example
Step 2: Renew membership based on new
centroids
Compute the distance of all
Example
Step 3: Repeat the first two steps until its
convergence
Knowing the members of each
cluster, now we compute the new centroid of each group based on these new memberships.
Example
Step 3: Repeat the first two steps until its
convergence
Compute the distance of all objects to the new centroids