Datamining

(1)

(2)

Chapter 28

(3)

Definitions of Data Mining

 The discovery of new information in terms of

patterns or rules from vast amounts of data.

 The process of finding interesting structure in

data.

 The process of employing one or more computer

(4)

Data Warehousing

 The data warehouse is a historical database

designed for decision support.

 Data mining can be applied to the data in a

warehouse to help with certain types of decisions.

 Proper construction of a data warehouse is

(5)

Knowledge Discovery in Databases

(KDD)

 Data mining is actually one step of a larger

process known as knowledge discovery in

databases (KDD).

 The KDD process model comprises six phases

 Data selection  Data cleansing  Enrichment

 Data transformation or encoding  Data mining

(6)

Goals of Data Mining and Knowledge

Discovery (PICO)

 Prediction:

 Determine how certain attributes will behave in the

future.

 Identification:

 Identify the existence of an item, event, or activity.

 Classification:

(7)

Types of Discovered Knowledge

 Association Rules

 Classification

(8)

Classification

 Classification is the process of learning a model

that is able to describe different classes of data.

 Learning is supervised as the classes to be

learned are predetermined.

 Learning is accomplished by using a training set

of pre-classified data.

 The model produced is usually in the form of a

(9)

(10)

(11)

(12)

K – Nearest Neighbor Algorithm

(13)

K – Nearest Neighbor Algorithm

2 2 1

2 2

1

)

(

)

(

x

y

(14)

(15)

(16)

Association Rules

 Association rules are frequently used to generate rules

from market-basket data.

 A market basket corresponds to the sets of items a

consumer purchases during one visit to a supermarket.  An association rule is of the form X=>Y

 X is called antecedent (left-hand-side or LHS) and Y is called consequent (right-hand-side or RHS) of the rule.

(17)

Example of rule

{onions, potatoes} => {beef}

 Found in the sales data of a supermarket would

indicate that if a customer buys onions and

potatoes together, he or she is likely to also buy beef

 Such information can be used as the basis for

(18)

Mathematical formulation of Rule

 Let I={i1, i2, .., in} be a set of n binary attributes called items.

 Let D={t1, t2, .., tm} be a set of transactions called the database.  Each transaction in D has a unique transaction ID and contains a

subset of the items in I.

 A rule is defined as an implication of the form XY where X∩Y= ∅ and X,Y

⊆ I

.

 Itemset

A collection of one or more items

 Example: {Milk, Butter, Bread}

(19)

Association rule:

Support

Every association rule has a support and a confidence.

“The support is the percentage of transactions that demonstrate the rule.”

Example: Database with transactions ( customer_# : item_a1, item_a2, … )

1: 1, 3, 5.

2: 1, 8, 14, 17, 12. 3: 4, 6, 8, 12, 9, 104. 4: 2, 1, 8.

(20)

Association rule:

Support

An itemset is called frequent if its support is equal or greater than an agreed upon minimal value – the support threshold

add to previous example: if threshold 50%

(21)

Association Rules :Confidence

Confidence

Every association rule has a support and a confidence.

An association rule is of the form: X => Y

• X => Y: if someone buys X, he also buys Y

The confidence is the conditional probability that, given X

present in a transition , Y will also be present.

Confidence measure, by definition:

(22)

Association Rules :Confidence

Confidence

We should only consider rules derived from

itemsets with high support, and that also have high confidence.

“A rule with low confidence is not meaningful.”

(23)

Association Rules :Confidence

Example: Database with transactions ( customer_# : item_a1, item_a2, … )

1: 3, 5, 8. 2: 2, 6, 8.

3: 1, 4, 7, 10. 4: 3, 8, 10. 5: 2, 5, 8.

Conf ( {5} => {8} ) ?

supp({5}) = 5 , supp({8}) = 7 , supp({5,8}) = 4,

then conf( {5} => {8} ) = 4/5 = 0.8 or 80%

(24)

Association Rules :Confidence

1. Generate all itemsets that have a support that

exceeds the threshold. These sets of items are

called large (or frequent) itemsets. Note that large here means large support.

2. For each large itemset, all the rules that have a

(25)

Generating Association Rules:

Apriori Algorithm

 Apriori principle:

 If an itemset is frequent, then all of its subsets

(26)

Generating Association Rules:

Illustrating

Apriori Principle

Item Count

Bread 4

Coke 2

Milk 4

Beer 3

Diaper 4

Eggs 1

Itemset Count {Bread,Milk} 3 {Bread,Beer} 2 {Bread,Diaper} 3 {Milk,Beer} 2 {Milk,Diaper} 3 {Beer,Diaper} 3 Items (1-itemsets) Pairs (2-itemsets)

No need to generate

candidates involving Coke or Eggs

Triplets (3-itemsets)

No need to generate

candidates involving {bread, beer} and {milk, beer}

(27)

Generating Association Rules:

Apriori

Algorithm

 N= Nomber of attributes

 Attribute_list = all attributes

 For i=1 to N

 Frequent_item_set = generate item set of i attributes using

attribute_list

 Frequent_item_set = remove all infrequent itemsets

 Rule = Rule U generate rule using Frequent_item_set

 Attribute_list = Only attributes contained in Frequent_item_set

(28)

Clustering

 Unsupervised learning or clustering builds

models from data without predefined classes.

 The goal is to place records into groups where

the records in a group are highly similar to each other and dissimilar to records in other groups.

 The k-Means algorithm is a simple yet effective

(29)

 Problem

Example

Suppose we have 4 types of medicines and each has two attributes (pH and weight index). Our goal is to group these objects into K=2 group of medicine.

Medicine Weight pH-Index

A 1 1

B 2 1

C 4 3

D 5 4

A B

C

(30)

Example

 Step 1: Use initial seed points for partitioning

_c₁  _A _,_c₂  _B

24 . 4 ) 1 4 ( ) 2 5 ( ) , ( 5 ) 1 4 ( ) 1 5 ( ) , ( 2 2 2 2 2 1           c D d c D d

Assign each object to the cluster

Euclidean distance

D

C

(31)

Example

 Step 2: Compute new centroids of the current

partition

Knowing the members of each _{cluster, now we compute the new}

centroid of each group based on these new memberships.

(32)

Example

 Step 2: Renew membership based on new

centroids

Compute the distance of all

(33)

Example

 Step 3: Repeat the first two steps until its

convergence

Knowing the members of each

cluster, now we compute the new centroid of each group based on these new memberships.

(34)

Example

 Step 3: Repeat the first two steps until its

convergence

Compute the distance of all objects _{to the new centroids}