Data Mining Data Mining
CS 341, Spring 2007 CS 341, Spring 2007
Lecture 6: Classification
Lecture 6: Classification – – issues, issues, regression,
regression, bayesian bayesian classification classification
© Prentice Hall 2
Review:
Review:
nn
Decision Trees Decision Trees
n
n
Neural networks Neural networks
© Prentice Hall 3
Data Mining
Data Mining Core Techniques Core Techniques
n
n
Classification Classification
nn ClusteringClustering
n
n Association RulesAssociation Rules
© Prentice Hall 4
Classification Outline Classification Outline
n
n
Classification Problem Overview Classification Problem Overview
n
n
Classification Techniques Classification Techniques
––RegressionRegression –
–Bayesian classificationBayesian classification ––DistanceDistance
–
–Decision TreesDecision Trees –
–RulesRules –
–Neural NetworksNeural Networks Goal:
Goal: Provide an overview of the classification Provide an overview of the classification problem and introduce some of the basic problem and introduce some of the basic algorithms
algorithms
© Prentice Hall 5
Classification Outline Classification Outline
n
n
Classification Problem Overview Classification Problem Overview
n
n
Classification Techniques Classification Techniques
––RegressionRegression–
–Bayesian classificationBayesian classification Goal:
Goal: Provide an overview of the classification Provide an overview of the classification problem and introduce some of the basic problem and introduce some of the basic algorithms
algorithms
© Prentice Hall 6
Classification Problem Classification Problem
nn
Given a database D={t Given a database D={t
11,t ,t
22, ,… …, ,t t
nn} and a set } and a set of classes C={C
of classes C={C
11,… , …,C ,C
mm}, the }, the
Classification ProblemClassification Problem
is to define a is to define a mapping
mapping f:D f:D C C where each where each t t
iiis assigned is assigned to one class.
to one class.
n
n
Actually divides D into equivalence Actually divides D into
equivalence classesclasses.
.
n
nPredictionPrediction
is is similar, but may be viewed similar, but may be viewed as having infinite number of classes.
as having infinite number of classes.
© Prentice Hall 7
Classification Examples Classification Examples
n
n
Teachers classify students Teachers classify students’ ’ grades as grades as A, B, C, D, or F.
A, B, C, D, or F.
n
n
Identify mushrooms as poisonous or Identify mushrooms as poisonous or edible.
edible.
n
n
Predict when a river will flood. Predict when a river will flood.
nn
Identify individuals with credit risks. Identify individuals with credit risks.
n
n
Speech recognition Speech recognition
n
n
Pattern recognition Pattern recognition
© Prentice Hall 8
Classification Ex: Grading Classification Ex: Grading
nn If x >= 90 then grade If x >= 90 then grade
=A.=A.
nn If 80<=x<90 then If 80<=x<90 then grade =B.
grade =B.
n
n If 70<=x<80 then If 70<=x<80 then grade =C.
grade =C.
n
n If 60<=x<70 then If 60<=x<70 then grade =D.
grade =D.
nn If x<60 then grade =F.If x<60 then grade =F.
>=90
<90 x
>=80
<80 x
>=70
<70 x
F
B A
>=60
<50
x C
D
© Prentice Hall 9
Classification Ex: Letter Classification Ex: Letter
Recognition Recognition
View letters as constructed from 5 components:
Letter C Letter E Letter A
Letter D Letter F Letter B
© Prentice Hall 10
Classification Techniques Classification Techniques
n
n
Approach: Approach:
1. 1. Create specific model by evaluating Create specific model by evaluating training data (or using domain training data (or using domain experts
experts’ ’ knowledge). knowledge).
2. 2. Apply model developed to new data. Apply model developed to new data.
nn
Classes must be predefined Classes must be predefined
n
n
Most common techniques use DTs, Most common techniques use DTs, NNs
NNs, or are based on distances or , or are based on distances or statistical methods.
statistical methods.
Defining Classes Defining Classes
Partitioning Based
Distance Based
Issues in Classification Issues in Classification
n
n Missing DataMissing Data –
–IgnoreIgnore
––Replace with assumed valueReplace with assumed value nn OverfittingOverfitting
–
–Large set of training dataLarge set of training data –
–Filter out erroneous or noisy dataFilter out erroneous or noisy data n
n Measuring PerformanceMeasuring Performance ––Classification accuracy on test dataClassification accuracy on test data ––Confusion matrixConfusion matrix
––OC CurveOC Curve
© Prentice Hall 13
Classification Accuracy Classification Accuracy
nn True positive (TP)True positive (TP)
––ttiiPredicted to be in Predicted to be in CCjjand is actually in it.and is actually in it.
n
n False positive (FP)False positive (FP)
––ttiiPredicted to be in Predicted to be in CCjjbut is not actually in it.but is not actually in it.
n
n True negative (TN)True negative (TN) –
–ttiinot predicted to be in not predicted to be in CCjjand is not actually in it.and is not actually in it.
n
n False negative (FN)False negative (FN) –
–ttiinot predicted to be in not predicted to be in CCjjbut is actually in it.but is actually in it.
© Prentice Hall 14
Classification Performance Classification Performance
True Positive
True Negative False Positive
False Negative
© Prentice Hall 15
Confusion Matrix Confusion Matrix
nn
An m x m matrix An m x m matrix
n
n
Entry Entry C C
i,ji,jindicates the number of tuples indicates the number of tuples assigned to
assigned to C C
jj, ,but but where the correct where the correct class is
class is C C
iin
n
The best solution will only have non The best solution will only have non- - zero values on the diagonal.
zero values on the diagonal.
© Prentice Hall 16
Height Example Data Height Example Data
N a m e G e n d e r H e ig h t O u tp u t1 O u t p u t2 K ris tin a F 1 .6 m S h o rt M e d iu m
J im M 2 m T a ll M e d iu m
M a g g ie F 1 .9 m M e d iu m T a ll M a rth a F 1 .8 8 m M e d iu m T a ll S te p h a n ie F 1 .7 m S h o rt M e d iu m B o b M 1 .8 5 m M e d iu m M e d iu m K a th y F 1 .6 m S h o rt M e d iu m D a v e M 1 .7 m S h o rt M e d iu m
W o r th M 2 .2 m T a ll T a ll
S te v e n M 2 .1 m T a ll T a ll D e b b ie F 1 .8 m M e d iu m M e d iu m T o d d M 1 .9 5 m M e d iu m M e d iu m K im F 1 .9 m M e d iu m T a ll A m y F 1 .8 m M e d iu m M e d iu m W y n e tte F 1 .7 5 m M e d iu m M e d iu m
© Prentice Hall 17
Confusion Matrix Example Confusion Matrix Example
Using height data example with Output1 Using height data example with Output1
(correct) and Output2 (actual) assignment (correct) and Output2 (actual) assignment
Actual Assignment Membership Short Medium Tall
Short 0 4 0
Medium 0 5 3
Tall 0 1 2
© Prentice Hall 18
Operating Characteristic Curve
Operating Characteristic Curve
© Prentice Hall 19
Classification Outline Classification Outline
n
n
Classification Problem Overview Classification Problem Overview
nn
Classification Techniques Classification Techniques
––RegressionRegression –
–DistanceDistance –
–Decision TreesDecision Trees –
–RulesRules
––Neural NetworksNeural Networks Goal:
Goal: Provide an overview of the classification Provide an overview of the classification problem and introduce some of the basic problem and introduce some of the basic algorithms
algorithms
© Prentice Hall 20
Regression Regression
nn Assume data fits a predefined functionAssume data fits a predefined function
n
n Determine best values for parameters in the Determine best values for parameters in the model
model
n Estimate an output value based on input values
n Can be used for classification and prediction
© Prentice Hall 21
Linear Regression Linear Regression
n
n Assume the relation of the output variable to Assume the relation of the output variable to the input variables is a linear function of some the input variables is a linear function of some parameters.
parameters.
nn Determine best values for Determine best values for regression regression coefficients
coefficientscc00,c,c11,…,…,c,cnn..
nn Assume an error: y = cAssume an error: y = c00+c+c11xx11+…+…++ccnnxxnn+ε
n Estimate error using mean squared error for training set:
© Prentice Hall 22
Example: 4.3 Example: 4.3
n
n
Y = C Y = C
0 0+ +
εεn
n
Find the value for c Find the value for c
00that best partition that best partition the height values into classes: short and the height values into classes: short and medium
medium
n
n
The training data for y The training data for y
iiis is
{1.6, 1.9, 1.88, 1.7, 1.85, 1.6, 1.7, 1.8, 1.95, {1.6, 1.9, 1.88, 1.7, 1.85, 1.6, 1.7, 1.8, 1.95,
1.9, 1.8, 1.75}
1.9, 1.8, 1.75}
nn
How ? How ?
Example: 4.4 Example: 4.4
n
n Y = cY = c0 0 + c+ c0 0 xx11+ + εε
nn Find the value for cFind the value for c0 0 and cand c11that best predict that best predict the class.
the class.
n
n Assume 0 for the short class, 1 for the Assume 0 for the short class, 1 for the medium class
medium class
n
n The training data for (xThe training data for (xii, , yyii))isis
{{(1.6,0), (1.9,0) , (1.88, 0), (1.7, 0), (1.85, 0), (1.6, 0), (1.(1.6,0), (1.9,0) , (1.88, 0), (1.7, 0), (1.85, 0), (1.6, 0), (1.7,0), (1.8,0), 7,0), (1.8,0), (1.95, 0), (1.9, 0), (1.8, 0), (1.75, 0)
(1.95, 0), (1.9, 0), (1.8, 0), (1.75, 0)}}
n n How ?How ?
Linear Regression Poor Fit
Linear Regression Poor Fit
© Prentice Hall 25
Classification Using Regression Classification Using Regression
nn Division:Division:
Use regression function to Use regression function to divide area into regions.
divide area into regions.
n
n PredictionPrediction: Use regression function to
: Use regression function to predict a class membership function.
predict a class membership function.
© Prentice Hall 26
Division Division
© Prentice Hall 27
Prediction Prediction
© Prentice Hall 28
Logistic Regression Logistic Regression
n
n A generalized linear modelA generalized linear model
n
n Extensively used in the medical and social Extensively used in the medical and social sciences
sciences
nn It has the following formIt has the following form Log
Logee(p /p (p /p --1) = c1) = c00+ c+ c11xx1 1 + + ……+ c+ ckkxxkk
ppis the probability of being in the class, 1 is the probability of being in the class, 1 ––p is the p is the probability that is not.
probability that is not.
The parameters c
The parameters c00, c, c11, , ……cckkare usually estimated by are usually estimated by maximum likelihood. (maximize the probability of maximum likelihood. (maximize the probability of observing the given value.)
observing the given value.)
© Prentice Hall 29
Why Logistic Regression Why Logistic Regression
n
n P is in the range [0,1]P is in the range [0,1]
–
–A good model would like to have p value close to A good model would like to have p value close to 0 or 1
0 or 1 n
n Linear function is not suitable for p Linear function is not suitable for p
n
n Consider the odds p/1Consider the odds p/1--p. p.
–
–As p increases, the odds (p/1As p increases, the odds (p/1--p) increasesp) increases ––The odds is in the range of [0, +The odds is in the range of [0, +∞∞], asymmetric.], asymmetric.
––The log odds lies in the range The log odds lies in the range --∞∞to to ++∞∞, , symmetric.
symmetric.
© Prentice Hall 30
Linear Regression vs. Logistic Linear Regression vs. Logistic
Regression
Regression
© Prentice Hall 31
Classification Outline Classification Outline
n
n
Classification Problem Overview Classification Problem Overview
nn
Classification Techniques Classification Techniques
––RegressionRegression –
–Bayesian classification Bayesian classification Goal:
Goal: Provide an overview of the classification Provide an overview of the classification problem and introduce some of the basic problem and introduce some of the basic algorithms
algorithms
© Prentice Hall 32
Bayes Theorem Bayes Theorem
nn Posterior Probability:Posterior Probability:P(hP(h1|x|xi))
n
n Prior Probability:Prior Probability:P(hP(h1))
n
n Bayes Theorem:Bayes Theorem:
nn Assign probabilities of hypotheses given a Assign probabilities of hypotheses given a data value.
data value.
© Prentice Hall 33
Na
Naï ï ve Bayes ve Bayes Classification Classification
nn
Assume that the contribution by all Assume that the contribution by all attributes are independent and that attributes are independent and that each contributes equally to the each contributes equally to the classification problem.
classification problem.
n
n
t t
iihas m independent attributes has m independent attributes
{x{xi1i1,,……, , xximim,}.,}.P (P (ttii| | CCjj)) ∏∏P (P (xxikik| | CCjj))
© Prentice Hall 34
Example: using the output1 as Example: using the output1 as
classification results classification results
N a m e G e n d e r H e ig h t O u tp u t1 O u t p u t2 K ris tin a F 1 .6 m S h o rt M e d iu m
J im M 2 m T a ll M e d iu m
M a g g ie F 1 .9 m M e d iu m T a ll M a rth a F 1 .8 8 m M e d iu m T a ll S te p h a n ie F 1 .7 m S h o rt M e d iu m B o b M 1 .8 5 m M e d iu m M e d iu m K a th y F 1 .6 m S h o rt M e d iu m D a v e M 1 .7 m S h o rt M e d iu m
W o r th M 2 .2 m T a ll T a ll
S te v e n M 2 .1 m T a ll T a ll D e b b ie F 1 .8 m M e d iu m M e d iu m T o d d M 1 .9 5 m M e d iu m M e d iu m
K im F 1 .9 m M e d iu m T a ll
A m y F 1 .8 m M e d iu m M e d iu m W y n e tte F 1 .7 5 m M e d iu m M e d iu m
Example 4.5 Example 4.5
n
n
Step1: Calculate the prior probability Step1: Calculate the prior probability
––P (short) =P (short) = –
–P (medium) =P (medium) = –
–P (tall) =P (tall) =
Example 4.5 Example 4.5
n
n Step1: Calculate the prior probability Step1: Calculate the prior probability –
–P (short) = 4/15 = 0.267P (short) = 4/15 = 0.267 ––P (medium) = 8/15 = 0.533P (medium) = 8/15 = 0.533 –
–P (tall) = 3/15 = 0.2P (tall) = 3/15 = 0.2 n
n Step 2: Calculate the conditional probabilityStep 2: Calculate the conditional probability –
–P(GenderP(Genderii||CCjj), ), Gender
Genderii= F or M, C= F or M, Cjj= short or medium or tall = short or medium or tall –
–P(HeightP(Heightii||CCjj)) Height
Heightiiin (0,1.6],(1.6,1.7],(1.7,1.8],(1.8,1.9],(1.9,2.0],(>2.0).in (0,1.6],(1.6,1.7],(1.7,1.8],(1.8,1.9],(1.9,2.0],(>2.0).
© Prentice Hall 37
Example 4.5 (cont Example 4.5 (cont’ ’d) d)
Attribute
Attribute countcount probability probability p(xp(xii|C|Cjj)) short medium tall short medium tall short medium tall short medium tall
Gender M 1 2 3
Gender M 1 2 3
F 3 6 0
F 3 6 0
Height (<1.6] 2 0 0
Height (<1.6] 2 0 0
(1.6,1.7] 2 0 0
(1.6,1.7] 2 0 0
(1.7,1.8] 0 3 0
(1.7,1.8] 0 3 0
(1.8,1.9] 0 4 0
(1.8,1.9] 0 4 0
(1.9,2.0] 0 1 1
(1.9,2.0] 0 1 1
( >2.0 ) 0 0 2
( >2.0 ) 0 0 2
1/4 2/8 3/3 3/4 6/8 0/3 2/4 0 0
2/4 0 0
0 3/8 0
0 4/8 0 0 1/8 1/3 0 0 2/3
© Prentice Hall 38
Example 4.5 (cont Example 4.5 (cont’ ’d) d)
n
n
Given a tuple Given a tuple t ={Adam, M, 1.95m} t ={Adam, M, 1.95m}
nn
Step 3: Calculate P(t|C Step 3: Calculate P(t|C
jj) )
P(t|shortP(t|short) =) = P(t|medium P(t|medium) = ) = P(t|tall P(t|tall)=)=
n
n
Step 4: calculate P(t Step 4: calculate P(t) )
P(t) = P(t) =P(t|short)P(short)+P(t|medium)P(medium)+P(t|tall)P(tall P(t|short)P(short)+P(t|medium)P(medium)+P(t|tall)P(tall))
© Prentice Hall 39
Example 4.5 (cont Example 4.5 (cont’ ’ d) d)
nn
Given a Given a tuple tuple t ={Adam, M, 1.95m} t ={Adam, M, 1.95m}
n
n
Step 3: Calculate Step 3: Calculate P(t|C P(t|C
jj) )
P(t|shortP(t|short) = ) = ¼¼x 0 =0x 0 =0 P(t|medium
P(t|medium) = 2/8 x 1/8 =0.031) = 2/8 x 1/8 =0.031 P(t|tall
P(t|tall)= 3/3 x1/3 =0.333)= 3/3 x1/3 =0.333 n
n
Step 4: calculate Step 4: calculate P(t P(t) )
P(tP(t) = ) =P(t|short)P(short)+P(t|medium)P(medium)+P(t|tall)P(tall P(t|short)P(short)+P(t|medium)P(medium)+P(t|tall)P(tall))
= 0.0826
= 0.0826
© Prentice Hall 40
Example 4.5 (cont Example 4.5 (cont’ ’d) d)
n
n
Step 5: Calculate P(C Step 5: Calculate P(C
jj| t) using | t) using Bayes Bayes Rule Rule
P(short|tP(short|t) = ) = P(t|short)P(short)/P(tP(t|short)P(short)/P(t) = ) = P(medium|t
P(medium|t) = ) = P(tall|t P(tall|t)=)=
n
n
Last step: Last step:
–
–classify t based on these probabilitiesclassify t based on these probabilities
© Prentice Hall 41
Example 4.5 (cont Example 4.5 (cont’ ’ d) d)
n
n
Step 5: Calculate Step 5: Calculate P(C P(C
jj| t) using Bayes | t) using Bayes Rule Rule
P(short|tP(short|t) = ) = P(t|short)P(short)/P(tP(t|short)P(short)/P(t) = 0) = 0 P(medium|t
P(medium|t) = 0.2) = 0.2 P(tall|t
P(tall|t)= 0.799)= 0.799 n
n
Last step: Last step:
–
–Classify the new Classify the new tupletupleas tall.as tall.
© Prentice Hall 42
A Summary A Summary
n
n Step 1: Calculate the prior probability of each class. P (Step 1: Calculate the prior probability of each class. P (CCjj) )
n
n Step 2: Calculate the conditional probability for each attributeStep 2: Calculate the conditional probability for each attribute value,
value, P(GenderP(Genderii||CCjj), ), n
n Step 3: Calculate the conditional probability Step 3: Calculate the conditional probability P(t|CP(t|Cjj))
n
n Step 4: calculate the prior probability of a Step 4: calculate the prior probability of a tupletuple, , P(tP(t))
n
n Step 5: Calculate the posterior probability for each class givenStep 5: Calculate the posterior probability for each class given the the tupletuple, , P(CP(Cjj| t) using | t) using BayesBayesRuleRule
n
n Step 6: Classify a Step 6: Classify a tupletuplebased on the based on the P(CP(Cjj| t), the | t), the tupletuplebelongs belongs to the class with has the highest posterior probability.
to the class with has the highest posterior probability.
© Prentice Hall 43
Next Lecture:
Next Lecture:
nn
Classification: Classification:
––DistanceDistance--based algorithmsbased algorithms ––Decision treeDecision tree--based algorithmsbased algorithms
nn