Review: Classification Outline

(1)

Data Mining Data Mining

CS 341, Spring 2007 CS 341, Spring 2007

Lecture 6: Classification

Lecture 6: Classification – – issues, issues, regression,

regression, bayesian bayesian classification classification

Review:

nn

Decision Trees Decision Trees

n

Neural networks Neural networks

Data Mining

Data Mining Core Techniques Core Techniques

n

Classification Classification

nn ClusteringClustering

n

n Association RulesAssociation Rules

Classification Outline Classification Outline

n

Classification Problem Overview Classification Problem Overview

n

Classification Techniques Classification Techniques

–

–RegressionRegression –

–Bayesian classificationBayesian classification ––DistanceDistance

–

–Decision TreesDecision Trees –

–RulesRules –

–Neural NetworksNeural Networks Goal:

Goal: Provide an overview of the classification Provide an overview of the classification problem and introduce some of the basic problem and introduce some of the basic algorithms

algorithms

Classification Outline Classification Outline

n

Classification Problem Overview Classification Problem Overview

n

Classification Techniques Classification Techniques

––RegressionRegression

–

–Bayesian classificationBayesian classification Goal:

algorithms

Classification Problem Classification Problem

nn

Given a database D={t Given a database D={t

₁₁

,t ,t

₂₂

, ,… …, ,t t

_n_n

} and a set } and a set of classes C={C

of classes C={C

₁₁

,… , …,C ,C

_m_m

}, the }, the

Classification Problem

is to define a is to define a mapping

mapping f:D f:D C C where each where each t t

_i_i

is assigned is assigned to one class.

to one class.

n

Actually divides D into equivalence Actually divides D into

equivalence classes

classes.

.

n

nPredictionPrediction

is is similar, but may be viewed similar, but may be viewed as having infinite number of classes.

as having infinite number of classes.

(2)

Classification Examples Classification Examples

n

Teachers classify students Teachers classify students’ ’ grades as grades as A, B, C, D, or F.

A, B, C, D, or F.

n

Identify mushrooms as poisonous or Identify mushrooms as poisonous or edible.

edible.

n

Predict when a river will flood. Predict when a river will flood.

nn

Identify individuals with credit risks. Identify individuals with credit risks.

n

Speech recognition Speech recognition

n

Pattern recognition Pattern recognition

Classification Ex: Grading Classification Ex: Grading

nn If x >= 90 then grade If x >= 90 then grade

=A.=A.

nn If 80<=x<90 then If 80<=x<90 then grade =B.

grade =B.

n

n If 70<=x<80 then If 70<=x<80 then grade =C.

grade =C.

n

n If 60<=x<70 then If 60<=x<70 then grade =D.

grade =D.

nn If x<60 then grade =F.If x<60 then grade =F.

>=90

<90 x

>=80

<80 x

>=70

<70 x

F

B A

>=60

<50

x C

D

Classification Ex: Letter Classification Ex: Letter

Recognition Recognition

View letters as constructed from 5 components:

Letter C Letter E Letter A

Letter D Letter F Letter B

Classification Techniques Classification Techniques

n

Approach: Approach:

1. 1. Create specific model by evaluating Create specific model by evaluating training data (or using domain training data (or using domain experts

experts’ ’ knowledge). knowledge).

2. 2. Apply model developed to new data. Apply model developed to new data.

nn

Classes must be predefined Classes must be predefined

n

Most common techniques use DTs, Most common techniques use DTs, NNs

NNs, or are based on distances or , or are based on distances or statistical methods.

statistical methods.

Defining Classes Defining Classes

Partitioning Based

Distance Based

Issues in Classification Issues in Classification

n

n Missing DataMissing Data –

–IgnoreIgnore

––Replace with assumed valueReplace with assumed value nn OverfittingOverfitting

–

–Large set of training dataLarge set of training data –

–Filter out erroneous or noisy dataFilter out erroneous or noisy data n

n Measuring PerformanceMeasuring Performance ––Classification accuracy on test dataClassification accuracy on test data ––Confusion matrixConfusion matrix

––OC CurveOC Curve

(3)

Classification Accuracy Classification Accuracy

nn True positive (TP)True positive (TP)

––tt_i_iPredicted to be in Predicted to be in CC_j_jand is actually in it.and is actually in it.

n

n False positive (FP)False positive (FP)

––tt_i_iPredicted to be in Predicted to be in CC_j_jbut is not actually in it.but is not actually in it.

n

n True negative (TN)True negative (TN) –

–ttiinot predicted to be in not predicted to be in CCjjand is not actually in it.and is not actually in it.

n

n False negative (FN)False negative (FN) –

–ttiinot predicted to be in not predicted to be in CCjjbut is actually in it.but is actually in it.

Classification Performance Classification Performance

True Positive

True Negative False Positive

False Negative

Confusion Matrix Confusion Matrix

nn

An m x m matrix An m x m matrix

n

Entry Entry C C

_i,j_i,j

indicates the number of tuples indicates the number of tuples assigned to

assigned to C C

_j_j,_,

but but where the correct where the correct class is

class is C C

_i_i

n

The best solution will only have non The best solution will only have non- - zero values on the diagonal.

zero values on the diagonal.

Height Example Data Height Example Data

N a m e G e n d e r H e ig h t O u tp u t1 O u t p u t2 K ris tin a F 1 .6 m S h o rt M e d iu m

J im M 2 m T a ll M e d iu m

M a g g ie F 1 .9 m M e d iu m T a ll M a rth a F 1 .8 8 m M e d iu m T a ll S te p h a n ie F 1 .7 m S h o rt M e d iu m B o b M 1 .8 5 m M e d iu m M e d iu m K a th y F 1 .6 m S h o rt M e d iu m D a v e M 1 .7 m S h o rt M e d iu m

W o r th M 2 .2 m T a ll T a ll

S te v e n M 2 .1 m T a ll T a ll D e b b ie F 1 .8 m M e d iu m M e d iu m T o d d M 1 .9 5 m M e d iu m M e d iu m K im F 1 .9 m M e d iu m T a ll A m y F 1 .8 m M e d iu m M e d iu m W y n e tte F 1 .7 5 m M e d iu m M e d iu m

Confusion Matrix Example Confusion Matrix Example

Using height data example with Output1 Using height data example with Output1

(correct) and Output2 (actual) assignment (correct) and Output2 (actual) assignment

Actual Assignment Membership Short Medium Tall

Short 0 4 0

Medium 0 5 3

Tall 0 1 2

Operating Characteristic Curve

(4)

Classification Outline Classification Outline

n

Classification Problem Overview Classification Problem Overview

nn

Classification Techniques Classification Techniques

–

–DistanceDistance –

–Decision TreesDecision Trees –

–RulesRules

––Neural NetworksNeural Networks Goal:

algorithms

Regression Regression

nn Assume data fits a predefined functionAssume data fits a predefined function

n

n Determine best values for parameters in the Determine best values for parameters in the model

model

n Estimate an output value based on input values

n Can be used for classification and prediction

Linear Regression Linear Regression

n

n Assume the relation of the output variable to Assume the relation of the output variable to the input variables is a linear function of some the input variables is a linear function of some parameters.

parameters.

nn Determine best values for Determine best values for regression regression coefficients

coefficientscc₀₀,c,c₁₁,…,…,c,c_n_n..

nn Assume an error: y = cAssume an error: y = c₀₀+c+c₁₁xx₁₁+…+…++cc_n_nxx_n_n+ε

n Estimate error using mean squared error for training set:

Example: 4.3 Example: 4.3

n

Y = C Y = C

₀₀

+ +

εε

n

Find the value for c Find the value for c

₀₀

that best partition that best partition the height values into classes: short and the height values into classes: short and medium

medium

n

The training data for y The training data for y

_i_i

is is

{1.6, 1.9, 1.88, 1.7, 1.85, 1.6, 1.7, 1.8, 1.95, {1.6, 1.9, 1.88, 1.7, 1.85, 1.6, 1.7, 1.8, 1.95,

1.9, 1.8, 1.75}

nn

How ? How ?

Example: 4.4 Example: 4.4

n

n Y = cY = c₀₀+ c+ c₀₀xx₁₁+ + εε

nn Find the value for cFind the value for c₀₀and cand c₁₁that best predict that best predict the class.

the class.

n

n Assume 0 for the short class, 1 for the Assume 0 for the short class, 1 for the medium class

medium class

n

n The training data for (xThe training data for (x_i_i, , yy_i_i)₎isis

{{(1.6,0), (1.9,0) , (1.88, 0), (1.7, 0), (1.85, 0), (1.6, 0), (1.(1.6,0), (1.9,0) , (1.88, 0), (1.7, 0), (1.85, 0), (1.6, 0), (1.7,0), (1.8,0), 7,0), (1.8,0), (1.95, 0), (1.9, 0), (1.8, 0), (1.75, 0)

(1.95, 0), (1.9, 0), (1.8, 0), (1.75, 0)}}

n n How ?How ?

Linear Regression Poor Fit

(5)

Classification Using Regression Classification Using Regression

nn Division:Division:

Use regression function to Use regression function to divide area into regions.

divide area into regions.

n

n PredictionPrediction: Use regression function to

: Use regression function to predict a class membership function.

predict a class membership function.

Division Division

Prediction Prediction

Logistic Regression Logistic Regression

n

n A generalized linear modelA generalized linear model

n

n Extensively used in the medical and social Extensively used in the medical and social sciences

sciences

nn It has the following formIt has the following form Log

Logee(p /p (p /p --1) = c1) = c00+ c+ c11xx1 1 + + ……+ c+ ckkxxkk

ppis the probability of being in the class, 1 is the probability of being in the class, 1 ––p is the p is the probability that is not.

probability that is not.

The parameters c

The parameters c00, c, c11, , ……cckkare usually estimated by are usually estimated by maximum likelihood. (maximize the probability of maximum likelihood. (maximize the probability of observing the given value.)

observing the given value.)

Why Logistic Regression Why Logistic Regression

n

n P is in the range [0,1]P is in the range [0,1]

–

–A good model would like to have p value close to A good model would like to have p value close to 0 or 1

0 or 1 n

n Linear function is not suitable for p Linear function is not suitable for p

n

n Consider the odds p/1Consider the odds p/1--p. p.

–

–As p increases, the odds (p/1As p increases, the odds (p/1--p) increasesp) increases ––The odds is in the range of [0, +The odds is in the range of [0, +∞∞], asymmetric.], asymmetric.

––The log odds lies in the range The log odds lies in the range --∞∞to to ++∞∞, , symmetric.

symmetric.

Linear Regression vs. Logistic Linear Regression vs. Logistic

Regression

(6)

Classification Outline Classification Outline

n

Classification Problem Overview Classification Problem Overview

nn

Classification Techniques Classification Techniques

–

–Bayesian classification Bayesian classification Goal:

algorithms

Bayes Theorem Bayes Theorem

nn Posterior Probability:Posterior Probability:P(hP(h₁|x|x_i))

n

n Prior Probability:Prior Probability:P(hP(h₁))

n

n Bayes Theorem:Bayes Theorem:

nn Assign probabilities of hypotheses given a Assign probabilities of hypotheses given a data value.

data value.

Na

Naï ï ve Bayes ve Bayes Classification Classification

nn

Assume that the contribution by all Assume that the contribution by all attributes are independent and that attributes are independent and that each contributes equally to the each contributes equally to the classification problem.

classification problem.

n

t t

_i_i

has m independent attributes has m independent attributes

{x{xi1i1,,……, , xximim,}.,}.

P (P (tt_i_i| | CC_j_j)) ∏∏P (P (xx_ik_ik| | CC_j_j))

Example: using the output1 as Example: using the output1 as

classification results classification results

N a m e G e n d e r H e ig h t O u tp u t1 O u t p u t2 K ris tin a F 1 .6 m S h o rt M e d iu m

J im M 2 m T a ll M e d iu m

M a g g ie F 1 .9 m M e d iu m T a ll M a rth a F 1 .8 8 m M e d iu m T a ll S te p h a n ie F 1 .7 m S h o rt M e d iu m B o b M 1 .8 5 m M e d iu m M e d iu m K a th y F 1 .6 m S h o rt M e d iu m D a v e M 1 .7 m S h o rt M e d iu m

W o r th M 2 .2 m T a ll T a ll

S te v e n M 2 .1 m T a ll T a ll D e b b ie F 1 .8 m M e d iu m M e d iu m T o d d M 1 .9 5 m M e d iu m M e d iu m

K im F 1 .9 m M e d iu m T a ll

A m y F 1 .8 m M e d iu m M e d iu m W y n e tte F 1 .7 5 m M e d iu m M e d iu m

Example 4.5 Example 4.5

n

Step1: Calculate the prior probability Step1: Calculate the prior probability

–

–P (short) =P (short) = –

–P (medium) =P (medium) = –

–P (tall) =P (tall) =

Example 4.5 Example 4.5

n

n Step1: Calculate the prior probability Step1: Calculate the prior probability –

–P (short) = 4/15 = 0.267P (short) = 4/15 = 0.267 ––P (medium) = 8/15 = 0.533P (medium) = 8/15 = 0.533 –

–P (tall) = 3/15 = 0.2P (tall) = 3/15 = 0.2 n

n Step 2: Calculate the conditional probabilityStep 2: Calculate the conditional probability –

–P(GenderP(Genderii||CCjj), ), Gender

Genderi_i= F or M, C= F or M, Cjj= short or medium or tall = short or medium or tall –

–P(HeightP(Heightii||CCjj)) Height

Height_iiin (0,1.6],(1.6,1.7],(1.7,1.8],(1.8,1.9],(1.9,2.0],(>2.0).in (0,1.6],(1.6,1.7],(1.7,1.8],(1.8,1.9],(1.9,2.0],(>2.0).

(7)

Example 4.5 (cont Example 4.5 (cont’ ’d) d)

Attribute

Attribute countcount probability probability p(xp(xii|C|Cjj)) short medium tall short medium tall short medium tall short medium tall

Gender M 1 2 3

F 3 6 0

Height (<1.6] 2 0 0

(1.6,1.7] 2 0 0

(1.7,1.8] 0 3 0

(1.8,1.9] 0 4 0

(1.9,2.0] 0 1 1

( >2.0 ) 0 0 2

1/4 2/8 3/3 3/4 6/8 0/3 2/4 0 0

2/4 0 0

0 3/8 0

0 4/8 0 0 1/8 1/3 0 0 2/3

Example 4.5 (cont Example 4.5 (cont’ ’d) d)

n

Given a tuple Given a tuple t ={Adam, M, 1.95m} t ={Adam, M, 1.95m}

nn

Step 3: Calculate P(t|C Step 3: Calculate P(t|C

_j_j

) )

P(t|short

n

Step 4: calculate P(t Step 4: calculate P(t) )

P(t) = P(t) =

Example 4.5 (cont Example 4.5 (cont’ ’ d) d)

nn

Given a Given a tuple tuple t ={Adam, M, 1.95m} t ={Adam, M, 1.95m}

n

Step 3: Calculate Step 3: Calculate P(t|C P(t|C

_j_j

) )

P(t|short

P(t|short) = ) = ¼¼x 0 =0x 0 =0 P(t|medium

P(t|medium) = 2/8 x 1/8 =0.031) = 2/8 x 1/8 =0.031 P(t|tall

P(t|tall)= 3/3 x1/3 =0.333)= 3/3 x1/3 =0.333 n

n

Step 4: calculate Step 4: calculate P(t P(t) )

P(tP(t) = ) =

= 0.0826

Example 4.5 (cont Example 4.5 (cont’ ’d) d)

n

Step 5: Calculate P(C Step 5: Calculate P(C

_j_j

| t) using | t) using Bayes Bayes Rule Rule

P(short|t

P(short|t) = ) = P(t|short)P(short)/P(tP(t|short)P(short)/P(t) = ) = P(medium|t

P(medium|t) = ) = P(tall|t P(tall|t)=)=

n

Last step: Last step:

–

–classify t based on these probabilitiesclassify t based on these probabilities

Example 4.5 (cont Example 4.5 (cont’ ’ d) d)

n

Step 5: Calculate Step 5: Calculate P(C P(C

_j_j

| t) using Bayes | t) using Bayes Rule Rule

P(short|t

P(short|t) = ) = P(t|short)P(short)/P(tP(t|short)P(short)/P(t) = 0) = 0 P(medium|t

P(medium|t) = 0.2) = 0.2 P(tall|t

P(tall|t)= 0.799)= 0.799 n

n

Last step: Last step:

–

–Classify the new Classify the new tupletupleas tall.as tall.

A Summary A Summary

n

n Step 1: Calculate the prior probability of each class. P (Step 1: Calculate the prior probability of each class. P (CCjj) )

n

n Step 2: Calculate the conditional probability for each attributeStep 2: Calculate the conditional probability for each attribute value,

value, P(GenderP(Gender_ii||CC_jj), ), n

n Step 3: Calculate the conditional probability Step 3: Calculate the conditional probability P(t|CP(t|C_jj))

n

n Step 4: calculate the prior probability of a Step 4: calculate the prior probability of a tupletuple, , P(tP(t))

n

n Step 5: Calculate the posterior probability for each class givenStep 5: Calculate the posterior probability for each class given the the tupletuple, , P(CP(C_jj| t) using | t) using BayesBayesRuleRule

n

n Step 6: Classify a Step 6: Classify a tupletuplebased on the based on the P(CP(Cjj| t), the | t), the tupletuplebelongs belongs to the class with has the highest posterior probability.

to the class with has the highest posterior probability.

(8)

Next Lecture:

nn

Classification: Classification:

––DistanceDistance--based algorithmsbased algorithms ––Decision treeDecision tree--based algorithmsbased algorithms

nn