Lecture 11-Classification-M-Detailed

(1)

CSC479

Data Mining

Lecture # 11

Classification

Basic Concepts

Decision Trees

(2)

Catching tax-evasion

Tid Refund Marital Status

Taxable

Income Cheat

1 Yes Single 125K No 2 No Married 100K No

3 No Single 70K No

4 Yes Married 120K No 5 No Divorced 95K Yes 6 No Married 60K No 7 Yes Divorced 220K No 8 No Single 85K Yes 9 No Married 75K No 10 No Single 90K Yes

10

Refund Marital Status

Taxable

Income Cheat

No Married 80K ? 10

Tax-return data for year 2011

A new tax return for 2012 Is this a cheating tax return?

An instance of the classification problem: learn a method for discriminating

between records of different classes (cheaters vs non-cheaters)

(3)

What is classification?

 Classification is the task of learning a target function f that

maps attribute set x to one of the predefined class labels y

Taxable

3 No Single 70K No

4 Yes Married 120K No 5 No Divorced 95K Yes 6 No Married 60K No 7 Yes Divorced 220K No 8 No Single 85K Yes 9 No Married 75K No

cate goric al cate goric al cont inuo us class

One of the attributes is the class attribute In this case: Cheat

(4)

What is classification (cont…)



The target function

f

is known as a

classification model



Descriptive modeling:

Explanatory tool

to

distinguish between objects of different

classes (e.g., understand why people

cheat on their taxes)



Predictive modeling:

Predict a class of a

(5)

Examples of Classification Tasks

 Predicting tumor cells as benign or malignant

 Classifying credit card transactions as legitimate or fraudulent

 Categorizing news stories as finance, weather, entertainment, sports, etc

 Identifying spam email, spam web pages, adult content

(6)

General approach to classification

 Training set consists of records with known class

labels

 Training set is used to build a classification model

 A labeled test set of previously unseen data records is

used to evaluate the quality of the model.

 The classification model is applied to new records

with unknown class labels

(7)

Illustrating Classification Task

Apply Model Induction Deduction Learn Model Model

Tid Attrib1 Attrib2 Attrib3 Class

1 Yes Large 125K No

2 No Medium 100K No

3 No Small 70K No

4 Yes Medium 120K No

5 No Large 95K Yes

6 No Medium 60K No

7 Yes Large 220K No

8 No Small 85K Yes

9 No Medium 75K No

10

Tid Attrib1 Attrib2 Attrib3 Class

11 No Small 55K ?

12 Yes Medium 80K ?

13 Yes Large 110K ?

14 No Small 95K ?

15 No Large 67K ?

Learning algorithm

(8)

Evaluation of classification models



Counts of

test records

that are correctly

(or incorrectly) predicted by the

classification model



Confusion matrix

_{Class = 1 Class = 0}

Class = 1 f11 f10

Class = 0 f₀₁ f₀₀

(9)

Classification Techniques



Decision Tree based Methods



Rule-based Methods



Memory based reasoning



Neural Networks

(10)

Classification Techniques



Decision Tree based Methods



Rule-based Methods



Memory based reasoning



Neural Networks



Naïve Bayes and Bayesian Belief Networks



Support Vector Machines

(11)

Decision Trees



Decision tree

 A flow-chart-like tree structure

 Internal node denotes a test on an attribute  Branch represents an outcome of the test  Leaf nodes represent class labels or class

(12)

Example of a Decision Tree

Taxable

3 No Single 70K No

10 cate goric al cate goric al cont inuo us class Refund MarSt TaxInc YES NO NO NO Yes No Married Single, Divorced

< 80K > 80K

Splitting Attributes

Training Data Model: Decision Tree

Test outcome

Class labels

(13)

Another Example of Decision Tree

Taxable

3 No Single 70K No

cate goric al cate goric al cont inuo us

class MarSt

Refund TaxInc YES NO NO NO Yes No

Married DivorcedSingle,

< 80K > 80K

(14)

Decision Tree Classification Task

Tid Attrib1 Attrib2 Attrib3 Class

1 Yes Large 125K No

2 No Medium 100K No

3 No Small 70K No

6 No Medium 60K No

7 Yes Large 220K No

9 No Medium 75K No

10

11 No Small 55K ?

12 Yes Medium 80K ?

13 Yes Large 110K ?

14 No Small 95K ?

15 No Large 67K ?

(15)

Apply Model to Test Data

Refund

MarSt

TaxInc

YES NO

NO

Yes No

Married Single, Divorced

< 80K > 80K

Refund Marital Status

Taxable

Income Cheat

No Married 80K ?

10

Test Data

(16)

Refund

MarSt

TaxInc

YES NO

NO

Yes No

< 80K > 80K

Refund Marital Status

Taxable

Income Cheat

No Married 80K ?

10

Test Data

Apply Model to Test Data

(17)

Refund

MarSt

TaxInc

YES NO

NO

Yes No

< 80K > 80K

Refund Marital Status

Taxable

Income Cheat

No Married 80K ?

10

Test Data

(18)

Refund

MarSt

TaxInc

YES NO

NO

Yes No

< 80K > 80K

Refund Marital Status

Taxable

Income Cheat

No Married 80K ?

10

Test Data

Apply Model to Test Data

(19)

Refund

MarSt

TaxInc

YES NO

NO

Yes No

Married

Single, Divorced

< 80K > 80K

Refund Marital Status

Taxable

Income Cheat

No Married 80K ?

10

Test Data

(20)

Refund

MarSt

TaxInc

YES NO

NO

Yes No

Married

Single, Divorced

< 80K > 80K

Refund Marital Status

Taxable

Income Cheat

No Married 80K ?

10

Test Data

Assign Cheat to “No”

Apply Model to Test Data

(21)

Decision Tree Classification Task

Tid Attrib1 Attrib2 Attrib3 Class

1 Yes Large 125K No

2 No Medium 100K No

3 No Small 70K No

6 No Medium 60K No

7 Yes Large 220K No

9 No Medium 75K No

10

11 No Small 55K ?

12 Yes Medium 80K ?

13 Yes Large 110K ?

14 No Small 95K ?

15 No Large 67K ?

(22)

Tree Induction

 Finding the best decision tree is NP-hard

 Greedy strategy.

 Split the records based on an attribute test that optimizes

certain criterion.

 Many Algorithms:

 Hunt’s Algorithm (one of the earliest)  CART

 ID3, C4.5

 SLIQ, SPRINT

(23)

General Structure of Hunt’s Algorithm

 Let D_t be the set of training records that reach a node t

 General Procedure:

 If D_t contains records that belong the same class y_t, then t is a leaf node labeled as y_t

 If D_t contains records with the same attribute values, then t is a leaf node labeled with the majority class y_t

 If D_t is an empty set, then t is a leaf node labeled by the default class, y_d

 If D_t contains records that belong to more than one class, use an attribute test to split the data into smaller subsets.

• Recursively apply the procedure to each subset.

Taxable

Income Cheat

1 Yes Single 125K No

2 No Married 100K No

3 No Single 70K No

4 Yes Married 120K No

5 No Divorced 95K Yes

6 No Married 60K No

7 Yes Divorced 220K No

8 No Single 85K Yes

9 No Married 75K No

10 No Single 90K Yes

10

D_t

(24)

Tree Induction



Issues

 How to Classify a leaf node

• Assign the majority class

• If leaf is empty, assign the default class – the

class that has the highest popularity.

 Determine how to split the records

• How to specify the attribute test condition? • How to determine the best split?

 Determine when to stop splitting

(25)

How to Specify Test Condition?



Depends on attribute types

 Nominal  Ordinal

 Continuous



Depends on number of ways to split

 2-way split

(26)

Splitting Based on Nominal Attributes



Multi-way split:

Use as many partitions as

distinct values.



Binary split:

Divides values into two subsets.

Need to find optimal partitioning.

CarType

Family

Sports

Luxury

CarType

{Family,

Luxury} {Sports}

CarType

{Sports,

Luxury} {Family}

OR

(27)

 Multi-way split: Use as many partitions as distinct

values.

 Binary split: Divides values into two subsets –

respects the order. Need to find optimal partitioning.

 What about this split?

Splitting Based on Ordinal Attributes

Size

Small

Medium

Large

Size

{Medium,

Large} {Small}

Size

{Small,

Medium} {Large} OR

Size

(28)

Splitting Based on Continuous Attributes



Different ways of handling

 Discretization to form an ordinal categorical

attribute

• Static – discretize once at the beginning

• Dynamic – ranges can be found by equal interval

bucketing, equal frequency bucketing (percentiles), or clustering.

 Binary Decision: (A < v) or (A  v)

• consider all possible splits and finds the best cut

• can be more compute intensive

(29)

Splitting Based on Continuous Attributes

Taxable Income > 80K?

Yes No

Taxable Income?

(i) Binary split (ii) Multi-way split

< 10K

(30)

How to determine the Best Split

Own Car? C0: 6 C1: 4 C0: 4 C1: 6 C0: 1 C1: 3 C0: 8 C1: 0 C0: 1 C1: 7 Car Type? C0: 1 C1: 0 C0: 1 C1: 0 C0: 0 C1: 1 Student ID? ...

Yes No Family

Sports

Luxury c₁

c₁₀ c20

C0: 0 C1: 1

...

c₁₁

Before Splitting: 10 records of class 0, 10 records of class 1

Which test condition is the best?

(31)

The central choice in the ID3 algorithm is selecting which attribute to test at each node in the tree

We would like to select the attribute which is most useful for classifying examples

For this we need a good quantitative measure

For this purpose a statistical property, called

information gain is used

(32)

- In order to define information gain precisely, we

begin by defining entropy

- Entropy is a measure commonly used in

information theory.

- Entropy characterizes the impurity of an

arbitrary collection of examples

Which Attribute is the Best Classifier?

Definition of Entropy

(33)



Entropy (D)

 Entropy of data set D is denoted by H(D)

 C_is are the possible classes

 p_i = fraction of records from D that have class C

Entropy

2

(

)

lo g

i

i i

c

(34)

Entropy Examples



Example:

 10 records have class A  20 records have class B  30 records have class C  40 records have class D

 Entropy = -[(.1 log .1) + (.2 log .2) + (.3

log .3) + (.4 log .4)]

 Entropy = 1.846

(35)

Splitting Criterion



Example:



Two classes,

+/-

100 records overall (50 +s and 50 -s)



A and B are two binary attributes

• Records with A=0: 48+,

2-Records with A=1: 2+,

48-• Records with B=0: 26+,

24-Records with B=1: 24+,

26-

Splitting on A is better than splitting on B

(36)

The expected information needed to classify a tuple in D is

= Entropy

How much more information would we still need

(after partitioning at attribute A) to arrive at an exact classification? This amount is measured by

= H(D, A)

Info Gain (D, A) = H(D) – H(D, A)

In general, we write Gain (D, A), where D is the collection of examples & A is an attribute

Which Attribute is the Best Classifier?

Information Gain

(37)

(38)

Information Gain

 Gain of an attribute split: compare the

impurity of the parent node with the average impurity of the child nodes

 Maximizing the gain  Minimizing the weighted

average impurity measure of children nodes

1

(

)

v j

(

_j

)

j

D

InfoGain

H parent

H D

D







(39)

(40)

40

Examples Constructing Decision Tree

 So the attribute Age will be placed at root

level.

 For placement at second level we find

(41)

Which Attribute is the Best Classifier?: Information Gain

(42)

The collection of examples has 9 positive values and 5 negative ones

Which Attribute is the Best Classifier?: Information Gain

Eight (6 positive and 2 negative ones) of these examples

have the attribute value Wind = Weak

Six (3 positive and 3 negative ones) of these examples

have the attribute value Wind = Strong

42

(43)

The information gain obtained by separating the

examples according to the attribute Wind is calculated

as:

Which Attribute is the Best Classifier?: Information Gain

(44)

We calculate the Info Gain for each attribute and select the attribute having the highest Info Gain

Which Attribute is the Best Classifier?: Information Gain

44

(45)

Example

Which attribute should be selected as the first test?

“Outlook” provides the most information

(46)

46

(47)

Example

The process of selecting a new attribute is now repeated for each (non-terminal) descendant node, this time using only training examples associated with that node

Attributes that have been incorporated higher in the tree are excluded, so that any given attribute can appear at most once along any path through the tree

(48)

Example

This process continues for each new leaf node until either:

1. Every attribute has already been included along this path through the tree

2. The training examples associated with a leaf node have zero entropy

48

(49)

Example

(50)

Next Step: Make rules from the decision tree

After making the identification tree, we trace each path from the root node to leaf node, recording the test outcomes as antecedents and the leaf node classification as the consequent

Simple way: one rule for each leaf

For our example we have:

If the Outlook is Sunny and the Humidity is High then No

If the Outlook is Sunny and the Humidity is Normal then Yes

...

From Decision Trees to Rules

50