CSC479
Data Mining
Lecture # 11
Classification
Basic Concepts
Decision Trees
Catching tax-evasion
Tax-return data for year 2011
A new tax return for 2012 Is this a cheating tax
return?
The data analysis task is classification, where a model or classifier is constructed to predict class.
What is classification?
● Classification is the task of learning a target function f that maps attribute set x to one of the predefined class labels y
categoricalcategorical continuousclass
One of the attributes is the class attribute
In this case: Cheat
Two class labels (or classes): Yes (1), No (0)
What is classification (cont…)
●
Two Major Types of Prediction Problems
●
Classification
●
The Model is constructed to predict class
label
●
Regression/ Numeric Prediction
●
The constructed model predicts a continuous
value
Examples of Classification Tasks
●
Predicting
tumor
cells as
benign
or
malignant
●
Classifying credit card
transactions
as
legitimate
or
fraudulent
●
Categorizing
news stories
as
finance
,
weather
,
entertainment
,
sports
, etc
●
Identifying
spam
, spam web
pages
,
adult
content
General approach to classification
●
Training set
consists of records with
known class
labels
●
Training set is used to
build
a classification model
●
A
labeled
test set
of
previously unseen
data
records is used to
evaluate
the quality of the
model.
●
The classification model is
applied
to new records
with
unknown class labels
Evaluation of classification models
●
Counts of
test records
that are correctly
(or incorrectly) predicted by the
classification model
●
Confusion matrix
Class = 1 Class = 0Class = 1 f11 f10
Class = 0 f01 f00
Predicted Class
Classification Techniques
●
Decision Tree based Methods
●
Rule-based Methods
●
Memory based reasoning
●
Neural Networks
●
Naïve Bayes and Bayesian Belief Networks
Classification Techniques
●
Decision Tree based Methods
●
Rule-based Methods
●
Memory based reasoning
●
Neural Networks
●
Naïve Bayes and Bayesian Belief Networks
Decision Trees
●
Decision tree
●
A
flow-chart-like tree
structure
●
Internal node
denotes a
test on an attribute
●
Branch
represents an
outcome of the test
●
Leaf nodes
represent
class labels
or class
distribution
Example of a Decision Tree
categoricalcategorical continuousclass
Refund MarSt TaxInc YES NO NO NO Yes No Married Single, Divorced < 80K > 80K Splitting Attributes
Training Data Model: Decision Tree
Test outcome
Class labels
Another Example of Decision Tree
categoricalcategorical continuousclass MarSt
Refund TaxInc YES NO NO NO Yes No Married Single, Divorced < 80K > 80K
There could be more than one tree that fits the same data!
Decision Tree Classification Task
Decision Tree
Apply Model to Test Data
Refund MarSt TaxInc YES NO NO NO Yes No Married Single, Divorced < 80K > 80K Test DataRefund MarSt TaxInc YES NO NO NO Yes No Married Single, Divorced < 80K > 80K Test Data
Refund MarSt TaxInc YES NO NO NO Yes No Married Single, Divorced < 80K > 80K Test Data
Refund MarSt TaxInc YES NO NO NO Yes No Married Single, Divorced < 80K > 80K Test Data
Refund MarSt TaxInc YES NO NO NO Yes No Married Single, Divorced < 80K > 80K Test Data
Refund MarSt TaxInc YES NO NO NO Yes No Married Single, Divorced < 80K > 80K Test Data
Assign Cheat to “No”
Decision Tree Classification Task
Decision Tree
Tree Induction
●
Finding the best decision tree is
NP-hard
●
Greedy
strategy.
● Split the records based on an attribute test that optimizes certain criterion.
●
Many Algorithms:
● Hunt’s Algorithm (one of the earliest)
● CART
● ID3, C4.5
How to Specify Test Condition?
●
Depends on attribute types
●
Nominal
●
Ordinal
●
Continuous
●
Depends on number of ways to split
●
2-way split
Splitting Based on Nominal Attributes
●
Multi-way split:
Use as many partitions as
distinct values.
●
Binary split:
Divides values into two subsets.
Need to find optimal partitioning.
CarT ype Family Sports Luxury CarT ype {Family, Luxury} {Sports} CarT ype {Sports, Luxury} {Family} OR
●
Multi-way split:
Use as many partitions as
distinct values.
●
Binary split:
Divides values into two subsets –
respects the order
. Need to find optimal
partitioning.
●
What about this split?
Splitting Based on Ordinal Attributes
Size Small Medium Large Size {Medium, Large} {Small} Size {Small, Medium} {Large} OR Size
Splitting Based on Continuous Attributes
●
Different ways of handling
●
Discretization
• Static – discretize once at the beginning
• Dynamic – ranges can be found by equal interval bucketing, equal frequency bucketing
(percentiles), or clustering.
●
Binary Decision
: (
A < v
) or (
A ≥ v
)
• consider all possible splits and finds the best cut
How to determine the Best Split
Before Splitting: 10 records of class 0, 10 records of class 1
The central choice in the ID3 algorithm is selecting
which attribute to test at each node in the tree
We would like to select the attribute which is most
useful for classifying examples
For this we need a good quantitative measure
For this purpose a statistical property, called
information gain
is used
- In order to define information gain precisely, we begin by defining entropy
- Entropy, as it relates to machine learning, is a measure
of the randomness in the information being processed
- Entropy characterizes the impurity of an arbitrary collection of examples
- The higher the entropy, the harder it is to draw any conclusions from that information
Which Attribute is the Best Classifier?
Definition of Entropy
●
Entropy (D)
●
Entropy of data set D is denoted by H(D)
●
C
is
are the possible classes
●
p
i= fraction of records from D that have class C
Entropy Examples
●
Example:
●
10 records have class A
●
20 records have class B
●
30 records have class C
●
40 records have class D
●
Entropy = -[(.1 log .1) + (.2 log .2) + (.3 log
.3) + (.4 log .4)]
Splitting Criterion
●
Example:
●
Two classes,
+/-●
100 records overall (50 +s and 50 -s)
●
A and B are two binary attributes
•
Records with A=0: 48+,
2-Records with A=1: 2+,
48-•
Records with B=0: 26+,
24-Records with B=1: 24+,
26-●
Splitting on A is better than splitting on B
•
A does a good job of separating +s and -s
The expected information needed to classify a tuple
in D is
= Entropy
How much more information would we still need
(after partitioning at attribute A) to arrive at an exact
classification? This amount is measured by
= H(D, A)
Info Gain (D, A) = H(D) – H(D, A)
In general, we write Gain (D, A), where D is the
collection of examples & A is an attribute
Which Attribute is the Best Classifier?
Information Gain
Information Gain
●
Gain
of an attribute split:
compare the impurity
of the parent node with the average impurity of
the child nodes
●
Maximizing
the
gain
⇔
Minimizing
the weighted
average
impurity
measure of children nodes
Examples Constructing Decision Tree
●
So the attribute Age will be placed at root
level.
●
For placement at second level we find
InfoGain for all the remaining attributes under
every branch of the parent node.
Which Attribute is the Best Classifier?: Information Gain
The collection of examples has 9 positive values and 5 negative ones
Which Attribute is the Best Classifier?: Information Gain
Eight (6 positive and 2 negative ones) of these examples have the attribute value Wind = Weak
Six (3 positive and 3 negative ones) of these examples have the attribute value Wind = Strong
The information gain obtained by separating the examples according to the attribute Wind is calculated as:
Which Attribute is the Best Classifier?: Information Gain
We calculate the Info Gain for each attribute and select the attribute having the highest Info Gain
Which Attribute is the Best Classifier?: Information Gain
Example
Which attribute should be selected as the first test?
“Outlook” provides the most information
Example
The process of selecting a new attribute is now repeated for each (non-terminal) descendant node, this time using only training examples associated with that node
Attributes that have been incorporated higher in the tree are excluded, so that any given attribute can appear at most once along any path through the tree
Example
This process continues for each new leaf node until either: 1. Every attribute has already been included along this
path through the tree
2. The training examples associated with a leaf node have zero entropy
Example
Next Step: Make rules from the decision tree
After making the identification tree, we trace each path from the root node to leaf node, recording the test outcomes as antecedents and the leaf node classification as the consequent
Simple way: one rule for each leaf
For our example we have:
If the Outlook is Sunny and the Humidity is High then No
If the Outlook is Sunny and the Humidity is Normal then Yes
...
From Decision Trees to Rules
50