Lecture 4: Decision T rees
What is a decision tree?
Constr ucting decision trees
Dealing with noise
Classification pr ob lem e xample
Da y Outlook T emper ature Humidity Wind Pla yT ennis
D1 Sunn y Hot High W eak No D2 Sunn y Hot High Strong No D3 Ov ercast Hot High W eak Y es D4 Rain Mild High W eak Y es D5 Rain Cool Nor mal W eak Y es D6 Rain Cool Nor mal Strong No D7 Ov ercast Cool Nor mal Strong Y es D8 Sunn y Mild High W eak No D9 Sunn y Cool Nor mal W eak Y es D10 Rain Mild Nor mal W eak Y es D11 Sunn y Mild Nor mal Strong Y es D12 Ov ercast Mild High Strong Y es D13 Ov ercast Hot Nor mal W eak Y es D14 Rain Mild High Strong No
Disco v er a “r ule” for the Pla yT ennis predicate!
Decision trees
Outlook
Overcast
Humidity
Normal High
No Yes Wind
Strong Weak
No Yes Yes Rain Sunny
A decision tree consists of:
a set of nodes , where each node tests the v alue of an attr ib ute
and br anches on all possib le v alues
a set of lea v es , where each leaf giv es a class v alue
Using decision trees for c lassification
Suppose w e get a ne w instance:
Outlook =Sunn y, T emper ature=Hot,Humidity=High,Wind=Strong
Ho w do w e classify it?
Outlook
Overcast
Humidity
Normal High
No Yes Wind
Strong Weak
No Yes Yes Rain Sunny
At e v er y node , test the corresponding attr ib ute
Send the instance do wn the appropr iate br anch of the tree
If at a leaf , output the corresponding classification
Real e xample: the “hepatitis” task
t
t
t
live (4) die (1) f die (10) live (1) age < 40.00 f live (65) die (3) spiders = no f
t
t
live (11) f
die (6) live (2) bilirubin < 1.40 f t
die (9) live (3) f live (37) die (3) albumin < 2.90 spleen_palpable = no liver_firm = yes
Good things about decision trees
Pro vide a gener al representation of classification rules
Easy to understand!
F ast lear ning algor ithms (e .g. C4.5, CAR T)
Rob ust to noise (attr ib ute and classification noise , missing
v alues)
Good accur acy
Decision trees are widely used in large , realistic classification
prob lems , e .g.:
Star classification
Medical diagnosis
Industr ial applications
Often incor por ated in data mining softw are (e .g. SGI Mineset).
Decision trees as logical representations
Each decision tree has an equiv alent representation in propositional
logic. F or e xample:
Outlook
Overcast
Humidity
Normal High
No Yes Wind
Strong Weak
No Yes Yes Rain Sunny
corresponds to:
(Outlook=Sunn y Humidity=Nor mal)
(Outlook=Ov ercast)
(Outlook=Rain Wind=W eak)
What is easy/har d for decision trees to represent ?
Ho w w ould w e represent:
XOR
(A B)
(C D)
M of N
Natur al to represent disjunctions , hard to represent functions lik e
par ity , XOR (need e xponential-siz e trees).
Sometimes duplication occurs (same subtree on v ar ious paths).
When w ould one use a decision tree?
Data is represented as attr ib ute-v alue pairs
T arget function is discrete v alued
Disjunctiv e h ypothesis ma y be required
P ossib ly noisy tr aining data, missing v alues
Need to constr uct a classifier fast
Need an understandab le classifier
Existing applications include:
Equipment/medical diagnosis
Lear ning to fly
Scene analysis and image segmentation
Standard algor ithm de v eloped in the ’80s , no w commercially
a v ailab le pac kages (C4.5). Quite successful in pr actice
T op-do wn induction of decision trees
Giv en a set of labeled tr aining instances:
1. If all the tr aining instances ha v e the same class , create a leaf
with that class label and e xit.
2. Pic k the best attr ib ute to split the data on
3. Add a node that tests the attr ib ute
4. Split the tr aining set according to the v alue of the attr ib ute
5. Recurse on each subset of the tr aining data
This is the ID3 algor ithm (Quinlan, 1983) and is at the core of C4.5
Whic h attrib ute is best?
Consider w e ha v e 29 positiv e e xamples , 35 negativ e ones , and w e
are consider ing tw o attr ib utes , that w ould giv e the follo wing splits of
instances:
A1=? A2=?
f t f t [30+,10-] [30+,10-]
[20+,10-] [10+,0-] [15+,7-] [15+,3-]
Intuitiv ely , w e w ould lik e an attr ib ute that separ ates the tr aining
instances as w ell as possib le
W e need a mathematical measure for the pur ity of a set of instances
A bit of inf ormation theor y
Suppose y ou w ant to guess if a n umber is in a set , and y ou can
ask y es/no questions .
What is the best questioning str ategy?
Pic k the “middle” of and ask if the n umber is less than that, then
pic k the middle of the remaining range etc.
Y ou need
questions .
A bit of inf ormation theor y (2)
No w suppose that the n umber can be in one of tw o subsets and
, with probability
and
respectiv ely , and y ou are told if the
n umber is in or
. What is the e xpected n umber of questions to
ask?
where
and
A bit of inf ormation theor y (3)
Ho w m uch inf or mation do y ou gain b y kno wing if the n umber is in
or
?
This is the entr op y function
Entr op y
Consider :
- a sample of tr aining e xamples
is the propor tion of positiv e e xamples in
is the propor tion of negativ e e xamples in
Entrop y measures the impur ity of :
Entropy(S)
1.00.5
0.00.51.0
p+
Inf ormation Gain
W e will use entrop y to deter mine what is the best attr ib ute
Gain(S ,A) = e xpected reduction in entrop y due to sor ting on attr ib ute A
A1=? A2=?
f t f t [30+,10-] [30+,10-]
[20+,10-] [10+,0-] [15+,7-] [15+,3-]
Chec k that in this case , A1 wins .
Decision tree construction as sear c h
State space: all possib le trees
Actions: which attr ib ute to test
Goal: tree consistent with the tr aining data
Depth-first search, no bac ktr ac king
Heur istic: inf or mation gain (or other v ar iations)
Can get stuc k in a local minim um, b ut is fair ly rob ust (because
of the heur istic)
Inductive bias of decision tree construction
The h ypothesis space is complete! W e can represent an y
Boolean function of the attr ib utes
So there is no representational bias
Outputs a single h ypothesis: the “shor test” tree , as anticipated
b y the inf or mation gain
Because there is no bac ktr ac king, it is subject to local minima
But because the search choices are statistically based, it is
rob ust to noise in the data
Algor ithmic bias: pref er shor ter (smaller) trees; pref er trees that
place attr ib utes with high inf or mation gain close to the root
Occam’ s Razor: Wh y pref er shor t h ypotheses?
Pro:
There are fe w er shor t h ypotheses than long h ypotheses
So if w e find one that fits the data, it is less unlik ely to be a
coincidence
Con:
There are man y w a ys to define shor t h ypotheses (e .g. all trees
with pr ime n umbers of nodes)
So what is so special about the siz e of the h ypotheses?
A for mal ans w er to this question can be giv en using the univ ersal
distr ib ution: the probability of a h ypothesis is
, where
is the length of the shor test prog ram that can wr ite do wn
.
F or details , see Kirchherr and Li (1997).
Attrib utes with contin uous v alues
Example:
T emper ature : 40 48 60 72 80 90
Pla yT ennis : No No Y es Y es Y es No
A decision tree needs to perf or m tests on these attr ib utes as w ell
What kind of test do w e w ant?
V alue of the attr ib ute less than a cut point!
What cut points should w e consider?
W e need to consider only cut points where the class label changes!
Using decision trees for real data
Ho w do w e estimate classifier error
Ho w to deal with noise in the data
Ho w to deal with missing attr ib utes
Ho w to incor por ate attr ib ute costs
Example: CRX data, UCI Repositor y
| This file concerns credit card applications. All attribute names
| and values have been changed to meaningless symbols to protect
| confidentiality of the data.
+, -. | classes
A1: b,a.
A2: continuous.
A3: continuous.
A4: u, y, l, t.
A5: g, p, gg.
A6: c, d, cc, i, j, k, m, r, q, w, x, e, aa, ff.
A7: v, h, bb, j, n, z, dd, ff, o.
A8: continuous.
A9: t,f.
A10: t,f.
A11: continuous.
A12: t,f.
A13: g, p, s.
A14: continuous.
A15: continuous.
Estimating the accurac y of a c lassifier
W e w ant to estimate the tr ue error of the classifier
Resubstitution: Build the classifier using all the tr aining data,
and test it using the same data
T oo optimistic! (wh y?)
T est sample estimation: Divide the data into a tr aining set and
a test set .
W astes data
Cr oss-v alidation: Gener al method for deter mining accur acy .
-f old cr oss-v alidation pr ocedure
1. Split the tr aining data into
par titions (f olds), ensur ing that the
class distr ib ution is roughly the same in each par tition
2. Repeat
times:
(a) T ak e one fold to be the test set
(b) T ak e the remaining
folds to for m the tr aining set
(c) W e tr ain the decision tree on the tr aining set, then measure
and
3. Repor t the a v er age of
and the a v er age of
.
Magic n umber :
.
More about cr oss-v alidation
Lea v e-one-out cross-v alidation: special case in which the test is
perf or med on just one instance .
Used especially if data is scarce .
If for an y reason w e need a v alidation set (f or the algor ithm
itself), that will be k ept separ ate from the tr aining and test sets
E.g. One fold is for testing, one for v alidation and the remaining
for tr aining
If w e are compar ing diff erent algor ithms test them on the SAME folds!
Dealing with noise in the training data
Noise is ine vitab le!
V alues of attr ib utes can be misrecorded
V alues of attr ib utes ma y be missing
The class label can be misrecorded
What happens when adding a noisy e xample?
Example: The eff ect of noise
Outlook
Overcast
Humidity
Normal High
No Yes Wind
Strong Weak
No Yes Yes Rain Sunny
Suppose w e add to the data a noisy e xample:
Sunn y, Hot, Nor mal, Strong, Pla yT ennis=No
The tree g ro ws unnecessar ily!
Overfitting
Consider error of h ypothesis o v er
T raining data:
Entire distr ib ution
of data:
Hypothesis o verfits tr aining data if there is an alter nativ e
h ypothesis
such that
and
This is a g eneral pr ob lem for all super vised learning methods
Overfitting in decision trees
0.5 0.55 0.6 0.65 0.7 0.75 0.8 0.85 0.9
0102030405060708090100
Accuracy
Size of tree (number of nodes) On training dataOn test data
As the tree g ro ws , the accur acy deg rades , because the algor ithm is
finding irrele v ant attr ib utes .
Do not belie ve an y one’ s results unless the y repor t them on
separate training and test sets!
A v oiding o verfitting
1. Stop g ro wing the tree when fur ther splitting the data does not
yield a statistically significant impro v ement
2. Gro w a full tree , then pr une the tree , b y eliminating nodes
The second approach has been more successful in pr actice
In both cases , the lea v es of the tree will no w be impure:
The leaf can be assigned the class label of the major ity of the
instances which reached the leaf
Alter nativ ely , one can use probability estimates of the class
membership , based on instance counts .
Ho w to select the “best” tree
1. Measure perf or mance o v er tr aining data only
2. Measure perf or mance o v er a separ ate v alidation data set
3. Minim um descr iption length pr inciple: minimiz e