What is a decision tree?

(1)

Lecture 4: Decision T rees

What is a decision tree?

Constr ucting decision trees

Dealing with noise

(2)

Classification pr ob lem e xample

Da y Outlook T emper ature Humidity Wind Pla yT ennis

D1 Sunn y Hot High W eak No D2 Sunn y Hot High Strong No D3 Ov ercast Hot High W eak Y es D4 Rain Mild High W eak Y es D5 Rain Cool Nor mal W eak Y es D6 Rain Cool Nor mal Strong No D7 Ov ercast Cool Nor mal Strong Y es D8 Sunn y Mild High W eak No D9 Sunn y Cool Nor mal W eak Y es D10 Rain Mild Nor mal W eak Y es D11 Sunn y Mild Nor mal Strong Y es D12 Ov ercast Mild High Strong Y es D13 Ov ercast Hot Nor mal W eak Y es D14 Rain Mild High Strong No

Disco v er a “r ule” for the Pla yT ennis predicate!

(3)

Decision trees

Outlook

Overcast

Humidity

Normal High

No Yes Wind

Strong Weak

No Yes Yes Rain Sunny

A decision tree consists of:

a set of nodes , where each node tests the v alue of an attr ib ute

and br anches on all possib le v alues

a set of lea v es , where each leaf giv es a class v alue

(4)

Using decision trees for c lassification

Suppose w e get a ne w instance:

Outlook =Sunn y, T emper ature=Hot,Humidity=High,Wind=Strong

Ho w do w e classify it?

Outlook

Overcast

Humidity

Normal High

No Yes Wind

Strong Weak

No Yes Yes Rain Sunny

At e v er y node , test the corresponding attr ib ute

Send the instance do wn the appropr iate br anch of the tree

If at a leaf , output the corresponding classification

(5)

Real e xample: the “hepatitis” task

t

live (4) die (1) f die (10) live (1) age < 40.00 f live (65) die (3) spiders = no f

t

live (11) f

die (6) live (2) bilirubin < 1.40 f t

die (9) live (3) f live (37) die (3) albumin < 2.90 spleen_palpable = no liver_firm = yes

(6)

Good things about decision trees

Pro vide a gener al representation of classification rules

Easy to understand!

F ast lear ning algor ithms (e .g. C4.5, CAR T)

Rob ust to noise (attr ib ute and classification noise , missing

v alues)

Good accur acy

Decision trees are widely used in large , realistic classification

prob lems , e .g.:

Star classification

Medical diagnosis

Industr ial applications

Often incor por ated in data mining softw are (e .g. SGI Mineset).

(7)

Decision trees as logical representations

Each decision tree has an equiv alent representation in propositional

logic. F or e xample:

Outlook

Overcast

Humidity

Normal High

No Yes Wind

Strong Weak

No Yes Yes Rain Sunny

corresponds to:

(Outlook=Sunn y Humidity=Nor mal)

(Outlook=Ov ercast)

(Outlook=Rain Wind=W eak)

(8)

What is easy/har d for decision trees to represent ?

Ho w w ould w e represent:

XOR

(A B)

(C D)

M of N

Natur al to represent disjunctions , hard to represent functions lik e

par ity , XOR (need e xponential-siz e trees).

Sometimes duplication occurs (same subtree on v ar ious paths).

(9)

When w ould one use a decision tree?

Data is represented as attr ib ute-v alue pairs

T arget function is discrete v alued

Disjunctiv e h ypothesis ma y be required

P ossib ly noisy tr aining data, missing v alues

Need to constr uct a classifier fast

Need an understandab le classifier

Existing applications include:

Equipment/medical diagnosis

Lear ning to fly

Scene analysis and image segmentation

Standard algor ithm de v eloped in the ’80s , no w commercially

a v ailab le pac kages (C4.5). Quite successful in pr actice

(10)

T op-do wn induction of decision trees

Giv en a set of labeled tr aining instances:

1. If all the tr aining instances ha v e the same class , create a leaf

with that class label and e xit.

2. Pic k the best attr ib ute to split the data on

3. Add a node that tests the attr ib ute

4. Split the tr aining set according to the v alue of the attr ib ute

5. Recurse on each subset of the tr aining data

This is the ID3 algor ithm (Quinlan, 1983) and is at the core of C4.5

(11)

Whic h attrib ute is best?

Consider w e ha v e 29 positiv e e xamples , 35 negativ e ones , and w e

are consider ing tw o attr ib utes , that w ould giv e the follo wing splits of

instances:

A1=? A2=?

f t f t [30+,10-] [30+,10-]

[20+,10-] [10+,0-] [15+,7-] [15+,3-]

Intuitiv ely , w e w ould lik e an attr ib ute that separ ates the tr aining

instances as w ell as possib le

W e need a mathematical measure for the pur ity of a set of instances

(12)

A bit of inf ormation theor y

Suppose y ou w ant to guess if a n umber is in a set , and y ou can

ask y es/no questions .

What is the best questioning str ategy?

Pic k the “middle” of and ask if the n umber is less than that, then

pic k the middle of the remaining range etc.

Y ou need

questions .

(13)

A bit of inf ormation theor y (2)

No w suppose that the n umber can be in one of tw o subsets and

, with probability

and

respectiv ely , and y ou are told if the

n umber is in or

. What is the e xpected n umber of questions to

ask?

where

and

(14)

A bit of inf ormation theor y (3)

Ho w m uch inf or mation do y ou gain b y kno wing if the n umber is in

or

?

This is the entr op y function

(15)

Entr op y

Consider :

- a sample of tr aining e xamples

is the propor tion of positiv e e xamples in

is the propor tion of negativ e e xamples in

Entrop y measures the impur ity of :

Entropy(S)

1.00.5

0.00.51.0

p+

(16)

Inf ormation Gain

W e will use entrop y to deter mine what is the best attr ib ute

Gain(S ,A) = e xpected reduction in entrop y due to sor ting on attr ib ute A

A1=? A2=?

f t f t [30+,10-] [30+,10-]

[20+,10-] [10+,0-] [15+,7-] [15+,3-]

Chec k that in this case , A1 wins .

(17)

Decision tree construction as sear c h

State space: all possib le trees

Actions: which attr ib ute to test

Goal: tree consistent with the tr aining data

Depth-first search, no bac ktr ac king

Heur istic: inf or mation gain (or other v ar iations)

Can get stuc k in a local minim um, b ut is fair ly rob ust (because

of the heur istic)

(18)

Inductive bias of decision tree construction

The h ypothesis space is complete! W e can represent an y

Boolean function of the attr ib utes

So there is no representational bias

Outputs a single h ypothesis: the “shor test” tree , as anticipated

b y the inf or mation gain

Because there is no bac ktr ac king, it is subject to local minima

But because the search choices are statistically based, it is

rob ust to noise in the data

Algor ithmic bias: pref er shor ter (smaller) trees; pref er trees that

place attr ib utes with high inf or mation gain close to the root

(19)

Occam’ s Razor: Wh y pref er shor t h ypotheses?

Pro:

There are fe w er shor t h ypotheses than long h ypotheses

So if w e find one that fits the data, it is less unlik ely to be a

coincidence

Con:

There are man y w a ys to define shor t h ypotheses (e .g. all trees

with pr ime n umbers of nodes)

So what is so special about the siz e of the h ypotheses?

A for mal ans w er to this question can be giv en using the univ ersal

distr ib ution: the probability of a h ypothesis is

, where

is the length of the shor test prog ram that can wr ite do wn

.

F or details , see Kirchherr and Li (1997).

(20)

Attrib utes with contin uous v alues

Example:

T emper ature : 40 48 60 72 80 90

Pla yT ennis : No No Y es Y es Y es No

A decision tree needs to perf or m tests on these attr ib utes as w ell

What kind of test do w e w ant?

V alue of the attr ib ute less than a cut point!

What cut points should w e consider?

W e need to consider only cut points where the class label changes!

(21)

Using decision trees for real data

Ho w do w e estimate classifier error

Ho w to deal with noise in the data

Ho w to deal with missing attr ib utes

Ho w to incor por ate attr ib ute costs

(22)

Example: CRX data, UCI Repositor y

| This file concerns credit card applications. All attribute names

| and values have been changed to meaningless symbols to protect

| confidentiality of the data.

+, -. | classes

A1: b,a.

A2: continuous.

A3: continuous.

A4: u, y, l, t.

A5: g, p, gg.

A6: c, d, cc, i, j, k, m, r, q, w, x, e, aa, ff.

A7: v, h, bb, j, n, z, dd, ff, o.

A8: continuous.

A9: t,f.

A10: t,f.

A11: continuous.

A12: t,f.

A13: g, p, s.

A14: continuous.

A15: continuous.

(23)

Estimating the accurac y of a c lassifier

W e w ant to estimate the tr ue error of the classifier

Resubstitution: Build the classifier using all the tr aining data,

and test it using the same data

T oo optimistic! (wh y?)

T est sample estimation: Divide the data into a tr aining set and

a test set .

W astes data

Cr oss-v alidation: Gener al method for deter mining accur acy .

(24)

-f old cr oss-v alidation pr ocedure

1. Split the tr aining data into

par titions (f olds), ensur ing that the

class distr ib ution is roughly the same in each par tition

2. Repeat

times:

(a) T ak e one fold to be the test set

(b) T ak e the remaining

folds to for m the tr aining set

(c) W e tr ain the decision tree on the tr aining set, then measure

and

3. Repor t the a v er age of

and the a v er age of

.

Magic n umber :

.

(25)

More about cr oss-v alidation

Lea v e-one-out cross-v alidation: special case in which the test is

perf or med on just one instance .

Used especially if data is scarce .

If for an y reason w e need a v alidation set (f or the algor ithm

itself), that will be k ept separ ate from the tr aining and test sets

E.g. One fold is for testing, one for v alidation and the remaining

for tr aining

If w e are compar ing diff erent algor ithms test them on the SAME folds!

(26)

Dealing with noise in the training data

Noise is ine vitab le!

V alues of attr ib utes can be misrecorded

V alues of attr ib utes ma y be missing

The class label can be misrecorded

What happens when adding a noisy e xample?

(27)

Example: The eff ect of noise

Outlook

Overcast

Humidity

Normal High

No Yes Wind

Strong Weak

No Yes Yes Rain Sunny

Suppose w e add to the data a noisy e xample:

Sunn y, Hot, Nor mal, Strong, Pla yT ennis=No

The tree g ro ws unnecessar ily!

(28)

Overfitting

Consider error of h ypothesis o v er

T raining data:

Entire distr ib ution

of data:

Hypothesis o verfits tr aining data if there is an alter nativ e

h ypothesis

such that

and

This is a g eneral pr ob lem for all super vised learning methods

(29)

Overfitting in decision trees

0.5 0.55 0.6 0.65 0.7 0.75 0.8 0.85 0.9

0102030405060708090100

Accuracy

Size of tree (number of nodes) On training dataOn test data

As the tree g ro ws , the accur acy deg rades , because the algor ithm is

finding irrele v ant attr ib utes .

Do not belie ve an y one’ s results unless the y repor t them on

separate training and test sets!

(30)

A v oiding o verfitting

1. Stop g ro wing the tree when fur ther splitting the data does not

yield a statistically significant impro v ement

2. Gro w a full tree , then pr une the tree , b y eliminating nodes

The second approach has been more successful in pr actice

In both cases , the lea v es of the tree will no w be impure:

The leaf can be assigned the class label of the major ity of the

instances which reached the leaf

Alter nativ ely , one can use probability estimates of the class

membership , based on instance counts .

(31)

Ho w to select the “best” tree

1. Measure perf or mance o v er tr aining data only

2. Measure perf or mance o v er a separ ate v alidation data set

3. Minim um descr iption length pr inciple: minimiz e

The second one (tr aining and v alidation set) is the most common.

(32)

Example: Reduced-err or pruning

1. Split data into a tr aining set and a v alidation set

2. Gro w a large tree (e .g. until each leaf is pure)

3. F or each node:

(a) Ev aluate the v alidation set accur acy of pr uning the subtree

rooted at the node

(b) Greedily remo v e the node that most impro v es v alidation set

accur acy , with its corresponding subtree

(c) Replace the remo v ed node b y a leaf with the major ity class

of the corresponding e xamples .

4. Stop when pr uning star ts hur ting the accur acy on the v alidation

set.

(33)

Example: Eff ect of reduced-err or pruning

0.5 0.55 0.6 0.65 0.7 0.75 0.8 0.85 0.9

0 10 20 30 40 50 60 70 80 90 100

Accuracy

Size of tree (number of nodes) On training data On test data On test data (during pruning)