Decision Trees, Bagging and Boosting

(1)

Decision Trees, Bagging and Boosting

CSE 446: Machine Learning

created by Emily Fox

(2)

Predicting potential loan defaults

(3)

3 What makes a loan risky?

I want a to buy a

new house! Credit History

★★★★

Income

★★★

★★★★★ Term

Personal Info

★★★

Loan

Application

(4)

Credit history explained

Credit History

★★★★

Income

★★★

★★★★★ Term

Personal Info

★★★

Did I pay previous loans on time?

Example: excellent,

good, or fair

(5)

5 Income

Credit History

★★★★

Income

★★★

★★★★★ Term

Personal Info

★★★

What’s my income?

Example:

$80K per year

(6)

Loan terms

Credit History

★★★★

Income

★★★

★★★★★ Term

Personal Info

★★★

How soon do I need to pay the loan?

Example: 3 years,

5 years,…

(7)

7 Personal information

Credit History

★★★★

Income

★★★

★★★★★ Term

Personal Info

★★★

Age, reason for the loan, marital status,…

Example: Home loan for a

married couple

(8)

Intelligent application

Safe ✓

Risky

✘ Risky

✘

Intelligent loan application review system

Loan

Applications

(9)

9 Classifier review

Loan

Application Classifier

MODEL

Input: x _i ^{Output: ŷ} _Predicted

class

Safe

ŷ _i = +1

Risky

ŷ _i = -1

(10)

This module ... decision trees

Start

Credit?

Safe

excellent

Income?

poor

Term?

Risky Safe

fair

5 years 3 years

Risky Low Term?

Risky Safe

high

5 years

3 years

(11)

11 Scoring a loan application

x _i = (Credit = poor, Income = high, Term = 5 years)

Credit?

Safe Term?

Risky Safe

Income?

Term?

Risky Safe

Risky

Start

excellent poor

fair

5 years

3 years high Low

5 years 3 years

Credit?

Safe Term?

Risky Safe

Income?

Term?

Risky Safe

Risky

Start

excellent poor

fair

5 years

3 years high Low

5 years 3 years

ŷ _i = Safe

(12)

Decision tree learning task

(13)

13 Decision tree learning problem

Optimize quality metric on training data

Training data: N observations (x _i ,y _i )

Credit Term Income y

excellent 3 yrs high safe

fair 5 yrs low risky

fair 3 yrs high safe

poor 5 yrs high risky

excellent 3 yrs low risky

fair 5 yrs low safe

poor 3 yrs high risky

poor 5 yrs low safe

fair 3 yrs high safe

T(X)

(14)

Quality metric: Classification error

• Error measures fraction of mistakes

- Best possible value : 0.0 - Worst possible value: 1.0

Error = # incorrect predictions

# examples

(15)

15 How do we find the best tree?

Exponentially large number of possible trees makes decision tree learning hard!

T

₁

(X) T

₂

(X) T

₃

(X)

T

₄

(X) T

₅

(X) T

₆

(X)

Learning the smallest

decision tree is an

NP-hard problem

[Hyafil & Rivest ’76]

(16)

Greedy decision tree learning

(17)

17 Our training data table

Assume N = 40, 3 features

Credit Term Income y

excellent 3 yrs high safe

fair 5 yrs low risky

fair 3 yrs high safe

poor 5 yrs high risky

excellent 3 yrs low risky

fair 5 yrs low safe

poor 3 yrs high risky

poor 5 yrs low safe

fair 3 yrs high safe

(18)

(all data)

Start with all the data

Loan status: Safe Risky

N = 40 examples

# of Safe loans

22 # of Risky loans

18

(19)

19 22 Root 18

Compact visual notation: Root node

Loan status: Safe Risky

N = 40 examples

# of Safe loans

# of Risky loans

(20)

Decision stump: Single level tree

22 Root 18 Loan status:

Safe Risky

4 poor 14 fair

9 4 excellent

9 0

Credit?

Split on Credit

Subset of data with

Credit = excellent Subset of data with

Credit = fair Subset of data with

Credit = poor

(21)

21 Visual notation: Intermediate nodes

22 Root 18

excellent

9 0 fair

9 4 poor

4 14 Loan status:

Safe Risky

Credit?

Intermediate nodes

(22)

Proceed to build a tree on each of children

22 root 18

excellent

9 0 fair

9 4 poor

4 14 Loan status:

Safe Risky

credit?

Safe

(23)

Selecting best feature to split on

(24)

How do we pick our first split?

22 Root 18

excellent

9 0 fair

9 4 poor

4 14 Loan status:

Safe Risky

Credit?

Find the “best”

feature to split on!

(25)

25 How do we select the best feature?

Root 22 18

excellent 9 0

fair

9 4 poor

4 14 Loan status:

Safe Risky

Credit?

Choice 1: Split on Credit

Root 22 18

3 years

16 4 5 years

6 14 Loan status:

Safe Risky

Term?

Choice 2: Split on Term

OR

(26)

How do we measure effectiveness of a split?

Error = # mistakes

# data points

Root 22 18

poor 4 14 Loan status:

Safe Risky

Credit?

excellent 9 0

9 fair 4

Idea: Calculate classification error

of this decision stump

(27)

27 Calculating classification error

• Step 1: ŷ = class of majority of data in node

• Step 2: Calculate classification error of predicting ŷ for this data

Root 22 18 Loan status:

Safe Risky

18 mistakes 22 correct

ŷ = majority class

Safe Tree Classification error

(root) 0.45

(28)

Choice 1: Split on Credit history?

Does a split on Credit reduce classification error below 0.45?

22 Root 18

excellent

9 0 fair

9 4 poor

4 14 Loan status:

Safe Risky

Credit?

Choice 1: Split on Credit

(29)

29 Split on Credit: Classification error

22 Root 18

excellent

9 0 fair

9 4 poor

4 14 Loan status:

Safe Risky

Credit?

0 mistakes 4 mistakes 4 mistakes

Safe Safe Risky

Choice 1: Split on Credit

Error = 4+4 _

^.

22 + 18

= 0.2

Tree Classification error

(root) 0.45

Split on credit 0.2

(30)

Choice 2: Split on Term?

22 Root 18

3 years

16 4 5 years

6 14 Loan status:

Safe Risky

Term?

Safe Risky

Choice 2: Split on Term

(31)

31 Evaluating the split on Term

22 Root 18

3 years

16 4 5 years

6 14 Loan status:

Safe Risky

Term?

4 mistakes 6 mistakes

Safe Risky

Error = 4+6

^.

= 0.25 40

Tree Classification error

(root) 0.45

Split on credit 0.2

Split on term 0.25

Choice 2: Split on Term

(32)

Choice 1 vs Choice 2:

Comparing split on Credit vs Term

Root 22 18

excellent 9 0

fair

8 4 poor

4 14 Loan status:

Safe Risky Root

22 18

3 years

16 4 5 years

6 14 Loan status:

Safe Risky

OR

Credit? Term?

Tree Classification

error

(root) 0.45

split on credit 0.2

split on loan term 0.25

WINNER

Choice 2: Split on Term

Choice 1: Split on Credit

(33)

33 Feature split selection algorithm

• Given a subset of data M (a node in a tree)

• For each feature ϕ _i (x):

1. Split data of M according to feature ϕ _i (x) 2. Compute classification error split

• Chose feature ϕ ^* (x) with lowest classification error

(34)

Recursion & Stopping conditions

(35)

35 We’ve learned a decision stump, what next?

22 Root 18

excellent

9 0 fair

9 4 poor

4 14 Loan status:

Safe Risky

Credit?

Safe All data points are Safe è

nothing else to do with this subset of data

Leaf node

(36)

Tree learning = Recursive stump learning

22 Root 18

excellent

9 0 fair

9 4 poor

4 14 Loan status:

Safe Risky

Credit?

Safe

Build decision stump with subset of data where

Credit = poor Build decision stump with

subset of data where

Credit = fair

(37)

37 Second level

22 Root 18

Loan status:

Safe Risky

Credit?

excellent

9 0 fair

9 4 poor

4 14

Safe

3 years

0 4 5 years

9 0 Term?

Risky Safe

Build another stump these data points

4 high 5 Low 0 9 Income?

Risky

(38)

Final decision tree

Root 22 18

excellent

9 0 Fair

9 4

poor 4 14

Loan status:

Safe Risky

Credit?

Safe

5 years 9 0 3 years

0 4

Term?

Risky Safe

low 0 9 high 4 5

Income?

5 years 4 3 3 years

0 2

Term?

Risky Safe

Risky

(39)

39 Simple greedy decision tree learning

Pick best feature to split on

Learn decision stump with this split

For each leaf of decision stump, recurse

When do we stop???

(40)

Stopping condition 1: All data agrees on y

Root 22 18

excellent

9 0 Fair

9 4

poor 4 14

Loan status:

Safe Risky

Credit?

Safe

5 years 9 0 3 years

0 4

Term?

Risky Safe

low 0 9 high 4 5

Income?

5 years 4 3

Term?

Risky Safe

Risky

3 years 0 2 3 years

0 2

All data in these nodes have same

y value è Nothing to do

excellent 9 0

5 years 9 0 3 years

0 4

low

0 9

(41)

41 Stopping condition 2: Already split on all features

Root 22 18

excellent

9 0 Fair

9 4

poor 4 14

Loan status:

Safe Risky

Credit?

Safe

5 years 9 0 3 years

0 4

Term?

Risky Safe

low 0 9 high 4 5

Income?

5 years 4 3

Term?

Risky Safe

Risky

3 years 0 2

Already split on all possible features è

Nothing to do

5 years

4 3

(42)

Recursion

Stopping conditions 1

& 2

Greedy decision tree learning

• Step 1: Start with an empty tree

• Step 2: Select a feature to split data

• For each split of the tree:

• Step 3: If nothing more to do, make predictions

• Step 4: Otherwise, go to Step 2 &

continue (recurse) on this split

Pick feature split

leading to lowest

classification error

(43)

43 Is this a good idea?

Proposed stopping condition 3:

Stop if no split reduces the

classification error

(44)

Stopping condition 3:

D on’t stop if error doesn’t decrease???

y values

True False ^Root ₂ ₂

Error = 2

^.

= 0.5 4

Tree Classification error

(root) 0.5

x[1] x[2] y

False False False False True True

True False True True True False

y ⁼ x[1] ^xor x[2]

(45)

45 Consider split on x[1]

y values

True False ^Root ₂ ₂

Error = 2

^.

= 0.5 4

Tree Classification error

(root) 0.5

Split on x[1] 0.5

1 True 1 False

1 1 x[1]

x[1] x[2] y

False False False False True True

True False True True True False

y ⁼ x[1] ^xor x[2]

(46)

Consider split on x[2]

y values

True False ^Root ₂ ₂

Error = 2

^.

= 0.5 4

Tree Classification error

(root) 0.5

Split on x[1] 0.5

Split on x[2] 0.5

1 True 1 False

1 1 x[2]

Neither features

improve training error…

Stop now???

x[1] x[2] y

False False False False True True

True False True True True False

y ⁼ x[1] ^xor x[2]

(47)

47 Final tree with stopping condition 3

Tree Classification error

with stopping

condition 3 0.5

y values

True False ^Root ₂ ₂

Predict True

x[1] x[2] y

False False False False True True

True False True True True False

y ⁼ x[1] ^xor x[2]

(48)

Without stopping condition 3

y values True False

Root 2 2

1 True 1 False

1 1 x[1]

0 True 1

x[2]

1 True 0 False

1 0

x[2]

False 0 1

True False

False True

Tree Classification

error with stopping

condition 3 0.5

without stopping condition 3

x[1] x[2] y

False False False False True True

True False True True True False

y ⁼ x[1] ^xor x[2]

Condition 3 (stopping when training error doesn’t improve) not always recommended!

0

(49)

Decision tree learning:

Real valued features

(50)

How do we use real values inputs?

Income Credit Term y

$105 K excellent 3 yrs Safe

$112 K good 5 yrs Risky

$73 K fair 3 yrs Safe

$69 K excellent 5 yrs Safe

$217 K excellent 3 yrs Risky

$120 K good 5 yrs Safe

$64 K fair 3 yrs Risky

$340 K excellent 5 yrs Safe

$60 K good 3 yrs Risky

(51)

51 Threshold split

Root 22 18 Loan status:

Safe Risky

Split on the feature Income

< $60K

8 13 >= $60K

14 5 Income?

Subset of data with

Income >= $60K

(52)

Finding the best threshold split

Infinite possible values of t

Income < t ^* Income >= t ^*

Safe Risky

Income

$120K

$10K

Income = t ^*

(53)

53 Consider a threshold between points

Safe Risky

Income

$120K

$10K

v

_A

v

_B

Same classification error for any

threshold split between v _A and v _B

(54)

Only need to consider mid-points

Safe Risky

Income

$120K

$10K

Finite number of splits

to consider

(55)

55 Threshold split selection algorithm

• Step 1: Sort the values of a feature h _j (x) :

Let {v ₁ , v ₂ , v ₃ , … v _N } denote sorted values

• Step 2:

- For i = 1 … N-1

• Consider split t _i = (v _i + v _i+1 ) / 2

• Compute classification error for threshold split h _j (x) >= t _i

- Choose the t ^* with the lowest classification error

(56)

Visualizing the threshold split

0 10 20 30 40 …

$0K

$40K

$80K

…

Age

Income Threshold split is the line Age = 38

(57)

57 Split on Age >= 38

Age Income age < 38 age >= 38

Predict Safe Predict Risky

0 10 20 30 40 …

$0K

$40K

$80K

…

(58)

Depth 2: Split on Income >= $60K

Age Income

0 10 20 30 40 …

$0K

$40K

$80K

…

Threshold split is the line Income = 60K

(59)

59 Each split partitions the 2-D space

Age

Age >= 38 Income >= 60K Age < 38

Age >= 38 Income < 60K

Income

0 10 20 30 40 …

$0K

$40K

$80K

…

(60)

Overfitting in decision trees

(61)

61 Decision tree recap

Root 22 18

excellent

9 0 Fair

9 4

poor 4 14

Loan status:

Safe Risky

Credit?

Safe

5 years 9 0 3 years

0 4

Term?

Risky Safe

low 0 9 high 4 5

Income?

5 years 4 3 3 years

0 2

Term?

Risky Safe

Risky

For each leaf node,

set ŷ = majority value

(62)

Tree depth depth = 1 depth = 2 depth = 3 depth = 5 depth = 10

Training error 0.22 0.13 0.10 0.03 0.00

Decision boundary

Training error reduces with depth

What happens when we increase depth?

(63)

63 How do we regularize? In this context, simplicity

Complex Tree

Simpler Tree

(64)

Two approaches to picking simpler trees

1. Early Stopping:

Stop the learning algorithm before tree becomes too complex 2. Pruning:

Simplify the tree after the learning algorithm terminates

(65)

65 Technique 1: Early stopping

• Stopping conditions (recap):

1. All examples have the same target value 2. No more features to split on

• Early stopping conditions:

1. Limit tree depth (choose max_depth using validation set)

2. Do not consider splits that do not cause a sufficient decrease in classification error

3. Do not split an intermediate node which contains too few data points

(66)

Challenge with early stopping condition 1

Cl as sif ic at io n Er ro r

Complex trees

True error Training error

Simple trees

Tree depth

max_depth

Hard to know exactly when to stop

Also, might want

some branches of

tree to go deeper

while others remain

shallow

(67)

67 Early stopping condition 2: Pros and Cons

• Pros:

- A reasonable heuristic for early stopping to avoid useless splits

• Cons:

- Too short sighted: We may miss out on “good” splits may occur right after “useless” splits

- Saw this with “xor” example

(68)

Two approaches to picking simpler trees

1. Early Stopping:

Stop the learning algorithm before tree becomes too complex 2. Pruning:

Simplify the tree after the learning algorithm terminates

Complements early stopping

(69)

69 Pruning: Intuition

Train a complex tree, simplify later

Complex Tree

Simplify

Simpler Tree

(70)

Pruning motivation

Cl as sif ic at io n Er ro r ^{True Error}

Training Error

Tree depth

Don’t stop too early Complex

tree

Simplify after

tree is built

Simple tree

(71)

71 Scoring trees: Desired total quality format

Want to balance:

i. How well tree fits data ii. Complexity of tree

Total cost =

measure of fit + measure of complexity

(classification error) Large # = bad fit to

training data

Large # = likely to overfit

want to balance

(72)

Simple measure of complexity of tree

Credit?

Start

Safe

Risky

excellent poor

bad Safe

Safe

Risky

good fair

L(T) = # of leaf nodes

(73)

73 Balance simplicity & predictive power

Credit?

Term?

Risky

Start

fair

3 years Safe

Safe

Income?

Term?

Risky Safe

Risky

excellent poor

5 years

high low

5 years 3 years

Risky

Start

Too complex, risk of overfitting

Too simple, high

classification error

(74)

Balancing fit and complexity

Total cost C(T) = Error(T) + λ L(T) tuning parameter If λ =0: build out full tree.

If λ =∞: won’t ever split.

If λ in between:

(75)

75 Having said all that…

Vanilla decision trees just don’t really work very well.

(76)

CSE 446: Machine Learning

Bagging

(77)

77 Suppose we want to reduce variance

Recall definition of variance

If we had lots of independent data sets, we could average their results and get the variance to approach 0

Given m independent data sets (D ₁ , …, D _m ), each of size n, learn m classifiers (f ₁ , …, f _m ), where f _i trained on D _i.

By law of large numbers

But we don’t have lots of data sets

(78)

Bootstrap Aggregation (aka bagging)

• A technique for addressing high variance (e.g. in decision trees)

- Given sample D, draw m samples (D ₁ , …, D _m ) of size n, with replacement.

- Train a classifier on each.

- Use average (for regression) or majority vote (for classification).

(79)

79 Advantages of bagging

• Easy to implement

• Reduces variance

• Gives information about uncertainty of prediction.

• Provides estimate of test error, “out of bag” error without using validation set.

Let and define

Then out of bag error is:

(80)

Bagging produces an ensemble classifier

(81)

81 Ensemble classifier in general

• Goal:

- Predict output y

• Either +1 or -1

- From input x

• Learn ensemble model:

- Classifiers: f ₁ (x),f ₂ (x),…,f _T (x) - Coefficients: ŵ ₁ ,ŵ ₂ ,…,ŵ _T

• Prediction:

ˆ

y = sign

X T t=1

w ˆ _t f _t (x)

!

(82)

CSE 446: Machine Learning

Random Forests

(83)

83 One of the most useful bagged algorithms

• Given data set D, draw m samples (D ₁ , …, D _m ) of size n from D, with replacement.

• For each D _j , train a full decision tree h _j with one crucial modification:

- Before each split, randomly subsample k features (without replacement) and only consider these for your split.

• Final classifier obtained by averaging.

(84)

One of best, most popular and easiest to use algorithms

• Only two hyperparameters m and k.

• Extremely insensitive to both.

- Classification: good choice for k is 𝑑 (where 𝑑 is number of features)

- Regression: good choice for k is d/3 - Set m as large as you can afford.

• Works really well!

• Features can be of different scale, magnitude, etc.

- Very convenient with heterogeneous features.

(85)

CSE 446: Machine Learning Emily Fox

University of Washington January 30, 2017

Boosting

(86)

Simple (weak) classifiers are good!

Logistic regression w.

simple features

Low variance. Learning is fast!

But high bias…

Shallow

decision trees Decision stumps

Income>$100K?

Safe Risky

No

Yes

(87)

87 Finding a classifier that’s just right

Model complexity Cl as sif ic at io n er ro r

train error

true error

Weak learner Need stronger learner

Option 1: add more features or depth

Option 2: ?????

(88)

Boosting question

“Can a set of weak learners be combined to create a stronger learner?” Kearns and Valiant (1988)

Yes! Schapire (1990)

Boosting

Amazing impact: simple approach widely used in industry

wins most Kaggle competitions

(89)

Ensemble classifier

(90)

A single classifier

Output: ŷ = f(x)

- Either +1 or -1

Input: x

Classifier

Income>$100K?

Safe Risky

No

Yes

(91)

91 Income>$100K?

Safe Risky

No

Yes

(92)

Ensemble methods: Each classifier “votes” on prediction

x

_i

= (Income=$120K, Credit=Bad, Savings=$50K, Market=Good)

f ₁ (x _i ) = +1

Combine?

F(x _i ) = sign(w ₁ f ₁ (x _i ) + w ₂ f ₂ (x _i ) + w ₃ f ₃ (x _i ) + w ₄ f ₄ (x _i ))

Ensemble

model Learn coefficients

Income>$100K?

Safe Risky

No Yes

Credit history?

Risky Safe

Good Bad

Savings>$100K?

Safe Risky

No Yes

Market conditions?

Risky Safe

Good Bad

Income>$100K?

Safe Risky

No Yes

f ₂ (x _i ) = -1

Credit history?

Risky Safe

Good Bad

f ₃ (x _i ) = -1

Savings>$100K?

Safe Risky

No Yes

f ₄ (x _i ) = +1

Market conditions?

Risky Safe

Good

Bad

(93)

93 Prediction with ensemble

w ₁ 2

w ₂ 1.5

w ₃ 1.5

w ₄ 0.5

f ₁ (x _i ) = +1

Combine?

F(x _i ) = sign(w ₁ f ₁ (x _i ) + w ₂ f ₂ (x _i ) + w ₃ f ₃ (x _i ) + w ₄ f ₄ (x _i ))

Ensemble

model Learn coefficients

Income>$100K?

Safe Risky

No Yes

Credit history?

Risky Safe

Good Bad

Savings>$100K?

Safe Risky

No Yes

Market conditions?

Risky Safe

Good Bad

Income>$100K?

Safe Risky

No Yes

f ₂ (x _i ) = -1

Credit history?

Risky Safe

Good Bad

f ₃ (x _i ) = -1

Savings>$100K?

Safe Risky

No Yes

f ₄ (x _i ) = +1

Market conditions?

Risky Safe

Good

Bad

(94)

Ensemble classifier in general

• Goal:

- Predict output y

• Either +1 or -1

- From input x

• Learn ensemble model:

- Classifiers: f ₁ (x),f ₂ (x),…,f _T (x) - Coefficients: ŵ ₁ ,ŵ ₂ ,…,ŵ _T

• Prediction:

ˆ

y = sign

X T t=1

w ˆ _t f _t (x)

!

(95)

Boosting

(96)

Training a classifier

Training data Learn

classifier Predict ŷ = sign(f(x))

f(x)

(97)

97 Learning decision stump

Credit Income y

A $130K Safe

B $80K Risky

C $110K Risky

A $110K Safe

A $90K Safe

B $120K Safe

C $30K Risky

C $60K Risky

B $95K Safe

A $60K Safe

A $98K Safe

Credit Income y

A $130K Safe

B $80K Risky

C $110K Risky

A $110K Safe

A $90K Safe

B $120K Safe

C $30K Risky

C $60K Risky

B $95K Safe

A $60K Safe

A $98K Safe

Income?

> $100K ≤ $100K

ŷ = Safe ŷ = Safe

3 1 4 3

X X

X

(98)

Boosting = Focus learning on “hard” points

Training data Learn

classifier Predict ŷ = sign(f(x))

f(x)

Learn where f(x)

makes mistakes Evaluate Boosting: focus next

classifier on places where

f(x) does less well

(99)

99 Learning on weighted data:

More weight on “hard” or more important points

• Weighted dataset:

- Each x _i ,y _i weighted by α _i

• More important point = higher weight α _i

• Learning:

- Data point i counts as α _i data points

• E.g., α _i = 2 è count point twice

(100)

Credit Income y

A $130K Safe

B $80K Risky

C $110K Risky

A $110K Safe

A $90K Safe

B $120K Safe

C $30K Risky

C $60K Risky

B $95K Safe

A $60K Safe

A $98K Safe

Learning a decision stump on weighted data

Credit Income y Weight α

A $130K Safe 0.5

B $80K Risky 1.5

C $110K Risky 1.2

A $110K Safe 0.8

A $90K Safe 0.6

B $120K Safe 0.7

C $30K Risky 3

C $60K Risky 2

B $95K Safe 0.8

A $60K Safe 0.7

A $98K Safe 0.9

Income?

> $100K ≤ $100K

ŷ = Risky ŷ = Safe

2 1.2 3 6.5

Increase weight α of harder/misclassified points

X X

X

(101)

101 Learning from weighted data in general

• Learning from weighted data generally no harder.

- Data point i counts as α _i data points

• E.g., gradient descent for logistic regression:

w ^(t+1) _j w ^(t) _j + ⌘ X N

i=1

h _j (x _i ) ⇣

1[y _i = +1] P (y = +1 | x i , w ^(t) ) ⌘ w ^(t+1) _j w ^(t) _j + ⌘

X N

i=1

h _j (x _i ) ⇣

1[y _i = +1] P (y = +1 | x i , w ^(t) ) ⌘

α _i

Sum over data points

Weigh each point by α _i

(102)

Boosting = Greedy learning ensembles from data

Training data

Predict ŷ = sign(f ₁ (x)) Learn classifier

f ₁ (x)

Weighted data

Learn classifier & coefficient

ŵ,f ₂ (x)

Predict ŷ = sign( ^ŵ ₁ ^f ₁ ^{(x) + ŵ} ₂ ^f ₂ ^(x) )

Higher weight for points where f ₁ (x)

is wrong

(103)

AdaBoost

(104)

AdaBoost: learning ensemble

[Freund & Schapire 1999]

• Start with same weight for all points: α _i = 1/N

• For t = 1,…,T

- Learn f _t (x) with data weights α _i - Compute coefficient ŵ _t

- Recompute weights α _i

• Final model predicts by:

ˆ

y = sign

X T

t=1

w ˆ _t f _t (x)

!

(105)

Computing coefficient ŵ _t

(106)

AdaBoost: Computing coefficient ŵ _t of classifier f _t (x)

• f _t (x) is good è f _t has low training error

• Measuring error in weighted data?

- Just weighted # of misclassified points

Is f _t (x) good? ^Yes ŵ _t large ŵ _t small

No

(107)

107 Data point

Weighted classification error

(Sushi was great, ,α=1.2)

Learned classifier

Hide label

Weight of correct Weight of mistakes

Sushi was great

ŷ =

Correct! _0.5 ^1.2 ₀ ⁰

(Food was OK, ,α=0.5) Food was OK Mistake!

α=1.2

α=0.5

(108)

Weighted classification error

• Total weight of mistakes:

• Total weight of all points:

• Weighted error measures fraction of weight of mistakes:

- Best possible value is 0.0

weighted_error = Total weight of all mistakes .

Total weight of all data

Worst possible value is 1.0 Actual worst is 0.5

(109)

109 AdaBoost:

Formula for computing coefficient ŵ _t of classifier f _t (x) ˆ

w _t = 1 2 ln

✓ 1 weighted error(f _t ) weighted error(f _t )

◆ weighted_error(f _t )

on training data ŵ _t

Is f _t (x) good? ^Yes

No

1 weighted error(f

t

) weighted error(f

t

)

0.01 (1-0.01)/0.01 = 99 0.5 ln(99) = 2.3

0.5 0.99

(1-0.5)/0.5 = 1

(1-0.99)/0.99 = 0.01

0.5 ln(1) = 0 0.5 ln(0.01) = -2.3

Last one is a terrible classifier, but - f

_t

(x) is great!

(110)

AdaBoost: learning ensemble

• Start with same weight for all points: α _i = 1/N

• For t = 1,…,T

- Learn f _t (x) with data weights α _i - Compute coefficient ŵ _t

- Recompute weights α _i

• Final model predicts by:

ˆ

y = sign

X T

t=1

w ˆ _t f _t (x)

!

w ˆ t = 1 2 ln

✓ 1 weighted error(f _t ) weighted error(f _t )

◆

(111)

Recompute weights α _i

(112)

AdaBoost: Updating weights ^α _i based on where classifier f _t (x) makes mistakes

Did f _t get x _i right?

Decrease α _i

Yes

Increase α _i

No

(113)

113 AdaBoost: Formula for updating weights α _i

f _t (x _i )=y _i ? ŵ _t ^{Multiply α} ⁱ ^by Implication

correct correct mistake mistake

Did f _t get x _i right? ^Yes

No

α _i ^ç α _i e , -ŵ _t ^if ^f t (x _i )=y _i α _i e , ŵ _t ^if ^f t (x _i )≠y _i

2.3 0 2.3 0

e

^-2.3

= 0.1 e

^-0

= 1 e

^2.3

= 9.98 e

^-0

= 1

Decrease importance of x

_i

, y

_i

Keep importance same

Increase importance

Keep importance same

(114)

AdaBoost: learning ensemble

• Start with same weight for all points: α _i = 1/N

• For t = 1,…,T

- Learn f _t (x) with data weights α _i - Compute coefficient ŵ _t

- Recompute weights α _i

• Final model predicts by:

ˆ

y = sign

X T

t=1

w ˆ _t f _t (x)

!

w ˆ t = 1 2 ln

✓ 1 weighted error(f _t ) weighted error(f _t )

◆ α _i ^ç α _i e , -ŵ _t ^if ^f t (x _i )=y _i

α _i e , ŵ _t ^if ^f t (x _i )≠y _i

(115)

CSE 446: Machine Learning

115 AdaBoost: Normalizing weights α _i

If x _i often mistake, weight α _i gets very

large

If x _i often correct, weight α _i gets very

small

Can cause numerical instability after many iterations

Normalize weights to

add up to 1 after every iteration

↵ _i ↵ _i P N

j=1 ↵ _j

.

(116)

CSE 446: Machine Learning

116 AdaBoost: learning ensemble

• Start with same weight for all points: α _i = 1/N

• For t = 1,…,T

- Learn f _t (x) with data weights α _i - Compute coefficient ŵ _t

- Recompute weights α _i - Normalize weights α _i

• Final model predicts by:

ˆ

y = sign

X T

t=1

ˆ

w _t f _t (x)

!

ˆ

w _t = 1 2 ln

✓ 1 weighted error(f t ) weighted error(f _t )

◆ α _i ^ç α _i e , -ŵ _t ^if ^f t (x _i )=y _i α _i e , ŵ _t ^if ^f t (x _i )≠y _i

↵ _i ↵ _i P N

j=1 ↵ _j

.

(117)

AdaBoost example

(118)

t=1: Just learn a classifier on original data

Learned decision stump f ₁ (x)

Original data

(119)

119 Updating weights α _i

Learned decision stump f ₁ (x) New data weights α _i

Boundary

Increase weight α _i

of misclassified points

(120)

t=2: Learn classifier on weighted data

Learned decision stump f ₂ (x) on weighted data

Weighted data: using α _i chosen in previous iteration

f ₁ (x)

(121)

121 Ensemble becomes weighted sum of learned classifiers

=

f ₁ (x) f ₂ (x)

0.61 ŵ ₁

+ ^0.53

ŵ ₂

(122)

Decision boundary of ensemble classifier after 30 iterations

training_error = 0

(123)

Boosting convergence & overfitting

(124)

Boosting question revisited

“Can a set of weak learners be combined to create a stronger learner?” Kearns and Valiant (1988)

Yes! Schapire (1990)

Boosting

(125)

125 After some iterations,

training error of boosting goes to zero!!!

Tr ai ni ng e rr or

Iterations of boosting

Boosted decision stumps on toy dataset

Training error of ensemble of 30 decision stumps = 0%

Training error of

1 decision stump = 22.5%

(126)

AdaBoost Theorem

Under some technical conditions…

Training error of

boosted classifier → 0

as T→∞ _Tr ^ai ⁿⁱ

ng e rr or

Iterations of boosting

May oscillate a bit

But will

generally decrease, &

eventually become 0!

(127)

127 Condition of AdaBoost Theorem

Under some technical conditions…

Training error of

boosted classifier → 0 as T→∞

Extreme example:

No classifier can separate a +1

on top of -1

Condition = At every t, can find a weak learner with

weighted_error(f _t ) < 0.5

Not always possible

Nonetheless, boosting often

yields great training error

(128)

Overfitting in boosting

(129)

129 Boosted decision stumps on loan data Decision trees on loan data

39% test error

8% training error Overfitting

32% test error

28.5% training error

Better fit & lower test error

(130)

Boosting tends to be robust to overfitting

Test set performance about

constant for many iterations

è Less sensitive to choice of T

(131)

131 But boosting will eventually overfit,

so must choose max number of components T

Best test error around 31%

Test error eventually increases to

33% (overfits)

(132)

How do we decide when to stop boosting?

Choosing T ?

Not on training data

Never ever ever ever on

test data Validation set Cross-validation

Like selecting parameters in other ML approaches (e.g., λ in regularization)

Not useful: training error improves

as T increases

If dataset is large For smaller data

(133)

Summary of boosting

(134)

Variants of boosting and related algorithms

There are hundreds of variants of boosting, most important:

• Like AdaBoost, but useful beyond basic classification

• XGBoost (Tianqi Chen, UW)

Gradient

boosting

(135)

135 Variants of boosting and related algorithms

There are hundreds of variants of boosting, most important:

Many other approaches to learn ensembles, most important:

• Like AdaBoost, but useful beyond basic classification

• XGBoost (Tianqi Chen, UW)

Gradient boosting

• Bagging: Pick random subsets of the data - Learn a tree in each subset

- Average predictions

• Simpler than boosting & easier to parallelize

• Typically higher error than boosting for same # of trees (# iterations T)

Random

forests

(136)

Impact of boosting ( spoiler alert... HUGE IMPACT )

• Standard approach for face detection, for example

Extremely useful in computer vision

• Malware classification, credit fraud detection, ads click through rate estimation, sales forecasting, ranking webpages for search, Higgs boson detection,…

Used by most winners of ML competitions

(Kaggle, KDD Cup,…)

• Coefficients chosen manually, with boosting, with bagging, or others

Most deployed ML systems use model ensembles

Amongst most useful ML methods ever created

(137)

137 What you should know about today’s material for final

• What a decision tree is, basic greedy approach to constructing them, what the decision boundary will look like if data is real-valued (axis-aligned partitions)

• Mitigating overfitting and high variance by limiting depth, pruning tree, bagging and random forests.

• Understand idea of ensemble classifier

• Outline the boosting framework –

sequentially learn classifiers on weighted data. Key idea: put more weight on data points where making error.

• Understand the AdaBoost algorithm at a high level

- Learn each classifier on weighted data - Compute coefficient of classifier

- Recompute data weights

- Normalize weights

Decision Trees, Bagging and Boosting