Decision Trees, Bagging and Boosting
CSE 446: Machine Learning
created by Emily Fox
Predicting potential loan defaults
3
What makes a loan risky?
I want a to buy a
new house! Credit History
★★★★
Income
★★★
★★★★★ Term
Personal Info
★★★
Loan
Application
Credit history explained
Credit History
★★★★
Income
★★★
★★★★★ Term
Personal Info
★★★
Did I pay previous loans on time?
Example: excellent,
good, or fair
5
Income
Credit History
★★★★
Income
★★★
★★★★★ Term
Personal Info
★★★
What’s my income?
Example:
$80K per year
Loan terms
Credit History
★★★★
Income
★★★
★★★★★ Term
Personal Info
★★★
How soon do I need to pay the loan?
Example: 3 years,
5 years,…
7
Personal information
Credit History
★★★★
Income
★★★
★★★★★ Term
Personal Info
★★★
Age, reason for the loan, marital status,…
Example: Home loan for a
married couple
Intelligent application
Safe ✓
Risky
✘ Risky
✘
Intelligent loan application review system
Loan
Applications
9
Classifier review
Loan
Application Classifier
MODEL
Input: x i Output: ŷ Predicted
class
Safe
ŷ i = +1
Risky
ŷ i = -1
This module ... decision trees
Start
Credit?
Safe
excellent
Income?
poor
Term?
Risky Safe
fair
5 years 3 years
Risky Low Term?
Risky Safe
high
5 years
3 years
11
Scoring a loan application
x i = (Credit = poor, Income = high, Term = 5 years)
Credit?
Safe Term?
Risky Safe
Income?
Term?
Risky Safe
Risky
Start
excellent poor
fair
5 years
3 years high Low
5 years 3 years
Credit?
Safe Term?
Risky Safe
Income?
Term?
Risky Safe
Risky
Start
excellent poor
fair
5 years
3 years high Low
5 years 3 years
ŷ i = Safe
Decision tree learning task
13
Decision tree learning problem
Optimize quality metric on training data
Training data: N observations (x i ,y i )
Credit Term Income y
excellent 3 yrs high safe
fair 5 yrs low risky
fair 3 yrs high safe
poor 5 yrs high risky
excellent 3 yrs low risky
fair 5 yrs low safe
poor 3 yrs high risky
poor 5 yrs low safe
fair 3 yrs high safe
T(X)
Quality metric: Classification error
• Error measures fraction of mistakes
- Best possible value : 0.0 - Worst possible value: 1.0
Error = # incorrect predictions
# examples
15
How do we find the best tree?
Exponentially large number of possible trees makes decision tree learning hard!
T
1(X) T
2(X) T
3(X)
T
4(X) T
5(X) T
6(X)
Learning the smallest
decision tree is an
NP-hard problem
[Hyafil & Rivest ’76]
Greedy decision tree learning
17
Our training data table
Assume N = 40, 3 features
Credit Term Income y
excellent 3 yrs high safe
fair 5 yrs low risky
fair 3 yrs high safe
poor 5 yrs high risky
excellent 3 yrs low risky
fair 5 yrs low safe
poor 3 yrs high risky
poor 5 yrs low safe
fair 3 yrs high safe
(all data)
Start with all the data
Loan status: Safe Risky
N = 40 examples
# of Safe loans
22 # of Risky loans
18
19
22 Root 18
Compact visual notation: Root node
Loan status: Safe Risky
N = 40 examples
# of Safe loans
# of Risky loans
Decision stump: Single level tree
22 Root 18 Loan status:
Safe Risky
4 poor 14 fair
9 4 excellent
9 0
Credit?
Split on Credit
Subset of data with
Credit = excellent Subset of data with
Credit = fair Subset of data with
Credit = poor
21
Visual notation: Intermediate nodes
22 Root 18
excellent
9 0 fair
9 4 poor
4 14 Loan status:
Safe Risky
Credit?
Intermediate nodes
Proceed to build a tree on each of children
22 root 18
excellent
9 0 fair
9 4 poor
4 14 Loan status:
Safe Risky
credit?
Safe
Selecting best feature to split on
How do we pick our first split?
22 Root 18
excellent
9 0 fair
9 4 poor
4 14 Loan status:
Safe Risky
Credit?
Find the “best”
feature to split on!
25
How do we select the best feature?
Root 22 18
excellent 9 0
fair
9 4 poor
4 14 Loan status:
Safe Risky
Credit?
Choice 1: Split on Credit
Root 22 18
3 years
16 4 5 years
6 14 Loan status:
Safe Risky
Term?
Choice 2: Split on Term
OR
How do we measure effectiveness of a split?
Error = # mistakes
# data points
Root 22 18
poor 4 14 Loan status:
Safe Risky
Credit?
excellent 9 0
9 fair 4
Idea: Calculate classification error
of this decision stump
27
Calculating classification error
• Step 1: ŷ = class of majority of data in node
• Step 2: Calculate classification error of predicting ŷ for this data
Root 22 18 Loan status:
Safe Risky
18 mistakes 22 correct
ŷ = majority class
Safe Tree Classification error
(root) 0.45
Choice 1: Split on Credit history?
Does a split on Credit reduce classification error below 0.45?
22 Root 18
excellent
9 0 fair
9 4 poor
4 14 Loan status:
Safe Risky
Credit?
Choice 1: Split on Credit
29
Split on Credit: Classification error
22 Root 18
excellent
9 0 fair
9 4 poor
4 14 Loan status:
Safe Risky
Credit?
0 mistakes 4 mistakes 4 mistakes
Safe Safe Risky
Choice 1: Split on Credit
Error = 4+4 _
.22 + 18
= 0.2
Tree Classification error
(root) 0.45
Split on credit 0.2
Choice 2: Split on Term?
22 Root 18
3 years
16 4 5 years
6 14 Loan status:
Safe Risky
Term?
Safe Risky
Choice 2: Split on Term
31
Evaluating the split on Term
22 Root 18
3 years
16 4 5 years
6 14 Loan status:
Safe Risky
Term?
4 mistakes 6 mistakes
Safe Risky
Error = 4+6
.= 0.25 40
Tree Classification error
(root) 0.45
Split on credit 0.2
Split on term 0.25
Choice 2: Split on Term
Choice 1 vs Choice 2:
Comparing split on Credit vs Term
Root 22 18
excellent 9 0
fair
8 4 poor
4 14 Loan status:
Safe Risky Root
22 18
3 years
16 4 5 years
6 14 Loan status:
Safe Risky
OR
Credit? Term?
Tree Classification
error
(root) 0.45
split on credit 0.2
split on loan term 0.25
WINNER
Choice 2: Split on Term
Choice 1: Split on Credit
33
Feature split selection algorithm
• Given a subset of data M (a node in a tree)
• For each feature ϕ i (x):
1. Split data of M according to feature ϕ i (x) 2. Compute classification error split
• Chose feature ϕ * (x) with lowest classification error
Recursion & Stopping conditions
35
We’ve learned a decision stump, what next?
22 Root 18
excellent
9 0 fair
9 4 poor
4 14 Loan status:
Safe Risky
Credit?
Safe All data points are Safe è
nothing else to do with this subset of data
Leaf node
Tree learning = Recursive stump learning
22 Root 18
excellent
9 0 fair
9 4 poor
4 14 Loan status:
Safe Risky
Credit?
Safe
Build decision stump with subset of data where
Credit = poor Build decision stump with
subset of data where
Credit = fair
37
Second level
22 Root 18
Loan status:
Safe Risky
Credit?
excellent
9 0 fair
9 4 poor
4 14
Safe
3 years
0 4 5 years
9 0 Term?
Risky Safe
Build another stump these data points
4 high 5 Low 0 9 Income?
Risky
Final decision tree
Root 22 18
excellent
9 0 Fair
9 4
poor 4 14
Loan status:
Safe Risky
Credit?
Safe
5 years 9 0 3 years
0 4
Term?
Risky Safe
low 0 9 high 4 5
Income?
5 years 4 3 3 years
0 2
Term?
Risky Safe
Risky
39
Simple greedy decision tree learning
Pick best feature to split on
Learn decision stump with this split
For each leaf of decision stump, recurse
When do we stop???
Stopping condition 1: All data agrees on y
Root 22 18
excellent
9 0 Fair
9 4
poor 4 14
Loan status:
Safe Risky
Credit?
Safe
5 years 9 0 3 years
0 4
Term?
Risky Safe
low 0 9 high 4 5
Income?
5 years 4 3
Term?
Risky Safe
Risky
3 years 0 2 3 years
0 2
All data in these nodes have same
y value è Nothing to do
excellent 9 0
5 years 9 0 3 years
0 4
low
0 9
41
Stopping condition 2: Already split on all features
Root 22 18
excellent
9 0 Fair
9 4
poor 4 14
Loan status:
Safe Risky
Credit?
Safe
5 years 9 0 3 years
0 4
Term?
Risky Safe
low 0 9 high 4 5
Income?
5 years 4 3
Term?
Risky Safe
Risky
3 years 0 2
Already split on all possible features è
Nothing to do
5 years
4 3
Recursion
Stopping conditions 1
& 2
Greedy decision tree learning
• Step 1: Start with an empty tree
• Step 2: Select a feature to split data
• For each split of the tree:
• Step 3: If nothing more to do, make predictions
• Step 4: Otherwise, go to Step 2 &
continue (recurse) on this split
Pick feature split
leading to lowest
classification error
43
Is this a good idea?
Proposed stopping condition 3:
Stop if no split reduces the
classification error
Stopping condition 3:
D on’t stop if error doesn’t decrease???
y values
True False Root 2 2
Error = 2
.= 0.5 4
Tree Classification error
(root) 0.5
x[1] x[2] y
False False False False True True
True False True True True False
y = x[1] xor x[2]
45
Consider split on x[1]
y values
True False Root 2 2
Error = 2
.= 0.5 4
Tree Classification error
(root) 0.5
Split on x[1] 0.5
1 True 1 False
1 1 x[1]
x[1] x[2] y
False False False False True True
True False True True True False
y = x[1] xor x[2]
Consider split on x[2]
y values
True False Root 2 2
Error = 2
.= 0.5 4
Tree Classification error
(root) 0.5
Split on x[1] 0.5
Split on x[2] 0.5
1 True 1 False
1 1 x[2]
Neither features
improve training error…
Stop now???
x[1] x[2] y
False False False False True True
True False True True True False
y = x[1] xor x[2]
47
Final tree with stopping condition 3
Tree Classification error
with stopping
condition 3 0.5
y values
True False Root 2 2
Predict True
x[1] x[2] y
False False False False True True
True False True True True False
y = x[1] xor x[2]
Without stopping condition 3
y values True False
Root 2 2
1 True 1 False
1 1 x[1]
0 True 1
x[2]
1 True 0 False
1 0
x[2]
False 0 1
True False
False True
Tree Classification
error with stopping
condition 3 0.5
without stopping condition 3
x[1] x[2] y
False False False False True True
True False True True True False
y = x[1] xor x[2]
Condition 3 (stopping when training error doesn’t improve) not always recommended!
0
Decision tree learning:
Real valued features
How do we use real values inputs?
Income Credit Term y
$105 K excellent 3 yrs Safe
$112 K good 5 yrs Risky
$73 K fair 3 yrs Safe
$69 K excellent 5 yrs Safe
$217 K excellent 3 yrs Risky
$120 K good 5 yrs Safe
$64 K fair 3 yrs Risky
$340 K excellent 5 yrs Safe
$60 K good 3 yrs Risky
51
Threshold split
Root 22 18 Loan status:
Safe Risky
Split on the feature Income
< $60K
8 13 >= $60K
14 5 Income?
Subset of data with
Income >= $60K
Finding the best threshold split
Infinite possible values of t
Income < t * Income >= t *
Safe Risky
Income
$120K
$10K
Income = t *
53
Consider a threshold between points
Safe Risky
Income
$120K
$10K
v
Av
BSame classification error for any
threshold split between v A and v B
Only need to consider mid-points
Safe Risky
Income
$120K
$10K
Finite number of splits
to consider
55
Threshold split selection algorithm
• Step 1: Sort the values of a feature h j (x) :
Let {v 1 , v 2 , v 3 , … v N } denote sorted values
• Step 2:
- For i = 1 … N-1
• Consider split t i = (v i + v i+1 ) / 2
• Compute classification error for threshold split h j (x) >= t i
- Choose the t * with the lowest classification error
Visualizing the threshold split
0 10 20 30 40 …
$0K
$40K
$80K
…
Age
Income Threshold split is the line Age = 38
57
Split on Age >= 38
Age Income age < 38 age >= 38
Predict Safe Predict Risky
0 10 20 30 40 …
$0K
$40K
$80K
…
Depth 2: Split on Income >= $60K
Age Income
0 10 20 30 40 …
$0K
$40K
$80K
…
Threshold split is the line Income = 60K
59
Each split partitions the 2-D space
Age
Age >= 38 Income >= 60K Age < 38
Age >= 38 Income < 60K
Income
0 10 20 30 40 …
$0K
$40K
$80K
…
Overfitting in decision trees
61
Decision tree recap
Root 22 18
excellent
9 0 Fair
9 4
poor 4 14
Loan status:
Safe Risky
Credit?
Safe
5 years 9 0 3 years
0 4
Term?
Risky Safe
low 0 9 high 4 5
Income?
5 years 4 3 3 years
0 2
Term?
Risky Safe
Risky
For each leaf node,
set ŷ = majority value
Tree depth depth = 1 depth = 2 depth = 3 depth = 5 depth = 10
Training error 0.22 0.13 0.10 0.03 0.00
Decision boundary
Training error reduces with depth
What happens when we increase depth?
63
How do we regularize? In this context, simplicity
Complex Tree
Simpler Tree
Two approaches to picking simpler trees
1. Early Stopping:
Stop the learning algorithm before tree becomes too complex 2. Pruning:
Simplify the tree after the learning algorithm terminates
65
Technique 1: Early stopping
• Stopping conditions (recap):
1. All examples have the same target value 2. No more features to split on
• Early stopping conditions:
1. Limit tree depth (choose max_depth using validation set)
2. Do not consider splits that do not cause a sufficient decrease in classification error
3. Do not split an intermediate node which contains too few data points
Challenge with early stopping condition 1
Cl as sif ic at io n Er ro r
Complex trees
True error Training error
Simple trees
Tree depth
max_depth
Hard to know exactly when to stop
Also, might want
some branches of
tree to go deeper
while others remain
shallow
67
Early stopping condition 2: Pros and Cons
• Pros:
- A reasonable heuristic for early stopping to avoid useless splits
• Cons:
- Too short sighted: We may miss out on “good” splits may occur right after “useless” splits
- Saw this with “xor” example
Two approaches to picking simpler trees
1. Early Stopping:
Stop the learning algorithm before tree becomes too complex 2. Pruning:
Simplify the tree after the learning algorithm terminates
Complements early stopping
69
Pruning: Intuition
Train a complex tree, simplify later
Complex Tree
Simplify
Simpler Tree
Pruning motivation
Cl as sif ic at io n Er ro r True Error
Training Error
Tree depth
Don’t stop too early Complex
tree
Simplify after
tree is built
Simple tree
71
Scoring trees: Desired total quality format
Want to balance:
i. How well tree fits data ii. Complexity of tree
Total cost =
measure of fit + measure of complexity
(classification error) Large # = bad fit to
training data
Large # = likely to overfit
want to balance
Simple measure of complexity of tree
Credit?
Start
Safe
Risky
excellent poor
bad Safe
Safe
Risky
good fair
L(T) = # of leaf nodes
73
Balance simplicity & predictive power
Credit?
Term?
Risky
Start
fair
3 years Safe
Safe
Income?
Term?
Risky Safe
Risky
excellent poor
5 years
high low
5 years 3 years
Risky
Start
Too complex, risk of overfitting
Too simple, high
classification error
Balancing fit and complexity
Total cost C(T) = Error(T) + λ L(T) tuning parameter If λ =0: build out full tree.
If λ =∞: won’t ever split.
If λ in between:
75
Having said all that…
Vanilla decision trees just don’t really work very well.
CSE 446: Machine Learning
Bagging
77
Suppose we want to reduce variance
Recall definition of variance
If we had lots of independent data sets, we could average their results and get the variance to approach 0
Given m independent data sets (D 1 , …, D m ), each of size n, learn m classifiers (f 1 , …, f m ), where f i trained on D i.
By law of large numbers
But we don’t have lots of data sets
Bootstrap Aggregation (aka bagging)
• A technique for addressing high variance (e.g. in decision trees)
- Given sample D, draw m samples (D 1 , …, D m ) of size n, with replacement.
- Train a classifier on each.
- Use average (for regression) or majority vote (for classification).
79
Advantages of bagging
• Easy to implement
• Reduces variance
• Gives information about uncertainty of prediction.
• Provides estimate of test error, “out of bag” error without using validation set.
Let and define
Then out of bag error is:
Bagging produces an ensemble classifier
81
Ensemble classifier in general
• Goal:
- Predict output y
• Either +1 or -1
- From input x
• Learn ensemble model:
- Classifiers: f 1 (x),f 2 (x),…,f T (x) - Coefficients: ŵ 1 ,ŵ 2 ,…,ŵ T
• Prediction:
ˆ
y = sign
X T t=1
w ˆ t f t (x)
!
CSE 446: Machine Learning
Random Forests
83
One of the most useful bagged algorithms
• Given data set D, draw m samples (D 1 , …, D m ) of size n from D, with replacement.
• For each D j , train a full decision tree h j with one crucial modification:
- Before each split, randomly subsample k features (without replacement) and only consider these for your split.
• Final classifier obtained by averaging.
One of best, most popular and easiest to use algorithms
• Only two hyperparameters m and k.
• Extremely insensitive to both.
- Classification: good choice for k is 𝑑 (where 𝑑 is number of features)
- Regression: good choice for k is d/3 - Set m as large as you can afford.
• Works really well!
• Features can be of different scale, magnitude, etc.
- Very convenient with heterogeneous features.
CSE 446: Machine Learning Emily Fox
University of Washington January 30, 2017
Boosting
Simple (weak) classifiers are good!
Logistic regression w.
simple features
Low variance. Learning is fast!
But high bias…
Shallow
decision trees Decision stumps
Income>$100K?
Safe Risky
No
Yes
87
Finding a classifier that’s just right
Model complexity Cl as sif ic at io n er ro r
train error
true error
Weak learner Need stronger learner
Option 1: add more features or depth
Option 2: ?????
Boosting question
“Can a set of weak learners be combined to create a stronger learner?” Kearns and Valiant (1988)
Yes! Schapire (1990)
Boosting
Amazing impact: simple approach widely used in industry
wins most Kaggle competitions
Ensemble classifier
A single classifier
Output: ŷ = f(x)
- Either +1 or -1
Input: x
Classifier
Income>$100K?
Safe Risky
No
Yes
91
Income>$100K?
Safe Risky
No
Yes
Ensemble methods: Each classifier “votes” on prediction
x
i= (Income=$120K, Credit=Bad, Savings=$50K, Market=Good)
f 1 (x i ) = +1
Combine?
F(x i ) = sign(w 1 f 1 (x i ) + w 2 f 2 (x i ) + w 3 f 3 (x i ) + w 4 f 4 (x i ))
Ensemble
model Learn coefficients
Income>$100K?
Safe Risky
No Yes
Credit history?
Risky Safe
Good Bad
Savings>$100K?
Safe Risky
No Yes
Market conditions?
Risky Safe
Good Bad
Income>$100K?
Safe Risky
No Yes
f 2 (x i ) = -1
Credit history?
Risky Safe
Good Bad
f 3 (x i ) = -1
Savings>$100K?
Safe Risky
No Yes
f 4 (x i ) = +1
Market conditions?
Risky Safe
Good
Bad
93
Prediction with ensemble
w 1 2
w 2 1.5
w 3 1.5
w 4 0.5
f 1 (x i ) = +1
Combine?
F(x i ) = sign(w 1 f 1 (x i ) + w 2 f 2 (x i ) + w 3 f 3 (x i ) + w 4 f 4 (x i ))
Ensemble
model Learn coefficients
Income>$100K?
Safe Risky
No Yes
Credit history?
Risky Safe
Good Bad
Savings>$100K?
Safe Risky
No Yes
Market conditions?
Risky Safe
Good Bad
Income>$100K?
Safe Risky
No Yes
f 2 (x i ) = -1
Credit history?
Risky Safe
Good Bad
f 3 (x i ) = -1
Savings>$100K?
Safe Risky
No Yes
f 4 (x i ) = +1
Market conditions?
Risky Safe
Good
Bad
Ensemble classifier in general
• Goal:
- Predict output y
• Either +1 or -1
- From input x
• Learn ensemble model:
- Classifiers: f 1 (x),f 2 (x),…,f T (x) - Coefficients: ŵ 1 ,ŵ 2 ,…,ŵ T
• Prediction:
ˆ
y = sign
X T t=1
w ˆ t f t (x)
!
Boosting
Training a classifier
Training data Learn
classifier Predict ŷ = sign(f(x))
f(x)
97
Learning decision stump
Credit Income y
A $130K Safe
B $80K Risky
C $110K Risky
A $110K Safe
A $90K Safe
B $120K Safe
C $30K Risky
C $60K Risky
B $95K Safe
A $60K Safe
A $98K Safe
Credit Income y
A $130K Safe
B $80K Risky
C $110K Risky
A $110K Safe
A $90K Safe
B $120K Safe
C $30K Risky
C $60K Risky
B $95K Safe
A $60K Safe
A $98K Safe
Income?
> $100K ≤ $100K
ŷ = Safe ŷ = Safe
3 1 4 3
X X
X
X
Boosting = Focus learning on “hard” points
Training data Learn
classifier Predict ŷ = sign(f(x))
f(x)
Learn where f(x)
makes mistakes Evaluate Boosting: focus next
classifier on places where
f(x) does less well
99
Learning on weighted data:
More weight on “hard” or more important points
• Weighted dataset:
- Each x i ,y i weighted by α i
• More important point = higher weight α i
• Learning:
- Data point i counts as α i data points
• E.g., α i = 2 è count point twice
Credit Income y
A $130K Safe
B $80K Risky
C $110K Risky
A $110K Safe
A $90K Safe
B $120K Safe
C $30K Risky
C $60K Risky
B $95K Safe
A $60K Safe
A $98K Safe
Learning a decision stump on weighted data
Credit Income y Weight α
A $130K Safe 0.5
B $80K Risky 1.5
C $110K Risky 1.2
A $110K Safe 0.8
A $90K Safe 0.6
B $120K Safe 0.7
C $30K Risky 3
C $60K Risky 2
B $95K Safe 0.8
A $60K Safe 0.7
A $98K Safe 0.9
Income?
> $100K ≤ $100K
ŷ = Risky ŷ = Safe
2 1.2 3 6.5
Increase weight α of harder/misclassified points
X X
X
X
101
Learning from weighted data in general
• Learning from weighted data generally no harder.
- Data point i counts as α i data points
• E.g., gradient descent for logistic regression:
w (t+1) j w (t) j + ⌘ X N
i=1
h j (x i ) ⇣
1[y i = +1] P (y = +1 | x i , w (t) ) ⌘ w (t+1) j w (t) j + ⌘
X N
i=1
h j (x i ) ⇣
1[y i = +1] P (y = +1 | x i , w (t) ) ⌘
α i
Sum over data points
Weigh each point by α i
Boosting = Greedy learning ensembles from data
Training data
Predict ŷ = sign(f 1 (x)) Learn classifier
f 1 (x)
Weighted data
Learn classifier & coefficient
ŵ,f 2 (x)
Predict ŷ = sign( ŵ 1 f 1 (x) + ŵ 2 f 2 (x) )
Higher weight for points where f 1 (x)
is wrong
AdaBoost
AdaBoost: learning ensemble
[Freund & Schapire 1999]
• Start with same weight for all points: α i = 1/N
• For t = 1,…,T
- Learn f t (x) with data weights α i - Compute coefficient ŵ t
- Recompute weights α i
• Final model predicts by:
ˆ
y = sign
X T
t=1
w ˆ t f t (x)
!
Computing coefficient ŵ t
AdaBoost: Computing coefficient ŵ t of classifier f t (x)
• f t (x) is good è f t has low training error
• Measuring error in weighted data?
- Just weighted # of misclassified points
Is f t (x) good? Yes ŵ t large ŵ t small
No
107
Data point
Weighted classification error
(Sushi was great, ,α=1.2)
Learned classifier
Hide label
Weight of correct Weight of mistakes
Sushi was great
ŷ =
Correct! 0.5 1.2 0 0
(Food was OK, ,α=0.5) Food was OK Mistake!
α=1.2
α=0.5
Weighted classification error
• Total weight of mistakes:
• Total weight of all points:
• Weighted error measures fraction of weight of mistakes:
- Best possible value is 0.0
weighted_error = Total weight of all mistakes .
Total weight of all data
Worst possible value is 1.0 Actual worst is 0.5
109
AdaBoost:
Formula for computing coefficient ŵ t of classifier f t (x) ˆ
w t = 1 2 ln
✓ 1 weighted error(f t ) weighted error(f t )
◆
weighted_error(f t )
on training data ŵ t
Is f t (x) good? Yes
No
1 weighted error(f
t) weighted error(f
t)
0.01 (1-0.01)/0.01 = 99 0.5 ln(99) = 2.3
0.5 0.99
(1-0.5)/0.5 = 1
(1-0.99)/0.99 = 0.01
0.5 ln(1) = 0 0.5 ln(0.01) = -2.3
Last one is a terrible classifier, but - f
t(x) is great!
AdaBoost: learning ensemble
• Start with same weight for all points: α i = 1/N
• For t = 1,…,T
- Learn f t (x) with data weights α i - Compute coefficient ŵ t
- Recompute weights α i
• Final model predicts by:
ˆ
y = sign
X T
t=1
w ˆ t f t (x)
!
w ˆ t = 1 2 ln
✓ 1 weighted error(f t ) weighted error(f t )
◆
Recompute weights α i
AdaBoost: Updating weights α i based on where classifier f t (x) makes mistakes
Did f t get x i right?
Decrease α i
Yes
Increase α i
No
113
AdaBoost: Formula for updating weights α i
f t (x i )=y i ? ŵ t Multiply α i by Implication
correct correct mistake mistake
Did f t get x i right? Yes
No
α i ç α i e , -ŵ t if f t (x i )=y i α i e , ŵ t if f t (x i )≠y i
2.3 0 2.3 0
e
-2.3= 0.1 e
-0= 1 e
2.3= 9.98 e
-0= 1
Decrease importance of x
i, y
iKeep importance same
Increase importance
Keep importance same
AdaBoost: learning ensemble
• Start with same weight for all points: α i = 1/N
• For t = 1,…,T
- Learn f t (x) with data weights α i - Compute coefficient ŵ t
- Recompute weights α i
• Final model predicts by:
ˆ
y = sign
X T
t=1
w ˆ t f t (x)
!
w ˆ t = 1 2 ln
✓ 1 weighted error(f t ) weighted error(f t )
◆
α i ç α i e , -ŵ t if f t (x i )=y i
α i e , ŵ t if f t (x i )≠y i
CSE 446: Machine Learning
115
AdaBoost: Normalizing weights α i
©2017 Emily Fox
If x i often mistake, weight α i gets very
large
If x i often correct, weight α i gets very
small
Can cause numerical instability after many iterations
Normalize weights to
add up to 1 after every iteration
↵ i ↵ i P N
j=1 ↵ j
.
CSE 446: Machine Learning
116
AdaBoost: learning ensemble
• Start with same weight for all points: α i = 1/N
• For t = 1,…,T
- Learn f t (x) with data weights α i - Compute coefficient ŵ t
- Recompute weights α i - Normalize weights α i
• Final model predicts by:
©2017 Emily Fox
ˆ
y = sign
X T
t=1
ˆ
w t f t (x)
!
ˆ
w t = 1 2 ln
✓ 1 weighted error(f t ) weighted error(f t )
◆
α i ç α i e , -ŵ t if f t (x i )=y i α i e , ŵ t if f t (x i )≠y i
↵ i ↵ i P N
j=1 ↵ j
.
AdaBoost example
t=1: Just learn a classifier on original data
Learned decision stump f 1 (x)
Original data
119
Updating weights α i
Learned decision stump f 1 (x) New data weights α i
Boundary
Increase weight α i
of misclassified points
t=2: Learn classifier on weighted data
Learned decision stump f 2 (x) on weighted data
Weighted data: using α i chosen in previous iteration
f 1 (x)
121
Ensemble becomes weighted sum of learned classifiers
=
f 1 (x) f 2 (x)
0.61 ŵ 1
+ 0.53
ŵ 2
Decision boundary of ensemble classifier after 30 iterations
training_error = 0
Boosting convergence & overfitting
Boosting question revisited
“Can a set of weak learners be combined to create a stronger learner?” Kearns and Valiant (1988)
Yes! Schapire (1990)
Boosting
125
After some iterations,
training error of boosting goes to zero!!!
Tr ai ni ng e rr or
Iterations of boosting
Boosted decision stumps on toy dataset
Training error of ensemble of 30 decision stumps = 0%
Training error of
1 decision stump = 22.5%
AdaBoost Theorem
Under some technical conditions…
Training error of
boosted classifier → 0
as T→∞ Tr ai ni
ng e rr or
Iterations of boosting
May oscillate a bit
But will
generally decrease, &
eventually become 0!
127
Condition of AdaBoost Theorem
Under some technical conditions…
Training error of
boosted classifier → 0 as T→∞
Extreme example:
No classifier can separate a +1
on top of -1
Condition = At every t, can find a weak learner with
weighted_error(f t ) < 0.5
Not always possible
Nonetheless, boosting often
yields great training error
Overfitting in boosting
129
Boosted decision stumps on loan data Decision trees on loan data
39% test error
8% training error Overfitting
32% test error
28.5% training error
Better fit & lower test error
Boosting tends to be robust to overfitting
Test set performance about
constant for many iterations
è Less sensitive to choice of T
131
But boosting will eventually overfit,
so must choose max number of components T
Best test error around 31%
Test error eventually increases to
33% (overfits)
How do we decide when to stop boosting?
Choosing T ?
Not on training data
Never ever ever ever on
test data Validation set Cross-validation
Like selecting parameters in other ML approaches (e.g., λ in regularization)
Not useful: training error improves
as T increases
If dataset is large For smaller data
Summary of boosting
Variants of boosting and related algorithms
There are hundreds of variants of boosting, most important:
• Like AdaBoost, but useful beyond basic classification
• XGBoost (Tianqi Chen, UW)
Gradient
boosting
135
Variants of boosting and related algorithms
There are hundreds of variants of boosting, most important:
Many other approaches to learn ensembles, most important:
• Like AdaBoost, but useful beyond basic classification
• XGBoost (Tianqi Chen, UW)
Gradient boosting
• Bagging: Pick random subsets of the data - Learn a tree in each subset
- Average predictions
• Simpler than boosting & easier to parallelize
• Typically higher error than boosting for same # of trees (# iterations T)
Random
forests
Impact of boosting ( spoiler alert... HUGE IMPACT )
• Standard approach for face detection, for example
Extremely useful in computer vision
• Malware classification, credit fraud detection, ads click through rate estimation, sales forecasting, ranking webpages for search, Higgs boson detection,…
Used by most winners of ML competitions
(Kaggle, KDD Cup,…)
• Coefficients chosen manually, with boosting, with bagging, or others