Ensemble Methods
Lecture 07
STAT 479: Machine Learning, Fall 2018 Sebastian Raschka
http://stat.wisc.edu/~sraschka/teaching/stat479-fs2018/
Overview
Ensemble Methods
Majority Voting Bagging
Boosting
Random Forests Stacking
Majority Voting
Unanimity
Majority
Plurality
h (x) = ̂y
Training set
h1 h2
. . .
hny1 y2
. . .
ynVoting
yf
New data
Classification models
Predictions
Final prediction
where
Majority Vote Classifier
̂yf = mode{h1(x), h2(x), . . . hn(x)}
•
assume n independent classifiers with a base error rate•
here, independent means that the errors are uncorrelated•
assume a binary classification task•
assume the error rate is better than random guessing (i.e., lower than 0.5 for binary classification)∀ϵi ∈ {ϵ1, ϵ2, . . . , ϵn}, ϵi < 0.5
ϵ
Why Majority Vote?
•
assume n independent classifiers with a base error rate•
here, independent means that the errors are uncorrelated•
assume a binary classification task•
assume the error rate is better than random guessing (i.e., lower than 0.5 for binary classification)∀ϵi ∈ {ϵ1, ϵ2, . . . , ϵn}, ϵi < 0.5
ϵ
k > ⌈n/2⌉
Why Majority Vote?
P(k) = ( n
k) ϵ
k(1 − ϵ)
n−kThe probability that we make a wrong
prediction via the ensemble if k classifiers predict the same class label
k > ⌈n/2⌉
Why Majority Vote?
P(k) = ( n
k) ϵ
k(1 − ϵ)
n−kThe probability that we make a wrong
prediction via the ensemble if k classifiers predict the same class label
Ensemble error:
ϵ
ens= ∑
nk
( n
k) ϵ
k(1 − ϵ)
n−kϵens
= ∑
11( 11 0.25
k(1 − 0.25)
11−k= 0.034
ϵ
ens= ∑
nk
( n
k) ϵ
k(1 − ϵ)
n−k"Soft" Voting
̂y = arg max
j
n
∑
i=1w
ip
i,jw
j optional weighting parameter, default wi = 1/n, ∀wi ∈ {w1, . . . , wn}p
i,j predicted class membershipprobability of the ith classifier for class label j
:
:
"Soft" Voting
̂y = arg max
j
n
∑
i=1w
ip
i,jw
j optional weighting parameter, default wi = 1/n, ∀wi ∈ {w1, . . . , wn}p
i,j predicted class membershipprobability of the ith classifier for class label j
:
:
Use only for well-calibrated
classifiers!
"Soft" Voting
̂y = arg max
j
n
∑
i=1w
ip
i,jBinary classification example
j ∈ {0,1} hi(i ∈ {1,2,3})
h1
(x) → [0.9,0.1]
h2
(x) → [0.8,0.2]
h3
(x) → [0.4,0.6]
"Soft" Voting ̂y = arg max
j
n
∑
i=1w
ip
i,jBinary classification example
j ∈ {0,1} hi(i ∈ {1,2,3})
h1
(x) → [0.9,0.1]
h2
(x) → [0.8,0.2]
h3
(x) → [0.4,0.6]
p(j = 0|x) = 0.2 ⋅ 0.9 + 0.2 ⋅ 0.8 + 0.6 ⋅ 0.4 = 0.58 p(j = 1|x) = 0.2 ⋅ 0.1 + 0.2 ⋅ 0.2 + 0.6 ⋅ 0.6 = 0.42
̂y = arg max
j {p(j = 0|x), p(j = 1|x)}
Bagging
(Bootstrap Aggregating)
Breiman, L. (1996). Bagging predictors. Machine learning, 24(2), 123-140.
Sebastian Raschka STAT 479: Machine Learning FS 2018 15
Algorithm 1 Bagging
1: Let n be the number of bootstrap samples
2:
3: for i=1 to n do
4: Draw bootstrap sample of size m, Di
5: Train base classifier hi on Di
6: y = modeˆ {h1(x), ..., hn(x)}
x1
x1
x1
x1 x1
x1
x2 x3 x4 x5 x6 x7 x8 x9 x10
x2
x2
x2
x2 x8 x8 x3 x7 x10
x6 x9
x3 x7 x8 x10 x2
x2 x6 x9
x6 x5 x4 x4
x10 x3 x5 x7 x4 x8
x4 x5
x6
x8 x9
Training Sets Test Sets
Bootstrap 1 Bootstrap 2 Bootstrap 3 Original Dataset
Bootstrap Sampling
P(not chosen) = (1 − 1 n )
n
, 1
e ≈ 0.368, n → ∞ .
Bootstrap Sampling
P(not chosen) = (1 − 1 n )
n
, 1
e ≈ 0.368, n → ∞ . P(chosen) = 1 − (1 − 1
n )
n
≈ 0.632
Training example indices
Bagging round 1
Bagging
round 2 …
1 2 7 …
2 2 3 …
3 1 2 …
4 3 1 …
5 7 1 …
6 2 7 …
7 4 7 …
h1 h2 hn
Bootstrap Sampling
h1 h2
. . .
hny1 y2
. . .
ynVoting
yf
New data
Classification models
Predictions
Final prediction
. . .
T2
Training set
Tn T1
T2
Bootstrap samples
̂yf = mode{h1(x), h2(x), . . . hn(x)}
Bagging Classifier
Bias-Variance Decomposition
Loss = Bias + Variance + Noise
(more technical details in next lecture on model evaluation)
Low Variance (Precise)
High Variance (Not Precise)
Low Bias (Accurate)t Accurate)
Bias-Variance Intuition
Low Variance (Precise)
High Variance (Not Precise)
Low Bias (Accurate)ot Accurate)
Bias-Variance Intuition
Low Variance (Precise)
High Variance (Not Precise)
Low Bias (Accurate)rate)
Bias-Variance Intuition
Low Variance (Precise)
High Variance (Not Precise)
Low Bias (Accurate)High Bias (Not Accurate)
Bias-Variance Intuition
Low Variance (Precise)
High Variance (Not Precise)
Low Bias (Accurate)High Bias (Not Accurate)
Bias-Variance Intuition
Bias and Variance Example
where f(x) is some true (target) function
where f(x) is some true (target) function the blue dots are a training dataset;
here, I added some random Gaussian noise
Bias and Variance Example
where f(x) is some true (target) function the blue dots are a training dataset;
here, I added some random Gaussian noise High Bias
Bias and Variance Example
where f(x) is some true (target) function the blue dots are a training dataset;
here, I added some random Gaussian noise High Variance
Why?
Bias and Variance Example
where f(x) is some true (target) function suppose we have multiple training sets
Bias and Variance Example
High Variance
Bias and Variance Example
So, why does bagging work/what does it do?
Boosting
Adaptive Boosting
Gradient Boosting
e.g., AdaBoost (here!)
e.g., XGBoost
Freund, Y., & Schapire, R. E. (1997). A decision-theoretic generalization of on-line learning and an application to boosting.
Journal of computer and system sciences, 55(1), 119-139.
Friedman, J. H. (2001). Greedy function approximation: a gradient boosting machine. Annals of statistics, 1189-1232.
Chen, T., & Guestrin, C. (2016). Xgboost: A scalable tree boosting system. In Proceedings of the 22nd ACM SIGKDD
Adaptive Boosting
Gradient Boosting
e.g., AdaBoost (here!)
e.g., XGBoost
Freund, Y., & Schapire, R. E. (1997). A decision-theoretic generalization of on-line learning and an application to boosting.
Journal of computer and system sciences, 55(1), 119-139.
Friedman, J. H. (2001). Greedy function approximation: a gradient boosting machine. Annals of statistics, 1189-1232.
Differ mainly in terms of how
• weights are updated
• classifiers are combined
General Boosting
Training Sample
Weighted Training Sample
Weighted Training Sample
h1(x)
h2(x)
hm(x)
hm(x) = sign(∑m
j=1
wj hj(x)) hm(x) = arg max
i (
m
∑j=1
wj 1[hj(x) = i])
for h(x) ∈ {−1,1}
or for h(x) = i, i ∈ {1,...,n}
General Boosting
‣
Initialize a weight vector with uniform weights‣
Loop:‣
Apply weak learner* to weighted training examples (instead of orig. training set, may draw bootstrap samples with weighted probability)‣
Increase weight for misclassified examples‣
(Weighted) majority voting on trained classifiersSebastian Raschka STAT 479: Machine Learning FS 2018 39
AdaBoost
Algorithm 1 AdaBoost
1: Initialize k: the number of AdaBoost rounds
2: Initialize D: the training dataset, D = {hx[1], y[1]i, ..., x[n], y[n]i}
3: Initialize w1(i) = 1/n, i = 1, ..., n, w1 2 Rn
4:
5: for r=1 to k do
6: For all i : wr(i) := wr(i)/P
i wr(i) [normalize weights]
7: hr := F itW eakLearner(D, wr)
8: ✏r := P
i wr(i) 1(hr(i) 6= yi) [compute error]
9: if ✏r > 1/2 then stop
10: ↵r := 12 log[(1 ✏r)/✏r] [small if error is large and vice versa]
11: wr+1(i) := wr(i)⇥
( e ↵r if hr(x[i]) = y[i]
e↵r if hr(x[i]) 6= y[i]
12: Predict: hk(x) = arg maxj P
r ↵r1[hr(x) = j]
13:
Sebastian Raschka STAT 479: Machine Learning FS 2018 40
Modify AdaBoost w. Bootstrap
Algorithm 1 AdaBoost
1: Initialize k: the number of AdaBoost rounds
2: Initialize D: the training dataset, D = {hx[1], y[1]i, ..., x[n], y[n]i}
3: Initialize w1(i) = 1/n, i = 1, ..., n, w1 2 Rn
4:
5: for r=1 to k do
6: For all i : wr(i) := wr(i)/P
i wr(i) [normalize weights]
7: hr := F itW eakLearner(D, wr)
8: ✏r := P
i wr(i) 1(hr(i) 6= yi) [compute error]
9: if ✏r > 1/2 then stop
10: ↵r := 12 log[(1 ✏r)/✏r] [small if error is large and vice versa]
11: wr+1(i) := wr(i)⇥
( e ↵r if hr(x[i]) = y[i]
e↵r if hr(x[i]) 6= y[i]
12: Predict: hk(x) = arg maxj P
r ↵r1[hr(x) = j]
13:
Decision Tree Stumps
x
1x
1x
1x
1x
2x
2x
2x
22
3 4 1
x
1x
1x
1x
1x
2x
2x
2x
22
3 4 1
x
1x
1x
1x
1x
2x
2x
2x
22
3 4 1
x
1x
1x
1x
1x
2x
2x
2x
22
3 4 1
Random Forests
Random Forests
= Bagging w. trees + random feature subsets
Tin Kam Ho used the “random subspace method,” where each tree got a random subset of features.
“Our method relies on an autonomous, pseudo-random procedure to select a small number of dimensions from a given feature space …”
•
Ho, Tin Kam. “The random subspace method for constructing decision forests.”IEEE transactions on pattern analysis and machine intelligence 20.8 (1998):
832-844.
“Trademark” random forest:
“… random forest with random features is formed by selecting at random, at each node, a small group of input variables to split on.”
•
Random Feature Subset for each Tree or Node?
Tin Kam Ho used the “random subspace method,” where each tree got a random subset of features.
“Our method relies on an autonomous, pseudo-random procedure to select a small number of dimensions from a given feature space …”
•
Ho, Tin Kam. “The random subspace method for constructing decision forests.”IEEE transactions on pattern analysis and machine intelligence 20.8 (1998):
832-844.
“Trademark” random forest:
“… random forest with random features is formed by selecting at random, at each node, a small group of input variables to split on.”
•
Breiman, Leo. “Random Forests” Machine learning 45.1 (2001): 5-32.Random Feature Subset for each Tree or Node?
num features = log2 m + 1 where m is the
number of input features
In contrast to the original publication
[Breiman, “Random Forests”, Machine Learning, 45(1), 5-32, 2001]
the scikit-learn implementation combines classifiers by averaging their probabilistic prediction, instead of letting each classifier vote for a single class.
"Soft Voting"
Will discuss Random Forests and feature importance in
Feature Selection lecture
PE ≤ ¯ρ ⋅ (1 − s
2) s
2(Loose) Upper Bound for the Generalization Error
Breiman, “Random Forests”, Machine Learning, 45(1), 5-32, 2001
¯ρ
: Average correlation among trees : "Strength" of the ensembles
Extremely Randomized Trees (ExtraTrees)
ExtraTrees algorithm adds one more random component
Random Forest random components:
1) _____________
2) _____________
3) _____________
Geurts, P., Ernst, D., & Wehenkel, L. (2006). Extremely randomized trees. Machine learning, 63(1), 3-42.
Stacking
Tang, J., S. Alelyani, and H. Liu. "Data Classification: Algorithms and Applications." Data Mining and Knowledge Discovery Series, CRC Press (2015): pp. 498-500.
Wolpert, David H. "Stacked generalization." Neural networks 5.2 (1992): 241-259.
Stacking Algorithm
Stacking Algorithm
Training set
h1 h2
. . .
hny1 y2
. . .
ynMeta-Classifier
yf
New data
Classification models
Predictions
Final prediction
Sebastian Raschka STAT 479: Machine Learning FS 2018 57
1 2 3 4 5 6 7 8 9 10
1 2 3 4 5 6 7 8 9 10 1 2 3 4 5 6 7 8 9 10
1 2 3 4 5 6 7 8 9 10 1 2 3 4 5 6 7 8 9 10
1 2 3 4 5 6 7 8 9 10
1 2 3 4 5 6 7 8 9 10 Holdout Method
2-Fold Cross-Validation
Repeated Holdout Training Evaluation
Cross-Validation
Sebastian Raschka STAT 479: Machine Learning FS 2018 58
1st 2nd
3rd 4th 5th
K Iterations (K-Folds)
Validation Fold
Training Fold
Learning Algorithm
Hyperparameter Values
Training Fold Data Training Fold Labels
Prediction
Performance Model
Validation Fold Data
Validation Fold Labels
Performance
Performance Performance
Performance Performance
1
2
3
4
5
Performance
1 5
∑
5i =1
Performancei
=
A
B C
k-fold Cross-Validation
1st 2nd
3rd 4th 5th
K Iterations (K-Folds)
Validation Fold
Training Fold
Learning Algorithm Hyperparameter
Values
Training Fold Data Training Fold Labels
Prediction
Performance Model
Validation Fold Data
Validation Fold Labels
Performance
Performance Performance
Performance Performance
1
2
3
4
5
Performance
1 5 ∑5
i =1
Performancei
=
A
B C
Stacking Algorithm with Cross-Validation
Wolpert, David H. "Stacked generalization." Neural networks 5.2 (1992): 241-259.
Stacking Algorithm with Cross-Validation
Training set
h1 h2 . . . hn
y1 y2 . . . yn
Meta-Classifier
Base Classifiers
Level-1 predictions in k-th iteration
Training folds Validation fold
Repeat ktimes
All level-1 predictions
Train
1 2 3 4 5 6 7 8 9 10 1 2 3 4 5 6 7 8 9 10 1 2 3 4 5 6 7 8 9 10 1 2 3 4 5 6 7 8 9 10 1 2 3 4 5 6 7 8 9 10 1 2 3 4 5 6 7 8 9 10 1 2 3 4 5 6 7 8 9 10 1 2 3 4 5 6 7 8 9 10 1 2 3 4 5 6 7 8 9 10 1 2 3 4 5 6 7 8 9 10
Training evaluation
Leave-One-Out CV
Demos
http://rasbt.github.io/mlxtend/user_guide/classifier/EnsembleVoteClassifier/
http://rasbt.github.io/mlxtend/user_guide/classifier/StackingClassifier/
http://rasbt.github.io/mlxtend/user_guide/classifier/StackingCVClassifier/
http://scikit-learn.org/stable/modules/generated/sklearn.ensemble.VotingClassifier.html
http://scikit-learn.org/stable/auto_examples/ensemble/
plot_bias_variance.html#sphx-glr-auto-examples-ensemble-plot-bias-variance-py http://scikit-learn.org/stable/auto_examples/ensemble/
plot_adaboost_hastie_10_2.html#sphx-glr-auto-examples-ensemble-plot-adaboost- hastie-10-2-py