Ensemble Methods

(1)

Ensemble Methods

Lecture 07

STAT 479: Machine Learning, Fall 2018 Sebastian Raschka

http://stat.wisc.edu/~sraschka/teaching/stat479-fs2018/

(2)

Overview

Ensemble Methods

Majority Voting Bagging

Boosting

Random Forests Stacking

(3)

Majority Voting

(4)

Unanimity

Majority

Plurality

(5)

h (x) = ̂y

Training set

h₁ h₂

. . .

h_n

y₁ y₂

. . .

y_n

Voting

y_f

New data

Classification models

Predictions

Final prediction

where

Majority Vote Classifier

̂yf = mode{h₁(x), h₂(x), . . . h_n(x)}

(6)

•

assume n independent classifiers with a base error rate

•

here, independent means that the errors are uncorrelated

•

assume a binary classification task

•

assume the error rate is better than random guessing (i.e., lower than 0.5 for binary classification)

∀ϵ_i ∈ {ϵ₁, ϵ₂, . . . , ϵ_n}, ϵ_i < 0.5

ϵ

Why Majority Vote?

(7)

•

assume n independent classifiers with a base error rate

•

here, independent means that the errors are uncorrelated

•

assume a binary classification task

•

assume the error rate is better than random guessing (i.e., lower than 0.5 for binary classification)

∀ϵ_i ∈ {ϵ₁, ϵ₂, . . . , ϵ_n}, ϵ_i < 0.5

ϵ

k > ⌈n/2⌉

Why Majority Vote?

P(k) = ( n

k) ϵ

^k

(1 − ϵ)

^n−k

The probability that we make a wrong

prediction via the ensemble if k classifiers predict the same class label

(8)

k > ⌈n/2⌉

Why Majority Vote?

P(k) = ( n

k) ϵ

^k

(1 − ϵ)

^n−k

The probability that we make a wrong

prediction via the ensemble if k classifiers predict the same class label

Ensemble error:

ϵ

_ens

= ∑

ⁿ

k

( n

k) ϵ

^k

(1 − ϵ)

^n−k

ϵ_ens

= ∑

¹¹

( 11 0.25

^k

(1 − 0.25)

^11−k

= 0.034

(9)

ϵ

_ens

= ∑

ⁿ

k

( n

k) ϵ

^k

(1 − ϵ)

^n−k

(10)

"Soft" Voting

̂y = arg max

j

n

∑

i=1

w

_i

p

_i,j

w

_j optional weighting parameter, default w_i = 1/n, ∀w_i ∈ {w₁, . . . , w_n}

p

_i,j predicted class membership

probability of the ith classifier for class label j

:

(11)

"Soft" Voting

̂y = arg max

j

n

∑

i=1

w

_i

p

_i,j

w

_j optional weighting parameter, default w_i = 1/n, ∀w_i ∈ {w₁, . . . , w_n}

p

_i,j predicted class membership

probability of the ith classifier for class label j

:

Use only for well-calibrated

classifiers!

(12)

"Soft" Voting

̂y = arg max

j

n

∑

i=1

w

_i

p

_i,j

Binary classification example

j ∈ {0,1} h_i(i ∈ {1,2,3})

h₁

(x) → [0.9,0.1]

h₂

(x) → [0.8,0.2]

h₃

(x) → [0.4,0.6]

(13)

"Soft" Voting ̂y = arg max

j

n

∑

i=1

w

_i

p

_i,j

Binary classification example

j ∈ {0,1} h_i(i ∈ {1,2,3})

h₁

(x) → [0.9,0.1]

h₂

(x) → [0.8,0.2]

h₃

(x) → [0.4,0.6]

p(j = 0|x) = 0.2 ⋅ 0.9 + 0.2 ⋅ 0.8 + 0.6 ⋅ 0.4 = 0.58 p(j = 1|x) = 0.2 ⋅ 0.1 + 0.2 ⋅ 0.2 + 0.6 ⋅ 0.6 = 0.42

̂y = arg max

j {p(j = 0|x), p(j = 1|x)}

(14)

Bagging

(Bootstrap Aggregating)

Breiman, L. (1996). Bagging predictors. Machine learning, 24(2), 123-140.

(15)

Sebastian Raschka STAT 479: Machine Learning FS 2018 15

Algorithm 1 Bagging

1: Let n be the number of bootstrap samples

2:

3: for i=1 to n do

4: Draw bootstrap sample of size m, Dⁱ

5: Train base classifier h_i on Dⁱ

6: y = modeˆ {h¹(x), ..., h_n(x)}

(16)

x₁

x₁ x₁

x₁

x₂ x₃ x₄ x₅ x₆ x₇ x₈ x₉ x₁₀

x₂

x₂ x₈ x₈ x₃ x₇ x₁₀

x₆ x₉

x₃ x₇ x₈ x₁₀ x₂

x₂ x₆ x₉

x₆ x₅ x₄ x₄

x₁₀ x₃ x₅ x₇ x₄ x₈

x₄ x₅

x₆

x₈ x₉

Training Sets Test Sets

Bootstrap 1 Bootstrap 2 Bootstrap 3 Original Dataset

Bootstrap Sampling

(17)

P(not chosen) = (1 − 1 n )

n

, 1

e ≈ 0.368, n → ∞ .

Bootstrap Sampling

(18)

P(not chosen) = (1 − 1 n )

n

, 1

e ≈ 0.368, n → ∞ . P(chosen) = 1 − (1 − 1

n )

n

≈ 0.632

(19)

Training example indices

Bagging round 1

Bagging

round 2 …

1 2 7 …

2 2 3 …

3 1 2 …

4 3 1 …

5 7 1 …

6 2 7 …

7 4 7 …

h₁ h₂ h_n

Bootstrap Sampling

(20)

h₁ h₂

. . .

h_n

y₁ y₂

. . .

y_n

Voting

y_f

New data

Predictions

Final prediction

. . .

T₂

Training set

T_n T₁

T₂

Bootstrap samples

̂yf = mode{h₁(x), h₂(x), . . . h_n(x)}

Bagging Classifier

(21)

Bias-Variance Decomposition

Loss = Bias + Variance + Noise

(more technical details in next lecture on model evaluation)

(22)

Low Variance (Precise)

High Variance (Not Precise)

Low Bias (Accurate)t Accurate)

Bias-Variance Intuition

(23)

Low Bias (Accurate)ot Accurate)

Bias-Variance Intuition

(24)

Low Bias (Accurate)rate)

Bias-Variance Intuition

(25)

Low Bias (Accurate)High Bias (Not Accurate)

Bias-Variance Intuition

(26)

Low Bias (Accurate)High Bias (Not Accurate)

Bias-Variance Intuition

(27)

Bias and Variance Example

where f(x) is some true (target) function

(28)

where f(x) is some true (target) function the blue dots are a training dataset;

here, I added some random Gaussian noise

Bias and Variance Example

(29)

here, I added some random Gaussian noise High Bias

Bias and Variance Example

(30)

here, I added some random Gaussian noise High Variance

Why?

Bias and Variance Example

(31)

where f(x) is some true (target) function suppose we have multiple training sets

Bias and Variance Example

(32)

High Variance

Bias and Variance Example

(33)

So, why does bagging work/what does it do?

(34)

Boosting

(35)

Adaptive Boosting

Gradient Boosting

e.g., AdaBoost (here!)

e.g., XGBoost

Freund, Y., & Schapire, R. E. (1997). A decision-theoretic generalization of on-line learning and an application to boosting.

Journal of computer and system sciences, 55(1), 119-139.

Friedman, J. H. (2001). Greedy function approximation: a gradient boosting machine. Annals of statistics, 1189-1232.

Chen, T., & Guestrin, C. (2016). Xgboost: A scalable tree boosting system. In Proceedings of the 22nd ACM SIGKDD

(36)

Adaptive Boosting

Gradient Boosting

e.g., AdaBoost (here!)

e.g., XGBoost

Freund, Y., & Schapire, R. E. (1997). A decision-theoretic generalization of on-line learning and an application to boosting.

Journal of computer and system sciences, 55(1), 119-139.

Friedman, J. H. (2001). Greedy function approximation: a gradient boosting machine. Annals of statistics, 1189-1232.

Diﬀer mainly in terms of how

• weights are updated

• classifiers are combined

(37)

General Boosting

Training Sample

Weighted Training Sample

h₁(x)

h₂(x)

h_m(x)

h_m(x) = sign(∑^m

j=1

w_j h_j(x)) h_m(x) = arg max

i (

m

∑j=1

w_j 1[h_j(x) = i])

for h(x) ∈ {−1,1}

or for h(x) = i, i ∈ {1,...,n}

(38)

General Boosting

‣

Initialize a weight vector with uniform weights

‣

^Loop:

‣

Apply weak learner* to weighted training examples (instead of orig. training set,  may draw bootstrap samples with weighted probability)

‣

Increase weight for misclassified examples 

‣

(Weighted) majority voting on trained classifiers

(39)

AdaBoost

Algorithm 1 AdaBoost

1: Initialize k: the number of AdaBoost rounds

2: Initialize D: the training dataset, D = {hx^[1], y^[1]i, ..., x^[n], y^[n]i}

3: Initialize w₁(i) = 1/n, i = 1, ..., n, w₁ 2 Rⁿ

4:

5: for r=1 to k do

6: For all i : w_r(i) := w_r(i)/P

i w_r(i) [normalize weights]

7: h_r := F itW eakLearner(D, w^r)

8: ✏_r := P

i w_r(i) 1(h_r(i) 6= yⁱ) [compute error]

9: if ✏_r > 1/2 then stop

10: ↵_r := ¹₂ log[(1 ✏_r)/✏_r] [small if error is large and vice versa]

11: w_r+1(i) := w_r(i)⇥

( e ^↵^r if h_r(x^[i]) = y^[i]

e^↵^r if h_r(x^[i]) 6= y^[i]

12: Predict: h_k(x) = arg max_j P

r ↵_r1[h_r(x) = j]

13:

(40)

Modify AdaBoost w. Bootstrap

Algorithm 1 AdaBoost

1: Initialize k: the number of AdaBoost rounds

2: Initialize D: the training dataset, D = {hx^[1], y^[1]i, ..., x^[n], y^[n]i}

3: Initialize w₁(i) = 1/n, i = 1, ..., n, w₁ 2 Rⁿ

4:

5: for r=1 to k do

6: For all i : w_r(i) := w_r(i)/P

i w_r(i) [normalize weights]

7: h_r := F itW eakLearner(D, w^r)

8: ✏_r := P

i w_r(i) 1(h_r(i) 6= yⁱ) [compute error]

9: if ✏_r > 1/2 then stop

10: ↵_r := ¹₂ log[(1 ✏_r)/✏_r] [small if error is large and vice versa]

11: w_r+1(i) := w_r(i)⇥

( e ^↵^r if h_r(x^[i]) = y^[i]

e^↵^r if h_r(x^[i]) 6= y^[i]

12: Predict: h_k(x) = arg max_j P

r ↵_r1[h_r(x) = j]

13:

(41)

Decision Tree Stumps

(42)

x

₁

x

₁

x

₁

x

₁

x

₂

x

₂

x

₂

x

₂

2

3 4 1

(43)

x

₁

x

₁

x

₁

x

₁

x

₂

x

₂

x

₂

x

₂

2

3 4 1

(44)

x

₁

x

₁

x

₁

x

₁

x

₂

x

₂

x

₂

x

₂

2

3 4 1

(45)

x

₁

x

₁

x

₁

x

₁

x

₂

x

₂

x

₂

x

₂

2

3 4 1

(46)

Random Forests

(47)

Random Forests

= Bagging w. trees + random feature subsets

(48)

Tin Kam Ho used the “random subspace method,” where each tree got a random subset of features.

“Our method relies on an autonomous, pseudo-random procedure to select a small number of dimensions from a given feature space …”

•

Ho, Tin Kam. “The random subspace method for constructing decision forests.”

IEEE transactions on pattern analysis and machine intelligence 20.8 (1998):

832-844.

“Trademark” random forest:

“… random forest with random features is formed by selecting at random, at each node, a small group of input variables to split on.”

• Random Feature Subset for each Tree or Node?

(49)

Tin Kam Ho used the “random subspace method,” where each tree got a random subset of features.

“Our method relies on an autonomous, pseudo-random procedure to select a small number of dimensions from a given feature space …”

•

Ho, Tin Kam. “The random subspace method for constructing decision forests.”

IEEE transactions on pattern analysis and machine intelligence 20.8 (1998):

832-844.

“Trademark” random forest:

“… random forest with random features is formed by selecting at random, at each node, a small group of input variables to split on.”

•

Breiman, Leo. “Random Forests” Machine learning 45.1 (2001): 5-32.

Random Feature Subset for each Tree or Node?

num features = log₂ m + 1 where m is the

number of input features

(50)

In contrast to the original publication  

[Breiman, “Random Forests”, Machine Learning, 45(1), 5-32, 2001]  

the scikit-learn implementation combines classifiers by averaging their probabilistic prediction, instead of letting each classifier vote for a single class.

"Soft Voting"

(51)

Will discuss Random Forests and feature importance in

Feature Selection lecture

(52)

PE ≤ ¯ρ ⋅ (1 − s

²

) s

²

(Loose) Upper Bound for the Generalization Error

Breiman, “Random Forests”, Machine Learning, 45(1), 5-32, 2001

¯ρ

: Average correlation among trees : "Strength" of the ensemble

s

(53)

Extremely Randomized Trees (ExtraTrees)

ExtraTrees algorithm adds one more random component

Random Forest random components:

1) _____________

2) _____________

3) _____________

Geurts, P., Ernst, D., & Wehenkel, L. (2006). Extremely randomized trees. Machine learning, 63(1), 3-42.

(54)

Stacking

(55)

Tang, J., S. Alelyani, and H. Liu. "Data Classification: Algorithms and Applications." Data Mining and Knowledge Discovery Series, CRC Press (2015): pp. 498-500.

Wolpert, David H. "Stacked generalization." Neural networks 5.2 (1992): 241-259.

Stacking Algorithm

(56)

Stacking Algorithm

Training set

h₁ h₂

. . .

h_n

y₁ y₂

. . .

y_n

Meta-Classifier

y_f

New data

Predictions

Final prediction

(57)

1 2 3 4 5 6 7 8 9 10

1 2 3 4 5 6 7 8 9 10 1 2 3 4 5 6 7 8 9 10

1 2 3 4 5 6 7 8 9 10

1 2 3 4 5 6 7 8 9 10 Holdout Method

2-Fold Cross-Validation

Repeated Holdout Training Evaluation

Cross-Validation

(58)

1st 2nd

3rd 4th 5th

K Iterations (K-Folds)

Validation Fold

Training Fold

Learning Algorithm

Hyperparameter Values

Training Fold Data Training Fold Labels

Prediction

Performance Model

Validation Fold Data

Validation Fold Labels

Performance

Performance Performance

1

2

3

4

5

Performance

1 5

∑

⁵

i =1

Performancei

=

A

B C

k-fold Cross-Validation

(59)

1st 2nd

3rd 4th 5th

K Iterations (K-Folds)

Validation Fold

Training Fold

Learning Algorithm Hyperparameter

Values

Training Fold Data Training Fold Labels

Prediction

Performance Model

Validation Fold Data

Validation Fold Labels

Performance

Performance Performance

1

2

3

4

5

Performance

1 5 ∑⁵

i =1

Performancei

=

A

B C

(60)

Stacking Algorithm with Cross-Validation

Wolpert, David H. "Stacked generalization." Neural networks 5.2 (1992): 241-259.

(61)

Stacking Algorithm with Cross-Validation

Training set

h₁ h₂ . . . h_n

y₁ y₂ . . . y_n

Meta-Classifier

Base Classifiers

Level-1 predictions in k-th iteration

Training folds Validation fold

Repeat ktimes

All level-1 predictions

Train

(62)

1 2 3 4 5 6 7 8 9 10 1 2 3 4 5 6 7 8 9 10 1 2 3 4 5 6 7 8 9 10 1 2 3 4 5 6 7 8 9 10 1 2 3 4 5 6 7 8 9 10 1 2 3 4 5 6 7 8 9 10 1 2 3 4 5 6 7 8 9 10 1 2 3 4 5 6 7 8 9 10 1 2 3 4 5 6 7 8 9 10 1 2 3 4 5 6 7 8 9 10

Training evaluation

Leave-One-Out CV

(63)

Demos

http://rasbt.github.io/mlxtend/user_guide/classifier/EnsembleVoteClassifier/

http://rasbt.github.io/mlxtend/user_guide/classifier/StackingClassifier/

http://rasbt.github.io/mlxtend/user_guide/classifier/StackingCVClassifier/

http://scikit-learn.org/stable/modules/generated/sklearn.ensemble.VotingClassifier.html

http://scikit-learn.org/stable/auto_examples/ensemble/

plot_bias_variance.html#sphx-glr-auto-examples-ensemble-plot-bias-variance-py http://scikit-learn.org/stable/auto_examples/ensemble/

plot_adaboost_hastie_10_2.html#sphx-glr-auto-examples-ensemble-plot-adaboost- hastie-10-2-py

(64)