• No results found

Ensemble Methods

N/A
N/A
Protected

Academic year: 2021

Share "Ensemble Methods"

Copied!
64
0
0

Loading.... (view fulltext now)

Full text

(1)

Ensemble Methods

Lecture 07

STAT 479: Machine Learning, Fall 2018 Sebastian Raschka

http://stat.wisc.edu/~sraschka/teaching/stat479-fs2018/

(2)

Overview

Ensemble Methods

Majority Voting Bagging

Boosting

Random Forests Stacking

(3)

Majority Voting

(4)

Unanimity

Majority

Plurality

(5)

h (x) = ̂y

Training set

h1 h2

. . .

hn

y1 y2

. . .

yn

Voting

yf

New data

Classification models

Predictions

Final prediction

where

Majority Vote Classifier

̂yf = mode{h1(x), h2(x), . . . hn(x)}

(6)

assume n independent classifiers with a base error rate

here, independent means that the errors are uncorrelated

assume a binary classification task

assume the error rate is better than random guessing (i.e., lower than 0.5 for binary classification)

∀ϵi ∈ {ϵ1, ϵ2, . . . , ϵn}, ϵi < 0.5

ϵ

Why Majority Vote?

(7)

assume n independent classifiers with a base error rate

here, independent means that the errors are uncorrelated

assume a binary classification task

assume the error rate is better than random guessing (i.e., lower than 0.5 for binary classification)

∀ϵi ∈ {ϵ1, ϵ2, . . . , ϵn}, ϵi < 0.5

ϵ

k > ⌈n/2⌉

Why Majority Vote?

P(k) = ( n

k) ϵ

k

(1 − ϵ)

n−k

The probability that we make a wrong

prediction via the ensemble if k classifiers predict the same class label

(8)

k > ⌈n/2⌉

Why Majority Vote?

P(k) = ( n

k) ϵ

k

(1 − ϵ)

n−k

The probability that we make a wrong

prediction via the ensemble if k classifiers predict the same class label

Ensemble error:

ϵ

ens

= ∑

n

k

( n

k) ϵ

k

(1 − ϵ)

n−k

ϵens

= ∑

11

( 11 0.25

k

(1 − 0.25)

11−k

= 0.034

(9)

ϵ

ens

= ∑

n

k

( n

k) ϵ

k

(1 − ϵ)

n−k

(10)

"Soft" Voting

̂y = arg max

j

n

i=1

w

i

p

i,j

w

j optional weighting parameter, default wi = 1/n, ∀wi ∈ {w1, . . . , wn}

p

i,j predicted class membership

probability of the ith classifier for class label j

:

:

(11)

"Soft" Voting

̂y = arg max

j

n

i=1

w

i

p

i,j

w

j optional weighting parameter, default wi = 1/n, ∀wi ∈ {w1, . . . , wn}

p

i,j predicted class membership

probability of the ith classifier for class label j

:

:

Use only for well-calibrated

classifiers!

(12)

"Soft" Voting

̂y = arg max

j

n

i=1

w

i

p

i,j

Binary classification example

j ∈ {0,1} hi(i ∈ {1,2,3})

h1

(x) → [0.9,0.1]

h2

(x) → [0.8,0.2]

h3

(x) → [0.4,0.6]

(13)

"Soft" Voting ̂y = arg max

j

n

i=1

w

i

p

i,j

Binary classification example

j ∈ {0,1} hi(i ∈ {1,2,3})

h1

(x) → [0.9,0.1]

h2

(x) → [0.8,0.2]

h3

(x) → [0.4,0.6]

p(j = 0|x) = 0.2 ⋅ 0.9 + 0.2 ⋅ 0.8 + 0.6 ⋅ 0.4 = 0.58 p(j = 1|x) = 0.2 ⋅ 0.1 + 0.2 ⋅ 0.2 + 0.6 ⋅ 0.6 = 0.42

̂y = arg max

j {p(j = 0|x), p(j = 1|x)}

(14)

Bagging

(Bootstrap Aggregating)

Breiman, L. (1996). Bagging predictors. Machine learning, 24(2), 123-140.

(15)

Sebastian Raschka STAT 479: Machine Learning FS 2018 15

Algorithm 1 Bagging

1: Let n be the number of bootstrap samples

2:

3: for i=1 to n do

4: Draw bootstrap sample of size m, Di

5: Train base classifier hi on Di

6: y = modeˆ {h1(x), ..., hn(x)}

(16)

x1

x1

x1

x1 x1

x1

x2 x3 x4 x5 x6 x7 x8 x9 x10

x2

x2

x2

x2 x8 x8 x3 x7 x10

x6 x9

x3 x7 x8 x10 x2

x2 x6 x9

x6 x5 x4 x4

x10 x3 x5 x7 x4 x8

x4 x5

x6

x8 x9

Training Sets Test Sets

Bootstrap 1 Bootstrap 2 Bootstrap 3 Original Dataset

Bootstrap Sampling

(17)

P(not chosen) = (1 − 1 n )

n

, 1

e ≈ 0.368, n → ∞ .

Bootstrap Sampling

(18)

P(not chosen) = (1 − 1 n )

n

, 1

e ≈ 0.368, n → ∞ . P(chosen) = 1 − (1 − 1

n )

n

≈ 0.632

(19)

Training example indices

Bagging round 1

Bagging

round 2

1 2 7

2 2 3

3 1 2

4 3 1

5 7 1

6 2 7

7 4 7

h1 h2 hn

Bootstrap Sampling

(20)

h1 h2

. . .

hn

y1 y2

. . .

yn

Voting

yf

New data

Classification models

Predictions

Final prediction

. . .

T2

Training set

Tn T1

T2

Bootstrap samples

̂yf = mode{h1(x), h2(x), . . . hn(x)}

Bagging Classifier

(21)

Bias-Variance Decomposition

Loss = Bias + Variance + Noise

(more technical details in next lecture on model evaluation)

(22)

Low Variance (Precise)

High Variance (Not Precise)

Low Bias (Accurate)t Accurate)

Bias-Variance Intuition

(23)

Low Variance (Precise)

High Variance (Not Precise)

Low Bias (Accurate)ot Accurate)

Bias-Variance Intuition

(24)

Low Variance (Precise)

High Variance (Not Precise)

Low Bias (Accurate)rate)

Bias-Variance Intuition

(25)

Low Variance (Precise)

High Variance (Not Precise)

Low Bias (Accurate)High Bias (Not Accurate)

Bias-Variance Intuition

(26)

Low Variance (Precise)

High Variance (Not Precise)

Low Bias (Accurate)High Bias (Not Accurate)

Bias-Variance Intuition

(27)

Bias and Variance Example

where f(x) is some true (target) function

(28)

where f(x) is some true (target) function the blue dots are a training dataset;

here, I added some random Gaussian noise

Bias and Variance Example

(29)

where f(x) is some true (target) function the blue dots are a training dataset;

here, I added some random Gaussian noise High Bias

Bias and Variance Example

(30)

where f(x) is some true (target) function the blue dots are a training dataset;

here, I added some random Gaussian noise High Variance

Why?

Bias and Variance Example

(31)

where f(x) is some true (target) function suppose we have multiple training sets

Bias and Variance Example

(32)

High Variance

Bias and Variance Example

(33)

So, why does bagging work/what does it do?

(34)

Boosting

(35)

Adaptive Boosting

Gradient Boosting

e.g., AdaBoost (here!)

e.g., XGBoost

Freund, Y., & Schapire, R. E. (1997). A decision-theoretic generalization of on-line learning and an application to boosting.

Journal of computer and system sciences, 55(1), 119-139.

Friedman, J. H. (2001). Greedy function approximation: a gradient boosting machine. Annals of statistics, 1189-1232.

Chen, T., & Guestrin, C. (2016). Xgboost: A scalable tree boosting system. In Proceedings of the 22nd ACM SIGKDD

(36)

Adaptive Boosting

Gradient Boosting

e.g., AdaBoost (here!)

e.g., XGBoost

Freund, Y., & Schapire, R. E. (1997). A decision-theoretic generalization of on-line learning and an application to boosting.

Journal of computer and system sciences, 55(1), 119-139.

Friedman, J. H. (2001). Greedy function approximation: a gradient boosting machine. Annals of statistics, 1189-1232.

Differ mainly in terms of how

• weights are updated

• classifiers are combined

(37)

General Boosting

Training Sample

Weighted Training Sample

Weighted Training Sample

h1(x)

h2(x)

hm(x)

hm(x) = sign(m

j=1

wj hj(x)) hm(x) = arg max

i (

m

j=1

wj 1[hj(x) = i])

for h(x) ∈ {−1,1}

or for h(x) = i, i ∈ {1,...,n}

(38)

General Boosting

Initialize a weight vector with uniform weights

Loop:

Apply weak learner* to weighted training examples (instead of orig. training set,
 may draw bootstrap samples with weighted probability)

Increase weight for misclassified examples


(Weighted) majority voting on trained classifiers

(39)

Sebastian Raschka STAT 479: Machine Learning FS 2018 39

AdaBoost

Algorithm 1 AdaBoost

1: Initialize k: the number of AdaBoost rounds

2: Initialize D: the training dataset, D = {hx[1], y[1]i, ..., x[n], y[n]i}

3: Initialize w1(i) = 1/n, i = 1, ..., n, w1 2 Rn

4:

5: for r=1 to k do

6: For all i : wr(i) := wr(i)/P

i wr(i) [normalize weights]

7: hr := F itW eakLearner(D, wr)

8: r := P

i wr(i) 1(hr(i) 6= yi) [compute error]

9: if ✏r > 1/2 then stop

10: r := 12 log[(1 r)/✏r] [small if error is large and vice versa]

11: wr+1(i) := wr(i)

( e r if hr(x[i]) = y[i]

er if hr(x[i]) 6= y[i]

12: Predict: hk(x) = arg maxj P

r r1[hr(x) = j]

13:

(40)

Sebastian Raschka STAT 479: Machine Learning FS 2018 40

Modify AdaBoost w. Bootstrap

Algorithm 1 AdaBoost

1: Initialize k: the number of AdaBoost rounds

2: Initialize D: the training dataset, D = {hx[1], y[1]i, ..., x[n], y[n]i}

3: Initialize w1(i) = 1/n, i = 1, ..., n, w1 2 Rn

4:

5: for r=1 to k do

6: For all i : wr(i) := wr(i)/P

i wr(i) [normalize weights]

7: hr := F itW eakLearner(D, wr)

8: r := P

i wr(i) 1(hr(i) 6= yi) [compute error]

9: if ✏r > 1/2 then stop

10: r := 12 log[(1 r)/✏r] [small if error is large and vice versa]

11: wr+1(i) := wr(i)

( e r if hr(x[i]) = y[i]

er if hr(x[i]) 6= y[i]

12: Predict: hk(x) = arg maxj P

r r1[hr(x) = j]

13:

(41)

Decision Tree Stumps

(42)

x

1

x

1

x

1

x

1

x

2

x

2

x

2

x

2

2

3 4 1

(43)

x

1

x

1

x

1

x

1

x

2

x

2

x

2

x

2

2

3 4 1

(44)

x

1

x

1

x

1

x

1

x

2

x

2

x

2

x

2

2

3 4 1

(45)

x

1

x

1

x

1

x

1

x

2

x

2

x

2

x

2

2

3 4 1

(46)

Random Forests

(47)

Random Forests

= Bagging w. trees + random feature subsets

(48)

Tin Kam Ho used the “random subspace method,” where each tree got a random subset of features.

“Our method relies on an autonomous, pseudo-random procedure to select a small number of dimensions from a given feature space …”

Ho, Tin Kam. “The random subspace method for constructing decision forests.”

IEEE transactions on pattern analysis and machine intelligence 20.8 (1998):

832-844.

“Trademark” random forest:

“… random forest with random features is formed by selecting at random, at each node, a small group of input variables to split on.”

Random Feature Subset for each Tree or Node?

(49)

Tin Kam Ho used the “random subspace method,” where each tree got a random subset of features.

“Our method relies on an autonomous, pseudo-random procedure to select a small number of dimensions from a given feature space …”

Ho, Tin Kam. “The random subspace method for constructing decision forests.”

IEEE transactions on pattern analysis and machine intelligence 20.8 (1998):

832-844.

“Trademark” random forest:

“… random forest with random features is formed by selecting at random, at each node, a small group of input variables to split on.”

Breiman, Leo. “Random Forests” Machine learning 45.1 (2001): 5-32.

Random Feature Subset for each Tree or Node?

num features = log2 m + 1 where m is the

number of input features

(50)

In contrast to the original publication 


[Breiman, “Random Forests”, Machine Learning, 45(1), 5-32, 2001] 


the scikit-learn implementation combines classifiers by averaging their probabilistic prediction, instead of letting each classifier vote for a single class.

"Soft Voting"

(51)

Will discuss Random Forests and feature importance in

Feature Selection lecture

(52)

PE ≤ ¯ρ ⋅ (1 − s

2

) s

2

(Loose) Upper Bound for the Generalization Error

Breiman, “Random Forests”, Machine Learning, 45(1), 5-32, 2001

¯ρ

: Average correlation among trees : "Strength" of the ensemble

s

(53)

Extremely Randomized Trees (ExtraTrees)

ExtraTrees algorithm adds one more random component

Random Forest random components:

1) _____________

2) _____________

3) _____________

Geurts, P., Ernst, D., & Wehenkel, L. (2006). Extremely randomized trees. Machine learning, 63(1), 3-42.

(54)

Stacking

(55)

Tang, J., S. Alelyani, and H. Liu. "Data Classification: Algorithms and Applications." Data Mining and Knowledge Discovery Series, CRC Press (2015): pp. 498-500.

Wolpert, David H. "Stacked generalization." Neural networks 5.2 (1992): 241-259.

Stacking Algorithm

(56)

Stacking Algorithm

Training set

h1 h2

. . .

hn

y1 y2

. . .

yn

Meta-Classifier

yf

New data

Classification models

Predictions

Final prediction

(57)

Sebastian Raschka STAT 479: Machine Learning FS 2018 57

1 2 3 4 5 6 7 8 9 10

1 2 3 4 5 6 7 8 9 10 1 2 3 4 5 6 7 8 9 10

1 2 3 4 5 6 7 8 9 10 1 2 3 4 5 6 7 8 9 10

1 2 3 4 5 6 7 8 9 10

1 2 3 4 5 6 7 8 9 10 Holdout Method

2-Fold Cross-Validation

Repeated Holdout Training Evaluation

Cross-Validation

(58)

Sebastian Raschka STAT 479: Machine Learning FS 2018 58

1st 2nd

3rd 4th 5th

K Iterations (K-Folds)

Validation Fold

Training Fold

Learning Algorithm

Hyperparameter Values

Training Fold Data Training Fold Labels

Prediction

Performance Model

Validation Fold Data

Validation Fold Labels

Performance

Performance Performance

Performance Performance

1

2

3

4

5

Performance

1 5

5

i =1

Performancei

=

A

B C

k-fold Cross-Validation

(59)

1st 2nd

3rd 4th 5th

K Iterations (K-Folds)

Validation Fold

Training Fold

Learning Algorithm Hyperparameter

Values

Training Fold Data Training Fold Labels

Prediction

Performance Model

Validation Fold Data

Validation Fold Labels

Performance

Performance Performance

Performance Performance

1

2

3

4

5

Performance

1 5 5

i =1

Performancei

=

A

B C

(60)

Stacking Algorithm with Cross-Validation

Wolpert, David H. "Stacked generalization." Neural networks 5.2 (1992): 241-259.

(61)

Stacking Algorithm with Cross-Validation

Training set

h1 h2 . . . hn

y1 y2 . . . yn

Meta-Classifier

Base Classifiers

Level-1 predictions in k-th iteration

Training folds Validation fold

Repeat ktimes

All level-1 predictions

Train

(62)

1 2 3 4 5 6 7 8 9 10 1 2 3 4 5 6 7 8 9 10 1 2 3 4 5 6 7 8 9 10 1 2 3 4 5 6 7 8 9 10 1 2 3 4 5 6 7 8 9 10 1 2 3 4 5 6 7 8 9 10 1 2 3 4 5 6 7 8 9 10 1 2 3 4 5 6 7 8 9 10 1 2 3 4 5 6 7 8 9 10 1 2 3 4 5 6 7 8 9 10

Training evaluation

Leave-One-Out CV

(63)

Demos

http://rasbt.github.io/mlxtend/user_guide/classifier/EnsembleVoteClassifier/

http://rasbt.github.io/mlxtend/user_guide/classifier/StackingClassifier/

http://rasbt.github.io/mlxtend/user_guide/classifier/StackingCVClassifier/

http://scikit-learn.org/stable/modules/generated/sklearn.ensemble.VotingClassifier.html

http://scikit-learn.org/stable/auto_examples/ensemble/

plot_bias_variance.html#sphx-glr-auto-examples-ensemble-plot-bias-variance-py http://scikit-learn.org/stable/auto_examples/ensemble/

plot_adaboost_hastie_10_2.html#sphx-glr-auto-examples-ensemble-plot-adaboost- hastie-10-2-py

(64)

Reading Assignments

Python Machine Learning, 2nd Ed., Ch07

References

Related documents

Firms tend to issue in the Yankee market when the relative interest cost in the Eurodollar market is high, indicating potential differences in borrowing costs influence where

Conversely, we predict that a strong organizational ethical climate is expected to provide conditions that decrease or mitigate the incentive to create budgetary slack in the

“A doctor asked to look after a person in a medical emergency has a legal duty to provide the necessaries of life to that person”.. MCNZ A Doctors Duty to Help in a Medical

CEIOPS [2006], Answers to the European Commission on the third wave of Calls for Advice in the framework of the Solvency II project (CEIOPS-DOC-03/06), Eligible elements to cover

Our poster provides you with a clearly arranged presentation of all lacquer, veneer and CPL surfaces available from LEBO, including our unique NATURE products. And not only that:

In order to correlate as closely as possible the measured signals in the CMEA and the corrosion damage, selected electrodes with an anodic response were scanned with higher

 Should you decide to develop this program regardless of our recommendation, then our sense is that the best opportunity for any success would be to target the program to a

Therefore the paper also deals with business innovation and innovation performance in the Czech Republic and with innovation metrics for innovation measurement