CS570 Data Mining Classification: Ensemble Methods

(1)

CS570 Data Mining Classification:

Ensemble Methods

Cengiz Günay

Dept. Math & CS, Emory University

Fall 2013

Some slides courtesy of Han-Kamber-Pei, Tan et al., and Li Xiong

Günay (Emory) Classification: Ensemble Methods Fall 2013 1 / 6

(2)

Today

Due today midnight:

Homework #2 – Frequent itemsets Given today:

Homework #3 – Classification Today’s menu:

Classification: Ensemble Methods

(3)

Ensemble Methods

• Given a data set, generate multiple models and combine the results

• Bagging

• Random Forests

• Boosting

– PAC learning significance

(4)

General Idea

(5)

Why does it work?

Suppose there are 25 base classifiers

 Each classifier has error rate, ε = 0.35

 Assume classifiers are independent

 Probability that the ensemble classifier makes a wrong prediction:

∑

i=13

25

(

²⁵i

)

^εⁱ⁽^1−ε)²⁵⁻ⁱ^{=0 . 06}

(6)

Types of Ensemble Methods

Can be obtained by manipulating:

1 Training set:

Bagging Boosting

2 Input features: Random forests

Multi-objective evolutionary algorithms Forward/backward elimination?

3 Class labels: Multi-classes Active learning

4 Learning algorithm: ANNs

Decision trees

(7)

Types of Ensemble Methods

1 Training set:

Bagging Boosting

2 Input features:

Random forests

3 Class labels: Multi-classes Active learning

Decision trees

(8)

Types of Ensemble Methods

1 Training set:

Bagging Boosting

2 Input features:

Random forests

3 Class labels:

Multi-classes Active learning

Decision trees

(9)

Types of Ensemble Methods

1 Training set:

Bagging Boosting

2 Input features:

Random forests

3 Class labels:

Multi-classes Active learning

4 Learning algorithm:

ANNs Decision trees

(10)

Bagging

• Create a data set by sampling data points with replacement

• Create model based on the data set

• Generate more data sets and models

• Predict by combining votes

– Classification: majority vote

– Prediction: average

(11)

Bagging

Sampling with replacement

Build classifier on each bootstrap sample

Each sample has probability (1 – 1/n)ⁿ of being selected

Original Data 1 2 3 4 5 6 7 8 9 10

Bagging (Round 1) 7 8 10 8 2 5 10 10 5 9

Bagging (Round 2) 1 4 9 1 2 3 2 7 3 2

Bagging (Round 3) 1 8 5 10 5 5 9 6 3 7

(12)

Bagging

Advantages:

Less overfitting

Helps when classifier is unstable (has high variance) Disadvantages:

Not useful when classifier is stable and has large bias

(13)

PAC learning

• Model defining learning with given accuracy and confidence using polynomial sample complexity

• References:

– L. Valiant. A theory of the learnable.

• http://web.mit.edu/6.435/www/Valiant84.pdf

– D. Haussler. Overview of the Probably

Approximately Correct (PAC) Learning Framework

• http://www.cs.iastate.edu/~honavar/pac.pdf

(14)

Boosting

• Use weak learners and combine to form strong learner in PAC learning sense

• Learn using a weak learner

• Boost the accuracy by reweighting the examples misclassified by previous weak learner and forcing the next weak learner to focus on the

“hard” examples

• Predict by using a weighted combination of the weak learners

– Weight is determined by their accuracy

(15)

Boosting



An iterative procedure to adaptively change distribution of training data by focusing more on previously misclassified records

 Initially, all N records are assigned equal weights

 Unlike bagging, weights may change at the end of boosting round

(16)

Boosting



Records that are wrongly classified will have their weights increased



Records that are classified correctly will have their weights decreased

Original Data 1 2 3 4 5 6 7 8 9 10

Boosting (Round 1) 7 3 2 8 7 9 4 10 6 3

Boosting (Round 2) 5 4 9 4 2 5 1 7 4 2

Boosting (Round 3) 4 4 8 10 4 5 4 6 3 4

• Example 4 is hard to classify

• Its weight is increased, therefore it is more likely to be chosen again in subsequent rounds

(17)

Boosting

Advantages:

Focuses on samples that are hard to classify Sample weights can be used for:

1 Sampling probability

2 Used by classifier to value them more

Adaboost:

Calculates classifier importance instead of voting Exponential weight update rules

But, susceptible to overfitting

(18)

Example: AdaBoost



Base classifiers: C

1

, C

2

, …, C

T



Error rate:



Importance of a classifier:

ε

_i

= 1 N ∑

j=1 N

w

_j

δ ( ^C

i

( x

_j

)≠ y

_j

)

α

_i

= 1

2 ln ( ^1−ε ^ε

i ⁱ

)

(19)

Example: AdaBoost



Weight update:



If any intermediate rounds produce error rate higher than 50%, the weights are reverted back to 1/n and the resampling procedure is repeated



Classification:

w_i^{(j+ 1)}

=

w⁽_i^{j )}

Z_j

{ ^exp ^exp

^−α^α^j ^j

^{if C} ^{if C}

^jj

⁽ (

^xxⁱ_i

^)=y )≠

yⁱ_i

}

where Z

_j

is the normalization factor

( )

∑

=

^T

j

j y j

y x C x

C

1

) ( max

arg )

(

* α δ

(20)

(C) Vipin Kumar, Parallel Issues in Data Mining, V ECPAR 2002

11

Illustrating AdaBoost

Data points for training Initial weights for each data point

(21)

(C) Vipin Kumar, Parallel Issues in Data Mining, V ECPAR 2002

12

Illustrating AdaBoost

(22)

Random Forests

• Sample a data set with replacement

• Select m variables at random from p variables

• Create a tree

• Similarly create more trees

• Combine the results

• Reference:

– Hastie, Tibshirani, Friedman, The Elements of

Statistical Learning, Chapter 15

(23)

Random Forests

Advantages:

Only for decision trees Lowers generalization error

Uses randomization in tree construction: #features= log₂d + 1 Equivalent accuracy to Adaboost, but faster

See table in Tan et al p. 294 for comparison of ensemble methods.

CS570 Data Mining Classification: Ensemble Methods