• No results found

Model Evaluation 1:

N/A
N/A
Protected

Academic year: 2021

Share "Model Evaluation 1:"

Copied!
57
0
0

Loading.... (view fulltext now)

Full text

(1)

Model Evaluation 1:

Introduction to Overfitting and Underfitting

Lecture 08

STAT 479: Machine Learning, Fall 2018 Sebastian Raschka

http://stat.wisc.edu/~sraschka/teaching/stat479-fs2018/

(2)

Sebastian Raschka STAT 479: Machine Learning FS 2018 2

Overview

Model Eval Lectures

Basics

Bias and Variance

Overfitting and Underfitting Holdout method

Confidence Intervals

Resampling methods Repeated holdout

Empirical confidence intervals Cross-Validation

Hyperparameter tuning Model selection

Algorithm Selection Statistical Tests

Evaluation Metrics

(3)

Overfitting and Underfitting

(4)

Sebastian Raschka STAT 479: Machine Learning FS 2018 4

Overfitting and Underfitting

• Want a model to "generalize" well to unseen data


("high generalization accuracy" or "low generalization error")

"Generalization Performance"

(5)

Overfitting and Underfitting

Assumptions

• i.i.d. assumption: inputs are independent, and training and test examples are identically distributed (drawn from the same probability distribution)

• The training error or accuracy provides an (optimistically) biased estimate of the generalization performance

• For some random model that has not been fitted to the training set, we

expect both the training and test error to be equal

(6)

Sebastian Raschka STAT 479: Machine Learning FS 2018 6

• Underfitting: both training and test error are large

• Overfitting: gap between training and test error (where test error is higher)

• Large hypothesis space being searched by a learning algorithm 
 -> high tendency to overfit

Overfitting and Underfitting

Model Capacity

(7)

Overfitting and Underfitting

(8)

Sebastian Raschka STAT 479: Machine Learning FS 2018 8

"[...] model has high bias/variance" -- What does that mean?

(9)

"[...] model has high bias/variance" -- What does that mean?

Originally formulated for regression

(10)

Sebastian Raschka STAT 479: Machine Learning FS 2018 10

Bias-Variance Decomposition and Trade-off

(11)

Bias-Variance Decomposition

• Decomposition of the loss into bias and variance help us understand

learning algorithms, concepts are correlated to underfitting and overfitting


• Helps explain why ensemble methods (last lecture) might perform better

than single models

(12)

Sebastian Raschka STAT 479: Machine Learning FS 2018 12

Low Variance (Precise)

High Variance (Not Precise)

Low Bias (Accurate)High Bias (Not Accurate)

Bias-Variance Intuition

(13)

Bias and Variance Intuition

(14)

Sebastian Raschka STAT 479: Machine Learning FS 2018 14

Bias and Variance Intuition

(15)

Bias and Variance Intuition

(16)

Sebastian Raschka STAT 479: Machine Learning FS 2018 16

Bias and Variance Intuition

(17)

Bias and Variance Intuition

High bias

(There are two points where the bias is zero)

(18)

Sebastian Raschka STAT 479: Machine Learning FS 2018 18

(here, I fit an unpruned decision tree) High Variance

Why?

Bias and Variance Intuition

(19)

where f(x) is some true (target) function suppose we have multiple training sets

Bias and Variance Example

(20)

Sebastian Raschka STAT 479: Machine Learning FS 2018 20

Bias and Variance Example

High variance

(21)

Bias and Variance Example

High variance

What happens if we take the average?

(22)

Sebastian Raschka STAT 479: Machine Learning FS 2018 22

Terminology

̂θ

Point estimator of some parameter θ

(could also be a function, e.g., the hypothesis is

an estimator of some target function)

(23)

Terminology

̂θ

Point estimator of some parameter θ

(could also be a function, e.g., the hypothesis is an estimator of some target function)

Bias = E[ ̂θ] − θ

(24)

Sebastian Raschka STAT 479: Machine Learning FS 2018 24

Bias-Variance Decomposition

Bias ( ̂θ) = E[ ̂θ] − θ

Var ( ̂θ) = E[ ̂θ

2

] − (E[ ̂θ])

2

General Definition:

Var ( ̂θ) = E[(E[ ̂θ] − ̂θ)

2

]

(25)

Bias-Variance Decomposition

Bias ( ̂θ) = E[ ̂θ] − θ

Var ( ̂θ) = E[ ̂θ

2

] − (E[ ̂θ])

2

General Definition: Intuition:

Var ( ̂θ) = E[(E[ ̂θ] − ̂θ)

2

]

(we ignore noise in this lecture for simplicity)

(26)

Sebastian Raschka STAT 479: Machine Learning FS 2018 26

Bias-Variance Decomposition of Squared Error

Bias ( ̂θ) = E[ ̂θ] − θ

Var ( ̂θ) = E[ ̂θ

2

] − (E[ ̂θ])

2

General Definition:

The variance provides an estimate of how much the estimate varies as we vary the training data (e.g,. by resampling).

Intuition:

Bias is the difference between the average estimator from different training samples and the true value.

(The expectation is over the training sets.)

Var ( ̂θ) = E[(E[ ̂θ] − ̂θ)

2

]

(27)

Bias-Variance Decomposition

Loss = Bias + Variance + Noise

(28)

Sebastian Raschka STAT 479: Machine Learning FS 2018 28

Bias-Variance Decomposition of Squared Error

Bias ( ̂θ) = E[ ̂θ] − θ

Var ( ̂θ) = E[ ̂θ

2

] − (E[ ̂θ])

2

General Definition:

Var ( ̂θ) = E[(E[ ̂θ] − ̂θ)

2

]

y = f(x)

"ML notation" for the Squared Error Loss:

S = (y − ̂y)

2

̂y = ̂f(x) = h(x)

(target, target function)

(For the sake of

simplicity, we ignore the noise term in this lecture) (Next slides: the expectation is over the training data, i.e, the average

estimator from different training samples)

(29)

Bias-Variance Decomposition of Squared Error

y = f(x)

S = (y − ̂y)

2

̂y = ̂f(x) = h(x)

(target, target function)

S = (y − ̂y)

2

(y − ̂y)

2

= (y − E[ ̂y] + E[ ̂y] − ̂y)

2

= (y − E[ ̂y])

2

+ (E[ ̂y] − y)

2

+ 2(y − E[ ̂y])(E[ ̂y] − ̂y)

"ML notation" for the Squared Error Loss:

(x is a particular data point e.g,. in the test set;

the expectation is over training sets)

(30)

Sebastian Raschka STAT 479: Machine Learning FS 2018 30

Bias-Variance Decomposition of Squared Error

S = (y − ̂y)

2

(y − ̂y)

2

= (y − E[ ̂y] + E[ ̂y] − ̂y)

2

= (y − E[ ̂y])

2

+ (E[ ̂y] − y)

2

+ 2(y − E[ ̂y])(E[ ̂y] − ̂y)

E[S] = E[(y − ̂y)

2

]

E[(y − ̂y)

2

] = (y − E[ ̂y])

2

+ E[(E[ ̂y] − ̂y)

2

]

= [Bias of the fit]

2

+ Variance of the fit

(The expectation is over the training data, i.e, the average estimator

from different training samples)

(31)

Bias-Variance Decomposition of Squared Error

S = (y − ̂y)

2

(y − ̂y)

2

= (y − E[ ̂y] + E[ ̂y] − ̂y)

2

= (y − E[ ̂y])

2

+ (E[ ̂y] − y)

2

+ 2(y − E[ ̂y])(E[ ̂y] − ̂y)

E[S] = E[(y − ̂y)

2

]

E[(y − ̂y)

2

] = (y − E[ ̂y])

2

+ E[(E[ ̂y] − ̂y)

2

]

= [Bias]

2

+ Variance

???

(32)

Sebastian Raschka STAT 479: Machine Learning FS 2018 32

Bias-Variance Decomposition of Squared Error

S = (y − ̂y)

2

(y − ̂y)

2

= (y − E[ ̂y] + E[ ̂y] − ̂y)

2

= (y − E[ ̂y])

2

+ (E[ ̂y] − y)

2

+ 2(y − E[ ̂y])(E[ ̂y] − ̂y)

???

E[2(y − E[ ̂y])(E[ ̂y] − ̂y)] = 2E[(y − E[ ̂y])(E[ ̂y] − ̂y)]

= 2(y − E[ ̂y])E[(E[ ̂y] − ̂y)]

= 2(y − E[ ̂y])(E[E[ ̂y]] − E[ ̂y])

= 2(y − E[ ̂y])(E[ ̂y] − E[ ̂y])

= 0

(33)
(34)

Sebastian Raschka STAT 479: Machine Learning FS 2018 34

"several authors have proposed bias-variance decompositions related to zero- one loss (Kong & Dietterich, 1995; Breiman, 1996b; Kohavi & Wolpert, 1996;

Tibshirani, 1996; Friedman, 1997). However, each of these decompositions has significant shortcomings."

Domingos, P. (2000). A unified bias-variance decomposition. 


In Proceedings of 17th International Conference on Machine Learning (pp. 231-238).

(35)

Bias-Variance Decomposition of 0-1 Loss

Dietterich, T. G., & Kong, E. B. (1995). Machine learning bias, statistical bias, and statistical variance of decision tree algorithms. Technical report, Department of Computer Science, Oregon State

University.

Squared Loss Generalized Loss

(y − ̂y)

2

L(y, ̂y)

E[(y − ̂y)

2

] E[L(y, ̂y)]

Domingos, P. (2000). A unified bias-variance decomposition. In Proceedings of 17th International Conference on Machine Learning (pp. 231-238).

(36)

Sebastian Raschka STAT 479: Machine Learning FS 2018 36

Bias-Variance Decomposition of 0-1 Loss

Dietterich, T. G., & Kong, E. B. (1995). Machine learning bias, statistical bias, and statistical variance of decision tree algorithms. Technical report, Department of Computer Science, Oregon State

University.

Squared Loss Generalized Loss

(y − ̂y)

2

L(y, ̂y)

E[(y − ̂y)

2

] E[L(y, ̂y)]

Domingos, P. (2000). A unified bias-variance decomposition. In Proceedings of 17th International Conference on Machine Learning (pp. 231-238).

E[(y − ̂y)

2

] = (y − E[ ̂y])

2

+ E[(E[ ̂y] − ̂y)

2

]

Bias

2

+ Variance

(37)

Bias-Variance Decomposition of 0-1 Loss

Dietterich, T. G., & Kong, E. B. (1995). Machine learning bias, statistical bias, and statistical variance of decision tree algorithms. Technical report, Department of Computer Science, Oregon State

University.

Squared Loss Generalized Loss

(y − ̂y)

2

L(y, ̂y)

E[(y − ̂y)

2

] E[L(y, ̂y)]

Domingos, P. (2000). A unified bias-variance decomposition. In Proceedings of 17th International Conference on Machine Learning (pp. 231-238).

E[(y − ̂y)

2

] = (y − E[ ̂y])

2

+ E[(E[ ̂y] − ̂y)

2

] Bias

2

+ Variance

Bias

2

: (y − E[ ̂y])

2

L(y, E[ ̂y])

(38)

Sebastian Raschka STAT 479: Machine Learning FS 2018 38

Define "Main Prediction"

Dietterich, T. G., & Kong, E. B. (1995). Machine learning bias, statistical bias, and statistical variance of decision tree algorithms. Technical report, Department of Computer Science, Oregon State

University.

Domingos, P. (2000). A unified bias-variance decomposition. In Proceedings of 17th International Conference on Machine Learning (pp. 231-238).

The main prediction is the prediction that minimizes the average loss

E[L( ̂y, ̂y′)]

¯̂y = argmin

̂y′

For squared loss -> Mean

For 0-1 loss -> Mode

(39)

Bias-Variance Decomposition of 0-1 Loss

Dietterich, T. G., & Kong, E. B. (1995). Machine learning bias, statistical bias, and statistical variance of decision tree algorithms. Technical report, Department of Computer Science, Oregon State

University.

Squared Loss 0-1 Loss

(y − ̂y)

2

L(y, ̂y)

E[(y − ̂y)

2

] E[L(y, ̂y)]

Domingos, P. (2000). A unified bias-variance decomposition. In Proceedings of 17th International Conference on Machine Learning (pp. 231-238).

E[(y − ̂y)

2

] = (y − E[ ̂y])

2

+ E[(E[ ̂y] − ̂y)

2

] Bias

2

+ Variance

Bias Main prediction -> Mean

2

: (y − E[ ̂y])

2

Main prediction -> Mode L(y, E[ ̂y])

(40)

Sebastian Raschka STAT 479: Machine Learning FS 2018 40

Bias-Variance Decomposition of 0-1 Loss

Dietterich, T. G., & Kong, E. B. (1995). Machine learning bias, statistical bias, and statistical variance of decision tree algorithms. Technical report, Department of Computer Science, Oregon State

University.

Squared Loss 0-1 Loss

Domingos, P. (2000). A unified bias-variance decomposition. In Proceedings of 17th International Conference on Machine Learning (pp. 231-238).

Bias

2

: (y − E[ ̂y])

2

L(y, E[ ̂y])

Variance: E[(E[ ̂y] − ̂y)

2

] E[L( ̂y, E[ ̂y])]

Main prediction -> Mean Main prediction -> Mode

Bias = { 1 if y ≠ ¯̂y

0 otherwise

Variance = P( ̂y ≠ ̂¯y)

E[(y − ̂y)

2

] E[L(y, ̂y)]

P(y ≠ ̂y)

(41)

Bias-Variance Decomposition of 0-1 Loss

Dietterich, T. G., & Kong, E. B. (1995). Machine learning bias, statistical bias, and statistical variance of decision tree algorithms. Technical report, Department of Computer Science, Oregon State

University.

0-1 Loss

Domingos, P. (2000). A unified bias-variance decomposition. In Proceedings of 17th International Conference on Machine Learning (pp. 231-238).

Bias = { 1 if y ≠ ¯̂y

0 otherwise

P( ̂y ≠ y)

Loss = Bias + Variance =

P( ̂y ≠ y)

Loss = Variance =

Variance = P( ̂y ≠ ̂¯y)

(42)

Sebastian Raschka STAT 479: Machine Learning FS 2018 42

Bias-Variance Decomposition of 0-1 Loss

Dietterich, T. G., & Kong, E. B. (1995). Machine learning bias, statistical bias, and statistical variance of decision tree algorithms. Technical report, Department of Computer Science, Oregon State

University.

0-1 Loss

Domingos, P. (2000). A unified bias-variance decomposition. In Proceedings of 17th International Conference on Machine Learning (pp. 231-238).

Bias = { 1 if y ≠ ¯̂y

0 otherwise

P( ̂y ≠ y)

Loss = Bias + Variance =

P( ̂y ≠ y)

Loss = Variance =

Variance = P( ̂y ≠ ̂¯y)

(43)

Bias-Variance Decomposition of 0-1 Loss

Dietterich, T. G., & Kong, E. B. (1995). Machine learning bias, statistical bias, and statistical variance of decision tree algorithms. Technical report, Department of Computer Science, Oregon State

University.

0-1 Loss

Domingos, P. (2000). A unified bias-variance decomposition. In Proceedings of 17th International Conference on Machine Learning (pp. 231-238).

Bias = { 1 if y ≠ ¯̂y

0 otherwise

P( ̂y ≠ y)

Loss =

P( ̂y ≠ y) = 1 − P( ̂y = y)

Loss =

(44)

Sebastian Raschka STAT 479: Machine Learning FS 2018 44

Bias-Variance Decomposition of 0-1 Loss

Dietterich, T. G., & Kong, E. B. (1995). Machine learning bias, statistical bias, and statistical variance of decision tree algorithms. Technical report, Department of Computer Science, Oregon State

University.

0-1 Loss

Domingos, P. (2000). A unified bias-variance decomposition. In Proceedings of 17th International Conference on Machine Learning (pp. 231-238).

Bias = { 1 if y ≠ ¯̂y

0 otherwise

P( ̂y ≠ y)

Loss =

P( ̂y ≠ y) = 1 − P( ̂y = y) = 1 − P( ̂y ≠ ¯̂y)

Loss =

(45)

Bias-Variance Decomposition of 0-1 Loss

Dietterich, T. G., & Kong, E. B. (1995). Machine learning bias, statistical bias, and statistical variance of decision tree algorithms. Technical report, Department of Computer Science, Oregon State

University.

0-1 Loss

Domingos, P. (2000). A unified bias-variance decomposition. In Proceedings of 17th International Conference on Machine Learning (pp. 231-238).

Bias = { 1 if y ≠ ¯̂y

0 otherwise

P( ̂y ≠ y)

Loss =

P( ̂y ≠ y) = 1 − P( ̂y = y) = 1 − P( ̂y ≠ ¯̂y)

Loss =

(46)

Sebastian Raschka STAT 479: Machine Learning FS 2018 46

Bias-Variance Decomposition of 0-1 Loss

Dietterich, T. G., & Kong, E. B. (1995). Machine learning bias, statistical bias, and statistical variance of decision tree algorithms. Technical report, Department of Computer Science, Oregon State

University.

0-1 Loss

Domingos, P. (2000). A unified bias-variance decomposition. In Proceedings of 17th International Conference on Machine Learning (pp. 231-238).

Bias = { 1 if y ≠ ¯̂y

0 otherwise

P( ̂y ≠ y)

Loss =

P( ̂y ≠ y) = 1 − P( ̂y = y) = 1 − P( ̂y ≠ ¯̂y)

Loss =

Loss = Bias - Variance

(47)

Bias-Variance Decomposition of 0-1 Loss

Dietterich, T. G., & Kong, E. B. (1995). Machine learning bias, statistical bias, and statistical variance of decision tree algorithms. Technical report, Department of Computer Science, Oregon State

University.

0-1 Loss

Domingos, P. (2000). A unified bias-variance decomposition. In Proceedings of 17th International Conference on Machine Learning (pp. 231-238).

Bias = { 1 if y ≠ ¯̂y

0 otherwise

P( ̂y ≠ y)

Loss =

P( ̂y ≠ y) = 1 − P( ̂y = y) = 1 − P( ̂y ≠ ¯̂y)

Loss =

Loss = Bias - Variance

Variance can improve loss!!

Why is that so?

(48)

Sebastian Raschka STAT 479: Machine Learning FS 2018 48

Statistical Bias vs "Machine Learning Bias"

• "Machine learning bias" sometimes also called "inductive bias"


• e.g., decision tree algorithms consider small trees before they consider large trees


(if training data can be classified by small tree, large trees are not considered)

(49)

Hypothesis Space

Entire hypothesis space

Hypothesis space a particular learning algorithm can sample Hypothesis space

a particular learning algorithm category has access to

Particular hypothesis

(From Lecture 1)

e.g,. decision tree + KNN

e.g,. decision tree

e.g,. ID3

(50)

Sebastian Raschka STAT 479: Machine Learning FS 2018 50

Dietterich, T. G., & Kong, E. B. (1995). Machine learning bias, statistical bias, and statistical variance of decision tree algorithms. Technical report, Department of Computer Science, Oregon State University.

(51)

Bias-Variance Simulation of C 4.5

Dietterich, T. G., & Kong, E. B. (1995). Machine learning bias, statistical bias, and statistical variance of decision tree algorithms. Technical report, Department of Computer Science, Oregon State University.

0 2 4 6 8 10 12 14

0 2 4 6 8 10 12 14

x2

x1 class c0

class c1

0 2 4 6 8 10 12 14

0 2 4 6 8 10 12 14

x1

• simulation on 200 training sets with 200 examples each (0-1 labels)

• 200 hypotheses

• test set: 22,801 examples (1 data point for each grid point)

• mean error rate is 536 errors (out of the 22,801 test examples)

• 297 as a result of bias

• 239 as a result of variance

(remember that trees use a

"staircase" to approximate

diagonal boundaries)

(52)

Sebastian Raschka STAT 479: Machine Learning FS 2018 52

Bias-Variance Simulation of C 4.5

Dietterich, T. G., & Kong, E. B. (1995). Machine learning bias, statistical bias, and statistical variance of decision tree algorithms. Technical report, Department of Computer Science, Oregon State University.

errors due to bias: 1788

errors due to variance:1046

(53)

Bias-Variance Simulation of C 4.5

Dietterich, T. G., & Kong, E. B. (1995). Machine learning bias, statistical bias, and statistical variance of decision tree algorithms. Technical report, Department of Computer Science, Oregon State University.

errors due to bias: 1788

errors due to variance: 1046

why?

(54)

Sebastian Raschka STAT 479: Machine Learning FS 2018 54

Bias-Variance Simulation of C 4.5

Dietterich, T. G., & Kong, E. B. (1995). Machine learning bias, statistical bias, and statistical variance of decision tree algorithms. Technical report, Department of Computer Science, Oregon State University.

errors due to bias: 0

errors due to variance: 17

(55)

Bias-Variance Simulation of C 4.5

Domingos, P. (2000). A unified bias-variance decomposition. In Proceedings of 17th International Conference on Machine Learning (pp. 231-238).

0 5 10 15 20 25 30 35 40 45

0 5 10 15 20

Loss (%)

k

L B V Vu Vb

0 2 4 6 8 10 12 14

0 5 10 15 20

Loss (%)

L B V Vu Vb

L = Loss B = Bias

V = (net) Variance

(56)

Sebastian Raschka STAT 479: Machine Learning FS 2018 56

Recommended Reading Resources for Bias-Decomposition

or more precisely

Dietterich, T. G., & Kong, E. B. (1995). Machine learning bias, statistical bias, and statistical variance of decision tree algorithms. Technical report, Department of Computer Science, Oregon State

University.

Domingos, P. (2000). A unified bias-variance decomposition. In Proceedings of 17th International Conference on Machine Learning (pp. 231-238).

0-1 loss

includes noise

and more general: Loss = Bias + c Variance c

1

N(x) + B(x) + c

2

V(x)

c

1

= c

2

= 1

where, e.g., for squared loss

(57)

Model Eval Lectures

Basics

Bias and Variance

Overfitting and Underfitting Holdout method

Confidence Intervals

Resampling methods Repeated holdout

Empirical confidence intervals Cross-Validation

Hyperparameter tuning Model selection

Algorithm Selection Statistical Tests

Evaluation Metrics

Next Lecture

References

Related documents

LTG Perna's other command assignments include: Commander, Joint Munitions Command and Joint Munitions and Lethality Lifecycle Management Command, responsible for the

In conclusion, a very high level of fear avoidance is a risk factor for long-term sickness absence among workers with musculoskeletal pain regardless of the level of

We then discuss optimal student credit arrangements that balance three important objectives: (i) providing credit for students to access college and finance consumption while in

This can be defined as “those considerations of morals and rights upon which democracy must be founded and according to which it must be built to be right and just” (Clancy, n.d.)

The traditional academic framework for the analysis of inequality is based on a-spatial inequality measures producing a global estimate of the inhomo- geneities inside the

The Horizons Target Date Funds are invested in units/shares of the following underlying funds: Horizons Stable Income Fund, Horizons Bond Fund, Horizons Large Cap Fund,

Outcomes _ 165 Overview 165 Introduction 166 Determining ethical risk 167 Ethical risk assessment 168 Codifying ethical standards 169 Directional codes 169 Aspirational codes