Model Evaluation 1:

(1)

Model Evaluation 1:

Introduction to Overfitting and Underfitting

Lecture 08

STAT 479: Machine Learning, Fall 2018 Sebastian Raschka

http://stat.wisc.edu/~sraschka/teaching/stat479-fs2018/

(2)

Sebastian Raschka STAT 479: Machine Learning FS 2018 2

Overview

Model Eval Lectures

Basics

Bias and Variance

Overfitting and Underfitting Holdout method

Confidence Intervals

Resampling methods Repeated holdout

Empirical confidence intervals Cross-Validation

Hyperparameter tuning Model selection

Algorithm Selection Statistical Tests

Evaluation Metrics

(3)

Overfitting and Underfitting

(4)

Overfitting and Underfitting

• Want a model to "generalize" well to unseen data 

("high generalization accuracy" or "low generalization error")

"Generalization Performance"

(5)

Overfitting and Underfitting

Assumptions

• i.i.d. assumption: inputs are independent, and training and test examples are identically distributed (drawn from the same probability distribution)

• The training error or accuracy provides an (optimistically) biased estimate of the generalization performance

• For some random model that has not been fitted to the training set, we

expect both the training and test error to be equal

(6)

• Underfitting: both training and test error are large

• Overfitting: gap between training and test error (where test error is higher)

• Large hypothesis space being searched by a learning algorithm   -> high tendency to overfit

Overfitting and Underfitting

Model Capacity

(7)

Overfitting and Underfitting

(8)

"[...] model has high bias/variance" -- What does that mean?

(9)

"[...] model has high bias/variance" -- What does that mean?

Originally formulated for regression

(10)

Bias-Variance Decomposition and Trade-oﬀ

(11)

Bias-Variance Decomposition

• Decomposition of the loss into bias and variance help us understand

learning algorithms, concepts are correlated to underfitting and overfitting 

 

• Helps explain why ensemble methods (last lecture) might perform better

than single models

(12)

Low Variance (Precise)

High Variance (Not Precise)

Low Bias (Accurate)High Bias (Not Accurate)

Bias-Variance Intuition

(13)

Bias and Variance Intuition

(14)

Bias and Variance Intuition

(15)

Bias and Variance Intuition

(16)

Bias and Variance Intuition

(17)

Bias and Variance Intuition

High bias

(There are two points where the bias is zero)

(18)

(here, I fit an unpruned decision tree) High Variance

Why?

Bias and Variance Intuition

(19)

where f(x) is some true (target) function suppose we have multiple training sets

Bias and Variance Example

(20)

Bias and Variance Example

High variance

(21)

Bias and Variance Example

High variance

What happens if we take the average?

(22)

Terminology

̂θ

Point estimator of some parameter θ

(could also be a function, e.g., the hypothesis is

an estimator of some target function)

(23)

Terminology

̂θ

Point estimator of some parameter θ

(could also be a function, e.g., the hypothesis is an estimator of some target function)

Bias = E[ ̂θ] − θ

(24)

Bias-Variance Decomposition

Bias ( ̂θ) = E[ ̂θ] − θ

Var ( ̂θ) = E[ ̂θ

²

] − (E[ ̂θ])

2

General Definition:

Var ( ̂θ) = E[(E[ ̂θ] − ̂θ)

²

]

(25)

Bias-Variance Decomposition

Bias ( ̂θ) = E[ ̂θ] − θ

Var ( ̂θ) = E[ ̂θ

²

] − (E[ ̂θ])

2

General Definition: Intuition:

Var ( ̂θ) = E[(E[ ̂θ] − ̂θ)

²

]

(we ignore noise in this lecture for simplicity)

(26)

Bias-Variance Decomposition of Squared Error

Bias ( ̂θ) = E[ ̂θ] − θ

Var ( ̂θ) = E[ ̂θ

²

] − (E[ ̂θ])

2

General Definition:

The variance provides an estimate of how much the estimate varies as we vary the training data (e.g,. by resampling).

Intuition:

Bias is the diﬀerence between the average estimator from diﬀerent training samples and the true value.

(The expectation is over the training sets.)

Var ( ̂θ) = E[(E[ ̂θ] − ̂θ)

²

]

(27)

Bias-Variance Decomposition

Loss = Bias + Variance + Noise

(28)

Bias-Variance Decomposition of Squared Error

Bias ( ̂θ) = E[ ̂θ] − θ

Var ( ̂θ) = E[ ̂θ

²

] − (E[ ̂θ])

2

General Definition:

Var ( ̂θ) = E[(E[ ̂θ] − ̂θ)

²

]

y = f(x)

"ML notation" for the Squared Error Loss:

S = (y − ̂y)

²

̂y = ̂f(x) = h(x)

(target, target function)

(For the sake of

simplicity, we ignore the noise term in this lecture) (Next slides: the expectation is over the training data, i.e, the average

estimator from diﬀerent training samples)

(29)

Bias-Variance Decomposition of Squared Error

y = f(x)

S = (y − ̂y)

²

̂y = ̂f(x) = h(x)

(target, target function)

S = (y − ̂y)

²

(y − ̂y)

²

= (y − E[ ̂y] + E[ ̂y] − ̂y)

²

= (y − E[ ̂y])

²

+ (E[ ̂y] − y)

²

+ 2(y − E[ ̂y])(E[ ̂y] − ̂y)

"ML notation" for the Squared Error Loss:

(x is a particular data point e.g,. in the test set;

the expectation is over training sets)

(30)

Bias-Variance Decomposition of Squared Error

S = (y − ̂y)

²

(y − ̂y)

²

= (y − E[ ̂y] + E[ ̂y] − ̂y)

²

= (y − E[ ̂y])

²

+ (E[ ̂y] − y)

²

+ 2(y − E[ ̂y])(E[ ̂y] − ̂y)

E[S] = E[(y − ̂y)

²

]

E[(y − ̂y)

²

] = (y − E[ ̂y])

²

+ E[(E[ ̂y] − ̂y)

²

]

= [Bias of the fit]

²

+ Variance of the fit

(The expectation is over the training data, i.e, the average estimator

from diﬀerent training samples)

(31)

Bias-Variance Decomposition of Squared Error

S = (y − ̂y)

²

(y − ̂y)

²

= (y − E[ ̂y] + E[ ̂y] − ̂y)

²

= (y − E[ ̂y])

²

+ (E[ ̂y] − y)

²

+ 2(y − E[ ̂y])(E[ ̂y] − ̂y)

E[S] = E[(y − ̂y)

²

]

E[(y − ̂y)

²

] = (y − E[ ̂y])

²

+ E[(E[ ̂y] − ̂y)

²

]

= [Bias]

²

+ Variance

???

(32)

Bias-Variance Decomposition of Squared Error

S = (y − ̂y)

²

(y − ̂y)

²

= (y − E[ ̂y] + E[ ̂y] − ̂y)

²

= (y − E[ ̂y])

²

+ (E[ ̂y] − y)

²

+ 2(y − E[ ̂y])(E[ ̂y] − ̂y)

???

E[2(y − E[ ̂y])(E[ ̂y] − ̂y)] = 2E[(y − E[ ̂y])(E[ ̂y] − ̂y)]

= 2(y − E[ ̂y])E[(E[ ̂y] − ̂y)]

= 2(y − E[ ̂y])(E[E[ ̂y]] − E[ ̂y])

= 2(y − E[ ̂y])(E[ ̂y] − E[ ̂y])

= 0

(33)

(34)

"several authors have proposed bias-variance decompositions related to zero- one loss (Kong & Dietterich, 1995; Breiman, 1996b; Kohavi & Wolpert, 1996;

Tibshirani, 1996; Friedman, 1997). However, each of these decompositions has significant shortcomings."

Domingos, P. (2000). A unified bias-variance decomposition.  

In Proceedings of 17th International Conference on Machine Learning (pp. 231-238).

(35)

Bias-Variance Decomposition of 0-1 Loss

Dietterich, T. G., & Kong, E. B. (1995). Machine learning bias, statistical bias, and statistical variance of decision tree algorithms. Technical report, Department of Computer Science, Oregon State

University.

Squared Loss Generalized Loss

(y − ̂y)

²

L(y, ̂y)

E[(y − ̂y)

²

] E[L(y, ̂y)]

Domingos, P. (2000). A unified bias-variance decomposition. In Proceedings of 17th International Conference on Machine Learning (pp. 231-238).

(36)

Bias-Variance Decomposition of 0-1 Loss

University.

Squared Loss Generalized Loss

(y − ̂y)

²

L(y, ̂y)

E[(y − ̂y)

²

] E[L(y, ̂y)]

E[(y − ̂y)

²

] = (y − E[ ̂y])

²

+ E[(E[ ̂y] − ̂y)

²

]

Bias

²

+ Variance

(37)

Bias-Variance Decomposition of 0-1 Loss

University.

Squared Loss Generalized Loss

(y − ̂y)

²

L(y, ̂y)

E[(y − ̂y)

²

] E[L(y, ̂y)]

E[(y − ̂y)

²

] = (y − E[ ̂y])

²

+ E[(E[ ̂y] − ̂y)

²

] Bias

²

+ Variance

Bias

²

: (y − E[ ̂y])

²

L(y, E[ ̂y])

(38)

Define "Main Prediction"

University.

The main prediction is the prediction that minimizes the average loss

E[L( ̂y, ̂y′)]

¯̂y ^{= argmin}

̂y′

For squared loss -> Mean

For 0-1 loss -> Mode

(39)

Bias-Variance Decomposition of 0-1 Loss

University.

Squared Loss 0-1 Loss

(y − ̂y)

²

L(y, ̂y)

E[(y − ̂y)

²

] E[L(y, ̂y)]

E[(y − ̂y)

²

] = (y − E[ ̂y])

²

+ E[(E[ ̂y] − ̂y)

²

] Bias

²

+ Variance

Bias Main prediction -> Mean

²

: (y − E[ ̂y])

²

Main prediction -> Mode L(y, E[ ̂y])

(40)

Bias-Variance Decomposition of 0-1 Loss

University.

Squared Loss 0-1 Loss

Bias

²

: (y − E[ ̂y])

²

L(y, E[ ̂y])

Variance: E[(E[ ̂y] − ̂y)

²

] E[L( ̂y, E[ ̂y])]

Main prediction -> Mean Main prediction -> Mode

Bias = { 1 if y ≠ ¯̂y

0 otherwise

Variance = P( ̂y ≠ ̂¯y)

E[(y − ̂y)

²

] E[L(y, ̂y)]

P(y ≠ ̂y)

(41)

Bias-Variance Decomposition of 0-1 Loss

University.

0-1 Loss

Bias = { 1 if y ≠ ¯̂y

0 otherwise

P( ̂y ≠ y)

Loss = Bias + Variance =

P( ̂y ≠ y)

Loss = Variance =

Variance = P( ̂y ≠ ̂¯y)

(42)

Bias-Variance Decomposition of 0-1 Loss

University.

0-1 Loss

Bias = { 1 if y ≠ ¯̂y

0 otherwise

P( ̂y ≠ y)

Loss = Bias + Variance =

P( ̂y ≠ y)

Loss = Variance =

Variance = P( ̂y ≠ ̂¯y)

(43)

Bias-Variance Decomposition of 0-1 Loss

University.

0-1 Loss

Bias = { 1 if y ≠ ¯̂y

0 otherwise

P( ̂y ≠ y)

Loss =

P( ̂y ≠ y) = 1 − P( ̂y = y)

Loss =

(44)

Bias-Variance Decomposition of 0-1 Loss

University.

0-1 Loss

Bias = { 1 if y ≠ ¯̂y

0 otherwise

P( ̂y ≠ y)

Loss =

P( ̂y ≠ y) = 1 − P( ̂y = y) = 1 − P( ̂y ≠ ¯̂y)

Loss =

(45)

Bias-Variance Decomposition of 0-1 Loss

University.

0-1 Loss

Bias = { 1 if y ≠ ¯̂y

0 otherwise

P( ̂y ≠ y)

Loss =

P( ̂y ≠ y) = 1 − P( ̂y = y) = 1 − P( ̂y ≠ ¯̂y)

Loss =

(46)

Bias-Variance Decomposition of 0-1 Loss

University.

0-1 Loss

Bias = { 1 if y ≠ ¯̂y

0 otherwise

P( ̂y ≠ y)

Loss =

P( ̂y ≠ y) = 1 − P( ̂y = y) = 1 − P( ̂y ≠ ¯̂y)

Loss =

Loss = Bias - Variance

(47)

Bias-Variance Decomposition of 0-1 Loss

University.

0-1 Loss

Bias = { 1 if y ≠ ¯̂y

0 otherwise

P( ̂y ≠ y)

Loss =

P( ̂y ≠ y) = 1 − P( ̂y = y) = 1 − P( ̂y ≠ ¯̂y)

Loss =

Loss = Bias - Variance

Variance can improve loss!!

Why is that so?

(48)

Statistical Bias vs "Machine Learning Bias"

• "Machine learning bias" sometimes also called "inductive bias" 

• e.g., decision tree algorithms consider small trees before they consider large trees 

(if training data can be classified by small tree, large trees are not considered)

(49)

Hypothesis Space

Entire hypothesis space

Hypothesis space a particular learning algorithm can sample Hypothesis space

a particular learning algorithm category has access to

Particular hypothesis

(From Lecture 1)

e.g,. decision tree + KNN

e.g,. decision tree

e.g,. ID3

(50)

Dietterich, T. G., & Kong, E. B. (1995). Machine learning bias, statistical bias, and statistical variance of decision tree algorithms. Technical report, Department of Computer Science, Oregon State University.

(51)

Bias-Variance Simulation of C 4.5

0 2 4 6 8 10 12 14

x2

x1 class c0

class c1

0 2 4 6 8 10 12 14

x1

• simulation on 200 training sets with 200 examples each (0-1 labels)

• 200 hypotheses

• test set: 22,801 examples (1 data point for each grid point)

• mean error rate is 536 errors (out of the 22,801 test examples)

• 297 as a result of bias

• 239 as a result of variance

(remember that trees use a

"staircase" to approximate

diagonal boundaries)

(52)

Bias-Variance Simulation of C 4.5

errors due to bias: 1788

errors due to variance:1046

(53)

Bias-Variance Simulation of C 4.5

errors due to bias: 1788

errors due to variance: 1046

why?

(54)

Bias-Variance Simulation of C 4.5

errors due to bias: 0

errors due to variance: 17

(55)

Bias-Variance Simulation of C 4.5

Domingos, P. (2000). A unified bias-variance decomposition. In Proceedings of 17th International Conference on Machine Learning (pp. 231-238).

0 5 10 15 20 25 30 35 40 45

0 5 10 15 20

Loss (%)

k

L B V Vu Vb

0 2 4 6 8 10 12 14

0 5 10 15 20

Loss (%)

L B V Vu Vb

L = Loss B = Bias

V = (net) Variance

(56)

or more precisely

University.

0-1 loss

includes noise

and more general: Loss = Bias + c Variance c

₁

N(x) + B(x) + c

₂

V(x)

c

₁

= c

₂

= 1

where, e.g., for squared loss

(57)

Model Eval Lectures

Basics

Bias and Variance

Overfitting and Underfitting Holdout method

Confidence Intervals

Resampling methods Repeated holdout

Empirical confidence intervals Cross-Validation

Hyperparameter tuning Model selection

Algorithm Selection Statistical Tests

Evaluation Metrics

Next Lecture

Model Evaluation 1: