Model Evaluation 1:
Introduction to Overfitting and Underfitting
Lecture 08
STAT 479: Machine Learning, Fall 2018 Sebastian Raschka
http://stat.wisc.edu/~sraschka/teaching/stat479-fs2018/
Sebastian Raschka STAT 479: Machine Learning FS 2018 2
Overview
Model Eval Lectures
Basics
Bias and Variance
Overfitting and Underfitting Holdout method
Confidence Intervals
Resampling methods Repeated holdout
Empirical confidence intervals Cross-Validation
Hyperparameter tuning Model selection
Algorithm Selection Statistical Tests
Evaluation Metrics
Overfitting and Underfitting
Sebastian Raschka STAT 479: Machine Learning FS 2018 4
Overfitting and Underfitting
• Want a model to "generalize" well to unseen data
("high generalization accuracy" or "low generalization error")
"Generalization Performance"
Overfitting and Underfitting
Assumptions
• i.i.d. assumption: inputs are independent, and training and test examples are identically distributed (drawn from the same probability distribution)
• The training error or accuracy provides an (optimistically) biased estimate of the generalization performance
• For some random model that has not been fitted to the training set, we
expect both the training and test error to be equal
Sebastian Raschka STAT 479: Machine Learning FS 2018 6
• Underfitting: both training and test error are large
• Overfitting: gap between training and test error (where test error is higher)
• Large hypothesis space being searched by a learning algorithm -> high tendency to overfit
Overfitting and Underfitting
Model Capacity
Overfitting and Underfitting
Sebastian Raschka STAT 479: Machine Learning FS 2018 8
"[...] model has high bias/variance" -- What does that mean?
"[...] model has high bias/variance" -- What does that mean?
Originally formulated for regression
Sebastian Raschka STAT 479: Machine Learning FS 2018 10
Bias-Variance Decomposition and Trade-off
Bias-Variance Decomposition
• Decomposition of the loss into bias and variance help us understand
learning algorithms, concepts are correlated to underfitting and overfitting
• Helps explain why ensemble methods (last lecture) might perform better
than single models
Sebastian Raschka STAT 479: Machine Learning FS 2018 12
Low Variance (Precise)
High Variance (Not Precise)
Low Bias (Accurate)High Bias (Not Accurate)
Bias-Variance Intuition
Bias and Variance Intuition
Sebastian Raschka STAT 479: Machine Learning FS 2018 14
Bias and Variance Intuition
Bias and Variance Intuition
Sebastian Raschka STAT 479: Machine Learning FS 2018 16
Bias and Variance Intuition
Bias and Variance Intuition
High bias
(There are two points where the bias is zero)
Sebastian Raschka STAT 479: Machine Learning FS 2018 18
(here, I fit an unpruned decision tree) High Variance
Why?
Bias and Variance Intuition
where f(x) is some true (target) function suppose we have multiple training sets
Bias and Variance Example
Sebastian Raschka STAT 479: Machine Learning FS 2018 20
Bias and Variance Example
High variance
Bias and Variance Example
High variance
What happens if we take the average?
Sebastian Raschka STAT 479: Machine Learning FS 2018 22
Terminology
̂θ
Point estimator of some parameter θ
(could also be a function, e.g., the hypothesis is
an estimator of some target function)
Terminology
̂θ
Point estimator of some parameter θ
(could also be a function, e.g., the hypothesis is an estimator of some target function)
Bias = E[ ̂θ] − θ
Sebastian Raschka STAT 479: Machine Learning FS 2018 24
Bias-Variance Decomposition
Bias ( ̂θ) = E[ ̂θ] − θ
Var ( ̂θ) = E[ ̂θ
2] − (E[ ̂θ])
2
General Definition:
Var ( ̂θ) = E[(E[ ̂θ] − ̂θ)
2]
Bias-Variance Decomposition
Bias ( ̂θ) = E[ ̂θ] − θ
Var ( ̂θ) = E[ ̂θ
2] − (E[ ̂θ])
2
General Definition: Intuition:
Var ( ̂θ) = E[(E[ ̂θ] − ̂θ)
2]
(we ignore noise in this lecture for simplicity)
Sebastian Raschka STAT 479: Machine Learning FS 2018 26
Bias-Variance Decomposition of Squared Error
Bias ( ̂θ) = E[ ̂θ] − θ
Var ( ̂θ) = E[ ̂θ
2] − (E[ ̂θ])
2
General Definition:
The variance provides an estimate of how much the estimate varies as we vary the training data (e.g,. by resampling).
Intuition:
Bias is the difference between the average estimator from different training samples and the true value.
(The expectation is over the training sets.)
Var ( ̂θ) = E[(E[ ̂θ] − ̂θ)
2]
Bias-Variance Decomposition
Loss = Bias + Variance + Noise
Sebastian Raschka STAT 479: Machine Learning FS 2018 28
Bias-Variance Decomposition of Squared Error
Bias ( ̂θ) = E[ ̂θ] − θ
Var ( ̂θ) = E[ ̂θ
2] − (E[ ̂θ])
2
General Definition:
Var ( ̂θ) = E[(E[ ̂θ] − ̂θ)
2]
y = f(x)
"ML notation" for the Squared Error Loss:
S = (y − ̂y)
2̂y = ̂f(x) = h(x)
(target, target function)
(For the sake of
simplicity, we ignore the noise term in this lecture) (Next slides: the expectation is over the training data, i.e, the average
estimator from different training samples)
Bias-Variance Decomposition of Squared Error
y = f(x)
S = (y − ̂y)
2̂y = ̂f(x) = h(x)
(target, target function)
S = (y − ̂y)
2(y − ̂y)
2= (y − E[ ̂y] + E[ ̂y] − ̂y)
2= (y − E[ ̂y])
2+ (E[ ̂y] − y)
2+ 2(y − E[ ̂y])(E[ ̂y] − ̂y)
"ML notation" for the Squared Error Loss:
(x is a particular data point e.g,. in the test set;
the expectation is over training sets)
Sebastian Raschka STAT 479: Machine Learning FS 2018 30
Bias-Variance Decomposition of Squared Error
S = (y − ̂y)
2(y − ̂y)
2= (y − E[ ̂y] + E[ ̂y] − ̂y)
2= (y − E[ ̂y])
2+ (E[ ̂y] − y)
2+ 2(y − E[ ̂y])(E[ ̂y] − ̂y)
E[S] = E[(y − ̂y)
2]
E[(y − ̂y)
2] = (y − E[ ̂y])
2+ E[(E[ ̂y] − ̂y)
2]
= [Bias of the fit]
2+ Variance of the fit
(The expectation is over the training data, i.e, the average estimator
from different training samples)
Bias-Variance Decomposition of Squared Error
S = (y − ̂y)
2(y − ̂y)
2= (y − E[ ̂y] + E[ ̂y] − ̂y)
2= (y − E[ ̂y])
2+ (E[ ̂y] − y)
2+ 2(y − E[ ̂y])(E[ ̂y] − ̂y)
E[S] = E[(y − ̂y)
2]
E[(y − ̂y)
2] = (y − E[ ̂y])
2+ E[(E[ ̂y] − ̂y)
2]
= [Bias]
2+ Variance
???
Sebastian Raschka STAT 479: Machine Learning FS 2018 32
Bias-Variance Decomposition of Squared Error
S = (y − ̂y)
2(y − ̂y)
2= (y − E[ ̂y] + E[ ̂y] − ̂y)
2= (y − E[ ̂y])
2+ (E[ ̂y] − y)
2+ 2(y − E[ ̂y])(E[ ̂y] − ̂y)
???
E[2(y − E[ ̂y])(E[ ̂y] − ̂y)] = 2E[(y − E[ ̂y])(E[ ̂y] − ̂y)]
= 2(y − E[ ̂y])E[(E[ ̂y] − ̂y)]
= 2(y − E[ ̂y])(E[E[ ̂y]] − E[ ̂y])
= 2(y − E[ ̂y])(E[ ̂y] − E[ ̂y])
= 0
Sebastian Raschka STAT 479: Machine Learning FS 2018 34
"several authors have proposed bias-variance decompositions related to zero- one loss (Kong & Dietterich, 1995; Breiman, 1996b; Kohavi & Wolpert, 1996;
Tibshirani, 1996; Friedman, 1997). However, each of these decompositions has significant shortcomings."
Domingos, P. (2000). A unified bias-variance decomposition.
In Proceedings of 17th International Conference on Machine Learning (pp. 231-238).
Bias-Variance Decomposition of 0-1 Loss
Dietterich, T. G., & Kong, E. B. (1995). Machine learning bias, statistical bias, and statistical variance of decision tree algorithms. Technical report, Department of Computer Science, Oregon State
University.
Squared Loss Generalized Loss
(y − ̂y)
2L(y, ̂y)
E[(y − ̂y)
2] E[L(y, ̂y)]
Domingos, P. (2000). A unified bias-variance decomposition. In Proceedings of 17th International Conference on Machine Learning (pp. 231-238).
Sebastian Raschka STAT 479: Machine Learning FS 2018 36
Bias-Variance Decomposition of 0-1 Loss
Dietterich, T. G., & Kong, E. B. (1995). Machine learning bias, statistical bias, and statistical variance of decision tree algorithms. Technical report, Department of Computer Science, Oregon State
University.
Squared Loss Generalized Loss
(y − ̂y)
2L(y, ̂y)
E[(y − ̂y)
2] E[L(y, ̂y)]
Domingos, P. (2000). A unified bias-variance decomposition. In Proceedings of 17th International Conference on Machine Learning (pp. 231-238).
E[(y − ̂y)
2] = (y − E[ ̂y])
2+ E[(E[ ̂y] − ̂y)
2]
Bias
2+ Variance
Bias-Variance Decomposition of 0-1 Loss
Dietterich, T. G., & Kong, E. B. (1995). Machine learning bias, statistical bias, and statistical variance of decision tree algorithms. Technical report, Department of Computer Science, Oregon State
University.
Squared Loss Generalized Loss
(y − ̂y)
2L(y, ̂y)
E[(y − ̂y)
2] E[L(y, ̂y)]
Domingos, P. (2000). A unified bias-variance decomposition. In Proceedings of 17th International Conference on Machine Learning (pp. 231-238).
E[(y − ̂y)
2] = (y − E[ ̂y])
2+ E[(E[ ̂y] − ̂y)
2] Bias
2+ Variance
Bias
2: (y − E[ ̂y])
2L(y, E[ ̂y])
Sebastian Raschka STAT 479: Machine Learning FS 2018 38
Define "Main Prediction"
Dietterich, T. G., & Kong, E. B. (1995). Machine learning bias, statistical bias, and statistical variance of decision tree algorithms. Technical report, Department of Computer Science, Oregon State
University.
Domingos, P. (2000). A unified bias-variance decomposition. In Proceedings of 17th International Conference on Machine Learning (pp. 231-238).
The main prediction is the prediction that minimizes the average loss
E[L( ̂y, ̂y′)]
¯̂y = argmin
̂y′
For squared loss -> Mean
For 0-1 loss -> Mode
Bias-Variance Decomposition of 0-1 Loss
Dietterich, T. G., & Kong, E. B. (1995). Machine learning bias, statistical bias, and statistical variance of decision tree algorithms. Technical report, Department of Computer Science, Oregon State
University.
Squared Loss 0-1 Loss
(y − ̂y)
2L(y, ̂y)
E[(y − ̂y)
2] E[L(y, ̂y)]
Domingos, P. (2000). A unified bias-variance decomposition. In Proceedings of 17th International Conference on Machine Learning (pp. 231-238).
E[(y − ̂y)
2] = (y − E[ ̂y])
2+ E[(E[ ̂y] − ̂y)
2] Bias
2+ Variance
Bias Main prediction -> Mean
2: (y − E[ ̂y])
2Main prediction -> Mode L(y, E[ ̂y])
Sebastian Raschka STAT 479: Machine Learning FS 2018 40
Bias-Variance Decomposition of 0-1 Loss
Dietterich, T. G., & Kong, E. B. (1995). Machine learning bias, statistical bias, and statistical variance of decision tree algorithms. Technical report, Department of Computer Science, Oregon State
University.
Squared Loss 0-1 Loss
Domingos, P. (2000). A unified bias-variance decomposition. In Proceedings of 17th International Conference on Machine Learning (pp. 231-238).
Bias
2: (y − E[ ̂y])
2L(y, E[ ̂y])
Variance: E[(E[ ̂y] − ̂y)
2] E[L( ̂y, E[ ̂y])]
Main prediction -> Mean Main prediction -> Mode
Bias = { 1 if y ≠ ¯̂y
0 otherwise
Variance = P( ̂y ≠ ̂¯y)
E[(y − ̂y)
2] E[L(y, ̂y)]
P(y ≠ ̂y)
Bias-Variance Decomposition of 0-1 Loss
Dietterich, T. G., & Kong, E. B. (1995). Machine learning bias, statistical bias, and statistical variance of decision tree algorithms. Technical report, Department of Computer Science, Oregon State
University.
0-1 Loss
Domingos, P. (2000). A unified bias-variance decomposition. In Proceedings of 17th International Conference on Machine Learning (pp. 231-238).
Bias = { 1 if y ≠ ¯̂y
0 otherwise
P( ̂y ≠ y)
Loss = Bias + Variance =
P( ̂y ≠ y)
Loss = Variance =
Variance = P( ̂y ≠ ̂¯y)
Sebastian Raschka STAT 479: Machine Learning FS 2018 42
Bias-Variance Decomposition of 0-1 Loss
Dietterich, T. G., & Kong, E. B. (1995). Machine learning bias, statistical bias, and statistical variance of decision tree algorithms. Technical report, Department of Computer Science, Oregon State
University.
0-1 Loss
Domingos, P. (2000). A unified bias-variance decomposition. In Proceedings of 17th International Conference on Machine Learning (pp. 231-238).
Bias = { 1 if y ≠ ¯̂y
0 otherwise
P( ̂y ≠ y)
Loss = Bias + Variance =
P( ̂y ≠ y)
Loss = Variance =
Variance = P( ̂y ≠ ̂¯y)
Bias-Variance Decomposition of 0-1 Loss
Dietterich, T. G., & Kong, E. B. (1995). Machine learning bias, statistical bias, and statistical variance of decision tree algorithms. Technical report, Department of Computer Science, Oregon State
University.
0-1 Loss
Domingos, P. (2000). A unified bias-variance decomposition. In Proceedings of 17th International Conference on Machine Learning (pp. 231-238).
Bias = { 1 if y ≠ ¯̂y
0 otherwise
P( ̂y ≠ y)
Loss =
P( ̂y ≠ y) = 1 − P( ̂y = y)
Loss =
Sebastian Raschka STAT 479: Machine Learning FS 2018 44
Bias-Variance Decomposition of 0-1 Loss
Dietterich, T. G., & Kong, E. B. (1995). Machine learning bias, statistical bias, and statistical variance of decision tree algorithms. Technical report, Department of Computer Science, Oregon State
University.
0-1 Loss
Domingos, P. (2000). A unified bias-variance decomposition. In Proceedings of 17th International Conference on Machine Learning (pp. 231-238).
Bias = { 1 if y ≠ ¯̂y
0 otherwise
P( ̂y ≠ y)
Loss =
P( ̂y ≠ y) = 1 − P( ̂y = y) = 1 − P( ̂y ≠ ¯̂y)
Loss =
Bias-Variance Decomposition of 0-1 Loss
Dietterich, T. G., & Kong, E. B. (1995). Machine learning bias, statistical bias, and statistical variance of decision tree algorithms. Technical report, Department of Computer Science, Oregon State
University.
0-1 Loss
Domingos, P. (2000). A unified bias-variance decomposition. In Proceedings of 17th International Conference on Machine Learning (pp. 231-238).
Bias = { 1 if y ≠ ¯̂y
0 otherwise
P( ̂y ≠ y)
Loss =
P( ̂y ≠ y) = 1 − P( ̂y = y) = 1 − P( ̂y ≠ ¯̂y)
Loss =
Sebastian Raschka STAT 479: Machine Learning FS 2018 46
Bias-Variance Decomposition of 0-1 Loss
Dietterich, T. G., & Kong, E. B. (1995). Machine learning bias, statistical bias, and statistical variance of decision tree algorithms. Technical report, Department of Computer Science, Oregon State
University.
0-1 Loss
Domingos, P. (2000). A unified bias-variance decomposition. In Proceedings of 17th International Conference on Machine Learning (pp. 231-238).
Bias = { 1 if y ≠ ¯̂y
0 otherwise
P( ̂y ≠ y)
Loss =
P( ̂y ≠ y) = 1 − P( ̂y = y) = 1 − P( ̂y ≠ ¯̂y)
Loss =
Loss = Bias - Variance
Bias-Variance Decomposition of 0-1 Loss
Dietterich, T. G., & Kong, E. B. (1995). Machine learning bias, statistical bias, and statistical variance of decision tree algorithms. Technical report, Department of Computer Science, Oregon State
University.
0-1 Loss
Domingos, P. (2000). A unified bias-variance decomposition. In Proceedings of 17th International Conference on Machine Learning (pp. 231-238).
Bias = { 1 if y ≠ ¯̂y
0 otherwise
P( ̂y ≠ y)
Loss =
P( ̂y ≠ y) = 1 − P( ̂y = y) = 1 − P( ̂y ≠ ¯̂y)
Loss =
Loss = Bias - Variance
Variance can improve loss!!
Why is that so?
Sebastian Raschka STAT 479: Machine Learning FS 2018 48
Statistical Bias vs "Machine Learning Bias"
• "Machine learning bias" sometimes also called "inductive bias"
• e.g., decision tree algorithms consider small trees before they consider large trees
(if training data can be classified by small tree, large trees are not considered)
Hypothesis Space
Entire hypothesis space
Hypothesis space a particular learning algorithm can sample Hypothesis space
a particular learning algorithm category has access to
Particular hypothesis
(From Lecture 1)
e.g,. decision tree + KNN
e.g,. decision tree
e.g,. ID3
Sebastian Raschka STAT 479: Machine Learning FS 2018 50
Dietterich, T. G., & Kong, E. B. (1995). Machine learning bias, statistical bias, and statistical variance of decision tree algorithms. Technical report, Department of Computer Science, Oregon State University.
Bias-Variance Simulation of C 4.5
Dietterich, T. G., & Kong, E. B. (1995). Machine learning bias, statistical bias, and statistical variance of decision tree algorithms. Technical report, Department of Computer Science, Oregon State University.
0 2 4 6 8 10 12 14
0 2 4 6 8 10 12 14
x2
x1 class c0
class c1
0 2 4 6 8 10 12 14
0 2 4 6 8 10 12 14
x1
• simulation on 200 training sets with 200 examples each (0-1 labels)
• 200 hypotheses
• test set: 22,801 examples (1 data point for each grid point)
• mean error rate is 536 errors (out of the 22,801 test examples)
• 297 as a result of bias
• 239 as a result of variance
(remember that trees use a
"staircase" to approximate
diagonal boundaries)
Sebastian Raschka STAT 479: Machine Learning FS 2018 52
Bias-Variance Simulation of C 4.5
Dietterich, T. G., & Kong, E. B. (1995). Machine learning bias, statistical bias, and statistical variance of decision tree algorithms. Technical report, Department of Computer Science, Oregon State University.
errors due to bias: 1788
errors due to variance:1046
Bias-Variance Simulation of C 4.5
Dietterich, T. G., & Kong, E. B. (1995). Machine learning bias, statistical bias, and statistical variance of decision tree algorithms. Technical report, Department of Computer Science, Oregon State University.
errors due to bias: 1788
errors due to variance: 1046
why?
Sebastian Raschka STAT 479: Machine Learning FS 2018 54
Bias-Variance Simulation of C 4.5
Dietterich, T. G., & Kong, E. B. (1995). Machine learning bias, statistical bias, and statistical variance of decision tree algorithms. Technical report, Department of Computer Science, Oregon State University.
errors due to bias: 0
errors due to variance: 17
Bias-Variance Simulation of C 4.5
Domingos, P. (2000). A unified bias-variance decomposition. In Proceedings of 17th International Conference on Machine Learning (pp. 231-238).
0 5 10 15 20 25 30 35 40 45
0 5 10 15 20
Loss (%)
k
L B V Vu Vb
0 2 4 6 8 10 12 14
0 5 10 15 20
Loss (%)
L B V Vu Vb
L = Loss B = Bias
V = (net) Variance
Sebastian Raschka STAT 479: Machine Learning FS 2018 56
Recommended Reading Resources for Bias-Decomposition
or more precisely
Dietterich, T. G., & Kong, E. B. (1995). Machine learning bias, statistical bias, and statistical variance of decision tree algorithms. Technical report, Department of Computer Science, Oregon State
University.
Domingos, P. (2000). A unified bias-variance decomposition. In Proceedings of 17th International Conference on Machine Learning (pp. 231-238).
0-1 loss
includes noise
and more general: Loss = Bias + c Variance c
1N(x) + B(x) + c
2V(x)
c
1= c
2= 1
where, e.g., for squared loss
Model Eval Lectures
Basics
Bias and Variance
Overfitting and Underfitting Holdout method
Confidence Intervals
Resampling methods Repeated holdout
Empirical confidence intervals Cross-Validation
Hyperparameter tuning Model selection
Algorithm Selection Statistical Tests
Evaluation Metrics
Next Lecture