• No results found

Model Evaluation 2:

N/A
N/A
Protected

Academic year: 2021

Share "Model Evaluation 2:"

Copied!
51
0
0

Loading.... (view fulltext now)

Full text

(1)

Sebastian Raschka STAT 479: Machine Learning FS 2018

Model Evaluation 2:

Confidence Intervals

Lecture 09

1

STAT 479: Machine Learning, Fall 2018 Sebastian Raschka

http://stat.wisc.edu/~sraschka/teaching/stat479-fs2018/

(2)

Sebastian Raschka STAT 479: Machine Learning FS 2018 2

Model Eval Lectures

Basics

Bias and Variance

Overfitting and Underfitting Holdout method

Confidence Intervals

Resampling methods Repeated holdout

Empirical confidence intervals Cross-Validation

Hyperparameter tuning Model selection

Algorithm Selection Statistical Tests

Evaluation Metrics

This Lecture

Overview

(3)

Sebastian Raschka STAT 479: Machine Learning FS 2018 3

• Concepts first

• (More) Code at the end of the lecture

(4)

Sebastian Raschka STAT 479: Machine Learning FS 2018 4

Main points why we evaluate the predictive performance of a model:

1. Want to estimate the generalization performance, the predictive performance of our model on future (unseen) data.

2. Want to increase the predictive performance by

tweaking the learning algorithm and selecting the best performing model from a given hypothesis space.

3. Want to identify the ML algorithm that is best-suited for the problem at hand; thus, we want to compare different algorithms, selecting the best-performing one as well as the best performing model from the algorithm’s

hypothesis space.

(5)

Sebastian Raschka STAT 479: Machine Learning FS 2018 5

• Training set error is an optimistically biased estimator of the generalization error

• Test set error is an unbiased estimator of the generalization error (test sample and hypothesis chosen independently)

• (in practice, it is actually pessimistically biased; why?)

(6)

Sebastian Raschka STAT 479: Machine Learning FS 2018 6

1 2 3 4 5 6 7 8 9 10

1 2 3 4 5 6 7 8 9 10 1 2 3 4 5 6 7 8 9 10

1 2 3 4 5 6 7 8 9 10 1 2 3 4 5 6 7 8 9 10

...

1 2 3 4 5 6 7 8 9 10

1 2 3 4 5 6 7 8 9 10 Holdout Method

2-Fold Cross-Validation

Repeated Holdout

Training Evaluation

(7)

Sebastian Raschka STAT 479: Machine Learning FS 2018

Often using the holdout method is not a good idea ...

7

(8)

Sebastian Raschka STAT 479: Machine Learning FS 2018

Often using the holdout method is not a good idea ...

8

Test set error as generalization error estimator is pessimistically biased (not so bad)


But it does not account for variance in the training data (bad)

(9)

Sebastian Raschka STAT 479: Machine Learning FS 2018 9

Suppose we have the following ranking based on accuracy:

h

2

: 75% > h

1

: 70% > h

3

: 65%,

we would still rank them the same way if we add a 10% pessimistic bias:

h

2

: 65% > h

1

: 60% > h

3

: 55%.

Why is pessimistic bias not "so bad"?

(10)

Sebastian Raschka STAT 479: Machine Learning FS 2018

Often using the holdout method is not a good idea ...

10

• Test set error as generalization error estimator is pessimistically biased (not so bad)


• Does not account for variance in the training data (bad)

(11)

Sebastian Raschka STAT 479: Machine Learning FS 2018 11

1.3 Resubstitution Validation and the Holdout Method

The holdout method is inarguably the simplest model evaluation technique; it can be summarized as follows. First, we take a labeled dataset and split it into two parts: A training and a test set. Then, we fit a model to the training data and predict the labels of the test set. The fraction of correct predictions, which can be computed by comparing the predicted labels to the ground truth labels of the test set, constitutes our estimate of the model’s prediction accuracy. Here, it is important to note that we do not want to train and evaluate a model on the same training dataset (this is called resubstitution validation or resubstitution evaluation), since it would typically introduce a very optimistic bias due to overfitting. In other words, we cannot tell whether the model simply memorized the training data, or whether it generalizes well to new, unseen data. (On a side note, we can estimate this so-called optimism bias as the difference between the training and test accuracy.)

Typically, the splitting of a dataset into training and test sets is a simple process of random subsam- pling. We assume that all data points have been drawn from the same probability distribution (with respect to each class). And we randomly choose 2/3 of these samples for the training set and 1/3 of the samples for the test set. Note that there are two problems with this approach, which we will discuss in the next sections.

1.4 Stratification

We have to keep in mind that a dataset represents a random sample drawn from a probability distribution, and we typically assume that this sample is representative of the true population – more or less. Now, further subsampling without replacement alters the statistic (mean, proportion, and variance) of the sample. The degree to which subsampling without replacement affects the statistic of a sample is inversely proportional to the size of the sample. Let us have a look at an example using the Iris dataset1, which we randomly divide into 2/3 training data and 1/3 test data as illustrated in Figure 1. (The source code for generating this graphic is available on GitHub2.)

All samples (n = 150)

Training samples (n = 100) Test samples (n = 50)

Figure 1: Distribution of Iris flower classes upon random subsampling into training and test sets.

1https://archive.ics.uci.edu/ml/datasets/iris

2https://github.com/rasbt/model-eval-article-supplementary/blob/master/code/iris-random-dist.ipynb

6

Issues with Subsampling (Independence violation)

The Iris dataset consists of 50 Setosa, 50 Versicolor, and 50 Virginica flowers; the flower species are distributed uniformly:

33.3% Setosa

33.3% Versicolor

33.3% Virginia

If our random function assigns 2/3 of the flowers (100) to the training set and 1/3 of the flowers (50) to the test set, it may yield the following:

training set → 38 x Setosa, 28 x Versicolor, 34 x Virginica

test set → 12 x Setosa, 22 x Versicolor, 16 x Virginica

(12)

Sebastian Raschka STAT 479: Machine Learning FS 2018 12

Learning Algorithm Hyperparameter

Values

Model

Prediction

Test Labels

Performance Model

Learning Algorithm Hyperparameter

Values Final

Model

2

3

4 1

Test Labels Test Data

Training Labels Data

Labels

Data Labels Training Data Training Labels

Test Data

Holdout evaluation

(13)

Sebastian Raschka STAT 479: Machine Learning FS 2018 13

Can we use the holdout method for model selection?

(14)

Sebastian Raschka STAT 479: Machine Learning FS 2018 14

Holdout validation (hyperparam. tuning)

2

1 LabelsData Validation

Data Validation

Labels Test Data Test Labels

Training Labels

Performance Model

Validation Data

Validation Labels Prediction

Performance Model

Validation Data

Validation Labels Prediction

Performance Model

Validation Data

Validation Labels Prediction

Best Model Learning Algorithm

Hyperparameter values

Model

Hyperparameter values

Hyperparameter values

Model

Model Training Data

Training Labels

Learning Algorithm

Best Hyperparameter

Values Final

Model

6 Data

Labels

3

Best Hyperparameter

values

Prediction

Test Labels

Performance Model

4

Test Data

Learning Algorithm

Best Hyperparameter

Values

Model Training Data

Training Labels

5

Validation Data Validation

Labels

(15)

Sebastian Raschka STAT 479: Machine Learning FS 2018 15

Holdout validation (hyperparam. tuning)

2

1

LabelsData

Training Data

Validation Data Validation

Labels Test Data

Test Labels

Training Labels

Performance Model

Validation Data

Validation Labels Prediction

Performance Model

Validation Data

Validation Labels Prediction

Performance Model

Validation Data

Validation Labels Prediction

Best Model Learning

Algorithm

Hyperparameter values

Model

Hyperparameter values

Hyperparameter values

Model

Model

Training Data Training Labels

Learning Algorithm

Best

Hyperparameter

Values Final

Model

6

Data

Labels

3

Best Hyperparameter

values

Prediction

Test Labels

Performance Model

4

Test Data

Learning Algorithm

Best

Hyperparameter Values

Model

Training Data Training Labels

5

Validation Data Validation

Labels

(16)

Sebastian Raschka STAT 479: Machine Learning FS 2018 16

Holdout validation (hyperparam. tuning)

2

1

LabelsData

Training Data

Validation Data Validation

Labels Test Data

Test Labels

Training Labels

Performance Model

Validation Data

Validation Labels Prediction

Performance Model

Validation Data

Validation Labels Prediction

Performance Model

Validation Data

Validation Labels Prediction

Best Model Learning

Algorithm

Hyperparameter values

Model

Hyperparameter values

Hyperparameter values

Model

Model

Training Data Training Labels

Learning Algorithm

Best

Hyperparameter

Values Final

Model

6

Data

Labels

3

Best Hyperparameter

values

Prediction

Test Labels

Performance Model

4

Test Data

Learning Algorithm

Best

Hyperparameter Values

Model

Training Data Training Labels

5

Validation Data Validation

Labels

(17)

Sebastian Raschka STAT 479: Machine Learning FS 2018 17

Holdout validation (hyperparam. tuning)

2

1

LabelsData

Training Data

Validation Data Validation

Labels Test Data

Test Labels

Training Labels

Performance Model

Validation Data

Validation Labels Prediction

Performance Model

Validation Data

Validation Labels Prediction

Performance Model

Validation Data

Validation Labels Prediction

Best Model Learning

Algorithm

Hyperparameter values

Model

Hyperparameter values

Hyperparameter values

Model

Model

Training Data Training Labels

Learning Algorithm

Best

Hyperparameter

Values Final

Model

6

Data

Labels

3

Best Hyperparameter

values

Prediction

Test Labels

Performance Model

4

Test Data

Learning Algorithm

Best

Hyperparameter Values

Model

Training Data Training Labels

5

Validation Data Validation

Labels

(18)

Sebastian Raschka STAT 479: Machine Learning FS 2018 18

Cross-Validation is generally better

... but ...

Bengio, Y., & Grandvalet, Y. (2004). No unbiased estimator of the variance

of k-fold cross-validation. Journal of machine learning research, 5(Sep),

1089-1105.

(19)

Sebastian Raschka STAT 479: Machine Learning FS 2018 19

𝒩(μ, σ 2 )

Bias of Estimators Example

Normal Distribution:

f(x

[i]

; μ, σ

2

) = 1

2πσ

2

exp( − 1

2 (x

[i]

− μ)

2

σ

2

)

Probability density function:

Is the sample mean an unbiased estimator of the mean of the Gaussian?

̂μ = 1

n ∑

i

x

[i]

(20)

Sebastian Raschka STAT 479: Machine Learning FS 2018 20

Is the sample mean an unbiased estimator of the mean of the Gaussian?

̂μ = 1

n ∑

i

x

[i]

Bias( ̂μ) = E[ ̂μ] − μ

= E[ 1 n ∑

i

x

[i]

] − μ

= 1 n ∑

i

E[x

[i]

] − μ

= 1 n ∑

i

μ − μ

= μ − μ = 0

Bias of Estimators Example

(21)

Sebastian Raschka STAT 479: Machine Learning FS 2018 21

̂σ

2

= 1

n ∑

i

(x

[i]

− ̂μ)

2

Bias( ̂σ

2

) = E[ ̂σ

2

] − σ

2

= E[ 1

n ∑

i

(x

[i]

− ̂μ)

2

] − σ

2

= . . .

= m − 1

m σ

2

− σ

2

Is the sample variance an unbiased estimator of the mean of the Gaussian?

Bias of Estimators Example

(22)

Sebastian Raschka STAT 479: Machine Learning FS 2018 22

̂σ

2

= 1

n ∑

i

(x

[i]

− ̂μ)

2

Bias( ̂σ

2

) = E[ ̂σ

2

] − σ

2

= E[ 1

n ∑

i

(x

[i]

− ̂μ)

2

] − σ

2

= . . .

= m − 1

m σ

2

− σ

2

̂σ

′2

= 1

n − 1 ∑

i

(x

[i]

− ̂μ)

2

The unbiased estimator is actually

Is the sample variance an unbiased estimator of the mean of the Gaussian?

Bias of Estimators Example

(23)

Sebastian Raschka STAT 479: Machine Learning FS 2018 23

Model Eval Lectures

Basics

Bias and Variance

Overfitting and Underfitting Holdout method

Confidence Intervals

Resampling methods Repeated holdout

Empirical confidence intervals Cross-Validation

Hyperparameter tuning Model selection

Algorithm Selection Statistical Tests

Evaluation Metrics

(24)

Sebastian Raschka STAT 479: Machine Learning FS 2018 24

(Image credit: Screenshot from https://en.wikipedia.org/wiki/Binomial_distribution_)

Binomial distribution

(25)

Sebastian Raschka STAT 479: Machine Learning FS 2018 25

• coin lands on head ("success")

• probability of success p

ERR

S

(h) =

1n

x∈S

δ( f(x), h(x)) ERR

𝒟

(h) = Pr

x∈𝒟

[ f(x) ≠ h(x)]

true error

Coin Flip (Bernoulli Trial) 0-1 Loss

• example misclassified (0-1 loss) Pr(k) = n!

k!(n − k)! p

k

(1 − p)

n−k

.

Binomial distribution

• sample (test set) error

• , estimator of k

n p

• mean, number of successes

μ

k

= np

(26)

Sebastian Raschka STAT 479: Machine Learning FS 2018 26

Coin Flip (Bernoulli Trial) 0-1 Loss

Pr(k) = n!

k!(n − k)! p

k

(1 − p)

n−k

.

Binomial distribution

• mean, number of successes μ

k

= np

variance

σ

k

= np(1 − p)

• standard deviation

σ

k2

= np(1 − p)

ERR

S

(h) =

1n

x∈S

δ( f(x), h(x))

σ

ERRS(h)

= σ

k

n = np(1 − p)

n = p(1 − p) n

We are interested in proportions!

σ

ERRS(h)

ERR

S

(h)(1 − ERR

S

(h))

n

(27)

Sebastian Raschka STAT 479: Machine Learning FS 2018 27

Confidence Intervals

By definition, a XX% confidence interval of some

parameter p is an interval that is expected to contain p

with probability XX%

(28)

Sebastian Raschka STAT 479: Machine Learning FS 2018 28

Normal Approximation

• Less tedious than confidence interval for Binomial distribution and hence often used in (ML) practice for large n

Rule of thumb: if n larger than 40, the Binomial distribution can be reasonably approximated by a Normal distribution; and np and n(1 - p) should be greater than 5

CI = ERR

S

(h) ± z ERR

S

(h)(1 − ERR

S

(h))

n

(29)

Sebastian Raschka STAT 479: Machine Learning FS 2018 29 The z constant for different confidence intervals:

99%: z=2.58

95%: z=1.96

90%: z=1.64

(30)

Sebastian Raschka STAT 479: Machine Learning FS 2018 30

Model Eval Lectures

Basics

Bias and Variance

Overfitting and Underfitting Holdout method

Confidence Intervals

Resampling methods Repeated holdout

Empirical confidence intervals Cross-Validation

Hyperparameter tuning Model selection

Algorithm Selection Statistical Tests

Evaluation Metrics

(31)

Sebastian Raschka STAT 479: Machine Learning FS 2018 31

Proportionally large test sets increase the pessimistic bias if a model has not reached its full capacity, yet.

To produce the plot above, I took 500 random samples of each of the ten classes from MNIST

The sample was then randomly divided into a 3500-example training subset and a test set (1500 examples) via stratification.

Even smaller subsets of the 3500-sample training set were produced via randomized,

stratified splits, and I used these subsets to fit softmax classifiers and used the same 1500- sample test set to evaluate their performances; samples may overlap between these training subsets.

(32)

Sebastian Raschka STAT 479: Machine Learning FS 2018 32

Decreasing the size of the test set brings up another problem:

It may result in a substantial variance increase of our model’s performance estimate.

Dataset DistributionSample 1Sample 2Sample 3

Train (70%)

Test (30%) Train

(70%)

Test (30%)

n=1000 n=100

Real World Distribution

Resampling

The reason is that it depends on which instances end up in training set, and which particular instances end up in test set.

Keeping in mind that each time we resample our data, we alter the statistics of the distribution of the sample.

Here, I repeatedly subsampled a two-dimensional Gaussian

(33)

Sebastian Raschka STAT 479: Machine Learning FS 2018 33

Repeated Holdout: Estimate Model Stability

ACC

avg

= 1 k

k

j=1

ACC

j

,

ACC

j

= 1 − 1 n

n

i=1

L(h(x

[i]

), f(x

[i]

)) . Average performance over k repetitions

where ACC

j

is the accuracy estimate of the jth test set of size m,

(also called Monte Carlo Cross-Validation)

(34)

Sebastian Raschka STAT 479: Machine Learning FS 2018 34

Repeated Holdout: Estimate Model Stability

50/50 Train-Test Split Avg. Acc. 0.95

90/10 Train-Test Split Avg. Acc. 0.96

Left: I performed 50 stratified training/test splits with 75 samples in the test and training set each; a K- nearest neighbors model was fit to the training set and evaluated on the test set in each repetition.

Right: Here, I repeatedly performed 90/10 splits, though, so that the test set consisted of only 15 samples.

How repeated holdout validation may look like for different training- test split using the Iris dataset to fit to 3-nearest neighbors

classifiers:

(35)

Sebastian Raschka STAT 479: Machine Learning FS 2018 35

The Bootstrap Method and Empirical Confidence Intervals

Circa 1900, to pull (oneself) up by (one’s) bootstraps was used figuratively of an

impossible task (Among the “practical questions” at the end of chapter one of Steele’s

“Popular Physics” schoolbook (1888) is, “30. Why can not a man lift himself by pulling up on his boot-straps?”). By 1916 its meaning expanded to include “better oneself by rigorous, unaided effort.” The meaning “fixed sequence of instructions to load the

operating system of a computer” (1953) is from the notion of the first-loaded program pulling itself, and the rest, up by the bootstrap.

(Source: Online Etymology Dictionary)

(36)

Sebastian Raschka STAT 479: Machine Learning FS 2018 36

The Bootstrap Method and Empirical Confidence Intervals

• The bootstrap method is a resampling technique for estimating a sampling distribution

• Here, we are particularly interested in estimating the uncertainty of our performance estimate

• The bootstrap method was introduced by Bradley Efron in 1979 [1]

• About 15 years later, Bradley Efron and Robert Tibshirani even devoted a whole book to the bootstrap, “An Introduction to the Bootstrap” [2]

• In brief, the idea of the bootstrap method is to generate new data from a population by repeated sampling from the original dataset with replacement — in contrast, the repeated holdout method can be understood as sampling without replacement.

[1] Efron, Bradley. 1979. “Bootstrap Methods: Another Look at the Jackknife.” The Annals of Statistics 7 (1).

Institute of Mathematical Statistics: 1–26. doi:10.1214/aos/1176344552.

[2] Efron, Bradley, and Robert Tibshirani. 1994. An Introduction to the Bootstrap. Chapman & Hall.

(37)

Sebastian Raschka STAT 479: Machine Learning FS 2018 37

The Bootstrap Method and Empirical Confidence Intervals

1. We are given a dataset of size n.

2. For b bootstrap rounds:

1. We draw one single instance from this dataset and assign it to our jth bootstrap sample. We repeat this step until our

bootstrap sample has size n (the size of the original dataset).

Each time, we draw samples from the same original dataset so that certain samples may appear more than once in our bootstrap sample and some not at all.

3. We fit a model to each of the b bootstrap samples and compute the resubstitution accuracy.

4. We compute the model accuracy as the average over the b accuracy estimates

ACC

boot

= 1 b

b

j=1

1 n

n

i=1

(1 − L(h(x

[i]

), f(x

[i]

)).

(38)

Sebastian Raschka STAT 479: Machine Learning FS 2018 38

The Bootstrap Method and Empirical Confidence Intervals

• As we discussed previously, the resubstitution accuracy usually leads to an extremely optimistic bias, since a model can be overly sensible to noise in a dataset.

• Originally, the bootstrap method aims to determine the statistical properties of an estimator when the underlying distribution was unknown and additional samples are not available.

• So, in order to exploit this method for the evaluation of predictive models, such as hypotheses for

classification and regression, we may prefer a slightly different approach to bootstrapping using the so- called Leave-One-Out Bootstrap (LOOB) technique.

• Here, we use out-of-bag samples as test sets for evaluation instead of evaluating the model on the training data. Out-of-bag samples are the unique sets of instances that are not used for model fitting

(39)

Sebastian Raschka STAT 479: Machine Learning FS 2018 39

x

1

x

1

x

1

x

1

x

1

x

1

x

2

x

3

x

4

x

5

x

6

x

7

x

8

x

9

x

10

x

2

x

2

x

2

x

2

x

8

x

8

x

3

x

7

x

10

x

6

x

9

x

3

x

7

x

8

x

10

x

2

x

2

x

6

x

9

x

6

x

5

x

4

x

4

x

10

x

3

x

5

x

7

x

4

x

8

x

4

x

5

x

6

x

8

x

9

Training Sets Test Sets

Bootstrap 1 Bootstrap 2 Bootstrap 3 Original Dataset

Bootstrap Sampling

Out-of-bag samples

(40)

Sebastian Raschka STAT 479: Machine Learning FS 2018 40

The Bootstrap Method and Empirical Confidence Intervals

We can compute the 95% confidence interval of the bootstrap estimate as

ACC

boot

= 1 b

b

i=1

ACC

i

and use it to compute the standard error

(In practice, at least 200 bootstrap rounds are recommended) SE

boot

= 1

b − 1

b

i=1

(ACC

i

− ACC

boot

)

2

.

Finally, we can then compute the confidence interval around the mean estimate as

ACC

boot

± t × SE

boot

.

For instance, given a sample with n=100, we find that

t

95

= 1.984

(41)

Sebastian Raschka STAT 479: Machine Learning FS 2018 41

The Bootstrap Method and Empirical Confidence Intervals

And if our samples do not follow a normal distribution? A more robust, yet

computationally straight-forward approach is the percentile method as described by B. Efron (Efron, 1981). Here, we pick our lower and upper confidence bounds as

follows:

ACC

lower

= α

1

th percentile of the ACC

boot

distribution

ACC

upper

= α

1

th percentile of the ACC

boot

distribution

where and and is our degree of confidence to compute 


the confidence interval.

α

1

= α α

2

= 1 − α α 100 × (1 − 2 × α)

For instance, to compute a 95% confidence interval, we pick 


to obtain the 2.5th and 97.5th percentiles of the b bootstrap samples distribution as our upper and lower confidence bounds.

α = 0.025

(42)

Sebastian Raschka STAT 479: Machine Learning FS 2018 42

The Bootstrap Method and Empirical Confidence Intervals

In the left subplot, I applied the Leave-One-Out Bootstrap technique to evaluate 3-nearest neighbors models on Iris, and the right subplot shows the results of the same model evaluation approach on MNIST, using the same softmax algorithm as mentioned earlier.

(43)

Sebastian Raschka STAT 479: Machine Learning FS 2018 43

The 0.632 Bootstrap Method

In 1983, Bradley Efron described the .632 Estimate, a further improvement to address the pessimistic bias of the bootstrap [1].

The pessimistic bias in the “classic” bootstrap method can be attributed to the fact that the bootstrap samples only contain approximately 63.2% of the unique samples from the original dataset.

[1] Efron, Bradley. 1983. “Estimating the Error Rate of a Prediction Rule: Improvement on Cross-Validation.” Journal of the American Statistical Association 78 (382): 316. doi:10.2307/2288636.

(44)

Sebastian Raschka STAT 479: Machine Learning FS 2018 44

P(not chosen) = (1 − 1 n )

n , 1

e ≈ 0.368, n → ∞ .

Bootstrap Sampling

(45)

Sebastian Raschka STAT 479: Machine Learning FS 2018 45

P(not chosen) = (1 − 1 n )

n , 1

e ≈ 0.368, n → ∞ . P(chosen) = 1 − (1 − 1

n )

n ≈ 0.632

(46)

Sebastian Raschka STAT 479: Machine Learning FS 2018 46

The .632 Bootstrap Method

[1] Efron, Bradley. 1983. “Estimating the Error Rate of a Prediction Rule: Improvement on Cross-Validation.” Journal of the American Statistical Association 78 (382): 316. doi:10.2307/2288636.

The .632 Estimate, is computed via the following equation:

ACC

boot

= 1 b

b

i=1

(0.632 ⋅ ACC

h,i

+ 0.368 ⋅ ACC

r,i

),

ACC

r,i

ACC

h,i

where

is the resubstitution accuracy

is the accuracy on the out-of-bag sample.

(47)

Sebastian Raschka STAT 479: Machine Learning FS 2018 47

The .632+ Bootstrap Method

Now, while the .632 Boostrap attempts to address the pessimistic bias of the estimate, an optimistic bias may occur with models that tend to overfit so that Bradley Efron and Robert Tibshirani proposed the The .632+ Bootstrap Method [1].


Instead of using a fixed “weight”

ω = 0.632 ACCboot = 1

b

b

i=1

(ω ⋅ ACC

h,i

+ (1 − ω) ⋅ ACC

r,i

),

in

we compute the weight as

ω = 0.632

1 − 0.368 × R ,

where R is the relative overfitting rate

R = (−1) × (ACC

h,i

− ACC

r,i

) γ − (1 − ACC

h,i

) .

[1] Efron, Bradley, and Robert Tibshirani. 1997. “Improvements on Cross-Validation: The .632+ Bootstrap Method.”

Journal of the American Statistical Association 92 (438): 548. doi:10.2307/2965703.

(48)

Sebastian Raschka STAT 479: Machine Learning FS 2018 48

The .632+ Bootstrap Method

R is the relative overfitting rate

R = (−1) × (ACC

h,i

− ACC

r,i

) γ − (1 − ACC

h,i

) .

[1] Efron, Bradley, and Robert Tibshirani. 1997. “Improvements on Cross-Validation: The .632+ Bootstrap Method.”

Journal of the American Statistical Association 92 (438): 548. doi:10.2307/2965703.

Now, we need to determine the no-information rate γ in order to compute R.

For instance, we can compute γ

by fitting a model to a dataset that contains all possible combinations between the examples and target class labels:

γ = 1 n

2

n

i=1 n

i′=1

(1 − L(h(x

[i]

), f(x

[i]

)) .

(49)

Sebastian Raschka STAT 479: Machine Learning FS 2018 49

The .632+ Bootstrap Method

R is the relative overfitting rate

R = (−1) × (ACC

h,i

− ACC

r,i

) γ − (1 − ACC

h,i

) .

[1] Efron, Bradley, and Robert Tibshirani. 1997. “Improvements on Cross-Validation: The .632+ Bootstrap Method.”

Journal of the American Statistical Association 92 (438): 548. doi:10.2307/2965703.

Now, we need to determine the no-information rate γ in order to compute R.

For instance, we can compute γ

by fitting a model to a dataset that contains all possible combinations between the examples and target class labels:

γ = 1 n

2

n

i=1 n

i′=1

(1 − L(h(x

[i]

), f(x

[i]

)) .

(50)

Sebastian Raschka STAT 479: Machine Learning FS 2018 50

Code Examples

https://github.com/rasbt/stat479-machine-learning-fs18/

blob/master/09_eval-ci/09_eval-ci_code.ipynb

(51)

Sebastian Raschka STAT 479: Machine Learning FS 2018 51

Model Eval Lectures

Basics

Bias and Variance

Overfitting and Underfitting Holdout method

Confidence Intervals

Resampling methods Repeated holdout

Empirical confidence intervals Cross-Validation

Hyperparameter tuning Model selection

Algorithm Selection Statistical Tests

Evaluation Metrics

Next Lecture

Overview

References

Related documents

The Asia Tour: Tokyo Dental College (Japan), Korean Dental Society (Korea), National Kaohsiung Medical College, National Tainan Medical College (Republic of China), Hong

There are eight government agencies directly involved in the area of health and care and public health: the National Board of Health and Welfare, the Medical Responsibility Board

The Lithuanian authorities are invited to consider acceding to the Optional Protocol to the United Nations Convention against Torture (paragraph 8). XII-630 of 3

The course gives students the opportunity to develop skills and knowledge in the key policy areas of spatial planning, local and regional economic development, site planning and

Reporting. 1990 The Ecosystem Approach in Anthropology: From Concept to Practice. Ann Arbor: University of Michigan Press. 1984a The Ecosystem Concept in

Insurance Absolute Health Europe Southern Cross and Travel Insurance • Student Essentials. • Well Being

A number of samples were collected for analysis from Thorn Rock sites in 2007, 2011 and 2015 and identified as unknown Phorbas species, and it initially appeared that there were

stop 0.3-5 mU/l <0.3(-0.5) mU/l T4L, T3 hyperthyroïdie infraclinique non-thyroidal illness hyperthyroïdie normales élevées scintigraphie captation élevée captation basse