• No results found

Lecture 13: Validation

N/A
N/A
Protected

Academic year: 2021

Share "Lecture 13: Validation"

Copied!
12
0
0

Loading.... (view fulltext now)

Full text

(1)

Lecture 13: Validation

g

Motivation

g

The Holdout

g

Re-sampling techniques

g

Three-way data splits

(2)

Motivation

g

Validation techniques are motivated by two fundamental

problems in pattern recognition: model selection and

performance estimation

g

Model selection

n Almost invariably, all pattern recognition techniques have one or more

free parameters

g The number of neighbors in a kNN classification rule

g The network size, learning parameters and weights in MLPs

n How do we select the “optimal” parameter(s) or model for a given

classification problem?

g

Performance estimation

n Once we have chosen a model, how do we estimate its performance?

g Performance is typically measured by the TRUE ERROR RATE, the

(3)

Motivation

g

If we had access to an unlimited number of examples these

questions have a straightforward answer

n Choose the model that provides the lowest error rate on the entire

population and, of course, that error rate is the true error rate

g

In real applications we only have access to a finite set of

examples, usually smaller than we wanted

n One approach is to use the entire training data to select our classifier

and estimate the error rate

g This naïve approach has two fundamental problems

n The final model will normally overfit the training data

g This problem is more pronounced with models that have a large number of

parameters

n The error rate estimate will be overly optimistic (lower than the true error rate) g In fact, it is not uncommon to have 100% correct classification on training

data

n A much better approach is to split the training data into disjoint subsets:

(4)

The holdout method

g Split dataset into two groups

n Training set: used to train the classifier

n Test set: used to estimate the error rate of the trained classifier

g A typical application the holdout method is determining a stopping

point for the back propagation error

Training Set Test Set

Total number of examples

Epochs MSE

Training set error Test set error Stopping point

(5)

The holdout method

g The holdout method has two basic drawbacks

n In problems where we have a sparse dataset we may not be able to afford the

“luxury” of setting aside a portion of the dataset for testing

n Since it is a single train-and-test experiment, the holdout estimate of error rate

will be misleading if we happen to get an “unfortunate” split

g The limitations of the holdout can be overcome with a family of

resampling methods at the expense of more computations

n Cross Validation

g Random Subsampling g K-Fold Cross-Validation

g Leave-one-out Cross-Validation

(6)

Random Subsampling

g

Random Subsampling performs K data splits of the dataset

n Each split randomly selects a (fixed) no. examples without replacement n For each data split we retrain the classifier from scratch with the training

examples and estimate Ei with the test examples

g

The true error estimate is obtained as the average of the

separate estimates E

i

n This estimate is significantly better than the holdout estimate Total number of examples

Experiment 1 Experiment 2 Experiment 3 Test example

= = K 1 i i E K 1 E

(7)

K-Fold Cross-validation

g

Create a K-fold partition of the the dataset

n For each of K experiments, use K-1 folds for training and the remaining

one for testing

g

K-Fold Cross validation is similar to Random Subsampling

n The advantage of K-Fold Cross validation is that all the examples in the

dataset are eventually used for both training and testing

g

As before, the true error is estimated as the average error rate

Total number of examples

Experiment 1 Experiment 2 Experiment 3 Test examples Experiment 4

K 1

(8)

Leave-one-out Cross Validation

g

Leave-one-out is the degenerate case of K-Fold Cross

Validation, where K is chosen as the total number of examples

n For a dataset with N examples, perform N experiments

n For each experiment use N-1 examples for training and the remaining

example for testing

g

As usual, the true error is estimated as the average error rate on

test examples

= = N 1 i i E N 1 E

Total number of examples

Experiment 1 Experiment 2 Experiment 3

Experiment N

(9)

How many folds are needed?

g

With a large number of folds

+ The bias of the true error rate estimator will be small (the estimator will be very accurate)

- The variance of the true error rate estimator will be large

- The computational time will be very large as well (many experiments)

g

With a small number of folds

+ The number of experiments and, therefore, computation time are reduced

+ The variance of the estimator will be small

- The bias of the estimator will be large (conservative or higher than the true error rate)

g

In practice, the choice of the number of folds depends on the

size of the dataset

n For large datasets, even 3-Fold Cross Validation will be quite accurate n For very sparse datasets, we may have to use leave-one-out in order to

train on as many examples as possible

(10)

Three-way data splits

g

If model selection and true error estimates are to be computed

simultaneously, the data needs to be divided into three disjoint

sets

n Training set: a set of examples used for learning: to fit the parameters

of the classifier

g In the MLP case, we would use the training set to find the “optimal” weights

with the back-prop rule

n Validation set: a set of examples used to tune the parameters of of a

classifier

g In the MLP case, we would use the validation set to find the “optimal”

number of hidden units or determine a stopping point for the back propagation algorithm

n Test set: a set of examples used only to assess the performance of a

fully-trained classifier

g In the MLP case, we would use the test to estimate the error rate after we

have chosen the final model (MLP size and actual weights)

g After assessing the final model with the test set, YOU MUST NOT further

(11)

Three-way data splits

g

Why separate test and validation sets?

n The error rate estimate of the final model on validation data will be

biased (smaller than the true error rate) since the validation set is used to select the final model

n After assessing the final model with the test set, YOU MUST NOT tune

the model any further

g

Procedure outline

n This outline assumes a holdout method

g If CV or Bootstrap are used, steps 3 and 4 have to be repeated for each of 1. Divide the available data into training, validation and test set

2. Select architecture and training parameters 3. Train the model using the training set

4. Evaluate the model using the validation set

5. Repeat steps 2 through 4 using different architectures and training parameters 6. Select the best model and train it using data from the training and validation set 7. Assess this final model using the test set

1. Divide the available data into training, validation and test set 2. Select architecture and training parameters

3. Train the model using the training set

4. Evaluate the model using the validation set

5. Repeat steps 2 through 4 using different architectures and training parameters 6. Select the best model and train it using data from the training and validation set 7. Assess this final model using the test set

(12)

Three-way data splits

Σ

Model1 Error1

Σ

Model2 Error2

Σ

Model3 Error3

Σ

Model4 Error4 Min Final Model

Σ

Final Error Test set Training set Validation set Model Selection Error Rate

References

Related documents

Are stress factors (psychosocial, musculoskeletal, infectious and mental) leading to work- related illnesses in occupational therapists. Is working as an occupational therapist a

“mis prácticas políticas como activista lesbiana y feminista queer, mis prácticas pedagógicas como maestra en una escuela donde la pobreza acecha con la muerte a los cuerpos

exceptionalism; (2) scientific evidence does not influence decision making; (3) an expedient view of human rights in individual advocacy; (4) government disinformation; (5) advocacy

The high frequency of TMPRSS2:ERG fusion in prostate cancer and the de- tection of the fusion transcript in early stage cancers and even pre-cancerous lesions [27, 31] has

This study was conducted to examine the effect of performance appraisal communication (i.e., feedback and interactional style) and procedural justice on

• Third Party Software is defined as certain third-party software which is provided along with the Univa Software, and such software is licensed under the license terms located

For the blended learning version of the class, I chose to consolidate the lecture material on the first class meeting of the week and use the second class meeting as an