Cross-validation - Evaluation of learning methods

4.4 Evaluation of learning methods

4.4.2 Cross-validation

Since in general we have a limited amount of data for training and testing, the classication procedure uses a certain amount for testing and the reminder for training. In practical terms, it is common to hold out one-third of data for testing and use the remaining two-thirds for training. Both the samples used for testing and for training have to be representative subsets of the underlying problem. Therefore a random sampling have to be done in such a way as to guarantee that each class is properly represented in both training and test sets. This procedure is called stratication, and we might speak of stratied holdout. A general way to reduce any bias caused by the particular sample chosen for holdout is to repeat the whole

process, training and testing, many times with dierent random samples. In each iteration a certain proportion of the data is randomly selected for training, possibly with stratication, and the reminder used for testing. The error rates on the dierent iterations are averaged to estimate an overall error rate. This is the repeated holdout method of the error rate estimation.

In a single holdout procedure, one could consider swapping the roles of testing and training data and average the two results, thus reducing the eect of not consistent repre- sentation in training and test sets. However, it is only possible with a 50:50 split between training and test data, which is generally not ideal, in fact it is preferable to use more than half the data for training even at expense of test data. A simple variant of this basic technique is a statistical method called cross-validation. Within this procedure, one decides to divide the data in a xed number of folds, or partitions of the dataset. For example, we can decide of dividing the data into three approximately equal partitions and each in turn is used for testing and the remaining is used for training. That is, we are using two-thirds for training and one-third for testing and then we can repeat the procedure three times so that, in the end, every instance has been used exactly once for testing. This is called three-fold cross-validation, and if stratication is applied as well, it is called stratied cross- validation. The last only was an example, but there are several ways of measuring the error rate of a learning scheme on a particular dataset. Two that are particularly important are leave-one-out cross-validation and the bootstrap.

Leave-one-out

Leave-one-out cross-validation is a n-fold cross-validation, where n represents the number of instances in the dataset. Each instance in turn is left out, and the classier is trained on all the remaining instances (n − 1). The results of all n judgments are averaged, and that average represents the nal error estimate.

This algorithm possesses two advantages. Firstly, the greatest possible amount of data is used for training at each iteration, which presumably increases the chance that the classier is an accurate one. Secondly, the procedure is deterministic: no random sampling is involved. There is no way of repeating it, the same result will be obtained at each iteration.

Nevertheless, this method has a high computational cost, because the entire learning procedure has to be repeated n times, and this is really expensive for large datasets. Another disadvantage is that this method cannot be stratied. Stratication involves getting the correct proportion of examples in each class into the test set, and this is impossible when the test set contains only a single example.

However, leave-one-out cross-validation seems to oer a chance of obtaining the maxi- mum out of a small dataset, resulting in an accurate estimate as possible.

A variant of leave-one-out that we have adopted in the current study is the commonly used leave-two-out cross-validation approach, which provides a relative unbiased estimate of the true generalization performance [39]. In each trial observations from one subject from each group are used to train the classier.

The bootstrap

The bootstrap is based on the statistical procedure of sampling with replacement. In the precedent method, whenever a sample was taken from the dataset to form a training or test set, it was drown without replacement. Therefore, the same instance, once selected, could not be chosen again. Instead, the idea of the bootstrap method is to sample the dataset with replacement to form a training set.

A particular variant is called the 0.632 bootstrap, where a dataset of n instances is sampled n times, with replacement, to give another dataset of n instances. Since there

Figure 4.4: An example of ROC curve [35].

must be some instances in the original dataset that have not been picked, they can be used to form the test set. We can evaluate the chance that a particular instance will not be picked after n extractions with replacement. It has a 1/n probability of being picked each time and therefore a probability 1 − 1/n of not being picked. Hence we have to multiply these probabilities for the number of extractions, obtaining:

p =1 − 1 n

≈ e−1= 0.368 (4.42) The last gives the chance of a particular instance not being picked at all. Therefore, for a reasonably large dataset, the test set will contain about the 36.8% of the instances and the training set the 63.2% (this is the reason why this method is called 0.632 bootstrap).

However, this method usually leads to a pessimistic estimate of the true error rate, because the training set contains only 63% of instances, which is not great deal compared, with the 90% of 10-fold cross-validation. Often the whole bootstrap procedure is repeated several times and the results averaged.

In document Analysis of Brain Magnetic Resonance Images: Voxel-Based Morphometry and Pattern Classification Approaches (Page 68-70)