CHAPTER 2 Training neural networks
3.4. The use of the validation set
For simplicity, the discussion will consider the case of a classifier but the same ideas are valid for other types of neural networks.
Let us consider an input space X, a set of classes C and a classifier d defined on X and taking values in C. R (d) will denote the true misclassification rate of the classifier d. The meaning of R (d) is the following: Using a sample population L, construct the classifier d. Draw another very large (virtually infinite) set of samples from the same population L and compare the prediction of the classifier d with the correct classification for each sample. The proportion of misclassifications given by d is the value of R*(d).
The size of the sample set is required to be sufficiently large so that the statistical techniques used and various probabilities are meaningful. If the size of the various sets (training, validation, etc.) is not large enough, the estimates given by the statistical techniques will be poor.
A more detailed framework for the true misclassification rate and various validation techniques can be found in [Breiman, 1993]. The most commonly used validation techniques will be briefly presented in the followings.
A distinction must be made between computer simulations and real world problems. In the case of a computer simulation, the true misclassification rate can be calculated using the definition. A random number generator is used to construct the sample set. Then, the classifier is built using these data. Another set of samples is obtained using the same distribution and the true misclassification rate can be easily calculated according to the definition.
The real world problems can be divided into two types: problems in which the amount of data is virtually infinite and problems in which the amount of data is finite. If the amount of data is very large, the estimate of the true misclassification rate can be calculated again according to the definition with the precaution that
independence of various pieces of data be assured. In many problems though, only a finite set of samples is available with reduced possibilities of getting an additional very large set of correctly classified samples. Due to the fact that the same data set is used both to construct and to validate the classifier, the estimate of the true misclassification rate R (d) is called an internal estimate.
3.4.1 The resubstitution estimate.
A common technique for calculating R*(d) is the resubstitution technique. After the classifier d is constructed, the data used in its construction is fed to its input and the misclassification rate is calculated by comparing the classification given by the classifier to the real class of each input pattern. Thus, the resubstitution estimate is obtained.
The main disadvantage of this estimate of the misclassification rate is the fact that usually, the construction algorithm tends to minimise a value proportional to the difference between the output given by the classifier and the desired target and therefore proportional to the resubstitution estimate. Therefore, this estimate is bound to be overly optimistic. For instance, there are techniques which ensure that all the patterns in the training set are correctly classified. In this case, the resubstitution estimate of the true misclassification rate is zero. It is difficult to accept that this classifier will correctly classify all the patterns drawn from the same distribution.
In conclusion, the resubstitution estimate is not sufficiently accurate for most purposes. It reflects only how good the classifier construction algorithm is and says very little about how the resulting classifier will behave with new data. In a neural framework, a measure of the resubstitution estimate is the error at the end of the training and the dangers of using this as a measure of the performance on new data have been long understood.
3.4.2 The test sample estimate.
This method requires the division of the input pattern set into two sets usually called the training set T and the validation set V. The classifier is constructed using the samples in T and the misclassification estimate is calculated using V. This method offers the advantage of using independent samples to construct and to test the classifier. The most important drawback of this method is that the size of the training set is reduced. A frequent but not theoretically justified division of the
available data puts 2/3 of it in the training set and the remaining 1/3 in the validation set ([Breiman, 1993]). In some cases, this loss of information can dramatically affect the resulting classifier. However, the larger the data set, the lower the probability that the samples in the validation set (which are missing from the training set) be important for the construction of the classifier.
Another critical condition of this method is the need for the training set and the validation set to be drawn from the same distribution i.e. to reflect to the same extent the intrinsic properties of the phenomenon ([Breiman, 1993], [Denker, 1987]). For instance, if a function is to be approximated on an interval I and all the training samples are chosen from the first 2/3 of I and the validation samples are chosen from the last 1/3 of the interval, both the classifier and the test sample estimation can be very poor. The common method to ensure this representativity condition is satisfied is to construct the validation set by randomly choosing patterns from the available data.
In conclusion, two conditions are necessary for the test sample estimate to be accurate: i) a (very) large data set and ii) the training set and the validation set to reflect to the same extent the properties of the underlying phenomenon.
3.4.3 The V-fold-cross-validation estimate.
This method requires the division of the available data L into V sets of the same size (or as close as possible) Li, L2,...,LV- The V test sample estimates are calculated, each time using the training set L-Li (L minus Li) and the validation set Lj. The V- fold-cross-validation estimate is the arithmetical mean of the V test sample estimates. At the end, the classifier d is constructed using the entire data set L. A variation of this validation method is the "leave-one-out" method. If the data set contains N patterns, one of the patterns will be ignored each time and a classifier will be constructed using the remaining N-l patterns. Then the ignored pattern will
jig
be used as a single-case test and R (d) calculated as the mean of the misclassification estimate for each of the N cases.
Cross-validation is parsimonious with data. Every sample is used to construct the classifier and every sample is used exactly once in a test sample. The main drawback of this approach is that the process is very tedious. The construction of V classifiers is required and each such construction can be difficult.
3.4.4 Conclusions regarding the use of the validation set.
The validation set is useful to obtain an estimate for the true misclassification rate which is an indicator of the expected performance of the classifier in the normal use after training. The validation set will not be able to distinguish between different weight states which satisfy both the training set and the validation set. These indistinguishable weight states are those which have been obtained if the validation set had been included in the training set. However, the validation set will emphasise those weight states or training methods which are able to guess some points even if they are not present in the training set. This justifies hopes that the nature of the underlying phenomenon has been embedded into the model built by the network.
W
M
If the sample set is small (much smaller than Baum's upper bound of —log—, for e £
instance), each pattern is important and a cross-validation technique will offer the best results. After the misclassification estimate is calculated, the construction of a new classifier taking into consideration all available data can be performed. This should ensure a true misclassification rate not worse than the test sample estimate obtained in the first place.
This conclusion follows from the approach to generalisation presented in [Schwartz, 1990]. Their results show that the learning determines "a monotonic increase of the average generalisation ability with increasing m" where m is the number of patterns in the training set. Therefore, the training set must contain as many patterns as possible and each pattern removed from the training set will affect negatively the generalisation. This is why, the final phase of retraining the net with the whole training set is very important.
If the sample set is large or very large, the individual patterns are far less important. Amari shows in [Amari, 1993] that the average information gain (the average of the logarithm of the probability of correct classification of a new pattern after t patterns have been learned) converges to 0 as d/t where d is the number of modifiable parameters. If t is very large, the information gain brought by an individual pattern is very small and some of them can be taken out from the learning set without damaging the performance of the net. In this case, the test sample estimate is more feasible. Although potentially useful, the final training with all available data can be skipped in this case if the generalisation performance is ensured by some bounds (e.g. Baum's) or is declared satisfactory by the user.
However, in the case of the test sample estimate, precautions must be taken to ensure the training set and the validation set reflect the properties of the phenomenon to the same extent.