Multilayer nets and backpropagation
6.10 Fostering generalization 1 Using validation sets
Consider again networks with too many hidden units like those associated with the left hand side of Figures 6.14 and 6.15. The diagrams show the decision surface and function respectively after exhaustive training, but what form do these take in the early stages of learning? It is reasonable to suppose that the smoother forms (indicated on the right hand side of the respective figures) or something like them may be developed as intermediates at this time. If we then curtail the training at a suitable stage it may be possible to "freeze" the net in a form suitable for generalization.
A striking graphical demonstration that this is indeed what happens in pattern space was provided by Rosin & Fierens (1995). They trained a net on two classes in a pattern space in 2D (to allow easy visualization), each of which consisted of a circularly symmetric cluster of dots in the plane. The two classes overlapped and so the error-free decision boundary for the training set was highly convoluted. However, in terms of the underlying statistical distributions from which the training set was drawn they should be thought of as being separated optimally by a straight line. On training, a straight line did emerge at first as the decision boundary but, later, this became highly convoluted and was not a useful reflection of the true situation.
How, then, are we to know when to stop training the net? One approach is to divide the available training data into two sets: one training set proper T, and one so- called validation set V. The idea is to train the net in the normal way with T but, every so often, to determine the error with respect to the validation set V. This process is referred to as cross-validation and a typical network behaviour is shown in Figure 6.16. One criterion for stopping training, therefore, is to do so when the validation error reaches a minimum, for then generalization with respect to the unseen patterns of V is optimal. Cross-validation is a technique borrowed from regression analysis in statistics and has a long history (Stone 1974). That such a technique should find its way into the "toolkit" of supervised training in feedforward neural networks should not be surprising because of the similarities between the two fields. Thus, feedforward nets are performing a smooth function fit to some data, a process that can be thought of as a kind of
Figure 6.16 Cross-validation behaviour.
nonlinear regression. These similarities are explored further in the review article by Cheng & Titterington (1994).
6.10.2 Adequate training set size
If, in Figures 6.14 and 6.15, the test data had originally been part of the training set, then they would have forced the network to classify them properly. The problem in the original nets is that they were underconstrained by the training data. In particular, if there are too few patterns near the decision surface then this may be allowed to acquire spurious convolutions. If, however, pattern space is filled to a sufficient density with training data, there will be no regions of "indecision" so that, given a sufficiently large training set of size N, generalization can be guaranteed. Such a result has been established for single output nets by Baum & Haussler (1989), who showed that the required value of N increased with the number of weights W, the number of hidden units H and the fraction f of correctly classified training patterns; in this sense W and H are the network "degrees of freedom". The only problem here is that, although this result provides theoretical lower bounds on the size of N, it is often unrealistic to use sets of this size. For example, for a single output net with ten hidden units, each of 30 weights, and with f=0.01 (1 per cent misclassified in the training set) N is more than 10.2 million; increasing f to 0.1 reduces this to 0.8 million but, either way, these are usually not realizable in practice. Baum and Haussler's result has subsequently been extended and sharpened for a special class of nets with internal symmetries by Shawe-Taylor (1992).
6.10.3 Net pruning
The last section introduced the idea that poor generalization may result from a large network being underconstrained by the training set. The other side of this coin is
that there are too many internal parameters (weights) to model the data that are supplied. The number of weights is, of course, partly determined by the number of hidden units and we set out by showing that this must be kept to a minimum for good generalization. However, rather than eliminate complete units, it may be possible to place constraints on the weights across the network as a whole so that some of the network's freedom of configuration is removed. One way to achieve this is to extend the network error function to incorporate a term that takes on small values for simple nets and large values for more complex ones. The utility of this hinges, of course, on what we mean by "complex" and one definition is that nets with many weights close to zero should be favoured in contrast to those with weights that all take significant numerical values. It transpires that there is a very simple way of enforcing this (Hinton 1987). Thus, define
(6.12)
where the sum is over all weights in the net, and let Et be the error used so far based on input-output differences (5.15). Now put E=Et+ Ec and perform gradient descent on this total risk E. The new term Ec is a complexity penalty and will favour nets whose weights are all close to zero. However, the original
performance measure Et also has to be small, which is favoured by nets with significantly large weights that enable the training set to be correctly classified. The value of determines the relative importance attached to the complexity penalty. With =0 we obtain the original backpropagation algorithm while very large values of may force the net to ignore the training data and simply assume small weights throughout. With the right choice of the result is a compromise; those weights that are important for the correct functioning of the net are allowed to grow, while those that are not important decay to zero, which is exactly what is required. In effect, each very small weight is contributing nothing and represents a non-connection; it has been "pruned" from the network. In this way connections that simply fine-tune the net—possibly to outliers and noise in the data—are removed, leaving those that are essential to model the underlying data trends.
Many variations have been tried for the form of Ec, and other heuristics, not based on a cost function like Ec, have been used to prune networks for better generalization; see Reed (1993) for a review.
So far it has been assumed that the network topology (the number of layers and number of nodes in each layer) is fixed. Our initial analysis, however, showed that the obvious thing to try is to determine a suitable number of hidden nodes. This may be done in one of two ways. We can try to determine an optimum topology at the outset and then proceed to train using backpropagation, or alter the topology dynamically in conjunction with the normal gradient descent. Either way, the resulting algorithms tend to be fairly complex and so we only give the barest outline of two examples, one for each approach.
Weymare & Martens (1994) provide an example of the topology initialization technique in which the data are first sent through a conventional clustering algorithm to help determine candidate hyperplanes, and hence hidden units. These candidate units are then used in a network construction algorithm to estimate the optimal topology. Finally the net is fine tuned with a limited number of backpropagation epochs. This algorithm is perhaps best considered as a hybrid technique in which the hidden units are trained in part via the initial data clustering, as well as the normal gradient descent.
In the method of Nabhan & Zomaya (1994) nodes are dynamically added or subtracted from a network that is concurrently undergoing training using backpropagation. Their algorithm is based on the hypothesis that nets which model the data well will train in such a way that the root mean square (rms) value of the weight changes decreases from epoch to epoch. If this fails to take place then structural changes are made to the network by selecting what they call "promising" structures—nets that start to decrease their rms weight change.
Other constructive algorithms, such as the cascade-correlation procedure (Fahlman & Lebiere 1990), make use of cost-function optimization but do not use gradient descent per se and are therefore only distantly related to backpropagation.