7.1 Generalisation and uncertainty in neural networks
7.1.2 Improving generalisation
This simulated handshape data-set was used to train 20 neural networks with a 16:18:20 topology.20 As would be expected for such a small training
set all of the networks quickly learned to classify all of the training examples correctly. Average training times were on the order of 10,000 pattern presentations using a learning rate of 0.1. Several test data sets of imperfect examples were then generated for use in testing the generalisation properties of these networks. In the absence of accurate models of the types of variation likely in handshapes the test sets were generated by the addition of varying amounts of random noise to each of the input values of the training examples. Each test set consisted of 5 examples of each of the 20 handshapes. Table 7.1 summarises the performance of the networks at each noise level.
Table 7.1 Summary of the performance of twenty networks on the simulated handshape data with different levels of added noise
Noise level Minimum % correct Maximum % correct Mean % correct 0.0 100 100 100.0 0.1 85 100 95.8 0.2 85 97 93.2 0.3 82 93 88.7 0.4 67 84 78.4 0.5 60 74 67.7 0.6 50 66 57.6
As the level of noise is increased the performance of the networks degrades gradually. Even at a level of 0.3 (which is 30% of the range of the initial input values) the networks average almost 90% correct. However there is a wide variation in performance between different networks, as can be seen by comparing the minimum and maximum values in Table 7.1.
This variation in the test set performance of the networks trained on the simulated handshapes, led to the development of a technique aimed at improving the generalisation performance. This approach, labelled a committee system, combines the outputs of multiple networks to produce a classification rather than using only a single network. The rationale is the
20 Note that unlike the other networks used during this thesis these networks used the
asymmetric sigmoid activation function, as this followed advice from the extant literature at the time.
realisation that there is not one global minimum into which every net trains but that there are many minima where adequate classification on the training examples can be obtained. However although they all perform similarly on the training set their response to new data varies, and hence the robustness of the net is not necessarily the same. By combining the output of several networks it may be possible to gain superior generalisation than that of any single network.
The output of the networks can be combined in two different ways. In a voting committee system for each network the output node with the highest value is taken to be the classification made by that network. The overall classification is the class which was indicated by the largest number of networks. Alternatively in a summing committee system the sum over all the networks of the output node corresponding to each class is calculated. The final classification given is the one with the highest sum over all networks. If a small number of networks make up the committee, the voting system can often give rise to situations where the votes are evenly split between two classes and hence no clear classification can be made. This situation is much less likely to arise in a summing system as the numbers being compared in the selection of the overall classification are floating point values, and are therefore unlikely to be exactly equal to each other. Early results indicated that other than this the two systems performed similarly, and so only the summing committee was tested extensively.
Table 7.2 Performance of committee systems on the simulated handshape data-set with added noise
Noise level (+/-)
Mean of 20 networks
20 networks 5 networks 5 networks 2,000 pps each None 100.0 100.0 100.0 100.0 0.1 95.8 100.0 98.0 98.0 0.2 93.2 95.0 94.5 95.0 0.3 88.7 93.0 92.0 91.0 0.4 78.4 87.0 87.5 83.0 0.5 67.7 77.0 75.0 75.0 0.6 57.6 67.0 67.0 69.0
Table 7.2 summarises the results obtained when various committees were applied to the simulated handshape test sets. Three different committee
structures were used. One consisted of all 20 networks, whilst the second contained only five randomly chosen networks. The third committee consisted of five networks each trained for only 2000 pattern presentations, so that the total training time for this committee was equal to the time taken to train each of the original individual networks.
All three of the committees clearly outperformed the mean of the 20 networks, particularly at high levels of noise. The larger committee was marginally better than either of the small committees, although this is at the cost of having to train far more networks. Perhaps the most interesting result was the performance of the committee of 5 networks trained for only 2,000 pattern presentation each. The training of this committee required 10,000 pattern presentations which is the same as used in training each of the 20 individual networks, yet the committee system generalised much better than the mean of those networks.
Similar results were obtained when the committee systems were applied to an unrelated data set which involves the classification of weed seeds into 10 different plant types on the basis of 7 measurements of the seed.21 Twenty
networks were trained on 298 examples from this data set, and tested on the remaining 100 examples. The number of hidden nodes in the networks was varied from 6 to 12, with relatively little effect on the resulting performance. As with the handshape data these networks were tested individually, and also when grouped into committees of 5 and 20 networks, with results as summarised in Table 7.3. As with the simulated handshapes the committee systems' performance was well above the mean of the individual networks.
Table 7.3 Performance of individual networks and committee systems on the weed seed test data set
Mean of 20 nets 5 net committee 20 net committee
59.9 64.9 65.3
The concept of the committee system has since been further explored by other members of the Artificial Neural Networks Research Group. Waugh and Adams (1993) applied this approach to several data sets and three different learning algorithms (pattern-presentation backpropagation, batch
21 This data set is originally from the Scottish Crop Research Institute, and was obtained
courtesy of Mr Phil Collier from the Expert Systems Research Group in the Department of Computer Science, University of Tasmania. This data set has been widely used as a benchmark within the Artificial Neural Networks Research Group.
backpropagation and Quickprop). On the majority of these datasets the committee systems generalised marginally (and in some cases significantly) better than the average of their component networks. Freeman and Adams (1993) also applied committee systems to the classification of heart data.