3 Generalisation In Neural Networks
3.4 The Mitchellian View
3.4.2 Relating Mitchell's Technique to Neural
For Mitchell, a concept is a link between instances, which groups certain instances under one category (those which match the concept) and other instances under another category (those which do not match the concept). If instances are seen as stimuli to a system, the only possible responses of the system are "Matches" or "Does not match".
Generalisation in Neural Networks The Mitchellian View
In this thesis, a concept shall be taken to be a body of IO behaviour. This allows for a more active model, with the potential for a richer set of responses than the two possibilities with Mitchell's technique. The concept of danger, for example, may be seen as the linking of certain stimuli to appropriate responses. For instance, given the stimulus of "toadstool", the trained response might be "don't eat". In neural terms, this broadening of the idea of a concept allows for the possibility of many output units. One output unit would suffice to provide a strict link with the symbolic technique. An output of 1 could indicate, "the input matches the concept", and an output of 0, "the input does not match the concept".
Since the weights and topology are what give rise to the IO behaviour in neural networks, concept space can be seen as the weight space for a given topology. The set of weights that correctly classify the patterns presented so far is referred to as the version space for that set of patterns by some authors.51 Weight space, however, is an indirect way of looking at the IO behaviour. The view of version space in this thesis will be the space of all IO behaviours that are consistent with the data. This space may be limited to the space of all IO behaviours realisable by the given topology. This provides the link with weight space.
Version space in neural networks is much larger in comparison to the symbolic technique, if weight space or IO behaviour is taken to be the analogue. This is because the weights may have any real number as a value, as indicated earlier, whereas the symbolic case has a fixed, finite set of symbols for attribute values in version space. Reaching the no alternative situation in neural networks, even within a certain degree of accuracy, could take a long time because there are so many possibilities. This is indicated by VC theory.
Instances are simply the training patterns. Training patterns combine input and target vectors for the input and output units of the neural topology. The neural analogue of the instance language is the specification of the sets from which the inputs and targets may be drawn. For example,
Generalisation in Neural Networks The Mitchellian View
inputs might be from Rn, and targets from the set {1, 0},n, where n is the number of input units, and m is the number of output units.
The neural analogue of the generalisation language is the topology. This constrains the set of IO behaviours the neural network is capable of realising. The choice of topology represents the bias, in the Mitchellian sense, just as it does in the Geman et al sense, of the generalisation. In Mitchell's technique, the choice of generalisation language is made by the user. To maintain consistency with this, the neural implementations discussed in chapters 5 and 6 also either restrict the choice of bias (in terms of the topology), or leave it to the user. Therefore techniques for automatically determining the topology during training are not considered.
Mitchell's technique uses representatives of the boundaries of version space to mark the shrinking of version space as instances are introduced. In order to establish those boundaries, it is necessary to have an ordering of the concepts. Mitchellian neural partial orderings are hard to find because the general/specific relation is not there. There is no a priori way to order IO behaviour. For example, should a concept that has response "eat" to the stimulus "toadstool" come before or after one that has "don't eat", or "run away"? The work of this thesis centres around finding such orderings in order to implement Mitchell's technique in a neural environment.
A further requirement, once there is the possibility for representing the boundaries of version space, is the ability to make changes to those boundaries under pressure from the instances. These are the neural analogues of updating and selection within.
Thus, certain prime directives for a neural implementation of Mitchell's technique may be established. These are given below:
• The implementation must have an ordering of IO behaviour which enables boundary representation of version space, with a many- one correspondence between the IO behaviour and the ordering, and any two weight states deemed to have the same IO behaviour must also have the same value in the ordering. If this directive is not upheld, then two networks with the same IO behaviour might
Generalisation in Neural Networks The Mitchellian View
have different values in the ordering. The no-alternative situation is detected by the networks having the same value in the ordering. Hence, without this directive, it is not possible to guarantee the detection of the no-alternative situation using the ordering.
• There must be mechanisms for updating and selection within. • Updating must always be by the minimum amount necessary for
correct classification, in one direction for the S analogue, and in the opposite direction for the G analogue.
3.5
Conclusion
Mitchell's technique has the advantage that it is not necessary to examine all the possible concepts, through the use of boundary representatives of version space during learning. Exhaustive search techniques, such as that of Schwarz et al, and the sampling technique of Opper and Haussler, do not have this advantage, and suffer from relatively high computational costs.
The main advantage of the boundary representatives, however, is the ability to recognise the no-alternative situation. This is an important consideration for anyone who is trying to fit some data:
In some cases, we may be interested in global, rather than local questions. Not, "how good is this fit?", but rather, "how sure am I that there is not a very much better fit in some other corner of parameter space?"52
The bidirectional convergence of the search enables this valuable property, since if there is no alternative but the current solution, then the fit must be the best possible. Having found and recognised the no-alternative situation also gives a terminating condition. There is no point in training further if it is known that there are no better alternatives. The validation technique also has a terminating criterion, but as indicated in section 3.3.1, there is ambiguity about which minimum of validation error should be
Generalisation in Neural Networks Conclusion
used. Hence, the terminating criterion of the validation technique is not certain to recognise the optimum fit. MacKay's technique — which is also a unidirectional technique — also suffers from this disadvantage.
There is more potential for the Opper and Haussler technique to indicate the no-alternative situation. If all the samples of version space give the same output for any randomly chosen input, then this might be taken to indicate the convergence of version space. This could, however, be due to poor or insufficient sampling of version space, rather than reaching the no-alternative situation.
The Bayesian literature relates to Mitchell's methods for partially learned version spaces. Selecting the most probable weight state from version space, or the most probable classification given the weight states in version space is a useful method for guessing the generalisation when it is clear that there are several alternatives for a given set of patterns.
VC theory and average generalisation theory provide measures which relate to the likely generalisation ability. The VC theory estimate has been shown in section 3.3.4 to place excessive demands on the number of patterns, through the high discrepancies between estimated and actual generalisation error it estimates even for extremely good average generalisation abilities.
Mitchell's technique is able to offer more than probabilities of generalisation, within a certain set of assumptions. If the no-alternative situation is reached, then Mitchell's technique can offer a guarantee that, given the assumptions, the generalisation is correct. This means that any unsatisfactory generalisation results arise from the assumptions of the user (and the designer of the technique), rather than being due to the probabilities not working in the favour of a good generalisation.
For example, if the average generalisation performance is estimated to be 90% on the basis of a set of patterns, the actual generalisation performance need not, in fact, have this value. The explanation of why the generalisation performance is different from the expected value does not rest on the assumptions of the user, so much as on the particular set of patterns chosen.
Generalisation in Neural Networks Conclusion
The VC and average generalisation theories show that in neural networks, large data sets are needed to constrain the number of alternative IO behaviours to a reasonable quantity — each with an acceptable generalisation performance. The Mitchellian guarantee, which rests on a single alternative, might seem rather a remote possibility. The techniques discussed in chapters 5 and 6 will both show neural implementations of Mitchell's technique which, within certain constraints, aim to achieve the no-alternative situation using a reasonable number of carefully chosen patterns.
Issues in Topology Determination Introduction