3 Generalisation In Neural Networks
3.3 Generalisation in Neural Networks in the Literature
3.3.2 Average Generalisation Error
Schwarz et al22 provide a formalism for computing the distribution of generalisation abilities given the number of patterns, and a prior distribution of generalisation abilities which embodies some kind of prior knowledge about the problem. The formalism is also discussed in Hertz et al,23 and the results are summarised in the following:
Let Vo he the volume of weight space. This could be finite if weight values were only considered within a certain range. Hertz et al suggest a range of [-10,10].24 Let Vo(/) he the volume of weight space (in the same region of weight space) that implements a given function f. (There might be more
than one weight state that implements a given function due to symmetries in weight space. Chapter 5 has more on these. There is also the possibility that the functions implemented by weight states in a given neighbourhood do not differ significantly.)
Let fd be the target, or desired function. Let X be the set of all possible
inputs, x. A given training set, T, takes random members x/ of X as input
— the targets for each given by /rf(xz). Let £(/, x) be the following function:
’ [0 otherwise [3.1]
Let g(f) be the probability that f agrees with fd on any randomly chosen
input — the generalisation ability of f. This is the mean value of E(f, x) for
all members of X:
«(/) = (£(/,x)) [3.2]
Let po(g) be the prior distribution of generalisation abilities, g. This is the
distribution of the fraction of weight space under consideration that has generalisation ability g:
22Schwarz et al, 1990
23Hertz et al, 1991, pp. 148-153 24Hertz et al, 1991, p. 148
Generalisation in Neural Networks Generalisation in the Literature
,, Z,W)5(s--•?(/))
Po (s) = —---
T,---
[3.3]where 5(x) is the Kronecker delta: 1 0
5(x) = < if x = 0 otherwise
Consider p patterns from T. Let Vp(f) be the volume of weight space consistent with a function f, and the p training examples, x,- from T:
Vp(/) = Vo(/)n£(/.*.) [3-4] 1=1,p
The product is 1 if the function is consistent with the training examples, and 0 otherwise. This represents the elimination of a function from the space of possible functions, if that function misclassifies a pattern. Vp(f) may be estimated using g(/) to give:
VJ,(/) = V„(/)[s(/)]'’ [3.5] The distribution of generalisation abilities after p patterns, pp(g) is then as per [3.6] below, where Vp is the volume of weight space consistent with p patterns (or the size of version space):
P
pU)=——---
vp
[3.6]
Using the estimate of Vp(f) in [3.5] above, pp(g) may also be estimated by substituting for Vp(f) in [3.6]:
P,(«) “
Z,
Vp(/)<5(« - s(f))[3.7]
= Z, V» (/)[«(/)]'’
can be taken outside of the sum as gP since if g(/) *g the Kronecker delta evaluates to zero. Thus, for the sum over all f, [g(/)]P is effectively constant and equal to gP. Therefore:
Generalisation in Neural Networks Generalisation in the Literature
Hence the estimate of pp(g) can be written in terms of the prior distribution, pofe):
[3.9]
where the integral is used to normalise the distribution.
The average generalisation ability after p patterns, G(p), is the mean of the distribution of generalisation abilities:
G(p) = [3-10]
This can be estimated as well, and hence expressed solely in terms of the prior distribution of generalisation abilities, by substituting for the estimate of pp(g):
=[3.n]
This can be used to estimate the number of patterns needed to achieve a good average expected generalisation ability, given knowledge of the prior distribution of generalisation abilities. To calculate the prior distribution from [3.3] requires knowledge of the underlying function, however, which is not always possible. If the underlying function is not known, the prior distribution must be estimated:
The prior distribution of generalisation abilities po(g) is computed by testing all networks ... on a randomly chosen set of ... examples, large enough to obtain the intrinsic generalisation ability of each network with a precision of at least 6%.2^
This kind of estimation requires sufficient data sample sizes to be sure of the precision of the estimation. If the amount of available data is limited, this may not be possible.
Generalisation in Neural Networks Generalisation in the Literature
Even if the underlying function is known, it is still necessary to
exhaustively consider all the possible weight states. Severe restrictions on allowable weight values are required if the prior distribution is to be feasibly calculated. Schwarz et al use weight values of ±1.26
For example, consider a simple topology with a single input unit and a single output unit. There are two weights: one weight, w, from the input unit to the output unit, and a bias weight, b, to the output unit. This topology separates the one dimensional input space into two halves, at the point x = -b/w. Let the weights, b and w be any non-zero integer between -10 and 10 inclusive. Let the input be any real number in the same range. The desired output is zero for all inputs between -10 and 3 inclusive, and 1 for all other inputs in the given range. Figure 3.8 shows the problem, and indicates how the generalisation ability is calculated. For any given weight state f(w, b) with each weight taken from the specified set of values, the generalisation ability g(/) of the weight state for this problem is given by:
1-|x-3| <?(/) = ’ I* 20 -31 where x =---w [3.12] 20 Output 1 —
I
1
l
1™
--- 1
-10 0 3 10 Input w > 0 w < 0Figure 3.8 — The simple problem is illustrated bp the solid line. The output
is zero from -10 to 3, and rises to 1 thereafter. A candidate solution, which is indicated bp a dashed line, rises to 1 from 0 at a different point. The generalisation error is then represented bp the shaded area. The generalisation ability of the candidate solution is then represented bp the rest of input space between -10 and 10.
Generalisation in Neural Networks Generalisation in the Literature
The prior distribution of generalisation abilities can then be calculated for this problem by considering each weight state, and assuming it represents a unit volume of weight space. The expected average generalisation ability after p patterns can then be calculated using [3.11]. Figure 3.9 shows how the average generalisation ability increases as the number of patterns is increased for this problem. Also shown is the decrease in the size of version space with increasing patterns.
There are strong links with Mitchell in this technique. It shows how, as functions are eliminated from the space of all possible functions under the weight of increasing numbers of patterns (see equations [3.4] and [3.5]), the average expected generalisation error increases (equations [3.10] and [3.11]). This is akin to the concept of version space shrinking, leaving fewer and fewer candidate concepts under consideration. The remaining concepts have a greater degree of agreement with the target concept, and hence the average generalisation ability of the remaining concepts is increased. ® 400 re 350 CL w 300 § 250 12 200 re > 150 o 100 o > 50 - 0 - 0 ~t»... I 50 100 Number of Patterns (a) (b)
Figure 3.9 — (a) The effect on the average generalisation ability of increasing the number of patterns, (b) The corresponding effect on the size of version space.
Generalisation in Neural Networks Generalisation in the Literature