• No results found

CHAPTER 2 Training neural networks

3.2. Are there enough patterns in the training set?

Let us consider a net used to model a real system. The system's transfer function is sampled and the samples are the I/O patterns in fig. 2. One looks at the points and gets the image of a straight line. A net having a straight line as its overall transfer function will be considered as having good generalisation.

output

input

->

Fig. 2. The observer interpolates the I/O patterns to obtain an expected generalisation which will be used to assess the generalisation of the net. In this case, the observer's expected generalisation is a straight line.

Fig. 3. A possible transfer function of a net modelling the system which produced the samples in fig. 2. The net will be judged as giving "good" generalisation.

Now, the user of the system compares the net's model (the straight line) with the real transfer function of the net presented in fig. 4 and declares their disappointment. A neural net person defending the net could say that it didn't have enough samples to model that transfer function. The user of the real system could reply that the transfer function was just a sin and there were 6 samples for 3 periods which is the minimum number of samples requested by a Fourier technique, for instance. Thus, the problem of the number of patterns in the training set (i.e. samples) arises. If we know the function to be modelled, how many samples do we need? And if we don't know the function and we are limited to a finite set of samples as in the example above, what sort of confidence can we have that the model will reflect the real properties of the underlying function?

Fig. 4. The samples in fig. 2 were samples from a sine function. The 'good' generalisation of the net proves to be a very poor one.

One could say that in a neural net framework the number of samples is large enough only when the samples 'sketch' the shape of the function. In this case, what does 'sketch' mean? Does one have a valid sketch when the shape of the underlying function is 'the same' as the shape of the function obtained by linear interpolation between the samples? If so, there are curves which can never be sketched no matter the number of samples, for instance the Mandelbrot set z^-X (fig 5). No matter the number of samples one uses, the curve will never be correctly "sketched" because between any two sample points there is another structure of infinite complexity. In order to appreciate this, one has only to change the scale of the exploration.

Does one have a valid sketch when the difference between the linear interpolation of the samples and the underlying function is below an error limit at any point? If so the Mandelbrot set would not be a problem.

‘J'..

Fig. 5. No matter the number of samples, the shape of the function obtained by linear interpolation will be different from the shape of the underlying function. The answer to this question depends very much on the application. The following example has been used by W.W. Sawyer in [Sawyer, 1966].

C)

y=f(x)

the function to be modelled

I

____ y=g(x) the model

j

sample

,

Fig. 6 The underlying function y=f(x) is sampled. From these samples the function y=g(x) can be obtained through linear interpolation. Are there enough samples?

In fig. 6, in each of diagrams (a), (b), (c), we see an underlying function y=f(x). This function is sampled and an interpolation y=g(x) is obtained. The question is whether the number of samples is enough to characterise the underlying function y=f(x). The answer depends very much on one's purpose. In (a), y=g(x) goes a long way from y=f(x) but it only stays away for a very short time. If we were mainly interested in the areas under the two curves, it might well be that these areas would differ by very little so y=g(x) is a good approximation and the number of samples was sufficient. With this criterion, the curves in (b) and (c) would be close together as well and therefore, the number of samples would have been sufficiently large even in these cases. However, it might be that we want to ensure that the difference between the underlying function and its approximation is not greater than a given error limit for any x value. In these case, the number of samples is insufficient for the function in (a) but is still sufficient for the functions in (b) and (c). In an investigation where we are particularly concerned with the length of the curves the number of samples would be seen as insufficient for both functions in (a) and (c).

The conclusion of this example is that the number of samples itself cannot be declared as sufficient or insufficient independently of the particular problem or type of problem. On the other hand, once the type of problem has been stated - and only in these conditions - one can assess the fitness of the sample set.

In this context, a problem is defined as a triplet consisting of a training set (an I/O problem), an error measure and an error limit. The error measure and the error limit depend on the application. The same I/O problem can be part of two different problems if the error measure or the error limit is different.

3.2.2 A neural network point of view.

The decision whether a training set contains enough patterns or not depends on the type of network as well. In this context, the type of a network is given by its intrinsic mechanism. Some such mechanisms are:

1. The classical linear network:

F(W,X) = WrX (1)

where W is a weight vector and X is a input vector. This corresponds to a network without hidden units.

2. The basis functions network:

nt

F(W,X) = '£Wi4>i(X) (2)

i=l

where <3>i are some function which form a basis. This corresponds to a radial basis function network for instance.

3. The multilayer sigmoid network: ( ,n

F(W,X)=cr

...cr

^wjxi

V’=1

I

v=1 J

J) (3)

where o is a sigmoidal function.

It is assumed that each type can use as many units as necessary for the training to be successful.

Let us suppose that the problem is to model a linear function in a 2-dimensional space (one input, one output). If the net uses a linear mechanism, 2 samples on the hyperplane are enough for the net to be able to build its best representation of the problem. If the net uses a radial basis mechanism with hyperspherical basis functions, a number of samples equal to the number of hidden units will be necessary for the net to build its best possible representation. The linear net's model will be much better than the radial net's one but this is not relevant for the problem. The important aspect is that different types of nets need different numbers of samples (training points) to build up the best representation (of the given underlying function) they are able to. For this argument, the particular method used to compare different representations is not important as long as it is the same for all comparisons.