an output for every possible input.
fact, the RBF network is a universal approximator, just like the MLP. To see this, imagine that we fill the entire space with RBF nodes equally spaced in all directions, so that their receptive fields just overlap, as in Figure 5.4. Now, no matter what the input, there is an RBF node that recognises it and can respond appropriately to it. If we need to make the outputs more finely grained, then we just add more RBFs at the relevant positions and reduce the radius of the receptive fields; and if we don’t care, we can just make the receptive fields of each node bigger and use fewer of them.
RBF networks never have more than one layer of non-linear neurons, in contrast to the MLP. However, there are many similarities between the two networks: they are both supervised learning algorithms that form universal approximators. In fact, it turns out that you can turn one into the other because the two types of neuron firing rules (RBFs based on distance and MLPs on inner product) are related. This fact will turn up in another form in Section 14.1.3. The most important difference between them is the fact that the MLP uses the hidden nodes to separate the space using hyperplanes, which are global, while the RBF uses them to match functions locally.
In an RBF network, when we see an input several of the nodes will activate to some degree or other, according to how close they are to the input, and the combination of these activations will enable the network to decide how to respond. Using the analogy we had earlier of looking at a star, suppose that the star is replaced by a torch, and somebody is signalling directions to us. If the torch is high (at 12 o’clock) we go forwards, low (6 o’clock) we go backwards, and left and right (9 and 3 o’clock, respectively) mean that we turn. The RBF network would work in such a way that if the torch was at 2 o’clock or thereabouts, then we would do some of the 12 o’clock action and a bit more of the 3 o’clock action, but none of the 6 or 9 o’clock actions. So we would move forwards and to the right. This adding
up of the contributions from the different basis functions according to how active they are means that our responses are local.
5.2.1 Training the RBF Network
In the MLP we used back-propagation of error to adjust first the output layer weights, and then the hidden layer weights. We can do exactly the same thing with the RBF network, by differentiating the relevant activation functions. However, there are simpler and better alternatives for RBF networks. They do not need to compute gradients for the hidden nodes and so they are significantly faster. The important thing to notice is that the two types of node provide different functions, and so they do not need to be trained together. The purpose of the RBF nodes in the hidden layer is to find a non-linear representation of the inputs, while the purpose of the output layer is to find a linear combination of those hidden nodes that does the classification. So we can split the training into two parts: position the RBF nodes, and then use the activations of those nodes to train the linear outputs. This makes things much simpler. For the linear outputs we can use an algorithm that we already know:
the Perceptron (Section 3.3).
However, we need to work out something different for the first layer weights, which control the positions of the RBF nodes. One thing that we can do is to avoid the problem of training completely by randomly picking some of the datapoints to act as basis locations.
Provided that our training data are representative of the full dataset, this often turns out to be a good solution. The other thing that we can do is to try to position the nodes so that they are representative of typical inputs. This is precisely the problem solved by several unsupervised learning methods, and we are going to see several algorithms for doing this in Chapter 14. For the RBF network, the most common one is the k-means algorithm that is described in Section 14.1. Thus, training an RBF network can be reduced to using two other algorithms that are commonly used in machine learning, one after the other. This is known as a hybrid algorithm, since it combines supervised and unsupervised learning.
The Radial Basic Function Algorithm
• Position the RBF centres by either:
– using the k-means algorithm to initialise the positions of the RBF centres OR – setting the RBF centres to be randomly chosen datapoints
• Calculate the actions of the RBF nodes using Equation (5.1)
• Train the output weights by either:
– using the Perceptron OR
– computing the pseudo-inverse of the activations of the RBF centres (this will be described shortly)
To implement this in Python we can simply import the other algorithms, and use them directly (if they are in different directories, then you need to add them to the PYTHONPATH variable; precisely how to do this varies between programming IDEs. The training is then very simple:
def rbftrain(self,inputs,targets,eta=0.25,niterations=100):
if self.usekmeans==0:
# Version 1: set RBFs to be datapoints indices = range(self.ndata) np.random.shuffle(indices) for i in range(self.nRBF):
self.weights1[:,i] = inputs[indices[i],:]
else:
# Version 2: use k-means
self.weights1 = np.transpose(self.kmeansnet.kmeanstrain(inputs)) for i in range(self.nRBF):
self.hidden[:,i] = np.exp(-np.sum((inputs - np.ones((1,self.nin))'
*self.weights1[:,i])**2,axis=1)/(2*self.sigma**2)) if self.normalise:
self.hidden[:,:-1] /= np.transpose(np.ones((1,np.shape(self.' hidden)[0]))*self.hidden[:,:-1].sum(axis=1))
# Call Perceptron without bias node (since it adds its own)
self.perceptron.pcntrain(self.hidden[:,:-1],targets,eta,niterations)
In fact, because of this separation of the two learning parts, we can do better than a Perceptron for training the outputs weights. For each input vector, we compute the activa-tion of all the hidden nodes, and assemble them into a matrix G. So each element of G, say Gij, consists of the activation of hidden node j for input i. The outputs of the network can then be computed as y = GW for set of weights W. Except that we don’t know what the weights are—that is what we set out to compute—and we want to choose them using the target outputs t.
If we were able to get all of the outputs correct, then we could write t = GW. Now we just need to calculate the matrix inverse of G, to get W = G−1t. Unfortunately, there is a little problem here. The matrix inverse is only defined if a matrix is square, and this one probably isn’t—there is no reason why the number of hidden nodes should be the same as the number of training inputs. In fact, we hope it isn’t, since that would probably be serious overfitting.
Fortunately, there is a well-defined pseudo-inverse G+ of a matrix, which is G+ = (GTG)−1GT. Since the point of the inverse G−1 to a matrix G is that G−1G = I, where I is the identity matrix, the pseudo-inverse is the matrix that satisfies G+G = I. If G is a square, non-singular (i.e., with non-zero determinant) matrix then G+= G−1. In NumPy the pseudo-inverse is np.linalg.pinv(). This gives us an alternative to the Perceptron network that is even faster, since the training only needs one iteration:
self.weights2 = np.dot(np.linalg.pinv(self.hidden),targets)
There is one thing that we haven’t considered yet, and that is the size of the receptive fields σ for the nodes. We can avoid the problem by giving all of the nodes the same size, and
testing lots of different sizes using a validation set to select one that works. Alternatively, we can select it in advance by arguing that the important thing is that the whole space is covered by the receptive fields of the entire set of basis functions, and so the width of the Gaussians should be set according to the maximum distance between the locations of the hidden nodes (d) and the number of hidden nodes. The most common choice is to pick the width of the Gaussian as σ = d/√
2M , where M is the number of RBFs.
There is another way to deal with the fact that there may be inputs that are outside the receptive fields of all nodes, and that is to use normalised Gaussians, so that there is always at least one input firing: the node that is closest to the current input, even if that is a long way off. It is a modification of Equation (5.1) and it looks like the soft-max function:
g(x, w, σ) = exp(−kx − wk/2σ2) P
iexp(−kx − wik/2σ2). (5.2) Using the RBF network on the iris dataset that was used in Section 4.4.3 with five RBF centres gives similar results to the MLP, with well over 90% classification accuracy.
With the MLP, one question that we failed to find a nice answer to was how to pick the number of hidden nodes, and we were reduced to training lots of networks with different numbers of nodes and using the one that performed best on the validation set. The same problem occurs with the RBF network.
In the RBF network the activations of the hidden nodes is based on the distance between the current input and the weights. There are various measures of distance that we can use, as will be discussed in Section 7.2.3; we generally use the Euclidean distance. These distances can be computed for any number of dimensions, but as the number of dimensions increases, something rather worrying happens, which is that we start needing more RBF nodes to cover the space. The number of input dimensions has a profound effect on learning, something that has a suitably impressive name that we have already seen—the curse of dimensionality (Section 2.1.2). For RBFs, the effect of the curse is that the amount of the space covered by an RBF with a fixed receptive field will decrease, and so we will need many more of them to cover the space.
5.3 INTERPOLATION AND BASIS FUNCTIONS
One of the problems that we looked at in Chapter 1 was that of function approximation:
given some data, find a function that goes through the data without overfitting to the noise, so that values between the known datapoints can be inferred or interpolated. The RBF network solves this problem by each of the basis functions making a contribution to the output whenever the input is within its receptive field. So several RBF nodes will probably respond for each input.
We are now going to make the problem a bit simpler. We won’t allow the receptive fields to overlap, and we’ll space them out so that they just meet up with each other. Obviously, we won’t need the Gaussian part that decides how much each one matches now, either. If the datapoint is within the receptive field of this function then we listen only to this function, otherwise we ignore it and listen to some other function. If each function just returns the average value within its patch, then for one-dimensional data we get a histogram output, as is shown in Figure 5.5. We can extend this a bit further so that the lines are not horizontal, but instead reflect the first derivative of the curve at that point, as is shown in Figure 5.6.
This is all right, but we might want the output to be continuous, so that the line within the first bin meets up with the line in the second bin at the boundary, so we can add the extra constraint that the lines have to meet up as well. This gives the curve in Figure 5.7.
FIGURE 5.5 Top: Curve showing a function. Second: A set of datapoints from the curve.
Third: Putting a straight horizontal line through each point creates a histogram that
describes an approximation to the curve. Bottom: That approximation.
FIGURE 5.6