Classifiers and mapping networks - Realising a simple model of perception using RBF

Chapter 6 Realising a simple model of perception using RBF

6.2 Classifiers and mapping networks

Classifiers or mapping networks must take the various features emanating from the feature extractors and combine them into a known form. For example, in one of Kohonens application of his feature map (i.e. his phonetic typewriter) identical

phonemes were distributed at various places within the map separated by other phonemes [Koh88b]. The mapping network would map identical phonemes together and combine this with the stream of phonemes already received to enable word classification to take place. One of the earliest classifiers was the perceptron which was reviewed in chapters 1 and 3. It is perhaps ironic that this simple one layer structure was responsible for the both the initial interest in learning machines in the 1960s and, because of its theoretical limitations, for the demise in interest of neural networks as a research discipline in the 1970s. Although interesting, the basic one layer device could not produce an arbitrary mapping from input to output. Such a mapping could be produced by a three layer device but no training algorithm existed for such structures when the limitations of the perceptron were being explored. However, algorithms did emerge in the later part of the 70s and early 80s and two of the most notable were backpropagation and simulated annealing both of which were considered in chapter 3. In general, the training time for the simulated annealing algorithm is impractically large. Backpropagation, although better in this regard is still somewhat lethargic in its convergence characteristics. In chapter 3, we surmised that this is due, in part, to the flexibility of the structure. Briefly, the nodes in the first layer of a three layer MLP form the sides to geometric figures which surround the region of interest in the input state space. However, any of the nodes in the first layer can form any one of the sides of the geometric figures. Hence, it would seem that a great deal of the training time is wasted in sorting which node performs which task. To address this, we proposed a two layer network based on the RBF. By using predefined regions, this indulgence in flexibility was removed to a large extent. Also, by using hyperspherical decision regions that arc, by definition, continuous and continuously differentiable we were able to apply a modified form of the delta rule to train the new network. The transfer function of the RBF network and its training equations (for the hyperellipsoid) were described in chapter 3 and are again summarised here for convenience:

netk = Xj [Oj.Wjk\ + Wck Ok = f(netk) = l/O+expC-wer*)) (6.2.1) (6.2.2) netj = Zi lAv.QCr W p] Oj = f(netj) = e\p(-netp and the training equations are:

(6.2.4) (6.2.3) Wjk(n+l) = Wjk(n)+AWjk Wck(n+l) = Wck(n) + AWck W jj(n+ l) = W ^ n ) + AW jj A ^n+ l) = A ^n ) + AAy (6.2.8) (6.2.7) (6.2.5) (6.2.6)

where

A W jk = g.(Rk -O k).Ok.( \ - O k).Oj W * = g.(Rk - Ok).Ok.(\ - Ok).Oc

(6.2.9)

(6.2.10)

AWij = -g.(dE/dOj).-Or 2Aj.(Xi - W p AAy = -g.(dE/dOj).-Oj.(Xj - W ^

(6.2.11)

(6.2.12)

and

dE/dOj = -Z* [(Rk - Ok).OkXl - (6.2.13)

where Op Ok is the output of neuron j layer 1, k layer 2, W- and Wjk are the weights which connect node i to node j and node j to node k> Aj is a positive constant equivalent to the variance of the Gaussian distribution in each dimension and therefore determines the width of the region, netp netk is the total input to neuron jjc, Xj is the input i and Rp Rk are the target values of the network when being trained and finally the constant g determines the learning rate. Simulations have shown (see chapter 3) that the performance of the network is considerably better in terms of convergence rate than the MLP which gives some credence to the flexibility argument outlined earlier. One may argue however that the training would only have to be done once and so the length of training is relatively unimportant. However, at present, there is very little theoretical knowledge to predetermine the size of network required to perform a particular task. Hence many trials are needed before a network with the appropriate performance will be found. Thus a network that requires perhaps a 10-100 fold decrease in training time may make a large difference in the time to determine a suitable network structure and may therefore reduce the cost to a viable level of a neural network solution compared to one using conventional techniques. Also, one of the advantages of a ANN solution is that the network can be retrained either continuously or at predetermined intervals to allow for a change in environment. In this circumstance, a reduction in training time will again make a large difference to the viability of a solution as hardware costs would increase at best linearly with the increase in training time to ensure that the solution is reached in the required time.

As far as the biological plausibility of training algorithms based on the generalised delta rule is concerned, the requirement of pre-classified data with this classification being an integral part of the training strategy is thought to be a distinct disadvantage. How does the system know what the required output should be ? It is this fundamental difficulty that we shall address in the next section.

In document The theory, design, unification and implementation of a class of artificial neural network (Page 137-140)