4 Artificial Neural Networks
4.5 Growth Algorithms
The fundamental issue in the attachment of a neural network to the output attained from MCRDR is how to handle the changing size of the input space. As the input space to the neural network increases the network structure will also need to be constantly growing. Potential was seen to be most likely found in the work done in growth based neural networks algorithms. These are a group of algorithms that not only learn weights on connections (such as backpropagation and the basic RBF algorithm) but simultaneously can also learn the structure.
Growth algorithms were developed as a means of addressing the inadequacies of the more standard algorithms. They have been found to converge quickly and overcome problems of local minima (Bose and Liang 1996). The primary difficulty with growth algorithms is the risk of over fitting, resulting in poor generalisation. Growth algorithms generally take a ‘divide and conquer’ approach to learning a solution to a problem. For instance, such a system will start with only a single threshold logic unit (TLU) and then progressively add additional nodes as they are required. Over fitting occurs when a new node is added to capture the difference between cases for every case, thereby, providing no generalisation.
4.5.1 Upstart Algorithm
The Upstart algorithm, suggested by Frean (1990), starts with a single TLU, u0. The system is then trained for a given length of
time using a perceptron style of algorithm. After training, u0, it
may be linked to two possible errors: wrongly ON, or wrongly OFF. When this occurs new TLUs are added to capture the errors. For instance, another TLU could be added, u1n, which
only activates for exemplars that turned u0ON incorrectly. This new node will
send a large inhibitory signal to u0, thereby, turning u0 OFF. This new node is
trained using a reduced form of the original dataset, where the nodes that correctly turned u0 OFF are removed. Similarly, wrongly OFF is corrected by
adding another node, u1p. If the new nodes cannot learn a correct solution within
a given training period, then additional nodes are added in the same way.
u0
4.5.2 Divide and Conquer
The basic ‘divide and conquer’ algorithm, proposed by Liang (1990; 1991), is simple in its approach. The network can be either a single or two layer network shown in Figure (4-2a), which instead of growing deeper like the upstart algorithm, grows wider. It begins with a single hidden TLU, which will attempt to learn the best linear separation between classes within a certain training period. If the problem is not linearly separable then this first node is treated as if it has effectively broken the input patterns into two subsets. At this stage a new neuron can be added to attempt to further divide the input patterns into two smaller subsets. This process continues while the system cannot classify all cases correctly. This will always solve any problem with a finite input pattern space, because at some stage a subset will only contain two cases needing separation which can always be divided giving the correct answer for each. While this method is very effective it is susceptible to generating too many hidden nodes causing a lack of generalisation. One solution is to include a merging phase, where, after dividing the system successfully the system then removes redundant nodes, improving generalisation.
4.5.3 Tiling Algorithm
The Tiling algorithm, developed by Mezard and Nadal (1990), grows a
multilayer network, Figure (4-2a). It once again starts with a single TLU, called the master unit, which is trained for a period of time. Should this node be unable to learn a separation between the classes then a TLU is introduced for each of the two subsets that contain exemplars from both classes. These new TLUs, referred to as ancillary units, are then trained with a subset of exemplars. Once again, if the ancillary units are unable to separate their subset into single class exemplars then more ancillary units are added. Eventually, ancillary units will be found that separate the inputs correctly. When this is accomplished a new layer is created with a single new master unit. This new layer takes input from the previously just fully trained layer, which will then follow the same procedure, possibly adding additional ancillary units. Once a new layer is created that can correctly classify all inputs without adding any ancillary units, this becomes the output node of the network.
4.5.4 Cascade Correlation (CC)
The popular Cascade Correlation (CC) network, by Fahlman and Lebiere (1990), starts with only an input and output layer. Over time, hidden nodes are then added to the network, one at a time, as they are needed. The total number of hidden nodes
added is dependant on the error bound set in the network, using supervised learning. As each hidden node is added it receives inputs from all the input nodes plus any previously created hidden nodes. Initially, it is not connected to the output of the system. It is then trained, usually using a gradient descent type algorithm. Once the training results in a stable local minimum the inputs to the hidden node are frozen and it is connected to the output. If the result of the network over all is not satisfactory then additional hidden nodes are added and trained in the same way. The resulting structure is not a layered network but instead a cascading of hidden nodes between the inputs and outputs.
4.5.5 Resource Allocating Network (RAN)
One of the problems associated with the RBF function, described earlier in section 4.4, is that the hyperellipsoids need to be defined prior to learning, either from some previous knowledge or from the use of a heuristics. One popular extension to the RBF algorithm is to provide a means of allocating new hyperellipse dynamically during the training process. This was first proposed by Moody and Darken (1988; 1989) and later extended by (Moody 1989) where the hyperellipses learn their centres and widths during training. Platt’s (1991) Resource Allocating Network (RAN) extended this further by also adding new nodes as well as learning the appropriate centres, which effectively turned the RBF network into a growth based algorithm. One interesting aspect to the work in this thesis is that the RAN based RBF network allocates units in such a way that they only respond to a narrow region of the input space, therefore, newly allocated units do not interfere with previously allocated units.
Unlike many other growth algorithms it starts with no hidden nodes. The network will select particular input patterns as they are presented and allocate a hidden node to capture the pattern. Input patterns are selected when there is sufficient distance from the previously seen cases. In other words a hidden node
In pu t L aye r Hidden Layers Output
is created when a deficiency is found in a particular area of the input space. Generally, the method allocates neurons that are quite coarse initially. However, over time the widths of the basis functions are reduced, producing a much finer representation.
4.5.5.1 Adaptive Response Function Neurons (ARFNs)
In situations where there is only a small error, the RAN method can move the centres of the basis functions. However, there is no ability to adjust the widths of the functions during online learning. A recent alternative to the RAN method is Ollington and Vamplew’s (2003; 2004) Adaptive Response Function Neurons (ARFN). This method was of particular interest to this thesis as it is the only method currently developed that can be applied to the idea of learning inputs.
The ARFN, shown in Figure 4-3, is based on the biological cortical neuron and receives inputs from both an excitatory and an inhibitory interneuron. Each interneuron is excited by the common cortical input and inhibited by a standard bias. An appropriate threshold will allow the output neuron to produce a Gaussian like response to the original cortical input. This forms a selective neuron response rather than more common monotonic function.
Interneurons Bias Input
Input Output
(ARFN)
Figure 4-3: ARFN: A neuron for implementing a Guassian-like response function .White neurons are excitory and grey neurons are inhibitory. This diagram is based on the ARFN model in Ollington (2004, p 42).
Training of the interneurons is based on the inputs received rather than the propagated error from the output. For instance, if the response from the excitory interneuron is high then the threshold is trained up, alternatively, if it is low it should be trained down. Similarly, the opposite should occur for the inhibitory interneuron. Effectively, this system is building an asymmetric function which tightly fits the input pattern. The advantage of this approach is that it allows the neuron to reshape its hyperellipsoid centre and widths while training by only