Other training techniques for feedforward networks

CHAPTER 2 Training neural networks

2.3. Some training techniques

2.3.2 Other training techniques for feedforward networks

Although backpropagation is still the most widely used algorithm for training multilayer perceptrons, there are other methods which are not based on backpropagating the error. Some of them will be reviewed in the following.

Derivative estimation by perturbation

The derivative estimation by perturbation method first injects some perturbations in the network, propagates them forward and calculates the change in error caused by them. Then, this change in error is used to approximate the gradient of the error function. The idea has two variations, to inject the perturbation locally, at the level of one neuron as in MRIII [Andes, 1990], or globally as in model-free distributed learning [Dembo, 1990].

Direct update of the weights

The method of directly updating the weights by perturbation eliminates the need for gradient computation. The weights are simply changed by an arbitrary quantity and the new error value is calculated. If the change has determined a decrease of the error, the change is accepted. If error has been increased by the weight change, the weight change is discarded and some other change will be tried. Again, there are two variations of this idea, one which applies the weight changes on an individual basis (to each weight) and one which considers a perturbation matrix which perturbs the weight matrix [Baba, 1989].

Genetic algorithms

Another approach to training is genetic algorithms. In this method, the weight state and/or the architecture are encoded into a binary string called a chromosome. For each chromosome, a fitness measure is calculated. This fitness measure is inversely proportional to the error on the training set. Initially, a population of chromosomes is generated randomly and subsequently new generations are created by using a set of genetic operators. There are two categories of genetic operators: cross-over

operators which combine bits from two chromosomes to create a third one and mutation operators which randomly change some bits in the chromosome it acts upon. At each generation, some individuals will die if their fitness value is below a certain threshold. After a certain number of generations, a solution is obtained by choosing the best fit (or any) member of the current generation. The main disadvantages of the genetic approach are the extreme sensitivity to the binary codification of genes, the genetic operations and the rate at which they are applied and the calculation of the fitness values. Very small changes in any of these elements could lead to a very slow evolution speed or even failure. Genetic algorithms are described in [Chang, 1990], [Dodd, 1990], [de Garis, 1990] and others.

Basis functions

In this case, the network approximates the desired function using a set of functions which form a basis in the function space:

f(w>x)=£w,o,(x)

f=l

where are the basis functions, Wi are weights and X is the input pattern. Each neuron on the hidden layer is fully connected to each input neuron and implements a basis function. The output layer performs a linear summation of the basis functions supplied by the hidden layer. The main difference from the backpropagation variations is that the weights from the input layer to the hidden layer are not trained by error propagation but they are calculated directly from the training samples. The weights to the output layer, however, can be trained using the delta rule or calculated directly by solving a linear system.

The basis functions can be localised or not. If the functions are local, their effect is present only in a limited region of the input space. The most commonly used type in this category is the radial basis function type in which the basis functions are Gaussian functions [Broomhead, 1988], [Musavi, 1992], [Moody, 1989], [Poggio, 1990], [Girosi, 1990]. Different non-localised basis functions can be used, as well. A possible choice for the basis functions is a set of orthogonal polynomials.

Converting decision trees

A classification network performs the same task as a classification tree. Several authors have shown that it is possible to construct a neural network equivalent to a given classification tree. This network can be used as it is or as a starting point for a backpropagation training or other training techniques. The approach presents the advantage that many classical techniques for designing tree classifiers can be directly used in building neural networks. A disadvantage could be that usually, the construction of the classification tree is a process with heavy computational demands. The connection between tree classifiers and neural networks is explored in [Sethi, 1990a], [Sethi, 1990b] and [Sirat, 1990].

Learning from examples and queries

The techniques presented until now, are based on a training set which is independent of the learning process. Query learning uses a partial training with very few examples after which the algorithm calculates some 'interesting' points in the input space and the output values for those inputs are asked. The advantages of this approach are firstly that a good approximation can be constructed using a small training set and secondly that the algorithm itself can find the points which are more important (such as the boundary points in a classification problem) and can pay more attention to them. The disadvantage is that the approach requires the existence of an oracle i.e. a mechanism able to give the correct output for any input point the algorithm might ask. There are many real world problems for which such an oracle does not exist. Two examples of query learning algorithms are [Hwang, 1990] and [Baum, 1991].

In document Using constraints to improve generalisation and training of feedforward neural networks : constraint based decomposition and complex backpropagation (Page 47-49)