• No results found

A Fast Converging Backpropagation^

9.2 Linear update function at the output layer

In section 4.1 we summarised the derivation of the BP algorithm and showed that the delta rule at the output layer is given by (4.5)

5

^ This chapter is based on the paper (El-Deredy and Branston, 1994) http://nobelprizes.com/nobel/peace/1991a.html.

C hapter 9: Fast backpropagation 112 Aw.. = r j ( t . - o . ) f ’( net . ) Oj AM/,,,= 277(L = 2 7? à J O u t ) where (9.1) / / ’ = = o. ( I - o. ) dnet^ f ’( net ) = f i f ( n e t)( l - f ( ne t))

is the derivative of the sigmoid function. Re-writing the delta rule (4.9) = 2;/

= ( f, - O; j 0,.( 1 - O, )

Fig. 9.1 shows plotted against ( t - - o - ) . Delta (9.2) (9.3) (9.4) (9.5) 0.2 0.6 0.8 1.0 0.4 0.2 0.0 Error

F ig u re 9.1. B a c k -p ro p a g a te d de lta fu n ctio n s, a. T h e orig in a l d e lta fu n c tio n (8.2) and b. T he p ro po sed lin e a r fun ction.

It is clear from the figure that as the error increases the weight update value actually decreases, such that when the actual output is totally wrong, (e.g. r p1 while o^ = 0 ), is zero. Thus, although a large error may be present at the output, the weights at the output layer may not be updated accordingly. This situation can slow down the training of the whole network significantly, and it the nodes are saturated in the flat regions of the sigmoid, training may be inhibited. This often occurs at the beginning of the training when the random initial weights shift some node activations towards the flat (inactive) regions of the sigmoid, thereby saturating their outputs. In this case it is difficult for such nodes to be brought to the dynamic region of the sigmoid within a practical training time.

Therefore, we introduce the update function

( ^ i = k . ( t . - o . ) (9.6)

where A,is a constant, which is more representative of the deviation between r, and o, (Fig. 9.1). W e use (9.6) instead of the squared error in (4.2) to update the weights between the output layer and its nearest hidden layer. The update steps for the hidden layers (4.5 and 4.6) remain unchanged.

The change proposed in (9.6) guarantees that an amount proportional to the error update the weights at the output layer and gets propagated to the hidden layer, ensuring active learning through out the training session.

It is important to note that the problem highlighted in Fig. 9.1 is specific to classification tasks when the quadratic error function (4.2) is used and the output node activations are sigmoids (4.4). The problem is not encountered in regression tasks since the output node activations are linear. For classification, it is possible to define a different error function whose derivative can alleviate this problem. It has become standard to define the error function in classification in terms of the negative log-likelihood of outputs (Bishop, 1995). 9.3 Designing a sigm oid curve fo r each layer

The sigmoid function is responsible for creating the decision boundaries of the classifier. The slope of the function controls the transition between the hyperplanes of the decision surface, the higher the slope the sharper the edges of the decision surface (Fig. 9.2). In section 4.3, we discussed approaches to accelerate the network training by adapting the slope of the sigmoid. The

basic principle in most of these F ig u re 9.2 The slope of the sigm oid [3 (4.4) controls

approaches was to apply gradient decent the steepness of the classification surface.

to the slope

of the sigmoid in a similar way to the weights (section 4.1.1). Each node would have a different slope that determines the steepness of the sigmoid function transform ing the activation of that node. This obviously adds to the computational time to training, which in the case of MRS is already burdened by the large dimensionality.

0.6

0.4

0.2

F ig u re 9.3 A large slope (com pared to the spread of the training data) m ay lead to prem ature saturation of som e of the network activations (x),

slowing down convergence or resulting in

suboptim al solutions

0 . 0

F ig u re 9.4 The inappropriate choice of the initial w eights and can also result in prem ature saturation.

Chapter 9: Fast backpropagation 114

training process, rather than during training (Fig. 9.3 and 9.4), and that making a sensible choice of the slope and location of the sigmoid before training starts can help speed up the convergence and prevent the nodes from getting trapped on the flat regions prematurely.

In this section we introduce a simple procedure to choose the displacement t and slope b of the sigmoid before training starts. This procedure attempts to fit a sigmoid to the training data, regardless of the choice of the initial weights, to each trainable layer in the network, and in doing so, it places all network activations for all training patterns on the linear part of the sigmoid. This ensures that all nodes and all patterns take an active part in the training and that none of the network nodes is trapped on the flat regions at either ends of the curve (Fig. 9.5).

1-8

0.8 0.6 0.4 0.2

xxxxxxxxxxxxxxxxxX

I 1 ^ x 4 .

0

innnet

mxnet

F ig u re 9.5 Fitting the sigm oid to the training data (and the initial weights), t is chosen by shifting the centre of the sigm oid to the centre of the data, w hile the slope is chosen such that the m inim um and m aximum node activations in a layer are transform ed to the low er and upper shoulder of the sigm oid, respectively. This ensures that the netw ork starts training with all its nodes on the active part of the curve for all patterns and hence forces these nodes to contribute to the training process.

The procedure works by calculating the two parameters characterising the sigmoid curve for each of the trainable layers by forward propagation using the training data and the initial set of weights. From (4.3) and (4.4) the node activations and outputs for any layer I are given by

net^ = w.o ' and o' = f {/3‘ (net' + r ' )) Now let

mnnet' = inin {net^’' j and mxnet' = max {net^’’'}

r ' = in x n e t' + m n n e t'

be the maximum and minimum activations of layer I given all the data patterns p. m xnet

and m n n e t represent the spread of the activations across all the nodes in layer I and all the data patterns given the initial weights. This spread of activations is to be fitted to the desired region of the sigmoid by selecting the values on the curve onto which m x n e t and

m n n e tare projected (Fig. 9.5), such that

£ = f ( m n n e t'+ t ' ) ) (9.7)

and

\ - £ = f ( P ' ( m x n e t' + t ' ) ) (9.8)

where e is a small positive value (e.g. 0.1). Therefore, all activation will initially project onto the region defined by eto (1- £ ).

Solving (9.6) and (9.7) for p and twe get

(9.9) pi _ 2 l o g ( \ - e / £)

m x n e t' — m n n e t' (9.10)

The procedure starts from the first hidden layer and propagates forward. Using the input data and the initial weights, it calculates all the activations of the hidden layer (4.3) as result of all the input patterns. It finds the maximum and minimum values of these activations and fits the sigmoid curve to the them using (9.9) and (9.10). The values of b and t are then used to computed the outputs of the nodes (4.4). These outputs and the initial weights are used to calculate b and t of the next hidden layer. The process is repeated until all hidden layers and the output layer have their sigmoid curve parameters evaluated. Note that this procedure is done only once before training starts. It guarantees that all network activations are on the active part of the sigmoid. During training these activation are then free to move towards either ends of the sigmoid. Any nodes that saturate their activations at either end of the curve will do so as a result of the training. 9.4 Experiments and Results

We refer to a network updated by (9.6) and whose sigmoid curves are designed by (9.9) and (9.10) as a linear update backpropagation (LUBP) network type. W e assess the training speed and generalisation performance of LUBP compared to the speed and performance of standard BP (4.5) in two different problems.

The first problem is to learn the parity function for orders between 4 and 7 (section 6.3), while the second is to classify NMR spectroscopy data derived from 6 types of normal and tumour animal tissues (section 6.4). The weights (randomly chosen) are updated after every pattern presentation, and the average of the training time (the number of epochs) over 2 0 re-training sessions each starting with different initial weights is recorded for those networks where convergence occurred within a pre-specified time. Networks that timed out have zero contribution to the average. Identical sets of initial weights were used in the comparison.

Chapter 9: Fast backpropagation 116

9.4.1. Learning the p a rity fu nction

The parity function is not an easy task for BP because of the exponential growth of learning time as the order N increases and because of local minima [7, Lisboa]. For a network with one hidden layer and N-2N-2 node structure the task is to train the network until it correctly produces the sum, mod 2 , of all 2^ binary combinations, with one output indicating the odd parity and the other the even parity. We used N ranging from 4 to 7 in two BP networks with learning rates of 0.1 and 0.3 in comparison with a LUBP network with a learning rate of 0.1, while fixing the momentum at 0.9 in all cases. Learning rate and momentum were not optimised. Table 9.1 summarises the results.

Table 9.1 Training time (in training epochs) for an N-bit parity

N BRI BP2 LUBP 4 1 2 1 97 80 5 341 233 175 6 721 564 410 7 1213 993 860 T|, F ^ 0.1, 0.9 0.3, 0.9 0.1, 0.9

It is clear that even when the learning rate for BP is increased, LUBP is faster. The increase in training time as function of N is to be expected as the complexity of the problem increases (Lisboa and Perantonis, 1991). Figures 9.6, 9.7 and 9.8 show the number of misclassified patterns as a function of training time for N = 5, 6 and 7 respectively. Although BP starts by learning a larger number of patterns than LUPB, it finds difficulty in learning the remaining patterns or fails completely, whereas LUBP manages to escape the local minima (Fig. 9.7) and converges for all patterns.

N u m b e r o f m is c l a s s if ic a t i o n s 1 6 L U B P T2 8 B P 1 4 B P 2 0 8 0 1 6 0 2 4 0 E p o c h s 0 3 2 0

N u m b e r of m is c la s s ific a tio n s 16 12 8 L U B P 4 B P 2 0 80 2 4 0 0 160 T ra in in g e p o c h s

F ig u re 9.8. N um ber of m isclassified patterns as a function of training tim e in epochs for parity 6.

Number of misclassifications 12 LUBP 8 4 BRI BP2 0 300 100 200 Epochs 0

Chapter 9: Fast backpropagation 118

9.4.2 C lassification o f NMR spectra

We used the rat data with injected noise as explained in chapter 6. We split the data into a training set (210 patterns) and a test set (150 patterns) in which the 6 classes were distributed with equal probability. The dimension of all patterns was 180 points. We used networks with 180-H-6 nodes, where H={4, 6, 8 , 10}. Training was terminated when all the patterns in the test set were correctly classified. We compared LUBP with a learning rate of 0.1 to BP with a rate of 0.3 while fixing the momentum at 0.9. Table 2 shows the results. In the case of 4 hidden nodes, BP did not converge in time. As with the parity function, LUBP converged faster than BP. Fig. 6 shows the number of misclassified patterns as function of training time for H = 6.

T a b le 9.2 C om parison of the convergence rate of LUBP and BP

H BP LUBP 4 570 6 400 340 8 340 320 1 0 310 290 T|, P 0.3, 0.9 0.7, 0.9 N u m b e r of

m iscla ssifica tio n s

80 60 40 L U B P 20 BP2 \________ Epoc hs 0 2 00 400

F ig u re . 9.10 C om parison of the num ber of m isclassified patterns during training of BP2 and LUBP.

It is noted that if the learning rate of the BP network is increased so as to attempt to approach the performance of the LUBP network, the BP network becomes unstable and oscillates.

Summary

In this chapter we have used a linear function proportional to the error between target and actual values at the output in order to update the weights of an MLP trained by backpropagation. Unlike the original (non-linear) update function, the derivative of the proposed linear function does not collapse to zero when the error increases and hence ensures that weights updates are proportional to the actual error take place at each training cycle. In addition, we have proposed a procedure to fit a sigmoid curve to the activations of each of the non-linear layers in the network prior to training. The proposed procedure ensures that training starts, under any set of initial weights, with all node activations on the active part of the sigmoid, for all input patterns. Hence, it prevents the premature saturation of the network activations while bearing no added computational burden on the training process itself.

The modifications were tested on the parity function and on MRS data. The results demonstrated that together the two modifications enable the network to converge faster and escape from local minima more readily than if updated using the standard BP algorithm. This speeded up MLP is, therefore, more practical for high dimensionality MRS data and we use it in the implementation part of this thesis (chapters 1 1 and 1 2) as well as in the following chapter.

References

Bishop, C.M. 1995. Neural networks for pattern recognition. Oxford: Clarendon Press.

El-Deredy, W. and Branston, N.M. 1994. An update function that speeds up backpropagation learning. Proc. of the IEEE int Conf. on Neural Networks, Orlando, FL, June 1994, pp. 477-482. Gori, M. and Tesi, A. 1992. On the problem of local minima in backpropagation. IEEE Trans. PAMi

14, 76-86.

Lisboa, P.J.G. and Perantonis, S.J. 1991. Complete solution of the local minima in the XOR problem. Network-Computation in Neural Systems 2, 119-124.

Nguyen □., and Widrow B., 1990. Improving the learning speed of 2-layer neural networks by choosing the initial values of the adaptive weights. Proc. of iJCNN. pp. (Ill) 21-26.

Osowski, S. 1993. New approach to selection of initial values of weights in neural function approximation. Electronic Letters 29, 313-315.

120

Chapter Ten

Variable Selection and Identification of Relevant