• No results found

Backpropagation Through Time (BPTT)

6.4 Recurrent training algorithms

6.4.2 Backpropagation Through Time (BPTT)

The BPTT algorithm was originally formulated by Rumelhart, Hinton and Williams (1986). It is a more powerful learning algorithm than the SRN as it calculates the true derivative of each weight with respect to the error, taking into account the recurrent nature of the network. It can be used to train any of the recurrent architectures described earlier but for the purposes of this discussion the general TOS network from Section 6.3.3 will be used as an example.

BPTT is based on the observation, originally made by Minsky and Papert (1969), that for any recurrent network a feed-forward network can be constructed which exhibits identical behaviour over a finite period of time. Figure 6.8 illustrates a strictly feed-forward network which will produce exactly the same output over a period of three time frames as the recurrent TOS network shown in Figure 6.6.

Input Output/State Time =t Time =t+3 Time =t+2 Time =t+1 Output/State Input Output/State Input Output/State

Figure 6.8 A feed-forward network which will produce identical results over a period of 3 time frames to the TOS network in Figure 6.6

BPTT works by converting the recurrent network into an equivalent feed- forward structure, and backpropagating the error through this feed-forward network. The conversion is performed by 'unrolling' the network for a number of time-frames equal to the length of the input sequence. The error on the output nodes can then be backpropagated through the structure calculating a change for each weight in the network. As the feed-forward network is created by duplicating elements of the recurrent network (creating an instantiation of each node and connection for each time frame in the input sequence), it is necessary to sum the changes for each duplicate weight and alter the weight by this summed value. This ensures that the updated feed-forward network can be 'collapsed' back to the recurrent network.

In practice the 'unrolling' and 'collapsing' of the network is not performed explicitly. All that is necessary is to present the entire input sequence to the network one frame at a time, and store the resulting activation of each node in the network. Once the end of the sequence is reached the appropriate weight updates can then be calculated as shown in the following equations.

Tj,t the target value for the jth output node at time t (a target value may not be defined for every node at every time frame)

α the learning rate

P the length of the input sequence I

For the input sequence I the weight updates Δui,jandΔri,j are given by:

ERj,t = Tj,tRj,t 41 –Rj,t2 ifTj,t is defined = 1 4 –Rj,t 2 rj,k

Σ

k= 1 b+cERk,t+ 1 ifTj,t is undefined andt<P = 0 ifTj,tis undefined andt =P 6.7 Δwi,jIi,tERj,t

Σ

t= 1 P 6.8 Δri,jRi,tERj,t+ 1

Σ

t=0 P– 1 6.9 For long sequences and large networks the storage requirements can become impractical and this is a weakness of BPTT which is adressed by the Real- Time Recurrent Learning algorithm discussed in the next section.

BPTT may have difficulty in learning to classify long sequences. The situation is analogous to attempting to train a deeply layered feed-forward network. Changes made to weights in the lower layers of the network will likely have only a small impact on the final output of the network, and so the adjustments calculated for these weights will also be small, meaning the network will take a long time to find appropriate values for these weights. In addition the effect of changing these weights early in the sequence will be distorted by the effect of the other weights in the structure until they have trained to appropriate values (Mozer 1994). Hence the training algorithm will have difficulty in forming network weights suitable for detecting relevant features early in the input sequence.

It is possible to specify target values for the output nodes at any point in the sequence. This would be done for example if the network was being trained to reproduce a particular sequence on the output nodes. However for tasks such as those discussed in this thesis which involve the classification of an input sequence, only the values of the output nodes at the end of the sequence are important. Therefore usually target output values are provided only for the final time-frame of the sequence. Lewis (1995a, 1995b) demonstrated that providing the network with target values at earlier points in the sequence can improve learning performance. This work describes an algorithm called signal-melding which generates adaptive, class-dependent training signals which can both reduce training times and increase performance on some sequence classification tasks. However this work was in progress at the same time as the SLARTI system was being developed and hence has not been used in this research. The recurrent networks described in this thesis were trained with target outputs provided only at the end of the input sequence.

In using any of recurrent architectures, all of the nodes which feed into recurrent connections must be set to an initial state prior to presentation of the first frame of each input sequence. This is a relatively minor issue and is included here only to aid in replicating these results – as long as the same values are used consistently throughout the training and subsequent use of the network their exact value should not be of any consequence, as the training algorithm will be able to adapt the weights to work from any initial values. For the purposes of this research the strategy used was to initialise these recurrent nodes to 0, which is the mid-range of the symmetric activation function used in these networks.