The Network Architecture - Analysis and Comparison of Algorithms for Training Recurrent Neural

The task of learning the operator implicit in the Roessler equations means that the network gets two coordinates as inputs and has to generate the third coordinate as output. This task was also used for a case study by Steil [1999]; Steil and Ritter [1999b]. It is known – and also intuitive – that the case of learningz(t)fromx(t)and y(t)is much harder than learningx(t)ory(t)from

the respective other two because the prediction of the width and the height of the peaks is difficult and strongly dependent on the time-course of the coordinates. Since the scope of this work was not to investigate what tasks can be learned but to compare the behavior of different algorithms, the easier task of learningy(t)fromx(t)andz(t)was used for the further analysis.

The network architecture was chosen to be a fully-connected recurrent neural network of ten neurons. The network has two inputsu

(t); k=1;2that are connected to every neuronx i

; 1

i10in the network with weightsw

ik. The activation of neuron x

iis given by the equation _ x i = x i (t)+ 10 X j=1 w ij f(x j (t))+ 2 X k=1 w ik u k (t); (3.2) wheref(x j )=tanh(x j

). The output is taken from neuronx 1 o 1 (t)=x 1 (t): (3.3)

Equation (3.2) is rewritten in discrete form using Euler integration

x(k+1)=(1 t)x(k)+t 0 @ 10 X w ij f(x j (k))+ 2 X w ik u k (k) 1 A ; (3.4)

3.2 The Network Architecture 25

where the time variable isk = ^ kt.

The initial weights and states are generated randomly and uniformly distributed in the range from 0:2 to 0:2. The data representing the Roessler trajectory is obtained by a fourth order

Runge-Kutta integration of the equations (3.1) with initial conditions(0:495; 0:166; 0:3). The

resulting coordinate functions are divided by ten to scale them into a range suitable for the network. The inputs are thus provided byu

1 (k)= 1 10 x(k)and u 2 (k)= 1 10

z(k). The target output is given

byd 1 (k)= 1 10 y(k).

The training of the network is arranged into epochs. During every epoch, a certain part of the Roessler trajectory is presented to the network and the weights are updated using either RTRL or APRL. To monitor the training progress, the training error is calculated using the mean square error of the deviation of the network output from the desired output

E train = 1 L L X k=1 (o 1 (k) d 1 (k)) 2 ; (3.5)

whereLis the number of learning steps in one epoch.

The network’s generalization capability is evaluated by running the network for the next10000

steps on a previously not presented part of the Roessler trajectory. The weights are not adapted further. The generalization error is calculated according to

E gen = 1 10000 10000 X k=1 o 1 (k) 1 10 y(k) 2 : (3.6)

The training setup described here was used for all simulations presented in the following chap- ters.

26 4 Recurrent Learning: APRL vs. RTRL

4 Recurrent Learning: APRL vs. RTRL

The algorithms that this work is primarily concerned with are real-time recurrent learning (RTRL) and Atiya-Parlos recurrent learning (APRL). Although both are derived from a gradient descent approach to minimizing the error, their strategies used to approximate the gradient direction are different. In order to investigate the properties of RTRL and APRL, both algorithms were applied to the task of learning the Roessler attractor as described in the previous chapter.

The results of the simulation are presented, and the training and generalization errors are analyzed as performance measures of the algorithms. Special interest is paid to the course of the error during the training process. On the basis of the results, further experiments are conducted with varying learning rates. The results are discussed with respect to the insights that can be gained in the behavior of the respective algorithms.

4.1 Real-Time Recurrent Learning of the Roessler Attractor

In the following, the results obtained with real-time recurrent learning of the Roessler attractor will be presented. The setup was as described in chapter 3. Network size and learning parameters were adopted from Steil [1999], where there was shown that the Roessler attractor can be learned with a ten-neuron network using a time step oft=0:05and a learning rate of = 0:05. This

work will not deal with any optimization of the learning parameters.

To obtain data for analysis, 25 training runs were carried out. The networks for the trials had different initial conditions, and they were partitioned into groups of five between which the length of the Roessler trajectory presented during one learning epoch varied.1 As a consequence of this variation, different points on the attractor are reached after iterating the Roessler dynamics for one epoch. This results in a smaller or larger distance (or “jump”) between the last point of one epoch and the first point of the next. Since these points are used as training examples, the learning has to cope with these jumps. This gives rise to transients during the first steps of an epoch because the network states have to follow the jump in the training examples. More precisely, the learning

1 _{In practice, this variation was achieved by iterating the network and the attractor dynamics for different numbers}

of learning steps per epoch.

iterations reached end point distance ofy-component

p. epoch x y z jy y 1 j 1 0.531761 -0.117349 -0.163011 — 1000 -4.192038 -8.029749 0.018934 7.9124 1019 10.836308 -2.858789 2.540039 2.74144 1020 10.737583 -1.824671 4.260116 1.707322 1250 7.697906 -3.259310 0.284886 3.141961 1430 6.831037 1.369152 0.577286 1.486501

Table 4.1: The number of iterations per epoch for which simulations were carried out. The dis-

tance of they-components between the last and the first point of the attractor gives rise

4.1 Real-Time Recurrent Learning of the Roessler Attractor 27 learning steps p. epoch avg training error after 100 epochs stddev of error after 100 epochs avg min of training error stddev of min error avg epoch with min error stddev of epoch with min error 1000 1.546635 0.108703 1.502200 0.085739 87.2 11.8727 1019 0.697316 0.256011 0.345977 0.048451 46.25 30.2768 1020 0.640156 0.145145 0.442379 0.054683 52.2 34.9251 1250 0.040773 0.001144 0.040650 0.001003 99.4 0.8000 1430 0.565376 0.049930 0.469762 0.011277 28.2 7.9599

Table 4.2: Training errors for real-time recurrent learning of the Roessler attractor.

procedure forces the network to follow the displacement in the target output to get back onto the desired attractor trajectory again. Since at the beginning of an epoch no learning memory is available, this effect could lead to a network learning too strongly the initial transient. It will turn out that these initial transients can in fact be crucial for successful learning. As the length of the transient is mainly dependent on the skip in the target output between epochs, the number of iterations per epoch was chosen according to the displacement of they-component of the Roessler

attractor.2

Table 4.1 lists the numbers of iterations per epoch together with the end points on the attractor and the respective distances of the y-components. The numbers are of three types: for 1000

iterations, the distance of the y-component is large (e.g. about 8), whereas for 1019 and 1250

iterations it is at a medium value of about 3, and for 1020 and 1430 iterations it is in range of the minimal distance achievable in terms of the discretization. For the types with medium and small distance, the distinct number of iterations allows to examine whether the effect of the initial transient can be compensated by a longer training trajectory. In the case of 1020 and 1430 iterations, they-component of the endpoint reached is located on different sides of the first point.

This allows to inspect whether the direction of the displacement between epochs plays a role concerning initial transients.

Training Errors

Each of the 25 networks was trained for 100 epochs. Table 4.2 summarizes the results of the simulation. The complete listing of the results can be found in table C.1. For each number of learning steps per epoch, the average training error after 100 epochs of training is given, where the average goes over the respective group of five networks with different initial conditions. Secondly, the table contains the average of the minimal training error achieved for the respective learning steps, together with the average epoch after which this error was reached. Alongside the averages, the standard deviations of the respective values are listed. In the case of 1019 iterations per epoch, only four runs were taken into account because the fifth trial did not converge.3

The best training error is achieved with 1250 steps of the Roessler dynamics presented in one training epoch. It is about one order lower than the minimal training error for 1019, 1020 and 1430 steps, the latter three being of comparable size. The training error for 1000 learning steps is the highest, namely about three times that of 1019, 1020 and 1430 steps, and about 37 times that of 1250 steps.

The standard deviations of the errors with respect to different initial conditions are relatively small, which shows that the error levels are typical for the number of learning steps. This is also indicated by the average error over the last 20 epochs of learning, which makes up for varying initial states in the epochs. For all but four trials, it is in range of the minimal training error,

They-component of the Roessler attractor is used as target output here.

28 4 Recurrent Learning: APRL vs. RTRL

learning steps p. epoch

generalization error

at minimum after 50 epochs after 100 epochs

avg stddev avg stddev avg stddev

1000 0.400198 0.120934 0.299193 0.042347 0.608198 0.503270 1019 0.084052 0.033356 0.409390 0.484723 0.395957 0.295087 1020 0.091808 0.071455 0.114114 0.040406 0.232750 0.096556 1250 0.008197 0.003299 0.011794 0.002412 0.004166 0.001018 1430 0.055814 0.012143 0.089477 0.008884 0.130885 0.022725

Table 4.3: Generalization errors for real-time recurrent learning of the Roessler attractor.

as can be seen in table C.1. This shows that that the differences in the error levels arise from initial transients dependent on the number of learning steps. This is consistent with the fact that the number of iterations yielding the largest displacement of the target output between epochs should cause the longest transient and therefore should result in the highest error. The largest displacement in fact belongs to 1000 steps per epoch and shows the highest error. However, the best training error at 1250 steps does not belong to the minimal displacement between epochs. The relation between initial transients and the displacement is not straightforward in the latter case.

Generalization Errors

The generalization errors for the respective numbers of learning steps are shown in table 4.3. As before, the average goes over the group of five networks with different initial conditions. Table C.2 lists the generalization errors for every trial. The initial state for the generalization is the last point of the previous epoch, which after some time of training is close to the desired trajectory of the Roessler attractor. This leads to generalization errors being lower than the training error because the generalization trajectory is not affected by initial conditions.

Like the training error, the generalization error is the lowest with 1250 steps of the Roessler dynamics learned during one epoch. The difference is up to a factor of about 100, and the best generalization is achieved at the end of the training. For 1019, 1020 and 1430 learning steps per epoch, the generalization is still very well but now appears to be better at the minimal training error than after further epochs. The data in table C.2 shows that this applies to all single trials. Again, the situation is inferior for 1000 iterations per epoch, where the generalization error is the highest in all numbers of iterations. In this case, the generalization error obtained after 50 epochs is better than that at the minimal training error, which means that a better generalization behavior is achieved prior to the minimal training error. These results indicate that the difference in error levels of trials with different numbers of iterations is mainly determined by initial transients at the beginning of an epoch. They influence the quality of generalization of the trained networks as well.

Epochs Needed to Reach the Minimal Training Error

A hint on how the achievable error is influenced by initial transients can be gained from the epoch after which the minimal training error is reached. For 1250 steps per epoch, the minimal error is reached after 99.4 epochs on average. In detail, the minimal error is reached at the end of learning in three of five cases and always in the last three epochs. Given that the training and the generalization errors for this number of steps are the lowest ones observed in the simulation, it can be concluded that in this case the initial transient has no adverse effect on learning.

In the case of 1000 steps per epoch, the minimal training error mostly occurs at a later epoch during learning, too. But at that point, the generalization error is already higher than previously. This suggests that the initial transient accounts for a bias in the training error by means of the

4.1 Real-Time Recurrent Learning of the Roessler Attractor 29

deviations in the first steps. At some point of the learning process, the trajectory of the Roessler attractor might already be approximated quite well while a certain error remains. As learning proceeds, the error can only be reduced by changing the behavior of the network on the transient. Since the transient is not a part of the operator represented by the Roessler dynamics, such modifications will most probably deteriorate the generalization.

For the other runs with 1019, 1020 and 1430 learning steps per epoch, the minimal training error is reached early compared to the cases above. While the standard deviation is high for 1019 and 1020 steps, an early occurrence of the minimal training error seems typical for 1430 steps. Here, the tendency to reduce the error on the initial transient possibly leads to greater deviations on the rest of the training trajectory. Of course, the generalization becomes inferior then, too.

Compared to 1430 steps, the results for 1019 and 1020 steps are more ambiguous. As well as the standard deviation of the epoch with minimal error is high, the standard deviation of the training error over the last 20 epochs exceeds the normal level in some cases. In four cases the training error after 100 epochs is significantly higher than the minimum (about 0.5, cf. table C.1). An interpretation of these observations is not obvious. Maybe overfitting plays a role here as well, but it is also possible that trials with these numbers of steps are more sensitive to the initial conditions. The search for (local) minima on the landscape of the error functional might run through very different paths subject to the starting points.

Error Curves

The different error levels can be observed well in the curves of the training error. Three typical cases are shown in figure 4.1. The curve in 4.1(a) belongs to a trial with 1250 steps of the Roessler dynamics learned per epoch. The training error decays fast in the beginning and reaches a level of below 0.1 around the 20th epoch. From then on, it decreases slowly, and the final training error is about 0.04 in this case. The training seems determined here and after some epochs reaches a level which seems optimal with respect to the choice of the learning parameters. Further improvement of the error would presumably require an adjustment of the parameters. The slow decay of the training error in later epochs might be a consequence of vanishing gradient [Hochreiter, 1998].

In figure 4.1(b), an error curve for 1000 learning steps is shown. The shape is similar to that of 4.1(a) but with the error settling to a higher error level. Again, a slower decay from around the 20th epoch might indicate vanishing gradient. The higher error level is caused by the initial transient, where larger deviations from the target output occur because the initial states perturb the desired trajectory. The curve shows that this mainly results in a bias shifting the error to a higher level without having an adverse effect on the general capability of learning the task. This is also indicated by the generalization errors discussed above.

Another case is displayed in figure 4.1(c), which belongs to a trial with 1019 learning steps per epoch. The error also decreases fast during the first epochs and reaches a level which hardly decreases further. But here, apparent fluctuations in the training error occur, and the figure also shows a large peak after 95 epochs. The fluctuations are responsible for the higher standard deviation over the last 20 epochs, which is about 0.8 (cf. table 4.2). This indicates that the initial transients affect the learning in a manner which makes it difficult to fix a minimum of the error functional. While trying to adjust the weights to fit the transient more properly, the network predicts the rest of the Roessler trajectory poorer. This conflict can possibly be solved by an adjustment of the learning parameters.

Summary of Results for RTRL

The results of this section show that the operator represented by the Roessler dynamics can be learned by RTRL. This is depicted qualitatively in figure 4.2, which shows the trajectory of the Roessler attractor and the network prediction after one epoch (4.2(a)), after 50 epochs (4.2(b))

30 4 Recurrent Learning: APRL vs. RTRL

(a) Error curve for a trial with 1250 steps per epoch.

2 4 6 8 10 0 20 40 60 80 100

Mean Square Error

Epoch 1250 Steps/Epoch

(b) Error curve for a trial with 1000 steps per epoch.

2 4 6 8 10 0 20 40 60 80 100

Mean Square Error

1000 Steps/Epoch

Epoch

2 4 6 8 10 0 20 40 60 80 100 Epoch Mean Square Error

1019 Steps/Epoch

4.1 Real-Time Recurrent Learning of the Roessler Attractor 31 −10 0 10 −10 0 10 0 10 20

(a) Network trajectory after one epoch of training. −10 0 10 −10 0 10 0 10 20

(b) Network trajectory after 50 epochs of training. −10 0 10 −10 0 10 0 10 20

Figure 4.2: Trajectories of the network prediction (red) for real-time recurrent learning of the

Roessler attractor (blue).

and for the generalization (4.2(c)). The figure shows that the part of the trajectory which was not presented during training is predicted well by the trained network. The network has learned the input-output function of the Roessler dynamics as desired.

Differences in the learning performance arise due to initial transients, which are caused by displacements in the target output between epochs. The discussion above revealed the following effects:

For certain numbers of steps, learning is hardly perturbed by initial transients. The training

error improves continuously as learning proceeds and the generalization does so, too. In the

In document Analysis and Comparison of Algorithms for Training Recurrent Neural Networks (Page 30-38)