Reducing the Learning Rate - Analysis and Comparison of Algorithms for Training Recurrent Neura

steps p. epoch error at change stddev at change avg error after last epoch stddev after last epoch avg min training error stddev of min error avg epoch w. min error stddev epoch w. min error 1000 0.5 1.218058 0.513981 16.499091 0.812398 1.545076 0.566893 3.8000 2.2271 1000 0.1 1.218058 0.513981 9.812188 2.575565 5.208858 1.736871 19.2000 8.2316 1019 0.5 0.365420 0.489347 3.636548 7.028277 0.636772 1.041158 23.2000 22.1215 1019 0.1 0.365420 0.489347 1.558096 2.714096 0.771933 1.181099 2.4000 1.8547 1020 0.5 0.594852 0.634128 10.630867 6.295093 1.208501 0.851564 5.8000 5.4553 1020 0.1 0.594852 0.634128 5.737520 4.626163 1.254207 0.728580 21.0000 20.1196 1430 0.5 1.346518 1.092831 11.629692 3.018186 1.421379 0.422199 9.0000 9.9599 1430 0.1 1.346518 1.092831 6.800986 2.279404 2.495142 0.832775 21.8000 11.2677

Table 4.6: Training errors for Atiya-Parlos recurrent learning with reduced learning rate.

The application of APRL therefore needs a more careful monitoring of the error in order to decide when to stop the training. It is desirable to find out whether there are ways to compensate the influence of the initial transients.

4.3 Reducing the Learning Rate

The results presented in section 4.2 show that an overshoot in the training error is common for Atiya-Parlos recurrent learning. It would be favorable to know how the error overshoot can be avoided or how it can be handled adequately. In this section, it will be investigated whether a reduction of the learning rate enhances the learning performance in further epochs. For this purpose, the networks from section 4.2 are taken with the configuration yielding the minimal training error. From this configuration on, the training is continued with a smaller learning rate of

=0:5or=0:1, respectively. Since a comparable point is difficult to choose for the networks

that did not show the error overshoot, no network is trained with 1250 learning steps. Networks are trained with 1020 steps, though, for which the 50th epoch is chosen arbitrarily as the point of switching to the lower learning rate.

Training Errors

The training is continued with the smaller learning rates for 50 epochs. The results are listed in table C.5 in the appendix, a summary is given in table 4.6. It lists the average training error at the point where the learning rate was reduced, the average training error after 50 epochs of training with the lower learning rate, as well as the average minimum and the epoch when the latter was achieved.

The table shows that the average minimum of the training error is in all cases higher than before. The detailed results indicate that this is not due to any outliers, but most trials have a higher minimum. Only in some rare cases (e.g. two for 1019 and 1020 steps and one for 1430 steps) the minimal training error achieved with the smaller learning rate is lower than before. In three cases, this was achieved with a learning rate of=0:5and only in one case with=0:1.

Except for 1019 learning steps per epoch, the training error after 50 epochs of training with the smaller learning rate is higher than the minimal error. The error levels are on average higher for

=0:5than for=0:1.

Obviously, the error overshoot occurs here as well, and it is more distinct for trials trained with a learning rate of=0:5. It is not sure whether there is a general connection between the value of

the learning rate and the extent of the error overshoot because the error overshoot might be slower with a smaller learning rate but reach the same error level in a later epoch. It is remarkable that the

38 4 Recurrent Learning: APRL vs. RTRL learning steps p. epoch generalization error

at change at minimum after last epoch

avg stddev avg stddev avg stddev

1000 0.5 5.524299 9.913341 6.066210 9.248754 17.331240 1.577941 1000 0.1 5.524299 9.913341 11.984670 15.449741 7.696659 4.092301 1019 0.5 0.761783 1.369415 0.742849 1.366318 4.753109 9.370178 1019 0.1 0.761783 1.369415 1.154715 2.153925 2.189632 4.198517 1020 0.5 0.957756 1.430449 2.388553 2.860810 26.639483 21.041452 1020 0.1 0.957756 1.430449 1.524745 1.432534 4.517109 4.202715 1430 0.5 2.102300 2.731254 2.029266 1.302511 20.563283 6.311599 1430 0.1 2.102300 2.731254 1.728727 1.937844 1.755113 0.810571

Table 4.7: Generalization errors for Atiya-Parlos recurrent learning with reduced learning rate.

error overshoot occurs only in those cases where it did before with the higher learning rate, e.g. it does not occur for the trials with 1019 steps. This shows that the trigger for the error overshoot is not influenced by the learning rate. It is likely that the initial transients in connection with initial conditions of the network are responsible for the error overshoot.

The fact that the training errors with smaller learning rate are inferior might also be due to the initial transients. Possibly, the small learning rate prevents the learning from compensating the initial transients in later steps of the epoch. The changes in the weights are too small to reach the same configurations as with higher learning rates. This would explain why the training performance does not improve in these simulations.

Generalization Errors

The situation is slightly better with the generalization errors. The averages are given in table 4.7, the complete list of generalization errors can be found in table C.6 in the appendix. For 1430 learning steps per epoch, the generalization error obtained with reduced learning rates is lower than before. For 1019 steps, it is lower when learning is continued with a learning rate of=0:5.

The detailed list of generalization errors shows that for all different numbers of steps, trials that generalize better than before can be found.

The improved generalization is related to the minimal training error: If training is continued after the minimum, the generalization error increases again in analogy to the training error, except in cases where the training error does not overshoot (e.g. for 1019 steps). It is interesting that the generalization error can decrease, although the training error does not. This is partly on account of the first point of the generalization being the last point of the training such that no initial transient occurs during the generalization. But it may also be that the lower learning rate contributes to the better generalization because the smaller changes in the weight matrix prevent the network from overfitting the initial transients. It is possible that a better trade-off between the error on the training trajectory and the generalization error is achieved.

Epochs Needed to Reach the Minimal Training Error

The epochs in table C.5 where the minimal training error is achieved are given with respect to the epoch at which the learning rate is reduced. An epoch denoted as 2 means that the minimal training error after the reduction of the learning rate is achieved after two epochs of learning with the smaller learning rate. The full number of epochs that the network has been trained can be obtained by adding the epoch of the reduction, which is also listed in table C.5. Consequently, the average epoch in table 4.6 specifies how long the networks have to be trained with the lower learning rate until the minimal training error during the continuation is achieved. The values show

4.3 Reducing the Learning Rate 39

large standard deviations such that it can be assumed that the progress of learning is very much dependent on the configurations of the networks at the point of reduction. The minimum is reached earlier for the higher learning rate of=0:5than for=0:1, which indicates that the learning

proceeds faster if the learning rate is higher.

Error Curves

In figure 4.5, the results of training with a reduced learning rate are visualized. The curves for the training error are plotted for the three learning rates=1, =0:5and =0:1. The dotted

lines represent the transition to the lower learning rate. The curves show jumps in the error, which means that the training error does not stay at the minimum. This is partly due to missing learning memory such that the weight adjustments take the weight matrix away from the minimum. Not until some learning memory has been established, the training error decreases again. Only in figure 4.5(b), the curve reaches an error level which is lower than that achieved with the higher learning rate. In the other cases, the error overshoot occurs independently of the learning rate. From a certain point, the same erratic behavior as before can be observed here as well.

Summary for Learning with Reduced Learning Rates

In conclusion, it is neither possible to avoid the error overshoot, nor can the training error be improved by reducing the learning rate of the Atiya-Parlos algorithm. This might have different reasons: It is possible that the selected numbers of steps per epoch are too small for the lower learning rates such that the influence of the initial transients perturbs the training too strongly. Too many steps of the epoch direct the network into an suboptimal region of the weight space. When the network output reaches the desired attractor, the remaining steps are not sufficient to obtain an efficient weight configuration. It might be that the Atiya-Parlos algorithm is in general inferior if it has to cope with initial transients. This might be due to the approximative approach which does not perform an exact gradient descent. The learning procedure might become confused by steps differing strongly from the gradient, and the algorithm could get stuck in regions of the weight space away from satisfactory error levels. However, it cannot be excluded that previous learning with =1 drives the network into a configuration that is unfavorable for learning with lower

learning rates.

The jump in the error at the reduction gives a subtle hint that the Atiya-Parlos algorithm might benefit from the high learning rate by using it to control the training error. Deviations could be compensated faster because the high learning rate allows for larger weight changes. This would explain why the generalization is sometimes bad, although the training error is good: The network implements some kind of overfitting by directly following the local deviations in the target output. The corresponding learning steps are not reasonable with respect to the global task.

Being directly dependent on the learning rate, this kind of local control cannot be established with smaller learning rates. Therefore the error jumps at the reduction point and cannot be de- creased to the previously achieved level again. If learning is continued, the network gets further away from good errors and shows the error overshoot.

40 4 Recurrent Learning: APRL vs. RTRL

(a) Error curves for a trial with 1000 steps per epoch.

5 10 15 20 0 10 20 30 40 50 60 70 =1 η =0.5 η =0.1 η

Mean Square Error

Epoch 1000 Steps/Epoch

(b) Error curves for a trial with 1020 steps per epoch.

5 10 15 20 0 10 20 30 40 50 60 70 =1 η =0.5 η =0.1 η Epoch Mean Square Error

1020 Steps/Epoch

5 10 15 20 0 10 20 30 40 50 60 70 =1 η =0.5 η =0.1 η Epoch Mean Square Error

1430 Steps/Epoch

Figure 4.5: Error curves for Atiya-Parlos recurrent learning with reduced learning rates. At the

minimum of the red curve for =1the learning rate is reduced to=0:5(magenta)

or = 0:1(cyan). The curve for = 1is plotted for comparison. The dotted lines

4.4 Learning with Smaller Learning Rates 41

In document Analysis and Comparison of Algorithms for Training Recurrent Neural Networks (Page 43-47)