Switching from APRL to RTRL - Algorithm Switching with Weight Exchange

5.6 Algorithm Switching with Weight Exchange

5.6.1 Switching from APRL to RTRL

For this direction of the switching, the weight matrices obtained with APRL in section 4.2 are used. Those yielding the minimal training error are taken as initial weight matrices for real-time recurrent learning, except for 1250 steps per epoch, where the first weight matrix achieving a training error less than 0.05 is taken. The networks are then trained with RTRL on the same task using the training parameters from section 4.1, e.g. t=0:05and =0:05. Furthermore,

the switching is applied in the same manner after 80 epochs of Atiya-Parlos recurrent learning. The results of these simulations are shown summarized in table 5.3 and 5.4. The complete list is provided in table C.9 in appendix C.

Training Errors after Switching

When switching to RTRL at the minimal training error achieved with APRL, the average minimal training error afterwards is inferior for 1000, 1019 and 1250 learning steps per epoch, whereas for 1020 and 1430 steps it is better. In detail, four of the five trials with 1000 steps do not reach a training error as low as before. For 1250 steps this is the case for two of five trials, and for 1019 steps, no trial reaches an error as low as before. Two of the five trials achieve lower training errors

5.6 Algorithm Switching with Weight Exchange 57

learning steps p. epoch

error at epoch with error after

exchange minimal error min error 50 epochs

avg stddev avg stddev avg stddev avg stddev

1000 1.218058 0.513981 1.577457 0.141594 45.6000 2.8705 1.599973 0.137556 1019 0.339507 0.501045 4.144006 7.926465 20.4000 17.7719 4.249738 7.873576 1020 0.594852 0.634128 0.408999 0.028606 41.6667 5.4365 0.420943 0.029602 1250 0.048760 0.001125 0.060039 0.037091 47.4000 2.8705 0.061051 0.036734 1430 1.387777 1.077365 0.592098 0.353531 35.2000 13.9485 0.709196 0.379407

Table 5.3: Training errors for algorithm switching from APRL to RTRL. The networks were first

trained with APRL. At the minimum of the error training was continued with RTRL. learning

steps p. epoch

error at epoch with error after

exchange minimal error min error 50 epochs

avg stddev avg stddev avg stddev avg stddev

1000 16.754543 0.038779 5.029472 1.142046 1.4000 0.4899 19.304372 0.013726 1019 3.473784 6.786168 0.962673 1.523163 27.2000 18.7339 4.191498 7.937247 1020 17.400340 0.741500 4.308207 0.363312 1.0000 0.0000 20.002802 0.024411 1250 0.036236 0.007286 0.037905 0.016998 50.0000 0.0000 0.037905 0.016998 1430 16.579919 0.011070 5.158634 0.765002 1.2000 0.4000 19.995786 0.007531

Table 5.4: Training errors for algorithm switching from APRL to RTRL. The networks were first

trained with APRL. After 80 epochs the training was continued with RTRL.

for 1020 steps while two trials do not converge at all. The lower average training error is likely to be an effect of poor statistics. In contrast, four of five trials reach a lower training error for 1430 steps.

The data indicates that RTRL cannot always make use of the previously optimized weight matrix to improve the learning performance further. It can be assumed that this is due to the different strategies to minimize the error functional. APRL reaches the minimum by approximating the gradient with respect to the states, e.g. @E

. It might be that the gradient @E @w

with respect to the weights does not vanish at this point such that RTRL moves away from it. It might reach a different point in the weight space which gives rise to the higher training error.

The situation is likewise ambiguous when switching from APRL to RTRL after 80 epochs. At this epoch, the error overshoot has occurred in most cases, and the training error achieved by APRL is high at the point of the switching.

For 1000 learning steps per epoch, RTRL can improve the training error in all five trials, although the obtained errors are not as low as for training with RTRL from the outset. It is remarkable that the training error becomes inferior 50 epochs after switching the algorithms. This is unusual for RTRL, and the reason for this might be that the weight matrix at the switching is unfavorable for RTRL. The weights might have been scaled to large by APRL such that they are too far off a good minimum for RTRL.

For 1250 steps, only two of the five trials reach a lower training error than APRL before. The other trials reach errors in the same range. Since the minimal training errors after switching are reached at the end of the training with RTRL, it can be assumed that further improvement could be achieved by training longer. Maybe RTRL would reach equally good training errors as APRL. For 1019 steps, the training errors stay in range of those achieved with APRL in four trials, though being distinguishable higher.

The results for 1020 and 1430 steps of the Roessler dynamics learned per epoch are similar to those of 1000 steps per epoch. RTRL yields an improvement of the training error after the switching, but on the long run, the training errors become worse. This emphasizes that the weight

58 5 Analyzing the Weight Dynamics of Recurrent Learning

learning steps p. epoch

generalization error

at exchange at minimum after 50 epochs

avg stddev avg stddev avg stddev

1000 5.524299 9.913341 0.232411 0.106022 0.257354 0.098675 1019 0.745576 1.377380 4.630173 9.048144 4.598801 9.063910 1020 0.957756 1.430449 0.054736 0.011885 0.055010 0.009068 1250 0.040527 0.008428 0.018051 0.023917 0.017862 0.024002 1430 2.755099 2.768327 1.224864 1.467367 3.561670 5.868782

Table 5.5: Generalization errors for algorithm switching from APRL to RTRL at the minimum of

the training error. Corresponds to table 5.3. learning

steps p. epoch

generalization error

at exchange at minimum after 50 epochs

avg stddev avg stddev avg stddev

1000 23.417161 0.081745 576.013573 696.462325 25.297191 0.056009 1019 5.308546 10.499892 204.735481 409.429470 4.581537 9.109724 1020 25.640779 0.838584 3559.956146 3138.707647 22.784936 0.026728 1250 0.012977 0.013427 0.013123 0.010001 0.013123 0.010001 1430 25.102470 0.012161 1252.027856 592.591444 22.420633 0.002691

Table 5.6: Generalization errors for algorithm switching from APRL to RTRL after 80 epochs.

Corresponds to table 5.4.

matrix provided by APRL after the occurrence of the error overshoot is not suitable for further refinement with RTRL.

Generalization Errors after Switching

Since RTRL achieved a better generalization than APRL in the previous simulations, it is inter- esting to consider the generalization here as well. Table 5.5 shows that the generalization errors after switching at the minimal training error are on average better than those achieved with APRL. The detailed results indicate that three of the five trials for 1000 steps learned per epoch achieve significantly lower generalization errors. The situation is similar for 1020, 1250 and 1430 steps. This is remarkable because the training errors were higher in most cases except for 1430 steps. Obviously, RTRL succeeds in finding a trade off between the training error and the generalization error. For 1019 steps per epoch, the generalization error is higher for all five trials, but the increase is moderate. The results show that the weight matrix obtained with APRL is adjusted further by RTRL and a better trade off between training error and generalization is achieved. Probably, the steps in the direction of @E

applied by RTRL are suitable to refine the weight matrix which was obtained with APRL by taking steps in the direction of @E

The generalization errors for switching from APRL to RTRL after 80 epochs are noticeably different. For 1019 and 1250 learning steps per epoch, they are in most cases lower than before. Remarkably, the exception for 1019 steps is the trial that shows the error overshoot. For 1000, 1020 and 1430 learning steps per epoch, the generalization at the minimal training error after the switching is very poor. It can be improved by training further and 50 epochs after the switching it is better than the generalization of APRL in several trials.

These results show that after the error overshoot, the weight matrix is in an unfavorable configuration. After switching to RTRL, it takes long until a more sensible configuration is reached and the likelihood of getting stuck in inferior local minima is relatively high. Moreover, the vanishing gradient problem might prevent RTRL from improving the weight configuration sufficiently. It

5.6 Algorithm Switching with Weight Exchange 59

can therefore be concluded that RTRL is not capable of improving the learning performance of APRL after the error overshoot has occurred.

Error Curves

Figure 5.3 visualizes the results of switching from APRL to RTRL. The plots show the training error for APRL and the training error for RTRL. The switching of the algorithms is indicated by the dotted line. As a consequence of the exchange of the weight matrix between APRL and RTRL, the error jumps at the point of switching in all three plots. This is partly due to the lack of learning memory when switching to RTRL. The latter has no information about how APRL reached the weight matrix and adjusts the weight matrix in an independent way. The plots also show that after the jump the training error decreases fast until the decay slows down. Then the training error hardly decreases further.

In figure 5.3(a), the training error for 1000 steps reaches a level which is higher than the training error at the point of switching achieved with APRL. It is also higher than the error level that is reached with RTRL from the outset.

The plot in 5.3(b) belongs to 1020 steps per epoch. The training error reaches a level in range of the training error at the point of switching. Moreover, the training error is in this case comparable to those achieved with RTRL from the outset.

Figure 5.3(c) for 1430 steps per epoch shows that it is also possible that RTRL reaches a better training error compared to both the error at the point of switching and the error achieved with RTRL applied from the outset.

Summary for Switching from APRL to RTRL

Altogether, the data shows that the outcome of switching from APRL to RTRL is very much dependent on the weight matrix being exchanged. After the error overshoot, RTRL takes the weight matrix away from the configuration achieved with APRL and increases the training error significantly. Although further training can compensate this increase to some extent, the error stays high and RTRL can in no case counterbalance the error overshoot. Obviously, the exchanged weight matrix is far away from reasonable regions of the weight space.

When exchanging at the minimal training error obtained with APRL, the training error jumps at the point of switching and only in some cases reaches a level as low as before again. This is due to the lack of learning memory at the point of exchange and due to vanishing gradient in later epochs. In some cases, RTRL succeeds in achieving a better trade off between the training error and the generalization. However, since there is no way to predict whether RTRL can yield any improvement, a systematic technique for switching between the algorithms cannot be derived.

60 5 Analyzing the Weight Dynamics of Recurrent Learning

(a) Error curves for a trial with 1000 steps per epoch.

5 10 15 20 0 20 40 60 80 100 RTRL APRL

Mean Square Error

Epoch 1000 Steps/Epoch

(b) Error curves for a trial with 1020 steps per epoch.

5 10 15 20 0 20 40 60 80 100 RTRL APRL

Mean Square Error

Epoch 1020 Steps/Epoch

5 10 15 20 0 20 40 60 80 100 RTRL APRL

Mean Square Error

Epoch 1430 Steps/Epoch

Figure 5.3: Error curves for algorithm switching. The networks were first trained with APRL

(red) and at the minimum of the training error the training was continued with RTRL (blue). The point of switching is indicated by the dotted line.

5.6 Algorithm Switching with Weight Exchange 61

learning steps p. epoch

error at

minimal error epoch with error after

exchange min error 50 epochs

avg stddev avg stddev avg stddev avg stddev

1000 1.739209 0.055705 0.679710 0.139548 5.4000 0.4899 16.761417 0.040967 1250 0.059518 0.003128 0.023532 0.002537 38.0000 10.7517 0.033646 0.011852 1020 0.598306 0.126017 0.361781 0.209311 12.2000 10.0479 10.390257 8.109234 1430 0.515192 0.030499 0.343521 0.247962 12.8000 7.4673 14.607924 4.586135 1019 0.665196 0.222220 2.568315 1.155420 8.5000 6.8374 17.047714 0.001588

Table 5.7: Training errors for algorithm switching from RTRL to APRL.

In document Analysis and Comparison of Algorithms for Training Recurrent Neural Networks (Page 62-67)