5.6 Algorithm Switching with Weight Exchange
5.6.2 Switching from RTRL to APRL
The switching from RTRL to APRL is executed after 50 epochs of real-time recurrent learning. The weight matrix achieved by RTRL is then trained with APRL on the same part of the Roessler trajectory. The learning parameters are t=0:1and=1, as in section 4.2, and the training is
carried out for 50 further epochs. Since one trial of RTRL for 1019 steps did not converge, only the four other trials are used for the switching here.
The obtained averages of the training and generalization errors are listed in table 5.7 and 5.8. The detailed results of the simulation can be found in table C.11 and C.12 in the appendix.
Training Errors after Switching
The minimal training errors achieved by APRL after switching are mostly better than the training errors at the point of exchange. Only four trials for 1019 steps and one trial for both 1020 and 1430 steps have have a higher minimal training error. The increase is remarkable because in the simulations with APRL applied from the outset, the results were better for 1019 steps than for 1020 steps and the situation is reversed here. Obviously, the initial training with RTRL influences the performance of APRL.
The training error 50 epochs after switching the algorithms shows that the error overshoot appears here as well. Except for two trials with 1020 steps per epoch and the trials with 1250 steps per epoch, the training error after 50 epochs is significantly higher than at the minimum. The small number of epochs after which the minimal training error is reached indicate that the error overshoot occurs soon after the point of switching.
The fact that the behavior of the trials with 1019 and 1020 steps per epoch is different after switching from RTRL to APRL is interesting. In the case of 1020 steps, RTRL can improve the performance, while for 1019 steps it has a deteriorating effect on the subsequent training with APRL. As discussed in section 4.2, the learning performance of APRL depends strongly on initial conditions of the networks and on initial transients arising from the different numbers of steps per epoch. Since the number of steps is not changed when switching the algorithms, the initial transients are the same for RTRL and APRL. Therefore it can be concluded that the different behavior is evoked by different initial conditions due to the previous adjustments of the weight matrix.
For 1019 steps per epoch, RTRL leads to weight configurations which are disadvantageous for APRL. This effects high training errors and the error overshoot. Conversely, for 1020 steps the adjustments applied to the weight matrix by RTRL improve the performance of APRL and can prevent the error overshoot. Nevertheless, in most other cases the initial learning with RTRL does not prevent the error overshoot.
These observations underline that APRL and RTRL behave different. Obviously, the different gradients followed by the respective algorithms give rise to different paths through the weight space. It is not clear how these paths are related because the switching between RTRL and APRL can lead to both improvement as well as deterioration.
62 5 Analyzing the Weight Dynamics of Recurrent Learning
learning steps p. epoch
generalization error
at exchange at minimum after 50 epochs
avg stddev avg stddev avg stddev
1000 0.299193 0.042347 0.357629 0.082173 23.324170 0.125673 1250 0.011794 0.002412 0.009434 0.000810 0.018624 0.012359 1020 0.114114 0.040406 0.291170 0.186806 15.742722 12.782175 1430 0.089477 0.008884 0.277987 0.245435 21.272876 5.595922 1019 0.409390 0.484723 6.556482 3.827462 26.140439 0.097646
Table 5.8: Generalization errors for algorithm switching from RTRL to APRL.
Generalization Errors after Switching
The generalization errors achieved by APRL after the switching are higher than before, except for 1250 steps learned per epoch. For the latter, they are comparable to those achieved by training with APRL from the outset, but not as low as those achieved by RTRL. For all other numbers of steps per epoch, the generalization error at the minimum is higher for most trials. The large generalization error 50 epochs after the switching is a clear signal for the error overshoot. This is in contrast to switching from APRL to RTRL, where subsequent training with RTRL yields a higher training error alongside a better generalization. The poorer generalization of APRL was also observed while training with APRL from the outset. Therefore it can be assumed that the higher generalization error is a principle property of the Atiya-Parlos algorithm.
Error Curves
Three plots of the training error are shown in figure 4.3. All three curves show a jump in the training error at the point of exchange of the weight matrix. This is again partly due to the lack of learning memory for APRL at the point of switching. After the jump, the training error decreases fast for some epochs. The level which is reached and the behavior after the decrease is different in the three cases.
Figure 5.4(a) belongs to a trial with 1000 learning steps per epoch. The training error de- creases beneath that of RTRL and becomes significantly lower. But after reaching the minimum it increases again, exceeds the training error of RTRL, and the error overshoot occurs.
For the case of 1250 learning steps per epoch in 5.4(b), the training error reaches a level beneath RTRL after 6 epochs and stays there until end of learning. APRL can lower the training error further than RTRL, but at costs of a poorer generalization. Both the training error and the generalization are comparable to those obtained by learning with APRL from the outset.
In the last plot 5.4(c) for 1019 steps per epoch, the training error never falls beneath the training error of RTRL after the exchange of the weight matrix. It shows erratic behavior after the first jump and increases to a high error level. In this case, the subsequent training with APRL deteriorates the learning performance. It is even worse than training with APRL from the outset, where the error overshoot did not appear in most cases.
Summary for Switching from RTRL to APRL
In conclusion, these results emphasize that APRL and RTRL behave different while trying to minimize the same error functional. The error overshoot of APRL occurs as well, if the weight matrix has previously been optimized with RTRL. It can be assumed that the error overshoot is a general feature of APRL, whose occurrence is dependent on the weight configuration and on the task. Especially, the transients arising from initial conditions are important.
Previous learning with RTRL influences the behavior of APRL: In the case of 1019 learning steps per epoch, the error overshoot occurs more often while it seems to occur more rarely than
5.6 Algorithm Switching with Weight Exchange 63 5 10 15 20 0 20 40 60 80 100 RTRL APRL
Mean Square Error
Epoch 1000 Steps/Epoch
(a) Error curves for a trial with 1000 steps per epoch.
1 2 3 4 0 20 40 60 80 100 APRL RTRL
Mean Square Error
Epoch 1250 Steps/Epoch
(b) Error curves for a trial with 1250 steps per epoch.
5 10 15 20 0 20 40 60 80 100 RTRL APRL
Mean Square Error
Epoch 1019 Steps/Epoch
(c) Error curves for a trial with 1019 steps per epoch.
Figure 5.4: Error curves for algorithm switching from RTRL to APRL. The networks were trained
with RTRL (blue) for 50 epochs and were then trained with APRL (red). The point of switching is indicated by the dotted line.
64 5 Analyzing the Weight Dynamics of Recurrent Learning
observed in section 4.2 for 1020 steps. Obviously, RTRL adjusts the weight matrix to a config- uration which is sometimes more and sometimes less unfavorable for APRL. This might mean that the different approaches to minimize the error functional exploit different properties of the weight space. The simulations reveal that APRL succeeds in achieving a lower training error but generalizes poorer in the majority of cases. It can be concluded that the weight configurations reached by APRL in general constitute different generalization capabilities than the weight con- figurations reached by RTRL. This indicates that the paths through the weight space are different and dependent on which gradient of the error functional is followed.
Altogether, the differences between RTRL and APRL are essential and a general strategy of switching the algorithms in order to exploit their respective advantages is impractical to derive. Nevertheless, some insights in the properties of recurrent learning can be gained from the results of this section. Some consequences for the structure of the weight space and the paths followed by the respective algorithms will be discussed in chapter 8.
65
6
Aspects of Stability
Recurrent neural networks are known to be capable of producing rich dynamical behavior [Wang, 1991; Sompolinsky and Crisanti, 1988]. Even a network of two fully connected neurons can show oscillatory and chaotic behavior [Haschke et al., 2001]. Recurrent neural networks are in this sense nonlinear dynamical systems, and stability is an important question in their study. Since the behavior of unstable systems is highly non-predictable, applications are normally restricted to stable systems that converge to fixed points.
The difficulty with recurrent neural networks is that due to their adaptive nature, the stability analysis is more complicated. This is especially the case for online learning, where the network dynamics and the dynamics of the adaptive weights are run simultaneously and hence are coupled. During learning, the location of fixed points and the boundaries of their basins of attraction can change [Pearlmutter, 1995]. This is unfavorable if the network has been trained on a certain task and is adapted further while already in use. In such cases, an algorithm that does not change the location of fixed points while adapting the weights is desirable.
The input-output stability of recurrent neural networks was extensively studied by Steil [1999, 2002]. Techniques from control and system theory are combined with neural computation to derive criteria for stability analysis. An approach to incorporate them into the learning process is to restrict the weight changes to a stable region in weight space, which can be determined by linear matrix inequalities [Steil and Ritter, 1999a].
In this chapter a less theoretical approach will be taken to explore the impact of the weight dynamics on the stability of neural networks. The aim is to trace the fixed points during the course of learning. It is an interesting question how fixed points evolve and how their stability is influenced by the learning process. Insights in the relation of learning and stability of recurrent neural networks will be helpful in understanding existing algorithms. Moreover, results on whether some relation exists between the task to be learned and the stability properties arising during training could be useful for the development of new algorithms and regularization techniques.
6.1
Tracing Fixed Points
For tracing the fixed points, the weight matrices of networks learning the Roessler attractor are saved after every epoch. These weight matrices are then used for a stability analysis. Since the task of learning the Roessler attractor leads to networks representing an input-output operator, the stability of these networks is of interest over the full range of possible inputs. Of course, not all possible inputs can be used practically because there are infinitely many. The analysis here is restricted to a part of 100 points of the Roessler trajectory. This part is shown in figure 6.1 and comprises the typical features of the Roessler flow. The x- andy-component both go through
a complete oscillation and the z-component shows a peak that is about 20 steps wide and has a
height of about 1.6 on the input scale. For each of the weight matrices obtained during training, the stability analysis is carried out for the 100 input points.
To investigate the behavior of the network with a certain weight matrix and the respective inputs, the following procedure is applied: After programming the weight matrix into the network, it is at first iterated with inputs from the Roessler trajectory for 250 steps to ensure that the internal states
66 6 Aspects of Stability −0.5 0.5 1 1.5 −100 −80 −60 −40 −20 x(t) y(t) z(t)
Figure 6.1: Coordinate functions of the part of the Roessler trajectory, which was used to generate
the inputs for the stability analysis.
of the network reach the attractor. The inputs to the network are then clamped to the respective components of the input point under observation. Keeping this input constant, the network is then iterated for 32768 steps1to settle down to its limit behavior.
In order to find out what kind of behavior is present, a Fourier analysis is carried out. The Fast Fourier Transformation algorithm of Frigo and Johnson [1998] is applied to the next 32768 states of the network.2 This yields the Fourier coefficients
c i (k)= T 1 X t=0 x i (t)e 2 T kt k =0;:::;T 1:
The absolute valuejc i
(k)jindicates how strong the harmonic oscillation with period T k
= T k
is contained in the activity pattern ofx
i
(t). The largest Fourier coefficientmax k jc i (k)jis compared against a threshold of10 20
to decide whether non fixed-point behavior is present. For non-fixed points, the period length is determined byT
= T k , where k =argmax k jc i (k)j. No attempt is
made to distinguish periodic and chaotic behavior. For fixed points,T
is set to one. After obtaining the period length, the network is iterated forT
steps (e.g. one period). For each step the eigenvalues
j of the Jacobi matrix J(x)= I+W 0 diag f 0 (x i )
are calculated, whereW 0
denotes the non-input part of the weight matrix. The largest real part of the eigenvalues
max
=max Re( j
)is recorded for the evaluation of stability.
The steps of the overall procedure are outlined in algorithm 6.1. It should be emphasized that this is only a rough technique which is unlikely to reveal any detailed information beyond the stability of the fixed points. However, the approach is feasible to trace the fixed points during the learning process. Moreover, it is extensible and refinements could make the technique applicable to more sophisticated analyses, like the investigation of fixed point bifurcations.
1 This number of steps is an empirically chosen power of two, that was high enough for all spurious oscillations
to fade.