Conclusions - Analysis and Comparison of Algorithms for Training Recurrent Neural Networks

Altogether, the results of this chapter show that it is possible to trace the evolution of the fixed points of recurrent neural networks subject to weight adaption. The observable behavior is very variable and makes it difficult to derive any conclusions. Therefore only some aspects can be mentioned here.

For RTRL, the stability of the fixed points shows different behavior in nearly every trial. The distribution of the largest real parts of the eigenvalues sometimes expands during the course of learning. As well can the distribution show a drift without expanding further. The variability in the fixed points reflects the flexibility of the algorithm. RTRL can change the weight matrix in an arbitrary direction, and a systematic connection to the properties of fixed points is difficult to discover, if one exists at all.

For APRL, the error overshoot can at least be related to certain observations concerning the fixed points. The typical boxplot shows that the real parts of the eigenvalues decrease to low values far from instability. Moreover, the distribution changes only little after the error overshoot. It can be assumed that APRL reaches a region of the weight space, where the stability properties of the network are unfavorable. Since APRL is restricted to scaling the inner weights of the network, it is not able to avoid such configurations.

The investigation of the output components of the stable states reveals that they resemble the course of the input trajectory. The oscillation in the inputs is related to a modulation of the stability of the respective fixed points. The passages through zero coincide with jumps in the largest real part of the eigenvalues, which suggests that the inputs to the network are used to control the location of the fixed points in order to produce the desired input-output behavior. Maybe some general mechanism exists how the network implements the desired task with respect to the stability properties of the weight matrix.

The data obtained until now is too sparse to make any detailed conclusions. Further experiments should clarify many aspects. For example, a more accurate sampling of the input trajectory could reveal whether the boundary to instability is actually crossed on the passage through zero. In addition, the evolution of the fixed points during learning could be analyzed with respect to bifurcations and basins of attraction.

The technique described in section 6.1 proved useful for tracing the fixed points, and it can be extended in order to carry out more detailed investigations. Hence this chapter can be thought of as an inspiration for future experiments.

7 Perspectives on new Algorithms

The analysis of the one-output behavior of APRL revealed a weight dynamic that is characterized by a functional division into a readout layer and a scalable reservoir. The inner weights are strongly coupled, and their change is restricted to a sub-manifold of the weight space. The echo state approach shows that the functional division is in general feasible for training recurrent neural network. This fact suggests to incorporate this structure into the learning algorithms. In the next sections, I will propose two variants of the Atiya-Parlos algorithm that exploit the special weight dynamics. As a result, the complexity of these algorithms is – though asymptotically in the same class – reduced with respect to the factors in the quadratic terms. No experiments have been carried out with the algorithms yet, but since they don’t change the underlying calculations, they should yield comparable results.

7.1 Hybrid Batch-Online APRL

The flexibility of updating the weights online has the price of a higher complexity of the online algorithm. While the batch algorithm of Atiya and Parlos needs on the order of 3N

operations per data point [Atiya and Parlos, 2000], the complexity of the online algorithm as presented in section 2.4 is on the order of7N

operations per data point. In the case of only one output neuron, the inner weights are only scaled with constant factors and their rates of change are lower than that of the readout layer. Obviously, the flexibility of online updates is mainly used for the adjustment of the output weights while the reservoir weights change only little. Therefore the question is raised whether the online update of the inner weights is worth the higher computational effort.

I suggest a hybrid batch-online variant of Atiya-Parlos recurrent learning that updates only the output weights online and uses batch updates for the inner weights. The details are given in algorithm 7.1.

To evaluate the complexity of the hybrid algorithm, we first count the number of operations in step 3 of the algorithm.1 The computation of(k)needs3N

N+N multiplications. The weight

update for the output layer can be done inN 2 +3N operations: stillN 2 forV 1 (k 1)f(x(k 1)),

but only the first element ofB(k 1)V 1

(k 1)f(x(k 1))is needed and onlyN weights have to be

updated. The update forB(k)needsN 2

operations, and computingV 1

(k)adds2N 2

operations. Altogether the online step needs4N

2 +3N

N +4N operations per data point. The operations

for the batch step amount to(N +1)N(N N O )multiplications forW ij (K), exploiting that B(K)andV 1

(K)have been computed online. This results in an overall complexity of4N 2 K+ 3N O NK+ 4NK+ N 3 N 2 N O + N 2 NN O. Typically

NK, and it was assumed thatN O

=1.

Hence we have approximately4N 2

K+7NK operations, whereas the original APRL algorithm

needs7N 2

K+4NKoperations. The asymptotic complexity is stillO(N 2

), but the factor of the

quadratic term is reduced to about the half. In practice, this can be a noticeable improvement. The hybrid batch-online algorithm has a complexity between the batch and the online variant. It is a trade-off in the sense that the number of operations is lowered by updating the inner weights in a batch fashion but retaining the flexibility of updating the output weights online.

76 7 Perspectives on new Algorithms

Hybrid batch-online APRL

1. k=0: Initializex(0)andW(0).

2. k=1: Iterate the forward dynamics of the network

x(1)=(1 t)x(0)+tWf(x(0));

and compute for allj

[e(1)] 1 =x 1 (1) d 1 (1); [e(1)] i =0fori>1; (1)= e(1); B(1)=(1)f(x T (0)); V 1 (1)= I f(x(0))f(x T (0)) 2 +f(x T (0))f(x(0)) ; W 1j (1)=W 1j (0)+W 1j (1)=W 1j (0)+ t 1 (1) X l f(x T l (1))V 1 l j (1):

3. k=k+1: Iterate the forward dynamics of the network

x(k)=(1 t)x(k 1)+tWf(x(k 1));

and compute for allj

[e(k)] 1 =x 1 (k) d 1 (k); [e(k)] i =0fori>1; D(k 1)=diag f 0 (x(k 1)) ;

(k)= e(k)+[(1 t)I+tWD(k 1)]e(k 1);

W 1j (k)= t (k) B(k 1)V 1 (k 1)f(x(k 1)) 1 [V 1 (k 1)f(x(k 1))] j 1+f(x T (k 1))V 1 (k 1)f(x(k 1)) ; W 1j (k)=W 1j (k 1)+W 1j (k); B(k)=B(k 1)+(k)f(x T (k 1)); V 1 (k)=V 1 (k 1) V 1 f(x(K 1)) V 1 (k 1)f(x(k 1)) T 1+f(x T (k 1))V 1 (k 1)f(x(k 1)) :

4. Go to step 3 until end of data.

5. Compute the batch update for alli>1and for allj

W ij (K)= t X m B im (K)V 1 mj (K); W ij (K)=W ij (0)+W ij (K):

In document Analysis and Comparison of Algorithms for Training Recurrent Neural Networks (Page 80-83)