Other Optimization Methods - Analysis and Comparison of Algorithms for Training Recurrent Neura

A crucial factor in all APRL algorithms is that every element of the vector V 1

(k)f(x(k)) is

needed for the update of the weights. Therefore the matrixV 1

(k)has to be computed in every

time step. These two computations amount to3N 2

operations. This partly reflects the complexity needed to propagate the error information through the network.

Since for the one-output case, the error propagation of the Atiya-Parlos algorithm leads to a linear scaling of the reservoir, it stands to reason to exploit this to reduce the costs of the error propagation further. This might be possible by decoupling the weight updates of the reservoir from those of the output layer. Thus, different optimization strategies could be applied for the output weights on the one hand, and the internal weights on the other. Many methods are available from optimization theory, among which are direct strategies like Newton’s method. The latter seems especially promising because the form of the Atiya-Parlos learning rule

w/ " @g @w T @g @w # 1 @g @w T

has a structure similar to the Newton direction

h= [H(x)] 1

rf(x):

It is however not straightforward how the methods could be used to navigate through the weight space more efficiently. The problem is to find an appropriate way to incorporate the error in order to find an optimal scaling for the reservoir. Future investigations could clarify whether new formulations of this problem make it accessible to more efficient optimization techniques.

78 7 Perspectives on new Algorithms

Accelerated one-output APRL

1. k=0: Initializex(0)andW(0).

2. k=1: Iterate the forward dynamics of the network

x(1)=(1 t)x(0)+tWf(x(0)); and compute e 1 (1)=x 1 (1) d 1 (1); 1 (1)= e 1 (1); B 1j (1)= 1 (1)f(x T j (0)); B ij (1)=0fori>1; V 1 (1)= I f(x(0))f(x T (0)) 2 +f(x T (0))f(x(0)) ; W 1j (1)=W 1j (0)+ t X l B 1l (1)V 1 l j (1); W ij (1)=W ij (0)fori>1:

3. k=k+1: Iterate the forward dynamics of the network

x(k)=(1 t)x(k 1)+tWf(x(k 1));

and compute the weight updates for the first two rows ofW, fori=1;2compute

e 1 (k)=x 1 (k) d 1 (k); 1 (k)= e 1 (k)+(1 t)e 1 (k 1)+tw 11 f 0 (x 1 (k 1))e 1 (k 1); 2 (k)=tw 21 f 0 (x 1 (k 1))e 1 (k 1); W ij (k)= t (k) B(k 1)V 1 (k 1)f(x(k 1)) i [V 1 (k 1)f(x(k 1))] j 1+f(x T (k 1))V 1 (k 1)f(x(k 1)) ; B ij (k)=B ij (k 1)+ i (k)f(x j (k 1)); V 1 (k)=V 1 (k 1) V 1 (k 1)f(x(k 1))[V 1 (k 1)f(x(k 1))] T 1+f(x T (k 1))V 1 (k 1)f(x(k 1)) :

4. Compute the remaining weight updates fori>2by columnwise scaling

W ij (k)= W i1 (0) W 21 (0) W 2j (k);

and update the weights for alli; j

W ij (k)=W ij (k 1)+W ij (k):

5. Go to step 3 until end of data.

8 Summary and Conclusive Remarks

In this work, algorithms for training recurrent neural networks were analyzed and compared with respect to different aspects of recurrent learning. The continuous time version of the recently introduced Atiya-Parlos algorithm [Atiya and Parlos, 2000] was derived, which reduces the complexity of recurrent learning to O(N

). The algorithm is based on the unified formulation of

gradient descent in terms of a constrained optimization problem. In contrast to other algorithms like real-time recurrent learning, APRL does not compute the gradient of the error with respect to the weights but with respect to the states. The constraint equation is used to determine the weight change that leads to a state change in the negative direction of this gradient. The derivation of the algorithm involves a least squares solution of a matrix inversion and hence APRL is like RTRL an approximative algorithm. However, the approximations are different and yield different weight updates.

The online variant of APRL was compared to RTRL in chapter 4 with respect to the learning performance. Both algorithms were applied to the task of learning the input-output operator im- plicitly given by the Roessler dynamics. The results show that in either case good minima of the training error can be found. APRL can be used with a higher learning rate and reaches the minimum training error more quickly than RTRL but generalizes poorer on previously not presented parts of the Roessler trajectory.

While RTRL gradually improves the training error, APRL shows an error overshoot after the minimum is reached. Experiments with variable learning rate revealed that this error overshoot is a characteristic feature (at least on the given task) and cannot be avoided by lowering the learning rate. Obviously, the weight update of APRL does not vanish in the vicinity of the minimum, but drives the weight matrix off again.

Transients to the Roessler attractor arising in the initial part of the training epochs influence the learning performance of both algorithms. The type of this influence is differing: RTRL is affected by means of a bias in the error but besides that is relatively robust, whereas APRL is more sensitive to the transients. The simulation data shows that the transients may even influence whether the error overshoot of APRL occurs or not.

In chapter 5, the weight dynamics of recurrent learning was investigated. A formal treatment of the one-output behavior of APRL revealed that the network is structured into a dynamical reservoir and a readout layer. While the output weights may change arbitrarily, the internal weights are coupled and scale equally and with constant rate in every column. The scaling factors are determined by the initialization of the weights. This behavior is a consequence of the special error propagation of APRL and also reflects the lowerO(N

)complexity of the algorithm. A compari-

son showed that the one-output behavior of APRL is closely related to echo state networks [Jaeger, 2001, 2002a,b].

The weight change of APRL and RTRL was also investigated in experiments. The data exhibits that the weights of the output layer change faster than the weights of the reservoir. For APRL this is more explicit than for RTRL. The slower change of the reservoir for RTRL might be due to the vanishing gradient during error backpropagation, whereas for APRL it reflects the functional

80 8 Summary and Conclusive Remarks

division of the network, which leads to the reservoir being only scaled. The error overshoot is indicated by the weight change when the reservoir change exceeds the change of the output layer. The relevance of the initialization of the weights was analyzed by applying different strategies to generate the initial weights. As could be expected, the results show a poorer learning performance if the variability in the initial weights is reduced. This makes clear the importance of a sensible weight initialization when using APRL for training recurrent neural networks.

An attempt was made to exploit the properties of both algorithms by exchanging the weights and continuing to adjust them with the other algorithms in order to improve the learning performance. However, the outcome of the simulation shows that this is not very sensible. Only in the case of switching from APRL to RTRL, the error could sometimes be improved. Neither can switching to RTRL compensate the error overshoot of APRL, nor can the overshoot be prevented by adjusting the weight matrix with RTRL before switching to APRL. It can hence be assumed that the two algorithms behave different and the respective minima of the training error correspond to distinct weight configurations.

Stability was dealt with in chapter 6, where a technique to trace the fixed points during the learning process was developed. The weight matrices obtained during training were used to determine the limit behavior of the network for 100 inputs from the Roessler trajectory. Boxplots of the distribution of the largest real part of the eigenvalues revealed that the stability behavior during training is very variable. A connection of stability to the learning process is not evident from the results, but the error overshoot is identifiable in the boxplots since the corresponding boxplots show a characteristic pattern. Moreover, the relation of the fixed points to the course of the input trajectory suggests that there might be some mechanism to control the stability behavior by means of the input. However, this is far from being obvious because the data does not provide enough significance yet.

Chapter 7 was thought to make some proposals how new algorithms can be derived on the basis of the special properties of APRL. Two possibilities can directly be deduced from the formal analysis of the one-output behavior of APRL. Hybrid batch-online APRL exploits the functional division of the network and adapts only the output layer online, while the internal weights are subject to batch adaption. Accelerated one-output APRL takes into account the constant scaling of the reservoir by explicitly computing the weight updates by means of the scaling factors. The proposed algorithms have asymptotic complexityO(N

)but with nearly half the constant factor

compared to the original APRL algorithm. However, they have not been implemented yet and it has to be verified in future simulations whether they are suitable.

The aim of the analysis and the comparison of the algorithms was to get insights in the issues of recurrent learning. In the following, I will make a few conclusive remarks on some aspects.

Conclusive Remarks

::: ::: on Recurrent Learning

Gradient based algorithms for recurrent learning are well understood and can be brought to a unified formulation in terms of a constrained optimization problem. There are two possibilities to approximate the exact gradient: Conventionally, the derivative @E

with respect to the weights is computed and a stochastic approximation yields the weight updates in the negative direction of this gradient. Secondly, the gradient @E

of the error with respect to the states can be computed and the weight update is obtained from the constraint equation and should yield a state change in the negative direction of @E

. Simulations show that both approaches are feasible and find good minima of the training error. The algorithms have different complexity, and they also differ in the number of learning steps required to reach the minimum and in their robustness to local deviations.

::: on the Weight Dynamics

The analysis of the weight change for one-output APRL revealed a functional division of the network into a dynamical reservoir and an output layer. The reservoir weights are strongly coupled and change slower than the output weights. The resulting weight dynamics is restricted to a sub- manifold of the weight space which is determined by the initialization of the weights. Neverthe- less, this sub-manifold provides enough variability for the network to implement an approximation of the desired output. The drawback is that the weight dynamics does not come to rest at the weight configuration that leads to the minimal training error. Instead, the reservoir is scaled to unsensible configurations that effect an error overshoot. Obviously, the weight change that approximates the gradient @E

does not vanish at the minimum. In contrast, the stochastic approximation of @E

applied by RTRL maintains the full flexibility of the weight dynamics. It does in most cases not depart from once obtained good weight configurations (provided that the learning rate is small enough). The price is the higher complexity of the RTRL algorithm.

::: on the Weight Space

When the weights are initialized with small uniquely distributed random values, APRL mostly finds good minima of the training error despite the restricted weight dynamics. Therefore it can be assumed that the structure of the weight space provides minima of the error in nearly every direction, at least in a certain vicinity of the origin. The fact that APRL quickly finds a minimum suggests that the weight space around the minima is not very jagged. These speculations can equally be assigned to RTRL.

From another point of view, I suppose that the structure of the weight space is completely different with respect to the gradients @E

or @E @x

. This is indicated by the different behavior of the algorithms after reaching the minimum. RTRL stays at the minimum, hence it is likely to correspond to a valley of the landscape in the weight space, to which a descending path leads that starts at the initial weight configuration. The simulations do not provide any information about the structure far away from the minimum.

APRL shows an error overshoot, which means that the weights are driven away from the minimum. Experiments show that this is not due to the learning rate being too high. Apparently, the minimum does not lie in a valley of the landscape, at least another path leads away from it. This could be due to the fact that the weight dynamics of APRL is also influenced by @g

, which might lead to perturbations of the structure imposed by @E

. In the experiments, the error overshoot was accompanied by fluctuations. Therefore I suppose that the error overshoot leads into a more jagged region of the weight space, where acceptable minima cannot be found.

82 8 Summary and Conclusive Remarks

Of course, much more experimental data is required to support these hypotheses. Therefore the algorithms should be applied to different tasks. Convenient examples could be provided by other complex attractors, like the butterfly attractor [Elwakil et al., 2002] orn-scroll attractors [Yalçin

et al., 2001].

::: on Algorithms

Although APRL is not in all respects optimal, the low complexity ofO(N 2

)makes it very attrac-

tive compared to other algorithms. The special structure of networks with one output motivates to base further research on this algorithm. Echo state networks are an example that the functional division of the network is suitable to derive new learning rules. Concerning APRL, the functional division is not an a-priori assumption but is implicit in the learning rule based on the constrained optimization problem. This could be exploited to derive new algorithms that take into account the special structure of the network in order to solve the credit assignment problem. Instead of scaling the reservoir by hand as in the echo state approach, the reservoir update could be determined according to the error function. Probably, other optimization methods can be incorporated to compute the weight updates more efficiently. In this way, it seems possible to reduce the complexity of recurrent learning further.

A

Methods from Matrix Algebra

The next sections provide some methods and derivations that are used in this thesis.

A.1 Pseudoinverse Matrices

The pseudoinverse of a matrix is defined as follows [Rao and Mitra, 1971]:

Definition A.1 (Pseudoinverse) A pseudoinverse or generalized inverse of amnmatrixAis a nmmatrixA such that

AA A=A: (A.1)

Pseudoinverses are often useful in solving overdetermined equation systems. A common approach is the least squares solution.

Definition A.2 (Least squares solution) Consider the equation Ax = y. x^ is called a least squares solution if

kAx^ yk=inf x

kAx yk: (A.2)

Theorem A.3 LetAandGbe real matrices. Gyis a least squares solution ofAx=y, iff

AGA=A and (AG) T

=AG: (A.3)

A proof of theorem A.3 can be found in [Rao and Mitra, 1971]. A least squares solution with respect tokk

2can be derived as follows: We want to minimize kAx yk

2 2

=(Ax y) T

(Ax y): (A.4)

The minimum has to fulfill the condition

rkAx yk

2 2

=0: (A.5)

By evaluating the gradient, we get the equation

rkAx yk 2 2 =2(Ax y) T A=2x T A T A 2y T A=2(A T Ax A T y) T ! =0: (A.6)

IfAhas full rank, thenA T

Ais invertible and (A.6) has the solution x=(A T A) 1 A T yA y: (A.7)

The following theorem states that under the given assumptions, equation (A.7) yields in general a least squares solution.

84 A Methods from Matrix Algebra

Theorem A.4 Letkxk=x T

xandAbe a realmnmatrix of full rank. Then

A = A T A 1 A T (A.8)

is a pseudoinverse ofAandA yis a least squares solution ofAx=y. Proof: We have AA A=A(A T A) 1 (A T A)=A; (A.9) (AA ) T = A(A T A) 1 A T T =A(A T A) 1 A T =AA : (A.10)

HenceA is a pseudoinverse ofAby definition A.1 andA yis a least squares solution ofAx=y

by theorem A.3.

A.2 The Small Rank Adjustment Matrix Inversion Lemma

In document Analysis and Comparison of Algorithms for Training Recurrent Neural Networks (Page 83-90)