Learning with Smaller Learning Rates - Analysis and Comparison of Algorithms for Training Recur

steps p. epoch t trained epochs error after last epoch min training error epoch w. min error avg error over last 200 epochs stddev over last 200 epochs 1000 0.1 0.01 2500 18.797245 3.482553 247 18.791067 0.075973 1019 0.1 0.01 2400 0.200479 0.173642 1701 0.201548 0.000902 1020 0.1 0.01 2400 3.418433 0.297886 1001 7.308657 7.131754 1250 0.1 0.01 2000 0.087244 0.087244 2000 0.096889 0.006188 1430 0.1 0.01 1700 2.881307 1.418606 1263 3.993059 0.540261

Table 4.8: Training errors for Atiya-Parlos recurrent learning with low learning rate.

learning steps p. epoch epoch with min training error generalization error at min epoch last epoch generalization error after last

epoch 1000 247 0.930662 2500 13.643557 1019 1701 0.126878 2400 0.102271 1020 1001 0.102463 2400 5.185716 1250 2000 0.070850 2000 0.070850 1430 1263 0.822627 1700 0.628722

Table 4.9: Generalization errors for Atiya-Parlos recurrent learning with low learning rate.

4.4 Learning with Smaller Learning Rates

As mentioned above, the Atiya-Parlos algorithm might drive the network into unfavorable regions of the weight space if the learning rate is selected too high. A later reduction of the latter can thus yield no improvement of the training error. In this section, it will be investigated whether this can be prevented by learning with a small learning rate from the outset. Five networks are trained with a learning rate of=0:01on the same parts of the Roessler attractor as before. The other

parameters were kept fixed, e.g. t=0:1and=1.

Training Errors

The training had to be carried out very long in order to obtain comparable results. The training errors are given in table 4.8. In neither case an improved learning performance is achieved. All minima of the training error are higher than those obtained with = 1. For 1000, 1020 and

1430 learning steps, the error at the end of training is significantly higher than at the minimum, which indicates an error overshoot. The average training error over the last 200 epochs shows a large standard deviation for 1020 learning steps. In contrast, the trials for 1019 and 1250 steps per epoch do not show the error overshoot. However, the training error does not reach a value as low as with =1. This might be due to slower decay as a result of the smaller learning rate

and a possibly more severe impact of the vanishing gradient. Continuation of the training could probably decrease the training error further in these cases, especially for the trial with 1250 steps, where the minimum is reached at the end of the training.

Generalization Errors

The generalization errors are listed in table 4.9. They also tend to be inferior than those obtained with=1. The overshoot occurs as well, with the exception that the generalization error for 1430

steps is better after 1700 epochs than at the minimum of the training error. This can be considered a matter of chance. For the cases without an error overshoot, the generalization improves to the end of the training.

42 4 Recurrent Learning: APRL vs. RTRL

Error Curves

The curves of the training error are plotted in figure 4.6. They comprise the same phenomena as observed before. The curve for 1250 steps of the Roessler dynamics learned per epoch does apart from some slight increases gradually improve. The decay of the error is fast in the first 500 epochs, then gets slower, and in the latter 1000 epochs the change is very small. In the case of 1000 steps (4.6(a)), the error decreases fast in the beginning and reaches its minimum after 247 epochs. Thereafter, it increases slightly again, and after about 1000 epochs the overshoot occurs. The error overshoot is hence not dependent on the learning rate. Erratic behavior can be observed as well, cf. figure 4.6(c) which shows the trial for 1020 steps per epoch. Again, the decay of the error is fast in the beginning. At a certain point it shows a jump and then large fluctuations appear. It can hence be concluded that the learning rate is not responsible for the erratic behavior either.

Summary for Learning with Smaller Learning Rates

Altogether, the errors show comparable characteristics as those of training with=1. The effect of

the smaller learning rate is merely a dilatation of the time-frame on which the training proceeds. Of course, the material presented in this section does not provide enough data to infer reliable implications. Nevertheless, since error overshoot and erratic behavior can be observed despite the smaller learning rate, it can be concluded that they are not primarily evoked by an unfavorable learning rate. It is rather a typical feature of the Atiya-Parlos algorithm, when applied to the task of learning the Roessler attractor.

The results of these chapter show that RTRL and APRL behave different with respect to training and generalization errors. While RTRL is able to learn the task well, regardless of whether transients occur, APRL evinces completely different behavior if the number of learning steps per epoch is varied. A characteristic feature of APRL is the error overshoot that occurs after the training error has reached its minimum. This suggests that the weight updates made by APRL differ from the direction of the gradient of the error with respect to the weights. Obviously, APRL uses a different strategy to navigate through the weight space than RTRL.

4.4 Learning with Smaller Learning Rates 43 5 10 15 20 0 600 1200 1800 2400 Epoch Mean Square Error

1000 Steps/Epoch

(a) Error curve for a trial with 1000 steps per epoch.

5 10 15 20

0 500 1000 1500 2000

Mean Square Error

Epoch 1250 Steps/Epoch

(b) Error curve for a trial with 1250 steps per epoch.

5 10 15 20

0 600 1200 1800 2400

Mean Square Error

Epoch 1020 Steps/Epoch

Figure 4.6: Error curves for Atiya-Parlos recurrent learning with low learning rate. The training

44 5 Analyzing the Weight Dynamics of Recurrent Learning

5 Analyzing the Weight Dynamics of

Recurrent Learning

This chapter deals with the weight dynamics of recurrent learning. The aim is to get insights how different algorithms navigate through the weight space. For fully connected recurrent networks, the dimension of the weight space is high, and it is difficult to obtain concrete descriptions of its structure. In the special case of only one output neuron – which is actually common for practical applications – an analytical result for the weight dynamics of APRL can be derived. This result reveals a functional division of the network. After the presentation of the formal analysis, the result are compared to the related echo state approach. Simulations which investigate the practical consequences are presented.

5.1 The One-Output-Behavior of the Atiya-Parlos Algorithm

In the following, a formal analysis of the one-output behavior of the APRL algorithm will be presented. This analysis was recently carried out by Schiller and Steil [2003].

Theorem 5.1 Consider a recurrent neural network with only one outputx

1, which is trained using the Atiya-Parlos algorithm. The weight changes for the internal weights scale equally and with constant rate in every column. The scaling factors are determined by the initial weights in the first column. It holds 8k8i>18j>1:W ih (k)= W i1 (0) W j1 (0) W jh (k) 1hN; ()

whereN denotes the number of units, including input units.

A similar statement holds for the other quantities used in the Atiya-Parlos algorithm:

Lemma 5.2 Assuming the same as in theorem 5.1, the following equations hold for allk,i>1, j>1and1hN. i (k)= W i1 (0) W j1 (0) j (k); (L1) B ih (k)= W i1 (0) W j1 (0) B jh (k); (L2) W i1 (k)= W i1 (0) W j1 (0) W j1 (k): (L3) Note thatW j1

5.1 The One-Output-Behavior of the Atiya-Parlos Algorithm 45

Proof: The key to the proof is the fact that for one output neuron only the first component of the

error vectore(k)is non-zero:

8k8i>1:e i

(k)=0: (5.1)

We proof () and (L1) to (L3) simultaneously by induction on the time stepk.

k=1: Leti>1; j>1.

By the definitions of(1),B(1)andW(1)in algorithm 2.1 we have

i (1)= e i (1)=0= e j (1)= j (1); B ih (1)= i (1)f(x T h (0))=0= j (1)f(x T h (0))=B jh (1); and W ih (1)= t X m B im (1)V 1 mh (1)=0= t X m B jm (1)V 1 mh (1)=W jh (1):

Substituting this intoW(1)yields

W i1 (1)=W i1 (0)+W i1 (1)=W i1 (0)= W i1 (0) W j1 (0) W j1 (0) = W i1 (0) W j1 (0) (W j1 (0)+W j1 (1))= W i1 (0) W j1 (0) W j1 (1):

With the above equations, we have proven that () and (L1) to (L3) hold fork=1.

k)k+1: Leti>1; j >1and suppose that () and (L1) to (L3) hold for allk 0

From the formula for(k+1)in algorithm 2.1 we get

i (k+1) j (k+1) = e i (k+1)+ P m [(1 t)I+tW(k)D(k)] im e m (k) e j (k+1)+ P m [(1 t)I+tW(k)D(k)] jm e m (k) = tW i1 (k)f 0 (x 1 (k))e T 1 (k) tW j1 (k)f 0 (x 1 (k))e T 1 (k) = W i1 (k) W j1 (k) = W i1 (0) W j1 (0) :

We have used thate i

(k+1)ande j

(k+1)vanish by equation (5.1). The sum contributes only the

term for m=1 and sincei 6= 1, the identity in the square brackets cancels. The last equality

follows from the induction hypothesis. If e 1 (k) = 0, then i (k+1) = j (k+1) = 0and (L1)

trivially holds. Hence we have proven that (L1) holds for allk.

ForB(k+1)we get B ih (k+1) B jh (k+1) = B ih (k)+ i (k+1)f(x T h (k)) B jh (k)+ j (k+1)f(x T h (k)) = Wi1(0) W j1 (0) (B jh (k)+ j (k+1)f(x T h (k))) B jh (k)+ j (k+1)f(x T h (k)) = W i1 (0) W j1 (0) :

We have used (L1) and the induction hypothesis to substitute i (k+1)andB ih (k). IfB jk (k+1)= 0, then alsoB ik

46 5 Analyzing the Weight Dynamics of Recurrent Learning

Now we turn to W(k+1). We use equation (2.43) and cancel t

and the denominator to write W ih (k+1) W jh (k+1) = ( i (k+1) P m B im (k)[V 1 (k)f(x(k))] m )[V 1 (k)f(x(k))] T h ( j (k+1) P m B jm (k)[V 1 (k)f(x(k))] m )[V 1 (k)f(x(k))] T h = Wi1(0) Wj1(0) ( j (k+1) P m B jm (k)[V 1 (k)f(x(k))] m ) j (k+1) P m B jm (k)[V 1 (k)f(x(k))] m = W i1 (0) W j1 (0) : The terms [V 1 (k)f(x(k))] T h

cancel. Then we use (L1) and (L2) to substitute i (k+1) and B im (k). IfW jk (k+1) =0, then alsoW ik

(k+1)=0. Hence () holds for allk.

Finally, we have to proof the induction step for (L3). From algorithm 2.1 we get:

W i1 (k+1) W j1 (k+1) = W i1 (k)+W i1 (k+1) W j1 (k)+W j1 (k+1) = W i1 (0) W j1 (0) (W j1 (k)+W j1 (k+1)) W j1 (k)+W j1 (k+1) = W i1 (0) W j1 (0) :

Here, we have used () and the induction hypothesis to substituteW i1 (k+1) andW i1 (k). If W j1 (k+1)=0, then alsoW i1

(k+1)=0and consequently (L3) holds.

Altogether, it follows by complete induction that Theorem 5.1 and Lemma 5.2 hold.

Theorem 5.1 shows that the network consists of two different groups of weights:

the weights that connect arbitrary neurons to the output neuronx

1 and which may change

arbitrarily

the weights that interconnect the inner neuronsx i

; i>1are systematically coupled. Their

rates of change are proportional and the constant factor is determined beforehand by the initialization.

This result holds for the case of a single output neuron and a generalization to more than one output is not straightforward.

The explanation for the functional structuring of the network is the special way in which errors are propagated by the Atiya-Parlos algorithm. This can be seen in the expression for (k) in

algorithm 2.1. The term e(k)+(1 t)Ie(k 1)reduces to e 1

(k)+(1 t)e 1

(k 1)and

influences the change of the output weights. The error propagation is contained in tWD(k 1)e(k 1)which reduces totw

i1 f 0 (x 1 (k 1))e 1

(k 1). The error is propagated over the feedback

connections only, and the weight change in each column is proportional to f 0

(x 1

(k 1)). The

latter factor is small for activationsx

1close to saturation and hence the weight change is higher

for activations in the working area around zero. In further time stepsk+s; s>1the errore(k)

is only contained in the accumulated sum in the matrixB. Since only the feedback weights w i1

are involved in the error propagation, a special structural credit assignment evolves that leads to the functional division of the network. On the one hand, this special structure reflects the lower

O(N 2

)complexity of APRL. On the other hand, it could lead to poorer performance of APRL

in tasks with long term-dependencies. This would explain the effect of the transients observed in chapter 4.

In document Analysis and Comparison of Algorithms for Training Recurrent Neural Networks (Page 47-53)