Initialization of the Reservoir - Analysis and Comparison of Algorithms for Training Recurrent

5.5 Initialization of the Reservoir

The theoretical analysis of the weight dynamics of APRL for the one-output case revealed a func- tional division into a layer of output weights and a reservoir comprising the inner weights. The inner weights are updated such that their relative rates of change remain constant throughout the training. The constant scaling factor between the rates of change is determined beforehand by the initialization of the first column of the weight matrix. Therefore the question arises whether there is a difference in the learning performance dependent on the initialization of the weights. If such differences could be identified, a strategy to initialize the weights which is especially suited for a certain task would be desirable.

In order to evaluate the impact of the initialization, simulations are carried out using different strategies to generate the initial weights. The networks are trained with APRL on the first 1250 steps of the Roessler trajectory. From section 4.2 it is known that this task can be solved well by APRL. The discretization time step for all simulations was t = 0:1, and was set to one.

The training errors are given in table C.7 in the appendix. A summary is presented in table 5.1, where the errors have been averaged over the trials with the respective initialization methods. The generalization errors are listed in tables 5.2 and C.8, respectively. For comparison, the results obtained with the normal initialization3are repeated (cf. section 4.2).

5.5.1 Initialization with Positive Bias

This initialization generates the weights uniformly distributed from 0.0 to 0.2. The mean value of the distribution of the initial weights is shifted from 0.0 to 0.2, and the width of the distribution is reduced to 0.2. Since the initial weights are not distributed over positive and negative values but are all positive, it can be expected that this initialization will deteriorate the learning performance. The training errors are in fact much higher than those for the normal initialization (cf. table 5.1). The minimal training error is over 200 times higher, and the factor is even larger for the generalization error. The minimal training error is on average reached early, which means that an error overshoot occurs in most trials. Obviously, the Atiya-Parlos algorithm cannot cope well with this kind of initialization. Possibly, the reservoir weights cannot be scaled to a configuration which can be used by the output layer for a good readout function. This might be due to the lower variability of the scaling factors, which are all positive due to the initialization. No weight configuration can be found which minimizes the error sufficiently before the error overshoot occurs.

5.5.2 Initialization with Scaled Columns

Since the relative rates of change of the inner weights remain constant, the relative rates of the weights themselves are strongly dependent on the initial values, too. More precisely, all reachable weight configurations lie on a hyperplane defined by

W ij (K)=W ij (0)+ W i1 (0) W 21 (0) K X k=1 W 2j (k); i>1:

The accessible weight space is thus reduced to a sub-manifold of the whole weight space. The sub-manifold is the hyperplane which is defined by the initial weights of the first column. To investigate whether a restriction in the variability of the rates has an effect, a scaled initialization is applied. The first two rows and the first column of the weight matrix are initialized randomly and uniformly distributed from -0.2 to 0.2. The rest of the weights is generated by scaling the

54 5 Analyzing the Weight Dynamics of Recurrent Learning method avg error after 100 epochs stddev after 100 epochs avg min training error stddev of min error avg epoch w. min error stddev of epoch w. min error normal 1 0.033417 0.005047 0.033415 0.005047 99.8000 0.4000 bias 1 16.961311 0.281640 6.732568 2.445212 6.4000 4.9639 scaled 1 17.243574 0.343552 8.208835 0.548522 5.4000 4.1761 large 1 17.383524 0.280863 6.330327 3.528913 7.8000 3.1875 large 0.1 11.232833 7.962248 3.545691 2.292551 49.0000 21.9870

Table 5.1: Training errors for Atiya-Parlos recurrent learning with different reservoir initializa-

tion.

respective rates of the first column.

W ij (0)= W i1 (0) W 21 (0) W 2j (0); i>2; j>1:

As a consequence, the relative rates of the weights (not only of the weight change) are the same for every column.

W ij (K)= W i1 (0) W 21 (0) W 2j (0)+ W i1 (0) W 21 (0) K X k=1 W 2j (k)= W i1 (0) W 21 (0) W 2j (K); i>1:

The results for this initialization show that the learning performance is equally poor as for the initialization with bias. The generalization errors are even worse. Since the minimal training error is reached in early epochs, it can be assumed that the error overshoot occurs. The detailed training errors in table C.7 show that this is in fact the case. It can be concluded that the scaled initialization does not provide enough variability for the weight matrix to generate an appropriate input-output function. Maybe the restriction of the reachable weight configurations prevents the network from implementing enough memory for the task. Since the Atiya-Parlos algorithm cannot unlink the coupling between the weights, the scaling of the weight matrix does not lead to a favorable weight configuration. The output layer cannot access sufficient dynamics of the reservoir, and therefore the approximation of the target output is very poor.

5.5.3 Initialization with Large Weights

Finally, a strategy with large initial weights is investigated. The weights are generated uniformly distributed from -2.0 to 2.0. This initialization increases the range over which the initial weights are spread. It also potentially increases the constant factors for the relative change of the weights. As this might also increase the learning steps, the strategy is used with two different learning rates.

For learning rate = 1:0, the results are comparably bad as with the initialization strategies

above. There is one trial which achieves a minimal training error of about 0.76 and a generalization error of about 1.12 at the minimum. The error overshoot is observable in all trials. Apparently, the larger initial weights cannot be scaled to an efficient weight configuration. Maybe the higher rates have the effect that the weight space is sampled too coarsely. Therefore no reservoir configuration is reached which the output layer can use to generate a good approximation of the target output. The training error is high, and the learning steps are large. This might have the effect that the network performs skips in the weight space instead of refining the weight configuration accordingly.

5.5 Initialization of the Reservoir 55

method

generalization error

at minimum after 50 epochs after 100 epochs

avg stddev avg stddev avg stddev

normal 1 0.009401 0.009620 0.012977 0.013427 0.009376 0.009616 bias 1 16.326431 12.209981 22.170402 2.240131 21.755566 3.090461 scaled 1 22.391564 15.160038 20.556044 4.037547 18.658886 3.773244 large 1 9.102609 6.601609 17.087690 3.096830 17.118008 3.081007 large 0.1 5.933696 3.791278 8.160569 4.826686 15.920765 10.434357

Table 5.2: Generalization errors for Atiya-Parlos recurrent learning with different reservoir ini-

tializations.

In order to investigate whether the negative effects of the large initialization can be compensated by a smaller learning rate, the large initialization was also applied with a learning rate of=0:1.

The results are listed for seven trials, and the averages in table 5.1 and 5.2 are calculated over these seven trials.

The average of the training error after 100 epochs has a significantly higher standard devia- tion. The same holds for the average epoch after which the minimal training error is reached. The individual results of the trials reveal that there are two cases performing noticeably better than the others. One of them reaches a minimal training error in range of those with the normal initialization of the weights. The generalization is also good, and no error overshoot occurs. Another trial performs at least one order better than the others with respect to both the training and the generalization error. Although an error overshoot is identifiable, it is not as drastic here.

The observations show that it is in some cases possible to compensate the large initial weights by means of a smaller learning rate. This indicates that larger step sizes are responsible for the deterioration of the learning performance when initial weights are too large. The smaller learning rate avoids too large skips through the weight space and prevents the network from missing good weight configurations. Nevertheless, a relevant number of trials still performs poorer than with the normal initialization. Hence there must be other effects perturbing the learning success. Possibly, the initialization with larger weights yields a distribution of the scaling factors which restricts the reachable subspace for the weights such that the reservoir cannot be scaled to a sensible configuration. A detailed analysis of the distribution of the relative rates of the weights and their respective changes resulting from the initialization could give more insights into these items. However, such an analysis is beyond the scope of this work.

The material presented in this section shows that the initialization strategy is crucial for the learning performance of the Atiya-Parlos algorithm. The results indicate that a certain variability in the weights is essential for successful learning, and restricting the variability of the weights can significantly deteriorate the learning performance. The realization of a good readout function by the output layer seems to rely on the reservoir providing sufficient dynamical behavior. This is necessary in order to implement enough memory for the task. During learning, APRL establishes a suitable reservoir scaling and an appropriate readout function for the output layer.

Since the variability in the weights required to provide the dynamical behavior of the reservoir is primarily determined by the initialization, a clever initialization strategy is essential for the performance of the Atiya-Parlos algorithm. Initialization with small uniformly distributed weights seems to work well. Further investigations should clarify whether better strategies can be found.

56 5 Analyzing the Weight Dynamics of Recurrent Learning

In document Analysis and Comparison of Algorithms for Training Recurrent Neural Networks (Page 59-62)