Model - Echo state networks with working memory

5.4 Echo state networks with working memory

5.4.1 Model

Our model is obtained by adding a set of special output units to an otherwise standard ESN. We called these units WM-units, and they di↵er from normal output units by having trainable connections from one to another or to themselves, as in Figure 5.3. In our setup we only allow feedback connections from the WM-units to the reservoir1 _{but not from the other, regular output units. Another di↵erence}

between WM-units and output units is that the former are binary-state neurons which can store memory bits. To achieve this behaviour we use a sharp threshold function (mem) _{as the activation function of these units:}

(mem)_{(x) =}

(

0.5 x_{ 0.}

+0.5 x > 0. (5.2)

The network is described by Equations (5.3), (5.4) and (5.5). Note the addition of the WM-units m, as well as the feedback connections from the memory units

W(f eedback)_{. Compared to a standard RNN, we also have direct connections from}

the input to the output units or memory units and connections between the memory units. For clarity in notation these connections where folded in to the matrix W(out)

and W(mem) _{respectively. We also do not use biases (we do, however, force one of}

the input units to have a constant negative value), and rely on the tanh activation function. The only learnt weights of the model are W(mem) _{and W}(out)_.

h[n+1] = (W(in)u[n+1]+ W(rec)h[n]+ W(f eedback)m[n]) (5.3)

1. In the Reservoir Computing literature, the hidden state of a recurrent network is usually refered to as a reservoir

y[n+1] = (out) W(out) " u[n+1] h[n+1] #! (5.4) m[n+1] = (mem) 0 B @W(mem) 2 6 4 u[n+1] h[n+1] m[n] 3 7 5 1 C A . (5.5)

Training this model is similar to the standard training of ESNs. Because the WM-units feed back into the reservoir, we need the value of the WM-units at time t 1 in order to compute the state of the reservoir h[t]. We use a teacher-forcing

approach of relying on the true targets on the WM-units to get the activations of the reservoir for some input sequence.

There is one important observation to be made here. First of all, in order to train the model we need targets for the WM-units. This means that their meaning and behaviour is predefined and not learnt. We use hints that tell us how the memory units should behave for some sequence. A similar approach is used in Gulcehre and Bengio (2013). This model, therefore, can not be used on some dataset for which we do not know what, when and for how long we need to memorize some information. As such, the utility of the model is limited. However, the model itself is interesting, because, putting aside the learning problem, it o↵ers evidence that stable working memory can be obtained in recurrent networks. These memory traces can, as we will show, continuously influence the ongoing processing, keep the information stored for unbounded periods of time and the model can learn to replace stored information when it receives di↵erent input cues. As such, our model, exhibits all the pre-requisites we have enumerated in Section 5.3. An analysis of the model provides intuitions about the internal mechanism that result in this behaviour, a particular type of attractor-like phenomena to which we refer as input-induced attractor.

Because of the sharp activation of the memory units we do not need to add noise to the teacher signal1 _{when we use it to compute the activation of the reservoir.}

However, if we forego the sharp activation function, this form of regularization is vital to get the memory units to be stable and not quickly diverge in the presence

1. When relying on a teacher-forcing strategy to train a model, we feedback through the recurrent connections of a model the target behaviour of a unit instead of its actual value. This target value is usually called the teacher signal.

of noise or new input sequences.

To compute the output weights W(out)_{, the reservoir state vectors, together with}

the activations of the input units are stored row-wise in a data collection matrix G. If _Y(target) _{is the target signal, then the output weights are computed using}

linear regression as shown in Equation (5.6), where † stands for pseudo-inverse. A similar process is done for learning the memory units W(mem)_{, where the inverse}

of the activation function is taken as being the identity.

W(out)= (G†_· (out) 1(_Y(target)))T. (5.6)

An important condition to make the learning of output weights by regression a well-defined procedure is the echo state property (ESP). This is a property of the reservoir and the admitted input. Roughly stated, a reservoir has the ESP with respect to a given admitted input range if for any infinite input sequence the network states h[n] asymptotically forget the (arbitrary) initial state h[0] used

at startup time. Formal definitions of the ESP are given in Jaeger (2001), and refined algebraic conditions are in Buehner and Young (2006); Manjunath and Jaeger (2013). In practice, the ESP is usually ensured when the spectral radius of the reservoir weight matrix W(rec) _{is set to a value below unity, but we emphasize}

that this is neither a necessary nor a sufficient criterion (Jaeger,2007b), in spite of a folklore belief in the field that it is both. The value of 1 for the spectral radius is sometimes also referred to as the edge of chaos, though going over a spectral radius of 1 does not imply a chaotic regime.

Dependding on the task that needs to be solved, there are a few global parameters that need to be tuned for optimal learning, namely, global scalings of input weights, reservoir weights, and output feedback weights. In the reservoir computing field, the global scaling of the reservoir weights W(rec) _{is typically specified through}

the spectral radius of this matrix. All these tunable parameters are explained in more detail in Jaeger (2001).

At first sight, the strong couplings between WM-units through the trained

W(mem) _{might appear problematic for a clean}1 _{storing of memory items, because}

in technical storage devices one does not usually desire dynamical interaction between stored items. However, we will demonstrate that such interactions can be

1. By clean we mean that there is no interference with the stored information due to the ongoing processing of information of the model.

Figure 5.4: A fragment example of the rich graphic script used as input. The image was scaled for better visualization.

harnessed for realizing desirable processing functionalities which go beyond pure storage and retrieval.

In document On Recurrent and Deep Neural Networks (Page 182-185)