Analysing the model - Echo state networks with working memory

5.4 Echo state networks with working memory

5.4.3 Analysing the model

In the light of the discussion carried out in Section 5.3, one question that we need to address is what is the underlying dynamical mechanism that provides this working memory behaviour that we were able to simulate with this model.

First of all, we make the observation that the memory is stable1_{. As long as no}

curly bracket appears in the input, the model preserves the current bracket level reliably for very large amounts of time. This suggest some form of attractor-like behaviour.

There are some difficulties with this claim, as attractors are rigorously defined only for autonomous systems. It was previously suggested in Bengio et al. (1994), for example, that one can regard the input as bounded noise, which allows a straightforward extension of the notion. This is not a perfect solution, as dis- cussed previously, as the input is highly structured and the model is trained to respond to it. In Pascanu and Jaeger(2011) we attempt to provide a definition of pseudo-attractors, called -attractors, for input driven dynamical systems (which

1. By stable we mean that the model can remember the stored information for what seems very long periods of time, longer than what a standard model can do. Unfortunately there is no theoretical quantification of what stable means in this case and we rely only on empirical evidence.

we refer to as -systems). For now let us consider simpler approach of simply ignoring the input and assume it is just bounded noise.

If the WM-units seem locked in an attractor state, not the same can be said about the rest of the network. The hidden units of the model, or at least a subset of them, seem to be free. These units are used to solve the payload task. Because there is a strong connection between the behaviour of the model and the number of open brackets, one might be tempted to fold the current behaviour into the attractor. This means that the current behaviour is confined to the support of the corresponding attractor and hence the whole model is locked into this attractor. For this to be true, the attractor itself has to be fairly complex. However, preliminary experimentation showed that the payload task can be independent of the memory content, and the number of open brackets, in such a situation, does not influence the performance on the payload task. This clearly suggests that at least some of the hidden units are not constrained by the attractor representing the number of open brackets. If we negate this claim, then the di↵erent attractors corresponding to di↵erent brackets levels would have to share some of their support set, which contradicts the definition of an attractor.

This suggests that the kind of attractor-behaviour that we observe is similar to the one reported by Maass et al. (2007), namely a form of high-dimensional attractor or partial attractor.

Another important trait of the trained model is the switching mechanism between di↵erent attractor states. This does not happen randomly, but rather when specific patterns appear in the input. This suggests that the input, for these attractors, is not equivalent to noise.

Figure 5.5 looks at the role of the input into the behaviour of the model. The plot was obtained as follows:

– We ran the WM model from the previous section 7 times for approximately 45,000 network updates, each time with the memory units clamped in one of the 7 settings coding for one bracketing level; the driving input was in each case generated from an input character sequence whose Markov chain properties were the same as used in the previous section, not containing curly brackets;

– the obtained 7 sets of reservoir states and input vectors were concatenated and the first principal components (PCs) of the reservoir states and inputs

!4 !3 !2 !1 0 1 !2 0 2 0 0.5 1 1.5 2 2.5 3 Reservoir PC 2 Reservoir PC 1 Input PC 1

Figure 5.5: Visualization of the memory states of the WM model. The first PC of input signals is plotted against the first two PCs of reservoir states, for the 7 WM unit configurations described in the previous section. Di↵erent colors correspond to di↵erent WM configurations. Projections of the reservoir state PCs are shown in darker shading on the ground plane. 6000 points are plotted per attractor. Notice that the value ranges do no longer correspond to the ( 1, 1) range of tanh reservoirs because we display projections on the PCs. Picture best seen in color. For detail compare text.

were computed;

– separately for each of the 7 datasets, the first PC of the inputs was plotted against the first two PCs of the reservoir states.

One sees that even in only the first two reservoir PCs, the reservoir state sets corresponding to the di↵erent memory configurations become very well separated. This suggest that additionally to have only part of the system locked in an attractor- like state, these states seem to be stable under most inputs, with the exception of the learnt switching patterns, which allow, within one step, for the model to leave the basin of attraction of the current attractor. We do not have a proper mathematical understanding of this behaviour. In Pascanu and Jaeger (2011) we do, however, provide a formalization of these observations. Specifically we define a input-driven system called -system, and define, for these systems, -attractors which have all the properties that we see empirically in our experimentations.

Further work is needed to understand such phenomena. For example, one would need better mathematical tools that would enable a deeper understanding of such behaviour. A more in depth comparison of -attractors with other similar observed phenomena, and, if possible, a common framework that describes all these di↵erent phenomena would also be useful.

For practical applications it would be useful to understand when these attractors can manifest themselves (what properties of the parameters are required for such phenomena to be possible). How can we learn them? What is the capacity of these attractors as a function of the model size? Can they only represent discrete information? Most of these questions we leave for future work.

In document On Recurrent and Deep Neural Networks (Page 190-193)