Recurrent architectures for temporal processing

Recurrent architectures aim to overcome the limitations of the finite length of memory of tapped delay line networks by modifying the strictly feed- forward network architecture via the addition of recurrent links which provide the network with a memory which not bounded by a fixed length of time. In addition to standard feed-forward inputs from nodes in lower levels of the network, nodes may also take time-delayed inputs from nodes in the same or higher layers of the network. This means that the activation of these nodes (and hence of the network as a whole) is dependent not only on the current input values, but also on the network's own internal state.

Many different architectures can be created by adding recurrent links at different points in the basic feed-forward architecture. This section is intended to describe some of the more commonly used recurrent networks, and in particular those architectures used in this research.

6.3.2 Elman network

Figure 6.3 illustrates a simple recurrent architecture where each node in the hidden layer contains a recurrent link from all nodes in the hidden layer (including itself) as well as standard feed-forward connections from each input node. This architecture is called an Elman network (Elman 1990).

Input Hidden Output

Figure 6.3 The Elman recurrent network architecture. The grey arrow indicates time-delayed recurrent connections between the nodes in the hidden layer. The feedback from the hidden nodes provides the network with a memory of its internal state. In Mozer's classification this architecture has a TS (transformed state) memory. This style of network can be trained to either recognise input sequences or generate output sequences. This architecture is one of the more commonly used recurrent models and was used in the work on Neural Transplant Surgery described in Section 7.5 so as to allow easier comparison with previous work.

An alternative implementation of recurrent networks is often used in the literature in which the recurrent links are implemented using special context

nodes. These nodes take input from time-delayed 'copy' links (fixed to a weight of one) which copy the activation of other nodes in higher layers of the network back to the context nodes in a lower level. These context nodes are then linked to nodes in the next layer in a feed-forward fashion. Figure 6.4 illustrates the same network as in Figure 6.3, viewed in this manner. This use of context nodes can be viewed as merely an alternative means of illustrating of recurrent networks. However some architectures (such as the Jordan network discussed below) also add recurrent links between the nodes in the context layer, which cannot be easily represented in the first style of illustration.

Input Hidden Output

Context

Figure 6.4 The same architecture as in Figure 6.3, illustrated using the context node view of recurrent architectures. The dashed line represents a time-delayed connection from each node in the hidden layer to a single context node. These connections have a

fixed weight of 1, so that each context node is merely a copy of the previous activation of its corresponding hidden node.

6.3.2 Jordan network

Jordan (1986, 1989) proposed a recurrent architecture in which the activity of the output nodes is recurrently copied back into the context nodes. In addition each context node has trainable recurrent links to all of the other context nodes, as shown in Figure 6.5. The recurrent connections between the context nodes provide the network with a memory of its previous state, and therefore allows it to process sequences. Under Mozer's classification scheme this network has a TOS (transformed output and state) memory which retains transformed versions of both its previous output and previous internal state. This style of network can be trained to generate sequences when presented with a time-invariant input signal, or to classify a time- varying input sequence.

Note that the recurrent connections between the context nodes are essential in creating a suitably powerful memory within this network. If these links were removed the only state information available to the network would be the previous value of the output nodes. Whilst this architecture would be

sufficiently powerful for the network to be able to learn to act as a finite-state automaton (with each output node corresponding to a particular state of the automaton), it would not allow the network to develop its own internal representation of state, and hence would not be generally applicable to other temporal classification tasks.

Input Hidden Output

Context

Figure 6.5 A Jordan recurrent network. Each context node takes input from a single output node, and is also recurrently connected to all context nodes, including itself.

6.3.3 General transformed output and state (TOS) network

The network architecture used in researching the motion recognition network for the SLARTI system is illustrated in Figure 6.6. This model has some similarities to the Jordan network (such as the recurrent links from the output nodes), and like that model it implements a TOS memory. However it eliminates both the non-recurrent hidden layer and the distinction between output and context nodes.

Input Output/State

Figure 6.6 The general TOS recurrent architecture. The output and state nodes are fully connected to the input layer, and completely recurrently connected to each

other.

This network consists of only two layers – the input layer and an output/context layer. The second layer contains both output nodes and additional state nodes. Every node in this layer is fully connected to all the input nodes, and also recurrently connected to all nodes in the output/context layer. Hence the connectivity is identical for the output and state nodes, and they are distinguished only during training and when calculating the network's classification of the input sequence (at which stage only the activation of the output nodes is considered).

Consider a TOS architecture with a input nodes, b state nodes and c output nodes.

I_j_,t _{the activation of the jth input node at time t}

Rj the activation of the jth node in the recurrent layer if j≤b the node is a state node

if j>b the node is an output node I₀ _{a bias node with a fixed value of 1}

wi,j the weight of the connection from the ith input node to the jth node in the recurrent layer

r_i,j,d _{the weight of the recurrent connection from the ith node to the} jth node in the recurrent layer

f(x) _{the activation function of the recurrent nodes}

For a given input sequence I, the activation of the recurrent units at time t is given by: R_j_,_t=f

_Σ

I_i,tw_i,j i= 0 a +

_Σ

R_i_,t_{– 1}r_i_,_j i=1 b+c 6.6

This simple network structure has been described in the literature (eg Miller and Giles 1993) and there is some evidence that it trains more stably than the Elman architecture on classification problems, due to the addition of recurrent links between the output nodes (Lewis 1995a, 1995b). It was observed that during training the output nodes generally evolved positive self-recurrent links and negatively weighted recurrent links to the other output nodes. This acts to stabilise the activity of the output nodes over the course of an input sequence. This behaviour is well suited to the task of classifying input sequences, but may make this network structure less suitable for sequence generation. The comparatively simple feed-forward structure and uniform connectivity of all the processing nodes (output and state nodes) also makes this model extremely easy to implement.

The lack of a hidden layer could potentially cause difficulty in learning some problems. For example, the network could be presented with a single bit as input at each time step and required to calculate the xor of this bit and the previous input. Due to its lack of a hidden layer the network could not learn this task. However if the problem were modified to require calculating the xor of the two previous input bits the network could learn this task by using

its recurrent state nodes to perform the same role that the hidden nodes play in a spatial xor network. This limitation of the architecture did not cause any difficulties in the classification tasks explored during this research.

In document Recognition of sign language using neural networks (Page 99-103)