Time Delay Neural Networks - Non-recurrent architectures for temporal processing

6.2 Non-recurrent architectures for temporal processing

6.2.2 Time Delay Neural Networks

Within Mozer's taxonomy simple tapped delay line networks would be categorised as having a memory content of type RI, meaning that the memory contains only raw input values from previous time frames. The RI memory of the simple delay line network can be extended by applying delay lines to nodes in the hidden layer to create a transformed input (TI) memory. An example of this style of network is the Time-Delay Neural Network (TDNN). This was designed to overcome some of the weaknesses of the simple tapped delay line architecture by buffering both the raw input values and the activations of the hidden layer nodes. It has been applied to voice recognition tasks with promising results (Lang and Hinton 1988, Lang et al 1990, Hampshire and Waibel 1990).

The TDNN architecture described in the literature contains two hidden layers. The units in the first hidden layer are connected to the inputs by a small number of delay lines, and are intended to detect temporally-localised features of the input sequence. One way to view these nodes is that they pass a relatively small window along the entire input sequence. The outputs of the first hidden layer are also passed through delay lines which feed into the nodes in the second hidden layer. These nodes take samples from the delay line over a longer period of time and as such are designed to detect higher- level features which are spread over a longer period of the input sequence. Again the output of these nodes is buffered and in this case fully connected to the output nodes which act to integrate the temporally disparate features of the sequence to produce the final classification. This style of architecture is illustrated in Figure 6.2.

Figure 6.2 An example of a TDNN architecture. Delay lines (represented by shaded rectangles) are applied to the nodes in the input and both hidden layers. To simplify

the figure only a single node has been included in each layer.

Consider a TDNN architecture with a input nodes, b1 nodes in the first hidden layer, b2 nodes in the second hidden layer and c output nodes:

Ij,t the activation of the jth input node at time t

H1_j,t _{the activation of the jth node in the first hidden layer at time t} H2j,t the activation of the jth node in the second hidden layer at time t O_j _{the activation of the jth output node}

I0,H0 bias nodes with a fixed value of 1

T_I _{the length of the delay buffer on the input nodes}

TH1 the length of the delay buffer on the first hidden layer nodes T_H2 _{the length of the delay buffer on the second hidden layer nodes} ui,j,d the weight of the connection from the ith input node at delay d to

the jth node in the first hidden layer

v_i,j,d _{the weight of the connection from the ith node in the first hidden} layer at delay d to the jth node in the second hidden layer

wi,j,d the weight of the connection from the ith node in the second hidden layer at delay d to the jth output node

f(x) _{the activation function of the hidden and output nodes}

For a given input pattern I, the activation of the nodes in each layer for a particular time-step can not be calculated until the buffer for the nodes in the

previous layer has been filled. The hidden and output node activations will be given by Equations 6.3 to 6.5. H1_j_,_t =f

_Σ

d=0 T_I– 1 I_i_,_t_–_d.u_i_,_j_,_d

Σ

i= 0 a 6.3 H2_j_,_t =f

_Σ

d= 0 T_H₁– 1 H1_i_,_t_–_d.v_i_,_j_,_d

Σ

i= 0 b1 6.4 O_j =f

_Σ

d=0 T_H2– 1 H2_i_,_T I–d.wi,j,d

Σ

i= 0 b2 6.5

The use of buffers of varying length throughout the layers of the network addresses some of the problems which occur when the basic tapped delay line approach is applied to long sequences. One of the most important of these is that it reduces the total number of weights in the network. In the standard tapped delay line network each hidden node contains a separate weight for each input for each time-frame contained in the input buffer (which may be as long as the entire input sequence). By contrast in the TDNN the nodes in the first hidden layer are connected to only a few time- frames of the input and hence have a much smaller number of weights. This helps to reduce the likelihood of overtraining damaging the network's ability to generalise. It should be noted that to process long sequences the TDNN architecture still requires large numbers of weights, but is less extreme in this need than the basic tapped delay line model.

A second advantage of this approach of using temporally-localised feature detectors in the first hidden layer is that the network is more immune to temporal shifting of the input sequence. The same low-level feature detectors are applied to the entire input sequence, meaning that the TDNN is less likely to become reliant on detecting a particular feature at a specific location in the input sequence.

The TDNN architecture fails to overcome the basic limitation of delay line networks, however. The overall length of the input sequence processed by the network (and hence the length of the network's memory) is still fixed in advance at a finite size. Hence this style of network is still incapable of handling problems where the length of the relevant input sequence is not known in advance. In addition, the TDNN still contains a large number of weights and takes substantial amounts of time to train on problems involving long input sequences. Whilst the use of localised feature detectors in the first hidden layer provides this system with its main advantages over

simple delay line networks, it also makes it more difficult to train these nodes as the task they are learning is more complex.

In document Recognition of sign language using neural networks (Page 96-99)