• No results found

Incremental language processing in a LSTM neural network

Chapter 2: Computational modelling of the incremental processing of a sentence

2.3. Modelling prediction with neural networks

2.3.5 Incremental language processing in a LSTM neural network

An LSTM network is a more sophisticated version of RNN which preserves the benefits of RNN as a model of incremental speech comprehension and additionally captures the long distance dependencies. In language modelling, LSTM is one of the most commonly adopted architectures for data mining and network training. Recently, Google announced a LSTM network trained on a 1 billion word benchmark which generates an accurate prediction of a following word based on the given context in a sentence (Jozefowicz et al., 2016). Note that the neural network model used in this thesis refers to this LSTM model. Here, I briefly walk through the architecture of LSTM (see also, Gers & Schmidhuber, 2000; Sundermeyer et al., 2015) and explain how it solves the vanishing gradient problem.

Instead of having a single operation in the recurrent hidden layer as in RNN, LSTM performs multiple operations, deciding which information to preserve and add inside the hidden layer. A useful analogy of this LSTM hidden layer is a memory cell with three gates in order to input, forget and output the contents of memory. First of all, it decides what to forget from the previous memory using the sigmoid function. Recall that the sigmoid function outputs a value between 0 and 1 which can be interpreted as a weight determining the strength of projection among the operators (a.k.a. gates in this analogy). Then, the vector of weights βˆ…(𝑑) reflects the state of the forget gate in the memory cell at a particular time 𝑑:

βˆ…(𝑑) = 𝜎(π‘₯(𝑑)π‘Šπ‘₯βˆ…+ 𝑠(𝑑 βˆ’ 1)π‘Šπ‘ βˆ…+ 𝑐(𝑑 βˆ’ 1)π‘Šπ‘βˆ…+ π‘βˆ…) … (13)

where 𝜎 is a sigmoid function, π‘₯(𝑑) is a current input with associated weights π‘Šπ‘₯βˆ…, 𝑠(𝑑 βˆ’ 1) is a previous state in the hidden layer with associated weights π‘Šπ‘ βˆ… and 𝑐(𝑑 βˆ’ 1) is a previous state in the memory cell with associated weights π‘Šπ‘βˆ…. Note that the cell state term 𝑐(𝑑 βˆ’

74

1)π‘Šπ‘βˆ… does not exist in the RNN architecture. Again, this vector of the forget gate state βˆ…(𝑑)

directly manipulates the memory content by setting 0 if it needs to be completely forgotten or setting 1 if it needs to be fully remembered.

Next, the LSTM network decides which information to add from the input and to store in the memory using sigmoid. With the same logic as above, the state of the input gate πœƒ(𝑑) can be expressed as:

πœƒ(𝑑) = 𝜎(π‘₯(𝑑)π‘Šπ‘₯πœƒ+ 𝑠(𝑑 βˆ’ 1)π‘Šπ‘ πœƒ+ 𝑐(𝑑 βˆ’ 1)π‘Šπ‘πœƒ+ π‘πœƒ) … (14)

Note that the weights to be trained in the input gate are different from those in the forget gate. From these weights that decide which memory contents to preserve from the previous cell state (or memory) βˆ…(𝑑) and that decide which information to store from the current input πœƒ(𝑑), we can construct new memory contents as below:

𝑐(𝑑) = 𝑐(𝑑 βˆ’ 1) βŠ› βˆ…(𝑑) + tanh(π‘₯(𝑑)π‘Šπ‘₯𝑐+ 𝑠(𝑑 βˆ’ 1)π‘Šπ‘ π‘+ 𝑏𝑐) βŠ› πœƒ(𝑑) … (15)

where tanh is a hyperbolic tangent function described in 2.3.1 and βŠ› denotes an element- wise product. Recall that tanh is a rescaled version of sigmoid in a scale between -1 and 1. Therefore, the input activation in the current hidden layer before passing through the memory cell is constructed through tanh which is, then, modified by the state of the input gate πœƒ(𝑑). Also, note that the element-wise product βŠ› allows a weight (a gate neuron in the input and forget gates) to directly modify a particular feature (either from the previous memory content or from the current input) processed by the neuron via one-to-one mapping (since a number of neurons in each gate in the memory cell is same). In summary, (15) shows that the modified input representation at the input gate is combined with the modified memory representation in the forget gate to generate a new memory content.

Lastly, the network decides what it is going to output. Similar to the state of the other gates, the state of the output gate directly modulates the new memory content from (15) using sigmoid:

πœ”(𝑑) = 𝜎(π‘₯(𝑑)π‘Šπ‘₯πœ”+ 𝑠(𝑑 βˆ’ 1)π‘Šπ‘ πœ”+ 𝑐(𝑑)π‘Šπ‘πœ”+ π‘πœ”) … (16)

These weights are used to modify the current memory content that is going to be output: 𝑠(𝑑) = πœ”(𝑑) βŠ› tanh(𝑐(𝑑)) … (17)

75

Similar to above, the unfiltered version of the memory content at the output gate is constructed through tanh which is, then, weighted by the state of the output gate through one-to-one mapping within every neuron in the output gate. Note that the bias term is not needed inside tanh of (17) because every distinct term that consists of new memory content 𝑐(𝑑) is already adjusted; see (13), (14) and (15). The gate response 𝑠(𝑑) (equivalent to the hidden layer activation in RNN) is then projected to the output layer of the network as in RNN (see (9)):

π‘œ(𝑑) = πœ‘(𝑠(𝑑)π‘Šπ‘”π‘œ+ 𝑏2) … (18)

where πœ‘ is the softmax function to generate a probabilistic response. Then, the BPTT

algorithm can be applied for optimizing every weight matrix (12 in total) through the memory cell from (13) to (18); see Figure 2-6 for illustration.

Figure 2-6: A schematic illustration of LSTM architecture (see Equations (13) – (18))

To understand how this architecture effectively prevents the error gradient from vanishing as it passes through more layers, we need to see how the gradient back-propagates from 𝑑 to 𝑑 βˆ’ 1 in the cell state. From (17), it is clear that the hidden layer activation in LSTM 𝑠(𝑑) is determined by the cell state 𝑐(𝑑). Therefore, we just need to prove that the gradient does not

76

necessarily diminish from 𝑐(𝑑) to 𝑐(𝑑 βˆ’ 1). Using an arbitrary loss function 𝐻(π‘Œ, 𝑂) and a chain rule, the BPTT can simply be expressed as:

πœ•

πœ•π‘(𝑑 βˆ’ 1) 𝐻(π‘Œ, 𝑂) =

πœ•π»(π‘Œ, 𝑂)

πœ•π‘(𝑑) βŠ› βˆ…(𝑑) … (19)

From (15), βˆ…(𝑑) is a forget gate activation which controls for the rate at which the neural network forgets its past memory. Hence, (19) simply follows from (15) defining how the new memory content at 𝑑 is constructed: note that there isn’t any non-linear activation function involved in generating this new content. In other words, the new memory content is generated from an identity function on the weighted combination of the previous cell state and the current input activation in the hidden layer. As a result, the error gradient does neither exponentially decrease (i.e. the derivative of an identity function is 1) nor explodes (i.e. the forget gate activation, which is basically a vector of sigmoid weights, is always less than 1) even if it passes through a number of previous cell states. The gradient is only linearly modulated by the forget gate activation βˆ…(𝑑). This is how LSTM architecture can preserve the long distance dependency information in its memory if it decides to.

2.4. Quantifying the β€œdegree” in prediction: the information-theoretic