2.4 Classification and Tagging Models
2.4.3 Deep Neural Network Models
Recently popular deep neural network (Deep Learning) models can be used in both classification and tagging. They contain a larger number of hidden layers and also feedback connections between their components.
state-of-the-art neural networks, especially when we have sequential data in which the relationship between the components (words or sentences in the case of language processing) matters. The architecture of these networks are represented in Figure 2.4. R R R R X1 X2 X3 Xi y1 y2 y3 yi S0 Si
Figure 2.4: Graphical representation of a simple RNN.
The output of RNN networks are often vectors that can be fed into other network components that will try to predict final labels. In this sense RNNs are trained to produce informative representations for upper layers, i.e. they are used as ‘feature extractors’ (Goldberg, 2017). RNNs allow representa- tion of arbitrarily sized sequential inputs in fixed-size vectors, while paying attention to the structured properties of the inputs.
In a high-level abstraction, as shown in Figure 2.4, the RNN is a func- tion that takes as input an arbitrary length ordered sequence of n din −
dimensional vectors x1:n = x1, x2, . . . , xn and the initial state s0, and re-
turns as output a single dout dimensional vector yn. Each unit, R, takes as
vector si (Goldberg, 2017).
RN N (x1:n; s0) = y1:n
si = R(si−1; xi)
(2.9)
The function R is the same across the sequence positions, but the RNN keeps track of the states of computation through the state vector si. RNNs
are trained like any neural network by adding a loss function and using the back-propagation algorithm to compute the gradients with respect to that loss.
RNNs have different variations; the RNN-based architecture that we fo- cus on in this study is LSTM (Hochreiter and Schmidhuber, 1997) which is a gated architecture devised to unravel the vanishing gradients problem (Pas- canu et al., 2012). LSTM stands for Long Short Term Memory networks which provide more controlled memory access. In LSTM, the state vector si is split into two halves, where one half is treated as ‘memory cells’ and
the other is working memory. At each input state, a gate is used to decide how much of the new input should be written to the memory cell, and how much of the current content of the memory cell should be forgotten. The architecture is represented in Figure 2.5 and the mathematical computations are detailed in Equation 2.10.
Figure 2.5: One of the components of LSTM architecture. The image is from http://colah.github.io/posts/2015-08-Understanding-LSTMs/ st = RLST M(st−1, xt) = [Ct; ht] Ct= ft∗ Ct−1+ it∗ ˜Ct it= σ(Wi.[ht−1, xt] + bi) ft= σ(Wf.[ht−1, xt] + bf) ˜ Ct= tanh(Wc.[ht−1, xt] + bC) ot= σ(Wo[ht−1, xt] + bo) ht= ot∗ tanh(Ct) (2.10)
As can be seen, in each component of an LSTM network (A), instead of a single neural network layer, there are four. These layers are interconnected in a systematic way, controlling the information stored in or forgotten from memory. LSTM also has many small variants, exposition of which is out of the scope of this thesis. The Keras software package (Chollet et al., 2015) implements its most standard form which we use in the experiments in this thesis. The combination of two LSTMs that traverse the sequential data in
opposite directions is called bidirectional LSTM (biLSTM) and has proven to be very effective in language processing tasks. More details on the set- tings that we choose will be provided in the experiment sections of relevant chapters in this thesis.
One other recently popular neural network model which is also an effective feature extractor is Convolutional Neural Network (CNN). A convolu- tional neural network is a combination of layers that function as convolving filters over local features in a large structure in order to capture important information for the prediction task. A CNN usually includes two consecutive operations: convolution and pooling. The convolution operation can be seen as a filter (function over each instantiation of a k-word sliding window) that passes through the input sentence.
The architecture of a convolution layer on a sample sentence is presented in Figure 2.6 which is from Goldberg (2017).
Figure 2.6: A narrow convolution with a window of size k = 2 and 3- dimensional output (l = 3), in the vector-concatenation notation.
Subsequently, a pooling operation is optionally used to combine the vec- tors resulting from the different windows into a single l-dimensional vector. This is achieved by taking the max or the average value observed in each of the l dimensions over the different windows. Pooling is usually used to com- press or subsample the input (Hu et al., 2014). Since we do not want to filter out any information, in this thesis, we do not use pooling in our convolutional neural network layers and we refer to the network as ConvNet.
According to Kim (2014), if we consider x1:n= x1⊕ x2⊕ . . . ⊕ xn to be a
sentence of length n, where xi is the k-dimensional word vector correspond-
ing to the i-th word in the sentence and ⊕ is the concatenation operator, a convolution operation is defined as follows. Let xi:i+j refer to the concatena-
tion of words xi , xi+1, . . . , xi+j. A convolution is a filter w which is applied
to a window of h words to produce a new feature. In equation 2.11, a feature ci is generated from a window of words xi:i+h−1.
ci = f (wxi:i+h−1+ b). (2.11)
Here, b is a bias term and f is a non-linear function such as the hyperbolic tangent.
This convolution can be applied over the text resulting in m vectors c1:m.
It is also possible to apply multiple filters with different size and step sizes to allow the ConvNet to detect multiple features.
One of the most widely used works on using CNN in NLP is by Collobert and Weston (2008) in which they use CNN to predict part-of-speech tags,
chunks, named entity tags, semantic roles, semantically similar sentences, and also learn a language model.