• No results found

Answer Ranking with Convolutional Neural Networks

In the previous section we reported answer ranking experiments, where we used an RNN to encode questions and answer, i.e. represent them as fixed-sized vectors. Besides an RNN, a convolution neural network (CNN) can also be used to encode a text. Initially designed for computer vision tasks, the CNNs became very popular in the area of NLP and were applied to answer selection (Severyn and Moschitti, 2015a; Tymoshenko et al., 2016a; dos Santos et al., 2016), sentiment analysis (Kim, 2014; dos Santos and Gatti, 2014; Kalchbrenner et al., 2014) and question type classifica- tion (Kim, 2014; Kalchbrenner et al., 2014). We have already used a convolutional architecture in Chapter 3 for the task of semantically equivalent question detection. In this chapter, we use an extended version of this architecture, i.e. we use various

filter sizes, and use a better regularisation. We use a convolution architecture similar to the one presented by Kim (2014). Let x be a text, e.g. a question or an answer, andxxxi ∈Rk the embedding of the i-th word in the text, i.e.

x= (xxx1, xxx2, ..., xxxn)

A filter of size h is a vector www ∈Rhk which is applied to a word window of size h and produces a featureci:

ci =f(www˙[xxxi, ..., xxxi+h−1] +b) (6.10) where b ∈ R is a bias and f is a non-linear function, such as the ReLU or the hyperbolic tangent. The filter is applied to every possible word window of size h, and the vector of the produced features is called a feature map:

ccc = (c1, c2, ..., cn−h+1) (6.11)

After that, a max-pooling operation takes the maximum from each feature map:

ˆ

c=max(c1, c2, ..., cn−h+1) (6.12)

The intuition behind the max-pooling operation is to capture the most important information from each feature map (Kim, 2014). Usually, not just one but a number of filters m is applied to each window, i.e.:

ccci =f(WWW>[xxxi, ..., xxxi+h−1] +bbb) (6.13)

where WWW is a matrix of size hk×m andbbb∈Rm is a bias vector. The representation of the text x is obtained with max-pooling:

Figure 6.4 illustrates the CNN that we use to encode a question or an answer. The main difference from the architecture used in Chapter 3 is the use of various filters, i.e. word window, sizes instead of using only one filter of a fixed size. We also train the system differently. We use two separate CNNs to encode the question and the answer, then the representations are concatenated and passed to an MLP. The network is trained in the same way as the RNN-based system described in Section 6.1, i.e. by minimising cross-entropy on the training set.

Figure 6.4: Illustration of a CNN encoder for answer ranking. First, words are represented as word embeddings. Second, a convolution with multiple filter sizes is applied to the word embeddings. Finally, max-pooling is applied to the output of the convolutions.

6.2.1

Hyperparameters

We experiment with filter (word window) sizes from one to five, and set the number of filters of each size to 100. The dropout probability was set to 0.2 for the CNN and the MLP, and the L2 regularisation rate was set to 10−7 on the YA dataset, and to 0.3 and 10−6 on the AU dataset. The model was trained with SGD with a mini-batch of size 100 and evaluated on the development set every 500 steps. The training was stopped if there was no improvement on the development set for 10 consecutive evaluations.

Yahoo! Answers

Encoder Test P@1 Test MRR

LSTM 37.45* 58.12 CNN 35.45 55.98 Paragraph Vector 37.37 57.05 Random baseline 15.74 37.40 CR baseline 22.63 47.17 Jansen et al. (2014) 30.49 51.89 Fried et al. (2015) 33.01 53.96 Ask Ubuntu

Encoder Test P@1 Test MRR

LSTM 42.64* 65.28 CNN 34.76 59.96 Paragraph Vector 41.48 64.33 Random baseline 26.60 53.64 CR baseline 35.36 60.17 Chronological baseline 37.68 60.06

Table 6.4: Answer ranking performance when using the RNN versus the CNN en- coder. *The improvement is statistically significant (p < 0.05).

6.2.2

CNN versus RNN for Answer Ranking

We apply the CNN with the hyperparameters providing the highest development P@1 to the test set. Table 6.4 reports the performance of the CNN versus the best performing RNN-based system described in the previous section. On the YA dataset, the CNN proves competitive with the RNN, although, the LSTM produces significantly better results. However, on the AU dataset, the CNN performs simi-

larly to the candidate retrieval baseline and is far below the RNN-based systems. The explanation for this is that the AU dataset contains much longer questions and answers. The CNNs are in some sense similar to the n-gram model: they encode local features. Unlike the RNNs, they lack the ability to represent long-term depen- dencies that is essential when encoding long texts. Nonetheless, the CNNs succeed in encoding sentences (dos Santos and Gatti, 2014; Kim, 2014) or short texts, e.g. Twitter data (Severyn and Moschitti, 2015b; Kalchbrenner et al., 2014). A two-level CNN like the one presented by Denil et al. (2014) that first composes words into sentences and then, sentences into documents, might be a better variation of a CNN architecture for longer texts.

6.3

Multi-Channel Recurrent Convolutional Neu-