Knowledgeable Reader: Enhancing Cloze Style Reading Comprehension with External Commonsense Knowledge

(1)

821

Knowledgeable Reader: Enhancing Cloze-Style Reading Comprehension

with External Commonsense Knowledge

Todor Mihaylov and Anette Frank Research Training Group AIPHES

Department of Computational Linguistics, Heidelberg University Heidelberg, Germany

{mihaylov,frank}@cl.uni-heidelberg.de

Abstract

We introduce a neural reading comprehen-sion model that integrates external com-monsense knowledge, encoded as a key-value memory, in a cloze-style setting. Instead of relying only on document-to-question interaction or discrete features as in prior work, our model attends to rel-evant external knowledge and combines this knowledge with the context represen-tation before inferring the answer. This al-lows the model to attract and imply knowl-edge from an external knowlknowl-edge source that is not explicitly stated in the text, but that is relevant for inferring the an-swer. Our model improves results over a very strong baseline on a hard Common Nounsdataset, making it a strong competi-tor of much more complex models. By including knowledgeexplicitly, our model can also provideevidenceabout the back-ground knowledge used in the RC process.

1 Introduction

Reading comprehension (RC) is a language under-standing task similar to question answering, where a system is expected to read a given passage of text and answer questions about it. Cloze-style reading comprehension is a task setting where the question is formed by replacing a token in a sentence of the story with a placeholder (left part of Figure1).

In contrast to many previous complex models

(Weston et al., 2015; Dhingra et al., 2017; Cui

et al., 2017; Munkhdalai and Yu, 2016; Sordoni

et al.,2016) that perform multi-turn readingof a

story and a question before inferring the correct answer, we aim to tackle the cloze-style RC task in a way that resembles how humans solve it: using, in addition, background knowledge. We develop

The prince put his away and prepared for his long trip..

He mounted his XXXX and rode away.

Story

Question

Commonsense knowledge

Candidates

horse

[image:1.595.333.501.226.316.2]

sword … hand

[IsUsedfor] horse riding

[Causes]

sword death

[HasA] human hand

… …

[RelatedTo] mount animal

Task setup

The prince was on his white , with a in his .sword hand horse

sword

… …

…

Figure 1: Cloze-style reading comprehension with external commonsense knowledge.

a neural model for RC that can successfully deal with tasks where most of the information to infer answers from is given in the document (story), but where additional information is needed to predict the answer, which can be retrieved from a knowl-edge base and added to the context representations explicitly.1An illustration is given in Figure1.

Such knowledge may be commonsense knowl-edgeorfactual background knowledge about enti-ties and eventsthat is not explicitly expressed but can be found in a knowledge base such as Con-ceptNet (Speer et al., 2017), BabelNet (Navigli

and Ponzetto,2012), Freebase (Tanon et al.,2016)

or domain-specific KBs collected with Informa-tion ExtracInforma-tion approaches (Fader et al., 2011;

Mausam et al.,2012;Bhutani et al.,2016). Thus,

we aim to define a neural model that encodes pre-selected knowledge in a memory, and that learns to include the available knowledge as an enrichment to the context representation.

The main difference of our model to prior state-of-the-art is that instead of relying only on document-to-question interaction or discrete fea-tures while performing multiple hops over the doc-ument, our model (i)attends to relevant selected

1

(2)

external knowledgeand (ii)combines this knowl-edge with the context representation before infer-ring the answer, in a single hop. This allows the model to explicitly imply knowledge that is not stated in the text, but is relevant for inferring the answer, and that can be found in an external knowledge source. Moreover, by including knowl-edge explicitly, our model provides evidence and insight about the used knowledge in the RC.

Our main contributions are: (i) We develop a method for integrating knowledge in a simple but effective reading comprehension model (AS Reader, Kadlec et al.(2016)) and improve its re-sults significantly whereas other models employ features or multiple hops. (ii) We examine two sources of common knowledge: WordNet (Miller

et al., 1990) and ConceptNet (Speer et al.,2017)

and show that this type of knowledge is important for answering common nouns questions and also improves slightly the performance fornamed enti-ties.

(iii)We show that knowledge facts can be added directly to the text-only representation, enrich-ing the neural context encodenrich-ing. (iv) We demon-strate the effectiveness of the injected knowledge by case studies and data statistics in a qualitative evaluation study.

2 Reading Comprehension with Background Knowledge Sources

In this work, we examine the impact of using ex-ternal knowledge as supporting information for the task ofcloze stylereading comprehension.

We build a system with two modules. The first, Knowledge Retrieval, performsfact retrieval and selects a number of facts f1, ...,fp that might be

relevant for connecting story, question and can-didate answers. The second, main module, the Knowledgeable Reader, is a knowledge-enhanced neural module. It uses the input of the story con-text tokensd1..m, the question tokensq1..n, the set

of answer candidatesa1..k and a set of ‘relevant’

background knowledge facts f1..p in order to

se-lect the right answer. To include external knowl-edge for the RC task, we encode each fact f1..p

and use attention to select the most relevant among them for each token in the story and question. We expect that enriching the text with additional knowledge about the mentioned concepts will im-prove the prediction of correct answers in a strong single-passsystem. See Figure1for illustration.

2.1 Knowledge Retrieval

In our experiments we use knowledge from the Open Mind Common Sense (OMCS,Singh et al.

(2002)) part of ConceptNet, a crowd-sourced re-source of commonsense knowledge with a total of

∼630k facts. Each factfiis represented as a triple

fi=(subject, relation, object), where subject and

objectcan be multi-word expressions andrelation is a relation type. An example is: ([bow]subj,

[IsUsedFor]_rel, [hunt, animals]_obj)

We experiment with three set-ups: using (i) all facts from OMCS that pertain to Concept-Net, referred to as CN5All, (ii) using all facts

from CN5All excluding some WordNet relations

referred to asCN5Sel(ected)(see Section3), and using (iii) facts from OMCS that havesourceset toWordNet(CN5WN3).

Retrieving relevant knowledge. For each in-stance (D, Q, A1..10) we retrieve relevant

com-monsense background facts. We first retrieve facts that contain lemmas that can be looked up via to-kens contained in anyD(ocument),Q(uestion) or A(nswer candidates). We add a weight value for each node: 4, if it contains a lemma of a candi-date token fromA;3, if it contains a lemma from the tokens ofQ; and2if it contains a lemma from the tokens ofD. The selected weights are chosen heuristically such that they model relative fact im-portance in different interactions asA+A>A+Q >A+D>D+Q>D+D. We weight the fact triples that contain these lemmas as nodes, by summing the weights of the subject and object arguments. Next, we sort the knowledge triples by this overall weight value. To limit the memory of our model, we run experiments with different sizes of the top number of facts (P) selected from all instance fact candidates,P ∈ {50,100,200}. As additional re-trieval limitation, we force the number of facts per answer candidate to be the same, in order to avoid a frequency bias for an answer candidate that ap-pears more often in the knowledge source. Thus, if we select the maximum 100 facts for each task in-stance and we have 10 answer candidatesai=1..10,

we retrieve the top 10 facts for each candidateai

that has either a subject or an object lemma for a token in ai. If the same fact contains lemmas of

two candidatesai andaj (j > i), we add the fact

once foraiand do not add the same fact again for

(3)

𝜶"#$"%&'"= 𝑊*𝜶𝒄𝒕𝒙𝒄𝒕𝒙+ 𝑊.𝜶𝒄𝒕𝒙𝒄𝒕𝒙/𝒌𝒏+ 𝑊2𝜶𝒄𝒕𝒙𝒄𝒕𝒙/𝒌𝒏+ 𝑊3𝜶𝒄𝒕𝒙/𝒌𝒏𝒄𝒕𝒙/𝒌𝒏

swordUsedFor kill Weighted fact representations

Weighted facts sum horse UsedForride

IsA animal horse horse ran fast sword and horse rode his XXXXX

… … Document Question … … Context Representation

Knowledge facts memory

Context + Knowledge Representation

𝜶𝒄𝒕𝒙𝒄𝒕𝒙 𝑟5678

t 𝜶"#

$"%

9'"

…

𝑃 ℎ𝑜𝑟𝑠𝑒|𝑞, 𝑑 = C 𝑎E= 𝑎F+ 𝑎F/H

I

EJK(MNO$",P)

g

Token-‐wise connection Answer placeholder

𝑐P678/S#

𝑐P678

𝑟5678/S#

𝑟5678

𝑟5678/S#

𝜶𝒄𝒕𝒙/𝒌𝒏𝒄𝒕𝒙

𝜶𝒄𝒕𝒙𝒄𝒕𝒙/𝒌𝒏

𝜶𝒄𝒕𝒙/𝒌𝒏𝒄𝒕𝒙/𝒌𝒏

Token-‐wise connection

𝑐S#

At

t +

S

of

tm

ax

multiply

[image:3.595.77.297.61.199.2]

𝑓$U&F_𝑓O"'_𝑓N&F

Figure 2: The Knowledgeable Reader combines plaincontext &enhanced (context + knowledge) repres. ofDandQand retrieved knowledge from the explicit memory with theKey-Valueapproach.

the first in the order of the list2, i.e., the order of retrieval from the database. If one candidate has less than 10 facts, the overall fact candidates for the sample will be less than the maximum (100).

2.2 Neural Model: Extending the Attention Sum Reader with a Knowledge Memory

We implement our Knowledgeable Reader (Kn-Reader)using as a basis theAttention Sum Reader as one of the strongest core models for single-hop RC. We extend it with a knowledge fact memory that is filled with pre-selected facts. Our aim is to examine how adding commonsense knowledge to a simple yet effective model can improve the RC process and to show some evidence of that by at-tending on the incorporated knowledge facts. The model architecture is shown in Figure2.

Base Attention Model. The Attention-Sum Reader (Kadlec et al., 2016), our base model for RC reads the input of story tokensd1..n, the

ques-tion tokens q1..m, and the set of candidates a1..10

that occur in the story text. The model calcu-lates the attention between the question represen-tation rq and the story token context encodings

of the candidate tokensa1..10and sums the

atten-tion scores for the candidates that appear multiple times in the story. The model selects as answer the candidate that has the highest attention score.

Word Embeddings Layer. We represent input document and question tokens w by looking up their embedding representations ei = Emb(wi),

whereEmbis an embedding lookup function. We apply dropout (Srivastava et al.,2014) with keep

2_{We also experimented with re-ranking the facts with the}

same weight sums using tf-idf but we did not notice a differ-ence in performance.

probability p = 0.8 to the output of the embed-dings lookup layer.

Context Representations. To represent the document and question contexts, we first encode the tokens with a Bi-directional GRU (Gated Re-current Unit) (Chung et al., 2014) to obtain con-text-encoded representations for document (cctx_d

1..n)

and question (cctx_q₁_..m) encoding:

cctx_d₁_..n=BiGRUctx(ed1..n)∈R

n×2h ₍₁₎

cctx_q₁_..m =BiGRUctx(eq1..m)∈R

m×2h ₍₂₎

, where di and qi denote the ith token of a text

sequence d(document) andq (question), respec-tively, n andm is the size of dand q andh the output hidden size (256) of a single GRU unit. BiGRU is defined in (3), withei a word

embed-ding vector

BiGRUctx(ei, hiprev) =

[−−−→GRU(ei,

−−−→

hiprev), ←−−−

GRU(ei,

←−−−

hiprev)]

(3)

, where hiprev = [ −−−→

hiprev, ←−−−

hiprev], and −−−→

hiprev

and ←−−−hiprev are the previous hidden states of the

forward and backward layers. Below we use BiGRUctx(ei)without the hidden state, for short.

Question Query Representation. For the question we construct a single vector representa-tionrctx_q by retrieving the token representation at the placeholder (XXXX) indexpl(cf. Figure2):

r_qctx=cctx_q_i..m[pl]∈R2h (4)

where[pl]is an element pickup operation.

Our question vector representation is different from the originalAS Reader that builds the ques-tion by concatenating the last states of a for-ward and backfor-ward layer[−−−→GRU(em),

←−−−

GRU(e1)].

We changed the original representation as we ob-served some very long questions and in this way aim to prevent the context encoder from ’forget-ting’ where the placeholder is.

Answer Prediction: Qctx to Dctx Attention.

In order to predict the correct answer to the given question, we rank the given answer candidates a1..aL according to the normalized attention sum

score between the context (ctx) representation of the question placeholderrctx_q and the representa-tion of the candidate tokens in the document:

P(ai|q, d) =sof tmax( X

αij) (5)

αij =Att(r

ctx

(4)

, wherej is an index pointer from the list of in-dices that point to the candidate ai token

occur-rences in the document context representationcd.

Attis a dot product.

Enriching Context Representations with

Knowledge (Context+Knowledge). To

en-hance the representation of the context, we add knowledge, retrieved as a set of knowledge facts.

Knowledge Encoding. For each instance in the dataset, we retrieve a number of relevant facts (cf. Section2.1). Each retrieved fact is represented as a triple f = (wsubj₁_..L

subj, w

rel

0 , w

obj

1..Lobj), where wsubj₁_..L

subj and w

obj

1..Lobj are a multi-word

expres-sions representing thesubjectandobjectwith se-quence lengthsLsubjandLobj, andwrel0 is a word

token corresponding to a relation.3 As a result of fact encoding, we obtain a separate knowledge memory for each instance in the data.

To encode the knowledge we use a BiGRU to encode the triple argument tokens into the follow-ing context-encoded representations:

f_lastsubj =BiGRU(Emb(w₁subj_..L

subj),0) (7) f_lastrel =BiGRU(Emb(wrel₀ ), f_lastsubj) (8)

f_lastobj =BiGRU(Emb(w₁obj_..L subj), f

rel

last) (9)

, where f_lastsubj, f_lastrel, f_lastobj are the final hidden states of the context encoder BiGRU, that are also used as initial representations for the encod-ing of the next triple attribute in left-to-right order. SeeSupplementfor comprehensive visualizations. The motivation behind this encoding is: (i) We encode the knowledge fact attributes in the same vector space as the plain tokens; (ii) we preserve the triple directionality; (iii) we use the relation type as a way of filtering thesubject information to initialize theobject.

Querying the Knowledge Memory. To en-rich the context representation of the document and question tokens with the facts collected in the knowledge memory, we select a singlesum of weighted fact representationsfor each token using Key-Value retrieval (Miller et al., 2016). In our model thekeyM_ik(ey)can be eitherf_lastsubj orf_lastobj and thevalueM_iv(alue)isf_lastobj.

For each context-encoded tokencctx_s_i (s =d, q; i the token index) we attend over all knowledge

3_The₀_in_wrel

0 indicates that we encode the relation as a

singlerelation typeword. Ex./r/IsUsedFor.

memory keys M_ik in the retrieved P knowledge facts. We use an attention functionAtt, scale the scalar attention value usingsof tmax, multiply it with the value representationM_iv and sum the re-sult into a single vector value representationckn_s_i :

ckn_s_i =Xsof tmax(Att(cctx, M₁k_..P))TM₁v_..P (10) Attis a dot product, but it can be replaced with another attention function. As a result of this op-eration, the context token representationcctx_s_i and the corresponding retrieved knowledgeckn_s_i are in the same vector space∈_R2h_.

Combine Context and Knowledge(ctx+kn).

We combine the original context token representa-tioncctx

si , with the acquired knowledge

representa-tionckn_s_i to obtaincctx_s_i +kn:

cctx_s_i +kn=γcctx_s_i + (1−γ)ckn_s_i (11)

, whereγ = 0.5. We keepγ static but it can be replaced with a gating function.

Answer Prediction: Qctx(+kn) to Dctx(+kn).

To rank answer candidatesa1..aLwe use attention

sum similar to Eq.5 over an attention αensemble_i j

that combines attentions between context (ctx) and context+knowledge (ctx+kn) representations of the question (rctxq (+kn)) and candidate token

oc-currencesaijin the documentc

ctx(+kn)

dj :

P(ai|q, d) =sof tmax( X

αensemble_i_j ) (12)

αensemble_i_j =

W1Att(rctxq , cctxdj )

+W2Att(rqctx, cctxdj+kn)

+W3Att(rqctx+kn, cctxdj )

+W4Att(rqctx+kn, cctxdj+kn)

(13)

, where j is an index pointer from the list of indices that point to the candidate ai token

oc-currences in the document context representation cctx_d (+kn).W1..4are scalar weights initialized with

1.0 and optimized during training.4 We pro-pose the combination ofctxandctx+kn atten-tions because our task does not provide supervi-sion whether the knowledge is needed or not.

4_{An example for learned}_W

1..4is (2.13,1.41,1.49,1.84)

(5)

CN NE Train 120,769 / 470 108,719 / 433

Dev 2,000 / 448 2,000 / 412

Test 2,500 / 461 2,500 / 424

[image:5.595.90.278.62.132.2]

Vocab 53,185 53,063

Table 1: Characteristics of Children Book Test datasets. CN: Common Nouns, NE: Named En-tities. Cells forTrain, Dev, Testshow overall num-bers of examples and average story size in tokens.

3 Data and Task Description

We experiment with knowledge-enhanced cloze-style reading comprehension using the Common NounsandNamed Entitiespartitions of the Chil-dren’s Book Test (CBT) dataset (Hill et al.,2015). In the CBT cloze-style task a system is asked to read a children story context of 20 sentences. The following 21stsentence involves a placeholder to-ken that the system needs to predict, by choosing from a given set of 10 candidate words from the document. An example with suggested external knowledge facts is given in Figure 1. While in itsCommon Nounssetup, the task can be consid-ered as a language modeling task,Hill et al.(2015) show that humans can answer the questions with-out the full context with an accuracy of only64.4% and a language model alone with57.7%. By con-trast, the human performance when given the full context is at 81.6%. Since the best neural model

(Munkhdalai and Yu,2016) achieves only72.0%

on the task, we hypothesize that the task itself can benefit from external knowledge. The characteris-tics of the data are shown in Table1.

Other popular cloze-style datasets such as CNN/Daily Mail (Hermann et al.,2015) or Who-DidWhat (Onishi et al., 2016) are mainly focu-sed on finding Named Entities where the benefit of adding commonsense knowledge (as we show for theNEpart of CBT) would be more limited.

Knowledge Source. As a source of common-sense knowledge we use the Open Mind Com-mon Sense part of ConceptNet 5.0 that contains 630k fact triples. We refer to this entire source as CN5All. We conduct experiments with subparts of this data:CN5WN3which is the WordNet 3 part of CN5All(213k triples) andCN5Sel, which excludes the following WordNet relations: RelatedTo, IsA, Synonym, SimilarTo, HasContext.

4 Related Work

Cloze-Style Reading Comprehension. Follow-ing the original MCTest (Richardson et al.,2013) dataset multiple-choice version of cloze-style RC) recently several large-scale, automatically gener-ated datasets for cloze-style reading comprehen-sion gained a lot of attention, among others the ‘CNN/Daily Mail’ (Hermann et al., 2015;

On-ishi et al., 2016) and the Children’s Book Test

(CBTest) data set (Hill et al.,2015). Early work in-troduced simple but goodsingle turn models(

Her-mann et al.,2015;Kadlec et al.,2016;Chen et al.,

2016), that read the document once with the ques-tion representaques-tion ‘in mind’ and select an answer from a given set of candidates. More complex models (Weston et al.,2015;Dhingra et al.,2017;

Cui et al., 2017;Munkhdalai and Yu, 2016;

Sor-doni et al.,2016) performmulti-turn readingof the

story context and the question, before inferring the correct answer or use features (GA Reader,

Dhin-gra et al. (2017). Performing multiple hops and

modeling a deeper relation between question and documentwas further developed by several mod-els (Seo et al., 2017; Xiong et al., 2016; Wang

et al., 2016, 2017; Shen et al., 2016) on another

generation of RC datasets, e.g. SQuAD (Rajpurkar

et al., 2016), NewsQA (Trischler et al., 2017) or

TriviaQA (Joshi et al.,2017).

Integrating Background Knowledge in Neural Models. Integrating background knowledge in a neural model was proposed in theneural-checklist model by Kiddon et al. (2016) for text genera-tion of recipes. They copy words from a list of ingredients instead of inferring the word from a global vocabulary. Ahn et al. (2016) proposed a language model that copies fact attributes from a topic knowledge memory. The model predicts a fact in the knowledge memory using a gating mechanism and given this fact, the next word to be selected is copied from the fact attributes. The knowledge facts are encoded using embeddings obtained usingTransE(Bordes et al.,2013).Yang

et al.(2017) extended aseq2seqmodel with

atten-tion to external facts for dialogue and recipe gen-eration and a co-reference resolution-aware lan-guage model. A similar model was adopted byHe

et al.(2017) for answer generation in dialogue.

In-corporating external knowledge in a neural model has proven beneficial for several other tasks:Yang

(6)

di-rectly into the LSTM cell state to improve event and entity extraction. They used knowledge em-beddings trained on WordNet (Miller et al.,1990) and NELL (Mitchell et al.,2015) using the

BILIN-EAR(Yang et al.,2014) model.

Work similar to ours is by Long et al. (2017), who have introduced a new task of Rare Entity Prediction. The task is to read a paragraph from WikiLinks (Singh et al.,2012) and to fill a blank field in place of a missing entity. Each miss-ing entity is characterized with a short description derived from Freebase, and the system needs to choose one from a set of pre-selected candidates to fill the field. While the task is superficially sim-ilar to cloze-style reading comprehension, it dif-fers considerably: first, when considering the text without the externally provided entity information, it is clearly ambiguous. In fact, the task is more similar to Entity Linking tasks in the Knowledge Base Population (KBP) tracks at TAC 2013-2017, which aim at detecting specific entities from Free-base. Our work, by contrast, examines the impact of injecting external knowledge in a reading com-prehension, or NLU task, where the knowledge is drawn from a commonsense knowledge base, ConceptNet in our case. Another difference is that in their setup, the reference knowledge for the candidates is explicitly provided as a single, fixed set of knowledge facts (the entity description), en-coded in a single representation. In our work, we are retrieving (typically) distinct sets of knowl-edge facts that might (or might not) be relevant for understanding the story and answering the ques-tion. Thus, in our setup, we crucially depend on the ability of the attention mechanism to retrieve relevant pieces of knowledge. Our aim is to exam-ine to what extent commonsense knowledge can contribute to and improve the cloze-style RC task, that in principle is supposed to be solvable without explicitly given additional knowledge. We show that by integrating external commonsense knowl-edge we achieve clear improvements in reading comprehension performance over a strong base-line, and thus we can speculate that humans, when solving this RC task, are similarly using common-sense knowledge as implicitly understood back-ground knowledge.

Recent unpublished work inWeissenborn et al.

(2017) is driven by similar intentions. The au-thors exploit knowledge from ConceptNet to im-prove the performance of a reading

comprehen-sion model, experimenting on the recent SQuAD

(Rajpurkar et al.,2016) and TriviaQA (Joshi et al.,

2017) datasets. While the source of the back-ground knowledge is the same, the way of inte-grating this knowledge into the model and task is different. (i) We are using attention to select un-ordered fact triples using key-value retrieval and (ii) we integrate the knowledge that is consid-ered relevant explicitly for each token in the con-text. The model ofWeissenborn et al.(2017), by contrast, explicitly reads the acquired additional knowledge sequentially after reading the docu-ment and question, but transfers the background knowledge implicitly, by refining the word embed-dings of the words in the document and the ques-tion with the words from the supporting knowl-edge that share the same lemma. In contrast to the implicit knowledge transfer of Weissenborn

et al. (2017), our explicit attention over

exter-nal knowledge facts can deliver insights about the used knowledge and how it interacts with specific context tokens (see Section6).

5 Experiments and Results

We perform quantitative analysis through experi-ments. We study the impact of the used knowl-edge and different model components that employ the external knowledge. Some of the experiments below focus only on the Common Nouns (CN) dataset, as it has been shown to be more challeng-ing thanNamed Entities (NE)in prior work.

5.1 Model Parameters

We experiment with different model parameters.

Number of facts. We explore different sizes of knowledge memories, in terms of number of ac-quired facts. If not stated otherwise, we use 50 facts per example.

Key-Value Selection Strategy. We use two strategies for defining key and value (Key/Value): Subj/ObjandObj/Obj, whereSubjandObjare the subject and object attributes in the fact triples and they are selected as Key and Value for the KV memory (see Section 2.2, Querying the Knowl-edge Memory). If not stated otherwise, we use the Subj/Objstrategy.

Answer Selection Components. If not stated otherwise, we use ensemble attention αensemble

(7)

Source Dev Test

CN5All 71.40 66.72

CN5WN3 (WN3) 70.70 68.48

[image:7.595.114.250.62.110.2]

CN5Sel(ected) 71.85 67.64

Table 2: Results with different knowledge sources, for CBT-CN (Full model, 50 facts).

# facts 50 100 200 500

Dev 71.85 71.35 71.40 71.20

Test 67.64 67.44 68.12 67.24

Table 3: Results for CBT (CN) with different num-bers of facts. (Full model, CN5Sel)

Hyper-parameters. For our experiments we use pre-trained Glove (Pennington et al., 2014) em-beddings, BiGRU with hidden size 256, batch size of 64 and learning reate of0.001as they were shown (Kadlec et al., 2016) to perform good on the AS Reader.

5.2 Empirical Results

We perform experiments with the different model parameters described above. We report accuracy on theDevandTestand use the results onDevset for pruning the experiments.

Knowledge Sources. We experiment with dif-ferent configuration ofConceptNetfacts (see Sec-tion3). Results on theCBT CNdataset are shown in Table2. CN5Selworks best on theDevset but CN5WN3works much better onTest. Further ex-periments use theCN5Selsetup.

Number of facts. We further experiment with different numbers of facts on theCommon Nouns dataset (Table3). The best result on theDevset is for 50 facts so we use it for further experiments.

Component ablations. We ensemble the atten-tions from different combinaatten-tions of the inter-action between the question and document con-text (ctx) representations and context+knowledge (ctx+kn)representations in order to infer the right answer (see Section2.2, Answer Ranking).

Table 4 shows that the combination of differ-ent interactions betweenctxandctx+kn represen-tations leads to clear improvement over the w/o knowledge setup, in particular for the Common Nouns dataset. We also performed ablations for a model with 100 facts (seeSupplement).

Key-Value Selection Strategy. Table 5 shows that for theNEdataset, the two strategies perform

NE CN

DreprtoQreprinteraction Dev Test Dev Test

Dctx,Qctx(w/o know) 75.50 70.30 68.20 64.80

Dctx+kn,Qctx+kn 76.45 69.68 70.85 66.32

Dctx,Qctx+kn 77.10 69.72 70.80 66.32

Dctx+kn,Qctx 75.65 70.88 71.20 67.96

Full model 76.80 70.24 71.85 67.64

w/oDctx,Qctx 75.95 70.24 70.65 67.12

w/oDctx+kn,Qctx+kn 76.20 69.80 70.75 67.00

w/oDctx,Qctx+kn 76.55 70.52 71.75 66.32

w/oDctx+kn,Qctx 76.05 70.84 70.80 66.80

Table 4: Results for different combinations of in-teractions between document (D) and question (Q) context (ctx) and context + knowledge (ctx+kn) representations. (CN5Sel, 50 facts)

NE CN

Key/Value Dev Test Dev Test

Subj/Obj 76.65 71.52 71.85 67.64

Obj/Obj 76.70 71.28 71.25 67.48

Table 5: Results for key-value knowledge retrieval and integration. (CN5Sel, 50 facts). Subj/Obj means: we attend over the fact subject (Key) and take the weighted fact object as value (Value).

NE CN

Models dev test dev test

Human (ctx + q) - 81.6 - 81.6

Single interaction

LSTMs (ctx + q) (Hill et al.,2015) 51.2 41.8 62.6 56.0

AS Reader 73.8 68.6 68.8 63.4

AS Reader (our impl) 75.5 70.3 68.2 64.8

KnReader (ours) 77.4 71.4 71.8 67.6

Multiple interactions

MemNNs (Weston et al.,2015) 70.4 66.6 64.2 63.0

EpiReader (Trischler et al.,2016) 74.9 69.0 71.5 67.4

GA Reader (Dhingra et al.,2017) 77.2 71.4 71.6 68.0

IAA Reader (Sordoni et al.,2016) 75.3 69.7 72.1 69.2

AoA Reader (Cui et al.,2017) 75.2 68.6 72.2 69.4

GA Reader (+feat) 77.8 72.0 74.4 70.7

NSE (Munkhdalai and Yu,2016) 77.0 71.4 74.3 71.9

Table 6: Comparison of KnReader to existing end-to-end neural models on the benchmark datasets.

equally well on theDevset, whereas theSubj/Obj strategy works slightly better on theTestset. For Common Nouns,Subj/Objis better.

Comparison to Previous Work. Table 6 com-pares our model (Knowledgeable Reader) to pre-vious work on the CBT datasets. We show the results of our model with the settings that per-formed best on the Dev sets of the two datasets NE and CN: for NE, (Dctx+kn, Qctx) with 100

facts; for CN the Full model with 50 facts, both withCN5Sel.

[image:7.595.310.528.63.185.2]

(8)

inter-action (single-hop)between the document context and the question so we primarily compare to and aim at improving over similar models. KnReader clearly outperforms prior single-hop models on both datasets. While we do not improve over the state of the art, our model stands well among other models that perform multiple hops. In the Supple-mentwe also give comparison to ensemble models and some models that use re-ranking strategies.

6 Discussion and Analysis

6.1 Analysis of the empirical results.

Our experiments examined key parameters of the KnReader. As expected, injection of background knowledge yields only small improvements over the baseline model forNamed Entities. However, on this dataset our single-hop model is competitive to most multi-hop neural architectures.

The integration of knowledge clearly helps for

the Common Nouns task. The impact of

knowl-edge sources (Table2) is different on theDevand Testsets which indicates that either the model or the data subsets are sensitive to different knowl-edge types and retrieved knowlknowl-edge. Table 5

shows that attending over theSubj of the knowl-edge triple is slightly better thanObj. This shows that using aKey-Valuememory is valuable. A rea-son for lower performance ofObj/Objis that the model picks facts that are similar to the candidate tokens, not adding much new information. From the empirical results we see that training and eval-uation with less facts is slightly better. We hypoth-esize that this is related to the lack of supervision on the retrieved and attended knowledge.

6.2 Interpreting Component Importance

Figure 3 shows the impact on prediction accu-racy of individual components of theFull model, including the interaction between D andQ with ctx or ctx+kn(w/o ctx-only). The values for each component are obtained from the attention weights, without retraining the model. The differ-ence between blue (left) and orange (right) values indicates how much the module contributes to the model. Interestingly, the ranking of the contribu-tion (Dctx, Qctx+kn> Dctx+kn, Qctx > Dctx+kn,

Qctx+kn) corresponds to the component

impor-tance ablation on theDevset, lines 5-8, Table4.

0 50 100 150 200 250

Subj/Obj, 50 facts

incorrect -‐> correct

0 50 100 150 200 250

Obj/Obj, 50 facts

correct -‐> incorrect

Figure 3: # of items with reversed prediction (±correct) for each combination of (ctx+kn, ctx) for Q and D. We report the number of wrong

→ correct (blue) and correct→ wrong (orange) changes when switching from score w/o knowl-edge to score w/ knowlknowl-edge. The best model type isEnsemble. (Full model w/oDctx,Qctx).

6.3 Qualitative Data Investigation

We will use the attention values of the interactions between Dctx(+kn) andQctx(+kn) and attentions

to facts from each candidate token and the ques-tion placeholder to interpret how knowledge is em-ployed to make a prediction for a single example.

Method: Interpreting Model Components.

We manually inspect examples from the evalu-ation sets where KnReader improves prediction (blue (left) category, Fig.3) or makes the predic-tion worse (orange (right) category, Fig.3). Figure

[image:8.595.323.509.64.206.2]

(9)

in-bird head legs sides wood Dctx, Qctx (w/o know) 0.00 0.26 0.40 0.33 0.02

Q: UNK_59 did not say anything ; but when the other two had passed on she bent down to the bird , brushed aside the feathers from his xxxxx , and kissed his closed eyes gently

.

0.0 0.4

Q

bird _head legs _sides _wood

bird /r/PartOf bird beak /r/PartOf bird wing /r/PartOf bird a bird /r/UsedFor testing air in a mine bird /r/CapableOf head south a bird /r/CapableOf sing to other birds head /r/PartOf animal basilar artery /r/PartOf head ear /r/PartOf head porch /r/PartOf house a bird /r/HasA two legs wood /r/Antonym fire spite /r/DistinctFrom like wood /r/AtLocation a fire wood /r/DistinctFrom carpet

0.2 0.4 0.6 0.8

bird head legs sides wood

Ensemble Dctx + kn, Qctx

Dctx + kn, Qctx + kn

Dctx, Qctx + kn

0.02 0.70 0.12 0.05 0.02 0.01 0.36 0.37 0.23 0.03 0.07 0.68 0.19 0.02 0.01 0.01 0.83 0.13 0.01 0.00

0.1 0.2 0.3 0.4 0.5 0.6 0.7 0.8

Figure 4: Interpreting the components of Kn-Reader. Adding knowledge toQandDincreases the score for the correct answer. Results for top 5 candidates are shown. (Full model, CN data, CN5Sel, Subj/Obj, 50 facts)

stead the model selects the factear /r/PartOf head which receives the highest attention fromQ. The weightedObjrepresentation (head) is added to the question with the highest weight, together with an-imalandbirdfrom the next highly weighted facts This results in a high score for theQctxtoDctx+kn

interaction with candidate head. See Supplement for more details.

Using the method described above, we analyze several example cases (presented in Supplement) that highlight different aspects of our model. Here we summarize our observations.

(i.) Answer prediction from Q or Q+D. In both human and machine RC, questions can be an-swered based on the question alone (Figure4) or jointly with the story context (Case 2,Suppl.). We show that empirically, enriching the question with knowledge is crucial for the first type, while en-richment of Q and D is required for the second.

(ii.) Overcoming frequency bias.. We show

that when appropriate knowledge is available and selected, the model is able to correct a frequency bias towards an incorrect answer (Cases 1 and 3).

(iii.) Providing appropriate knowledge. We observe a lack of knowledge regarding events (e.g. take off vs.put on clothes, Case 2;climb up, Case 5). Nevertheless relevant knowledge from CN5 can help predicting infrequent candidates (Case 2).

(iv.) Knowledge, Q and D encoding. The con-text encoding of facts allows the model to detect knowledge that is semantically related, but not sur-face near to phrases in Q and D (Case 2). The model finds facts to non-trivial paraphrases (e.g. undressed–naked, Case 2).

7 Conclusion and Future Work

We propose a neural cloze-style reading prehension model that incorporates external com-monsense knowledge, building on a single-turn neural model. Incorporating external knowledge improves its results with a relative error rate re-duction of 9% onCommon Nouns, thus the model is able to compete with more complex RC mod-els. We show that the types of knowledge con-tained in ConceptNet are useful. We provide quan-titative and qualitative evidence of the effective-ness of our model, that learns how to select rel-evant knowledge to improve RC. The attractive-ness of our model lies in itstransparency and flex-ibility: due to the attention mechanism, we can trace and analyze the facts considered in answer-ing specific questions. This opens up for deeper investigation and future improvement of RC mod-els in a targeted way, allowing us to investigate what knowledge sources are required for different data sets and domains. Since our model directly integrates background knowledge with the docu-ment and questioncontext representations, it can be adapted to very different task settings where we have a pair of two arguments (i.e. entailment, question answering, etc.) In future work, we will investigate even tighter integration of the attended knowledge and stronger reasoning methods.

Acknowledgments

[image:9.595.77.291.91.382.2]

(10)

References

Sungjin Ahn, Heeyoul Choi, Tanel P¨arnamaa, and Yoshua Bengio. 2016. A neural knowledge lan-guage model. InCoRR, volume abs/1608.00318.

Nikita Bhutani, H V Jagadish, and Dragomir Radev. 2016. Nested propositions in open information ex-traction. InProceedings of the 2016 Conference on Empirical Methods in Natural Language Process-ing, pages 55–64, Austin, Texas. Association for Computational Linguistics.

Antoine Bordes, Nicolas Usunier, Alberto Garcia-Duran, Jason Weston, and Oksana Yakhnenko. 2013. Translating embeddings for modeling multi-relational data. InAdvances in Neural Information Processing Systems 26, pages 2787–2795. Curran Associates, Inc.

Danqi Chen, Jason Bolton, and Christopher D. Man-ning. 2016. A thorough examination of the cnn/daily mail reading comprehension task. In Pro-ceedings of the 54th Annual Meeting of the Associa-tion for ComputaAssocia-tional Linguistics (Volume 1: Long Papers), pages 2358–2367, Berlin, Germany. Asso-ciation for Computational Linguistics.

Junyoung Chung, Ç alar Gülçehre, Kyunghyun Cho, and Yoshua Bengio. 2014. Empirical evaluation of gated recurrent neural networks on sequence model-ing. arXiv e-prints, abs/1412.3555. Presented at the Deep Learning workshop at NIPS2014.

Yiming Cui, Zhipeng Chen, Si Wei, Shijin Wang, Ting Liu, and Guoping Hu. 2017. Attention-over-attention neural networks for reading comprehen-sion. InProceedings of the 55th Annual Meeting of the Association for Computational Linguistics (Vol-ume 1: Long Papers), pages 593–602. Association for Computational Linguistics.

Bhuwan Dhingra, Hanxiao Liu, Zhilin Yang, William Cohen, and Ruslan Salakhutdinov. 2017. Gated-attention readers for text comprehension. In Pro-ceedings of the 55th Annual Meeting of the Associa-tion for ComputaAssocia-tional Linguistics (Volume 1: Long Papers), pages 1832–1846. Association for Compu-tational Linguistics.

Anthony Fader, Stephen Soderland, and Oren Etzioni. 2011.Identifying relations for open information ex-traction. InProceedings of the 2011 Conference on Empirical Methods in Natural Language Process-ing, pages 1535–1545, Edinburgh, Scotland, UK. Association for Computational Linguistics.

Shizhu He, Cao Liu, Kang Liu, and Jun Zhao. 2017. Generating natural answers by incorporating copying and retrieving mechanisms in sequence-to-sequence learning. InProceedings of the 55th An-nual Meeting of the Association for Computational Linguistics (Volume 1: Long Papers), pages 199– 208. Association for Computational Linguistics.

Karl Moritz Hermann, Tomas Kocisky, Edward

Grefenstette, Lasse Espeholt, Will Kay, Mustafa Su-leyman, and Phil Blunsom. 2015. Teaching ma-chines to read and comprehend. In C. Cortes, N. D. Lawrence, D. D. Lee, M. Sugiyama, and R. Garnett, editors,Advances in Neural Information Processing Systems 28, pages 1693–1701. Curran Associates, Inc.

Felix Hill, Antoine Bordes, Sumit Chopra, and Jason Weston. 2015. The goldilocks principle: Reading children’s books with explicit memory representa-tions.volume abs/1511.02301.

Mandar Joshi, Eunsol Choi, Daniel Weld, and Luke Zettlemoyer. 2017. Triviaqa: A large scale distantly supervised challenge dataset for reading comprehen-sion. InProceedings of the 55th Annual Meeting of the Association for Computational Linguistics (Vol-ume 1: Long Papers), pages 1601–1611, Vancouver, Canada. Association for Computational Linguistics (ACL) 2017.

Rudolf Kadlec, Martin Schmid, Ondˇrej Bajgar, and Jan Kleindienst. 2016. Text understanding with the at-tention sum reader network. In Proceedings of the 54th Annual Meeting of the Association for Compu-tational Linguistics (Volume 1: Long Papers), pages 908–918. Association for Computational Linguis-tics.

Chlo´e Kiddon, Luke Zettlemoyer, and Yejin Choi. 2016.Globally coherent text generation with neural checklist models. InProceedings of the 2016 Con-ference on Empirical Methods in Natural Language Processing, pages 329–339, Austin, Texas. Associa-tion for ComputaAssocia-tional Linguistics.

Teng Long, Emmanuel Bengio, Ryan Lowe, Jackie Chi Kit Cheung, and Doina Precup. 2017. World knowledge for reading comprehension: Rare en-tity prediction with hierarchical lstms using exter-nal descriptions. InProceedings of the 2017 Con-ference on Empirical Methods in Natural Language Processing, pages 825–834, Copenhagen, Denmark. Association for Computational Linguistics.

Mausam, Michael Schmitz, Stephen Soderland, Robert Bart, and Oren Etzioni. 2012. Open language learn-ing for information extraction. In Proceedings of the 2012 Joint Conference on Empirical Methods in Natural Language Processing and Computational Natural Language Learning, pages 523–534, Jeju Island, Korea. Association for Computational Lin-guistics.

Alexander Miller, Adam Fisch, Jesse Dodge, Amir-Hossein Karimi, Antoine Bordes, and Jason We-ston. 2016. Key-value memory networks for directly reading documents. In Proceedings of the 2016 Conference on Empirical Methods in Natural Lan-guage Processing, pages 1400–1409, Austin, Texas. Association for Computational Linguistics.

(11)

1990. Introduction to wordnet: An on-line lexical database*. In International Journal of Lexicogra-phy, volume 3, pages 235–244.

T. Mitchell, W. Cohen, E. Hruschka, P. Talukdar, J. Betteridge, A. Carlson, B. Dalvi, M. Gardner, B. Kisiel, J. Krishnamurthy, N. Lao, K. Mazaitis, T. Mohamed, N. Nakashole, E. Platanios, A. Rit-ter, M. Samadi, B. Settles, R. Wang, D. Wijaya, A. Gupta, X. Chen, A. Saparov, M. Greaves, and J. Welling. 2015. Never-ending learning. In Pro-ceedings of the Twenty-Ninth AAAI Conference on Artificial Intelligence (AAAI-15).

Tsendsuren Munkhdalai and Hong Yu. 2016. Reason-ing with Memory Augmented Neural Networks for Language Comprehension. InInternational Confer-ence on Learning Representations (ICLR) 2017.

Roberto Navigli and Simone Paolo Ponzetto. 2012. BabelNet: The automatic construction, evaluation and application of a wide-coverage multilingual se-mantic network. Artificial Intelligence, 193:217– 250.

Takeshi Onishi, Hai Wang, Mohit Bansal, Kevin Gim-pel, and David McAllester. 2016. Who did what: A large-scale person-centered cloze dataset. In Pro-ceedings of the 2016 Conference on Empirical Meth-ods in Natural Language Processing, pages 2230– 2235, Austin, Texas. Association for Computational Linguistics.

Jeffrey Pennington, Richard Socher, and Christopher Manning. 2014. Glove: Global vectors for word representation. In Proceedings of the 2014 Con-ference on Empirical Methods in Natural Language Processing (EMNLP), pages 1532–1543. Associa-tion for ComputaAssocia-tional Linguistics.

Pranav Rajpurkar, Jian Zhang, Konstantin Lopyrev, and Percy Liang. 2016. Squad: 100,000+ questions for machine comprehension of text. InProceedings of the 2016 Conference on Empirical Methods in Natu-ral Language Processing, pages 2383–2392, Austin, Texas. Association for Computational Linguistics.

Matthew Richardson, Christopher J.C. Burges, and Erin Renshaw. 2013. MCTest: A challenge dataset for the open-domain machine comprehension of text. InProceedings of the 2013 Conference on Em-pirical Methods in Natural Language Processing, pages 193–203, Seattle, Washington, USA. Associ-ation for ComputAssoci-ational Linguistics.

Minjoon Seo, Aniruddha Kembhavi, Ali Farhadi, and Hananneh Hajishirzi. 2017. Bi-Directional Atten-tion Flow for Machine Comprehension. In Proceed-ings of International Conference of Learning Repre-sentations 2017, pages 1–12.

Yelong Shen, Po-Sen Huang, Jianfeng Gao, and Weizhu Chen. 2016. Reasonet: Learning to stop reading in machine comprehension. InProceedings

of the Workshop on Cognitive Computation: Inte-grating neural and symbolic approaches 2016 co-located with the 30th Annual Conference on Neu-ral Information Processing Systems (NIPS 2016), Barcelona, Spain, December 9, 2016.

Parmjit Singh, T Lin, E.T. Mueller, G Lim, T Perkins, and W.L. Zhu. 2002. Open mind common sense: Knowledge acquisition from the general public. In

Lecture Notes in Computer Science, volume 2519, pages 1223–1237.

Sameer Singh, Amarnag Subramanya, Fernando

Pereira, and Andrew McCallum. 2012. Wikilinks: A large-scale cross-document coreference corpus la-beled via links to Wikipedia. Technical Report UM-CS-2012-015.

Alessandro Sordoni, Phillip Bachman, and Yoshua Bengio. 2016. Iterative alternating neural attention for machine reading. abs/1606.02245.

Robert Speer, Joshua Chin, and Catherine Havasi. 2017. ConceptNet 5.5: An Open Multilingual Graph of General Knowledge. InAAAI.

Nitish Srivastava, Geoffrey Hinton, Alex Krizhevsky, Ilya Sutskever, and Ruslan Salakhutdinov. 2014.

Dropout: A simple way to prevent neural networks from overfitting. volume 15, pages 1929–1958.

Thomas Pellissier Tanon, Denny Vrande, San Fran-cisco, Sebastian Schaffert, and Thomas Steiner. 2016. From Freebase to Wikidata : The Great Mi-gration. In Proceedings of the 25th International Conference on World Wide Web, pages 1419–1428.

Adam Trischler, Tong Wang, Xingdi Yuan, Justin Har-ris, Alessandro Sordoni, Philip Bachman, and Ka-heer Suleman. 2017. Newsqa: A machine compre-hension dataset. In Proceedings of the 2nd Work-shop on Representation Learning for NLP, pages 191–200, Vancouver, Canada. Association for Com-putational Linguistics.

Adam Trischler, Zheng Ye, Xingdi Yuan, Philip Bach-man, Alessandro Sordoni, and Kaheer Suleman. 2016. Natural language comprehension with the epireader. InProceedings of the 2016 Conference on Empirical Methods in Natural Language Pro-cessing, pages 128–137. Association for Computa-tional Linguistics.

Wenhui Wang, Nan Yang, Furu Wei, Baobao Chang, and Ming Zhou. 2017. Gated self-matching net-works for reading comprehension and question an-swering. InProceedings of the 55th Annual Meet-ing of the Association for Computational LMeet-inguistics (Volume 1: Long Papers), pages 189–198. Associa-tion for ComputaAssocia-tional Linguistics.

(12)

Dirk Weissenborn, Tomas Kocisky, and Chris Dyer.

2017. Dynamic integration of background

knowledge in neural NLU systems. CoRR,

abs/1706.02596.

Jason Weston, Sumit Chopra, and Antoine Bordes. 2015. Memory networks. InInternational Confer-ence on Learning Representations (ICLR), 2015.

Caiming Xiong, Victor Zhong, and Richard Socher. 2016. Dynamic coattention networks for ques-tion answering. In International Conference on Learning Representations (ICLR), 2017, volume abs/1611.01604.

Bishan Yang and Tom Mitchell. 2017. Leveraging knowledge bases in lstms for improving machine reading. In Proceedings of the 55th Annual Meet-ing of the Association for Computational LMeet-inguistics (Volume 1: Long Papers), pages 1436–1446. Asso-ciation for Computational Linguistics.

Bishan Yang, Wen-tau Yih, Xiaodong He, Jianfeng Gao, and Li Deng. 2014. Embedding Entities and Relations for Learning and Inference in Knowledge Bases. In International Conference on Learning Representations (ICLR), 2015.

Zichao Yang, Phil Blunsom, Chris Dyer, and Wang Ling. 2017. Reference-aware language models. In