Measuring the progress towards NLU - : Background and Related Work

CHAPTER 2 : Background and Related Work

2.3 Measuring the progress towards NLU

Evaluation protocols are critical in incentivizing the field to solve the right problems. One

of the earliest proposals is due to Alan Turing: if you had a pen-pal for years, you would

not know whether you’re corresponding to a human or a machine (Turing, 1950; Harnad,

1992). A major limitation of this test (and many of its extensions) is that it is “expensive”

to compute (Hernandez-Orallo, 2000; French, 2000).

tions; if an actor (human or computer) understands a given text, it should be able to answer

any questions about it. Throughout this thesis, we will refer to this protocol as Question

Answering(QA). This has been used in the field for many years (McCarthy, 1976; Wino-

grad, 1972; Lehnert, 1977). There are few other terms popularized in the community to

refer the same task we are solving here. The phrase Reading Comprehension is bor-

rowed from standardized tests (SAT, TOEFL, etc.), usually refers to the scenario where a paragraph is attached to the given question. Another similar phrase is Machine Com-

prehension. Throughout this thesis, we use these phrases interchangeably to refer to the

same task.

To make it more formal, for an assumed scenario described by a paragraph P, a systemf

equipped with NLU should be able to answer any questions Q about the given paragraph

P. One can measure the expected performance of the system on a set of questions D, via some distance measure d(., .) between the predicted answers f(Q;P) and the correct

answers f∗(Q;P) (usually a prediction agreed upon by multiple humans):

R(f;D) =E(Q,P)∼D h

df(Q;P), f∗(Q;P)i

A critical question here is the choice of question setDso thatR(f;D) is an effective measure of f’s progress towards NLU. Denote the set of all the possible English questions as D_u. This is an enormous set and, in practice it is unlikely that we could write them all in one

place. Instead, it might be more practical to sample from this set. In practice, this sampling

is replaced with static datasets. This introduces a problem: datasets are hardly a uniform

subset of Du; instead, they are heavily skewed towards more simplicity.

Figure 6 depicts a hypothetical high-dimensional manifold of all the natural language ques-

tions in terms of an arbitrary representation (bytes, characters, etc.) Unfortunately, datasets are usually biased samples of the universal setDu. And they are often biased towards sim-

on a single set might not be a true representative of our progress. Two chapters of this

work are dedicated to the construction of QA datasets.

Figure 6: A hypothetical manifold of all the NLU instances. Static datasets make it easy to evaluate our progress but since they usually give a biased estimate, they limit the scope of the challenge.

There are few flavors of QA in terms of their answer representations (see Table 2): (i)

questions with multiple candidate-answers, a subset of which are correct; (ii) extractive questions, where the correct answer is a substring of a given paragraph; (iii) Direct-answer

questions; a hypothetical system has to generate a string for such questions. The choice

of answer-representation has direct consequences for the representational richness of the

dataset and ease of evaluation. The first two settings (multiple-choice and extractive ques-

tions) are easy to evaluate but restrict the richness of the dataset. Direct-answer questions

can result in richer datasets but are more expensive to evaluate.

Datasets make it possible to automate the evaluation of the progress towards NLU and

be able to compare systems to each other on fixed problems sets. One of the earliest NLU datasets published in the field is the Remedia dataset (Hirschman et al., 1999) which

contains short-stories written in simple language for kids provided by Remedia Publications.

Each story has 5 types of questions (who, when, why, where, what). Since then, there has

been many suggestions as to what kind of question-answering dataset is a better test of NLU.

Multiple-c

hoice

Dirk Diggler was born as Steven Samuel Adams on April 15, 1961 outside of Saint Paul, Minnesota. His parents were a construction worker and a boutique shop owner who attended church every Sunday and believed in God. Looking for a career as a male model, Diggler dropped out of school at age 16 and left home. He was discovered at a falafel stand by Jack Horner. Diggler met his friend, Reed Rothchild, through Horner in 1979 while working on a film.

Question: How old was Dirk when he met his friend Reed?

Answers: *(A) 18 (B) 16 (C) 22

Extractiv

The city developed around the Roman settlement Pons Aelius and was named after the castle built in 1080 by Robert Curthose, William the Conqueror’s eldest son. The city grew as an important centre for the wool trade in the 14th century, and later became a major coal mining area. The port developed in the 16th century and, along with the shipyards lower down the River Tyne, was amongst the world’s largest shipbuilding and ship-repairing centres.

Question: Who built a castle in Newcastle in 1080?

Answers: “Robert Curthose”

Direct-answ

Question: Some birds fly south in the fall. This seasonal adaptation is known as migration. Explain why these birds migrate.

Answers: “A(n) bird can migrate, which helps cope with lack of food resources in harsh cold conditions by getting it to a warmer habitat with more food resources.”

Table 2: Various answer representation paradigms in QA systems; examples selected from Khashabi et al. (2018a); Rajpurkar et al. (2016); Clark et al. (2016).

multiple-choice challenge sets that are easy for children but difficult for computers. In a

similar spirit, Clark and Etzioni (2016) advocate elementary-school science tests. Many

science questions have answers that are not explicitly stated in text and instead, require

combining information together. In Chapter 2, 3 we use elementary-school science tests as

our target challenge.

While the field has produced many datasets in the past few years, many of these datasets

are either too restricted in terms of their linguistic richness or they contain annotation biases (Gururangan et al., 2018; Poliak et al., 2018). For many of these datasets, it has

been pointed out that many of the high-performing models neither need to ‘comprehend’

in order to correctly predict an answer, nor learn to ‘reason’ in a way that generalizes

across datasets (Chen et al., 2016; Jia and Liang, 2017; Kaushik and Lipton, 2018). In

Section 3.4.4 we show that adversarially-selected candidate-answers result in a significant

in Chapter 4, 5 we propose two new challenge datasets which, we believe, pose better

challenges for systems.

A closely related task is the task of Recognizing Textual Entailment (RTE) (Khashabi

et al., 2018c; Dagan et al., 2013), as QA can be cast as entailment (Does P entailQ+A?

(Bentivogli et al., 2008)). While we do not directly address this task, in some cases we use it as a component within out proposed QA systems (in Chapter 3 and 4).

In document Reasoning-Driven Question-Answering For Natural Language Understanding (Page 33-37)