CHAPTER 2 : Background and Related Work
2.3 Measuring the progress towards NLU
Evaluation protocols are critical in incentivizing the field to solve the right problems. One
of the earliest proposals is due to Alan Turing: if you had a pen-pal for years, you would
not know whether you’re corresponding to a human or a machine (Turing, 1950; Harnad,
1992). A major limitation of this test (and many of its extensions) is that it is “expensive”
to compute (Hernandez-Orallo, 2000; French, 2000).
tions; if an actor (human or computer) understands a given text, it should be able to answer
any questions about it. Throughout this thesis, we will refer to this protocol as Question
Answering(QA). This has been used in the field for many years (McCarthy, 1976; Wino-
grad, 1972; Lehnert, 1977). There are few other terms popularized in the community to
refer the same task we are solving here. The phrase Reading Comprehension is bor-
rowed from standardized tests (SAT, TOEFL, etc.), usually refers to the scenario where a paragraph is attached to the given question. Another similar phrase is Machine Com-
prehension. Throughout this thesis, we use these phrases interchangeably to refer to the
same task.
To make it more formal, for an assumed scenario described by a paragraph P, a systemf
equipped with NLU should be able to answer any questions Q about the given paragraph
P. One can measure the expected performance of the system on a set of questions D, via some distance measure d(., .) between the predicted answers f(Q;P) and the correct
answers f∗(Q;P) (usually a prediction agreed upon by multiple humans):
R(f;D) =E(Q,P)∼D h
df(Q;P), f∗(Q;P)i
A critical question here is the choice of question setDso thatR(f;D) is an effective measure of f’s progress towards NLU. Denote the set of all the possible English questions as Du. This is an enormous set and, in practice it is unlikely that we could write them all in one
place. Instead, it might be more practical to sample from this set. In practice, this sampling
is replaced with static datasets. This introduces a problem: datasets are hardly a uniform
subset of Du; instead, they are heavily skewed towards more simplicity.
Figure 6 depicts a hypothetical high-dimensional manifold of all the natural language ques-
tions in terms of an arbitrary representation (bytes, characters, etc.) Unfortunately, datasets are usually biased samples of the universal setDu. And they are often biased towards sim-
on a single set might not be a true representative of our progress. Two chapters of this
work are dedicated to the construction of QA datasets.
Figure 6: A hypothetical manifold of all the NLU instances. Static datasets make it easy to evaluate our progress but since they usually give a biased estimate, they limit the scope of the challenge.
There are few flavors of QA in terms of their answer representations (see Table 2): (i)
questions with multiple candidate-answers, a subset of which are correct; (ii) extractive questions, where the correct answer is a substring of a given paragraph; (iii) Direct-answer
questions; a hypothetical system has to generate a string for such questions. The choice
of answer-representation has direct consequences for the representational richness of the
dataset and ease of evaluation. The first two settings (multiple-choice and extractive ques-
tions) are easy to evaluate but restrict the richness of the dataset. Direct-answer questions
can result in richer datasets but are more expensive to evaluate.
Datasets make it possible to automate the evaluation of the progress towards NLU and
be able to compare systems to each other on fixed problems sets. One of the earliest NLU datasets published in the field is the Remedia dataset (Hirschman et al., 1999) which
contains short-stories written in simple language for kids provided by Remedia Publications.
Each story has 5 types of questions (who, when, why, where, what). Since then, there has
been many suggestions as to what kind of question-answering dataset is a better test of NLU.
Multiple-c
hoice
Dirk Diggler was born as Steven Samuel Adams on April 15, 1961 outside of Saint Paul, Minnesota. His parents were a construction worker and a boutique shop owner who attended church every Sunday and believed in God. Looking for a career as a male model, Diggler dropped out of school at age 16 and left home. He was discovered at a falafel stand by Jack Horner. Diggler met his friend, Reed Rothchild, through Horner in 1979 while working on a film.
Question: How old was Dirk when he met his friend Reed?
Answers: *(A) 18 (B) 16 (C) 22
Extractiv
e
The city developed around the Roman settlement Pons Aelius and was named after the castle built in 1080 by Robert Curthose, William the Conqueror’s eldest son. The city grew as an important centre for the wool trade in the 14th century, and later became a major coal mining area. The port developed in the 16th century and, along with the shipyards lower down the River Tyne, was amongst the world’s largest shipbuilding and ship-repairing centres.
Question: Who built a castle in Newcastle in 1080?
Answers: “Robert Curthose”
Direct-answ
er
Question: Some birds fly south in the fall. This seasonal adaptation is known as migration. Explain why these birds migrate.
Answers: “A(n) bird can migrate, which helps cope with lack of food resources in harsh cold conditions by getting it to a warmer habitat with more food resources.”
Table 2: Various answer representation paradigms in QA systems; examples selected from Khashabi et al. (2018a); Rajpurkar et al. (2016); Clark et al. (2016).
multiple-choice challenge sets that are easy for children but difficult for computers. In a
similar spirit, Clark and Etzioni (2016) advocate elementary-school science tests. Many
science questions have answers that are not explicitly stated in text and instead, require
combining information together. In Chapter 2, 3 we use elementary-school science tests as
our target challenge.
While the field has produced many datasets in the past few years, many of these datasets
are either too restricted in terms of their linguistic richness or they contain annotation biases (Gururangan et al., 2018; Poliak et al., 2018). For many of these datasets, it has
been pointed out that many of the high-performing models neither need to ‘comprehend’
in order to correctly predict an answer, nor learn to ‘reason’ in a way that generalizes
across datasets (Chen et al., 2016; Jia and Liang, 2017; Kaushik and Lipton, 2018). In
Section 3.4.4 we show that adversarially-selected candidate-answers result in a significant
in Chapter 4, 5 we propose two new challenge datasets which, we believe, pose better
challenges for systems.
A closely related task is the task of Recognizing Textual Entailment (RTE) (Khashabi
et al., 2018c; Dagan et al., 2013), as QA can be cast as entailment (Does P entailQ+A?
(Bentivogli et al., 2008)). While we do not directly address this task, in some cases we use it as a component within out proposed QA systems (in Chapter 3 and 4).