Lecture 14:
Ques-on Answering
Wei Xu
2
QA is very broad
‣ Factoid QA: what states border Mississippi?, when was Barack Obama born? ‣ “Ques<on answering” as a term is so broad as to be meaningless ‣ Is P=NP? ‣ What is the transla=on of [sentence] into French? [McCann et al., 2018] ‣ Lots of this could be handled by QA from a knowledge base, if we had a big enough knowledge base ‣ What is 4+5?Classical Ques-on Answering
Q: “where was Barack Obama born”
‣ Form seman-c representa-on from seman-c parsing, execute against structured knowledge base
λx. type(x, Location) ∧ born_in(Barack_Obama, x)
(other representa-ons like SQL possible too…)
‣ How to deal with open-domain data/rela-ons? Need data to learn how to ground every predicate or need to be able to produce predicates in a zero-shot way
Reading Comprehension
‣ “AI challenge problem”: answer ques-on given context
Richardson (2013)
‣ MCTest (2013): 500
passages, 4 ques-ons per passage
‣ Two ques-ons per
passage explicitly require cross-sentence reasoning
‣ Recognizing Textual Entailment (2006)
Dataset Explosion
‣ 10+ QA datasets released since 2015
‣ Ques-on answering: ques-ons are in natural language
‣ “Cloze” task: word (o`en an en-ty) is removed from a sentence
‣ Answers: mul-ple choice or require picking from the passage
‣ Require human annota-on
‣ Answers: mul-ple choice, pick from passage, or pick from vocabulary
‣ Can be created automa-cally from things that aren’t ques-ons
‣ Children’s Book Test, CNN/Daily Mail, SQuAD, TriviaQA are most well-known (others: SearchQA, MS Marco, RACE, WikiHop, …)
Children’s Book Test
Hill et al. (2015)
‣ Children’s Book Test: take a sec-on of a children’s story, block out an en-ty and predict it (one-doc mul--sentence cloze task)
bAbI
‣ Evalua-on on 20 tasks proposed as building blocks for building “AI-complete” systems
‣ Various levels of difficulty, exhibit different linguis-c phenomena
Weston et al. (2014)
Dataset Proper-es
‣ O`en shallow methods work well because most answers are in a single sentence (SQuAD, MCTest)
‣ Some explicitly require linking between mul-ple sentences (MCTest)
‣ Axis 1: QA vs. cloze (Children’s Book Test)
‣ Axis 2: single-sentence vs. passage
‣ Axis 3: single-document (datasets in this lecture) vs. mul--document (TriviaQA, WikiHop, HotPotQA, …)
Memory Networks
‣ Memory networks let you reference input with agen-on
‣ Encode input items into two vectors: a key and a value
‣ Keys compute agen-on weights given a query, weighted sum of values gives the output
Memory Networks
‣ Three layers of memory network where the query representa-on is updated addi-vely based on the memories at each step
Sukhbaatar et al. (2015)
‣ How to encode the sentences?
‣ Bag of words (average embeddings)
‣ Posi-onal encoding: mul-ply each word by a vector capturing posi-on in sentence
Evalua-on: bAbI
‣ 3-hop memory network does pregy well, beger than LSTM at processing these types of examples
Evalua-on: Children’s Book Test
‣ Outperforms LSTMs substan-ally with
Memory Network Takeaways
‣ Useful for cloze tasks where far-back context is necessary
‣ What can we do with more basic agen-on?
‣ Memory networks provide a way of agending to abstrac-ons over the input
CNN/Daily Mail
Hermann et al. (2015), Chen et al. (2016)
‣ Single-document, (usually) single-sentence cloze task
‣ Formed based on ar-cle
summaries — informa-on should mostly be present, makes it
easier than Children’s Book Test
‣ Need to process the ques-on, can’t just use LSTM LMs
CNN/Daily Mail
Hermann et al. (2015), Chen et al. (2016)
X visited England ||| Mary visited England
‣ LSTM reader: encode ques-on, encode passage, predict en-ty
‣ Can also use textual entailment-like models
X visited England
Mary visited England
Mary
Mul-class classifica-on problem over en--es
in the document
CNN/Daily Mail
Hermann et al. (2015) ‣ Agen-ve reader: u = encode query s = encode sentence r = agen-on(u -> s) predic-on = f(candidate, u, r) ‣ Uses fixed-sizerepresenta-ons for the
final predic-on, mul-class classifica-on
CNN/Daily Mail
‣ Chen et al (2016): small changes to the agen-ve reader
Stanford Agen-ve Reader 76.2 76.5 79.5 78.7
‣ Addi-onal analysis of the task found that many of the remaining ques-ons were unanswerable or
extremely difficult
SQuAD
‣ Single-document, single-sentence ques-on-answering task where the answer is always a substring of the passage
Rajpurkar et al. (2016)
SQuAD
What was Marie Curie the first female recipient of?
Rajpurkar et al. (2016)
first female recipient of the Nobel Prize . START END
‣ Like a tagging problem over the sentence (not mul-class classifica-on), but we need some way of agending to the query
Bidirec-onal Agen-on Flow
‣ Passage (context) and query are both encoded with BiLSTMs
‣ Context-to-query agen-on: compute so`max over columns of S, take weighted sum of u based on agen-on weights for each passage word
Seo et al. (2016)
passage H
query U
S
ij= h
i· u
j↵ij = softmaxj (Sij ) ‣ dist over query
˜
ui = X
j
↵ij uj ‣ query “specialized”
Bidirec-onal Agen-on Flow
Seo et al. (2016)
Each passage
word now “knows about” the query
25
QA with BERT
Devlin et al. (2019) What was Marie Curie the first female recipient of ? [SEP] One of the most famous people born in Warsaw was Marie … ‣ Predict start and end posi<ons in passage ‣ No need for cross-a8en<on mechanisms!SQuAD SOTA: 2018
‣ nlnet, QANet, r-net — dueling super complex
systems (much more than BiDAF…)
‣ BERT: transformer-based approach with pretraining on 3B tokens
SQuAD 2.0 SOTA: Spring 2019
27SQuAD SOTA: Spring 19
‣ Industry contest ‣ SQuAD 2.0: harder dataset because some ques<ons are unanswerableSQuAD 2.0 SOTA: Fall 2019
28SQuAD SOTA: Today
‣ Harder QA sezngs are needed! ‣ Performance is very saturatedSQuAD 2.0 SOTA: Today
29SQuAD SOTA: Today
‣ Harder QA sezngs are needed! ‣ Performance is very saturated30
TriviaQA
Joshi et al. (2017) ‣ Totally figuring this out is very challenging ‣ Coref: the failed campaign movie of the same name ‣ Lots of surface clues: 1961, campaign, etc. ‣ Systems can do well without really understanding the text31
What are these models learning?
‣ “Who…”: knows to look for people ‣ “Which film…”: can iden<fy movies and then spot keywords that are related to the ques<on ‣ Unless ques<ons are made super tricky (target closely-related en<<es who are easily confused), they’re usually not so hard to answerTakeaways
‣ Memory networks let you reference input in an agen-on-like way, useful for generalizing language models to long-range reasoning
‣ Complex agen-on schemes can match queries against input texts and iden-fy answers
‣ Many flavors of reading comprehension tasks: cloze or actual ques-ons, single or mul--sentence