Answer Selection in Question/Answering
Hugo Patinho Rodrigues
Dissertation for the Degree of Master of
Information Systems and Computer Engineering
Jury
President: Prof. Doctor Joaquim Armando Pires Jorge
Supervisor: Prof. Doctor Maria Lu´ısa Torres Ribeiro Marques da Silva Coheur Member: Prof. Doctor Bruno Emanuel da Gra¸ca Martins
July 2012
Resumo
Esta tese aborda o problema da selec¸c˜ao da resposta na ´area dos sistemas de Per- gunta/Resposta. A selec¸c˜ao da resposta ´e uma das principais etapas destes sistemas, tendo como objectivo escolher a resposta a devolver com base num conjunto de candidatos. Para tal propomos o AnSelMo (ANSwering SELection MOdule), um m´odulo de selec¸c˜ao de res- posta. A sua abordagem ´e baseada no contexto onde o candidato de resposta se encontra, medindo as distˆancias entre os termos da pergunta e da resposta e usando medidas de similari- dade para comparar passagens relacionadas com a pergunta com passagens relacionadas com a resposta. Outra abordagem explorada ´e baseada em espa¸cos semˆanticos, usando An´alise Semˆantica Latente.
O AnSelMo foi testado em trˆes cen´arios distintos: com dados do ‘Quem Quer Ser Milion´ario’
(WWBM), o famoso concurso de perguntas de escolha m´ultipla, no contexto da avalia¸c˜ao conjunta Question Answering for Machine Reading Evaluation (QA4MRE), uma tarefa de compreens˜ao escrita do Cross Language Evaluation Forum, e no Just.Ask, o sistema de Per- gunta/Resposta do L2F. Os resultados para o WWBM ultrapassam o estado da arte, enquanto que para o QA4MRE os resultados s˜ao melhores que grande parte dos obtidos pelos sistemas participantes em 2011. Tamb´em conseguimos melhorar a exactid˜ao do Just.Ask, atrav´es da integra¸c˜ao do AnSelMo.
Abstract
This thesis addresses the problem of answer selection in Question Answering (QA) systems.
Answer selection is one of the main steps of those systems and has as goal to choose the answer to be returned based on a set of candidate answers. For that, we propose AnSelMo, an ANSwering SELection MOdule. Its approach is based on the context where candidate answers appear, by measuring distances between question and answer terms and by using similarity measures to compare passages related with the question with passages related with the answer. Another approach explored is based in Semantic Spaces, by using Latent Semantic Analysis.
AnSelMo was tested in three different scenarios: in ‘Who Wants to Be Millionaire?’ (WWBM), the famous contest of multiple-answer questions, in Question Answering for Machine Reading Evaluation (QA4MRE), a Cross Language Evaluation Forum reading comprehension task, and with Just.Ask, the L2F QA system. Results for WWBM surpass the state of the art, while for QA4MRE results are better than most of the 2011 competing systems’ results. We were also able to improve Just.Ask in terms of accuracy, by integrating AnSelMo on it.
Palavras Chave Keywords
Palavras Chave
Pergunta/Resposta Selec¸c˜ao de Resposta Contexto
Proximidade de Palavras Medidas de Similaridade Espa¸cos Semˆanticos
Keywords
Question Answering Answer Selection Context
Word Proximity Similarity Measures Semantic Spaces
Acknowledgements
First of all, I want to thank my advisor Prof. Lu´ısa Coheur for the opportunity to work with her in such amazing topic, and for providing me, not only a fantastic work environment, but also the motivation and fruitful advises that made this work possible.
I want to thank to all L2F people as well, specially Ana Mendes, Ricardo Ribeiro and David Matos, for their help, ideas and enlightning discussions.
I am also really grateful to my family, that made me who I am today, for their weariless support.
I could not forget my friends, namely Andr´e Carvalho, C´atia Pereira and Diogo Godinho, who were so important throughout these years and were always there when I needed.
And last, but not least, a special thanks to my friend Pedro Mota, who was my companion during this journey, had put up with me when anything gone wrong and gave me his insight in all matters.
Lisboa, July 2012 Hugo Rodrigues
To my grandparents,
Ant´ onio, Maria do Carmo,
Maria de Lurdes and Vasco
Contents
1 Introduction 1
1.1 Motivation . . . 1
1.2 Goals . . . 3
1.3 Approach . . . 3
1.4 Contributions . . . 4
1.5 Structure of this Document . . . 4
2 Related Work 5 2.1 Answer Selection in Question Answering . . . 5
2.1.1 Pattern-based approach . . . 5
2.1.2 Knowledge-based approach . . . 7
2.1.3 Noisy-Channel-based approach . . . 8
2.1.4 N-grams approach . . . 8
2.1.5 Large-Scale Question Answering . . . 9
2.2 ‘Who Wants to Be Millionaire?’ . . . 10
2.2.1 The approach by Clarke et al. . . 10
2.2.2 The approach by Lam et al. . . 13
2.3 Question Answering for Machine Reading Evaluation . . . 15
2.4 Just.Ask . . . 19
2.5 Latent Semantic Analysis . . . 21
2.6 Information Sources . . . 22
2.6.1 World Wide Web as an information source . . . 22
2.6.2 (Semi-)Structured data sources . . . 24
2.7 Summary . . . 25 i
3 AnSelMo 27
3.1 Architecture . . . 27
3.2 Pre-Processing . . . 28
3.3 Counting . . . 29
3.4 Lexical Approaches . . . 29
3.4.1 Word Proximity . . . 29
3.4.2 Similarity Measures . . . 32
3.5 Latent Semantic Analysis . . . 35
3.6 Scoring . . . 36
3.7 Summary . . . 36
4 Evaluation 37 4.1 Experimental Setup . . . 37
4.1.1 Data Sets . . . 37
4.1.1.1 ‘Who Wants to Be Millionaire?’ . . . 37
4.1.1.2 Question Answering for Machine Reading Evaluation . . . . 38
4.1.1.3 Just.Ask . . . 38
4.1.1.4 Discussion . . . 39
4.1.2 Evaluation Measures . . . 40
4.2 Counting . . . 41
4.3 Word Proximity . . . 42
4.3.1 ‘Who Wants to Be Millionaire?’ . . . 43
4.3.2 Question Answering for Machine Reading Evaluation . . . 45
4.3.3 Just.Ask . . . 46
4.3.4 Discussion . . . 47
4.4 Similarity Measures . . . 49
4.4.1 ‘Who Wants to Be Millionaire?’ . . . 49
4.4.2 Question Answering for Machine Reading Evaluation . . . 51
4.4.3 Discussion . . . 51 ii
4.5 Latent Semantic Analysis . . . 52
4.5.1 ‘Who Wants to Be Millionaire?’ . . . 53
4.5.2 Question Answering for Machine Reading Evaluation . . . 53
4.5.3 Discussion . . . 54
4.6 Discussion . . . 55
5 Conclusions and Future Work 57 5.1 Conclusions . . . 57
5.2 Future Work . . . 58
A Counting Results 65
B Word Proximity Results 69
C Extents Results 77
D LSA Results 83
iii
iv
List of Figures
1.1 QA system typical architecture. . . 1
2.1 Algorithm 1 performance for three 90-questions runs, with different radius. The last point corresponds to the baseline approach. . . 15
2.2 Application of SVD to matrix A. . . 21
3.1 AnSelMo’s Architecture. . . 27
3.2 Score functions used in Word Proximity. . . 31
v
vi
List of Tables
3.1 Scores for each Part of Speech (POS) tag. . . 33
4.1 WWBM corpora characteristics. . . 38
4.2 QA4MRE corpus characteristics. . . 38
4.3 Just.Ask corpus characteristics. . . 39
4.4 Comparison of the different corpora average statistics. . . 39
4.5 Best results accomplished with Counting, for WWBM English corpus. The whole table may be consulted in annex (Tables A.1 and A.2). . . 42
4.6 Best results accomplished with Counting, for WWBM Portuguese corpus. The whole table may be consulted in annex (Tables A.3 and A.4). . . 43
4.7 Best results accomplished with Word Proximity, for WWBM English corpus, using Mean as combination method and Linear as scoring algorithm. The whole table is in annex (Table B.1). . . 43
4.8 Parameterizations used in further tests to WWBM scenario. . . 44
4.9 Best results accomplished with Word Proximity, for WWBM English corpus, using Mean as combination method and Linear as scoring algorithm, for dif- ferent radius values. The whole table is in annex (Table B.2). . . 44
4.10 Results with Word Proximity, for WWBM English corpus, using different com- bination methods and scoring algorithms. . . 45
4.11 Best results accomplished with Word Proximity, for WWBM Portuguese cor- pus, using Mean as combination method and Linear as scoring algorithm, for different radius values. . . 46
4.12 Results with Word Proximity, for WWBM Portuguese corpus, using different combination methods and scoring algorithms. . . 46
4.13 Best results accomplished with Word Proximity, for QA4MRE 2011 corpus, using Linear as scoring algorithm, for different radius values. Last columns contain runs using Lucene, indexed with and without the Background Collection. 46 4.14 Results accomplished with Word Proximity, for QA4MRE 2011 corpus, using different scoring algorithms, for 20 and 50 as radius values. . . 47
vii
4.15 Best results accomplished with Word Proximity, for Just.Ask corpus, using AnSelMo as a full system. . . 47 4.16 Best results accomplished with Word Proximity, for Just.Ask corpus, using
AnSelMo as an answering selection module. . . 48 4.17 Best results accomplished with Similarity Measures, for WWBM English cor-
pus, using Extents and different similarity measures. The whole table may be consulted in annex (Table C.1). . . 50 4.18 Best results accomplished with Similarity Measures, for WWBM English cor-
pus, using Extents Points and different similarity measures. The whole table may be consulted in annex (Table C.2). . . 50 4.19 Best results accomplished with Similarity Measures, for WWBM Portuguese
corpus, using Extents and different similarity measures. The whole table may be consulted in annex (Table C.3). . . 50 4.20 Best results accomplished with Similarity Measures, for WWBM Portuguese
corpus, using Extents and different similarity measures. The whole table may be consulted in annex (Table C.4). . . 50 4.21 Results accomplished with Similarity Measures, for QA4MRE 2011 corpus,
using different similarity measures and both Extents and Extents Points. . 51 4.22 Best results accomplished with LSA, for WWBM English corpus. The whole
table may be consulted in annex (Table D.1). . . 53 4.23 Best results accomplished with LSA, for WWBM Portuguese corpus. The
whole table may be consulted in annex (Table D.2). . . 53 4.24 Results accomplished with LSA, for QA4MRE 2011 corpus. . . 54 4.25 Results accomplished with LSA, for QA4MRE 2011 corpus, expanding the
document with passages from Lucene. . . 54 4.26 Best results accomplished for all scenarios, with all techniques. The presented
values are accuracies. . . 55 A.1 Results for Counting strategy, using Bing, for English WWBM corpus. . . 65 A.2 Results for Counting strategy, using Blekko, for English WWBM corpus. . . . 66 A.3 Results for Counting strategy, using Bing, for Portuguese WWBM corpus. . . 67 A.4 Results for Counting strategy, using Blekko, for Portuguese WWBM corpus. . 68 B.1 Results for Word Proximity strategy, for English WWBM corpus. . . 69 B.2 Results for Word Proximity strategy, for English WWBM corpus, with different
radius values. . . 74 viii
C.1 Results for Extents strategy, for English WWBM corpus. . . 77
C.2 Results for Extents Points strategy, for English WWBM corpus. . . 78
C.3 Results for Extents strategy, for Portuguese WWBM corpus. . . 79
C.4 Results for Extents Points strategy, for Portuguese WWBM corpus. . . 80
D.1 Results for LSA, for English WWBM corpus. . . 83
D.2 Results for LSA, for Portuguese WWBM corpus. . . 84
ix
x
Acronyms
QA Question Answering
WWBM ‘Who Wants to Be Millionaire?’
QA4MRE Question Answering for Machine Reading Evaluation LSA Latent Semantic Analysis
SVD Singular Value Decomposition MRR Mean Reciprocal Rank
xi
xii
1
Introduction
1.1 Motivation
Question Answering (QA) systems try to answer questions posed in natural language, in contrast to search engines that return a set of related documents given some keywords. This is challenging because natural language is a powerful tool and questions may be formulated in different forms. Also, returning a single and concise answer is a much harder problem than returning a set of more or less related documents.
QA has been studied for some decades, but systems stabilized in the architecture depicted in Figure 1.1.
Figure 1.1: QA system typical architecture.
First, the question is processed and interpreted (Question Interpretation step). One or more queries result from this step, which are passed down to the Passage Retrieval step. Here, the system tries to find information related with the question. For this purpose, it can use pre-defined corpora, local databases, or the web. Finally, according to the fetched snippets or documents, the system extracts possible answers and ranks them, selecting one or more as correct answers (Answer Extraction/Selection step).
2 CHAPTER 1. INTRODUCTION To exemplify the process, consider the following question: What is the capital of Portu- gal?. The first step could generate queries like capital Portugal, Portugal’s capital is and/or capital(Portugal, x?), depending on the target information source. The following step would retrieve data from corpora, databases or the web, gathering a set of snippets or documents. Continuing with the same example, a system could have as answer candidates, after the answer extraction step, the following set: {Lisboa, Guimar˜aes, Lisbon, Lisbona, Porto} and it would try to select one or more candidates as correct answers.
The focus of this thesis is in the last task: answer selection. This work aims at exploring this particular step where, given a set of candidate answers – which may or not be related with each other –, we have to choose one or more correct answers.
This problem, as explained, is present in several QA systems, as in Just.Ask [1, 47, 28], the L2F QA system. Currently, Just.Ask already explores the problem of relating candidates [29], as it joins related candidates into a unique cluster, representing thus a single answer. There is, however, more to exploit. Candidates are selected based on the snippets returned by Bing, by extracting candidate answers by their type. However, Just.Ask does not take into consideration the context in which the candidate appears. Recapping the previous example, Porto comes from Porto was chosen, in Portugal, as the European Capital of Culture while Lisboa comes from Lisboa, capital of Portugal. Given the context, one could rule out the first candidate in favor of the second one due to the distance between the question keywords, capital and Portugal, and the candidate answers.
This problem can also be found in other scenarios in the Natural Language Processing (NLP) field. For example, Question Answering for Machine Reading Evaluation (QA4MRE), a task introduced in Cross Language Evaluation Forum (CLEF) in 2011 [34], aims at evaluating a system’s reading capability through multiple-answer questions. Briefly, given a text and some questions about it, competing systems have to choose the correct answer to those questions, among five candidates, showing in this way their level of “comprehension” of the text. The information needed to correctly choose between the different questions can be found in the given texts. The task also provides a Background Collection, which consists in a collection of
1.2. GOALS 3 documents that can help systems answer the questions.
Another similar scenario, already studied by the NLP community, is ‘Who Wants to Be Millionaire?’ (WWBM), the famous multiple-answer question contest. In this case, contes- tants face different questions, with four hypothesis of answer.
These two tasks can be distinguished from the answer selection problem in QA for two reasons:
(1) they always have a correct candidate answer; and (2) the candidate answers are not related, as opposite to the candidate answers extracted, for instance, by Just.Ask. Thus, these scenarios can be seen as a subproblem of the answer selection problem, as there are only up to five candidates, non-related, and where one and only one is the correct answer.
However, these two are also distinct scenarios, due to their nature and complexity (QA4MRE questions tend to be trickier, as it has a Machine Reading (MR) task associated with it).
1.2 Goals
In summary, this dissertation aims at:
• Presenting the current state of the art on answer selection;
• Proposing techniques that consider the context of candidate answers to select them;
• Developing an answer selection module based on those techniques;
• Applying those techniques to Just.Ask, WWBM and QA4MRE scenarios and evaluate their impact.
1.3 Approach
We follow two different approaches to the answer selection task: a Lexical Distance-based approach and a Semantic Spaces approach. The first is based on two works from the beginning of the century, applied to WWBM, that we recover, modify and extend for this purpose. Both
4 CHAPTER 1. INTRODUCTION are directly related with the context of the candidate answer, and use different techniques to measure it. The second one is based on semantic spaces, and tries a different approach to the problem. The idea is to discover latent topics from documents, and associate the question and candidate answers to those topics, distinguishing them in the process.
1.4 Contributions
The work of this thesis led to the following contributions:
• A survey and analysis of the current state of the art on answer selection;
• The development of a set of techniques that consider the context of candidate answers, from where they are extracted, in order to select them;
• An answer selection module, AnSelMo, based on those techniques;
• The study of applying AnSelMo to three different scenarios: WWBM, QA4MRE and Just.Ask.
This work also resulted in a paper entitled 2B$ – Testing Past Algorithms in Nowadays Web, accepted in TSD 2012, the 15th International Conference on Text, Speech and Dialogue, that will take place in Brno, Czech Republic, in September 3–7 2012.
This work also allowed L2F to participate in the 2012 QA4MRE track.
1.5 Structure of this Document
This document is structured as follows: in Chapter 2 we present related word; Chapter 3 describes our system and its techniques, followed by the experiments performed, detailed in Chapter 4. Finally, in Chapter 5, we present the conclusions of this work and point to some future work.
2
Related Work
In this chapter we present the related work. We first analyze the problem of answer selection in QA systems, followed by two works applied to the WWBM problem (Section 2.2). In Section 2.3 we describe the participating systems in 2011 QA4MRE track, in Section 2.4 we present Just.Ask and in Section 2.5 we give some insight into Latent Semantic Analysis (LSA).
Finally, we describe some information sources and end with a summary of the chapter.
2.1 Answer Selection in Question Answering
In this section, divided in subsections, are described the different approaches studied to answer selection.
2.1.1 Pattern-based approach
Some systems are based in patterns, and, thus, sometimes it is hard to distinguish the extrac- tion step from the answer selection step. Gonzalez et al. [15] developed a system to participate at CLEF2006 that uses regular expressions to extract candidate answers from the passages gathered before. These patterns are designed to retrieve the answer depending of the ques- tion type. Answers are then categorized with a set of 17 attributes. These attributes may represent the lenght of the questions, the similarity between the question and the candidate answer (by the number of common lemmas, named entities, etc.) and the redundancy of the candidate answer in all passages. These are used by a Na¨ıve Bayes classifier, which selects the candidate answer with most probability of being the correct one. CINDI QA [16], from CINDI group, also participated at CLEF (2007) and uses a similar approach. Answers are
6 CHAPTER 2. RELATED WORK extracted from what they call of templates, which are no more than pre-defined lexical pat- terns. Raposa [45], a QA system for the Portuguese language, also uses patterns to extract the candidate answers. Although the authors refer to the patterns as “a set of context evaluation rules”, they are no more than strict patterns that are applied depending of the question type.
There are also other examples of systems that use patterns to extract candidate answers, like Kupiec [21], Soubbotin [49] and Fleischman et al. [13]. Hearst [18] also used patterns, but in a different task: to automatically acquire WordNet [30] relations [17]. Actually, Fleischman et al. [13] refers that its patterns occur 40 times often than those employed by Hearst. Joho and Sanderson [20] also based their work on Hearst to answer definition questions by applying similar patterns. An example of those patterns could be ‘<name>, such as <hyponym1> or
<hyponym2>’, that identifies hyponymies of a given word.
Patterns are also used by Ravichandran and Hovy [39]. However, the applied patterns are automatically extracted (in fact, great part of the patterns applied by Hearst are discovered automatically too). To do this, the authors learn the patterns from a seed. A seed is a sample containing the question term (for example ‘Mozart’) and the correct answer (‘1756’), for the given question category (in this case ‘birthyear’). The pair is submitted to a search engine and the top N results containing both terms are used to extract a pattern able to match the question term and answer. Those patterns are then generalized, by swapping the question term and answer by the <NAME> and <ANSWER> tags, respectively. In this example, patterns like ‘born in <ANSWER> , <NAME>’, ‘<NAME> was born on <ANSWER>’, ‘<NAME> (
<ANSWER> -’ and ‘<NAME> ( <ANSWER> - )’ could be extracted. The process is repeated with other seed pairs, learning thus more patterns and refining the existing ones. After this, new queries are created from the seeds with only the question terms (without the answer this time) and the obtained patterns are used to extract the answers. The answers retrieved with the patterns are then compared with the expected answer. The ratio of correctly extracted answers becomes the precision of the pattern that originated such answer. The values found represent a probability of each pattern to find the correct answer and are used later to decide what candidate answers are returned as the correct ones.
2.1. ANSWER SELECTION IN QUESTION ANSWERING 7 Sun et al. [53] use a similar strategy to extract answers: a statistical approach for pattern matching. However, unlike the previous approach, which relies on patterns’ confidence, their system calculates the similarity between the question and the snippet containing the candidate answer. The system allows two different matchings: a strict one, where the stems from both sentences should be the same, and a flexible one, which relates words with the help of WordNet. Matches are weighed and summed up, totalizing a score for the given answer.
Priberam’s QA system [7] also uses patterns to extract candidate answers. However, it uses an answer validation module to ensure the correctness of the answers. This is done by applying some sanity checking techniques, such as matching of named entities (similar to the previous matching technique). The idea is to match the named entities present in the snippet containing the candidate answer with those in the question. In case of failure, the candidate is discarded. Rodrigo et al. [41] use a similar approach at 2007 Answer Validation Exercise (AVE), on CLEF. In AVE, systems are required to decide if an answer from a QA system is correct or not, labeling it as Validated, Selected or Rejected [42].
2.1.2 Knowledge-based approach
Other strategy sometimes applied is known as model-based answer selection and can be seen as a knowledge-based approach. The idea suggested by Sinha and Narayanan [48] is to model the relations between events, entities and their properties, reasoning then about them to reach the correct answer. The system is able to parse documents and extract properties, relating them with those from the question. As example, consider the ques- tion Does Pakistan possess the technological infrastructure to produce biological weapons?.
The system extracts two relations, possess(Pakistan, technologicalInfrastructure) and produce(Pakistan, biological Weapons), and will later consult an ontology, search- ing for the preconditions necessary to a country produce biological weapons, what expertise is needed and if Pakistan possesses it.
8 CHAPTER 2. RELATED WORK 2.1.3 Noisy-Channel-based approach
The two techniques presented before (pattern-based and knowledge-based) are described by Echihabi et al. [11], as well as a noisy-channel-based approach to answer selection. Noisy- channel models are commonly used for error correction. Given a word, which may have some letters scrambled (or missing or even some extra letters), the calculated model shows the probability of the given word correspond to some other word (belonging to an allowed lexicon). The authors’ proposal is to consider sentences instead of words and, thus, words instead of letters. Given a sentence obtained from an information retrieval system one can calculate what is the probability of transforming that sentence into the original question.
The trained model gives such probability and the answer is extracted from the snippet which probability is higher.
2.1.4 N-grams approach
Aranea [26, 25] and OpenEphyra [46] use similar strategies between them. Aranea generates all n-grams of terms, from unigrams to tetragrams, from retrieved passages. These n-grams have an initial score, depending from which query they are from, and are considered the candidate answers. Identically, candidate answers extracted by patterns in OpenEphyra are given an initial score. Those equal candidates (or n-grams in case of Aranea) are then merged, summing up their scores. This process although gives more weight to the popular answer is able to filter some candidates resulting from malformed documents and incorrect information, as they should be rarer. Even that the previous step allows some filtering, there is a filtering step afterwards. Candidates are filtered considering some heuristics such as the candidates matching the expected answer type.
Because Aranea is dealing with n-grams, another combining step is performed. The remaining candidates are combined now taking into account the intersection of the various n-grams. If a candidate answer is a portion of another one, then the two are merged together, again summing their scores. Longer answers are preferred, counteracting this way the weight given
2.1. ANSWER SELECTION IN QUESTION ANSWERING 9 by the previous combination step to shorter n-grams. Finally, Aranea multiplies candidate scores by a factor that considers the terms present in the answer.
The authors of Aranea also refer some limitations of their system. Questions like Who was the first president of Portugal? could lead to the actual president. The reason is that Aranea does not recognize temporal expressions. Due to its approach, words like first or last are considered unimportant. We found no evidence showing that OpenEphyra would not be prone to the same problems.
2.1.5 Large-Scale Question Answering
DeepQA (or Watson) [12], unlike the previous examples, relies on massive parallelism and many expertise, i.e., it is able to run different techniques, often heavy and long taking, at the same time and much quicker. Thus, a number of different algorithms are applied. None of them is predominant, which means that the combination of them turns it possible to find the correct answer to almost all questions and no matter how they are posed. DeepQA employs over 2500 3.5GHz POWER7 cores and 16 Terabytes of RAM, being able to process 500 gigabytes per second. As we can see, Watson is a mega-project from IBM: this power enables the computation of dozens of algorithms, which is unthinkable for us, since we do not possess such resources. Nevertheless, the process deserves mention.
Candidate answers are gathered from different information sources and suffer a soft filtering.
These filters all but close to 100 candidates. The filtering is done with some ‘lightweight’
techniques, such as the candidate answer matching the question type. After this, there is a scoring step. As mentioned, DeepQA allows various techniques. About 50 scoring methods are used, from counting categorical features to information from triple stores. Each scorer may be focused on one or more features and work independently, making it easier to add more scorers in the future.
Because it would be hard to compare directly scores from different methods, DeepQA needs to normalize the candidates. As seen in the previous systems, Watson also merges similar
10 CHAPTER 2. RELATED WORK candidates. It uses candidate matching and coreference resolution, among others, to identify related candidates. In the end, candidates are ranked and the confidence in each one is calculated (this confidence is part of the strategy to play Jeopardy!, which is out of the scope of this thesis).
2.2 ‘Who Wants to Be Millionaire?’
As mentioned before, WWBM is one of this thesis’ scenarios. It is composed by a set of multiple-answer questions, each with four candidate answers – where one is the correct answer.
Two approaches were found in the literature, and are presented in this section.
2.2.1 The approach by Clarke et al.
To our knowledge, the first computational attempt to solve WWBM is the one described by Clarke et al. [9], where the authors present an approach to the QA task (first used at Text REtrieval Conference (TREC) [8]) and, ultimately, on WWBM. The main idea is to retrieve extents and score them accordingly to the likelihood of them containing the answer. Given a document D, which is seen as a set of terms ordered sequentially, one can have a subsequence defined by a pair (u, v), referred as extent. This extent represents a passage, starting at term tu and ending at term tv. These extents are created based on queries (Q), generated from the original question. The authors established also two concepts: satisfaction and cover.
An extent is said to satisfy a term set T ⊆ Q if the subsequence that the extent represents contains at least once each term in T . If there is no minor subsequence that satisfies T , then the given extent is considered to be a cover for T . T is no less than the power set of Q, P(Q).
The extents defined as covers are then scored. The score reflects the possibility of finding the answer in the extent or in its neighborhood. For each term ti∈ T , there is an associated probability pt, which corresponds to the likelihood of it appearing in any location in the document. It is made the assumption that a document is a set of independent terms. Although the assumption does not hold in general, as some terms appear more frequently together, it is
2.2. ‘Who Wants to Be Millionaire?’ 11 often a fair approximation given a large corpus. The probability pt can be roughly estimated from the frequency of the term in the document, pt = ft/N , with N being the length of the document, that is, the number of terms present. The probability of a given term being present on an extent (u, v), with length l = v − u + 1, can be thus defined as:
P (t, l) = 1 − P (t, l)
= 1 − P (t, u) × P (t, u+1) × · · · × P (t, v)
= 1 − (1 − P (t, u)) × (1 − P (t, u+1)) × · · · × (1 − P (t, v))
Given the assumption of independence of a term appearing in any location with probability pt, we can resume the equation to:
P (t, l) = 1 − (1 − pt)l
= 1 − (1 − lpt+ xp2t− yp3t+ · · · ± plt) ' 1 − 1 − lpt+ O(p2t)
The values l, x and y can be extracted from the first three values of the lth line from the Pascal Triangle. As ptis necessarily lesser than 1, the greatest the power it has, the smaller it becomes. Therefore, one can look only at p2t as the worst case (the asymptotic limit), leaving the first two terms of the expansion and contracting the remaining. After applying the minus signal, we end up with a more friendly value (approximately):
P (t, l) ' lpt (2.1)
After finding the value for P (t, l) we can now use it to calculate the probability of all terms in T appear in the extent (u, v):
P (T, l) = Y
t∈T
P (t, l)
'Y
t∈T
lpt
12 CHAPTER 2. RELATED WORK
= l|T |Y
t∈T
pt
= l|T |Y
t∈T
ft/N (2.2)
After this we are able to score the extents. The authors calculate the score of each extent by calculating the amount of self-information it contains by having all terms T ⊂ Q. Self- information measures the information associated with the occurrence of a given event and its outcome. The smaller the probability of an event, the greater the self-information associated with the occurrence of that event. This measure is additive and positive. Thus, if we have an event which is the composition of other two independent events, the self-information associated is the sum of self-information of those two events. The formula for self-information is:
I(ωn) = log
1
P (ωn)
= − log P (ωn), (2.3)
where ωn denotes the outcome of the event with probability P (ωn).
The score of an extent (u, v) is no more than the self-information contained by (T, l), which can be calculated based on Equation 2.2 and substituting it in Equation 2.3. Because we are assuming the independence of each P (t, l), we can simplify the equation:
I (T, l) = − log P (T, l)
= − log(l|T |Y
t∈T
pt)
= − log(l|T |) − log(Y
t∈T
pt)
= −|T | log l −X
t∈T
log(pt) (2.4)
=X
t∈T
log(1/pt) − |T | log(l)
=X
t∈T
log(N/ft) − |T | log(l) (2.5)
Equation 2.4 gives a higher score to snippets with lower probability of occurrence. These
2.2. ‘Who Wants to Be Millionaire?’ 13 higher values do not imply necessarily a greater likelihood of the answer being present in the extent or its proximity, but the authors found empirically that this relation holds. Just with this passage retrieval technique, at TREC-8, the system found about 63% of correct answers.
Initially, as mentioned before, this strategy was applied to TREC. Here, an answer is a passage, not a term. Thus, a passage centered in the extent was returned. To consider single terms as candidates, instead of complete snippets, the score is based on a simplification of Equation 2.5. The second part of the equation is reduced to 0, since l is now 1. The first one loses the sum for the same reason (only one term). The scoring for a term t is then:
wt= ctI (t, 1)
= ctlog(N/ft), (2.6)
where ct is the redundancy component, representing the number of occurrences of t in the different passages.
Finally, the authors tried to use these techniques for multiple-answer questions, from the WWBM contest. Because there is a set of possible answers, the passage retrieval module was modified to take that into account. For this, the scoring process also cares about the presence of one of the answers in the passages (still using Equation 2.4). The passages are then ranked by votes (one for each answer present in the snippet). The accuracy accomplished for a 108-question test was 70%.
2.2.2 The approach by Lam et al.
In 2003, Lam et al. [22] took another approach to WWBM. The strategy is simple, yet with good results. The authors used the web as their only resource, and reported three different techniques to this problem, with accuracies ranging from 50 to 75%.
The first technique comes out directly from the problem formulation. Having the possible answers (and, of course, the correct one among them), is it easier to confirm each one of them instead of searching just for the correct one? Relying on web, is it possible to use search
14 CHAPTER 2. RELATED WORK engines to answer the questions this way? The authors established their baseline choosing as correct answer the one with more results retrieved by querying a search engine with a query in the form ‘answer . questionModified’. The term ‘questionModified’ refers to the question filtered of stopwords. Also, for questions with a not presented, like Which of these plays is not written by Shakespeare? A: Hamlet B: Othello C: Romeo and Juliet D: Cats, the chosen answer is the one with less results (an inversion due to the presence of not ). Using Google the accuracy was set in about 50%. Some modifications, as enclosing answers by quotes in the queries, allowed the system to reach a percentage of correct answers of 60%.
Another technique employed by the authors considers the position of the answer in a given document, when compared with other words present in the question. The rational for this heuristic is the observation that often answers appear in documents close to question words.
It is intended to weigh the distances, of a maximum radius, so that documents with too many references to an answer but not to the corresponding question words are worth less. The first ten pages returned by the queries are used in the algorithm presented in Algorithm 1. The parameter ‘documentSplit’ represents an array where each position is a term in the document.
Algorithm 1 Pseudocode for a scoring algorithm, giving more weight to answers near ques- tion words, within radius words.
1: DistanceScore(documentSplit, questWords, ansWords, radius)
2: score, ansF oundW ords = 0
3: for i = 1 to ||documentSplit|| do
4: if documentSplited[i] ∈ ansWords then
5: ansF ountW ords += 1
6: for j = (i−radius) to (i+radius) do
7: if documentSplited[j] ∈ questWords then
8: score += (radius −|i − j|)/radius
9: end if
10: end for
11: end if
12: end for
13: if ansF oundW ords == 0 then
14: return 0
15: else
16: return score/ansF oundW ords
17: end if
Figure 2.1, extracted from Lam et al. [22, Figure 1], depicts the accuracies for three runs of
2.3. Question Answering for Machine Reading Evaluation 15 90-questions for different values of radius. The graph also shows the accuracy of the baseline approach.
Figure 2.1: Algorithm 1 performance for three 90-questions runs, with different radius. The last point corresponds to the baseline approach.
Inspecting the figure we can see that the optimal value for radius is not trivial to derive, but for higher values the algorithm performs better than simply counting the results for each possible answer.
Greater accuracies (70-75%) were accomplished by combining different runs and different search engines.
2.3 Question Answering for Machine Reading Evaluation
QA4MRE is another of the scenarios proposed in this thesis. Recall that in this task systems must answer a set of multiple-answer questions, related with a document (a TED talk tran- scription). There are five candidate answers and, as in WWBM, candidate answers are not related, being one of them the correct answer.
In 2011, about twelve systems participated in this challenge, from which ten in the English task. From the twelve, there are only eight working notes available [34]. For all submitted
16 CHAPTER 2. RELATED WORK runs (43 for english, 11 for german and 9 for romanian), nearly half reported score below the baseline [34, Table 11], which is set on 0.2 in c@1, meaning random guesses to all questions.
The winning system [33] achieved a 48.3% accuracy (0.57 considering the c@1 measure).
The systems presented in 2011 can be divided in three categories: the ones using Information Retrieval (IR) techniques, those which view QA4MRE as a QA problem and, finally, those applying logic-based strategies.
In the last group is the system proposed by Babych et al. [3], for German. It parses the input text, representing it in a dependency graph. From it, the system uses a set of rules as Hearst [18] did to extract hyponym and other relations. Finally, all sentences and rules form the previous step are transformed into logical formulae and, given the candidate answer C, the input text T , and the Background Collection B, the system tries to infer if (T ∧ B) ` C.
Similarly, Gl¨ockner et al. [14], also for the German task, take each candidate answer and, together with the question, build an hypothesis, which should be proved from the reading test and the Background Collection.
Verberne [54] submitted a system based on IR techniques. After data preparation, including coreference resolution (which the last mentioned system also does), sentences from input text are extended with sentences from the Background Collection. This is done with the BM25 ranking function, which ranks passages from Background Collection accordingly to their similarity to input text. Also, questions are extended with facts retrieved from the Background Collection. For this, the system uses a set of rules, which extract facts that establish a relation between a subject and an object. Then, if a question contains any of the subjects and/or object, the respective fact is added, extending this way the question.
These two strategies are optional, and the system behaves better when using only the first one. Independently of which one is used, the system then uses again BM25 to measure the similarity between questions (extended or not) and the passages from input text (extended or not), ranking them in the process. Finally, for each answer, the system uses again BM25 to score them against the selected fragments (extended or not), resulting in the final answer.
2.3. Question Answering for Machine Reading Evaluation 17 Iftene et al. [19] also applied a simple IR technique, which consisted in using the Lucene indexer1 to the reading test texts and Background Collection. The built index is then queried using the questions, creating this way another index, built with the retrieved pas- sages/documents. Then, based on this new index, answers are used as queries in a new retrieval step. The relevance scores from each retrieval step are then used to compute a final score for each answer. To transform questions and answers into queries, the authors use Named Entity (NE) Recognition (NER), lemmatization and stopword elimination, giving more weight to NEs when building the indexes.
Saias and Quaresma [44] also used an approach based on Lucene. With the texts and Back- ground Collection indexed, the system uses WordNet [30] to check for synonyms and uses the answers as queries, filtered of stopwords. Then, for each answer, the system checks two rules: the first one depends on the question classification, which, according to the authors, did not get any results. Each classification is associated with a pattern, with which was tried to extract the answer. If it is unsuccessful, then the second rule applies. This second rule measures the distance between the key elements (i.e. object of the question) and the answer.
The answer selection follows a set of if-else clauses regarding the two rules. The conditions are the rules that fired, how many times they did fire, or the minimal distance calculated.
Cao et al. [6] presented a strategy that simulates a strategy applied by people when learning a new language and answering reading tests. According to the authors, people will first locate the passages related with the questions, searching them by NEs. Thus, their system performs NER to find related passages afterwards, by comparing the NEs between the question and passages. We should note that the texts are preprocessed to resolve anaphoras. The second and third steps are hardly dissociable: the system uses a lexical chain to score each hypothesis, selecting then the one with greater score. The lexical chain was based on previous work [31].
The idea is to relate terms by WordNet relations, such as synonym, hypernym and gloss.
The terms may be directly related (a V relation – the letter is associated with how many intermediate terms exist) or indirectly, having one or more terms between (W, VW and
1http://lucene.apache.org/
18 CHAPTER 2. RELATED WORK WW relations). Each kind of relation has a weight associated, which contributes to the final score. Unfortunately, the authors’ description is not clear on how answers and passages are compared.
Martinez-Romo and Araujo [27] used a completely different approach. Although the Back- ground Collection is also preprocessed and indexed by Lucene, the system links all nouns (proper and common) and verbs within a given document, establishing a co-ocurrence graph.
This means that the terms appearing in a given document are related under the same topic.
After this, the system uses WalkTrap [37] to automatically discover communities. These communities are basically clusters that gather terms belonging to the same topic. Then, each question is assigned to a community based on their similarity. Following this, each an- swer is also assigned to a community; the selected answer is the one with greater similarity to the question context (i.e., community). We should note that the authors do not specify what are the similarity measures used to compare questions with communities, answers with communities, and the communities themselves.
Pakray et al. [33] developed a system that uses two different strategies: an Answer Validation (AV) approach and a QA approach. The best results were accomplished by using the system as an hybrid between the two. The AV module is based on Textual Entailments (TEs): for each answer, for a given question, an hypothesis H is generated, according to a set of patterns.
These are then used to retrieve passages from the texts, which are indexed with Lucene. The topmost sentence, T , is paired with the corresponding hypothesis, resulting in the pair T-H.
These pairs are then processed by a pipeline of different strategies to determine if they are TEs. Among these strategies are the comparison of NEs, the number of co-occurring unigrams, bigrams and skip-bigrams between T and H, and the matching of question and answer types.
Finally, the pair with greatest score from all strategies is chosen and the associated answer is considered as the correct answer. The QA module starts by doing similar tasks. Following some rules, each question is transformed into a pattern, where the wh-word is substituted by one of the candidate answers. From these patterns are also extracted stopwords, creating a keyword list. Then, each pattern is compared against each sentence from the documents. If
2.4. JUST.ASK 19 they do not match, the same is done between the respective keyword list and the sentences.
Whichever matches, a score is assigned. Finally, the answer associated with the sentence with greatest score is chosen as the correct answer. It is important to note that anaphora is resolved in each document. For this, NER is performed, in order to help in this task. First person personal pronouns are substituted with the document’s author; however, in direct speech, these pronouns refer to the last NE, making the swap accordingly. For third person pronouns the system uses similar rules to detect the referring NE, depending if it is direct or indirect speech.
2.4 Just.Ask
Just.Ask [1, 47, 28] is a QA system developed by the L2F group and it is the third scenario of this thesis.
Just.Ask receives a question and starts by analyzing it. For that it uses tools like tokenizers and syntactic analyzers. The question headword is extracted and the question is classified according to the two-layer taxonomy of Li and Roth [24]. This information is passed down to the passage retrieval module. Just.Ask uses different information sources and, thus, different query formulations, depending of their purpose.
The relevant passages retrieved in the previous step are the input for the answer extraction and selection step. Just.Ask uses a set of different approaches, each one for a subset of questions.
For Numeric and Abbreviation type questions, Just.Ask utilizes a set of patterns to extract candidate answers. However, questions of type Human:Individual require a more flexible approach, because names may appear in different formats and are therefore difficult to express in patterns. For this Just.Ask uses a machine learning-based named entity recognizer. There is still a third kind of questions, the type of questions. These questions ask for hyponyms of the questions headword. For instance, in the question Which animal is the fastest?, the answer we are looking for is not an individual, but a type of animal: cheetah, which is an hyponym of animal. Just.Ask uses WordNet [30] at run time, extracting such relations to
20 CHAPTER 2. RELATED WORK answer these type of questions.
After candidate answers are extracted, Just.Ask chooses one to be returned as final answer.
Candidates go through three steps: normalization, clustering and filtering.
For the first step, Just.Ask normalizes candidate answers that correspond to the same entity.
The idea is to try to eliminate the variation of answers. This is accomplished by transforming candidates to a unique representation. For example, numbers (numeric or textual versions) are all transformed into a number with one decimal place.
After normalization, the clustering step is performed. The goal is to merge candidate answers representing the same entity into a unique cluster. Just.Ask uses two distance measures to determine the similarity between candidates: Overlap distance and Levenshtein distance [23].
The first is defined as follows:
Overlap(X, Y ) = 1 − |X ∩ Y |
min(|X|, |Y |), (2.7)
where |X ∩ Y | is the number of tokens present in both candidate answers X and Y , and min(|X|, |Y |) is the size of the smallest candidate.
Initially each candidate answer has its own cluster. After each iteration, the two closest clusters are merged together. The distance between clusters is defined by the minimum distance between any member of each cluster. When the process halts, the longest candidate of each cluster is chosen as the representative answer of that cluster.
Finally, a filtering step occurs. Here, clusters with any member present in the original question are discarded. After filtering, the system chooses the cluster with higher score and returns its representative answer. The score of a given cluster is simply the sum of the scores of each member, which are attributed during the answer extraction phase.
2.5. LATENT SEMANTIC ANALYSIS 21
2.5 Latent Semantic Analysis
LSA is a mathematical technique first applied to IR [10] that has been also applied to different areas, such as summarization [40, 32]. LSA can help to identify latent topics from documents, even without knowing them apriori.
LSA is a vector-space approach, based on Singular Value Decomposition (SVD) to reduce the dimensionionality from the original matrix A. This matrix is usually a term by document matrix, but sentences instead of entire documents may also be used. The matrix has dimen- sions T × N , with T being the total number of terms and N the total number of documents.
The matrix is filled with the scores of each term in the given document. These scores usually have a local and global component (for instance, tf.idf score may be used), but only local weights such as term frequencies or other features may be used. From SVD application to matrix A results a decomposition into three matrices:
A = U ΣVT,
with U being a T × m matrix, Σ a m × m diagonal matrix, whose elements are non-negative singular values sorted in descending order and V a m × N matrix. Figure 2.2 depicts the transformation.
Figure 2.2: Application of SVD to matrix A.
We can see that one can now define terms by topics (or concepts), and also topics by document.
This allows another view on the documents, For example, in summarization one can extract only a subset of topics from a document and then select passages from that topic. The singular
22 CHAPTER 2. RELATED WORK values from Σ matrix follow a Zipf distribution, meaning that the top-n topics are the most relevant, while the remaining have approximately the same importance.
This decomposition allows different analysis. For example, one can compare two documents by evaluating their similarity. This is done by comparing their vectors (which are normalized due to SVD), applying, for instance, a cosine function between them.
There are similar approaches, using the same paradigm (semantic spaces and dimension reduc- tion), but that rely in probabilities instead of frequencies. These are called Probabilistic Topic Models, and two examples are pLSA (LSA with a probabilistic component) and LDA (Latent Dirichlet Allocation), which differs by adding a Dirichlet prior on the topic distribution [50].
2.6 Information Sources
Although we are focused in answer selection, such task is not possible without performing passage retrieval. In fact, if the correct answer has not been extracted before, there is no hope for the system to answer correctly the question later. QA systems need, thus, information sources. Information sources can be classified into two different groups: non-structured data and (semi-)structured data. The first refers to textual documents on the World Wide Web (WWW). The latter is related to databases and other platforms where information can be easily extracted. The two types are discussed in detail in the following subsections.
2.6.1 World Wide Web as an information source
The web is one of the most accessible information sources and is an example of non-structured data, in the sense that the data is contained in free text, that is, the content of the web is not structured, although web pages are structured in HTML. One significant advantage of the web is that it has redundant data. Most systems that use the web rely on this to support their answers. The assumption is that correct answers appear often in the web, while less accurate answers appear fewer times. However, one of the main limitations of this strategy is that a system may end up answering not the correct answer, but the most popular one, as
2.6. INFORMATION SOURCES 23 WWW contains much noise due to being wide-spread and of free-usage (anyone can create a blog nowadays, writing there whatever (s)he wants to).
Search engines are, no doubtfully, an important component of any system that uses the web.
They permit, easily, to fetch a set of (more or less) related documents, given a number of keywords. Typical examples are Google or Yahoo. People are used to input some terms in their browser and find quickly the intended information, almost always in the first page of returned results. Although humans are pretty accurate doing this, the task of extracting an answer from the snippets returned by the search engine is much harder for a computer.
Concretely, a system may be confused by surrounding terms, misleading the answer extraction module. For instance, systems using patterns (like those presented in Section 2.1.1) may not extract the correct answer due to this fact. Thus, the snippets retrieved should contain the least possible noise, to further facilitate the extraction.
As examples of systems that use the web we have the previously mentioned Just.Ask [1, 47, 28]
using Bing, Aranea [26, 25] using Google, and OpenEphyra [46] using Yahoo. OpenEphrya also uses Indri [51], from the Lemur Project. Indri is a search engine developed at UMass that combines different features to allow more refined searches.
Nowadays, to our knowledge, there are only three available search engines able to answer computational requests, besides the aforementioned Indri : Bing, Blekko and Entire Web.
This is due to the following facts: a) Yahoo API has been deprecated and is no longer available, as well as Alta Vista, which belongs to Yahoo!; b) Google API is deprecated as well, and the new one does not allow a significant number of queries free of charge; c) AllTheWeb also ceased to exist. In this way, the only survivors are these three. Bing is a traditional search engine, and works the same way as Google does. Blekko, on the other hand, has a limitation, as it only accepts queries with length of ten terms; any term besides the tenth is discarded blindly. However, it has an interesting feature: the so called slashtags. Slashtags allow to parameterize a query, by restricting the search domain (slashtags can restrict to a domain, such as health or sport ). Finally, Entire Web only recently has an API.
24 CHAPTER 2. RELATED WORK 2.6.2 (Semi-)Structured data sources
QA systems may also use semi-structured information sources. These are, usually, platforms where information is ensured to be correct, or at least assumed to be correct (for example, Wikipedia is an information source that usually is accurate but, as anyone can edit it, it may contain some errors). In this group we have some examples actually used by some systems, as explained next.
WordNet [30] is one of the most used sources. WordNet is used to search for synonyms, hyponyms, and other relations between words. This can be useful not only to query expansion, creating different queries meaning the same, but also when it comes to answer extraction and selection, as seen in Sections 2.1 and 2.3, as these relations may be important (for instance when clustering candidate answers or when matching terms from sentences with the question).
DeepQA, Just.Ask and Prager et al. [38] use WordNet as explained.
DeepQA also uses other information sources, as Freebase [4], DBpedia [2] and Yago [52].
Freebase is a collection of facts about people, places and so on. DBpedia is a knowledge base that gathers Wikipedia semi-structured information into a structured platform, allowing easy access to facts and descriptions about anything found in Wikipedia. In fact, as pages use the same suffix, Just.Ask uses the Wikipedia page titles to index DBpedia pages and consult them to answer description questions. Yago is another online platform that consolidates information from various sources and tries to establish relations between entities.
Aranea uses yet other information sources, such as CIA World Factbook2, 50states.com, Biog- raphy.com, and Internet Movie Database (IMDB)3 to help answer questions about countries, people, and movies, respectively. As each information source is defined independently, Aranea uses a set of hand-crafted wrappers to access each one of these sources.
Wikipedia is also being used as a source more frequently, especially after 2007, when CLEF started to allow its usage. Systems like Joost [5], a Dutch QA system, and Just.Ask, as
2https://www.cia.gov/library/publications/the-world-factbook/index.html
3http://www.imdb.com/
2.7. SUMMARY 25 referred before, use Wikipedia to answer not only description questions, but also some factoid questions like capitals, languages and other facts that can be easily extracted from Wikipedia pages.
2.7 Summary
In this chapter we started by presenting the state of the art in answer selection, followed by different works applied to WWBM and QA4MRE. From all presented solutions, except for those using patterns, only a few consider the context of the candidate answers where they appear, such as Lam et al. [22] (Section 2.2.2) and Echihabi et al. [11] (Section 2.1.3).
26 CHAPTER 2. RELATED WORK
3
AnSelMo
Following some of the work presented in last chapter, we built a system, language independent as much as possible, that aims at answering multiple-answer questions. This chapter presents this system, AnSelMo (ANSwering SELection MOdule), starting with its architecture, fol- lowed by a dedicated section to each one of its modules: Pre-Processing, Counting, Lexical Approaches, Latent Semantic Analysis and Scoring.
3.1 Architecture
Figure 3.1 depicts the system’s architecture. The core is the column of strategies that can be used to perform the answer selection task, with the grey box representing other techniques that may be developed and added to the system. Each of those modules will be described in greater detail in the following sections.
Figure 3.1: AnSelMo’s Architecture.
28 CHAPTER 3. ANSELMO The system starts by performing a pre-processing step (Section 3.2). From this step results the set of candidate answers1together with the snippets or documents that are related with them.
These are then passed to one (or more) of the designed techniques. There are three classes of algorithms: Counting (Section 3.3), Lexical Approaches (Section 3.4) – which includes both Word Proximity and Similarity Measures – and Latent Semantic Analysis (Section 3.5).
Finally, the scoring is done according to the weights given by each of the modules.
3.2 Pre-Processing
The pre-processing step includes different behaviors, depending on what is intended, which makes this step not totally language independent. In the following paragraphs we detail the possibilities of this step.
As sometimes there is the need to perform query formulation and IR steps (the cases where only the questions along with the candidate answers are given to the system), snippets need to be retrieved for each one of the hypothesis. This is done using two search engines, Bing and Blekko (see Section 2.6). For each question, distinct queries are envisaged, one for each hypothesis of answer. The process of query formulation involves different formats and filters (the system works with any combination of these):
• The answer can be quoted or not;
• The answer terms can appear before or after the question terms (AQ vs. QA format);
• Different filters can be applied (these are, actually, used throughout the whole system)2:
– Wh-words–filter, where words such as Where and Who are eliminated;
– Prepositions–filter, where prepositions like At are removed;
– To Be–filter, that ignores different forms of the verb to be.
1Already known apriori.
2Henceforth, when referring to filtering in any of the strategies, any combination of these can be applied.
3.3. COUNTING 29 These filters are an example of language dependence of the system.
The system also works with local documents (or collection of documents). Each question should be bound to the document where the answer can (or should) be found. Thus, in this situation, the IR does not happen. We may also use Lucene to index those documents and later retrieve relevant snippets, given a query created by concatenating the question with the answer, filtered of stopwords, as before. Two indexation strategies are applied: in chunks of one or three sentences.
We may perform a simple form of anaphora resolution too. This only applies to documents that represent speeches and whose speaker is well defined. Pronouns like I and me are commonly used in such cases, so we substitute all those pronouns by the text’s author.
3.3 Counting
This is one of the simplest modules and is based on the work of Lam et al. [22], described in Section 2.2.2. The authors performed experiments with this strategy, accomplishing interest- ing results. Recall that this strategy is based in the assumption that the correct candidate answer is the one that has the greatest number of hits when queering a search engine. The used queries are obtained as explained in Section 3.2, containing both the candidate answers and question terms.
3.4 Lexical Approaches
3.4.1 Word Proximity
The Word Proximity strategy is based on the assumption that answers occur close to question terms. The algorithm calculates the distance between each candidate answers’ terms and the question terms in the neighborhood, that is, it weighs the distances, considering a maximum
30 CHAPTER 3. ANSELMO radius3, so that documents with too many references to an answer but not to the corresponding question terms are worth less. The algorithm was presented in Algorithm 1, in Section 2.2.2.
This way, by consulting a set of documents, one can search for the answers and, in case of finding them, check within the radius for the question terms. Other variations were developed, in an attempt to minimize the linear decreasing given by the algorithm. The change was made in the line 8 of the algorithm, which calculates the scores given the distance between the answer and question terms. The different options available in our system are the following and their functions are depicted in Figure 3.2:
• Linear: radius−|answerIdx−windowIdx|
radius
• Quadratic: radius2−|answerIdx−windowIdx|2 radius2
• Cubic Root: −3
√
|answerIdx−windowIdx|−radius)
√3
radius
• Cubic: radius3−|answerIdx−windowIdx|3 radius3
• Tetra (Fourth Power): radius4−|answerIdx−windowIdx|4 radius4
The point C in the figure represents the center point, that is, the point where an answer term was found. This is the answerIdx in the function list (i in the original algorithm).
Recall that, when an answer term is found, the algorithm looks for question terms in the surrounding terms, within a given radius; windowIdx represents the index in the window and takes values between −radius and +radius (j in the original algorithm). The original function is the Linear one. All functions return a score between 0.0 and 1.0, representing respectively a greater or smaller distance to the center. The process is applied for each term present in the candidate answer, which will have a score of 0.0 if no question term is found within the radius.
The idea of the functions is, as noted, to diminish the impact of the linear decreasing of the original function. The Quadratic, Cubic and Tetra functions are all similar, only differing
3We use the term radius as it was introduced by Lam et al. [22], instead of window. Notice that the size of a window is twice the radius.
3.4. LEXICAL APPROACHES 31
Figure 3.2: Score functions used in Word Proximity.
on how late they start to decrease the returned score. The Cubic Root function is an attempt to flatten the abrupt decreasing present in the powered functions (it is visible in the figure by its line, that crosses all others): it starts almost linearly and decreases faster only close to the extremes.
Consider the following example, from QA4MRE 2011 corpus: Who is the founder of the SING campaign? is the question, with two candidate answers, Zackie Achmat and Annie Lennox – the correct one. Given the passages to have met Zackie Achmat, the founder of Treatment Action Campaign and And this is the name of Annie Lennox campaign, SING Campaign4, both in the associated document, the result for Word Proximity would be a score for each answer calculated as follows (assuming a radius of 10 and the Linear scoring algorithm):
• Zackie Achmat:
Zackie distance to founder is 3, score is 0.7; distance to campaign is 7, score accumu-
4After anaphora resolution.