Questionanswering, the identification of short accurate answers to users questions, is a longstanding challenge widely stud- ied over the last decades in the open- domain. However, it still requires fur- ther efforts in the biomedical domain. In this paper, we describe our participation in phase B of task 5b in the 2017 BioASQ challenge using our biomedicalquestionanswering system. Our system, dealing with four types of questions (i.e., yes/no, factoid, list, and summary), is based on (1) a dictionary-based approach for gen- erating the exact answers of yes/no ques- tions, (2) UMLS metathesaurus and term frequency metric for extracting the ex- act answers of factoid and list questions, and (3) the BM25 model and UMLS con- cepts for retrieving the ideal answers (i.e., paragraph-sized summaries). Preliminary results show that our system achieves good and competitive results in both exact and ideal answers extraction tasks as compared with the participating systems.
BiomedicalQuestionAnswering (QA) aims at providing automated answers to user ques- tions, regarding a variety of biomedical topics. For example, these questions may ask for re- lated to diseases, drugs, symptoms, or medical procedures. Automated biomedical QA sys- tems could improve the retrieval of informa- tion necessary to answer these questions. The MEDIQA challenge consisted of three tasks concerning various aspects of biomedical QA. This challenge aimed at advancing approaches to Natural Language Inference (NLI) and Rec- ognizing Question Entailment (RQE), which would then result in enhanced approaches to biomedical QA.
BiomedicalQuestionanswering has always been a hot topic of research among the QA commu- nity at large due to the relative significance of the problem and the challenge of dealing with a non standard vocabulary and vast knowledge sources. The BioASQ challenge has seen large scale par- ticipation from research groups across the world. One of the most prominent among such works is from Chandu et al. (2017) who experiment with different biomedical ontologies, agglomer- ative clustering, Maximum Marginal Relevance (MMR) and sentence compression. However, they only address the ideal answer generation with their model. Peng et al. (2015) in their BioASQ sub- mission use a 3 step pipeline for generating the exact answers for the various question types. The first step is question analysis where they subdivide each question type into finer categories and clas- sify each question into these subcategories using a rule based system. They then perform candidate answer generation using POS taggers and use a word frequency-based approach to rank the can- didate entities. Wiese et al. (2017) propose a neu- ral QA based approach to answer the factoid and list type questions where they use FastQA: a ma- chine comprehension based model (Weissenborn et al., 2017) and pre-train it on the SquaD dataset (Rajpurkar et al., 2016) and then finetune it on the BioASQ dataset. They report state of the art re- sults on the Factoid and List type questions on the BioASQ dataset. Another prominent work is from Sarrouti and Alaoui (2017) who handle the gener- ation of the exact answer type questions. They use a sentiment analysis based approach to answer the yes/no type questions making use of SentiWord- Net for the same. For the factoid and list type questions they use UMLS metathesaurus and term frequency metric for extracting the exact answers. 3 The BioASQ challenge
In this paper, we described a deep learning ap- proach to address the task of biomedicalquestionanswering by using domain adaptation techniques. Our experiments reveal that mere fine-tuning in combination with biomedical word embeddings yield state-of-the-art performance on biomedical QA, despite the small amount of in-domain train- ing data and the lack of domain-dependent fea- ture engineering. Techniques to overcome catas- trophic forgetting, such as a forgetting cost, can further boost performance for factoid questions. Overall, we show that employing domain adapta- tion on neural QA systems trained on large-scale, open-domain datasets can yield good performance in domains where large datasets are not available. Acknowledgments
this score, answer snippets were weighted differ- ently depending on the strength of their match. Factoid and list questions. Factoid and list questions demand slightly different approaches. For both categories, we implemented a rule-based priority queue on answer candidates. The high- est priority was given to answers where question and answer snippet contained the same predicate and for which the argument type of the answer matched the argument type of the question word of the question (e.g. ”what”). The next highest pri- ority was given to answers which were somehow related to the matching predicate. Hereby, the ar- gument types ”Arg0” and ”Arg1” have higher pri- ority. For factoid questions, the top five answers were selected for the submission. For list ques- tions, the maximum number of answers to be listed decreased depending on how low the priority lev- els got. This should ensure that we do not leave out a high-priority answer with a high probability to be correct in our model. Additionally, too many low-priority answers should be avoided to keep an acceptable precision level. Besides the SRL- based priority queue, we introduced a rule for the list question approach. As of the essence of list questions, we consider enumerations by detecting symbols like commas or the conjunction ”and”. Summary questions and ideal answers. We also investigated SRL for the summarization task. For summary questions, out of the given sets of answer snippets, the system selects the ones with the largest semantic conformities. Similar to fac- toid and list questions, the semantic conformity is determined by the degree to which question and answer snippet contain similar predicate argument structures or vocabulary. The same, previously de- scribed priority queue is applied. The ideal an- swers for yes/no, factoid and list questions were retrieved by selecting the whole answer snippet that included the highest priority answer to the cor- responding question. If no answer could be de- termined, we followed the same procedure as for summary questions.
BioASQ competition provides the train and test dataset. The training dataset consists of questions, golden standard documents, concepts, and ideal answers. The test dataset is split between phase A and phase B. The phase A dataset consists of the questions, unique ids, question types. The phase B dataset consists of the questions, golden standard documents and snippets, unique ids, and question types. Exact answers for factoid type ques- tions are evaluated using strict accuracy, lenient accuracy, and MRR (Mean Reciprocal Rank). An- swers for the list type question are evaluated based on precision, recall, and F-measure. ideal answers are evaluated using automatic and man- ual scores. Automatic evaluation scores consist of ROUGE-2 and ROUGE-SU4 and manual evalua- tion is done by measuring readability, repetition, recall, and precision.
errors. Usually, BioKIT’s preprocessing pipeline was meant to eliminate problematic characters but some symbols (e.g., “æ”, “¨o” or “˙o”) were not han- dled by the system. As a result, BioKIT crashed with an error after processing thousands of sen- tences without returning any result when it ran into a special character. We collected a set of almost 20 of such characters that we eliminated in an own script-based preprocessing step. Depending on the length of the question or snippet, the processing of one question or snippet took at least 600 millisec- onds or few seconds. This could be rated as a fast performance, but only when labeling many ques- tions at once. If BioKIT was just used to process a single question, its runtime exceeded one minute, given the necessary time to load models into mem- ory. In spite of the problem with special charac- ters, we found BioKIT to be reliable and stable.
Abstract. This article describes the participation of Fudan team in the 2015 BioASQ challenge. The challenge consists of two tasks, large- scale biomedical semantic indexing (task 3a) and biomedicalquestionanswering (task 3b). In task 3a, our method, MeSHLabeler, achieved the first place in all 15 weeks of three batches. Based on 3215 annotated citations (June 6, 2015) out of all 4435 citations in the official test set (batch 1, week 2), our submission best achieved 0.6194 in flat Micro F-measure. This is 0.0576(10.25%) higher than 0.5618, obtained by the official NLM solution Medical Text Indexer (MTI). Task 3b includes two phases. Given the questions raised by a team of biomedical experts from around Europe, the main task of phase A is to find relevant documents, snippets, concepts and RDF triples, while the main task of phase B is to provide exact and ideal answers. In the phase A of task 3b, our submission, fdu, achieved the first place in both document and snippet retrieval in batch 5 (June 6, 2015).
Open-domain Arabic questionanswering. The state of current Arabic QA systems is summarized in (Shaheen and Ezzeldin, 2014): research has focused mostly on open-ended QA using classi- cal information retrieval (IR) methods, and there are no common datasets for comparisons. Con- sequently, progress has been slow. Furthermore, the Arabic language presents its own set of dif- ficulties: given the highly intricate nature of the language, proper understanding can be difficult. For instance, é KñÊ¿ AJ ¯ means “so they will eat it”, which demonstrates the complexity that can be presented by a single word. Moreover, Arabic words require diacritization for their meaning to be completely understood. For example, Õ Î « trans- lates into “he taught”, and ÕÎ « means “found out”; modifying one diacritic changes the meaning en- tirely.
With the wide spread use of information in the internet exploration era there is a recently renewed interest for retrieving answers to the questions which are short and accurate. QA systems aim to retrieve point-to-point answers rather than flooding with documents or even matching passages as most of the information retrieval systems do. For E.g. “who is the first prime minister of India?” the exact answer expected by the user for this question is (Pandit Jawaharlal Nehru),but not intends to read through the passages or documents that match with the words like first, prime minister, India etc.,.
lem (Peters et al., 2018; Howard and Ruder, 2018; Devlin et al., 2019) proposed but they are not relevant specifically to the biomedical do- main. Instead, we highlight BioBERT (Lee et al., 2019), a biomedical version of BERT (Devlin et al., 2019) which is a deeply bidirectional trans- former (Vaswani et al., 2017) that is able to incor- porate rich context into the encoding or embed- ding process that has pre-trained on the Wikipedia and PubMed corpora. However, this model fails to account for the spatial and temporal aspects of dis- eases in biomedical literature as temporality is not encoded into its input. Furthermore, Biobert uses a WordPiece tokeniser (Wu et al., 2016) which keeps a fixed-size vocabulary dictionary for learn- ing new words. However, the vocabulary within the model is derived from Wikipedia, a general domain corpus, and thus Biobert is unable to learn distinct morphological semantics of medical terms like -phobia, where ’-‘ denotes suffixation, mean- ing fear as it only has the internal representation for -bia.
training instance consists of a question-answer pair with the KB-word specified in the answer. In this paper, we only consider the case of simple factoid question, which means each question-answer pair is associated with a single fact (i.e., one triple) of the KB. Without loss of generality, we mainly focus on forward relation QA, where the question is on sub- ject and predicate and the answer points to object. Tables 1 shows some examples of the training in- stances.
The main feature of our approach is the use of ALM as a language for encoding the meaning of action verbs (i.e., the effects and constraints for the execution of the actions they denote). In addition, we propose to extend an NLP resource about verbs, V ERB N ET (Kipper-Schuler, 2005; Palmer, 2006), with ALM based semantic annotations. Other logic formalisms have been used for other NLP tasks (e.g., Recognizing Textual Entailment (NIST, 2008)), but unlike ALM they cannot perform temporal reasoning (MacCartney and Manning, 2007; Harmeling, 2009) or reasoning by cases (Bos and Markert, 2005). This makes them less suitable for answering questions about discourses describing sequences of events.
As the complexity of questionanswering (QA) datasets evolve, moving away from restricted formats like span extraction and multiple- choice (MC) to free-form answer generation, it is imperative to understand how well current metrics perform in evaluating QA. This is es- pecially important as existing metrics (BLEU, ROUGE, METEOR, and F1) are computed us- ing n-gram similarity and have a number of well-known drawbacks. In this work, we study the suitability of existing metrics in QA. For generative QA, we show that while current metrics do well on existing datasets, convert- ing multiple-choice datasets into free-response datasets is challenging for current metrics. We also look at span-based QA, where F 1 is a rea-
is “interesting” enough to the scenario such that it deserves to be mentioned in a topic-relevant question. For example, Figure 4 illustrates an answer that includes two predicates and four entities. In this case, four types of reference are used to associate these linguistic objects with other related objects: (a) definitional reference, used to link entity (E1) “Anwar Sadat” to a cor- responding attribute “Egyptian President”, (b) metonymic reference, since (E1) can be coerced into (E2), (c) part-whole reference, since “BW stockpiles”(E4) necessarily imply the existence of a “BW program”(E5), and (d) relational ref- erence, since validating is subsumed as part of the meaning of declaring (as determined by WordNet glosses), while admitting can be de- fined in terms of declaring, as in declaring [to be true].
The third strand of research uses paraphrases more directly. The idea is to paraphrase the question and then submit the rewritten version to a QA module. Various resources have been used to produce question paraphrases, such as rule-based machine translation (Duboue and Chu- Carroll, 2006), lexical and phrasal rules from the Paraphrase Database (Narayan et al., 2016), as well as rules mined from Wiktionary (Chen et al., 2016) and large-scale paraphrase corpora (Fader et al., 2013). A common problem with the gen- erated paraphrases is that they often contain in- appropriate candidates. Hence, treating all para- phrases as equally felicitous and using them to an- swer the question could degrade performance. To remedy this, a scoring model is often employed, however independently of the QA system used to find the answer (Duboue and Chu-Carroll, 2006; Narayan et al., 2016). Problematically, the sepa- rate paraphrase models used in previous work do not fully utilize the supervision signal from the training data, and as such cannot be properly tuned
For CLEF 2004 (Magnini et al. 2004), Main evaluation shifted to nine sources languages and seven targets to constitute fifty different mono and bilingual tasks. Questions sets were augmented with How and Definition questions, only one Exact answer per question was allowed. In parallel, a new pilot task was added: List questions but with conjunctives and disjunctives series; temporal restrictions in questions (before, during, or after a date, a period, or an event). It studied QAS self-confidence ability.
This paper presents a system which learns to answer questions on a broad range of topics from a knowledge base using few hand-crafted features. Our model learns low-dimensional embeddings of words and knowledge base constituents; these representations are used to score natural language questions against candidate an- swers. Training our system using pairs of questions and structured representations of their answers, and pairs of question para- phrases, yields competitive results on a re- cent benchmark of the literature.
To develop additional instances of question-answer set for the embedded reuse corpus, TREC QA document collections would be a valuable resource. The embedded reuse test collection requires two general types of questions. One set of questions should provide for what we called the base questions in the embedded relation. As pointed out these are trivia like questions that inquire about a BasicNP. The BasicNPs in these base questions help to populate the BasicNP lexicon. Much of the earlier work in the field of QA has centered around fact- based questions. Certainly selected questions in the data set of the earlier TREC QA track can be used with none or minor modification for the base set. These include TREC9 QA task  and TREC10 QA  main and list tasks data sets.