Thesis Overview - Using natural language processing for question answering in closed and open d

19 processing several steps such as classification of the question, determining the EAT, and generating constraints. The constraints located in the main core of the QSiS which is used to formulate the related keywords in terms of syntax and semantics in order to utilize it in downstream steps. In addition, it uses dependency parsing, WordNet and Semantic Web technologies to build and complete the conceptual information of the QSiS. Although the same methodology is applied to implement the QSiS in both scenarios, the supplementary semantic information is added in the ontology-based closed domain. It provides a core dictionary of variables, which bind related terms to corresponding variables, automatically by providing two technical facilities. The first one, the constraints integrate all of the syntactic information from the dependency parsing along with lexical meaning and semantic information from WordNet or ontology. Moreover, the second is that the nature of the constraint indicates the unity formulation for each question type to make a dictionary of variables and relationship between them. This possibility allows providing the information needed to properly inference or even to deal with challenges in mapping question pattern to SPARQL query template.

The fourth contribution is defining and generating a question graph (QGraph) for representing core components of the question, further enriched with implicit knowledge (from the ontology in the first scenario). This graph is as a subgraph of the graph representing the domain and generating the graph format is done precisely, coherently and completely using the provided QSiS by the upstream process. The QGraph is used both as a search space for locating the answer and as a resource for enriching the constraints sets and EATs.

The fifth contribution is presenting an graph-based inference approach in the first scenario (closed-domain). A graph-based answer inference algorithm is applied to the provided QGraph format in the closed domain scenario. The structure of the QGraph format is analysed to find the relations between all of the involved variables and ontology entities that lead to the EAT ontology items. For many fundamental problems in Artificial Intelligence, adopting a graph-based framework can be straight-forward and very effective. The functionality of proposed algorithm (i.e. empirical technique) to extract precise answer from the whole semantic information generated for question is clearly significant.

1.8 Thesis Overview

In this section, we outline the content and organization of the remaining chapters of this document. After this introduction, Chapter 2 gives an extensive review of the state of the art, where the relevant approaches that have been proposed for semantic- based QA system are analyzed and presented. Some presented models of the QA system have been built within a specific domain. Some of them are independent or open domains such as QuestIO [12], AquaLog [37], DeepQA [38], [39] (IBM Watson1), QAKiS [40], SINA [41] and some are dependent (closed) domains like as

QACID [17], ONLI+ [42], and Pythia [26]. Moreover, PANTO [43], AquaLog [37]

and QuestIO [12] are systems that act as natural language interface. The framework, tools with combined solutions and techniques are introduced for IR, text mining and QA in NLP. Recently, specialized QA systems have been developed, such as EAGLi1

for health and life scientists. An increasing number of QA systems use the World Wide Web or LOD as corpus of text and knowledge base.

Chapter 3 contains the general architecture of the proposed ScoQAS consisting of several steps, which are briefly explained below. Roughly, the core of the thesis work is presented in this chapter. The ScoQAS performs over ontologies not over free text and operates on two scenarios, one Closed-domain, where Enterprise ontology is in support of domain for questions and answers, and the other Open domain where the answers are retrieved from a LOD2_{knowledge base. There are}

common and specific components, modules, and KBs. Most of them are common and used in both scenarios. Other specific modules are individually described in this chapter. While we use other common components such as Stanford CoreNLP3_parser,

question preprocessing, NLTK WordNet, and SPARQL query engine, the specific components were also designed and implemented in this work such as the question representation, rule-based question classifier, the building constraints component (QSiS), the graph construction component, answer extraction component, pattern to SPARQL mapping, OpenLink Virtuoso, and the SPARQL query construction component.

In the first two steps of the ScoQAS, the initial pre-processing of the user question is done as a NLP parsing. Then the specific format is provided so that its content can be easily used in the next steps. In step 3, a method is introduced to represent the syntactic structure of a question (e.g. morphological analysis and dependency relations) which will be described more in details in Section ‎3.3. In step 4, the typology of question is presented and a question classifier is built by implementing a semantic-based structure-feature pattern approach. In step 5, the role of the remaining words is specified semantically and syntactically in the question. Thus, constraints are built to create the question syntactic-semantic information structure (QSiS) for associated question type. In order to do that, all of these words have been analyzed and determined in terms of position and their relationship with pattern items. Up to this step, all of the mentioned components operate for both scenarios alike, but there are separate components for each scenario in the downstream steps. To deal with our first scenario, in step 6, an empirical method is presented for creating a question graph that its nodes and edges indicate the dependencies between ontology entities and corresponding question variables. The constraints information is handled to produce the QGraph. This information has been generated at the time of the formation of the QSiS. Finally, in step 7, the inference method over QGraph format is utilized to extract the precise answer. For the 2nd_{scenario, the ScoQAS} goes on the process of generating a formal query (SPARQL) through mapping a structural format using information obtained from upstream steps (e.g. the QC and Constraints modules). The structure of the produced SPARQL query templates are

1_{http://bitem.hesge.ch/content/eagli-eagle-eye} 2_{http://linkeddata.org/}

1.8 Thesis Overview

21 bound with constraints information. In order to get the answer, in the final step, the generated formal query will be sent to the Virtuoso1_{DBpedia endpoint}₂_{to crawl in}

LOD resource.

Chapter 4 provides an empirical evaluation of our implemented ScoQAS system. We evaluate the ScoQAS in both scenarios. Firstly, we analyze the first scenario with a set of questions, which were provided over the Enterprise ontology. Furthermore, in the second scenario, the preliminary results are analyzed and the accuracy of system is tested on QALD3 _{training and test sets standard benchmark.}

Chapter 5 presents the conclusions and the important aspects of the thesis work over research questions and summarizes the main contributions. At the end, the future research works are also suggested.

The thesis is closed by Appendices ‎A, ‎B, ‎C, ‎D, ‎E, ‎F, ‎G, ‎H, I, and J. Appendix ‎A shows the list of the questions in the first scenario. Appendix ‎B is a table of questions and corresponding question types for our training question chosen from QALD-3 training and test sets which were provided for the 2nd scenario. In the Appendix ‎C, we

show the details of bounded variables and corresponding constraints for sample question applied in the first scenario. The Appendix D shows the details of results of the ScoQAS over QALD-2 test set, which consists of 99 questions. The Appendix ‎E shows the analysis parameters with results in detail, obtained by ScoQAS over QALD-3 test set. The Appendices F and G illustrate the analysis items in QALD-4 and QALD-5 test set respectively. Other results have been summarized in Section ‎4.3. Appendix H shows the generalized pseudo code for building QGraph. Appendix ‎I contain our publications during this research work that show the basic and developed framework of the ScoQAS.

1_{https://virtuoso.openlinksw.com/dataspace/doc/dav/wiki/Main/VOSSparqlProtocol#SPARQ}

L%20Service%20Endpoint

2_{http://dbpedia.org/sparql} 3_{http://qald.sebastianwalter.org/}

CHAPTER

2

2 State of the Art

In document Using natural language processing for question answering in closed and open domains (Page 33-37)