• No results found

Main Contribution: DoSeR A Robust Entity Linking Framework

Most existing EL systems are highly optimized toward a specific data set, KB or domain, but do not (fully) provide Structural Robustness and Consistency. To create such a robust EL system, we first analyze three crucial components of EL algorithms to gain new insights into techniques and algorithms whose usage essentially influence Robustness in Chapter5,6 and7. In these chapters, we reveal the following three core findings in terms of robust EL, which are considered in our robust EL framework:

1. We show that a federated approach leveraging knowledge from entity-centric and document-centric KBs can (significantly) improve the Consistency of EL systems. 2. We present a new state-of-the-art entity relatedness measure for topical coherence

computation that provides Structural Robustness and Consistency.

3. We present Doc2Vec as textual context matching technique that provides Structural Robustness and Consistency in terms of low quantity (short) entity descriptions. Based on these findings and outcomes, we aim to construct a robust, state-of-the-art EL framework. More specifically, in Chapter 8, we present DoSeR (Disambiguation of SemanticResources). DoSeR is a KB-agnostic EL framework that extracts relevant entity information from multiple (entity-centric and document-centric) KBs in a fully automatic way. Further, it creates indexes and models that are required by the used algorithms later on. DoSeR accepts different types of input documents such as tables, news articles and tweets whereby each document provides one or multiple, previously annotated surface forms. Our

main EL algorithm in DoSeR utilizes semantic entity and document embeddings for entity relatedness and textual context matching computation and represents a new collective, graph-based approach. The DoSeR algorithm is also able to abstain if no appropriate candidate entity can be found for a specific surface form. To evaluate the EL accuracy, we conducted experiments on general-domain KBs (e.g., Wikipedia, DBpedia, YAGO3) and special-domain KBs (e.g., Uniprot). In our evaluation, we compare DoSeR to other publicly (e.g., Wikifier [Rat11], AIDA [Hof11] and AGDISTIS [Usb14]) and non-publicly (e.g., Probabilistic Bag-Of-Hyperlinks model [Gan16]) available EL systems and discuss the achieved results in detail. In our experiments, DoSeR outperforms current state-of-the-art EL systems over a wide range of very different data sets and domains. Moreover, DoSeR provides Structural Robustness and Consistency in terms of most criteria. We also provide DoSeR as well as the underlying KBs as open source solutions.

Table4.2 provides an overview of the conducted experiments in the respective chapters. A ‘check’ in parentheses indicates that this experiment was either not fully conducted and/or the outcomes are deduced from other experiments (some additional experiments may be required to fully confirm the results).

Table 4.2: Overview of conducted experiments in the respective chapters

Experiments Knowledge Entity Textual DoSeR

Bases Relatedness Context

Chapter 5 Chapter 6 Chapter7 Chapter8

Different Types of KBs 3 3 3 3

Different Document Structures 7 3 3 3

Various domains (3) (3) 7 3

Large and heterogeneous KBs 3 7 7 7

Low quantity of entity data 3 (3) 3 3

Knowledge Bases

In this chapter, we investigate how and to which extent various knowledge base (KB) properties influence Entity Linking (EL) results. The evaluated KB properties are (i) the entity format, i.e., the way entities are described (intensionally or extensionally), (ii) user data, i.e., the quantity and quality of externally disambiguated entities, and (iii) the quantity and heterogeneity of entities, i.e., the number and size of different domains in a KB. To this end, we implemented three ranking-based EL systems to address various entity definitions and provide a systematic evaluation of the defined KB properties in the biomedical domain. In our evaluation, we show that (i) the choice of the entity format to achieve the best EL results depends on the amount of available user data, (ii) the entity format strongly affects EL results with large-scale and heterogeneous KBs, (iii) all evaluated approaches are robust against a moderate amount of noise in user data, and (iv) a federated approach that leverages both entity formats (i.e., intensional and extensional entity definitions) can significantly improve the Consistency of EL systems. This chapter covers and combines the ideas, findings and materials published in the works [Zwi13b], [Zwi15a] and [Zwi15c].

The remainder of the chapter is structured as follows: In Section5.1, we briefly introduce the chapter’s core question, the contributions and the results. In Section 5.2, we model the evaluated KB properties. Section 5.3 describes the implementations of our EL systems. Section5.4analyzes the biomedical data set CalbC that is used in our evaluation. Section5.5 presents experiments in form of an in-depth evaluation. Finally, we conclude the chapter in Section 5.6.

5.1 Introduction

KBs represent an important aspect in EL systems by defining the basic conditions. These include the underlying domain, the specific set of entities and the entity information that can be leveraged for EL. A robust EL system, however, should be able to achieve consistent results on various domains, with a large number of entities and with a low quantity and poor quality of entity definitions (cf. Chapter 4). Basically, all these Consistency criteria refer to content-related KB properties. So far, it is unclear how and to which extent content-related KB properties influence EL results in general.

In this chapter, we pose the following research question:

Research Question: How and to which extent do content-related KB properties influence EL results?

To answer this question, we select the following three crucial KB properties whose influences are investigated throughout this chapter:

• Entity format, i.e., the way entities are described, that is intensionally (i.e., logical representations like descriptions) or extensionally (i.e., through instances and usage). • User data, i.e., quantity and quality of externally disambiguated entities within

entity-annotated documents.

• Quantity and heterogeneity of entities to disambiguate, i.e., the number and size of different domains in a KB.

To evaluate these KB properties, we focus on the biomedical domain, which is extensively represented by several large data sets and KBs. Moreover, the problems of missing user data and large-scale and heterogeneous KBs are particularly relevant and present in this specific domain. Generally, biomedical EL is a challenging task due to a considerable extent of ambiguity and, thus, has attained much attention in research in the last decade [Zwi15b].

In terms of EL approaches, we implemented three Learning-To-Rank-based (LTR) algo- rithms. Two approaches rely on intensional and extensional entity definitions, respectively. With our third and federated approach, we investigate whether we can further improve EL results by leveraging the knowledge from different entity formats, such as intensional and extensional entity definitions. To this end, our federated approach combines the result lists of both single approaches by means of LTR.

Overall, ourcontributionsin this chapter can be summarized as follows:

• We provide a systematic evaluation of (biomedical) EL with respect to the entity format, user data and the quantity and heterogeneity of entities.

• We show that the choice of the entity format, which is used to attain the best EL results, strongly depends on the amount of available user data.

• We show that the entity format strongly affects EL results with large-scale and heterogeneous KBs.

• We show that all evaluated approaches are robust against a moderate amount of noise in user data.

• We show that by using a federated approach, which leverages both entity formats, the Consistency of EL systems can be improved significantly.