Doc2Vec Parameter Study - Textual Context Matching Based on Semantic Document Embeddings

7.2 Textual Context Matching Based on Semantic Document Embeddings

7.4.4 Doc2Vec Parameter Study

The Doc2Vec implementation in Gensim provides a wealth of parameter settings, which may influence the resulting document embeddings. In our experiments, we chose the settings that were suggested in the works [Dai15;Lau16;Zwi16b] (negative-sampling=5, window-size=5, minimum-count=8 and iterations=200). In the following, we compare the PV-DM and PV-DBOW Doc2Vec models with a various number of dimensions to determine the best settings for the EL task.

Figure7.5depicts the micro-averaged F1 values across all data sets of our EL approach when using either the PV-DM or PV-DBOW Doc2Vec architecture and a specific number of dimensions. The printed results refer to Doc2Vec models based on the Wikipedia paragraph corpus whose construction was explained earlier. Both architectures achieve the best results with𝑑= 400 dimensions, with PV-DM leading PV-DBOW by ≈2 F1 percentage points. One reason for this outcome might be, that in contrast to PV-DBOW, PV-DM takes the word order into consideration, at least in a small context, in the same way that an n-gram model with a large n would do [Le14]. It is also very interesting to see that the averaged F1 values of both architectures with 800 dimensions drop by up to 5 F1 percentage points compared to𝑑= 400. One reason might be that a high number of dimensions leads to some kind of overfitting and, thus, the optimal number of dimensions for embeddings probably depends on the number of entities and amount of training data [Zwi16b]. Similar to the

results achieved in [Zwi16b], PV-DBOW provides slightly more robust F1 results with less dimensions. With 200 or less dimensions, PV-DBOW tops its counterpart by up to≈3 F1 percentage points. Since we are interested in the best overall results, we suggest to use PV-DM for context matching in EL systems in the future. Anyhow, a careful analysis of the underlying corpus and an adaption to the Doc2Vec parameter settings is required to affirm the results.

0.2 0.25 0.3 0.35 0.4 0.45 0.5 0.55 0 45 90 150 250 400 600 800 F1

Doc2Vec Feature Space

PV-DM PV-DBOW

Figure 7.5: Micro-averaged F1 values of our approach with the Doc2Vec architectures PV-DM and PV-DBOW and a various number of dimensions

Insummary, we note that in our experiments with Wikipedia paragraphs as documents, the PV-DM architecture performs best when using a specific number of dimensions. We note that further increasing the number of dimensions leads to a (significant) decrease of EL results with both architectures. However, the optimal dimension number has to be re-determined for other KBs.

7.5 Conclusion

In this chapter, we presented the neural-network-based approach Doc2Vec as a textual context matching technique for robust EL. In this context, we provided a systematic comparison to four other popular context matching techniques. These are the VSM with Apache Lucene TF-IDF weights, Okapi BM-25, the Entity-Context Model and the Thematic Context Distance with LDA. In our experiments, we evaluated all approaches by first determining the respectively best textual context length across all data sets measured in words. Then, we analyzed and discussed the results of all approaches on different data sets when using Wikipedia as entity describing source. Further, we re-conducted the data set experiments and used entity descriptions located in the entity-centric KB DBpedia to investigate the robustness of the approaches in terms of short entity descriptions. Finally, we provided a parameter study of the Doc2Vec architectures PV-DM and PV-DBOW.

achieves state-of-the-art results on most data sets and provides Structural Robustness. Moreover, it provides consistent results with (very) short entity descriptions in the underlying KB. We also showed that the VSM with adapted TF-IDF weights outperforms other state-of-the-art context matching techniques if a sufficient number of surface form context words is given.

A limitationof our work is the small number of evaluated KBs. With Wikipedia and DBpedia denoting general-domain KBs, we cannot entirely generalize the achieved results to any (special-domain) KB. Moreover, we did not evaluate the performance of the context matching techniques. We are going to tackle both types of experiments in the near future.

A Robust Entity Linking System

DoSeR - Disambiguation of Semantic Resources

In this chapter, we combine the results of the previous chapter to construct DoSeR (Disambiguation of Semantic Resources), a robust (i.e., providing Structural Robustness and Consistency), state-of-the-art Entity Linking (EL) system. DoSeR is a knowledge base (KB) agnostic EL framework that extracts relevant entity information from multiple (entity- centric and document-centric) KBs in a fully automatic way. The main EL algorithm in DoSeR utilizes semantic entity and document embeddings for entity relatedness and textual context matching computation and represents a new collective, graph-based approach. Our approach is also able to abstain if no appropriate entity can be found for a specific surface form. In our evaluation, we analyze how DoSeR performs on general-domain KBs (i.e., Wikipedia, DBpedia, YAGO3) and special-domain KBs (e.g., Uniprot). We compare DoSeR to other publicly (e.g., Wikifier [Rat11]) and non-publicly (e.g., Probabilistic Bag- Of- Hyperlinks model [Gan16]) available EL systems. Our system achieves significantly (>5%) better results than all other publicly available approaches on various document structures and types (e.g., news, tables). This chapter partially covers the ideas, findings and materials published in the works [Zwi16a] and [Zwi16b].

The remainder of this chapter is structured as follows: After introducing the chapter in Section8.1, we provide an overview of the DoSeR framework in Section8.2. Section8.3 presents the data sets used in our evaluation. In Section 8.4, we describe the experimental setup and the achieved results. We conclude the chapter in Section 8.5.

8.1 Introduction

The ultimate goal and main research question in this work is to create a robust EL system in terms of Structural Robustness and Consistency. To this end, we first analyzed three crucial components of EL algorithms to gain new insights into techniques and algorithms whose usage essentially influence Robustness in EL systems. These components are the underlying KB, the entity relatedness measure and the textual context matching technique. Overall, we revealed the following three core findings in terms of robust EL, which we aim to consider in our EL framework:

1. Knowledge Bases (Chapter 5): We showed that a federated approach leveraging knowledge from entity-centric and document-centric KBs can (significantly) improve the Consistency of EL systems.

2. Entity Relatedness (Chapter 6): We proposed a new state-of-the-art entity relatedness measure that provides Structural Robustness and consistent results with a low quantity and poor quality of entity definitions.

3. Textual Context (Chapter7): We presented Doc2Vec as textual context matching technique that provides Structural Robustness and consistent results with long and short entity definitions.

Based on these findings, we present DoSeR, a robust EL framework in terms of Structural Robustness and Consistency that achieves state-of-the-art results on various KBs, domains, and document structures and types. DoSeR is KB-agnostic in order to complement entity- centric and document-centric KBs in terms of entity coverage, i.e., the total number of entities available in a KB, and entity description, i.e., the completeness and quality of the description of one entity. Further, the graph-based EL algorithm in DoSeR unifies our proposed and robust semantic entity embeddings (cf. Chapter6) for collective EL and entity-context embeddings (cf. Chapter7) for surrounding context matching. In the case of our algorithm being uncertain about the correct entity target, our approach abstains by returning the pseudo-entityNIL.

In particular, we provide the followingcontributions:

• We present DoSeR, a new state-of-the-art (named) EL framework that emphasizes Robustness in terms of Structural Robustness and Consistency.

• We evaluate our algorithm against other state-of-the-art EL systems on 16 data sets overall and show that our approach outperforms all other systems by a significant margin on nearly all data sets.

• We discuss the influence of the quality of the underlying KB on the EL accuracy and indicate that our algorithm achieves better results than non-publicly available state-of-the-art algorithms.

• We provide our EL system as well as the underlying KB as open source solutions1. These resources allow a fair comparison between future EL algorithms and our approach that are not biased by the KB.

In document Robust Entity Linking in Heterogeneous Domains (Page 147-154)