8.2 DoSeR Framework
8.2.5 Entity Linking Graph and PageRank
In our approach, we generate an EL graph twice in order to link high probability candidate entities first and to perform abstaining afterwards. On this graph, we perform a random walk and determine the entity relevance, which can be seen as the average number of its visits. The random walk is simulated by a PageRank algorithm that permits edge weights and non-uniformly-distributed random jumps [Bri98;Whi03].
First, we create acomplete, directedπΎ-partite graph whose set of nodesπ is divided in πΎ disjoint subsets π0, π1,...,ππΎ. πΎ refers to the number of surface formsπ and ππ is
the set of generated candidate entities{ππ1,...,ππ|π
π|} for surface formππ. We defineπ0 as
pseudo surface form and use the subsetπ0 ={π01}to contain the topic node. The topic node represents the average topic of all already linked entities in πΈπ. Hence, the edge
weight between an entityππ
π and the topic nodeπ01 represents the relatedness betweenπππ and
all already linked entities. Since our graph isπΎ-partite, there are only directed, weighted edges between candidate entities that belong to different surface forms. Connecting the entities that belong to the same surface form would be wrong since the correct target entities of surface forms are determined by the other surface formsβ candidate entities (coherence).
The edge weights in our graph represent entity transition probabilities (ETP), which describe the likelihood to walk from a node to the adjacent node. We compute these probabilities by first computing theTransition Harmonic Mean (THM) between two nodes. The THM is the harmonic mean between two nodesβentity relatednessand thecontext similarityof the target entity (cf. Equation8.2).
The entity relatedness between two nodes (entities) is the cosine similarity (πππ ) of the entitiesβ semantic embeddings (vectors)π£ππ(πππ) and π£ππ(πβπ). The semantic embedding of our topic nodeπ01 is the sum of all entity embeddings inπΈπ(i.e.,π£ππ(π01) =
βοΈ
ππβπΈππ£ππ(ππ)).
The context similarity between the target entity πβπ and the surrounding context of its surface formπβ is the cosine similarity of πβπβs entity-context embeddingππ£ππ(πβπ), and the
inferred surrounding context vector ππ£ππ(πβ) ofπβ. In case, the target entity is our topic
node the context similarity equals 0. The ETP is computed by normalizing the respective THM value (cf. Equation8.3). π π»π(πππ, πβπ) = 2Β·πππ (π£ππ(π π π), π£ππ(πβπ))Β·πππ (ππ£ππ(πβπ), ππ£ππ(πβ)) πππ (π£ππ(πππ), π£ππ(πβπ)) +πππ (ππ£ππ(πβπ), ππ£ππ(πβ)) (8.2) πΈπ π(πππ, πβπ) = π π»π(π π π, πβπ) βοΈ πβ(πβππ)π π»π(π π π, π) (8.3)
Given the current graph, we additionally integrate a possibility to jump from any node to any other node in the graph during the random walk with probability πΌ. Typical values for πΌ (according to the original paper [Whi03]) are in the range [0.1,0.2]. We compute a probability for each candidate entity being the next jump target. Again, we either deploy the Sense Prior probability located in our EL index or the Entity Prior probability as jump probability for each node (entity). The Entity Prior probability is used if no document-centric KBs are available. The probability to jump to or from the topic node equals 0.
Figure8.2shows a possible candidate entity graph. The surface form βTSβ has only one candidate entity and consequently has already been linked to the entity Time Square. The surface form βNew Yorkβ is still ambiguous, providing two candidates. The topic node
π0
1 comprises the already disambiguated surface form βTime Squareβ. We omit the edge weights and jump probabilities in the figure to improve visualization.
After constructing the EL graph, we apply the PageRank algorithm and compute a
Time_Square New_York_City Random Jump Entity Transition TS New York m1 m2 e10 Time_Square
Figure 8.2: Example EL graph with candidates for the surface forms βTSβ and βNew Yorkβ and a topic vector. Solid lines denote entity transition probabilities and dashed lines denote jump probabilities between entity pairs.
relevance score for each candidate entity. Depending on the EL task, our approach decides which candidate entity is the correct target entity or abstains if no appropriate candidate is available (cf. Algorithm 8.2.3).
8.3 Data Sets
To evaluate DoSeR on general-domain entities, we make use of the same data sets as proposed in Section 6.4 on Page 109. All these data sets are integrated in the online EL evaluation framework GERBIL by default. Further, when we evaluate DoSeR in the biomedical domain, we use the CalbCSmall and CalbCBig corpus as training corpus and test data set, similar to Chapter 5. For an in-depth data set description, we refer to Section5.4.
Apart from natural language text data sets, we also investigate how DoSeR links entities in tables. For this purpose, we use six data sets from different domains whose entities are contained in Wikipedia. An overview of the data set statistics is given in Table8.1.
1. Wiki-Manual: Limaye et al. [Lim10] created a small data set of 36 Wikipedia tables extracted from Wikipedia article texts (non-Infobox tables). Some columns overlap with the Web-Manual data set.
2. Web-Manual: A set of 371 web tables was crawled by Limaye et al. [Lim10]. The difference between Wiki-Manual and Web-Manual is that the cell and header texts in the latter are noisier. The data set comprises a huge number of 51 898 cells, but only 9239 of them are annotated with ground truth entities.
3. Wiki-Links: This data set was specifically created to evaluate cell EL algorithms at large scale. The table set consists of Wikipedia tables where at least 90% of the cells internally link to entities in Wikipedia [Lim10].
4. LimayeAll: The LimayeAll data set was re-created in the context of the table annotation approachTableMiner by Zhang et al. [Zha14]. The authors re-created the Limaye et al. [Lim10] data sets Wiki-Manual, Web-Manual and Wiki-Links to correct wrong or changed Wikipedia annotations, and combined them. In addition, it was assumed that the original ground truth annotations of the data sets are very sparse and possibly biased. Thus, the authors changed a huge number of surface forms to complicate the EL process.
5. IMDb: The IMDb data set, also created in the context of the table EL approach TableMiner [Zha14], contains 7416 tables randomly extracted of the IMDb movie website. Each movie web page contains a table listing the actors/actresses and the corresponding characters played.
6. MusicBrainz: Our last data set MusicBrainz comprises about 1400 tables which were randomly extracted from the MusicBrainz record label web pages by Zhang et al. [Zha14]. Typically, a web page lists the music released by a production company. A table has about 8 columns with one listing the music release titles and one listing the respective artists.
We note that the table data sets listed above are exclusively annotated with named entities (i.e., persons, locations and organizations). Basically, we could have omitted all candidate entities that do not belong to these types during our experiments to further improve the underlying results. However, since we do not adapt our approach or EL index to specific data sets, we have used the same general-domain entity index (Wikipedia) for all data sets.
Table 8.1: Table data set statistics
Data Set #Tables #Average #Average #Entity
Rows Columns Annotations
Wiki-Manual 36 37 4 1691 Web-Manual 371 35 2 9239 Wiki-Links 6085 20 3 131 807 LimayeAll 6310 22 110 231 657 IMDB 7416 14 1 66 564 MusicBrainz 1406 78 2 93 110 8.4 Evaluation
In our evaluation, we show that DoSeR achieves state-of-the-art results across different domains and document structures and types. Before we report the results in detail, we first describe the experimental setup in Section 8.4.1. Next, we present how DoSeR performs on linking entities from general-domain KBs in news documents, RSS-feeds, tweets and tables in Section 8.4.2 and 8.4.3. It follows the evaluation on how DoSeR performs in the biomedical domain in Section8.4.4. In Section8.4.5, we analyze the EL results after enabling the abstaining mechanism in our algorithm. Finally, we present a parameter study of our semantic embeddings in terms of Word2Vec and Doc2Vec architectures and their optimal dimensions in Section 8.4.6.
8.4.1 Experimental Setup
The DoSeR framework is fully-implemented in Java and Python. For the Word2Vec and Doc2Vec algorithms, we chose Gensim [Εeh10], a robust and efficient framework to realize unsupervised semantic modeling from plain text. Before our algorithm is able to link entities, we first have to perform some preprocessing steps. First, we choose a set of KBs whose entities define our target entity set πΊ. When we disambiguate general-domain entities (as in Section 8.4.2 and 8.4.3), we make use of the current version of DBpedia (v.2015-10) as entity database (i.e., core KB). This version reflects information from the last years Wikipedia version. Overall, we extracted β 4.1 million entities (all entities belonging to the owl:thing class) out of DBpedia that we would like to link in our work. Next, we selected Wikipedia (β81 million annotations) and the Google Wikilinks Corpus (β40 million annotations) as entity-annotated document KBs that serve as training data for our semantic entity embeddings (Word2Vec). To create the Doc2Vec entity-context embeddings, our framework parses the entitiesβ Wikipedia pages and removes all Wikipedia syntax elements as well as tables. The resulting natural language text documents serve as
input for the Doc2Vec algorithm. We note that in contrast to Chapter7, DoSeR does not subdivide the entity texts into paragraphs to increase the performance of our approach.
In Section8.4.4, we evaluate DoSeR on the biomedical data sets CalbcSmall and CalbcBig. To create our entity database, we again (similar to Chapter 5) focus on the four major namespaces UMLS, Disease, Uniprot and EntrezGene in both CalbC data sets. Here, we use the original entity-annotated CalbC documents and crawled the respective entity-centric KBs in the LOD cloud (i.e., LinkedLifeData, Uniprot, NCBI) to gather the respective entity information. More information about the CalbC data sets can be found in Chapter5.
In the following, DoSeR learns entity embeddings and entity-context embeddings with Word2Vec and Doc2Vec. To train the entity embeddings with Word2Vec, we defined a feature space ofπ= 400 dimensions. DoSeR typically employs the skip-gram architecture that performs better with infrequent words [Mik13a]. In terms of Doc2Vec, we defined a feature space ofπ= 1000 dimensions. DoSeR learns the entity-document embeddings with the PV-DM architecture. An experimental comparison between the architectures and various settings for parameter πis presented in Section8.4.6. The Word2Vec training time tookβ90 minutes on our personal computer with a 4x3.4GHz Intel Core i7 processor and 16 GB RAM (1 corpus iteration). The training time for Doc2Vec took β2 days on our server with 20 cores and 25 GB RAM with 5 iterations overall.
Our approach offers several parameters to tweak the results. In the following, we will mention only those that have the most impact on the results.
β’ Surrounding Context: For Doc2Vec, DoSeR uses a surrounding context of 200 words, which denotes that 100 words before and after the surface forms form the context. Using more context words, results in less meaningful query vectors (cf. Chapter7).
β’ Candidate Filter: The cosine similarity ranges from -1 (unequal) to 1 (equal). A reasonable way to tuneπis to sweep the value between 0.25< π <0.8 (necessary similarity). We selected the valueπ= 0.57 according to the best averaged F1 values throughout the experiments.
β’ PageRank: DoSeR performs 100 PageRank iterations since the overall results do not change with more iterations. In terms of the PageRank jump probabilityπΌ, we chose πΌ = 0.1 in algorithm step 3 (according to the original paper [Whi03]). In algorithm step 4, we chose πΌ = 0.2 to increase the prior influence (i.e., a robust baseline) since the correct entity could not be determined with the help of topical coherence in the steps before. In the disambiguation stepHigh Probability Candidate Linking, we determined the parameter ππππππ1 = 0.5 by sweeping the value between 0.2 < ππππππ1 < 0.6. Again, the best value was selected according to the best averaged F1 values throughout the experiments.
β’ Abstaining: We note that abstaining is disabled by default using ππππππ2 =ββ. To provide the best abstaining results, we choseππππππ2 = 0.3 by sweeping the value between 0.2< ππππππ2 <0.6 as described above.