Entity Linking Graph and PageRank

8.2 DoSeR Framework

8.2.5 Entity Linking Graph and PageRank

In our approach, we generate an EL graph twice in order to link high probability candidate entities first and to perform abstaining afterwards. On this graph, we perform a random walk and determine the entity relevance, which can be seen as the average number of its visits. The random walk is simulated by a PageRank algorithm that permits edge weights and non-uniformly-distributed random jumps [Bri98;Whi03].

First, we create acomplete, directed𝐾-partite graph whose set of nodes𝑉 is divided in 𝐾 disjoint subsets 𝑉0, 𝑉1,...,𝑉𝐾. 𝐾 refers to the number of surface forms𝑆 and 𝑉𝑖 is

the set of generated candidate entities{𝑒𝑖₁,...,𝑒𝑖_|_𝑉

𝑖|} for surface form𝑚𝑖. We define𝑚0 as

pseudo surface form and use the subset𝑉0 ={𝑒01}to contain the topic node. The topic node represents the average topic of all already linked entities in 𝐸𝑑. Hence, the edge

weight between an entity𝑒𝑖

𝑗 and the topic node𝑒01 represents the relatedness between𝑒𝑖𝑗 and

all already linked entities. Since our graph is𝐾-partite, there are only directed, weighted edges between candidate entities that belong to different surface forms. Connecting the entities that belong to the same surface form would be wrong since the correct target entities of surface forms are determined by the other surface forms’ candidate entities (coherence).

The edge weights in our graph represent entity transition probabilities (ETP), which describe the likelihood to walk from a node to the adjacent node. We compute these probabilities by first computing theTransition Harmonic Mean (THM) between two nodes. The THM is the harmonic mean between two nodes’entity relatednessand thecontext similarityof the target entity (cf. Equation8.2).

The entity relatedness between two nodes (entities) is the cosine similarity (𝑐𝑜𝑠) of the entities’ semantic embeddings (vectors)𝑣𝑒𝑐(𝑒𝑖_𝑗) and 𝑣𝑒𝑐(𝑒ℎ_𝑘). The semantic embedding of our topic node𝑒0₁ is the sum of all entity embeddings in𝐸𝑑(i.e.,𝑣𝑒𝑐(𝑒01) =

∑︀

𝑒𝑗∈𝐸𝑑𝑣𝑒𝑐(𝑒𝑗)).

The context similarity between the target entity 𝑒ℎ_𝑘 and the surrounding context of its surface form𝑚ℎ is the cosine similarity of 𝑒ℎ𝑘’s entity-context embedding𝑐𝑣𝑒𝑐(𝑒ℎ𝑘), and the

inferred surrounding context vector 𝑐𝑣𝑒𝑐(𝑚ℎ) of𝑚ℎ. In case, the target entity is our topic

node the context similarity equals 0. The ETP is computed by normalizing the respective THM value (cf. Equation8.3). 𝑇 𝐻𝑀(𝑒𝑖_𝑗, 𝑒ℎ_𝑘) = 2·𝑐𝑜𝑠(𝑣𝑒𝑐(𝑒 𝑖 𝑗), 𝑣𝑒𝑐(𝑒ℎ𝑘))·𝑐𝑜𝑠(𝑐𝑣𝑒𝑐(𝑒ℎ𝑘), 𝑐𝑣𝑒𝑐(𝑚ℎ)) 𝑐𝑜𝑠(𝑣𝑒𝑐(𝑒𝑖_𝑗), 𝑣𝑒𝑐(𝑒ℎ_𝑘)) +𝑐𝑜𝑠(𝑐𝑣𝑒𝑐(𝑒ℎ_𝑘), 𝑐𝑣𝑒𝑐(𝑚ℎ)) (8.2) 𝐸𝑇 𝑃(𝑒𝑖_𝑗, 𝑒ℎ_𝑘) = 𝑇 𝐻𝑀(𝑒 𝑖 𝑗, 𝑒ℎ𝑘) ∑︀ 𝑙∈(𝑉∖𝑉𝑖)𝑇 𝐻𝑀(𝑒 𝑖 𝑗, 𝑙) (8.3)

Given the current graph, we additionally integrate a possibility to jump from any node to any other node in the graph during the random walk with probability 𝛼. Typical values for 𝛼 (according to the original paper [Whi03]) are in the range [0.1,0.2]. We compute a probability for each candidate entity being the next jump target. Again, we either deploy the Sense Prior probability located in our EL index or the Entity Prior probability as jump probability for each node (entity). The Entity Prior probability is used if no document-centric KBs are available. The probability to jump to or from the topic node equals 0.

Figure8.2shows a possible candidate entity graph. The surface form ‘TS’ has only one candidate entity and consequently has already been linked to the entity Time Square. The surface form ‘New York’ is still ambiguous, providing two candidates. The topic node

𝑒0

1 comprises the already disambiguated surface form ‘Time Square’. We omit the edge weights and jump probabilities in the figure to improve visualization.

After constructing the EL graph, we apply the PageRank algorithm and compute a

Time_Square New_York_City Random Jump Entity Transition TS New York m₁ m₂ e₁0 Time_Square

Figure 8.2: Example EL graph with candidates for the surface forms ‘TS’ and ‘New York’ and a topic vector. Solid lines denote entity transition probabilities and dashed lines denote jump probabilities between entity pairs.

relevance score for each candidate entity. Depending on the EL task, our approach decides which candidate entity is the correct target entity or abstains if no appropriate candidate is available (cf. Algorithm 8.2.3).

8.3 Data Sets

To evaluate DoSeR on general-domain entities, we make use of the same data sets as proposed in Section 6.4 on Page 109. All these data sets are integrated in the online EL evaluation framework GERBIL by default. Further, when we evaluate DoSeR in the biomedical domain, we use the CalbCSmall and CalbCBig corpus as training corpus and test data set, similar to Chapter 5. For an in-depth data set description, we refer to Section5.4.

Apart from natural language text data sets, we also investigate how DoSeR links entities in tables. For this purpose, we use six data sets from different domains whose entities are contained in Wikipedia. An overview of the data set statistics is given in Table8.1.

1. Wiki-Manual: Limaye et al. [Lim10] created a small data set of 36 Wikipedia tables extracted from Wikipedia article texts (non-Infobox tables). Some columns overlap with the Web-Manual data set.

2. Web-Manual: A set of 371 web tables was crawled by Limaye et al. [Lim10]. The difference between Wiki-Manual and Web-Manual is that the cell and header texts in the latter are noisier. The data set comprises a huge number of 51 898 cells, but only 9239 of them are annotated with ground truth entities.

3. Wiki-Links: This data set was specifically created to evaluate cell EL algorithms at large scale. The table set consists of Wikipedia tables where at least 90% of the cells internally link to entities in Wikipedia [Lim10].

4. LimayeAll: The LimayeAll data set was re-created in the context of the table annotation approachTableMiner by Zhang et al. [Zha14]. The authors re-created the Limaye et al. [Lim10] data sets Wiki-Manual, Web-Manual and Wiki-Links to correct wrong or changed Wikipedia annotations, and combined them. In addition, it was assumed that the original ground truth annotations of the data sets are very sparse and possibly biased. Thus, the authors changed a huge number of surface forms to complicate the EL process.

5. IMDb: The IMDb data set, also created in the context of the table EL approach TableMiner [Zha14], contains 7416 tables randomly extracted of the IMDb movie website. Each movie web page contains a table listing the actors/actresses and the corresponding characters played.

6. MusicBrainz: Our last data set MusicBrainz comprises about 1400 tables which were randomly extracted from the MusicBrainz record label web pages by Zhang et al. [Zha14]. Typically, a web page lists the music released by a production company. A table has about 8 columns with one listing the music release titles and one listing the respective artists.

We note that the table data sets listed above are exclusively annotated with named entities (i.e., persons, locations and organizations). Basically, we could have omitted all candidate entities that do not belong to these types during our experiments to further improve the underlying results. However, since we do not adapt our approach or EL index to specific data sets, we have used the same general-domain entity index (Wikipedia) for all data sets.

Table 8.1: Table data set statistics

Data Set #Tables #Average #Average #Entity

Rows Columns Annotations

Wiki-Manual 36 37 4 1691 Web-Manual 371 35 2 9239 Wiki-Links 6085 20 3 131 807 LimayeAll 6310 22 110 231 657 IMDB 7416 14 1 66 564 MusicBrainz 1406 78 2 93 110 8.4 Evaluation

In our evaluation, we show that DoSeR achieves state-of-the-art results across different domains and document structures and types. Before we report the results in detail, we first describe the experimental setup in Section 8.4.1. Next, we present how DoSeR performs on linking entities from general-domain KBs in news documents, RSS-feeds, tweets and tables in Section 8.4.2 and 8.4.3. It follows the evaluation on how DoSeR performs in the biomedical domain in Section8.4.4. In Section8.4.5, we analyze the EL results after enabling the abstaining mechanism in our algorithm. Finally, we present a parameter study of our semantic embeddings in terms of Word2Vec and Doc2Vec architectures and their optimal dimensions in Section 8.4.6.

8.4.1 Experimental Setup

The DoSeR framework is fully-implemented in Java and Python. For the Word2Vec and Doc2Vec algorithms, we chose Gensim [Řeh10], a robust and efficient framework to realize unsupervised semantic modeling from plain text. Before our algorithm is able to link entities, we first have to perform some preprocessing steps. First, we choose a set of KBs whose entities define our target entity set 𝛺. When we disambiguate general-domain entities (as in Section 8.4.2 and 8.4.3), we make use of the current version of DBpedia (v.2015-10) as entity database (i.e., core KB). This version reflects information from the last years Wikipedia version. Overall, we extracted ≈ 4.1 million entities (all entities belonging to the owl:thing class) out of DBpedia that we would like to link in our work. Next, we selected Wikipedia (≈81 million annotations) and the Google Wikilinks Corpus (≈40 million annotations) as entity-annotated document KBs that serve as training data for our semantic entity embeddings (Word2Vec). To create the Doc2Vec entity-context embeddings, our framework parses the entities’ Wikipedia pages and removes all Wikipedia syntax elements as well as tables. The resulting natural language text documents serve as

input for the Doc2Vec algorithm. We note that in contrast to Chapter7, DoSeR does not subdivide the entity texts into paragraphs to increase the performance of our approach.

In Section8.4.4, we evaluate DoSeR on the biomedical data sets CalbcSmall and CalbcBig. To create our entity database, we again (similar to Chapter 5) focus on the four major namespaces UMLS, Disease, Uniprot and EntrezGene in both CalbC data sets. Here, we use the original entity-annotated CalbC documents and crawled the respective entity-centric KBs in the LOD cloud (i.e., LinkedLifeData, Uniprot, NCBI) to gather the respective entity information. More information about the CalbC data sets can be found in Chapter5.

In the following, DoSeR learns entity embeddings and entity-context embeddings with Word2Vec and Doc2Vec. To train the entity embeddings with Word2Vec, we defined a feature space of𝑑= 400 dimensions. DoSeR typically employs the skip-gram architecture that performs better with infrequent words [Mik13a]. In terms of Doc2Vec, we defined a feature space of𝑑= 1000 dimensions. DoSeR learns the entity-document embeddings with the PV-DM architecture. An experimental comparison between the architectures and various settings for parameter 𝑑is presented in Section8.4.6. The Word2Vec training time took≈90 minutes on our personal computer with a 4x3.4GHz Intel Core i7 processor and 16 GB RAM (1 corpus iteration). The training time for Doc2Vec took ≈2 days on our server with 20 cores and 25 GB RAM with 5 iterations overall.

Our approach offers several parameters to tweak the results. In the following, we will mention only those that have the most impact on the results.

• Surrounding Context: For Doc2Vec, DoSeR uses a surrounding context of 200 words, which denotes that 100 words before and after the surface forms form the context. Using more context words, results in less meaningful query vectors (cf. Chapter7).

• Candidate Filter: The cosine similarity ranges from -1 (unequal) to 1 (equal). A reasonable way to tune𝜆is to sweep the value between 0.25< 𝜆 <0.8 (necessary similarity). We selected the value𝜆= 0.57 according to the best averaged F1 values throughout the experiments.

• PageRank: DoSeR performs 100 PageRank iterations since the overall results do not change with more iterations. In terms of the PageRank jump probability𝛼, we chose 𝛼 = 0.1 in algorithm step 3 (according to the original paper [Whi03]). In algorithm step 4, we chose 𝛼 = 0.2 to increase the prior influence (i.e., a robust baseline) since the correct entity could not be determined with the help of topical coherence in the steps before. In the disambiguation stepHigh Probability Candidate Linking, we determined the parameter 𝑚𝑎𝑟𝑔𝑖𝑛1 = 0.5 by sweeping the value between 0.2 < 𝑚𝑎𝑟𝑔𝑖𝑛1 < 0.6. Again, the best value was selected according to the best averaged F1 values throughout the experiments.

• Abstaining: We note that abstaining is disabled by default using 𝑚𝑎𝑟𝑔𝑖𝑛2 =−∞. To provide the best abstaining results, we chose𝑚𝑎𝑟𝑔𝑖𝑛2 = 0.3 by sweeping the value between 0.2< 𝑚𝑎𝑟𝑔𝑖𝑛2 <0.6 as described above.

In document Robust Entity Linking in Heterogeneous Domains (Page 160-165)