Graph-Based Approaches - Disambiguation Algorithms

3.3 Disambiguation Algorithms

3.3.4 Graph-Based Approaches

Graph-based approaches are particularly well-suited to model the interdependence of entities within collective EL approaches. Given a set of candidate entities across several

surface forms within an input document, graph-based approaches model a graph𝐺= (𝑉, 𝐸), commonly integrating all candidate entities as nodes. Depending on the graph definitions, nodes might also represent surface forms. Further, graph edges typically denote relations between a candidate entity pair or a surface form - candidate entity pair. Figure3.9 shows an example graph containing surface forms, corresponding candidate entities and relations (the transition weights are not normalized and are for illustrative purposes only).

Bulls Chicago Bulls Michael Jordan Chicago Michael I. Jordan Michael B. Jordan Brooklyn Brooklyn Bull 0.05 0.2 0.35 0.8 0.1 0.01 0.2 0.2 Surface Form Entity

Figure 3.9: An example graph with surface forms, entities and transition weights

Graph algorithms typically do not use training data to weight the graph nodes (entities). Instead, training data in form of annotated entities is often used to compute graph features like transition probabilities (e.g., entity relatedness).

A simple yet effective and accurate approach for collectively linking Linked Data resources isAGDISTIS [Usb14], a KB-agnostic approach for RDF-KBs such as DBpedia or YAGO. All candidate entities across all surface forms form the set of initial nodes in a directed unweighted graph. The graph is extended with additional entities by performing a breadth- depth first approach starting at the candidate entities in the RDF graph. All entities within a given range are additionally added to the EL graph. Further, an edge between an entity pair is added if there is a direct relation between both entities in the KB. To rank all candidate entities, Usbeck et al. [Usb14] applied the Hyperlink-Induced Topic Search (HITS) algorithm [Kle99] and suggested to assign those target entities that provide the

highest authority values.

Alhelbawy et al. [Alh14a;Alh14b] collectively linked entities with the help of an undirected weighted candidate entity graph. Depending on the entity relatedness measure, the edge weights are either binary or normalized similarity values between 0 and 1. First, the authors computed a local compatibility score 𝐼𝐶𝑜𝑛𝑓(𝑒𝑗) for each candidate entity in the graph

(e.g., either Sense Prior or cosine similarity between surface form and name of candidate entity) to indicate how good candidate entities fit to a given surface form. Finally, three different approaches were proposed to select the most relevant candidate entities. The clique

approach recursively searches cliques in a graph with binary edges between entities. The candidates of the highest scoring clique represent the disambiguated entities. Cliques are scored according to the highest 𝐼𝐶𝑜𝑛𝑓 values in the clique. The disambiguated entities are combined to a new node in the graph, whereby the nodes (candidates) of the already linked surface forms are removed. In the second approach, the authors applied the PageRank algorithm [Bri98], which ranks the graph nodes according to their overall importance within the graph. The third approach describes a combination of the PageRank scores with the

𝐼𝐶𝑜𝑛𝑓 scores to incorporate local compatibility and global coherence.

The EL systemsAIDAby Hoffart et al. [Hof11],WAT by Piccinno and Ferragina [Pic14] and the work by Han et al. [Han11b] also denote purely collective EL approaches. All methods rely on an undirected weighted graph where nodes represent surface forms AND candidate entities. The approaches integrate the textual context similarity of a candidate entity and its surface form (i.e., edge weight of a surface form - candidate entity pair) and the global interdependence between different EL decisions (i.e., edge weight of a candidate entity pair). Edges between surface form nodes and candidate entity nodes are only added if the entity is a candidate of the respective surface form. Moreover, edges between an entity pair are only added if the entity relatedness value between both entities is >0. Hoffart et al. [Hof11] defined the weights for surface form - entity edges as a linear combination of the Sense Prior and a textual context matching score. In contrast, Han et al. [Han11b] simply used the cosine similarity between the textual surface form context and the entity description as edge weight. Piccinno and Ferragina [Pic14] computed context matching using the BM-25 similarity score [Jon00]. In order to compute an entity relatedness score, all works leverage the WLM [Mil08a]. To rank the candidate entities modeled within the graphs, Han et al. [Han11b] and Piccinno and Ferragina [Pic14] applied the (Personalized) PageRank algorithm [Hav03] or HITS algorithm [Kle99]. AIDA[Hof11], however, computes a dense subgraph ideally containing one surface form - entity edge for each surface form. With the problem being NP-hard, the authors approximated the problem with an extended greedy algorithm proposed in [Soz10]. This algorithm iteratively removes the entity nodes in the graph that have the smallest weighted degree (i.e., total weight of the incident edges).

The Babelfy system by Moro et al. [Mor14] is based on the semantic network Ba- belNet [Nav12] that contains Wikipedia entities and other concepts. Each node in the underlying directed graph represents a surface form AND the corresponding candidate in form of an entity or another concept. Further, the authors connected two candidate mean- ings of different surface forms if one is in the semantic signature of the other. A semantic signature of an entity can be seen as a set of highly related other entities. More details about the computation of semantic signatures can be found in the original work [Nav12]. Anyway, in order to drastically reduce the degree of ambiguity while keeping the coherence as high as possible, the authors proposed a novel densest subgraph heuristic. The resulting subgraph contains the most important nodes in terms of coherence. The remaining candidate entities are then ranked by its normalized weighted degree in this subgraph. The weight of a node is the fraction of other nodes in the graph to which the candidate entity is connected to.

Other graph-based approaches were proposed to collectively link entities in tweets. For instance, Shen et al. [She13] proposed the tweet linking system KAURI. One assumption

is that every twitter user has his own interest distribution over various named entities. However, the underlying disambiguation graph comprises all candidate entities of multiple surface forms across several tweets. Each node provides an intra-tweet local information score, which is computed with a LTR approach incorporating the Sense Prior, textual context similarity (cosine similarity of TF-IDF weighted context/entity description vectors), and coherence between entities within the same tweet (using WLM [Mil08a]). The graph edges denote the coherence between all candidate entities in the graph. Similar to the approaches described before, the authors used the WLM as entity relatedness measure and weight the graph edges accordingly. Finally, a personalized PageRank algorithm [Hav03] ranks the candidate entities across all surface forms by considering the intra-tweet local information score of each node and the respective user interest information.

Another collective, tweet Wikification approach by Huang et al. [Hua14] also creates an undirected, weighted disambiguation graph, where each node represents a surface form - candidate entity pair. Moreover, the approach is based on the following three principles:

• Local compatibility: Two pairs of a surface form - candidate entity combination

< 𝑒𝑗, 𝑚𝑖> that both provide a strong local compatibility between each other, tend

to have similar characteristics. For instance, a surface form and a corresponding candidate entity usually share a set of characteristics like string similarity between

𝑚𝑖 and 𝑒𝑘 (e.g., the pairs < 𝐻1𝑁1, 𝐻1𝑁1 > and < 𝐶ℎ𝑖𝑐𝑎𝑔𝑜, 𝐶ℎ𝑖𝑐𝑎𝑔𝑜 > have a

high local compatibility in terms of similar labels). The local compatibility score between two pairs is defined as the cosine similarity between the respective local feature vectors.

• Coreference: If two surface forms like ‘North Carolina’ and ‘nc’ are coreferential, then both should be linked to the entity North Carolina. The coreference score of a surface form pair 𝑚𝑖 and 𝑚ℎ is 1, if both provide the same label or one is an

abbreviation of the other one. Otherwise the score is 0.

• Semantic relatedness: Two semantically related surface forms are more likely linked to entities that are also semantically related. The relatedness score between two nodes is defined as a combination of the relatedness between both surface forms and the relatedness between both candidate entities (computed with WLM [Mil08a]). After computing the three features, they are combined via linear combination to calculate an edge weight between a node pair in the graph. To select the surface forms’ target entities, the authors proposed a semi-supervised graph regularization framework based on the graph-based learning framework in [Zhu03]. More information can be found in [Hua14].

In summary, we state that graph-based algorithms are perfectly suited for collective EL. Further, these approaches have been very well researched across different domains (e.g., general domain, biomedical domain [Zhe14]). Basically, all graph-based approaches are somehow related and mainly differ in the employed feature set to compute the edge weights between candidate entities. To rank all graph nodes, the PageRank algorithm [Bri98] was applied in many works since the algorithm has been well researched in the last decade.

In document Robust Entity Linking in Heterogeneous Domains (Page 65-69)