Exploiting Global Coherence without Global Inference

3.2 Global Coherence in Entity Linking

3.2.1 Exploiting Global Coherence without Global Inference

While linguistically well-founded in the concept of cohesion (Halliday and Hasan, 1976), global inference approaches (Kulkarni et al., 2009; Hoffart, Yosef, et al., 2011) do not scale well in the number of mentions and in the number of candidate entities considered for each mention. Inference becomes prohibitively slow when process- ing long texts with many mentions or when disambiguating many highly ambigu- ous entity mentions. In contrast, local approaches do not suffer from scalability is- sues, since they only optimize the similarity between a single mention’s context and a given candidate entity’s knowledge base entry, without considering other entities mentioned in the document (Bunescu and Pa¸sca, 2006; Cucerzan, 2007). Recent local inference approaches achieve state-of-the-art results by using convolutional neural networks to capture similarity at multiple context sizes (Francis-Landau et

DUBLIN 1996-08-31 Result of the Tattersalls Breeders Stakes , a race for two- year-olds run over six furlongs at The Curragh . . <Capitals_in_Europe> <Towns_in_Ireland> <Companies_of_Ireland> <Horse_auction_houses> <Flat_horse_races_for_three-year-olds> <Graded_stakes_races_in_Canada> <Sports_venues_in_County_Kildare> <wordnet_racetrack_104037625> Knowledge Base Entity Linking Dublin Tattersalls Breeders' Stakes Curragh Racecourse Linked Entities Semantic Proﬁle Tattersalls: ➡ Suﬀolk: 52°10′N 1°00′E Breeders' Stakes: ➡ Toronto: 43°42′N 79°24′W Breeders' Stakes: ➡ wasCreatedOn: 1889-##-## Curragh Racecourse: 53°9′55″N 6°50′43″W Dublin: 53°20′52″N 6°15′35″W Geographic locations Temporal spans Entity types

Local context features

Dublin Tattersalls Breeders' Stakes Curragh Racecourse Veriﬁcation

Global coherence features

Geographic outlier detection Geographic distance Temporal outlier detection Temporal overlap

Entity type distribution similarity Features for each linked mention

FIGURE3.8: Overview of our entity linking verification system.

al., 2016). However, local approaches fail, by definition, to take global coherence among entities into account.

To avoid the trade-off between the efficiency of local inference on the one hand and the coherence benefits of global inference on the other, we propose a two-stage approach: In the first stage, candidate entities are ranked by a fast, local inference- based entity linking system. In the second stage these results are used to create a semantic profile of the given text, derived from rich data the knowledge base con- tains about the top-ranked candidates. Since the linking precision of current entity linking systems is relatively high, we assume that this profile is reasonably accurate and leverage it to measure the cohesive strength between a given candidate entity and the other, already-linked entities mentioned in the text. We then automatically verify the results from the first stage by classifying entity links as correct if they ex- hibit high coherence, and as wrong if there are only weak or no cohesive ties to the semantic profile. Figure 3.8 gives an overview of this process.

The verification results obtained in the second stage can then be used in at least three ways:

1. to increase linking precision by filtering out all entity links classified as wrong; 2. to rerank candidate entities by the class probability estimated by the verifier, i.e., prefer candidates that were predicted as correct with higher probability; or

3. to employ a more sophisticated entity linking system to re-link all entity links classified as wrong, using the entity links deemed correct as additional context.

DUBLIN 1996-08-31 Result of the Tattersalls Breeders Stakes , a race for two-year- olds run over six furlongs at The Curragh . . .

FIGURE3.9: An example of misleading generic coherence. Text source: document 1112testa from the CoNLL development set.

In the following, we investigate options 1. and 2., leaving option 3. to future work. As a motivating example for our approach, consider the sentence shown in Fig- ure 3.9 , in which an entity linking system correctly linked the following mentions:

• DUBLIN→DUBLIN, the capital of Ireland;

• Tattersalls→TATTERSALLS, a race horse auctioneer based in the UK and Ire- land; and

• The Curragh→CURRAGH_RACECOURSE, a course for horse races in Ireland.

These entities clearly situate the text in Ireland. However, several current entity linking systems compared in our experiments link the mention Breeders Stakes to the Wikipedia article about a Canadian horse race of the same name. This mistake was likely made because the actual referent, a horse race sponsored by Tattersalls and held in Ireland, does not have a Wikipedia article. The system is then misled by other evidence: high similarity between mention context and Wikipedia article due to the appositive race, as well as an almost perfect string match between mention string and the article title. This mistake results in an interpretation in which all entities except one are located in Ireland while one entity is isolated in Canada (Figure 3.10 on the next page).

We aim to prevent these kinds of mistakes by verifying entity linking results, using aspects of coherence that have not been employed for entity linking so far,

FIGURE 3.10: Example showing a geographical outlier: Breeders’ Stakes (red, in Canada) and contextual entities located in Ireland and the UK (green). Image source: https://www.bing.com/maps.

such as geographical coherence in the example above. To do so, we assume that an existing entity linking system has linked all entity mentions found in a document to an entry in the knowledge base. Due to entity linking mistakes, some of these entities may, in fact, not be referred to in the document. However, we can also expect that some of these entities have been correctly linked by the entity linking system.9 We now query the knowledge base for information about the entities that, according to the entity linking system, are mentioned in the document, regardless of whether this is actually the case or not. This information includes geographic data such as locations of the entities mentioned in it, temporal data such as years of birth or death, and the semantic types of all mentioned entities. We call this information the semantic profile of the document. The semantic profile of our motivating example is visualized as the blue box in Figure 3.8 on page 44. The main idea in our approach is that the semantic profile allows judging whether a given linked entity is coherent with the document, that is, whether it is highly related to other entities mentioned in the document or not. We cast the comparison of a linked entity to the document’s semantic profile as a supervised classification task.

The input for our classifier is the output of an entity linking system, which con- sists of links to entries in the knowledge base for all entity mentions found in a given set of documents. Next, we extract a rich set of global, pairwise, and local features for each linked mention. Using the gold annotations, which provide the correct knowledge base entry for all mentions in the document set, we then train a classifier to predict whether a given mention was linked correctly by the system or not.

Recall that in global disambiguation approaches to entity linking (Kulkarni et al., 2009; Hoffart, Yosef, et al., 2011; Fahrni and Strube, 2012; Moro et al., 2014), global inference is an NP-hard problem, since all combinations of all candidate entities of all mentions are considered simultaneously. In our proposed automatic verification setting, inference scales linearly in the number of mentions since we only need to compare the top candidate entity for each mention to the document’s semantic profile. This allows employing knowledge-rich, global coherence features that otherwise would have prohibitively high computational cost. Our features are designed to exploit several aspects of global coherence, such as geographic or temporal coherence.

9_{Since entity linking precision ranges between 60 and 90 percent on common datasets, this is}

not an unrealistic expectation. See, for example, the baseline precisions on the CoNLL and TAC15 datasets in Figure 3.12 on page 58.

Predicate /location/location/geolocation /organization/organization/geographic_scope /time/event/locations /sports/sports_team/location /organization/organization/headquarters

TABLE 3.7: Freebase predicates for querying geo-coordinates of locations, geo-political entities, events, and organizations.

In document Aspects of Coherence for Entity Analysis (Page 57-62)