• No results found

Name Dictionary Methods

2.3 Evaluation of Entity Linking Systems

3.1.1 Name Dictionary Methods

A name dictionary is the primarily used technique to generate candidate entities. EL approaches build a dictionary that can be seen as a < š‘˜š‘’š‘¦, š‘£š‘Žš‘™š‘¢š‘’ >structure. The š‘˜š‘’š‘¦

value stores a surface form and the š‘£š‘Žš‘™š‘¢š‘’represents a list containing all entities that may be addressed with the surface formš‘˜š‘’š‘¦. The following Table 3.1 shows an extract of a name dictionary.

Table 3.1: Part of a name dictionary

Key (Surface Form) Value (Entity)

Apple Inc. Apple Inc. Michael Jordan

Jordan Jordan, New York

Jordan River Michael Jordan M. Jordan Michael I. Jordan

Michael Jordan (Football) President G. Washington Georg Washington

The dictionary is constructed by fully leveraging the features offered by the respective knowledge base (KB) and/or other external resources. For instance, many works that link Wikipedia entities often extract possible surface forms for entities from the following Wikipedia pages: entity pages, redirect pages, disambiguation pages and bold phrases from the first article paragraph [She15]. These kind of features are used by nearly all Wikification systems (e.g., [Bun06;Gat13;Guo13]). However, the most important feature is the information about the entities’ usage context. More specifically, manually or automatically entity-annotated documents provide a rich source for relevant surface forms. For instance, Wikipedia articles contain hyperlinks that link to other Wikipedia entities. The anchor text of a link represents a surface form of the target entity and provides a useful source for synonyms and other name variations. In Example 3.1, the surface forms ā€˜TS’ and ā€˜New York’ refer to the entities New York Time Square and New York City.

Example 3.1. The TShas been aNew York attraction for over a century.

There also exist some external corpora that are annotated with Wikipedia entities. A popular example is the Google Wikilinks corpus1 providing ā‰ˆ42 million surface forms and ā‰ˆ3 million distinct Wikipedia entity annotations. Further corpora were proposed in [Art10] and [Day08] with both providingā‰ˆ55 000 annotated Wikipedia entities. Generally, corpora with a bulk of manually annotated entities are rare since a significant human effort is necessary. If entity annotations were automatically created, one has to regard the accuracy of the annotation system.

Given a (name) dictionary, candidate entities are usually determined by exactly matching the query surface forms with those located in the dictionary, while ignoring large and lower case letters. Depending on the domain, capital letters of surface forms may play a crucial role since capital letters can further specify the underlying entity (e.g., gene entities).

One of the major obstacles that make exact term matching insufficient is the problem of term variations. As a consequence, beside exact matching, partial term matching is essential to provide a high recall in candidate entity generation. Tsuruoka et al. [Tsu07] described the following, most frequent term variations:

• Spelling mistakes

• Orthographic variation (e.g., gene ā€˜IL2’ and ā€˜IL-2’)

• Morphological variation (e.g., ā€˜Transcriptional factor’ and ā€˜Transcription factor’) • Roman-Arabic (e.g., ā€˜Leopold 3’ and ā€˜Leopold III’)

• Acronym-definition (e.g., ā€˜NATO’ and ā€˜North Atlantic Treaty Organization’) • Extra words (e.g., ā€˜United States’ and ā€˜United States of America’)

• Different word ordering (e.g., ā€˜Serotonin receptor 1D’ and ā€˜Serotonin 1D receptor’) • Parenthetical material (e.g., ā€˜The Noise Conspiracy’ and ā€˜The (International) Noise

Conspiracy’)

These term variants often result from a combination of these and can be very complex. One way to alleviate the problem is to normalize surface forms first [Fan06;Usb14], if no appropriate candidate entities could be found. This includes converting capital letters to lower case, and deleting hyphens and spaces can resolve some of the mismatches caused by orthographic variation [Tsu07]. Some approaches additionally apply a spell checker in the case of misspelled surface forms. For instance, Chen et al. [Che10] applied the Apache Lucene1 spell checker to obtain the correct surface form. In contrast, Zhang et al. [Zha10a] made use of the Wikipedia built-in feature ā€œDid you mean?ā€, which provides an entity suggestion for a misspelled string (surface form). Several other works exist that correct spelling mistakes by using the spelling correction service supplied by the Google search engine (e.g., [She12b;Zhe10]).

Further, plenty of works apply string similarity measures, such as Levenshtein distance, Hamming distance, Dice score or Skip Bigram Dice score, to match surface forms in documents to surface forms in the dictionary. The application of such approximate string matching methods solves some of the term variation issues listed before. A survey of string matching methods can be found in [Had11].

Other approaches apply more advanced techniques. For instance, Moreau et al. [Mor08a] proposed a robust, generic model based on Soft TF-IDF [Coh03] to show that similarity measures may be combined in numerous ways. Their model outperforms all other evaluated measures on two corpora. However, in the biomedical domain, string similarity measures have been researched extremely well since character changes might lead to different entity interpretations. In this context, Tsuruoka et al. [Tsu07] proposed a logistic-regression-based approach that learns a string similarity measure from a dictionary. The results indicate that the learned measure outperforms all others like Hidden Markov model [Smi03], Soft TF- IDF, Jaro-Winkler distance [Win90] and Levenshtein distance in dictionary look-up tasks. Another work from Rudniy et al. [Rud14] introduced the Longest Approximately Common Prefix (LACP) method for biomedical string comparison. LACP runs in linear time and outperforms nine other string similarity measures like cosine similarity with TF-IDF weights [Sal88], Jaro-Winkler distance [Win90] or Needleman-Wunsch algorithm [Nee70] in terms of precision and performance.

In summary, the most common rules for partial name dictionary matching applied in EL systems include [She15]:

• The entity name is contained in or contains the surface form.

• The entity name exactly matches the first letters of all words in the surface form or vice versa.

• The entity name shares one or more common words with the surface form. • The entity name is very similar but does not exactly match the surface form. If a surface form matches a key in the dictionary during partial matching by satisfying at least one of the presented rules, all entities that are stored withš‘˜š‘’š‘¦are added as candidates

entities. A drawback of partial matching might be increased recall values at the cost of (significantly) decreased precision values. Generally, the order in which exact and partial matching methods are applied depends on the respective approach. Typically, a partial matching approach is applied if an exact matching method does not retrieve any candidate entities. Anyway, name dictionary methods for candidate entity generation are used by most EL systems but strongly depend on the quantity and quality of underlying entity data.