• No results found

Named Entity Annotation

With entities being a key concept in our document model, the recognition of entity men- tions in documents is a central aspect of document preprocessing and annotation. For many applications, the size of the document collection can be expected to be too large to manually annotate the contained entities in the texts. Therefore, automated approaches are required, which we discuss in the following. This annotation of entities can be split into three steps, namely the recognition and classification of entities, the disambiguation of entities, and the linking of mentions to an entry in a knowledge base or gazetteer. While the latter two steps could be considered separate, they are typically performed jointly. For entity disambiguation and linking in particular, we also provide some details on using Wikipedia and YAGO as knowledge graphs.

2.4.1 Named entity recognition

The first question that one should address in regard to entities is What is an entity?, the an- swer to which is of course highly philosophical. In practice, entity annotation approaches have therefore typically focussed on named entities, which are somewhat easier to cate- gorize, since names can be seen as rigid designators that uniquely identify the entity in question [110]. For example, the name of Perth, Western Australia is a unique identifier for the Australian city Perth. However, note that shortened forms of a name might be ambiguous (and therefore non-rigid), such as Perth, which could also refer to the city in Scotland. Such a rigid designator is the concept that enables the disambiguation of enti- ties and the unique representation of entities with multiple names by one unique identifier. Historically, most emphasis has been put on named entities of the types person, location, and organization [79],although temporal expressions are often considered to be named entities as well. In the evaluation of our work in the following, we focus on these four types of entities, but note that this choice depends on the data sets that we use and not on the model itself, which is agnostic to the types of entities. In fact, the implicit network representation as introduced in Chapter3can be applied to any type of entity annotation that is appropriate in a given context, even if this definition of an entity were to be novel.

2.4 Named Entity Annotation

From a conceptual point of view, the recognition, annotation and classification of enti- ties can then simply be described as the task of finding chunks in the text that represent the mention of an entity, determining the type of entity, and annotating it accordingly. In practice, however, this task is complex and the subject of ongoing research. For an overview and as a starting point, we recommend the survey by Nadeau and Sekine[139],

which covers both the recognition of entities, as well as their classification. In our work, we use three tools in particular, namely the Stanford NLP toolkit, Ambiverse, and HeidelTime. However, since these tool also provide entity disambiguation and linking functionality, we discuss them only after we introduce these concepts.

2.4.2 Named entity disambiguation and linking

Once entity mentions in a document have been recognized (automatically), ambiguous mentions can be problematic. Consider, for example, a document that mentions The Pres- ident, which could refer to numerous different individuals and is thus highly ambiguous. However, the context and metadata can often be helpful in resolving these ambiguities. For example, if the mention above occurs in a political document of the United States, it becomes more likely that it refers to the President of the United States. This, of course is still ambiguous since more than one person has held this office over the years, but meta- data (such as the document date), or contained information (such as temporal expressions) may be used to resolve it. In our example, if the document can be dated to 1963, then the person in question could be John F. Kennedy or Lyndon B. Johnson. Even other ambiguous entity mentions in the text can be helpful in resolving the ambiguities. For example, if the text contains the phrase The President and his wife Jacqueline, it becomes very likely that the document is referring to the 35th president, John F. Kennedy.

In practice, the task of automated entity disambiguation can be modelled as the assign- ment of probability scores to possible target entities for each mention in the text, based on which a selection is performed. Typically, a list of candidate entities is generated from an ontology, knowledge base, or gazetteer for each mention. Therefore, once they are disam- biguated, the entity mentions are also linked to an external resource, which motivated the term entity linking that is sometimes also referred to as wikification when Wikipedia is used as a target knowledge base. Approaches to solving this task are diverse and include, for example, the use of the context of mentions alongside knowledge bases to construct a co- herence graph from which the best candidates for all mentions are determined jointly[97],

or the computation of centrality scores on graphs that are derived from the similarity of word embeddings[233].

2 Background and Related Work

Of course, the phenomenon of ambiguity is not unique to entities, but affects all kinds of words, meaning that the task is a special case of word sense disambiguation[141],although there are some differences in practice[40]. Finally, the problem is not limited to explicit mentions of entities, but also includes the dereferencing of pronouns, and is thus linked to the task of anaphora resolution[91].

2.4.3 Entity annotation toolkits

For the evaluation and exploration of our entity-centric implicit networks and their appli- cations, we require large collections of documents that are annotated for named entities. In the following, we therefore discuss the annotation tools that we later use.

HeidelTime. HeidelTime is a rule-based temporal tagger [188]that we use for the ex- traction of temporal expressions. In addition to the extraction of temporal expressions, HeidelTime also normalizes them to the TimeML standard[151],thereby providing entity disambiguation capabilities for temporal expressions. While the Stanford NLP toolkit also offers dates as one of the types of entities that it can extract, HeidelTime has the advantage of being domain sensitive[189],meaning that it can be adapted to the formatting in dif- ferent text domains. In particular, it supports both the narrative domain (which includes Wikipedia texts) and the news domain (which we require for document stream analyses). Stanford CoreNLP. The Stanford CoreNLP toolkit is a general-purpose natural language processing pipeline that includes entity recognition and classification functionality, but also provides sentence splitting, tokenization, and part-of-speech tagging[126]. Since it does not include entity linking, we use it in our experiments in applications where we focus on entities that are annotated and classified, but not linked.

Ambiverse. The Ambiverse natural language understanding suite is our primary tool for entity linking. It includes named entity recognition based on KnowNER[166],and provides named entity linking capability derived from AIDA[97]. Thus, it can be used as an end- to-end tool for annotating named entities in documents and linking them to Wikidata and YAGO identifiers. The last step is particularly important, since it is beneficial to rely on links to both knowledge bases, as we discuss in the following.

2.4.4 Linking and classifying entities with knowledge graphs

For the linking of entities, choosing the knowledge graph that is used for obtaining the entity names is not trivial. In addition to providing additional external knowledge about the linked entities (which differs from knowledge graph to knowledge graph), its class