• No results found

Towards an Entity-centric Document Model

hierarchy also serves as the primary source of information in the classification of entities. In this regard, both Wikidata and YAGO have some advantages and drawbacks.

The benefit of Wikidata, as mentioned in Chapter2.3, is its inherently collaborative de- sign, which enables almost immediate updates of the knowledge graph as new entities emerge and are being referenced in documents. While this is of prime relevance for the processing of document streams such as news articles, it also comes at the cost of a less stable class hierarchy. Since the Wikidata knowledge graph must be able to adapt to chang- ing content to accommodate edits, its hierarchy also changes over time. Furthermore, the potential of vandalism[87] and the fact that errors are committed not systematically by one algorithm but randomly by a multitude of users, make the hierarchy of Wikidata less predictable than that of a traditional extracted knowledge base. As a result, using Wikidata class hierarchies even for the classification of the common entity types persons, locations, and organizations, is a difficult task (for a detailed discussion, see[176]).

In contrast, the class hierarchy of YAGO is partially derived from the WordNet ontol- ogy[132],which provides a slim hierarchy that is well suited for entity classification. This benefit, of course, comes at the cost of topicality, since YAGO is not updated incremen- tally but extracted in batch. As a result, keeping up to date with newly emerging entities is problematic when only YAGO is used.

To benefit from both approaches, we use a combination of the two knowledge bases in our work, which has the added advantage of potentially providing data from both knowl- edge bases to any linked entity. For the identification of entities, we use Wikidata identi- fiers. Since both Wikidata and YAGO are part of the Semantic Web, entities can in principle be matched through RDF relations. However, in our case, we found it simpler to use the Wikipedia page links that are contained in both Wikidata (due to the connection between the two projects) and YAGO (since it is extracted from Wikipedia content). We rely on this approach whenever we disambiguate entities for our evaluations and demonstrations in the following chapters.

2.5 Towards an Entity-centric Document Model

In this chapter, we have discussed numerous different approaches to the representation of documents, word and entity relations, and cooccurrences. However, when viewed through the lens of the requirements for entity-centric explorations of the documents and with a focus on entity relations within the context of their mentions, these approaches exhibit

2 Background and Related Work

one or several weaknesses. In the following, we summarize these issues and highlight what direction we are taking to propose a solution.

Vector space representations

In contrast to all other document models, the vector space representation has the advan- tage of an essentially unlimited window size. Any two terms that are contained in the same document are part of the same document vector. However, the model is focused on the oc- currence of terms to model document content, but does not contain a fine-grained notion of cooccurrence beyond document-level cooccurrences. As a result, while this model still provides a solid performance for document retrieval tasks despite its age, it is ill-suited for the representation and exploration of relations between terms inside the documents. By using a network approach in the following, we focus on the edges (that is, the relations between entities and terms), not on the nodes (that is, the terms). Thus, unlike the vector space model, weights in our network represent more flexible relation weights instead of term weights. In particular, relation weights can always be transformed to term weights if the localized scope around a finite set of entities is considered, while the reverse would be more difficult.

Word embeddings

While word embeddings are the de-facto standard model and current state-of-the-art for word representation and numerous retrieval applications, they suffer from three short- comings. One is the limited size of the sliding window that limits the scope of observed cooccurrences. In many of the approaches, this window size is an artefact of the neural network architecture and cannot be increased without substantially increasing the training time of the models. The second problem concerns the issue of similarity and relatedness. Word embeddings are typically good at recovering word similarity due to the way in which they are trained, meaning that embeddings tend to be similar if the words they represent are (roughly speaking) somewhat interchangeable. However, they tend to perform worse when modelling the relatedness between words that occur in similar contexts but are not interchangeable. Therefore, despite the fact that embeddings are often used for such tasks, they are a poor choice when a measure of entity or term relatedness is required[8].A third issue is the frequency of words, which embedding approaches typically address by limiting the vocabulary to the most frequent words. As a result, embeddings neglect potentially interesting parts of the document collection, which especially includes rare entities. While

2.5 Towards an Entity-centric Document Model

such rare entities and their connection might be spurious, it is also well known that a sub- stantial fraction of entities is in the long tail and thus rare[53,194].Therefore, disregarding them equates to disregarding a substantial amount of information in the document collec- tion. However, precisely because these entities are so rare, even their global context is limited to few occurrence instances. Therefore, using a network representation of such entities with a very small neighbourhood size in the network may be more space efficient than relying on a relatively high-dimensional vector representation of the words. Thus, networks stand to offer a way of retaining the full relation information for rarer entities while requiring minimal training effort.

Collocations

Similar to embeddings, the analysis of collocations utilizes extremely narrow window sizes. However, in contrast to embeddings, this window size is narrow by design, since the focus lies on adjacent words that cooccur with significant frequency. For the explo- ration of entity relations, however, one finds that entities seldom occur in close proximity. Since entities are typically nouns, there is no syntactic basis for them to be adjacent (or even closely located in many cases). Thus, while the concept of considering cooccurrences for the sake of detecting word relations is similar to our proposed entity-centric network, the scope and scale of collocations do not match the task.

Relation extraction and knowledge graphs

The strength of knowledge graphs and relation extraction lies in identifying concrete re- lations between entities with high precision. However, due to this focus on precision, they are prone to low recall, and stand to miss relations that are a priori unknown or hard to specify in concise syntactic terms. Furthermore, while the edges of knowledge graphs are attributed, they are typically not weighted, which allows inference and reasoning, but complicates ranking and recommendation tasks. For the exploration of neighbourhoods in densely populated regions of the graph, they thus face difficulties. In contrast to em- beddings, knowledge graphs are good sources of entity relation information, but less well suited to tasks that require a similarity of entity contexts, since much of the context infor- mation is lost during the extraction process. Thus, due to the focus on knowledge, the data that makes it into the knowledge base is a fraction of the content of the documents and too sparse to use as a model. However, knowledge bases can provide useful information not just for entity linking and disambiguation, but also as external knowledge in a suitable document model of entities, as we discuss in Chapter7.

2 Background and Related Work

The way ahead

A final and central observation from our discussion of the related work is the lack of dis- tance, as none of the presented approaches consider the distance of term and entity men- tions in the text. However, proximity-based methods of representing cooccurrences have been shown to play a major role in the quality of the retrieved information[197]. There- fore, we conjecture that a model that includes not just cooccurrences but also cooccurrence distances is well situated to derive meaningful ranking scores from this information.

With regard to the representation of entity relations, we find similar room for improve- ment. On a spectrum of potential entity relations that ranges from continuous weights that are available for all pairs of words on the one side to discrete relations of only enti- ties on the other, vector embeddings and knowledge graphs represent two extremes. While knowledge graphs contain only explicit relations between few entities, vector embeddings can be used to derive similarities between any two words in the document collection. In the following, we consider a middle ground in which edges in the network are continuously weighted but relations are not potentially omnipresent. Based on the implicit network model, we explore the context of entities and terms to distinguish those that should be semantically related, from those that could potentially be related.

Instead of extracting discrete relations between the terms and entities of a document collection, or learning vector embeddings that serve to later derive such relations, we investigate whether it might be more reasonable to represent the entire collection as a network in the first place.

3

Implicit Entity Networks

Based on the deliberations in the previous chapter and the clear lack of entity-centric document models, we introduce implicit entity networks in the following. To address the shortcomings of existing document models, we design the model to be both a conceptual representation of the data that supports a multitude of different retrieval tasks, as well as an efficient index that supports entity-centric retrieval operations on the underlying document collections. In doing so, we provide a basis for the more specialized applications that we introduce in the subsequent chapters.

Contributions. In this chapter, we thus make the following three contributions.

I We introduce and formalize entity-centric implicit networks as an efficient and ver- satile model of large document collections.

II We evaluate the performance of implicit networks for information retrieval tasks, and determine the scalability of such queries.

III We evaluate the performance of implicit networks against words embeddings as the current state-of-the-art in word representation.

References. Parts of this chapter are based on the following peer reviewed-publication and manuscript in preparation:

A. Spitz and M. Gertz. “Terms over LOAD: Leveraging Named Entities for Cross-Document Extrac- tion and Summarization of Events”. In: Proceedings of the 39th International ACM SIGIR Conference on Research and Development in Information Retrieval (SIGIR). 2016, pp. 503–512. doi: 10 . 1145 / 2911451.2911529

G. Feher, A. Spitz and M. Gertz. “Evaluating Word Embeddings for the Prediction of Entity Partic- ipation in News Events”. In preparation. 2019

3 Implicit Entity Networks

3.1 Motivation

The textual description of phenomena in the real world, such as events, incidents, facts, or concepts, typically involves important entities as central components, which is reflected in mentions of (named) entities in the texts. Given a large collection of documents, however, such descriptions may be incomplete in a single context, or distributed across multiple doc- uments. Consider, for example, the documentation of the Olympic Games and affiliated events in Wikipedia. For each iteration of the games, there are pages that go into great de- tail about the games themselves, along with a multitude of pages about individual athletes. However, not all information about everything that transpired at the games can be found on the central page, while the pages of individual athletes or sports events may be lacking important information, such as exact dates or even the location, and merely reference the Olympic Games of a given year. Thus, only the combination of data about multiple entities and from multiple contexts enables the full reconstruction of all pieces of a given iteration of the Olympic Games, including the place, the time, the involved participants, and their performances. As a result, numerous central tasks in information retrieval, such as event detection or summarization, are closely tied to named entity extraction and entity link- ing. For large document collections with a diverse set of entities, contexts, and concepts, a global approach that accounts for this distribution of mentions then also requires the incorporation of efficient indexing strategies across the documents.

To address entity-centric retrieval tasks in these cases, it can be beneficial to leverage partial information about the involved entities from multiple contexts to reconstitute the distributed relations. When we are investigating texts with a focus on the occurrences and cooccurrences of different types of entities across documents, we are therefore no longer bound by the concepts of one sense per discourse[63]or one sense per collocation[225],since we are not limited to one discourse or to one collocation. While these two interpretations have proven valid in many situations, there is room for improvement when large collec- tions of documents are considered that contain partial and distributed relations. Since the task of entity disambiguation becomes less taxing when multiple entities of differing types are involved in a given context, merging the shared contexts of entity mentions stands to improve the retrieval performance. Thus, a model of the entire document collection that accounts for the context in which entities are mentioned appears sensible. This notion is further supported by recent research into web query evaluation, where the importance of a distinction between so-called content words and intent words has been highlighted[158].