• No results found

Implications and Outlook

4.3 Entity-centric Summarization

4.3.6 Implications and Outlook

Two issues that we uncovered in our evaluation above deserve further investigation, which we discuss in the following.

4 Applications of Implicit Networks

Effects of the length of sentences

As we described in the evaluation of the sentence extraction of the four scoring functions, a normalization of the sentence scores by the length of the sentence directly influences the performance, regardless of whether this normalization is based on character or term counts. Specifically, a normalization by length increases the precision, but decreases the recall. While the overall benefit outweighs the drawbacks, as is evident from the obtained F1-scores, such a restriction may not always be desirable. For example, the extraction of

concise sentences is paramount for extractive summarizers, but such limitations may not apply to a composition technique for summarization that first extracts multiple sentences covering different aspects in relation to a set of focus entities and then combines them. As a result, a less restrictive extraction may be favourable at times, depending on the application. Thus, the selection of the proper sentence ranking method should be based on the given task at hand and on subsequent uses of the extracted sentences.

Interplay between entity types

In our extraction of example sentences for complex location relations, we found that a sub- stantial number of location relations pertains to the movement of people between places of residence. While we see this largely as an artefact of the amount of biographical data that is stored in Wikipedia and our focus on cities as examples, our previous experience with EVELIN suggests that this can be exploited with the inclusion of additional entity types. The most direct approach would be to exclude or limit sentences that mention person movements from the results, where such movements can be identified. However, additional entities could also be used to focus on certain aspects of relations for a sub- set of locations. For example, the exchange of scientists between universities could be considered through the extraction of substantiating sentences that explain the transfers. The addition of dates as entities would then even allow an analysis of the flow of persons from one place to another, given an appropriate document collection. Such an inclusion of temporal data would be a step towards an analysis of the evolution of relations between locations in documents with spatio-temporal content.

Finally, we note that terms have so far played a minor role in the extraction of sen- tences, and only served to represent the context of entities indirectly, without being given a primary role in the graph structure. However, we further investigate the possibilities that arise from using descriptive terms as first class citizens in Chapter6, where we use a similar intuition for the extraction of network topics around entities.

4.4 Summary and Discussion

4.4 Summary and Discussion

In this chapter, we have taken a closer look at the capabilities of implicit networks for entity retrieval tasks. We have considered several ranking tasks in the context of an entity-centric search engine that supports the exploration of a static document collection, represented by an annotated dump of Wikipedia. Motivated by the efficiency of computing localized rank- ings in the neighbourhood of nodes, we have demonstrated that implicit networks can be used to support interactive and ad-hoc retrieval operations on large document collections, even if the data is stored on secondary storage. While we have approached the extraction of descriptive subgraphs for the visual display of the context around query nodes, these subgraphs are static and the efficiency of their extraction is limited in dense regions of the graph. In Chapter6, we build on this concept to extract subgraphs to not only to describe static relations, but also to extract dynamic topics.

Based on the potential behind describing entities or sets of entities by extracting sen- tences, we then addressed the task of extractive summarization and focussed on sentence nodes in the network. We introduced further ranking functions to account for the length of sentences. Our empirical tests indicate that a normalization by sentence length improves the perceived ranking results due to the elimination of overly long sentences, as well as the overall quality of the top-ranked sentences. In addition to the extraction of descrip- tive sentences, there could be a future application in the multi-sentence summarization of entities, where a balance of recall and brevity stands to be investigated for an optimal coverage of contexts. Given the fast response times, such an extension could even enable near real-time document summarization techniques.

Practical implications

Consider the journalists and data analysts from our running example, and the possibili- ties that the methods discussed in this chapter open up for them. In contrast to the basic implicit network that is constructed from annotated but not disambiguated entities, an im- plicit network that is linked to a knowledge base provides new functionalities. In practice, this not only allows them to investigate the unstructured data in a more focused manner, but also provides additional information on entities from an external knowledge base such as Wikidata or even from internal knowledge repositories that are maintained within the investigative team. On top of this representation, an entity-centric search engine can then help in locating implicit entity relations comfortably, and visualizing them as graph struc- tures. Through easily included links to the document repository and the knowledge base

4 Applications of Implicit Networks

used during the creation of the network, following up on information and obtaining fur- ther details on specific entities becomes a simple matter of identifying the corresponding nodes in the network.

However, a central aspect is not the retrieval of information about known entities, but the extraction of definitions and descriptions for previously unknown entities. Here, the investigators can directly benefit from an extractive summarization that provides them with descriptions of previously unknown entities from within the document collection itself. Furthermore, they can even retrieve descriptions of relations to give them a starting point for an investigation into the relations between two entities. Since these relations are derived from their cooccurrences across all document, they would be difficult or impossible to find in a manual analysis or by traditional search algorithms.

Outlook

Beyond the selection of applications that we presented here, implicit networks also have the potential to enhance further tasks in NLP or (social) network analysis. For example, the extraction of implicit social networks from document collections stands to be a useful tool in uncovering latent social relations[69]. Similarly, as we have demonstrated in our comparison to Wikidata, novel location relations beyond the typical hierarchical structures can be uncovered[70]. Networks of such latent relations could then also serve as semi- structured knowledge for the support of NLP tasks, similar to knowledge bases. There are already examples of applications in which such networks can improve the disambiguation of person mentions[68]or toponyms[178].

One limitation of the implicit networks that we have considered so far is their static nature. However, this is not an inherent feature, but an artefact of their construction from static document collections. In the following chapter, we therefore investigate the adaptation of implicit networks to a dynamic setting, in which they can be used to model not just document collections, but streams of documents.

5

Dynamic Implicit Entity Networks

In Chapter3, we introduced implicit entity networks for the representation of large static document collections. In practice, however, many collections of documents are actually streams with a temporal dimension, such as news articles or blog posts. In this chapter, we therefore address how the implicit network model can be adapted to a streaming setting. In particular, since the context of entity occurrences and cooccurrences is more likely to change in new documents if a temporal dimension is considered, we also include new aggregation strategies for entity relations that go beyond a complete aggregation strategy like the one we used in Chapter3.2.

Contributions. In this chapter, we thus make the following four contributions.

I We extend the implicit entity network model from static document collections to dynamic document streams, including publication dates as a temporal dimension. II We include the context of entity mentions to enable a context-based aggregation of

entity relations instead of a content-agnostic, complete aggregation, and propose a dynamic and a static partial aggregation approach.

III We evaluate the performance of dynamic implicit networks for information retrieval tasks on a large collection of entangled online news streams.

IV Based on the context of entity relations, we explore the extraction of dynamically evolving topics over time.

References. Parts of this chapter are based on the peer-reviewed publication:

A. Spitz and M. Gertz. “Exploring Entity-centric Networks in Entangled News Streams”. In: Proceedings of the 27th International Conference on World Wide Web (WWW), Companion Volume. 2018, pp. 555–563. doi:10.1145/3184558.3188726

5 Dynamic Implicit Entity Networks

5.1 Motivation

Reading it in the paper in the morning is a common idiom for catching up with the news that is becoming increasingly less applicable. Putting aside the obvious departure from printed news, both the temporal aspect of the morning and the grammatical singular of the paper are less and less accurate. News are not reported and consumed in the morning, but in a constant news cycle throughout the day, published by a multitude of news outlets with varying degrees of reliability, political bias, and overlapping content. It is these entangled streams of news that the reader of news has to wade through to stay informed. As a result, the ever increasing number of news outlets and the frequency of the news cycle have made it all but impossible to obtain the full picture from online news. Consolidating news from different sources has thus become a necessity in online news processing. Despite similarities between the news cycle and streams of microblogs, and despite the abundance of research into extracting insights from online social networks, social media cannot take on the mantle of investigative journalism, which relies on argumentative texts and is less focused on the instant than it is on the evolution of stories. In this context, the so called Five Ws of Who?, When?, Where?, What?, and Why? are questions of central importance that serve the journalist and the reader in uncovering news. Naturally, these questions put an emphasis on entities as pivotal components of news. In information retrieval, this is reflected in the definition of an event as something that happens at a given place and time between a group of actors[6]that we have already applied to the modelling of events described in Wikipedia. Thus, it stands to reason that entities play a similarly (or even more) important role in inducing structure in the unstructured texts of news articles.

In contrast to Wikipedia, however, far more than one news article tends to be required to retrieve the full picture in large entangled news streams[192],thereby making it neces- sary to consolidate information snippets from different sources. On the other hand, a lot of information is replicated between or even within individual news streams and thus redun- dant. Intuitively, this motivates two major subtasks in automated news analysis: identify- ing event mentions in unstructured texts, and aggregating them across documents. These tasks are referred to as new event detection and event tracking[7],and can be augmented by detecting topics[28]that put individual documents into context. To make identified events accessible to users, a central step is thus their aggregation into threads of events along some dimension(s). Many different approaches have been proposed to this end. Some focus on a geographic aggregation and visualization of news sources[199],while others focus on the temporal aggregation[83],or both[229].Alternative approaches use the par- ticipating entities directly[39,111]. In the case of a temporal aggregation, different tem-

5.1 Motivation

poral dimensions can be considered, such as the dates in the documents[135],or external information like the publication date[10] and edit histories[106]. With regard to time- lines, another important aspect is then the temporal order, as the SemEval-2015 task for cross-document event ordering shows[134]. Beyond the above dimensions, more recent approaches include aggregation on a topic level[4]or based on word embeddings[138].

When we consider the above approaches for a contrastive exploration of the content of a news stream, we find that they suffer from two critical drawbacks: the limited number of aggregation dimensions and the aggregation granularity level. None of the approaches covers the entirety of available dimensions and it is indeed questionable whether an aggre- gation along all dimensions at once is realistically possible. Perhaps even more critically, the results are always coarse structures due to an aggregation either on the document, event, or topic level. However, if we consider incidents or events to be composite men- tions of (named) entities, then they constitute the stitching points between individual news streams and can be used for a fine-grained consolidation. After all, we consume news about people, organizations, or locations of interest and follow them over time and in different contexts. Is it then not a more reasonable approach to retain this entity-centric structure of news in a suitable document model for subsequent analyses, and aggregate only where necessary and in exactly the dimensions that fit the exploratory task?

To address these shortcomings, we can adapt the implicit entity network model for document collections to a model for document streams. While we do so with the context of news streams as a primary application in mind, it should be easy to see how this approach is applicable to any other type of document stream that includes entity mentions, such as blog posts or even scientific publications. Although this adaptation is conceptually simple, it is not entirely straightforward. In particular, we have to include a temporal dimension beyond temporal expressions that are (potentially) included in the documents, and also consider document time stamps. Furthermore, taking into account the context of entity mentions is necessary to properly consolidate mentions from multiple sources and obtain a comprehensive framework for entity-centric analyses.

On the technical side, our model then serves to address the inherent scaling issues of multiple entangled news streams by utilizing efficient entity-centric queries to localized graph substructures. The streaming graph updates can take advantage of incremental adjustments to relevance measures for queries against the data[36,223].Furthermore, the implicit representation serves as an (inverse) index for retrieval tasks without requiring the storage of the proprietary content of news articles, which is an increasingly important aspect of news aggregation services.

5 Dynamic Implicit Entity Networks

On the application side, our model stands to provide a more fine-grained and versa- tile representation of entangled news streams than previous approaches, by relying on the entity-centric representation of implicit networks. Instead of utilizing document- or event- centric indexing, we focus on the level of entities and contexts, and use them as stitching points between individual news threads. The model then supports a wide range of tasks, including entity-centric topic and event extraction and tracking, contextual search, con- trastive source comparison, and exploratory visualizations of the underlying streams, as we discuss towards the end of the chapter.

Structure. In Chapter 5.2, we discuss previous work that is related to the analysis of document streams with a focus on news, and give a brief background into the entity- centric exploration of such documents. Afterwards, we introduce the dynamic implicit network model in Chapter5.3, and use it for the exploration of evolving contextual topics around the edges of such a network that we construct from news articles in Chapter5.4. Finally, in Chapter5.5, we evaluate the improved network model on a set of news events for the task of event completion.