Network topic extraction and exploration - Comparison to Traditional Topic Models

6.4 Comparison to Traditional Topic Models

6.4.3 Network topic extraction and exploration

In contrast to the extraction of lists of terms, we can also fully utilize the network representation to extract complex topic substructures. Instead of lists of terms from nodes that surround seed edges, we extract the nodes themselves to continually grow a descriptive network structure. Conceptually, we proceed in the same way as described for traditional topics above, by extracting a ranked list of entity-centric edges and selecting the top-ranked edges. For each edge, we then include a number of terms that are adjacent to both entities in the network. Instead of retrieving the terms as a list, however, we directly visualize the resulting subgraphs. The number of adjacent terms per seed edge can be selected arbitrarily, but a value between two and five term nodes per edge tends to result

6.4 Comparison to Traditional Topic Models

Figure 6.3: Comparison of the topic substructures and topic evolution of entities (purple) and terms (teal) between two time slices for the article subset of CNN. Shown are the eight highest ranked edges and the three most relevant connected terms.

in a visually interpretable network that is not too cluttered, while still being descriptive enough to provide an insight into the topics’ contents.

As discussed in Chapter6.3, by projecting the network according to the publication date of the corresponding article, we can introduce a temporal dimension and investi- gate the evolution of topic networks over time or even dynamically for selected intervals. Since the temporal information is an integral part of the network, this selection can be changed dynamically by the user. Similarly, if outlet identifiers are stored alongside document identifiers, we can create projections to a part of the network that corresponds to articles published by a selected news outlet or subset of outlets.

Temporal topic comparison

In Figure6.3, we show two such temporal snapshots for the subnetwork of articles from CNN as a prototypical example. The results for other news outlets are similar, albeit with a regional or political bias that depends on the news outlet. While the network supports the extraction of topics from all articles of a collection simultaneously, such a restriction to particular news outlets can provide facets to the exploration. In the graph visualiza- tions of the network topics, we still find the same descriptive terms as we do in the case of ranked term lists, but we also observe the additional structure of the underlying network. Unlike term lists, which represent isolated topics that are only implicitly linked by

6 Entity-centric Network Topics

Figure 6.4: Comparison of the topic substructures of entities (purple) and terms (teal) for news articles from the news outlets CNN and The Guardian in Summer 2016. Shown are the eight highest ranked seed edges and the three most relevant connected terms per seed edge. Due to topic fusion, the number of connected components is less than eight, and we observe fusion of topic subgraphs on both terms and entities.

terms that occur in multiple topics, the overlaps of edges show topic relations directly. In fact, we observe fused subgraph structures that emerge from the top-ranked edges and lend further support to their topics. For example, the topics of Trump, Clinton, and the Republican Party are clearly related, as is evident from both the direct connections and the related terms. On the temporal axis, we find that the topics correlate well with political events. For example, the Brexit topic disappears after the date of the referendum in June 2016 (of course, it is more pronounced in British outlets), while several war-related topics shift in focus to follow ongoing campaign locations in the war against the Islamic State. Expectedly, the topic covering the drawn-out process of the U.S. election is stable over time. Overall, we find that the network representation adds a structure to the visualiza- tion that is easily recognizable and explorable, thus allowing the user to observe changes in the topics more easily. Obviously, such changes could be further visualized with network animations that highlight the changes over time.

Comparison of topics by news outlet

In Figure6.4, we show a contrastive comparison for the same time frame, but between the topics that are extracted from two different news outlets. Here, we observe a clear bias in the importance of topics that is caused by the location of the news outlet. CNN,

6.4 Comparison to Traditional Topic Models

which is based in the United States but concerned with giving an overview of global news, covers a wider range of topics, including the Brexit discussion. The Guardian, on the other hand, which is a British newspaper, is clearly more focused on this regional topic that is potentially of greater importance to its readers. Upon closer inspection, the snapshot of topics from The Guardian contains two topics that are related to Brexit, which are likely to fuse once more seed edges are considered. As in the case of temporal projects, the focus on news outlets allows us to quickly compare what is being talked about in a given news outlet or region (if multiple outlets are grouped) in comparison to other outlets or regions. While such a comparison is certainly possible even for lists of terms, it would be less intuitive and less visual.

In document Implicit Entity Networks: A Versatile Document Model (Page 174-177)