6.3 An Entity-centric Topic Model
6.3.3 Topic construction and growth
Based on the derived edge weights, we now face the question of extracting topics from the implicit network. To this end, we argue that the core of topics is formed by edges between frequently cooccurring nodes, and that topics can be grown around such edges in a well- defined manner. Thus, we propose two growth approaches that specifically enable an interactive exploration of topics, and discuss the potential evolution of topics over time.
Assuming a non-increasing ordering of edges in G by weightω, the top-ranked edges... then correlate to topic seeds as described above. Thus, we can select the top-rankedk edges for some value ofk and treat them as seeds around which the topics are grown. To grow topic substructures around the selected edges, we introduce two types of growth patterns, namely triangular growth and external node growth. Note that some seed edges may share nodes, an aspect that leads to the fusion of two topics.
Triangular growth
Given an edgea = (v,w) between entities v and w along with a network substructure that only containsa, v, and w, this initial substructure can be grown by adding neighbours of both entities, similar to the construction of evolving contextual topics in Chapter5.4. Formally, recall thatN (v) and N (w) denote the neighbours of nodes v and w respectively, thenN (v) ∩ N (w) is the set of all nodes in G that share v and w as neighbours. To rank nodes in this potentially very large set, we utilize a scoring function on the edge weights. Specifically, letϱvw :V → R such that
ϱvw(x) := min{...ω(x,v),...ω(x,w)}. (6.2)
Obviously, nodes with a higher score cooccur more often and more consistently with both entities of the seed edge. Ranking nodes in the shared neighbourhood ofv and w according toϱvw thus allows us to select the most related terms to the topic that is represented by
the seed edge(v, w). It is then a simple matter of adding any number of such term nodes to the network substructure (along with the edges connecting them tov and w), in order to incrementally grow the topic. Since all adjacent nodes can be ranked according toϱvw, we obtain a relevance score for nodes in relation to the seed edge. In addition to the two seed wordsv and w, a topic can thus be interpreted as a list of ranked words that are added to the initial two words, based on their cooccurrence patterns, much like a classic topic model. However, this growth strategy also results in a descriptive network substructure as illustrated in Figure6.1(top).
6 Entity-centric Network Topics
Figure 6.1: Visualization of topic growth with entitieseiand termstj. Top: seed edge selection and
triangular growth. Seed edges of topics are ranked and selected based on their global edge weightω. From the set of terms that are connected to either entity of a seed edge,... the most closely related are selected by edge weight, and used for triangular growth. Both the number of seed edges and the number of added term triangles can be selected or changed on the fly. Bottom: external growth of a topic. Entities or terms from the (joint) neighbourhood of any node can be added to grow the topic, for example based on edge weights or node type.
Based on the process described above, the incremental addition of words to the seed edge clearly supports different aspects of topic and cooccurrence exploration. First, instead of adding terms as described above, it is equally viable to select entities or even specific typesη of entities in the shared neighbourhood. For example, if η(v) = η(w) = Loc, one could restrict the growth process to add only other locations, or instead only add persons. In a medical setting, this might be useful to restrict relations to subsets of entities that correspond to symptoms of disease and drugs, for example, which could offer insights into their complex relations beyond simply dyadic edges.
Depending on the technical realization and implementation of the storage that is used for the implicit network, it can also be used as an inverted index as we showed for EVELIN in Chapter4.2. As a result, it becomes feasible to include functions that enable the user to inspect articles and sentences in which two words cooccur during the incremental con- struction and exploration of the network substructures. Thus, an interactive exploration of the topics in a collection or stream is viable, a possibility that we investigate in more detail for our topic exploration interface in Chapter6.5.
6.3 An Entity-centric Topic Model
Figure 6.2: Visualization of topic fusion for two topic subgraphs that overlap on two shared nodes. Dotted lines denote the shared entitye2and termt3.
External node growth
While the construction of edge triangles and term rankings is a key component in ex- tracting and exploring a topic in the classic sense, the set of nodes that is determined in this way can also be explored with further expansion techniques. As a topic substructure grows, the seed edge and its incident nodes are not the only available attachment points for further edges and nodes. Instead, we can also add further nodes that are connected to only one of the initial wordsv or w, but also to some of the other nodes that were added in subsequent triangles. The external node growth process is also illustrated in Figure6.1
(bottom), where an entity and a term are added to an existing topic substructure by dotted lines. Since these new nodes are not entirely connected to the seed nodes, we refer to this step as external node growth. While this attachment step has no analogy in classic topic extraction, it introduces additional degrees of freedom in an interactive exploration of topics, as we show in Chapter6.5. Obviously, this addition of external nodes also leads back to the ranking and recommendation of nodes in the immediate neighbourhood of a given query node, which highlights the connection to the approaches that we discussed in the previous chapters.
Topic fusion and overlap
During the network-based extraction of topics using the steps described above, the sub- structure that is grown around the top-ranked seed edge is self-contained. However, this is not necessarily the case for substructures that are grown from subsequent edges in the list of top-ranked entity edges ifk > 1 (that is, if more than the top-ranked seed edge is con- sidered). Assume an edgea0, a withω(a... 0) <ω(a) from the list of seed edges. While new... nodes are added to the substructure arounda0, it is possible that an edge is added that is incident to a node of the previously extracted substructure around the edgea. In practice,
6 Entity-centric Network Topics
this overlap between two topics may occur for term or entity nodes. In the most extreme case, even seed edges may overlap in one of their entity nodes, leading to the fusion of two topics. In Figure6.2, we show an example of the fusion of two topics that overlap on one common entity and one common term. In classic topic models, the same word may belong to different topics with a high probability, which is analogous to partially overlap- ping topics in our model, where the same node can be part of different topics. In fact, we argue that topics should overlap for entities in news articles that participate in multiple topics. Consider, for example, a politician who meets with members of both her own and a foreign government to discuss matters of state with one group, and foreign relations with the other. In Chapter6.4, we show how a network visualization can be used to high- light such overlapping substructures during the exploration of topics. However, unlike the fixed number of extracted topics in traditional topic models, the number of extracted topics dynamically adjusts itself to fit the data. If we assume that one connected subgraph describes one topic, then the number of topics is not necessarily equivalent to the number of selected seed edges. Instead, it adjusts dynamically with the number of related topics in the data, which is especially beneficial when the topics and their relations are visualized as graph structures.