4.2 Interactive Entity and Event Exploration
4.2.3 Graph-based entity ranking model
To better support an interactive browsing of the implicit network and the underlying doc- ument collection, it is necessary to slightly adapt the ranking functions for nodes in the network as introduced in Chapter3.2. In the following, we motivate and introduce these updated ranking functions that we use to retrieve information from the graph.
4 Applications of Implicit Networks
Recall the definition of an implicit network structure G = (V, E) with edges E and nodes V := T ∪ E ∪ D ∪ S that are comprised of entities E, terms T , sentences S, and documentsD. Also recall that N (v) denotes the neighbourhood of a node v, that is, the set of all nodes in the graph that are connected tov. For two sets of nodes X and Y , edges between entitiesx ∈ X and y ∈ Y are then weighted by directed weights ~ω that are derived from the textual distances of their mentions. For the entire derivation, see Chapter3.2, but recall that the weight can be summarized as
~ ω(x,y) = log |Y | |N (x) ∩ Y | ! X i ∈Ix y
exp(−δ (x,y, i)), (4.1)
whereδ (x,y, i) denotes the distance in sentences between the occurrences of x and y in some cooccurrence instancei, and Ixy is the set of all such cooccurrence instances. Since these weights encode the directed importance of one entity for another, their relations can be used to rank nodes in the neighbourhood of one or several query nodes. In the following, we further refine these rankings depending on the type of target node.
We can represent the common core of all rankings for retrieving information from the graph as a query hX |Q, ni, which consists of a target set X ∈ {T , S, D} or X ⊆ E (the latter for different types of entities), an integern that specifies the number of nodes to retrieve fromX , and a set of query entities Q ⊆ V. To answer a query, the aim is then to order all entities inX based on some ranking function ϱ that measures the importance of their relations to query entitiesq ∈ Q by using the network structure. The answer to a query is a setXn ⊆X of the n top-ranked entities in X such that ϱ(xn) > ϱ(x) ∀xn ∈Xn, x ∈ X \Xn. Ranking entities
A ranking of entities means that we equate the target setX with some subset of entities or terms. Thus,X ⊆ E or X = T . For a subset of entities or terms as target set, we can distinguish between two different scenarios. First, if the set of query entities contains only a single entityq (that is, |Q | = 1), then we rank entities in X by the weights of edges starting atq in the graph, and let ϱe(x) = ~ω(q, x). If q and x are not connected, then
we simply setϱe(x) = 0. This ranking directly retrieves the most important entities x in
the neighbourhood of entityq and is identical to the entity ranking that we introduced in Chapter3.2. To obtain a score in the interval [0, 1], we normalize with the maximum observed score ~ωmax to any entity in the result set,
~
ωmax := max
4.2 Interactive Entity and Event Exploration
Second, if we have multiple query entities (that is, |Q | > 1), then we employ a two-tier ranking system, in which we decompose the ranking score into two componentsϱe =
coh.sum such that coh denotes an integer and 0 ≤ sum < 1 (the dot thus denotes a decimal separator). In contrast to simply using the cohesion as a binary indicator as discussed in Chapter3.2, this allows us to rank candidate neighbours first by the cohesion component coh and break ties according to the second component sum. Here, coh denotes the cohesion of the subgraph that is induced by the query entities. Formally,coh is the number of query entities that are connected to a target entity beyond the first, that is,
coh(x) = |N (x) ∩ Q| − 1. (4.3) Thus, we havecoh ∈ {0, ..., |Q | − 1}, with larger values indicating a greater connected- ness of a target entity to the query entities. For the second component, we again use the normalized sum of edge weights
sum(x) =sum1 max X q ∈Q ~ ω(q, x). (4.4)
wheresummax is the maximum obtained sum of scores such thatsum ∈ [0, 1]. For the
resulting ranking score in the case of multiple query entities, we can thus setϱe= coh +
sum. We observe that ϱe ∈ [0, |Q |], where greater scores denote a greater importance.
The score then effectively ranks entities in the target set by the cohesion as the main component, and breaks ties by the directed importance edge weights.
The case in which we have only a single entity query, and therefore |Q | = 1, then corresponds to the special case wherecoh = 0, as described above.
Ranking sentences
To rank sentences, we equate the target set with the set of sentences, meaning thatX = S. As discussed in Chapter3.2, obtaining a ranking for sentences is less direct than a ranking for entities, since the edge weights between entities and sentences are binary and there is no notion of weight for these edges as a result. Therefore, we employ a slightly differ- ent two-component ranking schemeϱ = coh.sum to improve upon the existing sentence ranking. The first component is identical to the case of entity ranking and denotes the cohesion. That is,coh here denotes the number of query entities that are contained in a sentencex. To obtain the second component, we consider the k most important terms for each query entity and assign to each sentence a score that indicates how many of these
4 Applications of Implicit Networks
terms it contains. Formally, letTQbe the union of thek most important terms for all query
entities inQ. Then we obtain sum as the fraction of important terms that are contained in the sentence
sum(x) = |N (x) ∩ T|T Q|
Q| . (4.5)
Similar to the case of entity rankings, the combined ranking scoreϱs is then the sum of
the individual scorescoh and sum. Here, other sentence ranking schemes exist, in partic- ular schemes that also normalize the results for the lengths of sentences. We address the construction and performance of these alternative rankings in more detail in Chapter4.3
for the purpose of summarization.
Ranking documents
For documents, a ranking can be obtained in almost the same way as for sentences, but with a target set X = D. However, we observe that each sentence belongs to exactly one document. Thus, we easily arrive at a ranking of documents by computing a ranking of sentences according to the query entities, and then propagating the scores from the sentences to their respective documents. Using the same two component score as above, we define the cohesion of a documentd as the maximum cohesion of any of its sentences
coh(d) = max
s ∈d coh(s). (4.6)
For the second component, we sum over the contributions of all individual sentences and obtain
sum(d) =X
s ∈d
sum(s). (4.7)
After normalizing with the maximum of allsum(d) values, and combining ϱd = coh +
sum, we again obtain a score ϱd ∈ [0, |Q |]. While this approach is intuitive, it is not
straightforward to implement on the dense network structure due to the two-hop relation between entities or terms and documents. Thus, we subsequently consider a merging of document edges into sentence edges when discussing the application architecture.
Subgraph extraction
As a final extraction method that is valuable for an exploratory visualization tool, we also consider an implementation of subgraph extraction. With the aim of highlighting the immediate neighbourhood of a given set of query entities, we first include all query entities
4.2 Interactive Entity and Event Exploration
Figure 4.1: Schematic view of the data processing pipeline and system architecture of EVELIN. Here, we use Wikipedia as the input document collection and Wikidata as the knowl- edge base for entity linking, but the process can be applied to any document collection or document stream.
in the subgraph. To discover additional nodes, we rely on entity queries as defined above. Specifically, we rank entities in each of the entity type sets according to their importance for the query entities. Then, we select the three highest ranked entities in each set that have a cohesion ofcoh ≥ 1 and include them in the graph. In a final step, we extract all edges between the selected nodes and include edge weights that can be used to visualize the importance of relations. In essence, this procedure equates to determining the highest ranked neighbours, and extracting the complete graph between all nodes that are selected in this way. Unfortunately, this extraction is comparably slow for dense local subnetworks, and may require a few seconds for highly connected nodes in the Wikipedia data. In Chapter6, we consider a more efficient way of extracting descriptive subgraphs.