• No results found

Aggregated graph attributes

5.3 Stream Compatible and Context Sensitive Implicit Networks

5.3.3 Aggregated graph attributes

Based on the edge attributes as introduced above, we now lay the foundations for an entity- centric exploration of document streams that is sensitive to the context of entity relations. A shortcoming of the static implicit network model is the aggregation of all parallel edges to obtain a simple graph. While such an aggregation makes the extraction of graph repre- sentations from large document collections feasible, it does not distinguish between men- tions in different contexts. In many streaming applications such as news, however, the

5.3 Stream Compatible and Context Sensitive Implicit Networks

number of contexts in which two entities cooccur is likely much more limited than in a comprehensive dictionary like Wikipedia. Thus, an aggregation of edges by context still results in a stark reduction of the number of edges, while also preserving the context of en- tity cooccurrences for later analyses. Here, it seems reasonable to design such an approach to be flexible enough to handle arbitrary and varying numbers of contexts for a given edge. Furthermore, an aggregation by context is very likely to partially preserve the multiplicity of edges (albeit with reduced multiplicity), while simultaneously collapsing unjustifiably duplicate edges to enable a more focused extraction of relations from the resulting graph. In particular for entangled streams of news articles with redundant information, such an approach is clearly beneficial.

In the following, we thus discuss how we can obtain an aggregated graph GA = (V, A)

from the original multigraph G. While the set of nodes remains unchanged, we obviously require a new set of aggregated edges A with aggregated attributes. Letv and w denote two entities, and letIa denote a set of instances that induce parallel edges Ea with

Ea := {(v,w,i) ∈ E | i ∈ Ia} (5.6)

between the two nodesv and w. In the following, we consider the derivation of edge attributes when the set of parallel edges Ea is aggregated to a single edgea. We begin by

deriving the aggregated edge attributes.

Aggregated edge importance

This weight derives an overall strength of the relation between two entities from the sen- tence distances of individual edges, and corresponds to the importance weight of the static implicit network that we introduced in Chapter3.2. Thus, the dissimilarity of individual sentence distances is transformed into similarities by a decaying exponentiation. The indi- vidual similarities are then added over all aggregated edges in a process that corresponds to the edge weighting in the basic implicit network. Thus, we compute an importance weight for the aggregated edgea = (v,w, j) as

ω(a) := X

e ∈Ea

exp(−δ (e)) (5.7)

Aggregated publication dates

For a temporal analysis, we store the set of all publication dates, which we assume to be distinct, as long as the granularity of time is fine enough. For lower granularities, this

5 Dynamic Implicit Entity Networks

attribute is effectively a multiset of dates, since multiple documents with identical dates are likely to occur. Formally, we thus have

T(a) := [

e ∈Ea

{τ (e)} (5.8)

Aggregated context

The context is the primary component of the edge aggregation (as we discuss in Chap- ter5.3.4). However, once edges are aggregated, a single context vector is sufficient to rep- resent an edge and facilitate context-sensitive queries. For reasons of storage efficiency, it is also sensible to not store more vector-valued edge attributes than necessary. Therefore, the contexts of individual edges is aggregated as the mean of the context vectors. Since the context of two entities whose mentions are separated by a couple of sentences is likely less important than two mentions within the same sentence, we normalize individual con- tributions by the distance of the mentionsδ.

Definition 5.4 (Aggregated context embeddingκ). Let Ea denote a set of parallel edges that are to be aggregated into a single aggregated edgea. Then the context for the aggre- gated edge is defined as

κ(a) := 1 |Ea| X e ∈Ea κ(e) δ (e) + 1, (5.9) where the addition of 1 prevents a division by zero for intra-sentence occurrences.

Multiplicity of aggregated edges

To maintain the context centroid in the streaming aggregation model, we also have to store the number of individual edges that were aggregated.

Definition 5.5 (Multiplicityλ). Let Ea denote a set of parallel edges that are aggregated

into a single aggregated edgea. Then the multiplicity λ : A → N is defined as λ(a) := |Ea|.

Based on these four attributes, we can combine parallel edges in G = (V, E) to create the aggregated graph GA = (V, A) as we describe in the following. For containment

edges, only the importance and the number of aggregated edges are meaningful. For the importance of containment edges, note that the exponentiation turns the distances into a value of 1 for existing edges and 0 for missing edges. An overview of the edge attributes is shown in Table5.1.

5.3 Stream Compatible and Context Sensitive Implicit Networks

τ publication time win context window δ textual sentence distance κ context embedding ς sentence index λ # aggregated edges

ι instance of cooccurrence ω edge importance ε term embedding η node type

Table 5.1: Overview of edge and node attributes in the network.