• No results found

5.5 News Event Completion Evaluation

5.5.2 Evaluation setup

We briefly discuss the setup of the evaluation, the used parameters, and the ground truth, before we present the evaluation results.

Context extraction schemes

To derive contexts for entity cooccurrences, we consider two schemes according to the definition in Chapter5.3.2: the complete context and the verb context. For the complete context as a straightforward approach, we simply use the weighted average embedding of all non-stopwords inside the context window. In contrast, based on the importance of verbs for traditional event extraction, we also consider the verb context. In this case, we utilize only the embeddings of verbs inside the context window. However, we exclude all forms of the auxiliary verbs to be and to have. Both schemes are applied separately during network and ground truth construction.

Ground truth data

Since we cannot rely on historic events for the evaluation of contemporary news streams, we need a set of news events. Thus, to obtain ground truth events, we use the Wikipedia Current Events portal[214],which contains summaries of news events with short descrip- tions that are manually updated by Wikipedia editors (for an example, see Figure5.3). We crawl the pages to extract each listed item as a news event, and perform named entity recognition and disambiguation by following Wikipedia links in the text, similar to our annotation of Wikipedia in Chapter4.2. Since the Wikipedia summaries contain refer- ences to the original news article sources from which the events were taken, we match the references to articles in our input stream to ensure that the described events are covered in our documents. We exclude all events that consist of less than two entities or have no reference to an article in our network. We obtain 97 individual events that correspond to at least one article in our collection. For each such event, we generate a query from each contained entity by using the remaining entities as query input and the removed entity as ground truth (that is, an event withk entities induces k queries). All words in the event descriptions that are not annotated as entities are extracted as terms for the generation of query contexts. To obtain the verb context, we also manually annotate the verbs in the event summaries. In total, we obtain 293 queries for the evaluation.

As an example, consider the second item in the list of events in Figure5.3. In addition to the date of July 16, 2016, it contains mentions of the President of Turkey as a person,

5 Dynamic Implicit Entity Networks

Figure 5.3: Snapshot of an event description for July 16, 2016, from the Wikipedia Current Events portal. Events are sorted by days and annotated with brief event descriptions, as well as links to the source articles that describe the event.

and Istanbul Airport as a location. From this set of three entities, we can construct three evaluation queries by removing one hold-out entity from the set and using the remaining two as query input. For example, we try to predict the location Istanbul Airport, based on the set of query entities containing July 16, 2016, and President of Turkey. We also construct two further queries with the other two entities as targets. In contrast to this example, entities are of course represented by their Wikidata identifiers (or normalized dates) in the evaluation. All remaining terms in the sentence are then be considered for the generation of the context embeddings. However, the verb context is constructed from only the embeddings of return, indicate, and falter.

Clustering aggregation setup

For the clustering approach, we require a clustering algorithm that does not enforce a fixed number of clusters, since it is impossible to divine a reasonable number of clusters that applies equally to all pairs of nodes. Thus, we select DBSCAN[54]with cosine as a distance measure. To determine values for the necessary two parametersϵ and minPts, we conduct a number of preparatory tests. In the results of these tests, we find that the quality of the results suffers for high values ofminPts, while minPts = 5 works well. Since a value ofminPts > |Ea|exceeds the number of edges and would therefore be meaningless

for edge aggregation, we use the schememinPts = min{5, | Ea|

5 }, which performs best in

our experiments. We then employ the min-points heuristic to obtain a reasonable density value ofϵ = 0.3 as a starting point for the evaluation.

5.5 News Event Completion Evaluation

Comparison of streaming aggregation thresholds

0 10 20 30 40 50 0.2 0.3 0.4 0.5 0.6 0.7 0.8 rank k recall@k aggregation threshold th = 0.3 th = 0.4 th = 0.5 th = 0.6 compl. aggregation

Figure 5.4: Recall comparison for different aggregation threshold values in the streaming aggrega- tion approach, when using the complete context to generate context vectors, and cosine as a similarity measure (the results for the verb context are almost identical and omit- ted). Shown is the fraction of correctly identified entities in the top k ranks versus the rank k. Complete aggregation corresponds to the static implicit network model and is included as a baseline.

th = 0.3 th = 0.4 th = 0.5 th = 0.6 context (all) 0.218 0.218 0.232 0.253 context (verb) 0.225 0.222 0.215 0.208 no context 0.157

Table 5.2: Evaluation results of the streaming edge aggregation with cosine similarity. Shown is the precision@1 for the complete context embedding derived from all words in the context window, as well as the context that is derived only from verbs. The aggregation without context corresponds to the static implicit network model and is included as a baseline.

5.5.3 Evaluation results

As discussed above, each evaluation query has exactly one correct answer. Therefore, suitable evaluation metrics are the fraction of queries in which the top-ranked prediction is correct (that is, precision@1), and the number of correct predictions among the topk predictions (that is, recall@k). We discuss the results in the following.

Streaming aggregation

We first compare the two approaches for context generation over varying aggregation thresholds, and show the resulting precision scores in Table5.2. We omit threshold values ofth < 0.3 since no further changes occur below this point in our data. We find that both context generation methods outperform the static implicit network baseline by a large margin of up to 61% improvement). However, the verb context aggregation shows

5 Dynamic Implicit Entity Networks

full aggr. streaming aggr. clustering aggr. all verb ϵ = 0.2 ϵ = 0.3 ϵ = 0.4

cor@1 44 71 61 35 27 25

prc@1 0.165 0.266 0.228 0.131 0.101 0.094 recall 0.655 0.955 0.955 0.955 0.955 0.955

Table 5.3: Performance comparison of the clustering and streaming (th = 0.6) edge aggregation approaches on a subset of the evaluation data. We show the correct predictions at rank one (cor@1), precision@1, and recall. The full aggregation corresponds to the static implicit network model and is included as a baseline.

a slight decline in performance as the threshold increases. In contrast, the precision of the complete context increases with the threshold value, and it performs better overall. In Figure5.4, we show the corresponding recall values of the complete context approach, which are almost identical for the verb context, and thus omitted. Varying the threshold values has little influence on the recall, which makes low thresholds attractive in settings where a compact graph representation with as few edges as possible is important and a good recall@5 score is sufficient. Overall, we find that even a heavily aggregated context is still sufficient to increase the performance in comparison to the static implicit network that aggregates all edges regardless of context.

Clustering aggregation

In Table5.3, we show the performance of the clustering aggregation for a subset of 267 evaluation queries (the remaining 26 clusterings did not finish within 48 hours). We also include the static implicit network as a baseline and the best results of the streaming aggre- gation approach for comparison. Due to the smaller evaluation set, note that the values vary slightly from the values in Table5.2. We find that the clustering aggregation per- forms better than the static implicit network baseline for some of theϵ settings, but not by a large margin. Interestingly, higher values ofϵ decrease the performance. The recall values shown in Figure5.5support the observation of a lower performance of the cluster- ing approach. While some of the clustering aggregation settings eventually outperform the static baseline, clustering does not rival the streaming aggregation.

Overall, we therefore find that streaming edge aggregation is superior to clustering ag- gregation in this setting. While other clustering algorithms may perform better, our tests were extensive, and the ease of use for the streaming method is much higher. While the optimal parameter settings for clustering approaches are typically difficult to obtain, there

5.5 News Event Completion Evaluation

Comparison of context aggregation methods

0 10 20 30 40 50 0.1 0.2 0.3 0.4 0.5 0.6 0.7 0.8 rank k recall@k aggregation method stream (verb), th = 0.6 stream (all), th = 0.6 clustering, ε = 0.2 clustering, ε = 0.3 clustering, ε = 0.4 no context

Figure 5.5: Performance and recall comparison for different density values in the clustering aggre- gation approach using cosine similarity. DBSCAN is used as a clustering algorithm with minPts = 5 and varying ϵ values. Shown is the fraction of correctly identified entities in the top k ranks versus the rank k. The aggregation without context corresponds to the static implicit network model and is included as a baseline.

is a direct correlation between the threshold and the prediction quality in the streaming approach. Therefore, the streaming approach is preferable for the analysis of news, not least because it also saves storage space in comparison to the clustering approach, which requires all edges to be collected prior to the aggregation. What remains to be evaluated is the performance of the streaming approach during aggregation, which depends on the threshold selection. We consider this aspect of the model in the following.