3.4 Event Completion on Wikipedia Data
3.4.5 Event completion evaluation: date prediction
To evaluate the performance of entity-centric rankings that use the Wikipedia implicit network based on the two ground-truth event data sets, we use the description of events as query input. Specifically, we use all entities of type location, organization, and person as query input and evaluate the resulting ranking of dates with respect to the known ground truth date of the event. Thus, for each event, the query input is a set of entities. For query- ing strategies that include the context, this set may also contain terms. We focus on the prediction of dates since there exists exactly one date of granularity day per event in the
3.4 Event Completion on Wikipedia Data
evaluation data set. Furthermore, the coverage of dates in Wikipedia is very comprehen- sive for the considered time frame[185],which reduces the risk of erroneously evaluating the completeness of Wikipedia or the performance of the named entity recognition tool. For the output ranking, we only consider dates with a granularity of days since this is the granularity in the evaluation data sets.
Ranking methods and baselines
Based on this evaluation task, we test several possible ranking methods that utilize the implicit network model. With LOAD, we denote the basic method of inputting all query entities and ranking dates according to the combined edge scoring functionϱ. For LOADsq, we use the same approach but also enable subqueries to split names into components. For LOADTsq, we use the descriptive terms of events as part of the query in addition to the participating entities. Similarly, LOADTsq+ is a special query for the dates of death evaluation set, in which we also include the terms death and died in addition to the entities and terms that are contained in the event description. Furthermore, we use two baseline approaches. For BASEr, we rank dates that are connected to any query entity at random (this process is repeated 1,000 times and the results are averaged). Conversely, BASEw ranks dates by their weighted connection ~ω to any one entity in the query.
Evaluation results
In Table3.4, we show the results for the dates of death and historic events evaluation data sets. Independent of the data set, we find that the LOAD methods are able to locate an overall larger number of correct dates at some position in their ranking. The number of identified dates rises as we include more entities and terms in the queries. In the case of dates of death, the LOAD approach outperforms the randomized baseline, but is almost equivalent in performance to the weighted baseline. Since we only have one class of in- put query entity in this case (namely persons), this result is expected. The overall best performing approach is LOADsq with its included subqueries, which retrieves the largest number of correct dates that are ranked at the top position. The results are similar for the historical event data, although the weighted baseline performs much poorer with the addition of further entities to the input query. Interestingly, the inclusion of terms in the query helps with the overall recall but not with the precision of the results, indicating that entities are better suited as query input than the overall more frequent terms. This ob- servation is reinforced by the results of LOADTsq+, which ultimately has the best recall values at the cost of a low precision.
3 Implicit Entity Networks
data method found missing cor@1 prc@1
dates of death LOAD 869 613 122 0.082 LOADsq 1326 156 207 0.140 LOADTsq 1374 108 125 0.084 LOADTsq+ 1443 39 13 0.009 BASEw 869 613 206 0.140 BASEr 869 613 63.25 0.043 historic events LOAD 290 210 39 0.078 LOADsq 341 159 40 0.080 LOADTsq 414 86 33 0.066 BASEw 290 210 19 0.038 BASEr 290 210 0.48 0.001
Table 3.4: Evaluation results for both event data sets. Shown are the used methods and the two base- lines, the number of identified dates, the number of missing dates that are not predicted at any rank in the results, the number of top ranked dates that were in the evaluation data sets (cor@1), and the resulting precision at rank 1 (prc@1). The best achieved score for each metric and data set is highlighted in bold.
Historic Dates 0.0 0.1 0.2 0.3 0 10 20 30 40 50 60 70 80 90 100 maximum rank fraction of include d
dates methodLOAD
LOADsq LOADTsq LOADTsq+ BASEw BASEr Dates of Death 0.0 0.1 0.2 0.3 0.4 0.5 0.6 0.7
Figure 3.5: Prediction of dates based on the event data. Shown is the fraction of identified correct dates in the topk ranks (recall@k) versus the rank k.
3.4 Event Completion on Wikipedia Data
In order to evaluate the browsing potential of the method when used as an entity-centric search engine, we also consider the number of correctly identified dates with progressing rank in the list, as shown in Figure3.5. Here, we find that LOADsq performs best for the date of death data, for which it includes 50% of the correct dates in the first 10 posi- tions of the ranking. In comparison, the performance of LOADTsq is worse. Even though LOADTsq and LOADTsq+ ultimately include the most correct dates, this happens at a point in the ranking where it is of little relevance, unless this were to be used as a pre-selection step for further analyses. For both data sets, the LOAD methods with subqueries perform significantly better than either baseline. For the historic event data, the inclusion of terms leads to a further improvement over the other LOAD methods, which emphasizes the need to include terms in the model, since they may be essential components or even the only known features of an event description that is contained in the texts of the document col- lection. We further explore this importance of terms (and the context that they provide) for event relations in the following chapters.