Annotated Data Sets - Event Extraction - Domain-sensitive Temporal Tagging for Event-centric In

4.5 Event Extraction

4.5.3 Annotated Data Sets

For the development and evaluation of spatio-temporal event extraction methods that go beyond the cooccurrence approach, a development data set and an evaluation data set are required. As mentioned above, the development data set (the initial data set) is used to evaluate the cooccurrence approach, which allows for detecting specific patterns that can be used to develop the more complex approaches. These then cannot be evaluated on the initial data set so that further test data are required.

Document Selection

To decide what types of documents should be included in the data sets, the following requirements for the document selection are formulated:

(a) Temporal expressions have to be annotated. (b) Geographic expressions have to be annotated.

(d) The documents should contain many temporal and geographic expressions as well as a reasonable amount of spatio-temporal events.

(e) Documents of different domains should be considered (according to the domain definition in Section 3.3.1, Definition 3.1, page 48).

(f) Documents of different languages should be included.

Obviously, in the context of developing and evaluating spatio-temporal event extraction approaches, (c) is the most important requirement. However, since there is no data set containing annotations of spatio-temporal events, requirements (a) and (b) become important because existing annotations of temporal and geographic expressions simplify the task of and reduce the effort for manually annotating spatio-temporal events. Unfortunately, there is no suitable corpus containing both, manually annotated temporal and geographic expressions. Thus, we build our data sets of some of the documents of the temporally annotated corpora described in Section 3.2.3. Using these corpora, requirement (d) is also considered, because these corpora all contain many temporal expressions, and, as detected during the annotation process, also several geographic expressions and spatio-temporal events.

In Chapter 3, we explained the different challenges for temporal tagging documents of different domains. Since the event-centric search and exploration scenarios that will be introduced in Chapter 6 should also be suitable for documents of different domains, we also want to evaluate the spatio-temporal event extraction approaches on different domains (requirement (e)). Thus, for English, we select documents of the WikiWars corpus and the TimeBank corpus, containing narrative- and news-style documents, respectively. Since both corpora are already annotated with temporal expressions, only geographic expressions and spatio-temporal events have to be manually annotated.

4 The Concept of Spatio-temporal Events

Finally, to satisfy requirement (f), we also select documents of the German WikiWarsDE corpus containing narrative-style documents. As for the English corpora, we manually annotated geographic expressions and spatio-temporal events. Although this data set is only used to evaluate the event extraction approaches, and although the more complex extraction methods are developed using an English data set only, we aim at developing language-independent event extraction methods. All event-centric search and exploration scenarios that will be developed in Chapter 6 are applicable on multilingual corpora.

Annotation Procedure

For the development and the evaluation of approaches to extract spatio-temporal events, it is not important to process full documents. In contrast, single sentences containing at least one temporal and one geographic expression are required. Thus, we run the following procedure to build the manually annotated data sets. Note that all documents are taken from temporally annotated corpora.

• Each document is split into sentences; sentences without date or time expressions are removed. • All geographic expressions (toponyms) are manually annotated without normalization information. • All sentences without at least one temporal and one geographic expression are removed.

• Some of the sentences – as will be detailed below – are randomly selected.

• Sentences are duplicated so that for each pair of temporal and geographic expressions, a separate sentence exists.

• In each sentence, the temporal expression of analysis and the geographic expression of analysis are marked as expressions of analysis to distinguish them from further occurring temporal and geographic expressions. Thus, each sentence instance contains a single cooccurrence of analysis. • Following the annotation guidelines described in Section 4.5.2, each cooccurrence of analysis is

manually annotated as (i) spatio-temporal event, (ii) agent-based spatio-temporal event, or (iii) no spatio-temporal event.

Except the number of sentences that are randomly selected, the same procedure is applied to all documents of the three corpora. As initial data set, we use 150 sentences of the WikiWars corpus. As evaluation data sets, we use 50 sentences of each of the three corpora. In Table 4.7, the four data sets are listed together with the number of cooccurrences in each set. In addition, an example sentence containing two temporal and two geographic expressions is shown in Table 4.8. Each cooccurrence with explicitly marked temporal and geographic expressions of analysis (TEA and GEA) is manually annotated.

Evaluation Results for the Cooccurrence Approach

As starting point for the development of more complex event extraction methods, we analyze all potential spatio-temporal events in the initial data set, i.e., the cooccurrences manually annotated as events, as agent-based events, or as non valid events. Obviously, using the cooccurrence approach all potential spatio-temporal events are extracted as spatio-temporal events and no distinction is made between clearly valid events and agent-based events. In Table 4.9, the respective evaluation numbers are presented.

Considering only clearly valid events, the precision is already above 50%. Combining the precision value with the cooccurrence approach’s recall of 100%, results in an f1-score (cf. Section 2.6.1) of 67.2%.

4.5 Event Extraction

development data set evaluation data sets

name WW-150 WW-50 TB-50 WWde-50

corpus WikiWars WikiWars TimeBank WikiWarsDE

unique sentences 150 50 50 50

cooccurrences 411 111 91 102

Table 4.7: Development and evaluation data sets containing potential spatio-temporal events.

manual sentence with annotated expressions of analysis annotation German forces surrendered in <GEA>Italy</GEA> on <TEA>April 29</TEA> and in

<G>Western Europe</G> on <T>May 7</T>.

event

German forces surrendered in <GEA>Italy</GEA> on <T>April 29</T> and in <G>Western Europe</G> on <TEA>May 7</TEA>.

no event

German forces surrendered in <G>Italy</G> on <TEA>April 29</TEA> and in <GEA>Western Europe</GEA> on <T>May 7</T>.

no event

German forces surrendered in <G>Italy</G> on <T>April 29</T> and in <GEA>Western Europe</GEA> on <TEA>May 7</TEA>.

event

Table 4.8: Manual event annotations; each cooccurrence is annotated separately.

When considering clearly valid and agent-based events as correct, the f1-score even raises to 82.1%. Thus,

the baseline for the evaluation of more complex approaches is already very strong.

Note that we also performed an evaluation of the cooccurrence approach in (Strötgen and Gertz, 2012a). There, however, we only used a data set containing Wikipedia articles. All sentences were taken of the WikiWars and WikiWarsDE corpora and contained manually annotated temporal expressions. However, for geographic expressions, we relied on automatic annotations of a geo-tagger. In contrast, as described above, we now use data sets of different domains for the development and evaluation of the further approaches and for a first evaluation of the cooccurrence approach. In addition, temporal expressions and geographic expressions are now manually annotated, and all cooccurrences of temporal and geographic expressions are manually checked for whether they form an event. Thus, this procedure allows for a better comparison of different approaches for spatio-temporal event extraction because errors of the geo-tagger and temporal tagger do not occur and thus do not influence the event extraction task.

Before describing some heuristic and linguistically-motivated approaches for spatio-temporal event extraction, note that in two student bachelor theses, preliminary advanced approaches for event extraction were studied. Kaufmann (2012) developed some heuristic and linguistically-motivated methods using manually created rules, and Limpert (2013) applied relation extraction methods followed by a machine learning post processing step. While these works are partially similar to the work we will present in the following sections, their evaluation data sets have the same deficits as the one we used for the evaluation described in (Strötgen and Gertz, 2012a). Furthermore, as mentioned above, we focus in this thesis on language-independent event extraction methods. Nevertheless, both works can be considered as helpful preliminary studies.

4 The Concept of Spatio-temporal Events

valid agent-based non-valid precision precision cooccurrences events events events (valid) (valid/agent-based)

WW-150 411 208 78 125 50.6% 69.6%

Table 4.9: Evaluation of the cooccurrence approach on the initial data set.

In document Domain-sensitive Temporal Tagging for Event-centric Information Retrieval (Page 169-172)