2.5 Capturing events in their diversity
2.5.1 Partitioning events into types
The challenge presented by diversity is exhibited in the transition from muc to more general type-based extraction in the ace programme. muc was defined by its selectiveness; it tar- geted a “fixed and closely circumscribed subject domain” (Yangarber and Grishman, 1997) for each evaluation (for instance, management succession and aircraft accident . Through iterative refinement of templates and detailed annotation specifications, this yielded human inter-annotator agreement from 70 to 90 percent (Will, 1993) with the best system in each evaluation performing in the 50-60% F1 range (Chinchor, 1998b).16 The ace programme
was the natural descendent of muc evaluations, in terms of its tasks and participants; it further specified the extraction of entities and other values, entity coreference considered by muc-6 (Sundheim, 1995) and entity-entity relations of muc-7 (Chinchor, 1998a), before
considering events (Doddington et al., 2004).
One outcome of muc was the understanding that targeted applications could utilise small, portable, textually fine-grained components, to be determined and benchmarked sepa- rately. (Grishman and Sundheim, 1996). In an attempt to parallel entity and relation extrac- tion, ace thus targeted more general notions of event extraction than application-specific scenario templates considered in muc. Hence the first ace attempt at event annotation considered five very broad event types (LDC, 2003):
• destruction/damage; • creation/improvement;
• transfer of possession or control; • movement; and
• interaction of agents.
15
The edition for 10-11 September 2010.
16Note this agreement is calculated over detailed templates rather than whether an event of the target type is present.
2.5. Capturing events in their diversity 27
Event type Event subtype
Life Be-Born, Marry, Divorce, Injure, Die Movement Transport
Transaction Transfer-Ownership, Transfer-Money
Business Start-Org, Merge-Org, Declare-Bankruptcy, End-Org Conflict Attack, Demonstrate
Contact Meet, Phone-Write
Personnel Start-Position, End-Position, Nominate, Elect
Justice Arrest-Jail, Release-Parole, Trial-Hearing, Charge-Indict, Sue, Convict, Sentence, Fine, Execute, Extradite, Acquit, Appeal, Pardon
Table 2.2: Event types and subtypes in the ace05evaluation (NIST, 2005).
The annotation guide (LDC, 2003) provides arrest and winning an election as examples of
transfer of possession; presumably hijacking falls into this category as well, rather than with
other attack-like events in destruction/damage, where tsunami might also reside. So while these categories capture ontological families of event and may represent a vast proportion of newsworthy events, such broad types naturally hide distinguishing features of event semantics. This pilot annotation produced much lower inter-annotator agreement than entity or relation detection tasks (Strassel et al., 2004) and so the schema was reinvented for future evaluations. ace05 introduced events into the evaluation, categorising those of interest into eight
more thematic types which break down further into 33 sub-types17 listed in Table 2.2. It distinguishes, for instance, birth of a person and the creation of an organization where the 2003 schema did not, but it does not completely cover that earlier typology. For example, it cannot mark the creation of an interesting artifact.
To consider the heterogeneity of ace05event sybtypes we plot the frequency of each in an
annotated corpus (Walker et al., 2006) against the average length of its coreference chains. As shown in Figure 2.8, frequencies of event subtypes vary from two justice:pardon to 1119
conflict:attack events. The distribution of the conflict type is also clearly imbalanced between
its two constituent subtypes, with attack over ten times more frequent than demonstrate. The infrequent types are too too scarce for supervising a learnt extractor, while the most frequent types are impractically broad for application, with annotated movement:transport instances include withdrawal of troops, climbing Mount Everest, a Mars Rover voyage, swimming and weapons smuggling. Even so, numerous interesting events are missed by the schema, from natural disasters to construction to legislation and other publication. The frequency variation is notably present in a corpus that was not sampled randomly from its sources, but selected to ensure sufficient instances of targeted types within a corpus of predetermined size.18 Variation
17
All systems known to the author focus on the sub-types, ignoring the broader groupings.
100 101 102 103 1 1.2 1.4 1.6 1.8 2 2.2 Pardon DivorceRelease-Parole Transport Attack Event frequency References p er ev en t Business Conflict Contact Justice Life Movement Personnel Transaction
Figure 2.8: Frequencies of event subtypes in all 600 ace05 training documents.
Evaluation attack transport die meet injure charge-indict
# gold references 984 472 392 160 87 85
Annotator 1 84 78 89 80 89 89
Annotator 2 88 85 92 79 88 87
Inter-annotator 73 61 82 64 76 76
Naughton et al. Trigger-based 25 20 80 65 65 80
Naughton et al. svm 70 40 75 70 60 80
Table 2.3: Human and system (Naughton et al., 2010) performance (F1) on a sentence-level
event type identification task, over six frequent event types in the newswire portion of the ace05corpus (Walker et al., 2006).
on the other axis, the number of references per distinct event, indicates a few categories of event subtypes: divorce and release-parole are both subjects of documents, with a number of references to the same generic concept of such events, rather than specific referents; in contrast, pardon’s annotations tend to be single references in passing; while attack exhibits a mix of single focal events and cases where a number of distinct events of that type are mentioned in an article.
By reducing ace05 event detection to a sentence-level classification task Naughton et al.
(2010) illustrate the difficulty of identifying such broad event types. Despite its relative infre- quency, a homogeneous type like charge-indict is reliably recognised by the human annotators
such as phone number and events as in Table 2.2; some of these may substantially bias the corpus domain. Type-targeted sampling was first adopted for the 2005 evaluation in place of random sampling (Walker et al., 2006), and follows from muc evaluations where corpora were selected to match the target event domain: the muc-3corpus includes only documents matching topical keywords (Chinchor et al., 1993); muc-6collates an
2.5. Capturing events in their diversity 29
(see Table 2.3), while broader types of event such as meet and transport are recognised with reasonably high precision, resulting in high annotator F1 with respect to the final corpus,
but lower recall, such that the annotators fail to mark the same sentences, presumably due to sub-salient references. Using support vector machines (svm), Naughton et al. (2010) are able to approach inter-annotator performance well for most types, but perform half as well for transport as for charge-indict ; for the latter type, using a small list of trigger words is equally effective, while for the former trigger terms perform only half as well as svm.19 The
attack type is also notable for being identifiable with a machine-learning model, but not with
a word list, suggesting that unlike four of the six types that Naughton et al. (2010) consider, this type is lexically diverse.20
Having reviewed two (correlated) attempts to schematise broad-coverage event types, the extreme variability within and across types suggests that this approach does not readily gen- eralise to the breadth of events. Although we again consider such a typology in Section 3.1, the data presented here suggest that this approach is flawed: while considering a few pre- scribed event types may be suited to specific applications, alternatives must be considered for more general event processing.