Inter-annotator agreement - Type-driven annotation experiment

3.2 Type-driven annotation experiment

3.2.2 Inter-annotator agreement

In our task as in ace05it is diﬃcult to aggregate inter-annotator agreement directly. Token

annotation is underspecified in both schemas, although ours more so, leading to sparse agree- ments.16 We therefore assess inter-annotator agreement as a sentence-level (or document- level) binary decision of whether it refers to an event of a particular type, in accordance with Naughton et al. (2010). This fails to account for many aspects of the task, including:

• multiple events of the same type marked in the evaluation unit;17

• type confusion for an agreed event;18

• the relative salience of references and referents: one annotator missing an adverbial

reference is considered no less an error than missing a typical verbal reference; and

• marked attributes other than type, particularly in ace05.

Though the sample is very small, from the raw agreement counts in Table 3.3 we can see annotators often disagree on the identification or classification of events. Annotator B is alone in identifying Correspondence, Disaster and New Release events, such as those in the following examples:

(18) “Ninety-five million tonnes will be rolled out when the market needs it,” Mr Forrest told a [Correspondence conference]B of the Securities and Derivatives Industry Association in Sydney

yesterday.

15_{These counts – and all statistics presented here – exclude the five annotator training documents.} 16

Micro-averaged event type (and subtype) F1 agreement between the first-pass annotators for the full ace05training corpus (Walker et al., 2006) is 79% (75% for subtypes) at the document level, 65% (64%) at

the sentence level and 57% (57%) at the anchor-text (i.e. token) level. This 8% drop from sentence to token accounts for both anchor choice and disagreement in the number of events of the same type.

In our annotation, the number of distinct events of the same type per document is 2.1 for the median type. Over the ace05training corpus (Walker et al., 2006) the equivalent statistic is 1.7 for subtypes, and 2.7

for coarse types. 18

Our schema does not strictly allow us to determine when annotators have identified the same referent, even in token-level evaluation, since the anchor to mark is underspecified and multiple references with the same attributes may appear within a sentence. In ace05the specification of event participants makes this less

Type Documents Sentences ∆F1 |A ∩ B| |A\B| |B\A| F1 |A ∩ B| |A\B| |B\A| F1

Conflict 4 0 4 0.7 10 1 18 0.5 -0.2 Correspondence 0 0 5 0.0 0 0 17 0.0 0.0 Disaster 0 0 2 0.0 0 0 3 0.0 0.0 Employment / Award 2 3 1 0.5 5 5 3 0.6 0.1 Finance 2 3 0 0.6 4 12 5 0.3 -0.3 Governance 0 1 3 0.0 0 1 7 0.0 0.0 Justice 1 0 2 0.5 5 1 3 0.7 0.2 Lifecycle 2 0 0 1.0 7 1 3 0.8 -0.2 New Release 0 0 5 0.0 0 0 5 0.0 0.0 Organisation Lifecycle 2 3 0 0.6 1 12 1 0.1 -0.4 Sports Match 1 0 1 0.7 1 0 2 0.5 -0.2 Transaction 4 2 1 0.7 7 8 10 0.4 -0.3

Table 3.3: Agreement and disagreement counts for our type-driven annotation and the derived F measure (F1) for document and sentence-level binary event type identification.

Here A and B are the sets of (unit, type) pairs produced respectively by our annotators, with F1 = _|A|+|B|2|A∩B|. ∆F1 is the gain in F measure when moving from document to

sentence-level evaluation.

(19) The two poor divisional performances were triggered by the harsh [Disaster recession]B in

New Zealand and by the[New Releaserelaunch]B of the Dick Smith brand to compete better

with the market leader, JB Hi-Fi.

Excluding these types, our annotators produce the same quantity of distinct document-level type annotations, but with substantial disagreement: for each type there are at least as many documents annotated in disagreement as there are documents annotated in agreement. Except where document-level agreement is already poor, agreement measured with F119 at

the sentence level is consistently much lower than document-level agreement (see Table 3.3, column ∆F1). Some of these errors may result from insuﬃcient specification in the annotation

schema, or real diﬀerences in interpreting the annotated texts; others result from the diﬃculty of identifying all relevant event references.

Various types of disagreement can be identified in the story shown in Figure 3.4, exempli- fying the diﬃculty of such an annotation task. In sentences 1. and 4., the transfer of money into a fund is disputed as either a Finance or Transaction event, according to the respec- tive annotations of A and B; our annotation guidelines provide no clear reference for such an example. Annotator B alone marks an investigation in sentence 1.; A alone marks the

Employment event of hiring administrative oversight (sentence 2.). Both seem to be clearly

3.2. Type-driven annotation experiment 59

Annotator A Annotator B

1. On June 30, managers acting for Trio Capital

[Financepoured]Aa $47 million of their investments

into a fund that is now being investigated for the whereabouts of $118 million in hedge fund investments.

On June 30, managers acting for Trio Capital

[Transaction poured]Ba $47 million of their invest-

ments into a fund that is now being[Justiceinves-

tigated]B

b for the whereabouts of $118 million in

hedge fund investments. 2. Administrators called in before Christmas to

[Employment oversee]Ab Trio Capital say they are

unable to determine what assets have been

[Transaction bought]Ac with $118 million invested

through the Astarra Strategic Fund.

Administrators called in before Christmas to oversee Trio Capital say they are unable to determine what assets have been bought with $118 million[Transactioninvested]Ba through the Astarra

Strategic Fund. 3. Inquiries have focused on a company[Org. Lifecycle

registered]Ac in the British Virgin Islands, EMA

International, which has provided statements but no proof of investments in hedge funds.

Inquiries have focused on a company registered in the British Virgin Islands, EMA International, which has provided statements but no proof of investments in hedge funds.

4. The annual report of Astarra Strategic Fund, one of 24 Trio Capital managed investment schemes now under administration, reveals that on June 30 its assets were [Finance topped]Ad up

with a $47 million transfer of assets.

The annual report of Astarra Strategic Fund, one of 24 Trio Capital managed investment schemes now under administration, reveals that on June 30 its assets were topped up with a $47 million [Transaction transfer]Ba of assets.

5. On December 22, Astarra Asset Management was [Org. Lifecycle placed into voluntary adminis-

tration]A

e at the request of creditors.

On December 22, Astarra Asset Management was[Org. Lifecycleplaced]Bc into voluntary admin-

istration at the request of creditors.

Figure 3.4: An extract from a news story with two predominantly diﬀering annotations (Stuart Washington, Mystery deepens over Trio’s missing $118m, Sydney Morning Herald,

events within the typology, but each annotator failed to identify one. Other superficially similar instances – what assets have been [Transaction bought] (2.), the $118 million [Transaction

invested](2.) and a company[Org. Lifecycle registered]in . . . (3.) – seem more dubious in terms

of their reference to a specific event. It is unclear for instance whether registered refers to the act or the state of registration. We discuss similar examples below. There are apparent errors in the annotation of coreference, where A marks poured in 1. as not being coreferent with topped in 4.; B incorrectly coindexes invested in 2. with poured, although a partitive relation might hold. Finally, in sentences 4. and 5. we find disputes with respect to the marked span, which is not considered disagreement within our schema.

In document Grounding event references in news (Page 74-77)