Quantitative inter-annotator agreement - Inter-annotator agreement and corpus analysis

5.4 Inter-annotator agreement and corpus analysis

5.4.2 Quantitative inter-annotator agreement

For a quantitative evaluation of corpus quality and task diﬃculty, we seek to determine the extent to which annotators agree with each other and with the final, adjudicated corpus. At the token level, we are guaranteed at most one annotation per annotator; when considering a span of tokens, a sentence, or a document, we may compare the set (or bag) of annotations. Hence we consider a number of inter-annotator decision comparisons:

Binary such as is token x linked?

kb node such as what kb entry is token x linked to, given that it is linked? Multi-valued such as what kb entries are linked to from document x?

In all cases we apply F measure (or Dice’s coeﬃcient) which is defined as the harmonic mean of precision and recall:

PAB = RBA= |A ∩ B|_|A| F1(A, B) =

2PABRAB

PAB+ RAB

= 2|A ∩ B|

|A| + |B|

Here, A and B represent two annotations, defined as sets of (u, l) pairs for unit u annotated with label l. Chance-corrected metrics like Cohen’s κ do not apply to linking where random

the event’s occurrence, in part because events are often reported according to the information of some other source, and in part because of the need to keep the schema brief for low-skill annotators.

Unit Decision AB AC

P R F P R F

Tokens Is marked 0.56 0.26 0.35 0.56 0.32 0.40

Tokens Is linkable 0.55 0.22 0.32 0.56 0.26 0.35

Tokens Is linked 0.56 0.18 0.27 0.48 0.14 0.21

Linked tokens Target kb node 0.5 0.5 0.5 0.7 0.7 0.7 Linked tokens Target date 0.6 0.6 0.6 0.8 0.8 0.8 Documents Target kb nodes 0.5 0.2 0.3 0.5 0.3 0.4

Documents Target dates 0.6 0.3 0.4 0.5 0.4 0.4

Table 5.4: Inter-annotator agreement over selected units and decisions. For each annotator pair (AB and AC), A’s annotations are considered ground truth and the others’ are compared to calculate precision (P ), recall (R) and F measure (F ). Annotator training

documents are removed.

chance of selecting a particular target is minuscule; F measure eﬀectively accounts for chance in binary decisions with a vast majority negative class.

Marking tokens Inter-annotator agreement at the token level is poor; Table 5.4 indicates

under 40% F measure agreement as to whether a token is linkable.19 _{Most disagreement lies}

in the decision to mark a particular token as a newsworthy, past event reference. Some of this reflects individual annotator biases: in their shared portion, annotator A marked 3.7 times as many tokens as B. Hence while B and C recalled fewer than 23% of the tokens marked by A, 56% of the tokens marked by B and C were also marked by A.

The schema underspecifies definitions of ‘event’ and ‘newsworthiness’, accounting for some of this token-level disagreement, but not directly aﬀecting the task of linking a specified mention to the archive. For example, an adjectival mention such as Apple’s new CEO is easy to miss and questionable as an explicit past event reference. Events are also confused with facts and abstract entities, such as bans, plans, reports and laws; unlike many other facts, events can be grounded to a particular time of occurrence. Nominal event references such as graft, e-mail or fire may also ambiguously refer to an event or the theme of that event.20

Annotators may also select diﬀerent tokens for the same event reference, such as in the Black Saturday fires burned or another acquisition. The low per-token agreement is therefore a result of the schema’s loose prescriptions and requirement of a single token per reference,

For the moment we ignore adjudication. Full token-level agreement statistics are tabulated in Appendix Section C.1.

20_{Other ambiguous references include impressionistic language such as scandal, tragedy and troubles. In one} instance, carrot was found to refer to a government’s oﬀer of incentive! Negated event references such as missed out and overlooked also present a problem. The schema asks annotators to focus on explicit event references, but this too could be more explicit.

5.4. Inter-annotator agreement and corpus analysis 113

while highlighting the general diﬃculty of newsworthy event identification and anchoring.

Categorising event tokens The top section of Table 5.4 also shows agreement decreasing

with increasingly fine-grained annotation decisions, such that annotators B and C respectively recall only 18 and 14% of tokens successfully linked by A, with precision around 50%.21 We provide raw pairwise agreement data for token-level annotation in Appendix C.1.

Considering only tokens marked by both annotators in a pair, the most confused token label is compound . For every 10 tokens in which annotator pairs agree on compound , there are 28 where they disagree, with one choosing compound , and the other usually (25 of 28 times) choosing linkable.

Among diﬃcult compound -linkable ambiguities are bureaucratic and legal processes, such as large business transactions and changes in law. One example in our corpus states that the Carr government loosened restrictions . . . . The government’s loosening initially consists of their presenting a bill to parliament, but is not concluded until two houses of parliament vote in its favour, and the bill receives vice-regal approbation (which, as a formality, usually goes unnoticed in news). Generally there would be a further delay before the loosening comes into eﬀect. So the referent event space is technically compound . Yet given another similar reference, a parliamentary victory might be the unambiguous referent.

There is also frequent ambiguity among compound , multiple and aggregate, suggesting that these are not natural delineations of event reference. For example, does food riots in 30 countries [over a short period] constitute reference to a single event reported through its sub-events, a collection of distinct events, or an emergent aggregate? In an earlier schema (in Appendix B.1) these categories were conflated as plural, which is too broad for annotators to work with. These delineations were therefore intended to help annotators decide what is not

linkable, but they do not aﬀect the present event linking task.

Identifying link targets To assess how often annotators agree on a canonical link target,

we firstly consider only those tokens a pair of annotators both successfully linked. For these units, there is reasonable (31 out of 64) agreement between A and B and high but statistically weak (11 out of 15) agreement between A and C. Since annotators may mark diﬀerent tokens for the same event reference, or may mark diﬀerent numbers of references with the same link target, we also compare the set of distinct link targets identified within each document. Annotators B and C respectively recall 22 and 34% of A’s link targets with about 50% precision. Overall, this level of agreement suggests the feasibility of the event linking task, when ignoring complexities introduced by exhaustive and archive-internal linking.

In some cases, there may be multiple articles published on the same day that describe the event in question from diﬀerent angles; Table 5.4 shows agreement increase substantially

Unit Decision JA JB JC

P R F P R F P R F

Tokens Is marked 0.58 0.76 0.66 0.77 0.44 0.56 0.76 0.60 0.67

Tokens Is linkable 0.54 0.73 0.62 0.69 0.39 0.50 0.72 0.46 0.57

Tokens Is linked 0.55 0.69 0.61 0.70 0.30 0.42 0.61 0.24 0.34

Linked tokens Target kb node 0.84 0.84 0.84 0.8 0.8 0.8 0.7 0.7 0.7 Linked tokens Target date 0.87 0.87 0.87 0.9 0.9 0.9 0.9 0.9 0.9 Documents Target kb nodes 0.68 0.69 0.69 0.8 0.4 0.5 0.6 0.4 0.4

Documents Target dates 0.72 0.71 0.71 0.8 0.4 0.5 0.7 0.5 0.6

Table 5.5: Adjudicator-annotator agreement over selected units and decisions, as per Table 5.4. The adjudicator J’s annotations are considered ground truth for calculation of P ,

R and F . From J’s perspective, P and R are rates of acceptance and contribution.

when relaxed to accept date agreement. Where a definitive link target is not available, an annotator may erroneously select another candidate: an opinion article describing the event, an article where the event is mentioned as background, or an article anticipating the event. One annotator linked the reference the survivors were flown to an article where the survivors were to be flown, which implies the event in question is uncertain and either imminent or happening at present.

Determining whether a particular archival story reports an event is diﬃcult, as suggested by high confusion between not found and reported here annotations. For every 10 tokens where annotators agree on not found , there are 10 cases of not found -reported here confusion, and 6 cases of not found -linked confusion. Some confusion results from cases where the smh only belatedly reports an event, either because it was not suﬃciently newsworthy at the time, or because the event’s occurrence only later became public knowledge. The disagreements otherwise indicate a lack of clear discourse features for annotators to discern whether a story reports or merely mentions an event.

The task is complicated by changed perspective between an event’s first report and its later reference. Can overpayed link to what had been acquired? Can 10 died be linked to a story s where only nine are confirmed dead? For this example, if the tenth death occurred in the same event as the first nine (unbeknownst to the reporter of s), its mention is strictly a reference to the event reported in s. If instead the 10th death occurred as a result of the same event as the first nine, its mention may be better considered an aggregate. For the application of adding hyperlinks to news, such a link might still be beneficial; such are the challenges in determining appropriate event link targets.

5.4. Inter-annotator agreement and corpus analysis 115

Adjudication Agreement statistics for adjudication are shown in Table 5.5. For all anno-

tation decisions listed, each annotator achieves high precision against the adjudicator; that is, the adjudicator accepts most input annotations. Since annotators rarely agreed among themselves, this shows that each annotator is likely to miss event references in the exhaustive linking task.

The lowest precision (54%) is reported in annotator A’s marking of tokens, suggesting A over-generates annotations or chooses diﬀerent anchor tokens to J. A’s thoroughness is ap- parent in her recall (or high contribution) of link targets per document, which is substantially higher than B and C.

In all, we find that the primary disagreements in the annotation task regard whether to mark a particular token and whether it can be linked. We have seen similar recall problems in other fine-grained event reference annotation; linking requires a further sustained eﬀort to examine candidates and refine queries, such that some annotators make the eﬀort to identify many more links than others. An exhaustive annotation therefore requires redundant annotations to be merged. Regarding the selection of link target, there is relatively little dispute, suggesting that the event linking task is feasible. Yet agreement statistics also suggest identifying a link target is not trivial, a result which is supported by further analysis of the resulting corpus; in particular, the relationship between link source and target.

In document Grounding event references in news (Page 128-132)