• No results found

3.3 Annotated corpora

3.3.1 The TERN corpus

The TERN corpus (Ferro et al., 2004) is the corpus employed in the TERN 2004 competition (Ferro, 2004), whose aim was to evaluate systems capable of performing automatic TIMEX2 annotation. The TERN 2004 exercise extends the MUC definition of the TIMEX category in terms of broader coverage of expressions, and by introducing attributes that capture the meaning of a temporal expression. The corpus includes both English and Chinese data annotated according to the TIDES TIMEX2 annotation standard. The following discussion focuses on the English part of TERN, because only the English data was used

Annotator 1 Annotator 2 Annotator 3

Partial recognition (TIMEX2) 0.973 0.972 0.915

Full extent (TEXT) 0.963 0.911 0.894

VAL 0.981 0.939 0.940

MOD 0.983 0.800 0.564

SET 0.980 0.835 0.833

ANCHOR DIR 0.982 0.879 0.777

ANCHOR VAL 0.942 0.856 0.728

Table 3.4: Official inter-annotator agreement figures for the TERN corpus for experiments in this thesis.

The English TERN data were assembled from a variety of sources selected from broadcast news programs, newspapers and newswire reports, and included 767 training documents and 192 test documents.2 It was annotated by three

annotators using the Alembic Workbench (Day et al., 1997) and the Callisto annotation tool (Day et al., 2004). The entire process went through the stages of annotation, discussion and reconciliation until reaching an inter-annotator agreement of 90% or above on partially identifying TEs3, and on the value of the

VAL attribute. The inter-annotator agreement was computed by scoring each annotator against the final adjudicated gold standard generated by Lisa Ferro, the co-author of the TIMEX2 guidelines. Table 3.4 presents the scores achieved by the three annotators when their annotations were compared to the reconciled gold standard.

The scores on the row Partial recognition (TIMEX2) indicate the percentage of temporal expressions correctly annotated with a TIMEX2 tag by

2. These figures are extracted from an official presentation on TERN

Evaluation Task Overview and Corpus that is available online at http://fofoca.mitre.org/tern 2004/ferro1 TERN2004 task corpus.pdf

3. Two annotators are considered to have annotated the same TE even if their annotations match only partially. In the rest of the thesis, the numeric figures corresponding to partial matches will be attached the label TIMEX2.

each of the three annotators, in the sense that they annotated at least a part of the markable TE present in the gold standard. The task was difficult because annotators had to mark not only time nouns and numeric expressions, but also other parts of speech such as adverbs and adjectives. For this reason there was an occasional disagreement over whether something was considered markable. It was noticed that annotators missed certain TEs, particularly pre-nominals like the adjective former in the context the former senator that in many cases was not annotated as TE. However, the annotation proved to be quite accurate for time nouns and numeric expressions.

The scores on the row labelled with Full Extent (TEXT) indicate in how many cases the annotated span of text representing a TE is exactly the same as the extent of the TE encountered in the gold standard (the byte offsets for the start and end of the TE are the same as in the gold standard). Problems appeared because human annotators often did not look beyond the head and they did not include post-modifiers (e.g. a year when most candidates are afraid

of appearing negative), or pre-modifiers (e.g. almost a decade) and determiners

(e.g. the 1960s). They also had problems with embedding, especially in the case of appositives (e.g. The speaker focused on 1955, the year he was born.), and they were confused over where the head was in contexts like a three-hour

meeting.

The following rows in Table 3.4 represent the agreement obtained when assigning values to the TIMEX2 attributes. In the case of the VAL attribute, human annotators made errors4 when typing the value, when selecting list-items,

when calculating the value of the attribute, and when using the calendar. Some

4. The source of information concerning error sources in the manual annotation process is an official presentation on Annotating the TERN Corpus, available online at http://fofoca.mitre.org/tern 2004/ferro2 TERN2004 annotation sanitized.pdf

errors then propagated, as certain dates or times were saved and reused to fill in the value of VAL for other underspecified expressions. The annotators also had problems understanding the guidelines, or remembering all the details specified by the guidelines. There were cases when the annotators were not to blame for the inconsistencies in annotation, as the guidelines sometimes offer more than one choice for encoding the same thing, or the text is just too ambiguous and one has to annotate according to their interpretation (e.g. on the night of a presidential

debate).

The MOD attribute was also subject to human error, but this was mostly because there is a low number of modified expressions in text, so the annotators were not used to specifying a value for the MOD attribute. Sometimes they did not notice that the expression was modified, or when they did notice, cases of disagreement appeared over the MOD type (e.g. for the expression nearly 3 years one annotator selected the MOD value APPROX and another annotator selected LESS THAN).

The SET attribute was also subject to human forgetfulness and to disagreement over what a set expression is (e.g. set expressions were confused with generic expressions, as in winter snowstorms).

The annotation errors for the anchoring attributes ANCHOR DIR and ANCHOR VAL were due to annotators forgetting to apply them, or because they did not pay attention to all the information present in a document. There were also problems caused by making the distinction duration vs. point, or not knowing what the granularity of ANCHOR VAL should be, so it was difficult for the annotators to be consistent especially because natural language is vague about when durations begin and when they end.

Despite all the errors that appeared during the annotation process, the TERN corpus is the most reliably annotated resource for temporal processing developed so far, and is used frequently for the evaluation of systems performing TIMEX2 annotation. No other resource bearing temporal annotation has reached the level of inter-annotator agreement achieved by the TERN data. The detailed annotation guidelines contributed greatly to this achievement.