2 Background and Related Work
2.2 Central Concepts
2.2.2 Corpus annotation
Corpus annotation is the process of building interpretative information into corpora. Unannotated corpora consist of raw or plain texts that can be used as a basis for linguistic study, but they become far more useful if they are further refined and developed into
annotated corpora. In this case, they are enriched with different kinds of linguistic information referred to as "annotations" to enable the manipulation of the data contained in the corpus in more diverse ways (McEnery & Wilson, 2001, p. 32). The term annotation refers both to the task of adding annotations to the text and to the actual linguistic symbols which are added (Leech, 1997, p. 2). Corpus annotation has been utilized extensively in corpus-based language study and NLP over the last several decades, and various annotation schemes and tools have been developed. The main focus has generally been on the English language, but since the turn of the millennium similar tools for other languages have become increasingly common.
Corpus annotation offers many advantages. According to Leech (1997, pp. 4–6), the first advantage is that it is easier to extract information from a corpus which is enriched with
annotations. Secondly, an annotated corpus can constitute a valuable resource that can be reused by other members of the research community. Thirdly, annotations are multi- functional; there are different levels of annotation, and one level prepares the way for the following level. For example, POS tagging can be seen as the first step towards more challenging levels of annotation, such as syntactic and semantic annotation; these will be discussed in more detail in the following sections.
Leech (1997, p. 6) remarks that during the history of corpus annotation some of the various annotation types that have been employed have been found to be difficult or even impossible to use by other members of the research community. To overcome this problem, Leech (1997, p. 6‒8) drafts some practical guidelines for successful annotation of corpora:
1) The raw corpus should be recoverable, in other words, it should be easy to delete the annotations, if necessary.
2) Correspondingly, it should be easy to remove the annotations from the corpus and store them independently, if necessary.
3) An annotated corpus should come with appropriate documentation including information about the annotation scheme itself and of how, where, and by whom the annotations have been applied. Furthermore, there should be some account of the quality of annotations.
4) No annotation scheme should claim to represent "God’s Truth". The people who use readily annotated corpora use them simply for practical reasons. They consider it a much wiser choice than to start compiling their own corpora from scratch and inventing and using their own annotations.
5) The annotation schemes used should be based as far as possible on consensual or theory-neutral analyses of the data to avoid misunderstandings and
misapplications.
6) No annotation scheme should claim to represent the absolute standard. The nature of the corpus as well as the particular needs of the task at hand have a decisive effect on what kind of annotation scheme is considered to be the most useful and sensible.
Leech (1997, pp. 7–8) raises two good points supporting the idea of a certain degree of unification in corpus annotation practices. The first advantage to be gained is in saving time and effort. It is clearly sensible to adhere to an annotation scheme that one is already familiar with and that has been found to be effective and useful. The second advantage is related to the reusability factor indicated above. If researchers wished to interchange data and resources, this would obviously be easier if the corpora were made compatible by following the same standards and guidelines worldwide. In fact, there was an attempt to standardize corpus annotation practices in 1990s, when a large community of language engineers set out to propose standards, guidelines, and recommendations for good practice in the core areas of the field. These were named the "EAGLES (Expert Advisory Group on Language Engineering Standards) Guidelines", and they included computational lexicons, text corpora,
computational linguistic formalisms, spoken language resources, as well as assessment and evaluation (Institute for Computational Linguistics "A. Zampolli", n.d.).
Though corpus annotation clearly offers advantages, not everyone has fully supported its use. Sinclair (2004, pp. 190‒191), for instance, admits that corpus annotation can be a helpful procedure, but he strongly cautions against its overuse. In his opinion, it allows the handling of documents without engaging in the interpretation of the language they contain. As long as a
text is marked up with annotations, the computer works with the annotations and ignores the language resulting in a study of the annotations, as opposed to a study of the language used. Sinclair also points out that if corpus data is observed through annotations, anything the annotations are not sensitive to will be missed. Hunston (2002, p. 93) has made similar observations. She suggests that while annotations add to the usefulness of corpora, they also make them less readily updated, expanded, or discarded. Furthermore, since the categories used for the annotation are typically determined before any actual annotation work has been carried out, this, in her opinion, limits the type of research questions that can be made.
There are several different types of corpus annotation. While this thesis deals with linguistic annotation, which will be discussed in the following subsections, other types include textual and extra-textual annotation, orthographical annotation, prosodic annotation, and phonetic transcription.