Coding and marking-up corpora - Multi-Modal Corpus Design Methodology

Chapter 3: Multi-Modal Corpus Design Methodology

3.5. Coding and marking-up corpora

3.5.1. Coding conventions

Coding is the next phase of the corpus development process. This stage involves ‘the assignment of events to stipulated symbolic categories’ (Bird and Liberman, 2001: 26, also see Brundell and Knight, 2005). This is where qualitative records of events start to become quantifiable, as specific ‘items relevant to the variables being quantified’ are marked up for future analyses (Scholfield, 1995: 46). Coding is closely linked to the transcription phase,

however, instead of providing written accounts abstracted from spoken interaction, it provides abstract definitions of these abstractions.

Coding and annotation is commonly undertaken with the use of computational software. Some current corpora are described as being un- annotated, without tags and mark-up, although the majority are annotated, because the addition of such annotations allow corpora to be navigated using digital software.

These annotations exist from a word-based level (tagging) through to a more sentence- and text-based level; involving ‘the addition of typographic or structural information to a document’ (mark-up; see Bird and Liberman, 2001: 26). Corpora can also be annotated at a higher, discourse-based, level wherein specific semantic or pragmatic, function-based codes are added. In short, various features of the discourse can thus be annotated, such as information on speakers (demographic), contextual (extra-linguistic information), P-O-S (part of speech- a form of grammatical tagging, such as the CLAWS25 ‘word class annotation scheme’ used in the BNC, see Garside, 1987), prosodic (marking stress in spoken corpora), phonetic (marking speech sounds) features, or a combination of these (for more information see Leech, 2005 and McEnery and Xiao, 2004, also refer back to the metadata section in 3.3.2.4).

Early standards for the mark-up of corpora, known as the SGML (Standard Generalised Mark-up Language), have generally been succeeded by XML, (Extensible Markup Language, see Ide, 1998). These standards were developed in the 1980s when the electronic-corpora ‘revolution’ was just 25_{CLAWS, the Constituent-Likelihood Automatic Word-Tagging System, is a system for}

beginning to take off, with the transition from 1st to 2nd generation corpora (refer back to Section 2.2.2 of Chapter 2 for further details). SGML was traditionally used for marking up features such as line breaks and paragraph boundaries, typeface and page layout; providing standards for structuring both transcription and annotation.

Modern advances in technology, and associated advances in the sophistication of corpora and corpus tools, have prompted a movement towards a redefinition of SGML. Since the late 1990s, efforts have been made to establish some ‘encoding conventions for linguistic corpora designed to be optimally suited for use in language engineering and to serve as a widely accepted set of encoding standards for corpus-based work’ (Ide, 1998: 1, discussing the Corpus Encoding Standard, CES26, specifically). There are various schemes of this nature, including the Open Language Archives Community (OLAC27, see Bird and Simons, 2000); the CES; the ISLE28 Metadata Initiative (IMDI, see Wittenberg et al., 2000) and the TEI29 (Text Encoding Initiative, as used in the BNC, see Sperberg-McQueen and Burnard, 1999).

In general these schemes aim to cater for corpora of any size and/or form, including spoken and/or written corpora, specialised and/or general corpora. Thus, they work on the premise that the standardised nature of corpus encoding conventions will allow coded data and related analyses to be re- used and transferred across different corpora. However, while many of these 26_{More information about the CES can be found at:}_{http://www.cs.vassar.edu/CES/}

OLAC aimed to provide a ‘common framework across electronic preprint archives’. For more information see:http://www.language-archives.org/docs/white-paper.html

28_{Details of the ISLE project can be found at the following website:}_{http://isle.nis.sdu.dk/} 29 _{The TEI is ‘a consortium which collectively develops and maintains a standard for the} representation of texts in digital form’. For more information on the TEI see: http://www.tei- c.org/index.xml

schemes share some similarities, and the same intentions in respect of standardisation, at present there remains no universally-used prescribed method of corpus mark-up and encoding, although TEI is perhaps currently the closest to this.

As with the record and transcription phases, the level of detail used in the coding phase, ‘the actual symbolic presentations used’ (Leech, 2005) when annotating a corpus, is thus generally dependent on the purpose and aims of the corpus (i.e. they are ‘hypothesis-driven’, refer to Rayson, 2003: 1, also see Allwood et al., 2007a). So, ‘there is no purely objective, mechanistic way of deciding what label or labels should be applied to a given linguistic phenomenon’ (Leech, 1997: 2). However, it should be noted that regardless of the standards and systems of notation used to encode corpora, the majority tend to integrate this information into the corpus in the same way. Specific codes and tag-sets are usually integrated within the underlying infrastructure of a corpus, contained within searchable header information, separating the ‘extra-textual and textual information’ from the ‘corpus data (or transcripts) proper’ (McEnery et al., 2006: 23). This is usually XML based.

3.5.2. Gestural coding schemes

It is important to note that while the majority of current encoding schemes and approaches deal with the mark-up of selected extra-linguistic information, they do not have provision for marking up discourse beyond the text in any great detail, insofar as they are not fully extendable to all MM features of talk. As Baldry and Thibault indicate (2006: 148):

In spite of the important advances made in the past 30 or so years in the development of linguistic corpora and related techniques of analysis, a central and unexamined theoretical problem remains, namely that the methods adapted for collecting and coding texts isolate the linguistic semiotic from the other semiotic modalities with which language interacts…. [In] other words, linguistic corpora as so far conceived remains intra-semiotic in orientation…. [In] contrast multi-modal corpora are, by definition, inter-semiotic in their analytical procedures and theoretical orientations.

Thus within the field of linguistics, no scheme really exists with the capacity to fully support the mark-up of NVC or NVB, nor do they integrate information from both spoken and non-verbal stimuli. However, there are many schemes which deal with the coding and annotation of visual and/or multi-modal datasets, and associated methodological approaches to the application of these, beyond the area of AL (Applied Linguistics) and CL research. Therefore, it is relevant to discuss these briefly here.

Firstly, there are a wide variety of coding schemes which concentrate solely on facilitating the mark-up and labelling of gestures according to kinesic properties. These function to explicitly define the specific action, size, shape and relative position of movements that comprise forms of gesticulation (see Frey et al., 1983; McNeill, 1992 and Holler and Beattie, 2002, 2003, 2004 for examples of these). One widely used scheme of this nature, the FACS coding scheme (Ekman and Friesen, 1978- for examples of studies that use FACS see Buck 1990; Black and Yacoob, 1998; Pantic and Rothkrantz, 1999; 2000;

Kanad et al., 2000; Tian et al., 2000; Kawato and Ohya, 2000 and Rosenberg et al., 2001) is perhaps the one which is most relevant to the current thesis, in that it specifically deals with head movements (in addition to facial expressions).

FACS provides the referential guidelines for appropriately sub-dividing and encoding a facial image generated from a video recording, according to key ‘motion reference points’, defined by specific facial muscles known as Action Units (AUs, see Ekman and Friesen, 1978). There are 46 different locations of AUs for facial expression and 12 locations that account for head orientation and gaze. Two AUs from the FACS system are presented in Figure 3.4 (based on Ekman and Friesen, 1978).

In document A multi-modal corpus approach to the analysis of backchanneling behaviour (Page 136-141)