Coding Reliability - Summary and Discussion

3.3 Summary and Discussion

4.1.5 Coding Reliability

Annotation-based data might be problematic as they are based on subjective judgements of the coders. The reliability of the annotation is, therefore, of vital importance for the significance of results. It has to be shown that different people agree with respect to the coding judgements on which statistical analyses are based to make research results replicable (Carletta, 1996). To make a database as sound as possible, it is necessary to evaluate coding decisions with respect to their reliability. Two kinds of agreement are distinguished depending on whether one coder re-codes the same data (intra-rater agreement) or several coders annotate the same data (inter-rater agreement). The latter captures the stability of the annotation and is the more severe kind of reliability as it goes beyond intra-coder inconsistencies (Krippendorff, 1980).

Reliability Measures

Depending on the type and scale of the data, different statistical methods are available. A qualitative distinction has to be made between Type I and Type II ratings (Gwet, 2001). Type I measurements are those where the degree to which a rating is subject to human interpretation is well-understood and the outcome easily interpretable. As an example, Stegmann and Lücking (2005) cited the measuring of a patient’s blood pressure by a doctor. The outcome displayed on the blood pressure gauge reflects the actual level of the patient’s blood pressure (or at least approximates it in a sufficient way). Other doctors will come to the same result. Type II data, in contrast, are subject to interpretation by the coder. Here Stegmann and Lücking (2005) gave the example of a classification task in which, on the basis of data from a psychological questionnaire, raters have to determine the satisfaction level of various subjects assigning them to categories of emotion such as ‘happy’ or ‘sad’. This difference between Type I data and Type II data has to be considered in evaluations of respective annotations as Type II ratings must be adjusted for chance-based agreements, whereas this is not necessary for Type I ratings. The SaGA corpus comprises both types of annotation. The classification of gestures in terms of representation techniques, reference objects and dialogue context information is interpretive and therefore of Type II. The respective annotation labels are categories on a nominal scale. Descriptions of gesture form make up data of Type I. With one exception (handshape, see below), the labels for annotating a gesture performance are ordered on an ordinal scale. Accordingly, different methods are employed to evaluate annotations of representation techniques and context information on the one hand, and annotations of gesture form on the other hand.

As a chance-corrected coefficient determining the level of agreement to be found in Type II data, the first order agreement coefficient AC1developed by Gwet (2001) was chosen since the widely used Kappa coefficient (Cohen, 1960) is often criticized on grounds of delivering anti-intuitive results under certain configurations (kappa paradoxes; for a discussion see Stegmann and Lücking (2005)).

Regarding the interpretation of agreement coefficients different quality thresholds exist. Values above which the agreement of raters is judged as acceptable range from 0.4 to more stringent conventions of 0.8 (cf. Artstein and Poesio, 2008). A frequently employed agreement level, also applied to the SaGA data, is 0.7 with an α-error of 0.05 and a β-error of 0.85 for Type II annotations.

In addition, to assess the extent of association between annotations of the Type I gesture morphology, an approach based on angle measures was employed for codings of directions and orientations. As the disagreement between, e.g., ‘movement to the right’ and ‘movement to the right and slightly down’, is less than that between ‘movement to the right’ and ‘movement to the left’. Comparing just for sameness of annotation labels would not capture the degree of spatial difference between them. This problem was addressed by translating the annotation labels into angular measures which can be analyzed in terms of numeric differences.

Movement directions, palm and BoH orientation were compared by calculating the angle between the two orientation vectors. For instance, there is an angle of90◦ between ‘PTL’ and ‘PUP’, and an angle of45◦ between ‘PTL’ and ‘PTL/PUP’. A maximal angle of180◦is present if the two vectors are opposing each other (e.g. ‘PTL’ and ‘PTR’) and can be considered as the worst match.

Reliability Results

Inter-rater agreement was calculated based on a sample of 477 gestures (∼10% of the data) which have been classified independently by four annotators.

Type II data The first-order agreement coefficient AC1for gestures’ representation technique rating was 0.784 with a confidence interval of (0.758, 0.81). The sample of representation technique coding was classified independently by four annotators. The proportion of agreement on gestures’ representation techniques, given that the agreement was not due to chance, was significantly greater than 0.7. In particular, this result complied with the reliability level that was initially demanded.

The degree of reliability of the annotations of reference objects and context information was rated by two independent annotators. The agreement coefficientAC1 for the classification of reference objects was 0.91, for information structure 0.95, for information state 0.86, and for communicative goals 0.88. All values are collected in

Table 7.10. In sum, the highly interpretive Type II data showed a reasonable degree of inter-rater reliability.

Type I data The annotations that make up Type I data of the SaGA corpus transcribe orientations and movement directions as they have a clear spatial interpretation. The reliability of this data was assessed by angle-based measures. The smallest angular deviation is2.36◦for the movement direction of hand shapes, and the largest is46.16◦for BoH orientation. On average, the angular difference as a whole is27◦ (with average standard deviation SD=45). Given that the annotation categories resolve gesture space into ‘slices’ of 45◦ each, the average difference comes close to the theoretically undecidable mean value of22.5◦.

Evaluating the annotation of handshapes required a special treatment, since the categories developed to classify the handshape observed comprise both Type I and Type II shares. On the one hand, there is a set of basic shapes derived from the ASL lexicon. These Type I labels are then enhanced by Type II modifiers such as ‘loose’ or ‘spread’. Therefore, all modified handshapes were mapped onto their basic type and treated them as Type I data. As a result, it was found that the four annotators agreed on 83% (AC1=0.9, to give the Type II statistics for comparison) of the handshapes within the reliability sample of gestures.

Table 4.4: Reliability results for the annotation of the SaGA corpus.

AC1 Angular Deviation (SD)

Gesture Representation Technique 0.78

Handedness 0.92

Handshape 0.90

Palm Orientation 19.14◦_(1.92)

BoH Orientation 20.66◦(2.47)

Wrist Movement Direction 37.08◦(6.5) Discourse Context Thematization 0.95

Information State 0.86 Communicative Goal 0.88

Referent Object 0.91

In sum, the evaluation of the secondary data of the SaGA corpus revealed a satisfactory degree of reliability. Chance-corrected agreement on Type II data sur- passed the threshold of 0.7. Observed inter-rater agreement on Type I data resulted in angular values which, by and large, denote rather harmless dissent between annotators.

Hence, the SaGA corpus provides a reproducible data base which can be exploited for empirically driven research.

In document The Production of Co-Speech Iconic Gestures: Empirical Study and Computational Simulation with Virtual Agents (Page 83-86)