Measuring Inter-coder Reliability - Assessing Inter-Coder Agreement

3.7 Assessing Inter-Coder Agreement

3.7.3 Measuring Inter-coder Reliability

This section will explain how the reliability tests were applied. Before describing in detail how inter-coder agreement was measured, I will describe the procedures for judging inter-coder agreement. The first one is a pairwise comparisons method to match error tags. This is followed by a comparison of individual error tags procedure. Firstly, agreement was checked based on pairwise comparisons of the error tags. The comparisons were done by matching two error tags with the same location. The first error tag assigned by Coder 1 was compared with the first error tag assigned by Coder 2, and so on. For instance, in Table 3.20, agreement was achieved for Tag 1 and Tag 2. No agreement was considered for Tag 3 and Tag 4 even though both coders assigned the same errors tags but in opposite location. In Agreement row, Yes means full agreement and No means full disagreement.

Table 3.21 shows an example of coders’ annotation. Coder 2 does not include a tag, T2 which causes disagreement beginning from Tag 2 until Tag 4. This may be due to the human mistake of slipping tags. The problem can be solved by rearranging

Table 3.21: Sequence of error tags.

Tag 1 Tag 2 Tag 3 Tag 4

Coder 1 T1 T2 T3 T4

Coder 2 T1 T3 T4

Agreement Yes No No No

Table 3.22: Error tags realignment.

Tag 1 Tag 2 Tag 3 Tag 4

Coder 1 T1 T2 T3 T4

Coder 2 T1 T3 T4

the sequence of tags. For example, Coder 2’s T3 and T4 tags are shuffled forward one place ahead. The result of rearrangements is depicted in Table 3.22.

Tag misalignment is also an issue looked by Michaud (2002). Michaud applies a realignment algorithm (Smith and Waterman, 1981), giving penalties for any realignment which were needed. There are two reasons why I didn’t perform any alignment tasks. Firstly, if realignment is required, the question of whose error tags should be selected arises. Again in Table 3.21, if Coder 1 is chosen, T3 and T4 are moved backward one place. As a result, T2 is deleted and this causes incomparable between Coder 1’s error annotation and the occurrence of errors in the respective utterance. Nevertheless, such problems can be avoided if only Coder 2’s T3 and T4 tags are brought forward one location.

The second reason is when both coders have assigned the maximum numbers of error tags. As referred in Table 3.23, if T4 and T5 of Coder 2 are shuffled forward one location, the total number of error tags exceeds four. If T5 is removed, again the error annotation does not tally with the occurrence of errors in the utterance. Therefore, I decided not to apply tag realignment. Furthermore, only five cases occurred in my error annotation.

Now I explain the second procedure of inter-coder agreement judgement. It involves the matching of individual error tags. As stated before, each error tag is represented in a predicate form. Each predicate has up to two arguments. There are 15 predicates

Table 3.23: Limited number of error tags.

Tag 1 Tag 2 Tag 3 Tag 4

Coder 1 T1 T2 T3 T4

Coder 2 T1 T2 T4 T5

Table 3.24: Equivalent predicates in a same sequence order.

Tag 1 Tag 2

Coder 1 sva(X) ins(X)

Coder 2 sva(X) ins(X)

Agreement Yes Yes

to choose from and among them, there are two predicates with one argument and three predicates with two arguments. More than twenty linguistic forms are available for identifying the arguments (see Table 3.10 on page 84 and Tense Errors section on page 81). Due to many predicate types and their respective arguments, I decided to assess the inter-coder agreement in two levels, as outlined below:

Level 1: The comparison of individual error tag predicates only without looking at their arguments.

Level 2: For each agreed pair of predicate, arguments of the pair are compared. In Level 1, the comparison was only based on the predicate of error tags. Full agreement was considered if both predicates were matched if their arguments was different, as shown in Table 3.24.

When the predicates were not matched, this was assessed as full disagreement as shown in Table 3.25. Another case of disagreement was when both coders annotated the same error tags but the tags were ordered in an opposite location as shown in Table 3.26.

One more disagreement case was when a coder slipped one tag, at least. An example is as shown in Table 3.27. Full agreement was judged for Tag 1 and Tag 2 but not for Tag 3 and Tag 4. Even though Coder 2 assigned three tags only, matching was still done in the Tag 4 column.

Table 3.25: Different error tags annotation. Tag 1

Coder 1 det-n-ag

Coder 2 del(X)

Agreement No

Table 3.26: Equivalent tags but in different order.

Tag 1 Tag 2

Coder 1 del(X) ins(X)

Coder 2 ins(X) del(X)

Agreement No No

Table 3.27: Missing tags

Tag 1 Tag 2 Tag 3 Tag 4

Coder 1 del(X) del(X) sva(X) ins(X)

Coder 2 del(X) del(X) ins(X)

Table 3.28: Agreement in arguments of predicate

Tag 1 Tag 2 Tag 3

Coder 1 sva(have) ins(noun) del(noun)

Coder 2 sva(have) ins(noun) del(adj)

Agreement Yes Yes No

I now describe the Level 2 of agreement test. In this level, I calculated the agreement of arguments for each predicate that has similar judgement by both coders. The error tag predicates which have arguments are del(X ), ins(X ), sva(X ), subst- with(X,Y ), tense-error(A,B ), and transp(X,Y ). An example of full agreement and full disagreement is shown in Table 3.28.

During this measurement, some cases of partial agreement occurred between two coders. Refer to an example below:

Coder 1 ins(will)

Coder 2 ins(modal-aux)

Coder 1 tagged ins(will) and Coder 2 tagged ins(modal-aux). Partial agreement was considered because will is categorised as one type of modal auxiliary. Despite both arguments being included in the error classification scheme, modal-aux is referred to other modal auxiliary types such as can and may.

Another case of partial agreement can occur in tense-err(X,Y ) error tags. For instance,

Coder 1 tense-err(past,pres) Coder 2 tense-err(past,progr )

Since one of the arguments match, this was assessed as a case of partial agreement. To measure inter-rater agreement for my annotation scheme, I applied the α reliability test. α reliability can be applied because it caters for different levels of agreement as noted earlier. The weight is indicated by a distance metric, δ. In this study, I assigned δ=0.5 for the above two types of partial agreement cases. For full agreement,

Table 3.29: Distance metric used in the (α) test

Distance Metric Descriptions Examples

Value

0 If coder1’s tag == coder2’s tag Coder1: ins(noun)

Coder2: ins(noun)

0.5

1) If coder1’s and coder2’s tag is Coder1: del(modal-aux) modal-aux and will or vice versa Coder2: del(will)

inf-mrkr-to and to or vice versa

2) If 1st argument in coder1’s Coder1: tense-err(progr,pres) and coder2’s tag or 2nd Coder2: tense-err(progr,inf ) argument in coder1’s and

coder2’s tag is similar

1 Coder1: del(be)

If coder1’s tag 6= coder2’s tag Coder2: subst-with(be,will)

δ=0, and full disagreement, δ=1. A summary of distance metric values for the α test is shown in Table 3.29.

In document A Statistical Model of Error Correction for Computer Assisted Language Learning Systems (Page 118-123)