Inter-coder reliability - Prompts for verbal recall

3. Methodology

3.6. Study 2

3.6.4. Prompts for verbal recall

3.6.5.3. Inter-coder reliability

Once all of the data was coded by me, 10 percent of all quotations were double-coded by a second coder (coder 2) and another 10 percent by a third coder (coder 3) to establish reliability of the coding process. The two additional coders were language assessment specialists and University lecturers with many years of experience in listening test development. One of them held an MA in teaching English and an MA in language testing and was in the process of completing a PhD in language testing, while the other held an MA in teaching English, an MA in language testing, and a PhD in applied linguistics with a focus on language testing.

Due to the large number of codes, the two coders focussed on different parts of the coding scheme and were therefore assigned different quotations. Coder 2 was assigned quotations which were initially coded as listening strategies by me. In addition,

86 the data for coder 2 also included quotations originally coded as test-management, as test-management strategies were sometimes evident in combination with a listening strategy (see discussion in Section 3.6.5.2 above), as well as the meta-commentary “different behaviour”. Coder 3, on the other hand, was assigned quotations originally coded as cognitive processes, test-taking strategies (including test-management and test-wiseness), anxiety, and the remaining meta-commentary codes which emerged during the coding (“reactivity” and “prefer double play”). In total, coder 2 focussed on 11 codes and coder 3 on 10 codes, as illustrated in Figure 10.

To get a representative sample of the data for the double-coding process, I made sure to include quotations from all participants, all tasks, both conditions, and all stages of recall for both coders. Each coder was sent their version of the coding scheme, including short descriptions of the codes and an example from the data for each code, and a coding document, which included the quotations to be coded and a column to enter the codes (see Appendix 7). They were asked to familiarise themselves with the coding scheme first and then assign a code to each of the quotations in the coding document.

Figure 10: Study 2: coding categories for double-coding for coder 2 and coder 3

coder 2 coder 3

Following the double-coding, I calculated inter-coder agreement between myself and the double-coders. To that end, I first transformed my original codings as well as the double-coders’ codings into nominal data by assigning 0 (code not applied) and 1 (code applied) for each code and each quotation. The data for all quotations were then entered in Microsoft Excel and Gwet’s AC2 was calculated separately for each double-

87 coder, using the Excel add-in by Zaiontz (2019). Gwet’s AC2 was chosen as it is

considered more robust than other inter-coder reliability coefficients such as Cohen’s kappa, Fleiss’s kappa, Conger’s kappa, or Krippendorff’s alpha (Quarfoot & Levine, 2016).

As a final step, the extent of agreement was calculated according to the benchmarking procedure suggested by Gwet (2014, pp. 164–181). Instead of simply using the reliability coefficient for interpreting the strength of agreement between coders, which might mask the true level of agreement as the associated error of measurement is not taken into account, Gwet suggests to utilise the standard error to calculate the coefficient’s membership probabilities for each range on a given benchmark scale. To classify the extent of agreement, the benchmark scale by Landis and Koch (1977) was used, which differentiates between poor (<0.00), slight (0.00- 0.21), fair (0.21-0.40), moderate (0.41-0.60), substantial (0.61-0.80), and almost perfect (0.81-1.00) agreement. The membership probabilities for each of these categories are reported.

The results are displayed in Table 10 and Table 11. The tables include Gwet’s AC2 for each coding category and for the overall agreement with each coder, as well

the associated standard errors and the membership probabilities (in percent) in relation to Landis and Koch’s (1977) benchmark scale. The highest membership probability is highlighted in each row to make the results more immediately interpretable. As shown in the tables, inter-coder agreement was high. The overall agreement between coder 2 and myself was almost perfect (with a 93 percent probability) and between coder 3 and myself it was substantial (with a 62 percent probability) to almost perfect (38 percent probability). Agreement was also calculated separately for the four main response processes of interest to inspect whether certain code groups attracted more agreement than others. As shown in Table 10, for listening strategies agreement between coder 2 and myself was substantial (74 percent probability) to almost perfect (26 percent probability) and for test-taking strategies (only test-management) it was almost perfect (99 percent probability). For coder 3 (Table 11), agreement for cognitive processes was substantial (83 percent probability) to almost perfect (15 percent probability) and for test-taking strategies (test-management and test-wiseness) it was closer to almost perfect (84 percent probability) than substantial (16 percent probability). The agreement probabilities for anxiety are more scattered due to there being only one coding category and the small number of quotations which received that code in the double-coding

88 document (N=7), which resulted in a high standard error. Coder 3 and myself agreed in 6 out of 7 cases that a quotation was related to anxiety, resulting in substantial (34 percent probability) to almost perfect (57 percent probability) agreement. Despite these high levels of inter-coder agreement, I discussed all quotations where there was disagreement with the two double-coders to reach a consensus decision for each case. I then individually double-checked all of my original codings in light of the discussions.

Table 10: Study 2: inter-coder agreement between the researcher and coder 2

Response processes Gwet’s AC2 S.e. Probability (in percent) for agreement to be

moderate substantial almost perfect

listening strategies 0.765 0.054 0 74 26

test-taking strategies 0.935 0.054 0 1 99

overall 0.864 0.035 0 7 93

Table 11: Study 2: inter-coder agreement between the researcher and coder 3

Response processes Gwet’s AC2 S.e. Probability (in percent) for agreement to be

moderate substantial almost perfect

cognitive processes 0.733 0.063 2 83 15

test-taking strategies 0.822 0.071 0 16 84

anxiety 0.831 0.171 9 34 57

In document Double play in listening assessment (Page 96-100)