4. Results Study 1
4.3. Answer change during the second play
4.3.2. Effects of answer change
It was also investigated whether students benefitted from changing their answers during the second play as well as the typology of changes and frequencies of those types. The analysis was run according to the categories outlined in Table 16 above. For the analysis the two NF specific categories in Table 16 (categories 2 and 3) were added to NF category 8 to be able to compare the two task types. The results are displayed inTable 18. The table includes percentages for each category and task type, Chi-square statistics to explore potentially significant differences between the two task types, and effect sizes (Cohen’s h) for categories with statistically significant differences. The Chi-square statistics were calculated separately for each category based on a comparison of the proportions for each task type, using the website medcalc.org (Schoonjans, 2018), which utilises the Chi-squared test recommended by Campbell (2007) and Richardson (2011) and the confidence interval calculation recommended by Altman, Machin, Bryant, and Gardner (2000). Effect sizes were calculated manually in SPSS (version 24 for Mac).
Starting with beneficial answer changes at the top of the table, it can be seen that 20.3 percent of all MC answers were left blank during the first play but were correct after the second play (category 0), compared to 17.9 percent of NF answers. There was also a difference in frequency between the two task types for answers which were incorrect after the first play but correct after the second (category 1), with 5.7 percent for the MC tasks and 3.6 percent for the NF tasks. As shown in the table, the differences
104 between the two task types are statistically significant for both of these categories, with small (category 1) to medium (category 0) effect sizes (see J. Cohen, 1988). When comparing the subtotals for beneficial answer changes, it can be seen that students benefitted significantly more from the second play for MC tasks (25.9 percent of all answers were changed beneficially) than for NF tasks (21.5 percent of all answers were changed beneficially) (2 (1, N = 1,068) = 11.93, p > .00), with a small to medium effect size (Cohen’s h = 0.22). This confirms the findings presented in Section 4.2, where it was shown that students benefitted more from a second play in MC tasks than NF tasks in terms of average item difficulty.
Table 18: Study 1: frequencies and chi-square statistics for the answer change categories across the two task types Change or
no change
Benef. or not benef.
Category MC % NF % Diff. % 95% CI 2 DF p Cohen’s
h change benef. 0 20.3 17.9 2.4 0.1 to 4.8 4.15 1 0.04* 0.28 1 5.7 3.6 2.1 0.9 to 3.4 11.45 1 0.00*** 0.05 subtotal 25.9 21.5 4.4 1.9 to 6.9 11.93 1 0.00*** 0.22 not benef. 4 11.3 10.3 1.0 -0.8 to 2.9 1.15 1 0.28 5 1.5 2.6 1.1 0.3 to 1.9 6.31 1 0.01* 0.91 6 1.9 1.1 0.8 0.1 to 1.6 5.04 1 0.03* 0.13 subtotal 14.7 14.1 0.6 -1.5 to 2.7 0.32 1 0.57 no change 7 0.5 5.9 5.4 4.5 to 6.4 89.21 1 0.00*** 2.22 8 48.2 50.4 2.2 -0.8 to 5.1 2.13 1 0.14 9 10.7 7.8 2.9 1.2 to 4.7 11.36 1 0.00*** 0.10 subtotal 59.4 64.2 4.8 2.2 to 8.0 11.41 1 0.00*** 0.48 unclear - 0.3 - - - - - total 100 100
When it comes to answer changes that were not beneficial, the two task types performed similarly, with no statistically significant difference overall. 14.7 percent of all answers were changed to no benefit in the MC tasks compared to 14.1 percent in the NF tasks. However, there were differences for two of the three categories within this type of answer change. While category 4 (no answer after first, incorrect after second) performed similarly between the two task types, with 11.3 percent for MC and 10.3 percent for NF, category 5 (incorrect after first, changed after second but still incorrect) was observed significantly more often for NF tasks (2.6 percent) than MC tasks (1.5 percent) with a large effect size (Cohen’s h = 0.91), presumably because students have more opportunities to change open answers than MC answers. Similarly, there was a
105 significant difference between the two task types for category 6 (correct after first, incorrect after second), in that students would erroneously change their correct answers significantly more often in MC tasks than in NF tasks during the second play (1.9 percent compared to 1.1 percent), however, the effect size was small (Cohen’s h = 0.13). This may again have to do with the nature of the task type: When students are not sure about their answer it is very easy for them to choose a different answer to a MC question but it takes more effort to change an open answer on a NF question.
Interesting findings also emerged for the three categories related to no changes at the bottom of Table 18. For category 7 (no answer in either first or second) a significant difference between the two task types with a large effect size (Cohen’s h = 2.22) was observed: Only 0.5 percent of all MC answers were left blank after the second listening, compared to 5.9 percent of NF answers. This seems to be evidence that MC tasks may be more prone to guessing than NF tasks. There was no significant difference between the two task types for category 8 (correct after first, no changes after second), likely because students who are sure of their answer during the first listening do not change it during the second, regardless of test format. In contrast, category 9 (incorrect after first, no changes after second) was observed significantly more often for MC tasks (10.7 percent of all answers) than NF tasks (7.8 percent of all answers), however, the effect size was again small (Cohen’s h = 0.10). This may confirm the task type effect observed on category 5 outlined above, i.e. that test takers have more opportunities to change open answers during the second play compared to MC answers. Overall, as outlined in the last section, students changed their answers significantly less often for NF tasks than MC tasks, with a medium effect size.
As a last step, the two NF specific answer change categories were looked at in detail (categories 2 and 3 in Table 16). These answer changes were regarded as beneficial, as they are evidence of students understanding more of the listening text during the second play, or at least an indication that students have more opportunities to show their understanding in the second play as compared to the first. The category frequencies are shown in Table 19.
For 6.4 percent of all NF answers, students added more details during the second play, whereas for 2.5 percent they chose a different correct answer (for questions which allowed more than one answer).
106
Table 19: Study 1: frequencies for the two NF specific answer change categories
N % 2 correct after first, more details after second and "more correct" 175 6.4 3 correct after first, different correct answer after second 69 2.5
total 244 8.9
In summary, the analysis of answer changes in the second play of double play revealed a number of benefits for test takers. Candidates changed their answers in about 40 percent of all cases for the MC tasks and 35 percent of all cases for the NF tasks. Out of these changes, about 60 percent resulted in benefits for the test takers, in that they were able to change their missing or incorrect answer to a correct answer. In the case of NF tasks, candidates were also able to add more details or choose a different correct answer in a number of cases. This indicates that for about 20 to 25 percent of all questions participants understood more of the listening text during the second play, or at least that they had more opportunities to showcase their understanding.
4.4. Questionnaire 1: strategies and anxiety
Questionnaire 1 targeted test-taking strategies and listening strategies as well as test- taking anxiety and listening anxiety. Participants had to indicate their level of agreement to 25 statements (24 for the NF tasks) on a four-point Likert scale. They completed the questionnaire twice – once after the single play condition and once after the double play condition (see Section 3.5.3).
The questionnaire data was analysed in three separate stages. First, descriptive statistics were calculated for each question. Then, an exploratory factor analysis was performed to group questionnaire responses. In a final step, differences in test takers’ strategic behaviour and anxiety levels between the two conditions were explored by means of Wilcoxon signed-rank test.