Difficulty and dynamic difficulty - Results and Discussion: AMS Version 1 CTT study

6.3 Results and Discussion: AMS Version 1 CTT study

6.3.2 Difficulty and dynamic difficulty

Two versions of the difficulty statistic are calculated here: the difficulty and the dy-

namic difficulty as previously defined in Subsection 6.2.2. Data pertaining to these two types of difficulty are presented in Table 6.1 and Figure 6.5 below. Note that the

data used for these calculations were the Version 1 AMS responses marked by the UHM. Further note that in Table 6.1 and subsequent tables in the thesis, FRQ denotes that

the question was free-response; MRQ denotes that the question was multiple-response; and MCQ denotes that the question was multiple-choice.

Figure 6.5: Graph showing the dynamic difficulty (blue) and difficulty (orange) of each question on Version 1 of the AMS. The red horizontal lines indicate the lower and upper bounds of the acceptable range of values for the difficulty; the blue horizontal line indicates the mean value of the difficulty; and the green horizontal line indicates the mean value of the dynamic difficulty. Note that higher values indicate easier items, whereas lower values indicate harder items.

Question Question type Dynamic Difficulty Difficulty Q1 FRQ 0.75 0.76 Q2 FRQ 0.52 0.52 Q3 FRQ 0.95* 0.97* Q4 FRQ 0.70 0.70 Q5 FRQ 0.41 0.43 Q6 MRQ 0.36 0.39 Q7 MRQ 0.42 0.45 Q8 MCQ 0.85 0.88 Q9 MCQ 0.72 0.74 Q10 MCQ 0.70 0.71 Q11 FRQ 0.61 0.64 Q12 MCQ 0.75 0.77 Q13 FRQ 0.39 0.41 Q14 MCQ 0.83 0.84 Q15 MCQ 0.56 0.57 Q16 MCQ 0.63 0.63 Q17 FRQ 0.38 0.37 Q18 MCQ 0.74 0.74 Q19 FRQ 0.68 0.69 Q20 FRQ 0.82 0.82 Q21 MRQ 0.56 0.57 Q22 FRQ 0.55 0.57 Q23 FRQ 0.69 0.70 Q24 MCQ 0.53 0.54 Q25 FRQ 0.75 0.74 Q26 MCQ 0.58 0.57 Q27 FRQ 0.80 0.80 Q28 MCQ 0.57 0.58 Q29 FRQ 0.81 0.81 Q30 FRQ 0.82 0.82 Q31 FRQ 0.61 0.61 Q32 FRQ 0.77 0.77 Q33 MRQ 0.49 0.49

Table 6.1: Table showing the dynamic difficulty and difficulty of each question on Version 1 of the AMS. Note that values marked with an asterisk were identified as being problematic, and this convention is applied throughout this thesis.

Table 6.1 and Figure 6.5 show that that dynamic difficulty was larger than the difficulty for Q17, Q25 and Q26. This means that the total number of respondents

who attempted these questions found them easier on the whole than the number of respondents who attempted all of the questions. The dynamic difficulty was equal to

the difficulty for Q2, Q4, Q16, Q18, Q20, Q27 and Q29, meaning that these questions were of the same difficulty for test-takers who submitted partially complete attempts, and for test-takers who submitted complete attempts. Further, the dynamic difficulty

was also equal to the difficulty for Q30, Q31, Q32 and Q33; this occurred because in

these cases, the total number of respondents who attempted these questions was equal to the total number of respondents who attempted all of the questions. The dynamic

difficulty was smaller than the difficulty for the other questions, so the total number of respondents who attempted these questions found them harder on the whole than

the number of respondents who attempted all of the questions.

In general, the dynamic difficulty is expected to be less than or equal to the difficulty. This is because respondents of lower abilities may be more likely to give up on

the test at some point during it, thus not submitting a complete attempt. This trend was observed in the majority of the questions. In the cases where the opposite trend

is observed, further explanation was required.

Cases where the dynamic difficulty was greater than the difficulty

In the cases of Q25 and Q26, the number of test-takers who answered each of these questions was 255, whereas the number of test-takers who answered all of the

questions was 254. As a result, it would be expected that the values of the dynamic difficulty and difficulty in these two questions would be slightly different because of the

extra test-taker’s score contributing a small amount to the dynamic difficulty through a stochastic effect. In contrast, the number of test-takers who answered Q17 was 276,

which was 22 test-takers more than those who answered all of the questions. As a result, the above explanation based upon a stochastic effect could not be applied to

explain why the dynamic difficulty was greater than the difficulty on Q17.

Q17 of Version 1 of the AMS was adapted from Q15 of the original FCI, and it tested understanding of Newton’s Third Law. The AMS version of the question is

shown in Figure 6.6 below. Q15 of the FCI is known in the literature to be a difficult question (Poutot and Blandin, 2015), and this difficulty could have transferred to the

AMS version of the question. As a result, even the most able test-takers may not be expected to get Q17 of Version 1 of the AMS right. For Q17, the values of difficulty

and dynamic difficulty remain within the acceptable range of values, but the small fluctuation between the two might indicate a more frequent resort to guesswork than

Figure 6.6: Q17 of Version 1 of the AMS, which was adapted from Q15 of the FCI.

In each of the cases where the dynamic difficulty was larger than the difficulty, the difference was never greater than 0.01; this meant that the effect was small, and within

what might be expected from random fluctuations where there is some guesswork in the responses, as was explained above. Other cases to consider are those where the

question had a difficulty that was out of the acceptable range of values, or close to the boundaries of this acceptable range. As mentioned previously, the acceptable range

of values for difficulty are [0.3, 0.9]. One question on the AMS had a difficulty value above 0.9, and five other questions had values that were close to the cut-offs. These

are discussed below.

Cases where the difficulty values were high

Q3 had difficulty and dynamic difficulty values that were above the 0.9 cut-off, meaning that almost every test-taker who attempted the questions got it right. This

was an essentially new question based on the situation from Q3 of the original FCI, although Q4 of the AMS bears more resemblance to Q3 in the original FCI. Q3 of the

AMS is a free-response question, asking the test-taker to identify the force or forces acting on a stone after it is dropped from the roof of a building, and also explicitly

instructs test-takers to ignore the effects of air resistance. Examination of the answers showed no flaw in the marking scheme; most test-takers simply answered this question

correctly. The very high difficulty value singles Q3 out as a possibly problematic item, with revisions or removal possibly being necessary.

Q8 and Q14 both had difficulty and dynamic difficulty values that were above 0.8,

which meant that these questions were two of the easier questions on the AMS. Q8 was adapted from Q6 of the original FCI. It is a multiple-choice question, and it asks the

test-taker to identify the trajectory of a marble once it exits a curved channel. Q14 was adapted from Q12 of the original FCI. It is also a multiple-choice question, and

it requires the test-taker to identify the trajectory of a cannon ball after it has been fired out of a cannon at the top of a cliff. In both questions, most of the distractor options were rarely selected by the students, with the most frequently selected answers

being either the correct answer or one specific incorrect distractor answer. Q8 and Q14 were taken from the original FCI, meaning that they have previously been tested

and validated. However, the findings here indicate that the functionality of some of the distractors lead to a potential weakness in these two questions.

From above, most of the distractors on Q8 of the AMS were found to be ineffec-

tive, as the majority of the test-takers selected either the correct answer or one other incorrect option. Yasuda et al. (2018) found that students gave correct answers to the

FCI version of this question (Q7 of the FCI) by using incorrect lines of reasoning. Fur- thermore, Traxler et al. (2018) found that the question was biased in favour of males,

and even suggested removing it from the original FCI. Similar patterns were identified on Q14 of Version 1 of the AMS, which is another trajectory-based question. This

raises questions about what is required to develop effective distractors, particularly in questions that are based on trajectories, since there may only be one viable mis-

conception to base a distractor trajectory path on. However, it is difficult to develop free-response versions of these questions because of the level of description required

to specify a path in words. An alternative approach could be to allow students to sketch a trajectory, and to mark the answer based on how close the sketched path is to

the desired correct path. Others are investigating the automatic marking of sketches and it has been suggested that this might be incorporated into a version of the FCI (Martinez and Perez, 2010; Martinez, 2020). Combining this approach with the AMS

is a possible avenue for future work, but it is beyond the scope of the present study.

Cases where the difficulty values were low

Q6 was the hardest question on the AMS in terms of the dynamic difficulty statistic. Q6 was adapted from Q5 of the original FCI; it is a multiple-response question, and asks the test-taker to identify the forces acting on a marble while it is travelling inside

a curved track. In the usability laboratory testing covered in Chapter 5, this question

caused problems for some of the participants, as they misinterpreted the diagram by failing to recognize that the track is flat on the table. Because the subject backgrounds

of the usability testing participants were similar to those of the Version 1 cohort, it is likely that test-takers also had this issue in the large-scale administration of the AMS,

leading to the low value for the question difficulty. Similarly to Q8 and Q14, Q6 is taken from the original FCI, so it has previously been tested and validated, but this alone does not mean that it should not be revised. However, rewording the question

to encourage students to interpret it in the desired way may be ineffective or counter- productive, since a diagram already accompanies the question to facilitate with its

interpretation.

Q17 was the hardest question on the AMS in terms of the difficulty statistic. It is

a free-response question adapted from Q15 of the FCI, and it requires the test-taker to apply Newton’s Third Law to identify that two forces acting on a car and a truck

are equal. As previously noted, Q15 of the FCI is known in the literature as being a difficult question (Poutot and Blandin, 2015), and the concept of Newton’s Third Law

is known to be a difficult concept for students to master (as discussed in Section 3.4). This question could simply be conceptually demanding for the test-takers, leading to

the low values of difficulty. The question itself probably does not need any revisions, since its difficulty is not below the cut-off, and it is useful to have more challenging

questions as well as more straightforward questions in order to balance the AMS.

More correct answers were given to Q18 of Version 1 of the AMS than to Q17 of

Version 1 of the AMS. Q17 of the AMS corresponds to Q15 of the FCI, and it asks students to compare the force that a car has on a truck, while the car is speeding

up and pushing the truck. Q18 of the AMS corresponds to Q16 of the FCI, and it asks students to compare the force that the same car has on the same truck, when

the car is pushing the truck at a constant speed. For Version 1 of the AMS, Q17 was a free-response question, whereas Q18 was a multiple-choice question, and it is

possible that students in general found the multiple-choice variant of the question to be easier. However, it is important to note that while the questions are based on the

same situation, they are not identical, and it is possible that other factors cause Q17 to be answered better than Q18.

Rebello and Zollman (2004) noted that FCI Q16 makes use of the wording “constant

cruising speed”, which is not present in the preceding FCI Q15. It is possible that this wording guides students to the correct answer by using a faulty line of reasoning, and

this idea is supported by Yasuda et al. (2018) and Galloway (2019) who found that students had linked the idea of moving at constant speed to the forces being equal. It

is possible that students in the Version 1 cohort used such reasoning to answer Q18 correctly, although it is not possible to verify this with Version 1 student responses, since Q18 was a multiple-choice question in Version 1 of the AMS. However, Q18

was a free-response question in Version 2 of the AMS (the Version 2 study is covered in Chapter 7), and going through these responses revealed that a small number

of students did explicitly make use of the incorrect constant speed line of reasoning, invoking Newton’s First Law, to give a correct answer to this question.

Summary

Overall, 32 out of the 33 questions on the AMS had a difficulty value and dynamic

difficulty value that were in the acceptable range of [0.3, 0.9]. The mean value of the difficulties of the individual questions was 0.65, and the mean value of the dynamic

difficulties of the individual questions was 0.65. Both of these values were within the acceptable range for difficulty, which implied that the AMS was functioning in the

In document Establishing Physics Concept Inventories Using Free-Response Questions (Page 115-122)