4.5 Comprehension test score reliability
4.5.3 Summary score descriptive statistics and reliability
Each participant completed a summary as one of the comprehension tasks. Summaries were expected to be within 50 to 150 words. Each summary was rated by at least two raters using a rubric developed by the researcher (Appendix F). Raters scored each summary for each of the four constructs on the rubric (Accuracy, Modeling, Task Completion, and Language) on a scale from 0 to 4. If the two first raters for a summary differed by more than 1 point in any of the four constructs, a third rater (and rarely a fourth rater) provided an additional total rating. Average ratings for the closest two scores were used as final scores. Table 4.4 presents descriptive statistics for the summary scores for each topic, across the subscores. In the final column is a total score which adds together the Accuracy, Modeling, and Task Completion scores, while excluding the language score, to give a single summary score based on comprehension out of 12. The cells in Table 4.5.4 show the mean score, with standard deviation in parentheses. Topic 5 was overall scored slightly lower than the other topics, with specifically accuracy being rated lower, and topic 6 was overall scored slightly higher than the other topics, with specifically modeling scores higher than average.
The total comprehension score and each summary component score were compared pairwise with each other. All comprehension components were strongly correlated with each other (r > .7) and with total score (r > .9). As such, only the total summary score is used as a
dependent variable in subsequent analysis since it captures the overall comprehension construct. Correlations were weaker between comprehension constructs and language, indicating that raters were able to separate, to some degree, the language construct from comprehension. These
correlations are shown in Table 4.5.5.
Table 4.5.4 Mean score and standard deviation (sd) for summary scores for each topic.
Text Accuracy Modeling Task
Completion Language Total Comprehension Biotechnology 2.74 (.81) 2.53 (.82) 2.62 (.76) 2.38 (.63) 7.88 (2.08) Compound Microscope 2.94 (.98) 2.29 (.90) 2.53 (.99) 2.65 (1.00) 7.76 (2.74) Water 2.81 (.84) 2.31 (.97) 2.39 (.99) 2.33 (.71) 7.50 (2.68) Hunger 2.61 (.85) 2.42 (.73) 2.53 (.92) 2.47 (.74) 7.56 (2.28) Choices 2.47 (.78) 2.47 (1.01) 2.50 (.95) 2.35 (.79) 7.44 (2.59) Attitudes 2.76 (.75) 2.65 (.86) 2.50 (.71) 2.59 (.80) 7.91 (2.15)
Note: Total comprehension is calculated as the average sum of Accuracy, Modeling, and Task Completion.
Table 4.5.5 Correlations between summary rubric construct scores.
Accuracy Modeling Task
Completion Language
Modeling 0.736
Task Completion 0.771 0.841
Language 0.575 0.649 0.641
Total 0.900 0.930 0.944 0.673
To investigate the reliability of the rubric constructs, the rating scale, and the raters for scores on the summary forms, a Multi-faceted Rasch Analysis (MFRA) was performed using the program, Facets version 3.83 (Linacre, 2020). This analysis presents a score model for the entire test, which gives information about how well the rubric constructs fit the test model and how well each point on the rating scale differentiated test-takers at different ability levels. It also evaluates the degree to which the raters exhibited self-consistency, or internal reliability. To further investigate the reliability of raters, inter-rater reliability was calculated using Cohen’s
Kappa, though the numbers of pairwise ratings between raters was low. Each aspect of reliability on the summary test is detailed below.
Regarding the overall reliability of the summary task to separate examinees at different ability levels, the MFRA had a reported weighted likelihood estimate reliability of .902, indicating high person separation reliability and accuracy of scoring. Infit measures were calculated for each construct on the rubric. For a rubric construct to be reliable, infit
measurements should lie within .5 and 1.5 (Linacre, 2002) or ideally within a more narrow range of .8 to 1.3. Infit that is too low indicates that a construct was too narrowly defined and exhibited limited score variance, and infit that is too great indicates a construct was poorly defined and ratings for the construct were erratic, or else did not model well with the other constructs. Fit statistics for each construct on the rubric showed that each construct exhibited sufficient fit and are presented in Table 4.5.6. The higher infit for Language indicates that it was treated by raters in a way inconsistent with the other constructs, meaning it constituted a construct separate from the comprehension constructs. The table also presents the fair average and facility for each construct, indicating that Accuracy was rated highest (the easiest), followed by Modeling and Task Completion, with Language being the lowest rated or most difficult.
Table 4.5.6 Statistics for rubric constructs
Construct Fair Average Facility S.E. Infit
Accuracy 2.71 -0.59 0.12 1.02
Modeling 2.61 0.01 0.12 0.81
Task Completion 2.48 0.20 0.12 0.85
Language 2.43 0.38 0.13 1.30
Fit statistics were likewise calculated for the rating scale employed by the rubric for each construct. Fit statistics for each scale point on the rubric showed that each point exhibited
for each construct is further provided in Figure 4.5.1. The charts in Figure 5 show the probability of assignment of a score given an individual’s ability level. Each scale point should be
represented by a distinct peak, and these peaks should be ordered along the person ability scale in the expected numerical order. Both of these conditions are satisfied by the distributions of score assignment probabilities.
Table 4.5.7 Fit statistics for rubric scale
Scale point Accuracy Fit Modeling Fit Task Completion Fit Language Fit
4 1.0 1.0 0.9 1.3
3 0.9 0.9 0.8 1.2
2 1.1 0.7 0.7 1.1
1 1.2 0.7 0.8 1.5
Figure 4.5.1 Summary score probability curves with respect to person ability
Reliability statistics for raters were produced as well. Severity and fit were calculated for each rater to ascertain the degree to which raters differed in overall ratings and exhibited self- consistency. There were seven raters, and their number of summaries rated, severity, fair
average, and fit statistics are presented in Table 4.5.8. The rater separation index was 2.19, with rater separation reliability of .83, indicating that raters exhibited 2.19 distinct levels of severity, and this distinction was significant. This is indicated by two raters being noticeably more lenient than average (Raters G and M) and one rater being noticeably more severe than average (Rater R). Rater G was the most lenient rater (-0.52) and Rater R was the most severe (0.58). Raters exhibited different levels of severity, but no rater’s average rating was more than one standard deviation from the mean, indicating raters were neither too severe nor too lenient overall. Tolerable fit has been variously defined as between .5 and 1.5 (Linacre, 2002), .75 to 1.3 (McNamara, Knoch, & Fan, 2019), and .6 to 1.4 (Wright et al., 1994) for rating scales. Taking these bounds into consideration, all raters exhibited satisfactory model fit, indicating self- consistent rating patterns.
Table 4.5.8 Rater statistics
Rater Code
N Severity S.E. Fair average
(Total score) Infit Point biserial Exact Agreement G 24 -0.52 0.18 2.77 0.90 0.79 48.1% M 24 -0.47 0.17 2.75 0.74 0.80 44.2% N 44 -0.15 0.13 2.61 1.04 0.80 45.8% W 18 0.15 0.19 2.48 0.96 0.59 38.2% I 44 0.19 0.13 2.47 1.10 0.75 47.9% E 18 0.21 0.20 2.46 0.93 0.76 43.1% R 34 0.58 0.15 2.32 1.00 0.77 46.2% Overall inter-rater 0.75 45.5% Separation Index 2.19 Separation Reliability 0.83
Interrater reliability was further calculated using Cohen’s Kappa. This statistic shows the degree to which pairs of raters showed similar trends in rating, and Kappa values closer to 1 are desirable, with values closer to 0, or negative values, indicating poor interrater reliability. Table 4.5.9 presents Cohen’s Kappa values for each pair of raters and the number of ratings for each pair. As the number of ratings for a given pair can be quite small, these results must be taken
with caution. Lower sample sizes can influence the accuracy of Kappa values. Two rating pairs, G-I and R-W, showed the lowest interrater reliability, but the raters otherwise exhibited
sufficient internal consistency, and low sampling may be the source of lower Kappa values. Considering these results alongside the process for adjudicating disagreement, there is evidence to suppose that summary scoring functioned reliability.
Table 4.5.9 Cohen’s Kappa interrater reliability for summary raters.
Rater Pair N Cohen’s Kappa
E-R 7 .64 E-W 11 .41 G-I 8 .22 G-R 15 .57 I-M 8 .78 I-N 27 .70 M-N 15 .67 R-W 5 .29 Overall 96* 0.583
*Six summaries were set aside as benchmarks for rater training, so the total number of summaries was 102.