Summary - Detecting Drift using Rater Models

Study 2: Detecting Drift using Rater Models

5.1 Summary

The use of CR items to evaluate examinee ability has increased over the years, which can be attributed to its role in validity. There are important skills that cannot be fully measured when only MC items are used (Livingston, 2009). CR items ask test takers to construct their own answer, which requires the use of raters. This introduces a subjective layer into scoring CR items, because scores given by the same rater can also differ across scoring occasions. Yet, scores generated from CR items must be reliable and valid, regardless of when an individual takes the test.

Differences in rater scores between testing administrations raise the issue of rater drift, which occurs when raters change their scoring behavior over different scoring occasions. Studies have found evidence of rater drift in real-world data (e.g., Congdon & McQueen, 2000) and have suggested the use of rater models (e.g., IRT models and the latent class SDT model) to adjust for rater effects such as rater severity when scoring CR items. However, the effect of rater drift on model-based classifications of essays into latent classes defined by the scoring rubric has not been studied comprehensively. To address these issues, this study had two main goals: (1) to examine how changes in rater behavior – rater drift – affect model-based classification and (2) to investigate the ability of different rater models to detect rater drift. These objectives were addressed using an analysis of real-world data and simulation studies.

Empirical study 1: Teacher certification test. In the empirical study, a teacher certification test and a high school writing test were used to identify patterns of rater drift using the latent class SDT model and IRT models. Parameter estimates from the rater models were used to detect patterns of rater drift. The teacher certification test was scored by 32 raters over 7 testing administrations on a 1 to 6 scale.

Plots of rater parameters showed minor individual variation in drift. These changes in rater behavior reflected variations in rater severity and in rater discrimination. Regression was used to summarize rater severity, which showed no significant linear (and nonlinear) trends; there were no significant trends for rater discrimination. Measures of classification (i.e., proportion correctly classified and lambda) showed stable estimates of classification accuracy for the seven testing administrations. Although there was evidence of rater drift in rater severity and in rater discrimination, these variations had a minimal effect on classification accuracy.

Empirical study 2: High school writing test. In the second phase of empirical analysis, the high school writing test was used to examine the effect of rater drift on classification accuracy and also to investigate patterns of rater drift using different rater models. This data differed from the teacher certification test in that there were 18 raters scoring over 12 months on a 1 to 4 scale.

This study produced results that were unexpected; one of the most notable results was that the discrimination parameters from the latent class SDT model showed a

significant increase in parameter estimates, whereas the IRT models showed stable estimates across the scoring occasions. The estimated latent class sizes showed a non- normal distribution, with a greater class size in the middle scoring categories (i.e., 2 and

3). Estimates of classification accuracy showed minor changes over the 12 scoring occasions. Unlike the teacher certification test, results from the high school writing test showed differences between the latent class SDT model and IRT models that contradicted with respective to measures of rater discrimination.

Simulation study 1: Effect of rater drift on classification accuracy. Two simulation studies were conduced. In the first study, the effect of rater drift on classification accuracy was investigated. Using the latent class SDT model, data reflecting raters becoming stricter, more lenient, and a combination of raters that were both stricter and more lenient were generated over two scoring occasions. A separate condition was created that showed an increase in rater discrimination between two scoring occasions. Results showed that changes in rater severity had a minimal effect on classification accuracy. On the other hand, rater discrimination had a greater effect on classification accuracy – for an average increase in rater discrimination of two units, classification accuracy increased by about 20%.

Simulation study 2: Effect of rater drift on parameters of rater models. In the second simulation study, the effect of rater drift on parameter estimates of the GR model was examined using data generated from the latent class SDT model. Results showed that the GR model was able to detect changes in rater severity and in rater discrimination. This indicated that the GR model was sensitive to detect changes in both rater severity and in rater discrimination using data generated from the latent class SDT model.

The effect of different latent class sizes using data generated from the latent class SDT model on parameter estimates of the GR model was also examined. In general,

when the distribution of latent class sizes were non-normal with a greater concentration of class size in the middle scoring categories, the GR model underestimated rater discrimination.

Finally, the effect of shifting latent class sizes on parameter estimates of rater models was examined; this represented a greater concentration of scores in the higher scoring categories during the second scoring occasion than in the first scoring occasion, thereby creating a shift in the latent class sizes. This condition affected estimates of the criteria parameter for the latent class SDT model and the location parameters of the GR model to shift down. However, estimates of rater discrimination remained stable. This effect was consistent with the interpretation of the latent class sizes, where there were greater proportions of scores in the higher scoring categories, reflecting leniency among raters.

In document Rater Drift in Constructed Response Scoring via Latent Class Signal Detection Theory and Item Response Theory (Page 97-100)