CHAPTER 4: VALIDATION OF THE GUESSING FROM CONTEXT TEST
4.3 Procedure for Item Analysis
Data were collected in October and November 2010. The six test forms were randomly distributed to the participants. The data were entered into one Microsoft Office Excel 2007 (12.0.6545) spreadsheet, exported to WINSTEPS 3.71.0 (Linacre, 2010b) for Rasch analysis.
Rasch analysis was used because the purpose of the research was “to develop fundamental measures that can be used across similar appropriate measurement situations, not merely to describe the data produced by administering Test a to Sample b on Day c” (Bond & Fox, 2007, p. 143). Rasch analysis, which examines the fit of the data to the requirements for objective measurement, contrasts with Item Response Theory which primarily focuses on maximising the fit of the model to the data by adding parameters such as item discrimination and guessing (Embretson & Hershberger, 1999). The key principle of the Rasch model is straightforward:
a person having a greater ability than another person should have the greater probability of solving any item of the type in question, and similarly, one item being more difficult than another means that for any person the probability of solving the second item is the greater one (Rasch, 1960, p. 117).
The model is mathematically represented by the following formula:
78
person n in logits (log odds of success), and Di = the difficulty of item i in logits. Rasch analysis examines how well the empirical data fit to the model, and not vice versa.
Rasch analysis was performed to identify poorly written items, or items that do not fit to the Rasch model. First, the point-measure correlation (correlation between the observations on an item and the corresponding person ability estimates) was examined to see whether the items are aligned in the same direction on the latent variable. The point-measure correlation measures the degree to which more able persons scored higher (or less difficult items were scored higher). The values range between -1 and 1, and the items with negative and low positive values (less than .10) need to be inspected. The point-measure correlation, rather than biserial-measure correlation, was used because the former is more robust with missing data than the latter (Linacre, 2010a).
Next, the degree of fit to the model was investigated. There are two fit statistics for examining the match between the model and the data: outfit (outlier-sensitive fit) and infit (inlier-sensitive or information-weighted fit). Outfit is an unweighted estimate sensitive to unexpected responses by low-ability persons on difficult items or high- ability persons on easy items; infit, on the other hand, is a weighted estimate sensitive to unexpected responses to items targeted on the person (Linacre, 2002). Both outfit and infit statistics are expressed in two forms: unstandardised mean square and standardised
t. The mean square is a chi-square statistic divided by its degree of freedom with the
expected value being 1.0. Reasonable mean-square values should range between 0.5 and 1.5 for productive measurement (Linacre, 2002) or between 0.7 and 1.3 for run-of-the- mill multiple-choice tests (Bond & Fox, 2007). It has been pointed out that mean-square statistics have the weaknesses of failing to detect a significant number of misfit items and having varying Type I error rates according to sample size (Smith, 2000; Smith,
79
Schumacker, & Bush, 1998; Smith & Suh, 2003). The t statistics are derived by converting mean squares to the normally distributed z-standardised statistics using the Wilson-Hilferty cube root transformation with the expected value being 0 (Linacre, 2002). Reasonable t values should range between -2.0 and 2.0 (Bond & Fox, 2007; Linacre, 2002). It has been demonstrated that standardised fit statistics are highly susceptible to sample size: with a large sample a small mean square can be identified as misfitting (Karabatsos, 2000; Linacre, 2003; Smith, Rush, Fallowfield, Velikova, & Sharpe, 2008). For example, Linacre (2003) calculates that an item with a mean square of 1.2 is detected as misfitting if observed in a sample of more than 200 persons. The present research used outfit and infit t statistics as the primary criterion for detecting misfit items, because the t statistics may identify a greater number of misfit items than mean-square statistics. However, each misfit item was carefully inspected to see whether it was really a bad item, because the t statistics might potentially identify good items as misfit with a large sample of more than 400 persons.
Misfit items are classified into the following two types which have different implications for measurement: underfit and overfit. Underfit (or noisy) items indicate that the quality of the items is degraded by many unexpected responses that do not conform to the Rasch model. Underfit is usually taken as mean squares greater than a particular value (e.g., 1.3 or 1.5) or t values greater than 2.0. Overfit (or muted) items do not indicate the same threat to the measurement quality as underfit items. Overfit indicates that the data seem to show a Guttman pattern due to less variability than the model expectation and thus reliability might be overestimated. Overfit is usually taken as mean squares less than a particular value (e.g., 0.7 or 0.5) or t values less than -2.0. Care needs to be taken about the treatment of overfit items, because “omitting the
80
overfitting items […] could rob the test of its best items” (Bond & Fox, 2007, p. 241). A major criticism against the use of the Rasch model for analysis of the multiple- choice format is that there is no parameter accounting for lucky guessing (unexpected success by low ability respondents) (Weitzman, 1996). However, Rasch analysis can detect lucky guessing by item and person outfit statistics, and a simple strategy is to remove the lucky guesses from the data set (Wright, 1992, 1995). The subsequent section looks at whether lucky guessing was detected and how it was treated if it occurred.