2.4 CHECKING IRT MODEL-FIT USING PPMC
2.4.3 Previous Research
Previous research using PPMC methods with IRT models has focused on unidimensional dichotomous models. Sinharay (2005) applied the PPMC method to a number of real applications of unidimensional dichotomous IRT models. The first application was to assess which model, a simple 3PL model data or a more complicated hierarchical model, fits an operational CAT data better. The discrepancy measure used is standard deviation (SD) of the
results showed that the hierarchical model explained the SDs satisfactorily. Another application is to examine the speededness in a basic skill test using two pairwise discrepancy measures (OR and MH) with PPMC. The last example used the PPMC method to check if a 3PL model can be a good fit to a real data from NAEP. Several measures were employed to evaluate different aspects of misfit including observed score distribution, biserial correlation, OR, and MH. The results suggested the 3PL model performs extremely well. Overall, through using several real applications, this study shows the PPMC method provides a straightforward way to evaluate different aspects of model misfit.
As follows, Sinharay, Johnson, and Stern (2006) conducted several simulation studies to show the ability of PPMC to detect a range of misfitting conditions using similar discrepancy measures as in Sinharay (2005). They included observed score distribution, biserial correlation
coefficient, OR, and MH statistics. The results showed that the biserial correlations and OR
measures can be used to detect inadequacy of Rasch models when the data are generated under 2PL/3PL model, and the observed score distribution measure can identify the lack of fit of a 2PL model to a 3PL data. Moreover, the OR and MH statistics were found to successfully detect misfit whenever there is a violation of the local independence assumptions (e.g., for a multidimensional or a speeded test), and the observed score distribution was very useful to detect misfit when the assumed ability distribution was not correct. In this study, the authors used graphical displays to present the PPMC results, providing graphical evidence about misfit.
Sinharay (2006) also used PPMC to assess item-fit of simulated and real data by using item-fit plots and the discrepancy measures based on Orlando and Thissen (2000)’s item-fit
statistics S-X2 and S-G2. These Bayesian item-fit measures have reasonable Type-I error rates, false alarm rates, and acceptable power, even for a short test and/or small sample size.
Hoijtink (2001) developed two fit statistics for evaluating conditional independence (CI) and differential item functioning (DIF), then applied PPMC to evaluate the effectiveness these fit statistics. The results showed the PPMC method with these fit statistics were powerful in detecting CI and DIF for 2PL models.
Fu, Bolt, and Li (2005) used PPMC to evaluate item fit for a polytomous fusion model using a number of univariate and bivariate discrepancy measures. The univariate measures check item fit through responses to a single item which is named as “item-level” discrepancy measures in section 2.4.2. They included Orlando and Thissen (2000)’s item-fit statistics and item score
distribution. Bivariate measures are based on the joint responses to an item pair which is called
as “pairwise measures” in the present study. Two bivariate measures were included in their study: “absolute item residual covariance” and “bivariate item response discrepancy” which is a polytomous extension of Chen and Thissen (1997)’s chi-square LD index. It was found that bivariate item test statistics had more power in detecting misfit items than univariate statistics and moreover the absolute item covariance discrepancy measure performed best.
In the context of person-fit, type-I error rates of most statistics for 2PL and 3PL models are not consistent with empirical rates due to the use of estimated abilities rather than true abilities. Since PPMC takes into the account the uncertainty of the estimation of model parameters, Glas and Meijer (2003) applied it for assessing person fit of 3PL models using several discrepancy measures. They found that this Bayesian analysis of person fit produced reasonable Type-I error rates, even for a short test and small sample size.
Levy (2006) conducted a simulation study to explore the effectiveness of PPMC for dimensionality assessment of responses to dichotomous items. In his study, several factors that would influence dimensionality such as correlations between dimensions, data-generating model,
proportion of multidimensional items, strength of dependence, and sample size were systematically manipulated. A number of univariate (item-level) and bivariate (pairwise) discrepancy measures were investigated. The univariate measures included proportion correct, and item score distribution. The bivariate measures included Chen and Thissen’s chi-square LD
index, Yen’s Q3 statistic, model-based item covariance, absolute item covariance residual,
log(OR), and standardized log(OR). It was found that the univariate measures were wholly
ineffective for detecting the multidimensionality and the most effective measures were two bivariate measures: model-based covariance and Q3. Furthermore, all discrepancy measures showed empirical proportion of extreme PPP-values below nominal levels, but the model-based
covariance and Q3 had PPP-values quite close to nominal levels. The performance of these discrepancies was also found to be related to the manipulated factors.
The studies presented so far have focused on using PPMC to check the fit of a single model. Some researchers have also used PPMC for model comparison. For example, Béguin and Glas (2001) compared the fits of one- and two-dimensional 3PL models by comparing observed and posterior predictive score distributions to a data, and found that the two models were comparable with regard to the reproduction of the observed score distribution. Li, Bolt, and Fu (2006) applied several Bayesian model comparison methods including PPMC to compare different testlet models. PPMC using the OR measure was found to be effective in choosing the data-generating testlet model as the best model.