Conduct PPMC - SIMULATION STUDY 1 - Assessing Fit of Item Response Models for Performance Asses

3.1 SIMULATION STUDY 1

3.1.5 Conduct PPMC

As reviewed in Chapter 2, conducting PPMC involves simulating replicated data under a presumed model and comparing the discrepancy measures for observed data against the distribution of discrepancy measures across the replicated data sets using graphical displays or PPP-values to evaluate model fit.

PPP-values provide a quantitative measure of the degree to which observed data would be expected under the model. PPP-values near 0.5 indicate that the realized (i.e. observed) discrepancies fall in the middle of the distribution of discrepancy measures based on the posterior predictive response data (i.e., replicated data). Such values provide evidence for model fit. In contrast, extreme PPP-values near 0 or 1 suggest that the observed discrepancies are inconsistent with the posterior predictive discrepancies and hence are indicative of model misfit. More specifically, PPP-values near 0 indicate that the predictive discrepancy values under the model are smaller than the realized values most of the time, indicating that the model under-

predicts this discrepancy measure. Using the same logic, PPP-values near 1 indicate that the

predictive discrepancy values are larger than the realized values, indicating that the model over-

predicts the measure. In the current study, extreme PPP-values were defined as those below 0.05

or above 0.95, corresponding to a two-tailed test with α=0.10 in a hypothesis testing framework. In addition to PPP-values, different types of graphical plots were also used in the current study to provide graphical evidence about model fit. As discussed in Chapter 2, it is more appropriate to use the PPMC approach as a diagnostic tool for model fit rather than a hypothesis test because the PPP-values are not necessarily uniformly distributed under the null conditions. Thus, a preferable way to interpret the difference between observed and predicted discrepancy measures in PPMC is also to employ graphical plots.

Within each condition for Study 1, the generated data served as “observed data”, and the posterior predictive (i.e. replicated) data sets under the unidimensional GR model were simulated within WinBUGS in the process of estimating the model parameters. The values of the proposed discrepancy measures were calculated both for the observed data as well as each of the predicted data and then compared using graphical plots and PPP-values. Among all the 8 discrepancy measures investigated in this study, four measures (“item score distribution”, “Yen’s Q3”,

“absolute item covariance residual”, “global OR”) and their corresponding PPP-values were

computed within WinBUGS. However, the remaining four discrepancy measures (“test score

distribution”, “item-total score correlation”, “Yen’s Q1”, and “Stone’s fit statistic”) were calculated by inputting the replicated response data and parameter estimates for all iterations (CODA output) from WinBUGS into SAS. If we label the first set of 4 measures as PPMC1 measures, and the remaining 4 measures as PPMC2 measures, the general steps to implement PPMC in Study 1 are as follows:

1) Generate a unidimensional GR data in SAS;

2) Run WinBUGS from SAS through a batch file to estimate the generated data using a unidimensional GR model, simulate replicated response data, and compute the PPP-values of the four PPMC1 measures. In addition, save the replicated response data and parameter estimates for all iterations (CODA files) into text files for the next implementation of PPMC based on the four PPMC2 measures. Also save the CODA files for the realized and predictive discrepancies in order to compare them using graphical plots;

3) Read these CODA text files from (2) into SAS datasets;

4) Compute the realized and predictive values of the PPMC2 discrepancy measures based on observed data (i.e., generated data) and the CODA datasets from (3) in SAS, and then

obtain their PPP-values. As for the PPMC1 measures, save the realized and predictive discrepancies in order to draw graphical plots.

The preliminary study conducted in Section 3.1.3 used two chains of 4000 iterations. The results showed that each chain converged very quickly and the item parameters were well recovered. Based on those results, only one chain of length of 4000 was run for conducting PPMC due to the intensive computation in WinBUGS. The first 3500 iterations in each chain were discarded as part of the burn-in phase, and posterior estimation of model parameters and PPMC were conducted based on the 500 remaining iterations. Item recovery using the posterior sample of 500 was evaluated using the Root Mean Square Difference (RMSD) statistic. This statistic compared the true (or generating) and estimated parameters across 20 replications, as follows: 20 ) ( 20 1 2

∑

= − = n estimate true RMSD . (3.9) The results indicated that a posterior sample size of 500 was adequate for accurate recovery of item parameters for GR model (see Chapter 4). In addition, this sample size was consistent with previous studies (Fu et al., 2005; Levy, 2006; Li et al, 2006).

To investigate Type-I error rates and empirical power for each discrepancy measure proposed, the PPMC analysis was replicated 20 times (one for each generated data) within each condition. The proportion of the 20 replications with extreme PPP-values (< 0.05 or > 0.95) for each discrepancy measures provides estimates of Type-I error rates of this measure under the null condition (Condition 1) or estimates of empirical power rates of this measure under other misfit conditions (Conditions 2-5). It should be noted that for each replication, different types of discrepancy measures resulted in different numbers of PPP-values. For any replication, the test-

level chi-square measure was evaluated once leading to one PPP-value; each item-level discrepancy measure was evaluated 15 times (once for each item) leading to 15 PPP-values; and each pair-wise discrepancy measure was evaluated 105 times (one for each unique pairing of items) leading to 105 PPP-values. In order to summarize results, PPP-values for item-level and pair-wise level measures were pooled based on data structure. Type-I error rates and empirical power rates were based on these pooled PPP-values. The details are discussed in the results chapter.

Appendix C provides the WinBUGS code used for the implementation of PPMC based on the four PPMC1 measures including estimating unidimensional GR models, calculating these four discrepancy measures and their PPP-values, as well as simulating replicated response data. In addition, the SAS code used to create a batch file for running PPMC in WinBUGS from SAS is given in Appendix D. The SAS code for conducting PPMC using the four PPMC2 measures is available from the author upon request.

In document Assessing Fit of Item Response Models for Performance Assessments using Bayesian Analysis (Page 123-127)