Conduct Model Comparison - SIMULATION STUDY 2

3.2 SIMULATION STUDY 2

3.2.4 Conduct Model Comparison

For each of 20 generated data in each condition, different models were estimated in WinBUGS, and three Bayesian model comparison indices (DIC, CPO, and PPMC) were obtained for each model during the estimation of the models. These values for the different models were then compared in order to determine which model was preferred.

The estimates of the DIC index for different models were requested within WinBUGS. In the batch file (see Appendix D), a line “dic.set()" was used to set the DIC index, and another

line "dic.stats()" was used to request the value of DIC. The smaller the value of DIC, the

model for the overall test. Based on DIC, we can not know which model is preferred for a specific item.

The computation of the CPO index was implemented by first computing the CPO at the level of an individual item response. A command line “ inprob[i, j] <- pow(p[i, j, y[i,j] ], -1)” was added to the WinBUGS code (see Appendix C) to compute the inverse likelihood of the observed item response based on the posterior model parameter values at a specific iteration. The mean value of this node “ inprob[i, j]” across the posterior sample is given in the statistics output for WinBUGS and represents the estimate of CPO value for the response of student i to Item j. After the CPOij estimates were known, the CPO value for each item was computed in SAS by reading in the CPO ij estimates and taking the log of the product of the CPOij across all examinees (see Equation 2.36). In addition, a CPO index for the overall test was summarized by taking the log of the product of the item-level CPOj across all the items. In the current study, two levels of CPO index were used: the test-level CPO was used to compare the models for the overall test, and the item-level CPOj were used to choose a preferred model for each item. The larger the value of the test-level CPO, the better the model fit for the overall test. The larger the value of the item-level CPOj, the better the model fit for a specific item j.

The different models in each condition were also compared using PPMC. The details about conducting PPMC were introduced in Section 3.1.5. Recall, 8 different levels of discrepancy measure were used with PPMC in Study 1. Different from Study 1, however, the discrepancy measures employed in Study 2 only included the effective measures identified from Study 1. From the results presented in Chapter 4 for Study 1, two discrepancy measures “Yen’s Q3” and “global OR” were found to be most effective among all the 8 measures for detecting the violations of unidimensionality and local independence. Therefore, for Conditions 2-4 in Study

2, only these two measures were used with PPMC for model comparison purpose. However, for Condition 1 in which the GR, 1-parameter GR, and RS models were compared, all 8 discrepancy measures were employed with PPMC since the use of discrepancy measures with these models was not investigated and therefore unknown.

In order to compare different models using PPMC, the frequency of extreme PPP-values was computed for each model. For item-level discrepancy measures, there were 15 PPP-values for 15 items for each replication. How many items from the 15 items had extreme PPP-values (< 0.05 or > 0.95) was treated as the criterion for comparing different models. For pair-wise measures, there were 105 PPP-values for the 105 item pairs for each replication. How many item pairs out of these 105 pairs had extreme PPP-values (< 0.05 or > 0.95) was treated as the criterion to compare different models. When the true model was estimated, it was expected that no or few extreme PPP-values would be observed. In contrast, when the alternative model was estimated, more extreme PPP-values would be expected. In addition to PPP-values, graphical plots based on different models were also compared.

The relative performance of these three indices was compared with respect to the number of times each index selected the correct model across 20 replications. An effective index should be able to identify the generating model as the preferred model a large proportion of the time.

The preliminary study conducted in Section 3.2.3 used two chains of different length to estimate different models. One exception was the RS model for which one long chain (10000 iterations) was run. The results indicated that each chain converged very quickly and item parameters were well recovered. Due to the intensive computation in WinBUGS, only one chain was run to estimate the different models and compute the model comparison indices. The length

of the chain for each model depended on the model as well as the results from the preliminary study.

Condition 1

For each of these three models, one chain of 5000 iterations was run, and the first 4000 was discarded as the burn-in phase and the remaining 1000 iterations were thinned by taking every other iteration to obtain a posterior sample of size 500. The computation of three model comparison indices was based on these 500 iterations.

: GR vs. one-par GR vs. RS models

Condition 2

For 2-dim simple-structure GR model, one chain of 8000 iterations was run, and the first 5000 was discarded as the burn-in phase and the remaining 3000 iterations were thinned by taking every third iteration to get a posterior sample of size 1000. For the unidimensional GR model in this condition, one chain of 5000 iterations was run, and the first 3000 was discarded as the burn-in phase, and the remaining 2000 iterations were thinned by taking every other iteration to get a posterior sample of size 1000. The computation of three model comparison indices was based on these 1000 iterations.

: unidimensional GR model vs. 2-dim simple-structure GR model

Condition 3

The length of the chain, thinning, and the size of posterior sample for the 2-dim complex- structure GR model was the same as for the 2-dim simple-structure GR model in Condition 2. Note that more thinning was conducted than the previous preliminary study in order to further reduce the autocorrelation among parameters. For the unidimensional GR model, one chain of 5000 iterations was run, and the first 3000 was discarded as the burn-in phase, and the remaining 2000 iterations were thinned by taking every other iteration to get a posterior sample of size 1000.

Condition 4

For both models, one chain of 5000 iterations was run, and the first 3000 was discarded as the burn-in phase and the remaining 2000 iterations were thinned by taking every other iteration to obtain a posterior sample of size 1000.

: unidimensional GR model vs. testlet GR model

In document Assessing Fit of Item Response Models for Performance Assessments using Bayesian Analysis (Page 146-150)