2.5 ITEM RESPONSE THEORY
2.5.5 Bifactor Model Applications
Bifactor models have been applied to empirical data in diverse contexts such as achievement tests and survey instruments in health and psychological measurements (Gibbons & Hedeker, 1992; Reise, Morizot, & Hays, 2007). The relative performance has been assessed in terms of model fit. For example, Gibbons and Hedeker (1992) fit a full-information bifactor model with four group factors to a dichotomously scored ACT science assessment. A sample of 1000 examinees was tested on 20 items. The bifactor model outperformed a four factor model with promax rotation in terms of model fit. In addition, it was noted that substantial factor loadings were observed on the general latent construct, whereas factor loadings on the subdomains appeared to vary in a greater range. Reise et al. (2007) compared the fit of the bifactor model to unidimensional and multidimensional IRT models. A sample of 1000 examinees completed a five domain health outcome survey instrument consisting of 16 items. The bifactor model was superior to both unidimensional and orthogonal multidimensional IRT models in terms of model fit.
Similar results in terms of model fit were obtained when a bifactor model was fit to graded response data of a survey instrument consisting of seven subdomains and 34 items (Gibbons et al., 2007). The model fit of the bifactor model was superior compared to the unidimensional IRT model. In a more recent application, Li and Rupp (2011) conducted a simulation study to examine the performance of the extension of the multidimensional S- 𝜒2 statistic under various conditions such as sample size, test length, and levels of the
discrimination or factor slopes. Data were generated using either a simple-structure MIRT or full information bifactor model and then a unidimensional, multidimensional and full information bifactor model were fit to the data. Results indicated that the power of the S-𝜒2statistic for detecting model misfit was low for all models under investigation regardless of which model was utilized as the generating and the fitting model.
The predominant application of the bifactor model within the field of educational measurement has been to testlet-based assessments, such as reading comprehension examinations. Several researchers have investigated the relationship between bifactor, testlet, and second-order MIRT models. Li, Bolt, and Fu (2006) demonstrated that the testlet model can be modeled as a constrained version of the bifactor model if the testlet item discriminations are proportional to the item discriminations of the general latent construct. The equivalence of the testlet model to a second-order MIRT model has been established by Rijmen (2010) and it was concluded that both the testlet and second-order MIRT models can be thought of as constrained bifactor models. Rijmen (2010) used data from an international English assessment test to assess model fit of the bifactor, testlet and second-order MIRT models. A sample of 13,508 examinees
took a subset consisting of 20 reading comprehension items that were comprised of four testlets with five items within each testlet. Results indicated superior model fit of the bifactor model.
DeMars (2006) also fit a bifactor model to testlet based assessments and concluded that the latent trait and item parameter estimates recovered appropriately for simulated and real data. However, latent trait recovery appeared to be less influenced by model choice compared to item parameter recovery. In particular, choice of model had the most impact on the recovery of the item discrimination parameters. Other psychometric issues that have been successfully addressed through the application of a bifactor model to testlet based assessments. These issues include vertical scaling (Li & Rijmen, 2009), extension of bifactor model to a multi-group bifactor model (Cai, 2010; Cai, Yang, & Hansen, 2011), and differential item functioning (DIF) (Fukuhara & Kamata, 2011; Jeon & Rijmen, 2010).
The bifactor model also was utilized to address construct shift within an IRT vertical scaling framework (Li, 2011). Model fit and recovery of parameter estimates were examined in terms of systematic and random error under various conditions such as sample size, length of common-item set, and variance of grade subdomains. The bifactor model showed superior model fit compared to a unidimensional 2PL IRT model. Parameter estimation accuracy was greatly affected by sample size as a larger sample size led to more accurate parameter estimates. The variance of the grade specific subdomains also affected the accuracy of item parameter estimates in that with a larger degree of construct shift the accuracy of the parameter estimates for the general dimension decreased, whereas the stability in terms of parameter estimates increased for the grade specific subdomains. The length of the common item set did not impact the results significantly.
In general, the bifactor model has proven to be a valuable tool to tackle various psychometric issues such as vertical scaling, differential item functioning (DIF), and multi-group modeling. However, most of these applications were in context of either dichotomously or polytomously scored instruments but not mixed-format assessments. In addition, the mainstream of the reviewed studies utilized a two parameter bifactor model. In an attempt to account for a guessing effect on the MC items, a three parameter bifactor model was chosen to generate the data for the dichotomously scored items. A bifactor graded response model ZDV utilized to generate data for polytomously scored test items.