To conclude, the results from the simulation study are briefly reviewed together with an outline of the implications for practical research. In our simulation study, we investigated the appro- priateness of the Rasch trees in the detection of non-uniform DIF in the entire set of items. We presented two extensions of the logistic regression by using a correction for the multiple testing
4.5 Discussion 77 problem (i-logit) and by applying a global likelihood ratio test (v-logit) to allow for a direct comparison with the Rasch trees.
In the condition of the null hypothesis (no DIF), the Rasch trees always showed a well-controlled false alarm rate, while both extensions of the logistic regression displayed inflated false alarm rates when ability differences were present. Similar findings occurred for the item-wise logistic regression in the study of Narayanan and Swaminathan (1996). When using the extensions of the logistic regression in situations where ability differences are present, inflated false alarm rates are to be expected that jeopardize the results of the DIF analysis as well as the associated investigation of the causes of DIF (Jodoin and Gierl, 2001).
In case of the uniform DIF condition, where DIF in the difficulty parameters was simulated between two groups, the performance of the methods depended on the proportion of DIF-items. In case of 10% DIF-items, the i-logit method displayed the highest hit rate, whereas 30% DIF- items were most frequently detected by the Rasch trees. Opposed to the hit rate, the classifica- tion accuracy of the DIF methods was lower. In all situations of the uniform DIF condition, the Rasch trees reached far higher classification rates.
Non-uniform DIF was first investigated in the non-uniform DIF jump condition where the focal group consisted of binary group members withθfoc >−0.5 that obtained different item difficulty parameters. In case of 10% non-uniform DIF-items, the i-logit method displayed the highest hit rate in regions of medium sample sizes. Otherwise, the Rasch trees again outperformed the extensions of the logistic regression by yielding a high hit rate and holding the alpha level. When the sample sizes were in medium or large regions, the Rasch trees also yielded the highest classification rates.
In the non-uniform DIF discrimination condition, different discrimination parameters were as- signed to the group members and the extensions of the logistic regression clearly outperformed the Rasch trees. In case of 10% DIF-items, the i-logit method reached the highest hit and clas- sification rate; whereas, with 30% DIF-items the v-logit method was superior. Nevertheless, in case of large sample sizes, the Rasch trees also yielded a high hit and classification rate and were, thus, in principle able to also detect this type of non-uniform DIF.
In summary, the study showed that Rasch trees provide a flexible approach for the detection of uniform DIF and non-uniform DIF. In case of differing item difficulty parameters (in the uniform DIF and non-uniform jump condition), Rasch trees were often found to be superior while the alpha level was well-controlled. In the case of DIF in the discrimination parame- ters between two groups (non-uniform discrimination condition), the i-logit and v-logit method clearly outperformed the Rasch trees by using a single model coefficient.
Additionally, we investigated scenarios, where the true focal group consisted of an interaction between two binary variables or of the middle range of a numeric covariate. Rasch trees then displayed a comparable or higher hit rate compared to the (misspecified) binary versions of the logistic regression methods even though the discrimination condition was regarded. Thus, Rasch trees may prove useful to detect DIF in non-standard patterns, such as interactions of more than one grouping variable or u-shaped patterns, that would usually not be explicitly
specified – and thus missed – in logistic regression DIF analysis. However, one drawback is that the classification rate may be rather low if DIF occurs only in the discrimination parameters. For practical research, we recommend to use both methods, the logistic regression methods as well as the Rasch trees, and to compare the results. Rasch trees allow to assess non-uniform DIF in a flexible way when the data set is large enough. High sample sizes might often be found in psychological or educational testing situations, especially in large-scale assessments. In the case of prior information that the proportion of DIF-items is very small, e.g. one out of ten items displays DIF, and that no ability differences are present, the multiple comparison i- logit procedure is expected to reach satisfying results. But, again, these methods need the prior knowledge of the groups, whereas the Rasch trees automatically search for the DIF groups in the set of available covariates and are also applicable when the DIF groups are not defined prior to data analysis.
The simulation results also showed that the post-hoc inspections are important. In various situations, the true classification rate was far behind the rate of detecting DIF. This highlights that careful considerations of the type of DIF after the global test for (non-) uniform DIF are important to correctly assess the type of DIF. If the type of DIF is misclassified by the methods, wrong steps could be taken or the psychological source of DIF might be missed. Thus, the development and evaluation of post-hoc classifications of the type of DIF remain important tasks.
This study was limited to one type of DIF in the data set. In practical testing situations, both types of DIF can overlap. Therefore, the descriptive analysis of the type of DIF on the item level is important. Methods that do not rely on the post-hoc tests described here are the graphical exploration or the comparison of R2 values (Gelin and Zumbo, 2003; Zumbo, 1999). Future research might compare those methods to the post-hoc Wald tests. Furthermore, the study was restricted to situations where the different data generating processes relied on the Rasch or the 2pl model. Future research may address further data generating processes such as the 3pl model or non-parametric ICCs. Moreover, we will try to address the generalization of the Rasch tree method to the 2pl or Birnbaum model (Birnbaum, 1968), that may prove helpful for the analysis of non-uniform DIF and the generalization to a 3pl model including a location and a guessing parameter.
79
5 A framework for anchor methods
Summary: In the analysis of differential item functioning (DIF) using item response theory (IRT), a common metric for the item parameters is necessary to compare item parameters between groups of test-takers. In the Rasch model, the same restriction is placed on the item parameters in each group in order to define a common metric for the item parameters. Several methods have previously been suggested to determine which items should be included in the restriction. These items are termed the anchor items. This chapter proposes a conceptual framework for categorizing the anchor methods: The anchor class to describe characteristics of the anchor methods and the anchor selection strategy to guide how the anchor items are determined.
Keywords: Rasch model, anchoring, anchor selection, contamination, item response theory (IRT), differential item functioning (DIF), DIF analysis, item bias
5.1 Introduction
The analysis of differential item functioning (DIF) in item response theory (IRT) research in- vestigates the violation of the invariant measurement property among subgroups of examinees, such as male and female test-takers. Invariant item parameters are necessary to assess ability differences between groups in an objective, fair way. If the invariance assumption is violated, different item characteristic curves occur in subgroups. In this chapter, we focus on uniform
DIF where one group has a higher probability of solving an item (given the latent trait) over the entire latent continuum and the group differences in the logit remain constant (Mellenbergh, 1982; Swaminathan and Rogers, 1990).
A variety of testing procedures for DIF on the item-level is available (as was discussed in Sec- tion 3.3, see, e.g., Lord 1980; Mellenbergh 1982; Holland and Thayer 1988; Thissenet al.1988; Swaminathan and Rogers 1990; Shealy and Stout 1993a, for an overview see Millsap and Ever- son 1993). In the analysis of DIF using IRT, item parameters are to be compared across groups. Mostly, research focuses on the comparison of two pre-defined groups, the reference and the focal group. Thus, a common scale for the item parameters of both groups is required to as- sess meaningful differences in the item parameters. The minimum (necessary but not sufficient) requirement for the construction of a common scale in the Rasch model is to place the same restriction on the item parameters in both groups (Glas and Verhelst, 1995). The items included in the restriction are termedanchor items.
An anchor method determines how many items are used as anchor items and how they are located. Consistent with the literature, we use the termlocateas a synonym for selecting anchor items. The choice of the anchor items has a high impact on the results of the DIF analysis: If the anchor includes one or more items with DIF, the anchor is referred to ascontaminated. In this case, the scales may be biased and items that are truly free of DIF may appear to have DIF. Therefore, the false alarm rate may be seriously inflated – in the worst case all DIF-free items
seem to display DIF (Wang, 2004) – and the results of the DIF analysis are doubtful, as various examples demonstrate (see Section 6.2).
In the next section, technical details of the anchor process for the Rasch model are explained and illustrated by means of an instructive example and the framework of anchor classes, anchor selection strategies and anchor methods is introduced in detail.