• No results found

Future research questions arise from the limitations of this work. First, future research is in- tended that applies the model-based recursive partitioning algorithm to other IRT models in- cluding a location and a guessing parameter or allowing for more than two response categories (Stroblet al., 2013).

Similar to the situation presented here for the Rasch model, additional steps to allow to answer the question which items display DIF and between which groups then also become relevant.

10.2 Future research 173 Item-wise DIF tests for other IRT models that include discrimination or guessing parameters or that allow for more response categories may be investigated.

It is to be shown, how the anchor methods developed in this thesis perform when other IRT models are the underlying data generating process. Furthermore, modifications of the anchor selection strategies might be necessary: Gonz´alez-Betanzos and Abad (2012) found that anchor items that displayed high discrimination parameters were better suited as anchor items. Future research may combine the strategies introduced in this thesis with new requirements. The rank- ing order of candidate anchor items may for example be modified in a way such that items with a low discriminatory power are excluded from the anchor candidates. The anchor methods may also be combined with other statistical tests for DIF.

Furthermore, it is yet to be explained how the anchor selection strategies perform in locating a suitable anchor for paired multiple group comparisons. First, the strategy to select a common set of anchor items needs to be compared with the results that occur if different anchor sets are selected for each paired comparison. The paired comparisons are similar to the two-group case discussed in Chapter 6 and in Chapter 7.

Second, a systematic evaluation of the aggregation rules presented in Chapter 8 is necessary. Third, it is important to compare different anchor selection strategies including those suggested in Chapter 7 under different conditions when more focal groups are present. Moreover, the methods from the iterative forward or backward anchor class and the quasi-Wald tests can be investigated for paired multiple group comparisons.

Addressing the problems related to the statistical significance testing with the null hypothesis of no DIF, alternative ways might be considered. First, instead of a final test decision, a DIF ranking order might be reported that no longer states whether an item displays significant DIF but reports which of the items display the lowest magnitudes of DIF and which items display the largest magnitudes of DIF.

Second, an alternative to overcome the problem that the tests presented in this thesis (see, again, Chapter 3) decide about the absence of DIF and not about the presence of DIF, is the usage of so called equivalence tests. Ongoing research considers these tests for the DIF analysis using the Mantel-Haenszel statistic (Casabianca and Lewis, 2012).

This development has the attractive property that the so called burden of proof (e.g. Walker and Nowacki, 2011) is shifted. While classical DIF tests decide about the absence of DIF, equiv- alence tests allow to decide about the presence of DIF. The burden of proof in the alternative hypothesis is then to show that the studied item does not display DIF which matches exactly the research hypothesis. An advantage is that DIF is then less likely missed due to small samples sizes compared to the classical DIF tests. In contrast to this, the burden of proof requires large sample sizes to underpin that the studied item does not display DIF. Thus, equivalence testing may prove helpful to set objective quality standards in test and questionnaire development. One strategy to conduct an equivalence test that is referred to as “[t]he simplest and most widely

(Westlake, 1981; Schuirmann, 1987). Generally, the null hypothesis states non-equivalence by means of two predefined equivalence margins (Λ1andΛ2) H0 :µT−µR ≤ Λ1 or µT−µR ≥Λ2, for example between the mean bioavailability of a test and reference product (Schuirmann, 1987) or the efficacy of a new and a current therapy (Walker and Nowacki, 2011). The alterna- tive represents equivalence within the predefined margins H1 :Λ1< µT −µR < Λ2.

Schuirmann (1987) points out that the null hypothesis of non-equivalence can be separated in two parts – H01 :µT−µR ≤Λ1and H02 :µT−µR ≥Λ2– and that the alternative H1is empirically supported if both null hypothesis are rejected at theαlevel (or, equivalently, if the (1−2α)·100% confidence interval for the difference is located between the equivalence margins).

Applied to the Wald test used in this thesis, the null hypothesis (of non-equivalence) reflects DIF in item j(instead of no DIF in item j) and can be written as

H0 :βrefj −β foc j ≤ −ΛDIF or β ref j −β foc j ≥ΛDIF,

assuming the symmetry property that DIF in either direction, in favor of the reference or of the focal group, is treated in the same way and that an equivalence marginΛDIF is given. The

corresponding alternative hypothesis is

H1 :−ΛDIF < β

ref

j −β

foc

j <ΛDIF.

For DIF analysis, the null hypothesis can also be written in two parts, namely H01 :βrefj −β

foc

j ≤

−ΛDIFand H02:βrefj −β

foc

j ≥ΛDIF.

Before the equivalence test can be carried out, research on how to choose the equivalence mar- gin ΛDIF is necessary. While Casabianca and Lewis (2012) discuss an effect size classification

employed by ETS, one idea to obtain a substantially meaningful equivalence margin for the Wald test employed here could be to limit the maximum difference in the probability of solving the item between both groups to a certain value (e.g. such as to the values 5% or 15% that were discussed in the court proceedings, see Chapter 3).

To test both null hypothesis, now two item-wise one-sided DIF tests such as two one-sided Wald tests can be used. As discussed throughout this thesis, the DIF test results depend on the anchor methods what also holds for one-sided test procedures. Thus, the anchor methods developed in this thesis are not only applicable for the classical DIF tests or for descriptive and effect size measures but also for DIF analysis using equivalence tests such as the TOST approach.

Literature 175 Literature

Ackerman TA (1992). “A Didactic Explanation of Item Bias, Item Impact, and Item Validity from a Multidimensional Perspective.” Journal of Educational Measurement,29(1), 67–91.

Aigner DJ, Cain GG (1977). “Statistical Theories of Discrimination in Labor Markets.” Indus-

trial and Labor Relations Review,30(2), 175–187.

Allalouf A, Hambleton RK, Sireci SG (1999). “Identifying the Causes of DIF in Translated Verbal Items.” Journal of Educational Measurement,36(3), 185–198.

Anderson EB (1973). “A Goodness of Fit Test for the Rasch Model.” Psychometrika, 38(1), 123–140.

Andrews DWK (1993). “Tests for Parameter Instability and Structural Change with Unknown Change Point.” Econometrica,61(4), 821–856.

Andrich D, Hagquist C (2012). “Real and Artificial Differential Item Functioning.” Journal of

Educational and Behavioral Statistics,37(3), 387–416.

AngoffWH (1993). “Perspectives on Differential Item Functioning Methodology.” In PW Hol- land, H Wainer (eds.),Differential Item Functioning, chapter 1. Lawrence Erlbaum, Hillsdale, New Jersey.

Benjamini Y, Hochberg Y (1995). “Controlling the False Discovery Rate: A Practical and Powerful Approach to Multiple Testing.” Journal of the Royal Statistical Society. Series B

(Methodological),57(1), 289–300.

Berk R, Brown L, Zhao L (2010). “Statistical Inference After Model Selection.” Journal of

Quantitative Criminology,26(2), 217–236.

Berk RA (2006). “An Introduction to Ensemble Methods for Data Analysis.” Sociological

Methods&Research,34(3), 263–295.

Birnbaum A (1968). “Some Latent Trait Models and Their Use in Inferring an Examinee’s Ability.” In F Lord, M Novick (eds.), Statistical Theories of Mental Test Scores, pp. 397– 479. Addison-Wesley, Reading.

Bj¨orklund A, Kjellstr¨om C (2002). “Estimating the Return to Investments in Education: How Useful Is the Standard Mincer Equation?” Economics of Education Review,21(3), 195–210.

Boeck PD, Bakker M, Zwitser R, Nivard M, Hofman A, Tuerlinckx F, Partchev I (2011). “The Estimation of Item Response Models with the lmer Function from the lme4 Package in R.”

Journal of Statistical Software,39(12), 1–28.

Bolt DM, Hare RD, Vitale JE, Newman JP (2004). “A Multigroup Item Response Theory Analysis of the Psychopathy Checklist – Revised.” Psychological Assessment, 16(2), 155– 168.

Boulesteix AL (2006). “Maximally Selected Chi-Square Statistics and Binary Splits of Nominal Variables.” Biometrical Journal,48(5), 838–848.

Breiman L, Friedman JH, Olshen RA, Stone CJ (1984). Classification and Regression Trees. Chapman and Hall, New York.

Bretz F, Hothorn T, Westfall P (2011). Multiple Comparisons Using R. Chapman & Hall/CRC Press, Boca Raton, Florida, USA.

Candell GL, Drasgow F (1988). “An Iterative Procedure for Linking Metrics and Assessing Item Bias in Item Response Theory.” Applied Psychological Measurement,12(3), 253–260.

Casabianca J, Lewis C (2012). “Equivalence Testing for Differential Item Functioning: Stan- dard and Bayesian Approaches.” Carnegie Mellon University Working Paper.

Chaudhuri P, Lo WD, Loh WY, Yang CC (1995). “Generalized Regression Trees.” Statistica

Sinica,5(2), 641–666.

Cheung GW, Rensvold RB (1999). “Testing Factorial Invariance across Groups: A Reconcep- tualization and Proposed New Method.” Journal of Management,25(1), 1–27.

Cohen A, Bolt D (2005). “A Mixture Model Analysis of Differential Item Functioning.”Journal

of Educational Measurement,42(3), 133–148.

Cohen AS, Kim SH, Wollack JA (1996). “An Investigation of the Likelihood Ratio Test for Detection of Differential Item Functioning.” Applied Psychological Measurement, 20(1), 15–26.

DeMars CE (2010). “Type I Error Inflation for Detecting DIF in the Presence of Impact.”

Educational and Psychological Measurement,70(6), 961–972.

Diekmann A (2007). Empirische Sozialforschung. Grundlagen Methoden Anwendungen. 18. edition. Rowohlt Taschenbuch Verlag, Reinbek.

Dobra A, Gehrke J (2001). “Bias Correction in Classification Tree Construction.” In CE Brod- ley, AP Danyluk (eds.),Proceedings of the Seventeenth International Conference on Machine

Learning (ICML 2001), Williams College, Williamstown, MA, USA, pp. 90–97. Morgan Kauf-

mann.

Dorans NJ (1989). “Two New Approaches to Assessing Differential Item Functioning: Stan- dardization and the Mantel-Haenszel Method.” Applied Measurement in Education, 2(3), 217–233.

Drasgow F (1987). “Study of the Measurement Bias of Two Standardized Psychological Tests.”

Literature 177 Edelen MO, Thissen D, Teresi JA, Kleinman M, Ocepek-Welikson K (2006). “Identification of Differential Item Functioning Using Item Response Theory and the Likelihood-based Model Comparison Approach. Application to the Mini-Mental State Examination.” Medical Care,

44(22), 134–142.

Eggen T, Verhelst N (2006). “Loss of Information in Estimating Item Parameters in Incomplete Designs.” Psychometrika,71(2), 303–322.

Ellis BB (1989). “Differential Item Functioning: Implications for Test Translations.” Journal

of Applied Psychology,74(6), 912 – 921.

Fahrmeir L, Tutz G (2001). Multivariate Statistical Modelling Based on Generalized Linear

Models. Springer, New York.

Finch H (2005). “The MIMIC Model As a Method for Detecting DIF: Comparison with Mantel- Haenszel, SIBTEST, and the IRT Likelihood Ratio.” Applied Psychological Measurement,

29(4), 278–295.

Finch WH, French BF (2007). “Detection of Crossing Differential Item Functioning: A Com- parison of Four Methods.” Educational and Psychological Measurement,67(4), 565–582.

Firth D (2003). “Overcoming the Reference Category Problem in the Presentation of Statistical Models.” Sociological Methodology,33(1), 1–18.

Firth D (2012). qvcalc: Quasi Variances for Factor Effects in Statistical Models. R package version 0.8-8.

Firth D, De Menezes RX (2004). “Quasi-variances.” Biometrika,91(1), 65–80.

Fischer G (1981). “On the Existence and Uniqueness of Maximum-Likelihood Estimates in the Rasch Model.” Psychometrika,46(1), 59–77.

Fischer G, Molenaar I (eds.) (1995). Rasch Models: Foundations, Recent Developments and

Applications. Springer-Verlag, New York.

Fischer GH (1974). Einf¨uhrung in die Theorie Psychologischer Tests. Grundlagen und Anwen-

dungen. Verlag Hans Huber, Bern, Stuttgart, Wien.

Fischer GH (1995). “Derivations of the Rasch Model.” In GH Fischer, IW Molenaar (eds.),

Rasch Models – Foundations, Recent Developments, and Applications, chapter 2. Springer,

New York.

Flier HVD, Mellenbergh GJ, Ad`er HJ, Wijn M (1984). “An Iterative Item Bias Detection Method.” Journal of Educational Measurement,21(2), 131–145.

Frederickx S, Tuerlinckx F, De Boeck P, Magis D (2010). “RIM: A Random Item Mixture Model to Detect Differential Item Functioning.” Journal of Educational Measurement,47(4), 432–457.

French BF, Maller SJ (2007). “Iterative Purification and Effect Size Use with Logistic Regres- sion for Differential Item Functioning Detection.” Educational and Psychological Measure- ment,67(3), 373–393.

Gelin M, Carleton B, Smith M, Zumbo B (2004). “The Dimensionality and Gender Differential Item Functioning of the Mini Asthma Quality of Life Questionnaire (MiniAQLQ).” Social

Indicators Research,68(1), 91–105.

Gelin MN, Zumbo BD (2003). “Differential Item Functioning Results May Change Depend- ing on How an Item Is Scored: An Illustration with the Center For Epidemiologic Studies Depression Scale.” Educational and Psychological Measurement,63(1), 65–74.

Glas CAW (1998). “Detection of Differential Item Functioning using Lagrange Multiplier Tests.” Statistica Sinica,8(3), 647–667.

Glas CAW, Verhelst ND (1995). “Testing the Rasch Model.” In GH Fischer, IW Molenaar (eds.),

Rasch Models – Foundations, Recent Developments, and Applications, chapter 5. Springer,

New York.

G´omez-Benito J, Hidalgo MD, Guilera G (2010). “Bias in Measurement Instruments. Fair Tests.” Papeles del Psiclogo,31(1), 75–84.

Gonz´alez-Betanzos F, Abad FJ (2012). “The Effects of Purification and the Evaluation of Dif- ferential Item Functioning with the Likelihood Ratio Test.” Methodology: European Journal

of Research Methods for the Behavioral and Social Sciences,8(4), 134–145.

Gujarati DN (2003). Basic Econometrics. 4. edition. McGraw-Hill, Boston.

Gustafsson J (1980). “Testing and Obtaining Fit of Data in the Rasch Model.” British Journal

of Mathematical and Statistical Psychology,33(2), 205–233.

Hambleton RK, Rogers HJ (1989). “Detecting Potentially Biased Test Items: Comparison of IRT Area and Mantel-Haenszel Methods.” Applied Measurement in Education, 2(4), 313– 334.

Hambleton RK, Swaminathan H, Rogers HJ (1991). Fundamentals of Item Response Theory. Sage Publications, Newbury Park.

Hancock G, Samuelsen K (eds.) (2007). Advances in Latent Variable Mixture Models. Infor- mation Age Publishing, Charlotte.

Hann¨over W, Richard M, Hansen NB, Martinovich Z, Kordy H (2002). “A Classification Tree Model for Decision-Making in Clinical Practice: An Application Based on the Data of the German Multicenter Study on Eating Disorders, Project TR-EAT.” Psychotherapy Research,

12(4), 445–461.

Hastie T, Tibshirani R, Friedman JH (2008).The Elements of Statistical Learning. Data Mining,

Literature 179 Hidalgo-Montesinos MD, Lopez-Pina JA (2002). “Two-Stage Equating in Differential Item Functioning Detection under the Graded Response Model with the Raju Area Measures and the Lord Statistic.” Educational and Psychological Measurement,62(1), 32–44.

Hochberg Y, Tamhane A (eds.) (1987). Multiple Comparison Procedures. John Wiley & Sons, New York.

Holland PW, Thayer DT (1988). “Differential Item Performance and the Mantel-Haenszel Pro- cedure.” In H Wainer, HI Braun (eds.),Test Validity, chapter 9. Lawrence Erlbaum, Hillsdale, New Jersey.

Holland PW, Wainer H (eds.) (1993). Differential Item Functioning. Lawrence Erlbaum, Hills- dale, New Jersey.

Hothorn T, Hornik K, van de Wiel M, Zeileis A (2006a). “A Lego System for Conditional Inference.” The American Statistician,60(3), 257–263.

Hothorn T, Hornik K, Zeileis A (2006b). “Unbiased Recursive Partitioning: A Conditional Inference Framework.” Journal of Computational and Graphical Statistics,15(3), 651–674.

Hothorn T, Lausen B (2003). “On the Exact Distribution of Maximally Selected Rank Statis- tics.” Computational Statistics&Data Analysis,43(2), 121–137.

Jodoin MG, Gierl MJ (2001). “Evaluating Type I Error and Power Rates Using an Effect Size Measure with the Logistic Regression Procedure for DIF Detection.” Applied Measurement

in Education,14(4), 329–349.

Kelderman H, MacReady G (1990). “The Use of Loglinear Models for Assessing Differential Item Functioning across Manifest and Latent Examinee Groups.” Journal of Educational

Measurement,27(4), 307–327.

Kim J, Oshima TC (2013). “Effect of Multiple Testing Adjustment in Differential Item Func- tioning Detection.” Educational and Psychological Measurement,73(3), 458–470.

Kim SH, Cohen AS (1995). “A Comparison of Lord’s Chi-Square, Raju’s Area Measures, and the Likelihood Ratio Test on Detection of Differential Item Functioning.” Applied Measure-

ment in Education,8(4), 291–312.

Kim SH, Cohen AS (1998). “Detection of Differential Item Functioning under the Graded Re- sponse Model with the Likelihood Ratio Test.” Applied Psychological Measurement, 22(4), 345–355.

Kim SH, Cohen AS, Kim HO (1994). “An Investigation of Lord’s Procedure for the Detection of Differential Item Functioning.” Applied Psychological Measurement,18(3), 217–228. Kim SH, Cohen AS, Park TH (1995). “Detection of Differential Item Functioning in Multiple

Kitsantas P, Moore T, Sly D (2007). “Using Classification Trees to Profile Adolescent Smoking Behaviors.” Addictive Behaviors,32(1), 9–23.

Kopf J, Augustin T, Strobl C (2013a). “The Potential of Model-Based Recursive Partition- ing in the Social Sciences: Revisiting Ockam’s Razor.” In J McArdle, G Ritschard (eds.),

Contemporary Issues in Exploratory Data Mining, chapter 3. Routeledge. To appear in 2013.

Kopf J, Strobl C (2013). “Detecting Non-uniform DIF with Rasch Trees.” To be submitted. Kopf J, Zeileis A, Strobl C (2013b). “Anchor Methods for DIF Detection: A Comparison of the

Iterative Forward, Backward, Constant and All-Other Anchor Class.” Technical Report 141, Department of Statistics, LMU Munich.

Kopf J, Zeileis A, Strobl C (2013c). “Anchor Selection Strategies for DIF Analysis: Review, Assessment, and New Approaches.” Technical Report 150, Department of Statistics, LMU Munich.

Koziol J (1991). “On Maximally Selected Chi-Square Statistics.” Biometrics, 47(4), 1557– 1561.

Kuckulenz A, Zwick T (2005). “Heterogene Einkommenseffekte Betrieblicher Weiterbildung.”

Die Betriebswirtschaft,65(3), 258–275.

Leeb H, P¨otscher BM (2005). “Model Selection and Inference: Facts and Fiction.”Econometric

Theory,21(1), 21–59.

Leisch F (2004). “FlexMix: A General Framework for Finite Mixture Models and Latent Class Regression in R.” Journal of Statistical Software,11(8), 1–18.

LesnoffM, Lancelot R (2012). aod: Analysis of Overdispersed Data. R package version 1.3. Li HH, Stout W (1996). “A New Procedure for Detection of Crossing DIF.” Psychometrika,

61(4), 647–677.

Li KC, Lue HH, Chen CH (2000). “Interactive Tree-Structured Regression via Principal Hessian Directions.” Journal of the American Statistical Association,95(450), 547–560.

Lim RG, Drasgow F (1990). “Evaluation of Two Methods for Estimating Item Response Theory Parameters When Assessing Differential Item Functioning.” Journal of Applied Psychology,

75(2), 164 – 174.

Linn RL, Levine MV, Hastings CN, Wardrop JL (1981). “Item Bias in a Test of Reading Comprehension.” Applied Psychological Measurement,5(2), 159–173.

Liou M (1994). “More on the Computation of Higher-Order Derivatives on the Elementary Symmetric Functions in the Rasch Model.” Applied Psychological Measurement,18(1), 53– 62.

Literature 181 Loh WY (2002). “Regression Trees with Unbiased Variable Selection and Interaction Detec-

tion.” Statistica Sinica,12(2), 361–386.

Lopez Rivas GE, Stark S, Chernyshenko OS (2009). “The Effects of Referent Item Parameters on Differential Item Functioning Detection Using the Free Baseline Likelihood Ratio Test.”

Applied Psychological Measurement,33(4), 251–265.

Lord F (1980).Applications of Item Response Theory to Practical Testing Problems. Lawrence Erlbaum, Hillsdale, New Jersey.

Magis D, De Boeck P (2011). “Identification of Differential Item Functioning in Multiple- Group Settings: A Multivariate Outlier Detection Approach.” Multivariate Behavioral Re-

search,46(5), 733–755.

Magis D, Raˆıche G, B´eland S, G´erard P (2011). “A Generalized Logistic Regression Procedure to Detect Differential Item Functioning Among Multiple Groups.” International Journal of

Testing,11(4), 365–386.

Maij-de Meij A, Kelderman H, Van der Flier H (2008). “Fitting a Mixture Item Response Theory Model to Personality Questionnaire Data: Characterizing Latent Classes and Inves- tigating Possibilities for Improving Prediction.” Applied Psychological Measurement,32(8), 611–631.

Mair P, Hatzinger R (2007). “Extended Rasch Modeling: The eRm Package for the Application of IRT Models in R.” Journal of Statistical Software,20(9), 1–20.

Mair P, Hatzinger R, Maier MJ (2012). eRm: Extended Rasch Modeling. R package version 0.15-0.

Maller SJ (2001). “Differential Item Functioning in the Wisc-III: Item Parameters for Boys and