Saad Chahine & Ruth A. Childs, Ontario Institute for Studies in Education, University of Toronto This study illustrates the use of differentialitemfunctioning (DIF) and differential step functioning (DSF) analyses to detect differences in item difficulty that are related to experiences of examinees, such as their teachers’ instructional practices, that are relevant to the knowledge, skill, or ability the test is intended to measure. This analysis is in contrast to the typical use of DIF or DSF to detect differences related to characteristics of examinees, such as gender, language, or cultural knowledge, that should be irrelevant. Using data from two forms of Ontario’s Grade 9 Assessment of Mathematics, analyses were performed comparing groups of students defined by their teachers’ instructional practices. All constructed-response items were tested for DIF using the Mantel Chi-Square, standardized Liu Agresti cumulative common log-odds ratio, and standardized Cox’s noncentrality parameter. Items exhibiting moderate to large DIF were subsequently tested for DSF. In contrast to typical DIF or DSF analyses, which inform item development, these analyses have the potential to inform instructional practice.
items favors female group equal favors male group. For Differential test functioning (DTF), there are 8 tests that have small level.
The fairness is a very important characteristic in the item. The items without fairness mean items which enable any group of students in the same level of competency to give correct answer differently which may result from the difference in terms of language, gender, religion, race, culture, domicile or level of intelligence. This nature of items is called the DifferentialItem Function (DIF). The DifferentialItemFunctioning (DIF) is the matter on which the test developers, educational assessors and statisticians place importance and put their effort to develop guidelines or statistic techniques to investigate items with differentialfunctioning among different groups of students in order to enable the fairness in the test results to the test results as much as possible. The test will include items with questions in the content which require the students to express what they know within the measurement framework. Therefore, the resource of difficulty depends on the construction of items which may contain a part impertinent to the group of students, such as items with vocabularies which each group of students may interpret differently. The scores obtained from the tests therefore result from the limitations of word meaning or language used in the tests (Camili & Shepard, 1994). The purpose of action relating to item bias is to investigate and indicate the test validity
She and I spent the better part of an afternoon devising elaborate and ostensibly convincing theories about why six particular items on the Metropolitan Achievement Test were behaving differentially for Black examinees, only to discover that, because of a programming error, we had been examining the wrong items. What was especially painful was the realization that in subsequent theorizing about the correct set of items showing DifferentialItemFunctioning (DIF), we found ourselves making arguments that were diametrically opposed to our earlier theorizing. (p.
Reliability and differentialitemfunctioning (DIF) analyses were conducted on testlets displaying local item de- pendence in this study. The data set employed in the research was obtained from the answers given by 1500 students to the 20 items included in six testlets given in English Proficiency Exam by the School of Foreign Languages of a state University in Turkey. One of the purposes of this study was to determine the influences of the tests composed of testlets on reliability, so the reliability coefficients obtained for cases where the influ- ences of testlets were considered and those for cases where the testlet influences were not considered were compared. In consequence of the G theory analyses conducted in this context, it was found that the G and Phi coefficients estimated by not considering the testlet effects were higher than those estimated by considering the testlet effects. It was concluded that the reliability was estimated to be relatively higher when the influences of the testlet were not considered. Two methods were used in this study so as to determine the effects of testlets on differentialitemfunctioning and the results were compared. In the DIF-determining method considering the testlet effect, both the number of items displaying DIF at the significant and estimated levels of DIF were found to be higher than in the method not considering the testlet effect.
A psychological test should be developed to accurately estimate the measured construct, being insensitive to extraneous factors that could affect it. The way in which the various external factors jeopardize the item responses is known as test bias and minimizing these influences is a primary concern for test developers. A relatively recent category of psychometric techniques that aim to identify those items that function differently in a group of test-takers is called DifferentialItemFunctioning (DIF). The purpose of this paper is to present, in a non-technical manner, the conceptual foundation of DIF, and the most common applied techniques. We will discuss methods based on logistic regression, those using Item Response Theory models as well as the Mantel-Haenszel method. We will then discuss the strengths and limitations of each method, conditions of application, and the main software packages used to conduct such analyses.
Regardless of the type of items constituting the test, the most important psychometric property that it should have is validity. The significant elements of threat in relation to validity are item bias and test bias (Clauser & Mazor, 1998). Neutrality against those taking a test is a property that should be satisfied in tests. The differing performances displayed in relation to an item by individuals being at equal levels of ability but in different groups can be accounted for by differentialitemfunctioning (DIF). This can also be defined as the differentiation of the probability of individuals who are similar in the measured property but in different sub-groups such as gender and socio-economic level to answer an item correctly (Hambleton, Swaminathan, & Rogers, 1991). The increase in the use of testlets in standardized tests in the field of education over the last 20 years has led to an increase in studies concerning how to score and analyze them. Traditionally, the effects of testlets were ignored and each item constituting the testlet was scored as if they were independent items (Bradlow, Wainer, & Wang, 1999; Lee, Kolen, Frisbie, & Ankenmann, 2001; Sireci et al., 1991; Wainer, 1995; Wainer & Wang, 2000; Yen, 1993). The DIF analyses performed in these studies are considered at item level (Sireci et al., 1991; Wainer & Thissen, 1996). Other studies calculating the sums or averages of the items constituting testlets, and then obtaining the scores at the testlet level are also available in the literature (Lee, Dunbar, & Frisbie, 2001; Lee & Frisbie, 1999; Sireci et al., 1991). Another method of scoring is testlet scoring where testlets are considered as one single item and are polytomously scored; polytomous item response models, especially Bock’s nominal response model, are used in this process (Sireci et al., 1991; Thissen et al., 1989; Wainer, 1995; Wainer & Thissen, 1996; Yen, 1993). However, the DIF derived from this method is at testlet level rather than item level. In other words, through this method, the differential testlet functioning (DTF) is derived. Therefore, specific items which lead to DIF cannot be determined (Fukuhara & Kamata, 2011). Since constructing a testlet is demanding, time-consuming work, taking the testlet displaying DTF out of the item pool would not be a desirable case. Instead, identifying and regulating the problematic items in the testlet would be advantageous in re-using the testlet.
Keywords: logistic regression; differentialitemfunctioning; recursive partitioning;
In recent years, differentialitemfunctioning (DIF) and DIF identification methods have been the areas of intensive research. DIF occurs if the prob- ability of a correct response among persons with the same value of their underlying trait differs in subgroups, for example, if the difficulty of an item depends on the membership to a racial, ethnic, or gender subgroup. If a test contains DIF items, it may be unfair, that is, favor specific groups. When developing and using tests that measure latent abilities, one should be aware of the phenomenon of DIF. Ideally, tests should not contain suspicious items.
They suggest that after determining the differentially functioningitem it must be the following concern to test whether elimination of these items contributed to the quality of the test, because differentialitemfunctioning is an item level analysis whereas individuals are usually compared at the test level. Also, biased items may be pointing out some cross-cultural differences that may require further investigations. In the case of these items, this potential source of information will be lost. In addition, this elimination method may deform the content validity of the instruments.
The key concepts of item bias and differentialitemfunctioning
Metric or measurement equivalence in cross-cultural comparisons is a key concern in cross-cultural comparative research. One cannot determine equivalence using unequal comparisons (Foxcroft & Roodt, 2009) because the essence of what one is measuring will be questioned. Constructs are equivalent for different cultural groups when one obtains the same or similar scores when using the same or different language versions of the items or measures. If not, the items are culturally biased or the measures are not equivalent.
Thus, these scholars suggested that studies should be conducted on various aspects related to the item structure and arrangement with the aim of eliminating or reducing any gender bias. To support the issue, Siti Rahayah et al. (2008a; 2008b) revealed some findings on students’ achievement based on gender. Their studies have be- come the local references from different field to ensure that DIF analysis on the itemfunctioning is essential to confirm its reliability. In some other related DIF studies, Sheppard et al. (2006) investigated the Hogan Person- ality Inventory across gender and two racial groups (Caucasian and Black) and have revealed 38.4% (53 out of 138 items) gender-based DIF and 37.7% (52 out of 138 items) race-based DIF. These indicate potential bias for the items displaying DIF more for Caucasians than Blacks. Cauffman & MacIntosh (2006) measured the Mas- sachusetts Youth Screening Instrument by identifying race and gender differentialitemfunctioning among juve- nile offenders. An item is a basic unit in an instrument. In order to create an item, it is essential to ascertain the stability and equality of all participants. Thus, DIF is used to measure items with different functions in a con- struct. This can be applied to various demographic groups provided that they are of similar capabilities (Tennant
Differentialitemfunctioning (DIF) refers to a difference in the way a test item functions for comparable groups of test takers. Formally defined, an item displays DIF if subjects of equal proficiency, or equal ability level, on the construct intended to be measured by a test, but from separate subgroups of the population, differ in their expected score on this item (). In DIF analysis, the population is typically divided in two subgroups named reference and focal group; the reference group provides a baseline for performance and the focal group is the focus of fairness concerns. Two types of DIF can be identified and denoted as uniform and nonuniform DIF (). Uniform DIF (UDIF) occurs when the relative advantage of one group over another on a test item is uniform, favoring only one group consistently across the entire scale of ability. Nonuniform DIF (NUDIF) exists when the conditional dependence of group membership and item performance changes in size but not in direction across the entire ability continuum (unidirectional DIF) or when the conditional dependence changes in direction across the entire
The prime objective of this study is to find out the empirical data about a comparison sensitivity as Scheuneman’s Chi-square, Mantel-Haenszel methods and Item Response Theory (IRT) Rasch model in DifferentialItemFunctioning (DIF) detection. The study used experimental methods with 1 x 3 design. The independent variables were, Scheuneman’s Chi-Square methods, Mantel-Haenszel methods and IRT Rasch model. This specific research aims at revealing: (1) the characteristics of test items based on the of classical test-theory, (2) the standard error measurement of classical test theory, and (3) the items detected as containing DifferentialItemFunctioning (DIF) based on gender. Analysis of this study was based on testee’s response to the Mathematics test in the National Examination at Senior High School in Tangerang, 2008/2009 academic year. The data source was computer answer sheets of 5000 consist of 2500 from male student responses and 2500 female student responses, established using the simple random sampling technique. A result of the descriptive analysis using the classical test theory indicated that 28 out of 40 mathematics test items were good, with a reliability index of 0,827. The results indicated that all methods were good for the detection of DIF, but IRT
The purpose of this article was to introduce practitioners and researchers to the topic of missing data in the context of differentialitemfunctioning (DIF), review the current literature on the issue, discuss implications of the review, and offer suggestions for future research. DIF occurs when two or more distinct groups with equal ability differ in their probabilities of answering test items correctly (Holland & Wainer, 1993). Test takers produce missing data on educational assessments by omitting or not reaching one or more of the items. An omit happens when a test taker accidently skips an item, or after reading it, fails to respond to the item. Given that the individual responds to subsequent items, omitted responses occur earlier in a test. A test taker may not reach an item because of lack of time. Since the individual does not respond to subsequent items, not-reached responses occur at the end of a timed test (Ludlow & O’Leary, 1999).
regression technique to match on all the possible traits simultaneously. More extensive research by Mazor, Hambleton, and Clauser (1998) compared the results of MH
procedure and logistic regression for differentialitemfunctioning analysis with matching based on total test score, matching based on subtest scores and multivariate matching based on multiple subtest scores. Their simulation study involved the variation in dimensional structure, item discrimination parameter and the correlation between traits. When identical matching criteria were used, the MH procedure and logistic regression produced similar outcomes. Logistic regression had the potential advantage over the Mantel-Haenszel statistic for multivariate matching because it avoided the sparseness problems that resulted when examinees were matched on multiple variables in the Mantel-Haenszel procedure. In other cases, logistic regression produced extremely similar results as Mantel-Haenszel procedure (Swaminathan & Rogers, 1990). Within the three matching criteria, total test score was much less accurate than the other two methods and multiple subtest scores simultaneously were superior to matching on total test scores and individual relevant subtest scores. It is the intent of the thesis to explore a matching criterion created from profile scores of mastered/nonmastered skills as
Tujuan utama dari penelitian ini adalah untuk mendapatkan data empirik tentang indek dari crossing differentialitemfunctioning (CDIF) dan perbedaan sensitifitas antara metode pendeteksian CDIP berdasarkan teori responsi butir. Penelitian ini menggunakan metode kuasi eksperimen dengan desain 1 x 3. Variabel bebas adalah Luasan menurut Raju, Khi-kuadrat menurut Lord, tes rasio kebolehjadian, serta ukuran sampel, sedangkan variabel terikatnya adalah indek CDIF. Metode untuk pendeteksian uniform DIF telah banyak dikembangkan, sedangkan untuk pendeteksian CDIF belum banyak dikembangkan. Ketiga metode yang dibandingkan adalah Luasan menurut Raju, Khi-kuadrat menurut Lord, dan Tes Rasio Kebolehjadian. Faktor yang dimanipulasi adalah ukuran sampel, perbedaan kemampuan responden dari dua kelompok, persentase CDIF, serta hasil tes responden. Hasil penelitian menunjukkan bahwa ketiga metode dapat digunakan untuk pendeteksian CDIF, tetapi Metode Luasan menurut Raju sangat sensitif dalam mendeteksi CDIF dibandingkan dengan kedua metode yang lain. Disamping itu tidak ada DIF yang disebabkan perbedaan gender pada tingkat kemampuan tinggi.
the same test, which threatens the validity of the measurements. Also, test fairness is violated if tests lead to different conclusions for distinct groups of people. Differentialitemfunctioning (DIF) means that measurement invariance is violated on the item level. More precisely, DIF is present if one or more items are significantly more diffi- cult for one group than for the other after controlling for the underlying ability or trait. One can distinguish between uniform and nonuniform DIF. Uniform DIF means that the difference between the groups is constant across levels of the latent continuum of the individual. If it is dependent on the ability or trait of the person non-uniform DIF is present. DIF detection procedures can also be classified into item response theory (IRT) methods and non-IRT methods. The IRT methods, also called parametric meth- ods, are those in which an IRT model is used for the detection of DIF. For an over- view of IRT methods and non-IRT methods, see Holland and Wainer (1993), Magis, Be`land, Tuerlinckx, and De Boeck (2010) and Penfield and Camilli (2006).
Chapter 3: Methodology
This research was multi-faceted. First, the case needed to be made for using a latent class, rather than a manifest group approach for studying differentialitemfunctioning. To highlight the inadequacies of manifest approaches to DIF detection the Mantel-Haenszel procedure was used as a representative approach and simulated data with varying amounts of DIF were analyzed. Since the characteristics of these data were known, there existed a standard against which to judge the efficacy of the currently employed types of procedures. Second, the mixed Rasch model was used on a subset of these simulated data to investigate the strengths and weaknesses of that approach. Next, a strategy for applying a latent class approach needed to be delineated. This included the development of a series of protocols and recommendations for use. Finally, this strategy was applied to real data from an assessment of English Language proficiency that was being field-tested at the time the data was collected. This test was chosen because a relatively large number of covariates were available along with the item responses.
3 What must be the minimum number of experts that give their opinion on the existence of DIF?
1.1. Purpose and Importance of the Research
In this research, analysis with the objective methods which are often used in the literature for the determination of differentialitemfunctioning with regard to gender on the items that constitute the test, and the subjective method based on experts opinions are performed, and it is aimed to reveal the relation between the methods. An appreciable number of both national and international researches is seen in the literature performed to determine DIF based on objective methods. However, researches determining DIF by subjective method and being based on only the contents of the test items less than other studies. Therefore, this research will be one of the few studies [15, 16] intending to determine DIF depending on the gender using subjective method that all items in large-scale exams in Turkey are examined by experts.
The Mantel-Haenszel (M-H) procedure was originally used to match subjects retrospectively on
cancer risk factors in order to study current cancer rates (Mantel &Haenzel, 1959). The procedure has
since been adapted to study differentialitemfunctioning and is now the primary DIF detection device
used at the Educational Testing Service (ETS; Dorans & Holland, 1993). The M-H method works by first
This article briefly describes the main procedures for performing differentialitemfunctioning (DIF) analyses and points out some of the statistical and extra-statistical implications of these methods. Research findings on the sources of DIF, including those associated with translated tests, are reviewed. As DIF analyses are oblivious of correlations between a test and relevant criteria, the elimination of differentially functioning items does not necessarily improve predictive validity or reduce any predictive bias. The implications of the results of past DIF research for test development in the multi-lingual and multi-cultural South African society are considered.