Using Statistical and Judgmental Reviews to Identify and Interpret Translation Differential Item Functioning

(1)

The Alberta Journal of Educational Research Vol. XLV, No. 4, Winter 1999,353-376

Mark J. Gierl

W. Todd Rogers

a n d

Don A. Klinger

University of Alberta

Using Statistical and Judgmental Reviews to

Identify and Interpret Translation

Differential Item Functioning

The purpose of this study was to evaluate the equivalence of two translated tests using statistical and judgmental methods. Performance differences for a large random sample of English- and French-speaking examinees were compared on a grade 6 mathematics and social studies provincial achievement test. Items displaying differential item functioning (DIF) were flagged using three popular statistical methods—ManteTHaenszel, Simultaneous Item Bias Test, and logistic regression—and the substantive meaning of these items was studied by comparing the back-translated form with the original English version. The items flagged by the three statistical procedures were relatively consistent, but not identical across the two tests. The correlation between the DIF effect size measures were also strong, but far from perfect, suggesting that two procedures should be used to screen items for translation DIF. To identify the DIF items with translation differences, the French items were back-translated into English and compared with the original English items by three reviewers. Two of seven and six of 26 DIF items in mathematics and social studies respectively were judged to be nonequivalent across language forms due to differences introduced in the translation process. There were no apparent translation differences for the remaining items, revealing the neces-sity for further research on the sources of translation differential item functioning. Results from this study provide researchers and practitioners with a better understanding of how

three popular DIF statistical methods compare and contrast. The results also demonstrate how statistical methods inform substantive reviews intended to identify items with transla-tion differences.

Le but de cette étude était d'évaluer l'équivalence de deux examens traduits avec des méthodes basées sur les statistiques et d'autres reposant sur le jugement. On a comparé les différences dans la performance d'un grand échantillon aléatoire de sujets anglophones et francophones qui avaient complété des examen provinciaux de sixième année en mathématiques et en études sociales. Les items démontrant une divergence par rapport aux autres (differential item functioning - DIF) ont été marqués d'un indicateur dans le contexte de trois méthodes statistiques bien connues - Mantel-Haenszel, Simultaneous Item Bias Test et la régression

Mark Gierl is an assistant professor of educational psychology. His research interests include educational measurement and evaluation, with an emphasis on assessment and cognition, differential item functioning, and item response theory.

Todd Rogers is a professor in the Department of Educational Psychology and Director of the Centre for Research in Applied Measurement and Evaluation. He specializes in measurement and evaluation.

(2)

M . Gierl, W.T. Rogers, and D. Klinger

logistique. La signification de fond de ces items a été étudiée en comparant la version traduite de l'examen avec l'original en anglais. Les items marqués par les trois procédures statistiques étaient relativement constants mais pas identiques d'une version à l'autre. Alors que la corrélation entre les mesures de l'effet DIF étaient aussi forte, elle était loin d'être parfaite, ce qui suggère que l'on devrait avoir recours à deux procédures dans le dépistage du DIF en traduction. Pour identifier les items DIF présentant des différences en traduction, trois réviseurs ont comparé les items français retraduits en anglais avec les originaux en anglais. Ceux-ci ont jugé que deux sur sept items en mathématiques et six sur vingt-six items en études sociales n'étaient pas équivalents d'une langue à l'autre à cause des différences introduites par le processus de traduction. Les autres items ne présentaient pas de différences apparentes de traduction, ce qui révèle le besoin de poursuivre la recherche sur les sources du DIF en traduction. Les résultats de cette étude aideront les chercheurs et les praticiens à mieux comprendre les similarités et les différences entres trois méthodes statistiques DIF souvent employées. De plus, ils démontrent comment les méthodes statistiques contribuent aux études de signification dont le but est l'identification des items présentant des différences de traduction.

Item bias is a serious concern for test developers a n d test users. Bias results i n systematic errors that distort the inferences m a d e f r o m a test for members of a particular g r o u p s u c h as female, N a t i v e , o r French-speaking examinees. Bias occurs w h e n items are w o r d e d i n s u c h a w a y that examinees f r o m a specific g r o u p w h o are k n o w l e d g e a b l e about the construct of interest are prevented f r o m d e m o n s t r a t i n g their k n o w l e d g e . I n most cases, test items are biased because they contain sources of difficulty that are irrelevant or extraneous to the construct b e i n g tested, a n d this difficulty factor adversely affects test per-formance ( C a m i l l i & S h e p a r d , 1994).

Bias is also a concern w h e n a test is translated or adapted f r o m one language or culture to another ( A l l a l o u f & Sireci, 1998; B u d g e l l , Raju, & Quartetti, 1995; H a m b l e t o n , 1994). F o r example, the m e a n i n g of an item can change d u r i n g test translation. H a m b l e t o n p r o v i d e s o n e illustrative example. I n a S w e d i s h -E n g l i s h c o m p a r i s o n , -E n g l i s h - s p e a k i n g examinees were presented w i t h this item:

Where is a bird with webbed feet most likely to live? a. in the mountains

b. in the woods c. in the sea d. in the desert.

In the S w e d i s h translation the phrase webbed feet became swimming feet, thereby p r o v i d i n g a n o b v i o u s clue to the S w e d i s h - s p e a k i n g examinees about the cor-rect o p t i o n for this item. This type of testing p r o b l e m has a n i m p o r t a n t a n d o b v i o u s consequence: A difference i n student performance resulting f r o m s u c h a n i t e m m a y b e attributed to a g r o u p difference i n achievement that is u n -f o u n d e d because the i t e m is not equivalent across the t w o language groups. Differential Item Functioning, Item Bias, and Item Impact

(3)

-Using Statistical and judgmental Reviews

tematic because it constantly distorts performance for members of the g r o u p . W h e n a test i t e m u n f a i r l y favors one g r o u p of examinees over another, the item is biased. A l t e r n a t i v e l y , g r o u p disparity i n i t e m performance that reflects actual k n o w l e d g e a n d experience differences o n the construct of interest is called item impact. Impact is also constant for the members of a particular g r o u p , but these effects reflect performance differences that the test is intended to measure (Clauser & M a z o r , 1998).

T h e l i n k between D I F , bias, a n d i m p a c t is largely methodological: Statistical analyses are u s e d to identify items w i t h D I F , a n d j u d g m e n t a l analyses are used to determine if D I F is attributable to bias o r to impact for members of a specific g r o u p . R e v i e w s intended to evaluate the fairness of a n i t e m cannot proceed as either a statistical or a j u d g m e n t a l analysis; b o t h procedures are needed ( L i n n , 1993; v a n d e V i j v e r , 1994).

Purpose of Study

Statistical a n d substantive issues i n test translation are addressed i n this study. First, the results f r o m three statistical methods designed to identify D I F are c o m p a r e d : M a n t e l - H a e n s z e l , Simultaneous Item Bias Test, a n d logistic regres-sion. C u r r e n t l y a l l three methods are p o p u l a r , b u t few studies have c o m p a r e d the outcomes f r o m these procedures u s i n g a large sample of student response data f r o m a major testing p r o g r a m . In this s t u d y , E n g l i s h - a n d French-speaking examinees w h o wrote a grade 6 mathematics a n d social studies achievement test are c o m p a r e d . A n e w D I F effect size measure used w i t h the logistic regres-sion p r o c e d u r e , as p r o p o s e d b y Z u m b o a n d T h o m a s (1996), is also evaluated b y c o m p a r i n g the results f r o m this measure against existing effect size measures u s e d w i t h the M a n t e l - H a e n s z e l procedure a n d the Simultaneous Item Bias Test.

Second, the utility of back-translation as a j u d g m e n t a l m e t h o d for interpret-i n g D I F interpret-is evaluated. Recall that the dinterpret-istinterpret-inctinterpret-ion between D I F , interpret-item binterpret-ias, a n d item i m p a c t is i m p o r t a n t because D I F is a statistical concept a n d bias a n d i m p a c t are substantive concepts. J u d g m e n t a l reviews intended to evaluate the equivalence of a test s h o u l d rely o n both statistical a n d j u d g m e n t a l analyses. In this s t u d y , back-translation is used to identify items that differ i n m e a n i n g b e t w e e n the E n g l i s h a n d French forms of the tests. There is n o attempt to account for i t e m impact. J u d g m e n t a l reviews that y i e l d interpretable results are essential for i d e n t i f y i n g items w i t h translation differences a n d for c o n t r o l l i n g this p r o b l e m i n future forms of the test. T h e efficacy of back-translation for a c h i e v i n g this outcome is evaluated b y c o m p a r i n g the results f r o m t w o back-translations of the same test against the original test a n d against the statistical outcomes.

Overview of Statistical and Judgmental Procedures

(4)

M. Gierl, W.T. Rogers, and D. Klinger

Mantel-Haenszel

M a n t e l H a e n s z e l ( M H ) is a nonparametric a p p r o a c h for i d e n t i f y i n g D I F ( H o l -l a n d & Thayer, 1988; M a n t e -l & H a e n s z e -l , 1959). M H yie-lds a chi-square test w i t h one degree of freedom to test the n u l l hypothesis that there is n o relation b e t w e e n g r o u p m e m b e r s h i p a n d test performance o n one item after controlling for ability. T h e M H procedure is also used to estimate the constant o d d s ratio that y i e l d s a measure of effect size for evaluating the a m o u n t of D I F . M H is c o m p u t e d b y m a t c h i n g examinees i n each g r o u p o n total test score a n d then f o r m i n g a 2-by-2-by-K contingency table for each item, where the score is level o n the m a t c h i n g variable of total test score.

Research at the E d u c a t i o n a l Testing Service has resulted i n p r o p o s e d values for c l a s s i f y i n g the D I F effect size measure at the i t e m level. D I F is considered negligible w h e n the effect size measure A M H is not significantly different f r o m 0 a n d the m a g n i t u d e of the I A M H I <1. D I F is considered moderate w h e n A M H is significantly different f r o m 0 a n d 1< I A M H I <1.5. D I F is considered large w h e n

A M H is significantly greater than 0 a n d I A M H I SI.5. (Zieky, 1993; Z w i c k &

E r c i k a n , 1989). These ratings are referred to as A - , B-, a n d C - l e v e l D I F to denote negligible, moderate, a n d large amounts of D I F .

Simultaneous Item Bias Test

T h e S i m u l t a n e o u s Item Bias Test (SIBTEST) is a n alternative statistical m e t h o d for detecting D I F p r o p o s e d b y Shealy a n d Stout (1993). SIBTEST is intended to m o d e l m u l t i d i m e n s i o n a l data, a l t h o u g h it can be used for u n i d i m e n s i o n a l data as w e l l . T h e statistical hypothesis tested b y SIBTEST is:

HO:B(T) = PR(T)-PF(T) = 0 vs.

HX:B(T) = PR(T)-PF(T)*0,

w h e r e B(T) is the difference i n probability of a correct response o n the studied item for examinees i n the Reference a n d Focal groups matched o n true score; PR(T) is the p r o b a b i l i t y of a correct response o n the studied item for examinees i n the Reference g r o u p w i t h true score T; a n d Pp(T) is the probability of a correct response o n the s t u d i e d item for examinees i n the Focal g r o u p w i t h true score T. In other w o r d s B(T), the parameter representing the a m o u n t of u n i d i m e n s i o n a l D I F w h e n a single test item is evaluated, is 0 w h e n there is n o D I F a n d n o n z e r o w h e n D I F is present. W i t h the SIBTEST approach, items o n the test are d i v i d e d into t w o subsets, the suspect subtest a n d the m a t c h i n g subtest. T h e suspect subtest contains the biased item, a n d the m a t c h i n g subtest contains the rest of the items. F o r each m a t c h i n g subtest score k, the c o r r e s p o n d i n g subtest true score for the Reference a n d Focal groups, is es-timated u s i n g linear regression. The eses-timated true scores are then adjusted u s i n g a regression correction technique to ensure that the estimated true score is c o m p a r a b l e for the examinees i n the Reference a n d Focal g r o u p s o n the m a t c h i n g subtest. In the final step, B(T) is estimated u s i n g è, w h i c h is the w e i g h t e d s u m of the differences between the proportion-correct true scores o n the s t u d i e d item for examinees i n the t w o groups across all score levels.

(5)

Using Statistical and Judgmental Reviews

the n u l l hypothesis is rejected a n d I ê I <0.059, D I F is considered negligible; w h e n the n u l l hypothesis is rejected a n d 0.059< I ê I <0.088, D I F is considered moderate; a n d w h e n the n u l l hypothesis is rejected a n d I ê I >0.088, D I F is considered large. These guidelines are used to classify items i n category A , B , or C (i.e., negligible, moderate, a n d large amounts of DIF).

S I B T E S T differs f r o m M H i n a n u m b e r of w a y s . First, SIBTEST uses a regression estimate of the true score instead of a n observed score as the match-i n g varmatch-iable. A s a result, exammatch-inees are matched o n a latent rather than a n observed score. Second, SIBTEST c a n be used to evaluate D I F i n t w o o r more items s i m u l t a n e o u s l y i n the analysis. This feature allows the developer to assess D I F m o r e effectively i n testlets or item b u n d l e s o n a n test (Douglas, Roussos, & Stout, 1996). T h i r d , SIBTEST can be used to assess D I F iteratively b y initially u s i n g a l l items i n the m a t c h i n g variable a n d then systematically r e m o v i n g D I F items f r o m the m a t c h i n g test u n t i l a subtest of items w i t h o u t D I F is identified. SIBTEST performs s i m i l a r l y to M H i n i d e n t i f y i n g u n i f o r m D I F e v e n w i t h s m a l l samples of examinees ( N a r a y a n a n & S w a m i n a t h a n , 1994; Roussos & Stout, 1996).

JjOgistic Regression

A t h i r d a p p r o a c h c o m m o n l y used to identify D I F is logistic regression (LR) ( S w a m i n a t h a n & Rogers, 1990). L R can detect both uniform a n d nonuniform D I F , w h i c h p r o v i d e s a distinct advantage over M H a n d SIBTEST as these t w o procedures were designed to detect o n l y u n i f o r m D I F . U n i f o r m D I F exists w h e n there is n o interaction between ability level a n d g r o u p m e m b e r s h i p . That is, the p r o b a b i l i t y of a n s w e r i n g a n item correctly is greater for one g r o u p u n i f o r m l y over all ability levels. In item response theory terminology, u n i f o r m D I F is i n d i c a t e d b y parallel item characteristic curves. N o n u n i f o r m D I F occurs w h e n there is a n interaction between ability level a n d g r o u p m e m b e r s h i p . In this case the difference i n the probabilities of a correct response for the t w o g r o u p s is n o t the same at a l l levels of ability. S i m u l a t i o n studies have been c o n d u c t e d to demonstrate that L R is i n fact more p o w e r f u l than M H a n d SIBTEST at detecting n o n u n i f o r m D I F ( N a r a y a n a n & S w a m i n a t h a n , 1996; Rogers & S w a m i n a t h a n , 1993).

The presence of D I F i n the L R a p p r o a c h is determined b y testing the i m -p r o v e m e n t i n m o d e l fit that occurs w h e n a term for g r o u -p m e m b e r s h i -p a n d a term for the interaction between test score a n d g r o u p m e m b e r s h i p are succes-sively a d d e d to the regression m o d e l . A chi-square test is then used to evaluate the presence of u n i f o r m a n d n o n u n i f o r m D I F o n the item of interest b y testing each term i n c l u d e d i n the m o d e l . T h e general m o d e l for logistic regression takes the f o r m

P(M=1) =

(6)

is, M o d e l 1 subtracted f r o m M o d e l 2 (2 = ßo + ß i X + $iG). The presence of n o n u n i f o r m D I F is tested b y e x a m i n i n g the i m p r o v e m e n t i n chi-square m o d e l fit associated w i t h a d d i n g a term for g r o u p m e m b e r s h i p (G) and a term for the interaction between test score a n d g r o u p m e m b e r s h i p ( X G ) against M o d e l 2, i n other w o r d s , M o d e l 2 subtracted f r o m M o d e l 3 (z = ßo + ß i X + ß2G + ß3XG). N o n u n i f o r m D I F can be tested w i t h L R regardless of the outcome f r o m the u n i f o r m D I F test because each m o d e l contains different terms.

Z u m b o a n d T h o m a s (1996) developed a n index to quantify the m a g n i t u d e of D I F for the L R procedure based on partitioning a w e i g h t e d least-squares estimate of R2 that yields a n effect size measure (also see Pope, 1997; T h o m a s & Z u m b o , 1996; Z u m b o , 1999). This index is obtained first b y c o m p u t i n g the R2

measure of fit for each term i n the L R m o d e l (i.e., test score, g r o u p m e m b e r s h i p , test score-by-group m e m b e r s h i p interaction) a n d then b y p a r t i t i o n i n g the R2

for each of the terms. A D I F effect size for the g r o u p m e m b e r s h i p term is p r o d u c e d b y subtracting the R2 for the total test score term ( M o d e l 1) f r o m the

R2 for the g r o u p m e m b e r s h i p term ( M o d e l 2). The result is an effect size

measure associated w i t h g r o u p m e m b e r s h i p that quantifies the m a g n i t u d e of u n i f o r m D I F (herein called R2A-U). A second D I F effect size is p r o d u c e d for the

total score-by-group m e m b e r s h i p term b y subtracting the R2 for the g r o u p

m e m b e r s h i p term ( M o d e l 2) f r o m the R2 for the total score-by-group member-s h i p interaction term ( M o d e l 3). The remember-sult imember-s an effect member-size meamember-sure amember-smember-sociated w i t h the total scorebygroup m e m b e r s h i p interaction that quantifies the m a g -n i t u d e of -n o -n u -n i f o r m D I F (herei-n called R2A - N). As w i t h the M H a n d SIB-T E S SIB-T effect size measures, R2A can be used w i t h the L R significance test to identify items w i t h D I F . T o date, h o w e v e r , R2A has been used sparingly w i t h L R i n D I F research.

J o d o i n (1999) p r o p o s e d guidelines for interpreting R2A (also see G i e r l & M c E w e n , 1998). A n item has negligible or A - l e v e l D I F w h e n the chi-square test for m o d e l fit is not statistically significant or w h e n R2A<0.035. A n item has moderate or B-level D I F w h e n the chi-square test is statistically significant a n d w h e n 0.035 <R2A<0.070. A n item has large or C - l e v e l D I F w h e n the chi-square test is statistically significant a n d w h e n R2A>0.070. These guidelines are a p -plicable to both u n i f o r m and n o n u n i f o r m D I F a n d were used to classify D I F items i n this s t u d y .

Back-Translation

Back-translation is a p o p u l a r a n d w e l l - k n o w n j u d g m e n t a l m e t h o d for evaluat-i n g the equevaluat-ivalence of t w o language forms (van de Vevaluat-ijver & L e u n g , 1997). In the basic d e s i g n , the source language test is first translated into the target language, then back-translated into the source language b y a different translator. The equivalence of the o r i g i n a l source a n d target language forms is assessed b y a r e v i e w e r or committee of reviewers w h o compare the original a n d back-trans-lated source language forms for comparability i n m e a n i n g (Brislin, 1970,1986; H a m b l e t o n & B o l l w a r k , 1991; W e r n e r & C a m p b e l l , 1970).

(7)

Using Statistical and Judgmental Revieres

general check o n translation q u a l i t y a n d that it can detect translation differen-ces (Ellis, 1989; H a m b l e t o n , 1993; H u l i n , D r a s g o w , & K o m o c a r , 1982; v a n de V i j v e r & L e u n g , 1997). The back-translation d e s i g n also has disadvantages. For e x a m p l e , the e v a l u a t i o n of test equivalence is conducted o n l y i n one language, and there is n o assurance that the f i n d i n g s i n the source language generalize to the target language becaus*e the source-to-target language translation is not directly assessed. T h i s p r o b l e m stems f r o m the a s s u m p t i o n that errors m a d e d u r i n g the o r i g i n a l translation w i l l not be m a d e d u r i n g the back-translation. H o w e v e r , this a s s u m p t i o n m a y not h o l d i n practice w h e n , for instance, skilled translators m a k e adjustments i n the translation to ensure the items are equivalent even w h e n the original source to target language items are different (Brislin, 1970; H a m b l e t o n & B o l l w a r k , 1991; H a m b l e t o n & Kanjee, 1995). This o u t c o m e m a y also occur if the back-translator i m p r o v e s the quality of the test i n situations w h e r e the o r i g i n a l translation is p o o r ( H a m b l e t o n , 1993). F i n a l l y , v a n de V i j v e r a n d L e u n g (1997) contend that the basic back-translation d e s i g n may result i n a literal translation at the expense of connotations, naturalness, and c o m p r e h e n s i b i l i t y across languages, especially w h e n translators k n o w their w o r k w i l l be evaluated w i t h back-translation.

M o d i f i e d back-translation designs have been p r o p o s e d to overcome some of the limitations i n the basic d e s i g n (Bracken & Barona, 1991; Brislin, 1986; H a m b l e t o n , 1993). In this s t u d y , another m o d i f i c a t i o n to the traditional back-translation d e s i g n is presented a n d u s e d . The p r o p o s e d design is an attempt to o v e r c o m e the limitations i n the basic d e s i g n as o u t l i n e d above. The m o d i f i e d d e s i g n is presented i n F i g u r e 1.

In the m o d i f i e d d e s i g n , the source language test is translated into the target language b y the test developer. T h e n the target language test is i n d e p e n d e n t l y back-translated into the source language b y t w o translators. Test equivalence is assessed b y c o m p a r i n g the t w o back-translations a n d the original source guage test. Differences between the t w o back-translations a n d the source lan-guage h i g h l i g h t potential translation p r o b l e m s . W i t h t w o back-translators w h o w o r k i n d e p e n d e n t l y , i n d i v i d u a l differences between them w i l l reduce the

Substantive Analysis Using Judgmental Method (e.g., Item Review Following Back-Translation)

Back Translation A _

Source Language Test (S1) (e.g., English)

Original

Translation Target Language _{Test (T1)}

(e.g., French)

Statistical Analysis Using DIF Method (e.g., SIBTEST)

Back Translation B

Substantive Analysis Using Judgmental Method (e.g.. Item Review Following Back-Translation)

(8)

l i k e l i h o o d that each w i l l make the same adjustments to ensure item equivalence w h e n the o r i g i n a l a n d source language items are i n fact different. W i t h t w o back-translators, it is also u n l i k e l y that both w i l l make the same changes to i m p r o v e the quality of a text i n situations where the o r i g i n a l trans-lation is poor. Rather, different changes w i l l be m a d e resulting i n inconsisten-cies between the t w o forms, thereby h i g h l i g h t i n g problems i n the original translation. F i n a l l y , the m o d i f i e d design has a check; the researcher can evaluate back-translator consistency b y c o m p a r i n g their translations w i t h each other i n a d d i t i o n to c o m p a r i n g their translations w i t h the original source l a n -guage.

Method Student Sample and Achievement Tests

D a t a f r o m 4,400 grade 6 students (2,200 E n g l i s h a n d 2,200 French Immersion) w h o w r o t e the 1997 a d m i n i s t r a t i o n of a p r o v i n c i a l mathematics achievement test a n d 4400 grade 6 students (2,200 E n g l i s h a n d 2,200 French Immersion) w h o wrote the 1997 a d m i n i s t r a t i o n of a p r o v i n c i a l social studies achievement test w e r e used. Because C a n a d a has t w o official languages, different language g r o u p s can be identified i n m a n y school districts. In this s t u d y the E n g l i s h -s p e a k i n g examinee-s repre-sent the d o m i n a n t language g r o u p becau-se mo-st students receive instruction i n this language at E n g l i s h - s p e a k i n g schools. E n g l i s h s p e a k i n g students are tested i n E n g l i s h . A l t e r n a t i v e l y , the French I m -m e r s i o n students are i n p r o g r a -m s where French is the language of instruction. I m m e r s i o n p r o g r a m s are e m b e d d e d i n E n g l i s h s p e a k i n g schools. The I m m e r -sion p r o g r a m is designed for students whose first language is not French but w h o w a n t to become functionally fluent i n French a n d to develop an u n d e r -s t a n d i n g a n d appreciation of French culture i n a d d i t i o n to ma-stering E n g l i -s h . T h u s F r e n c h I m m e r s i o n students are linguistically distinct f r o m English-speak-i n g students. I m m e r s English-speak-i o n students are tested English-speak-i n French. The four samples were r a n d o m l y selected f r o m a database containing approximately 38,000 E n g l i s h -and 3,000 F r e n c h - s p e a k i n g grade 6 students for each test administration.

T h e mathematics test contained 50 multiple-choice items, a n d each item h a d f o u r options. Test items were classified into five c u r r i c u l a r content areas a n d t w o cognitive levels. The social studies test contained 49 multiple-choice items (the o r i g i n a l test contained 50 items, but one item was d r o p p e d d u e to a n o b v i o u s translation error), a n d each item h a d four options. Test items were classified into four c u r r i c u l a r content areas a n d t w o cognitive levels. For both tests items w e r e based o n concepts, topics, a n d facts f r o m the p r o v i n c e - w i d e P r o g r a m of Studies. The test score does not necessarily contribute to a student's final course grade, a l t h o u g h teachers are encouraged to m a r k the tests a n d use the results for student g r a d i n g .

(9)

developer. In this step the c o m p a r a b i l i t y of the E n g l i s h a n d French versions of the test w a s assessed b y c o m p a r i n g the t w o forms. The v a l i d a t i o n committee also referred to the P r o g r a m of Studies a n d to appropriate textbooks d u r i n g the v a l i d a t i o n step. O n c e the committee h a d r e v i e w e d the test, the translator a n d test d e v e l o p e r received comments a n d feedback o n the accuracy a n d a p p r o p r i -ateness of the translated test. T h i r d , the test developer, acting o n the recom-m e n d a t i o n s of the corecom-mrecom-mittee, decided o n the final changes. These changes w e r e m a d e b y the translator, a n d the translated test w a s finalized. F o u r t h , both the test d e v e l o p e r a n d the test d e v e l o p m e n t supervisor r e v i e w e d a n d f i n a l i z e d the translated test. The translator i n this process was a former teacher w i t h 23 years experience i n English-to-French translation.

Statistical Analysis

T w o of the three D I F statistics used i n this s t u d y — M a n t e l - H a e n s z e l a n d logis-tic regression—are based o n the a s s u m p t i o n that the test is u n i d i m e n s i o n a l . T o assess this a s s u m p t i o n a confirmatory factor analysis was conducted. The indicator variables i n the confirmatory factor analysis were created b y s u m -m i n g the ite-ms i n each curricular content area. A one-factor -m o d e l w a s fitted to b o t h the E n g l i s h a n d the French data for mathematics a n d social studies u s i n g L I S R E L 8.14 (Jöreskog & Sörbom, 1996). In a d d i t i o n , a m u l t i p l e - s a m p l e analysis w a s c o n d u c t e d to evaluate the factor structure, factor l o a d i n g , a n d error i n -variance across the E n g l i s h a n d French samples i n the mathematics a n d social studies tests (Jöreskog, 1971). Three nested m o d e l s were sequentially tested b y e q u a t i n g the n u m b e r of factors, factor loadings, a n d errors.

(10)

absolute residuals between the observed a n d the h y p o t h e s i z e d covariances. A s m a l l R M R indicates g o o d fit.

N e x t , D I F statistical analyses were conducted for each item f r o m the E n g l i s h a n d F r e n c h forms i n the mathematics a n d social studies achievement tests u s i n g M H , S I B T E S T , a n d L R . The item u n d e r consideration w a s i n c l u d e d i n f o r m i n g the score g r o u p s for M H a n d L R , a n d n o iterative p u r i f i c a t i o n w a s u s e d i n the D I F analyses. A l l test statistics were interpreted at a n alpha-level of 0.05. Items that w e r e flagged b y at least t w o of the statistical procedures w i t h fi-or C - l e v e l w e r e considered translation D I F items. This interpretation seems justified as B - a n d C - l e v e l items are typically scrutinized for potential bias i n test r e v i e w s ( Z i e k y , 1993).

Judgmental Analysis

T w o translators i n d e p e n d e n t l y back-translated the achievement tests f r o m F r e n c h to E n g l i s h . B o t h translators were certificated b y the A s s o c i a t i o n of Translators a n d Interpreters of A l b e r t a , w h i c h is a n association affiliated w i t h b o t h the C a n a d i a n Translators a n d Interpreters C o u n c i l (CTIC) a n d the Inter-national Federation of Translators. T o become certificated, translators m u s t pass the n a t i o n a l C T I C e x a m ; to r e m a i n they m u s t pass the C I T C exam once every three years. B o t h translators have been accredited since 1990, a n d b o t h have extensive experience translating business, i n d u s t r y , government, a n d educational texts f r o m F r e n c h to E n g l i s h . The translators were b l i n d to the outcomes f r o m the statistical analyses, a n d they h a d n o contact w i t h one another d u r i n g this s t u d y .

N e x t three r e v i e w e r s — o u r s e l v e s — i n d e p e n d e n t l y evaluated the c o m -parability between the E n g l i s h source language tests a n d the back-translated tests. E a c h r e v i e w e r w a s g i v e n the original E n g l i s h f o r m , the French translated f o r m , a n d the t w o back-translated forms. The reviewers were explicitly asked to evaluate the c o m p a r a b i l i t y i n m e a n i n g between the original E n g l i s h f o r m a n d the t w o back-translated forms u s i n g a three-point rating scale: 1 = N o C h a n g e i n M e a n i n g , 2 - M i n o r C h a n g e i n M e a n i n g , a n d 3 = M a j o r C h a n g e i n M e a n i n g . The scale w a s used b y the reviewers to identify items that were translated incorrectly.

(11)

a 1, then the i t e m w a s d e e m e d equivalent; if at least t w o of the three ratings were either a 2 a n d / o r 3, then the i t e m w a s d e e m e d not equivalent.

Results

Psychometric Characteristics of the Test Forms and Items

A s u m m a r y of the observed psychometric characteristics o n the mathematics a n d social studies tests for the E n g l i s h - a n d French-speaking examinees is presented i n Table 1. T y p i c a l l y , the differences reported i n Table 1 are tested for statistical significance between groups. H o w e v e r , the large samples used i n this s t u d y resulted i n m a n y differences that were statistically b u t n o t practically significant. H e n c e statistical outcomes are n o t reported. Instead, some general trends are h i g h l i g h t e d . First, the psychometric characteristics of the items were c o m p a r a b l e b e t w e e n the E n g l i s h - a n d French-speaking examinees. T h e measures of internal consistency, difficulty, a n d d i s c r i m i n a t i o n were quite s i m i l a r for b o t h language g r o u p s i n mathematics a n d social studies. Second, the m e a n f o r the French-speaking examinees w a s somewhat higher that the m e a n for the E n g l i s h - s p e a k i n g examinees i n mathematics; the order was reversed i n social studies. H o w e v e r , f o r b o t h tests the effect sizes associated w i t h these m e a n differences were relatively s m a l l . T h i r d , the standard deviations, skew-ness, a n d kurtosis were s i m i l a r between the t w o groups for each test. Fourth, the n u m b e r of w o r d s o n the F r e n c h forms w a s noticeably larger than the E n g l i s h f o r m s , especially i n social studies.

Factor Structure Within and Across Language Groups

Results f r o m the c o n f i r m a t o r y factor analysis s u p p o r t e d the u n i d i m e n s i o n a l a s s u m p t i o n . T h e one-factor m o d e l p r o v i d e d excellent fit to the E n g l i s h a n d

Table 1

Psychometric Characteristics for the English and French Forms in

Mathematics and Social Studies

Mathematics Social Studies Characteristic English French English French

No. of Examinees 2,200 2,200 2,200 2,200

No. of Items 50 50 49 49

No. of Words 2,713 3,066 3,354 4,157

Mean 35.44 37.12 32.67 31.75

SD 8.34 7.57 8.29 7.71

Skewness -.49 -.65 -.44 -.34

Kurtosis -.47 -.11 -.43 -.49

Internal Consistency3 .89 .87 .87 .84

Mean Item Difficulty .71 .74 .67 .65

SD Item Difficulty .15 .14 .12 .12

Range Item Difficulty .26-.91 .22-.94 .39-.86 .39-.87

Mean Item Discrimination15 .48 .45 .43 .38

SD Item Discrimination .12 .11 .11 .11

Range Item Discrimination .05-.66 .09-.67 .15-.63 .17-.59

aCronbach's alpha

(12)

Table 2

Fit Indices for the One-Factor Model Across Content Areas as a Function of

Language Group

df RMSEA RMR ContentArea English French English French English French English French

Mathematics One-Factor

Model 16.16" 21.75* 5 5 .032 .039 .034 .040

Social Studies One-Factor

Model 2.08 1.99 2 2 .004 .000 .024 .025

'fx 0.01

F r e n c h data o n b o t h the mathematics a n d social studies achievement tests, as s h o w n i n Table 2. A l t h o u g h the chi-square values were statistically significant for the E n g l i s h a n d F r e n c h sample o n the mathematics test, the R M S E A a n d the R M R w e r e s m a l l , i n d i c a t i n g g o o d m o d e l fit.

Results f r o m the m u l t i p l e - s a m p l e analysis also suggested that the n u m b e r of factors a n d factor loadings were invariant across language groups o n the mathematics a n d social studies tests. T h e results of the m u l t i p l e - s a m p l e analy-sis are p r o v i d e d i n Table 3. T h e one-factor m o d e l w a s fitted separately for the E n g l i s h a n d F r e n c h sample, a n d a chi-square statistic w a s c o m p u t e d to assess parameter invariance across the t w o groups. Three nested models were se-quentially tested u s i n g this a p p r o a c h b y equating the n u m b e r of factors, factor loadings, a n d errors. F o r the mathematics test, M o d e l s 1 a n d 2 were n o t statis-tically different, whereas M o d e l s 2 a n d 3 were statisstatis-tically different, i n d i c a t i n g that the observed variables were not equally reliable across the t w o language g r o u p s . Despite this difference, the R M S E A a n d R M R for M o d e l s 1, 2, a n d 3 were s m a l l i n d i c a t i n g strong m o d e l fit. A s i m i l a r pattern of results occurred w i t h the social studies test data as M o d e l s 1 a n d 2 were not statistically dif-ferent b u t M o d e l s 2 a n d 3 were difdif-ferent. A g a i n , h o w e v e r , the R M S E A a n d R M R were s m a l l for a l l three models, i n d i c a t i n g g o o d m o d e l fit. I n short, w h e n w e take into account the sensitivity of chi-square to sample size a n d examine the R M S E A a n d R M R , there is strong evidence to suggest that the n u m b e r of factors a n d the factor l o a d i n g s are invariant across the E n g l i s h a n d F r e n c h g r o u p s i n mathematics a n d social studies.

Comparison of DIF Classification Using Three Statistical Procedures

Item classification p r o d u c e d b y the three D I F p r o c e d u r e s — M a n t e l - H a e n s z e l ( M H ) , S i m u l t a n e o u s Item Bias Test (SIBTEST), a n d logistic regression ( L R ) — w a s c o m p a r e d across the mathematics a n d social studies tests. Classification consistency across the procedures is s u m m a r i z e d i n Table 4. In all comparisons that f o l l o w , items w i t h a B - or C - l e v e l rating are considered D I F items, whereas those w i t h a n A - l e v e l rating are not.

(13)

Using Statistical and judgmental Revieivs

Table 3

Tests for Invariant Models Between English and French Examinees

Content Area X2 df RMSEA RMR

Mathematics •

Model 1

Equated Number of Factors 37.91* 10 .036 .040

Model 2

Equated Factor Loadings

Model 3

Equated Factor Loadings Equated Errors

Social Studies

Model 1

Equated Number of Factors 4.07 4 .003 .025

Model 2

Equated Number of Factors 15.21 7 .023 .110

Equated Factor Loadings

Model 3

Equated Factor Loadings Equated Errors

Model Comparison X2 df

Mathematics

Model 1 vs. Model 2 .68 4

Model 2 vs. Model 3 16.55* 5

Social Studies

Model 1 vs. Model 2 11.14 3

Model 2 vs. Model 3 13.35* 4

*p<.01.

identified b y SIBTEST. A l l 6 items identified b y M H (items 6, 8,15, 40,44, a n d 47) w e r e identified b y L R as d i s p l a y i n g u n i f o r m D I F . O f the 6 items flagged by SIBTEST, 5 w e r e identified b y L R for u n i f o r m D I F (items 6, 8, 41, 44, a n d 47). T h e o n l y item w i t h n o n u n i f o r m D I F , item 16, w a s identified b y L R .

(14)

Table

4 Classification Consistency Across the Three Differential Item Functioning Procedures as a Function of Test

Mathematics Social Studies

MH SIBTEST LR-U LR-N MH SIBTEST LR-U LR-N

MH 6 19

SIBTEST 4 6 19 27

LR-U 6 5 10 19 26 27

LR-N 0 0 1 1 1 2 2 2

(15)

Table 5

Correlation Coefficients Across the Four DIF Effect Size Measures as a Function of Test

Mathematics Social Studies

A-MH è F?A-U FFA-N A-MH è hfA-U f?A-N

A- M H — —

è -.96 — -.99 —

P^A- U .91 .92 .93 .95

P?A-N -.03 • 01 -.00 — .27 29 .25 •

Note. A- M H is Delta-Mantel-Haenszel; È is the effect size measure in the Simultaneous Item Bias Test; FpA-U is P? change for the Group variable in logistic

regression associated with uniform DIF; and PpA-N is R2 change for the Total Score-by-Group Membership interaction term In logistic regression associated with

(16)

Table 6

Results of the Mathematics Item Review by Three Raters

Reviewer Rating

Item Statistical Flag (Rating) Favors Reviewer 1 Reviewer 2 Reviewer 3 Overall Rating

6 MH (C), SIBTEST (B), LR (C) English 1 1 1 Equivalent

8 MH (B), SIBTEST (B), LR (C) English 1 1 1 Equivalent

15 MH (B), LR (B) French 1 1 1 Equivalent

40 MH (B), LR (C) English 1 1 1 Equivalent

41 SIBTEST (B), LR (B) English 2 1 1 Equivalent

44 MH (C), SIBTEST (C), LR (C) French 3 3 3 Not Equivalent

47 MH (B), SIBTEST (C), LR (B) French 3 3 3 Not Equivalent

(17)

Table

7 Results of the Social Studies Item Review by Three Raters

Reviewer Rating

Item Statistical Flag (Rating) Favors Reviewer 1 Reviewer 2 Reviewer 3 Overall Rating

2 MH (C), SIBTEST (C), LR (C) French 3 3 3 Not Equivalent

3 MH (B), SIBTEST (C), LR (C) English 1 1 2 Equivalent

5 MH (C), SIBTEST (C), LR (C) French 1 1 1 Equivalent

6 MH (C), SIBTEST (C), LR (C) French 1 2 1 Equivalent

9 SIBTEST (C), LR (B) English 1 1 1 Equivalent

11 MH (B), SIBTEST (C), LR (C) English 3 Not Equivalent

13 MH (B), SIBTEST (C), LR (B) French 1 1 1 Equivalent

16 MH (A), SIBTEST (B), LR (B) French 1 1 1 Equivalent

17 MH (B), SIBTEST (C), LR (C) English 3 Not Equivalent

18 MH (B), SIBTEST (C), LR (C) French 1 1 1 Equivalent

19 SIBTEST (C), LR (C) English 1 1 1 Equivalent

22 MH (B), SIBTEST (B), LR (B) French 1 2 1 Equivalent

24 MH (C), SIBTEST (C), LR (C) English 3 1 3 Not Equivalent 8 _a

25 MH (C), SIBTEST (C), LR (C) French 3 2 2 Not Equivalent a.

c

27 SIBTEST (C), LR (B) French 3 2 3 Not Equivalent (¾"

29 MH (B), SIBTEST (B), LR (B) French 1 1 1 Equivalent ST

30 MH (C), SIBTEST (C), LR (C) English 3 1 1 Equivalent

(18)

Table

7 (continued)

Item Statistical Flag (Rating) Favors

Reviewer Rating

Reviewerl Reviewer 2 Reviewer 3 Overall Rating

33 MH (C), SIBTEST (C), LR (C) English 1 1 1 Equivalent

34 SIBTEST (B), LR (C) French 1 1 1 Equivalent

35 SIBTEST (B), LR (B) French 1 1 1 Equivalent

36 SIBTEST (B), LR (B) English 1 1 1 Equivalent

44 MH (B), SIBTEST (C), LR (B) French 1 1 1 Equivalent

45 MH (B), SIBTEST (B), LR (B) English 1 1 1 Equivalent

48 MH (B), SIBTEST (B), LR (B) English 1 1 1 Equivalent

(19)

Using Statistical and Judgmental Review's

Relations Among DIF Effect Size Measures

Classification consistency can also be evaluated b y e x a m i n i n g the correlation b e t w e e n the D I F effect size measures. A strong correlation indicates a close relationship b e t w e e n the rankings of the items. A s s h o w n i n Table 5, effect size measures were h i g h l y correlated across D I F procedures except the measure for n o n u n i f o r m D I F . F o r the mathematics test the M H effect size measure A - M H was h i g h l y correlated w i t h the SIBTEST effect size measure ê at -0.96 a n d w i t h the L R effect size measures for u n i f o r m D I F , R2A - L f , at 0.91. A - M H a n d the n o n u n i f o r m D I F measure R2 A-N were correlated at -0.03. ê a n d R2A - L i were also h i g h l y correlated at 0.92, whereas ê a n d R2A-N were correlated at -0.01.

The correlation between the R2A measures w a s zero. F o r the social studies test A - M H w a s h i g h l y correlated w i t h ê a n d R2A - U (-0.99 a n d 0.93 respectively), b u t n o t w i t h (R2A-N (r-0.27). ê a n d R2A - U were h i g h l y correlated at 0.95, whereas ê a n d R2A-N were w e a k l y correlated at 0.29. T h e R2A measures also h a d a w e a k correlation at 0.25. T h e correlations between the u n i f o r m effect size measures a n d R2A-N were larger i n social studies than i n mathematics because

the social studies test has m o r e translation D I F items.

Results from Judgmental Analysis

Results for the j u d g m e n t a l analysis are reported i n Tables 6 a n d 7 for the mathematics a n d social studies tests respectively. O n l y results for items flagged b y at least t w o of the statistical procedures w i t h B- o r C-level D I F are presented a n d d i s c u s s e d d u e to space limitations. Consistency a m o n g the three reviewers w a s h i g h for i d e n t i f y i n g the nonequivalent items. T w o items o n the mathe-matics tests a n d six items o n the social studies tests were deemed not equivalent between the t w o language forms. O n e mathematics item j u d g e d to be n o n e q u i v a l e n t i n E n g l i s h a n d French is presented i n A p p e n d i x A to illus-trate translation differences. F o r this item the E n g l i s h f o r m contained a 12-hour clock w i t h A M a n d P M , whereas the French f o r m used a 24-hour clock. Stu-dents w e r e required to interpret a n A M to P M time difference i n this item, a n d this difference w a s m o r e apparent w h e n the 24-hour clock w a s used. A l s o , w h e n the 24-hour clock is used the first t w o options o n the French f o r m are clearly incorrect. This item w a s identified b o t h statistically a n d substantively u s i n g back-translation as b e i n g different between the t w o languages. This item also demonstrates h o w a n i m p o r t a n t c u l t u r a l difference c a n influence test d e v e l o p m e n t because i n E n g l i s h the 12-hour clock is routinely used, whereas i n F r e n c h the 24-hour clock is frequently used.

(20)

Conclusions and Discussion

The p u r p o s e of this s t u d y was to evaluate the equivalence of t w o translated tests u s i n g statistical a n d j u d g m e n t a l methods. Performance differences for a large r a n d o m sample of E n g l i s h a n d Frenchspeaking examinees were c o m -p a r e d o n a mathematics a n d social studies achievement test. Items d i s -p l a y i n g D I F w e r e flagged u s i n g three different statistical m e t h o d s — M a n t e l - H a e n s z e l , S i m u l t a n e o u s Item Bias Test, a n d logistic regression—and the substantive m e a n i n g of these flags was studied b y c o m p a r i n g the back-translated f o r m w i t h the original E n g l i s h - v e r s i o n a n d w i t h the statistical outcomes for each item o n both tests.

Statistical Outcomes

T w o m a i n statistical findings were f o u n d . First, the classification results across procedures were relatively consistent, but not identical. The correlation between the effect size measures for u n i f o r m D I F were also strong, but not perfect. These results indicate that the three procedures p r o d u c e relatively consistent item classification a n d effect size rankings, but some discrepancies were present. O n e explanation for these discrepancies m a y be f o u n d w i t h the cut-points (i.e., A , B, a n d Clevels) used to identify D I F items. A r e these cutpoints c o m -parable across procedures? O u r results suggest that the M H cut-points are more conservative than either the SIBTEST or L R cut-points. In a n attempt to establish a consistent a n d defensible pattern of D I F item classification, re-searchers m a y choose to use at least t w o procedures w h e n screening items for translation D I F . Researchers and practitioners s h o u l d also expect to identify fewer D I F items w i t h M H c o m p a r e d w i t h SIBTEST or L R .

Second, the L R effect size R2A - Ü was h i g h l y correlated w i t h the effect size measures for M H a n d SIBTEST. Correlations between the n o n u n i f o r m L R effect size measure R2A-N w i t h M H a n d SIBTEST were negligible because the later

measures were not designed to flag n o n u n i f o r m D I F . T h u s R2A appears to be a reliable measure that p r o v i d e s useful information w h e n used w i t h the L R statistical analysis for q u a n t i f y i n g the m a g n i t u d e of u n i f o r m and n o n u n i f o r m D I F . N e w guidelines for interpreting the L R and R2A results for classifying items w i t h D I F were p r o p o s e d by J o d o i n (1999), but m u c h more research w i t h L R a n d R2A is needed to establish the v a l i d i t y of this procedure for i d e n t i f y i n g D I F . T h e first author is currently c o n d u c t i n g s i m u l a t i o n studies i n this area.

Interpretability of DIF

(21)

j u d g e d to be n o n e q u i v a l e n t across language forms d u e to differences intro-d u c e intro-d i n the translation process. There were n o apparent translation intro- differen-ces for the r e m a i n i n g items.

Translation differences were m o r e p r o n o u n c e d i n social studies, a language-rich content area. The sheer n u m b e r of w o r d s o n the social studies test (3,354 and 4,157 o n the E n g l i s h arid F r e n c h forms respectively) c o m p a r e d w i t h the mathematics test (2,713 a n d 3,066 o n the E n g l i s h a n d French forms respective-ly) c o u l d lead to m o r e translation differences. Student performance c o u l d also be affected. French-speaking examinees i n social studies, for example, were r e q u i r e d to read 803 m o r e w o r d s c o m p a r e d w i t h the E n g l i s h - s p e a k i n g ex-aminees, a n increase of almost 24%, a n d this increased reading l o a d c o u l d adversely affect test performance. The w o r d count difference across language forms m a y account for some of the discrepancy between the statistical analysis and the j u d g m e n t a l r e v i e w .

T h e discrepancy between statistical a n d j u d g m e n t a l results i n social studies m a y also be attributed to inflated T y p e I error resulting f r o m the use of an inadequate c o n d i t i o n i n g variable. D I F analyses require a c o n d i t i o n i n g variable that matches examinees i n b o t h language g r o u p s o n the same construct of interest. W h e n the n u m b e r of flagged items becomes large, i n d i c a t i n g m a n y potentially problematic items, total test score (or a latent version of total test score as w i t h the S I B T E S T procedure) m a y not be a v a l i d c o n d i t i o n i n g variable. C u r r e n t l y there is n o s o l u t i o n to this p r o b l e m w h e n the n u m b e r of D I F items is large (Sireci, 1997; Sireci, X i n g , & F i t z g e r a l d , 1999). A s a first step, care m u s t be taken to ensure the translation is accurate. Iterative purification can also be used b y r e m o v i n g the D I F items f r o m the total test score a n d repeating the analysis. H o w e v e r , this a p p r o a c h has been s t u d i e d w i t h o n l y a relatively s m a l l n u m b e r of D I F items (Clauser, N u n g e s t e r , M a z o r , & R i p k e y , 1996). W h e n a large n u m b e r of items are flagged o n a u n i d i m e n s i o n a l test as i n social studies, it is not clear w h i c h items to remove, as the p u r p o s e of the D I F analysis is to identify problematic items or h o w the construct a n d content representation o n the test w i l l be affected w h e n a large n u m b e r of D I F items are r e m o v e d f r o m the c o n d i t i o n i n g variable. This p r o b l e m remains u n s o l v e d a n d m u s t be a d -dressed i n future research.

(22)

p r o v i d e practitioners w i t h the i n f o r m a t i o n needed to m a k e decisions about a test (e.g., w h i c h D I F items to d r o p a n d w h i c h items to keep w h e n B- o r C - l e v e l D I F is found) o r about the psychological factors that p r o d u c e D I F that s h o u l d be considered w h e n creating a test. In short, C a m i l l i a n d Shepard (1994) cor-rectly state that it is essential to " w o r r y as m u c h about h o w to interpret D I F as h o w to c o m p u t e i t " (p. 153). Researchers s h o u l d address this concern b y s t u d y -i n g the p s y c h o l o g -i c a l factors that p r o d u c e translat-ion d-ifferent-ial -item f u n c t i o n i n g .

Acknouiedgements

This research was supported with funds awarded to the first two authors from the Social Sciences and Humanities Research Council of Canada (SSHRC), the Support for the Advancement of Scholarship Fund (SAS) at the University of Alberta, and the Social Science Research Fund (SSR), also at University of Alberta.

We would like to thank Ronald K. Hambleton, University of Massachusetts at Amherst, for his critique on an earlier version of this article.

References

Allalouf, A., & Sireci, S.G. (1998, April). Detecting sources of DIF in translated verbal items. Paper presented at the annual meeting of the American Educational Research Association, San Diego.

Angoff, W. (1993). Perspective on differential item functioning methodology. In P.W. Holland & H. Wainer (Eds.), Differential item functioning (pp. 3-24). Hillsdale, NJ: Erlbaum.

Bollen, K.A., & Long, J.S. (1993). Testing structural equation models. Newbury Park, CA: Sage. Bracken, B.A., & Barona, A. (1991). State of the art procedures for translating, validating, and

using psychoeducational tests in cross-cultural assessment. School Psychology International, 12, 119-132.

Brislin, R.W. (1970). Back-translation for cross-cultural research, journal ofCross-Cultural Psychology, 1,185-216.

Brislin, R.W. (1986). The wording and translation of research instruments. In W.J. Lonner & J.W. Berry (Eds.), Field methods in cross-cultural research (pp. 137-164). Newbury Park, C A : Sage. Browne, M.W., & Cudek, R. (1993). Alternative ways of assessing model fit. In K.A. Bollen & J.S.

Long (Eds.), Testing structural equation models (pp. 136-162). Newbury Park, CA: Sage. Budgell, G.R., Raju, N.S., & Quartetti, D.A. (1995). Analysis of differential item functioning in

translated assessment instruments. Applied Psychological Measurement, 19,309-321. Camilli, G., & Shepard, L.A. (1994). Methods for identifying biased test items. Newbury Park, CA:

Sage.

Clauser, B.E., & Mazor, K.M. (1998). Using statistical procedures to identify differentially functioning test items. Educational Measurement: Issues and Practice, 17,31-44. Clauser, B.E., Nungester, R.J., Mazor, K., & Ripkey, D. (1996). A comparison of alternative

matching strategies for DIF detection in tests that are multidimensional, journal of Educational Measurement, 33,202-214.

Douglas, J., Roussos, L., & Stout, W., (1996). Item bundle DIF hypothesis testing: Identifying suspect bundles and assessing their DIF. journal of Educational Measurement, 33,465-484. Elliott, P.R. (1994, April). An overview of current practice in structural equation modeling. Paper

presented at the annual meeting of the American Educational Research Association, New Orleans.

Ellis, B.B. (1989). Differential item functioning: Implications for test translation. Journal of Applied Psychology, 74,912-921.

Frederiksen, N , Mislevy, R.J., Bejar, I.I. (1993). Test theory for a neiv generation of tests. Hillsdale, NJ: Erlbaum.

Gierl, M.J. (1997). Comparing the cognitive representations of test developers and students on a mathematics achievement test using Bloom's taxonomy. Journal of Educational Research, 91, 26-32.

(23)

Using Statistical and Judgmental Revieres Gierl, M.J., & Mulvenon, S. (1995, April). Evaluating the application of fit indices to structural equation models in educational research: A review of the literature from 1990 through 1994. Paper presented at the annual meeting of the American Educational Research Association, San Francisco.

Hambleton, R.K. (1993). Translating achievement tests for use in cross-cultural studies. European Journal of Psychological Assessment, 9,57-68.

Hambleton, R.K. (1994). Guidelines for adapting educational and psychological tests: A progress report. European Journal of Psychological Assessment, 10,229-244.

Hambleton, R.K., & Bollwark, J. (1991). Adapting tests for use in different cultures: Technical issues and methods. Bulletin of the International Testing Commission, 18,3-32.

Hambleton, R.K., & Jones, R.W. (1994). Comparison of empirical and judgmental procedures for detecting differential item functioning. Educational Research Quarterly, 18,23-36.

Hambleton, R.K., & Kanjee, A. (1995). Increasing the validity of cross-cultural assessments: Use of improved methods for test adaptations. European Journal of Psychological Assessment, 11, 147-157.

Holland, P.W., & Thayer, D.T. (1988). Differential item functioning and the Mantel-Haenszel procedure. In H . Wainer & H.I. Braun (Eds.), Tesf validity (pp. 129-145). Hillsdale, NJ: Erlbaum.

Hulin, C.L., Drasgow, F., & Komocar, J. (1982). Applications of item response theory to analysis of attitude scale translations. Journal of Applied Psychology, 67,818-825.

Jodoin, M . (1999). Reducing type I error rates using an effect size measure with the logistic regression DIF procedure. Unpublished master's thesis, University of Alberta.

Jöreskog, K.G. (1971). Simultaneous factor analysis in several populations. Psychometrika, 36, 409-426.

Jöreskog, K.G., & Sörbom, D. (1996). LISREL 8: User's reference guide. Chicago, IL: Scientific Software.

Linn, R.L. (1993). The use of differential item functioning statistics: A discussion of current practice and future implications. In P.W. Holland & H . Wainer (Eds.), Differential item functioning (pp. 349-366). Hillsdale, NJ: Erlbaum.

Mantel, N., & Haenszel, W. (1959). Statistical aspects of the analysis of data from retrospective studies of disease. Journal of the National Cancer Institute, 22,719-748.

McDonald, R.P., & Marsh, H.W. (1990). Choosing a multivariate model: Noncentrality and goodness of fit. Psychological Bulletin, 107,247-255.

Mislevy, R.J. (1996). Test theory reconceived. Journal of Educational Measurement, 33,379-416. Mulaik, S.A., James, L.R., Van Alstine, J., Bennett, N , Lind, S., & Stilwell, C D . (1989). Evaluation

of goodness-of-fit indices for structural equation models. Psychological Bulletin, 105,430-445. Narayanan, P., & Swaminathan, H. (1994). Performance of Mantel-Haenszel and simultaneous

item bias procedure for detecting differential item functioning. Applied Psychological Measurement, 18,315-328.

Narayanan, P., & Swaminathan, H . (1996). Identification of items that show nonuniform DIF. Applied Psychological Measurement, 20,257-274.

Nichols, P. (1994). A framework of developing cognitively diagnostic assessments. Review of Educational Research, 64,575-603.

Nichols, P.D., Chipman, S.F., Brennan, R.L. (1995). Cognitively diagnostic assessment. Hillsdale, NJ: Erlbaum.

O'Neill, K.A., & McPeek, W.M. (1993). Item and test characteristics that are associated with differential item functioning. In P.W. Holland & H . Wainer (Eds.), Differential item functioning (pp. 255-276). Hillsdale, NJ: Erlbaum.

Pope, G.A. (1997). Nonparametric item response modeling and gender differential item functioning analysis of the Eysenck personality questionnaire. Unpublished master's thesis, University of Northern British Columbia.

Rogers, H.J., & Swaminathan, H . (1993). A comparison of logistic regression and Mantel-Haenszel procedures for detecting differential item functioning. Applied Psychological Measurement, 17, 105-116.

Roussos, L.A., & Stout, W.F. (1996). Simulation studies of the effects of small sample size and studied item parameters on SIBTEST and Mantel-Haenszel type I error performance. Journal of Educational Measurement, 33,215-230.

Scheuneman, J.D. (1987). An experimental, exploratory study of causes of bias in test items. Journal of Educational Measurement, 24,97-118.

(24)

M. Gl'eri, W.T. Rogers, and D. Klinger

Shepard, L.A., Camilli, G., & Averiii, M . (1981). Comparison of six procedures for detecting test item bias using both internal and external ability criteria. Journal of Educational Statistics, 6, 317-375.

Sireci, S.G. (1997). Problems and issues in linking tests across languages. Educational Measurement: Issues and Practice, 16,12-19.

Sireci, S.G., Xing, D., & Fitzgerald, C. (1999, April). Evaluating adapted tests across multiple language groups: Lessons learned from the TT industry. Paper presented at the annual meeting of the National Council on Measurement in Education, Montreal.

Snow, R.E., & Lohman, D.F. (1989). Implications of cognitive psychology for educational

measurement. In R.L. Linn (Ed.), Educational measurement (3rd ed., pp. 263-331). New York: American Council on Educational, Macmillan.

Swaminathan, H., & Rogers, H.J. (1990). Detecting differential item functioning using logistic regression procedures. Journal of Educational Measurement, 27,361-370.

Thomas, D.R., & Zumbo, B.D. (1996, July). Variable importance in regression and related analyses. Paper presented at the annual meeting of the Psychometric Society, Banff,

van de Vijver, F.J.R. (1994). Item bias: Where psychology and methodology meet. In A. Bouvy, F.J.R. van de Vijver, P Boski, & P. Schmitz (Eds.), Journeys into cross-cultural psychology (pp. 111-126). Lisse, Netherlands: Swets & Zeitlinger.

van de Vijver, F, & Leung, K. (1997). Methods and data-analysis for cross-cultural research. Thousand Oaks, C A : Sage.

Werner, O., & Campbell, D.T. (1970). Translating, working through interpreters, and the problem of decentering. In R. Naroll & R. Cohen (Eds.), A handbook of method in cultural anthropology (pp. 398-420). New York: Columbia University Press.

Zieky, M . (1993). Practical questions in the use of DIF statistics in test development. In P.W. Holland & H . Wainer (Eds.) Differential item functioning (pp. 337-347). Hillsdale, NJ: Erlbaum. Zwick, R., & Ercikan, K. (1989). Analysis of differential item functioning in the NAEP history

assessment. Journal of Educational Measurement, 26,55-66.

Zumbo, B.D. (1999). A handbook on the theory and methods of differential item functioning: Logistic regression modeling as a unitary framework for binary and Likert-type item scores. Ottawa: Directorate of Human Resources Research and Evaluation, Department of National Defense. Zumbo, B.D., & Thomas, D.R. (1996, October). A measure of DIT effect size using logistic regression

procedures. Paper presented at the National Board of Medical Examiners, Philadelphia. Appendix A

Items 47 on the English and French form of the Grade 6 Mathematics Achievement Test respectively

47. On the first day of filming, the crew arrived on the set at 5:20 A.M. They left the set at 8:15 P.M. How long did the crew spend on the set that day?

A. 3 h 5 min B. 5 h 5 min C. 13 h 35 min D. 14 h 55 min

47. Le premier jour du tournage, l'équipe arrive au plateau de projection à 5 h 20 du matin. Elle quitte le plateau à 20 h 15. Combien de temps l'équipe est-ce que l'équipe passe sur le plateau le premier jour?