Efficiency of Item Selection Method in Variable-length Computerized Adaptive Testing for the Testlet Response Model: Constraint-weighted A-stratification Method

(1)

Procedia - Social and Behavioral Sciences 116 ( 2014 ) 1890 – 1895

1877-0428 © 2013 The Authors. Published by Elsevier Ltd. Open access under CC BY-NC-ND license. Selection and/or peer-review under responsibility of Academic World Education and Research Center. doi: 10.1016/j.sbspro.2014.01.490

ScienceDirect

5

th

World Conference on Educational Sciences - WCES 2013

Efficiency of Item Selection Method in Variable-Length

Computerized Adaptive Testing for the Testlet Response Model:

Constraint-Weighted A-Stratification Method

Anusorn Koedsri

a

*, Nuttaporn Lawthong

b

, Sungworn Ngudgratoke

c

a_{Ph.D. Student ,}b_{Assistant Professor at Faculty of Education, Chulalongkorn university, Bangkok 10330, Thailand} c _{School of Education, Sukhothai Thammathirat Open University, Nonthaburi 11120, Thailand}

Abstract

Most Computerized adaptive tests (CATs) are constructed on the foundation of standard item response theories (IRT). This model is a powerful psychometric paradigm and thus it has been used by several testing programs. However, the model, by its assumption, ignores dependence structure of items even though tests are made up of test lets (sets of items). When traditional IRT models are applied to tests composed of test let, violations of independence assumptions as required by traditional IRT models will result in imprecise ability estimates. Hence, the test let response theory (TRT) model is recommended for estimating examinee’s proficiency estimates when tests composed of test let were used. The purpose of this study is to explore the accuracy of the TRT model when it is applied to the variable-length CATS. This article employed a variable-length item selection method adapted from the constraint-weighted a-stratification method to control items exposure and contents balancing in item selection processes. The Monte Carlo method was employed to compare the Pearson product-moment correlation, bias, mean squared error (MSE) of the TRT model in estimating examinees. In this study we compared the results based on three test let selection methods: the maximization information method, the randomization method, the constraint-weighted a-stratification method. The results indicate that the constraint-weighted a-stratification method outperforms the other adapted variable-length CAT based on TRT model.

Keywords: computerized adaptive testing, testlet response theory, variable-length test, constraint-weighted

a-stratification method; 1. Introduction

Computerized adaptive test (CAT), based on item response theory (IRT), was formally proposed by Lord in 1980. An ideal adaptive test can provide each examinee with a tailored test of a certain test length that may be different from others (Eignor, 2005; Thissen & Mislevy, 2000; Weiss, 1985). Multiple-choice items are the most frequently used item format in CATs to date. This format items tend to meet the assumptions of IRT, such as local independence and one-dimensional latent trait (Hambleton & Swaminathan, 1985). In practice, there are situations

*Corresponding Author name. Tel.: +668-9452-4466 E-mail address: [email protected]

(2)

where these assumptions are likely to be violated when items are grouped together and based on a single stimulus, as when several items refer to test let or a single reading passage (Wainer & Kiely, 1987).

The measurement models based on test let response theory proposed by Wainer, Bradlow, and Du (2002) can be employed to handle test let data within a CAT. The precision of the ability estimates yielded by a CAT system for test lets is dependent not only on the measurement model on which it is based, but also the method of item exposure control that is selected (Boyd, 2003; Davis & Dodd, 2003). Previous studies on variable-length CAT have found that the major advantages of the content weighted balancing method over the variable-length MMM method are the relatively more effective item exposure control and higher efficiency (Huo, 2009).

This study sought to extend Huo’s (2009) research on variable-length CAT systems modeled with IRT by including CATs with TRT framework. The goal of the study is to investigate the efficiency of variable-length CATs when it is applied to TRT model.

Literature Review

The exposure control method

The a-stratified multistage method was proposed by Chang and Ying (1999) to control item exposure rates. In this method, all items in the item pool are stratified into a number of levels (K strata) based on value of

discrimination parameter. At the early stages of a test, when little is known about the examinee’s proficiency, low discriminating items are administered to examinees and the items with high discrimination are saved for the latter stages. The results from simulations showed that this method can effectively control the overexposure rates and thus significantly improve item pool utilization. This method can incorporate some non-psychometric constraints, such as content balancing (Chang, Qian, & Ying, 2001; Chang & Ying, 1999).

Content balancing method

The constraint-weighted method developed by Cheng and Chang (2009) is able to contain various non-statistical constraints simultaneously, such as content balancing, exposure control, answer key balancing, and so on. Like the weighted deviation modeling method, it can be easily implemented in the current CAT programs. On the other hand, it does not require any adjustment of the relative weight between constraints and information (Cheng & Chang, 2009; Cheng, Chang, Douglas, & Guo, 2009).

The 3PL test let response model

The three parameter logistic test let response theory model (3PL-TRT) was proposed by Wainer et al. (2002), This model differs from 3PL-IRT model because it adds a random effect to the log it of equitation in IRT model that is an interaction of examinee with test let d(j), the test let that contains item j. The model (Wainer, Bradlow, & Wang, 2007) is defined as.

where = 1|)denotes the probability that examinee i receives a score of 1 (i.e., answers correctly) on binary item j, aj represents the slope of item j, bj represents the difficulty of item j, cj is the lower asymptote or ‘‘pseudo-guessing’’ parameter for item j, ɣid(j) is the test let effect of item j to person i, which is nested in test let d (i.e., d(j)).

) 1 ( ... ... ... ... ... ) | 1 − − + − − + = =

(3)

The item information, Ii(θ), for the three-parameter logistic test let response theory model is given by (Wainer et al., 2007).

Test let information, IT(θ) , is the sum of the item information within a test let:

Test information, TI(θ) is the sum of the test let information:

Information can be used in CAT systems to select an item for administration and provide measurement precision through the standard error associated with a given ability. Measurement precision for a test can be evaluated through the standard error associated with a given ability, which is the square root of the reciprocal of the test information:

The standard error associated with a given ability is not necessarily constant across the ability continuum.

2. Method

2.1. Test let pool

The simulated item pool contains 1,000 items where item responses were obtained from probability (p) values generated from equation (1). When p is less than 0.5 it was coded as 0, otherwise 1. To generate probability of answering a an item correctly, we followed Nudgratoke and Yon's study (2006), that is discrimination parameters(a) were drawn from a log normal distribution on (0.5, 1.5), difficulty parameters (b) from a standard normal distribution, and guessing parameters (c) from a beta distribution on (2, 10). The generated item pool consists of 250 reading passages (4 items/test lets). Additionally, the passages were classified into three content areas, and items were randomly assigned to these areas with equal probability. The number of items needed from the first content area for the simulation had a lower bound of 4 and an upper bound of 12; the bounds were 8 and 16 for the second and 8 and 20 for the third content area. Test let selection was also contingent on the Constraint-weighted method to handle severely constrained testing situations. This method selects an item based on the content weighted minimum discrepancy value between the ability estimate and the item difficulty parameter b.

2.2. Data generations

The true ability (θ) values for 1,000 examinees are simulated from the standard normal distribution. Person-specific test let effect parameters (γid(j)) were randomly selected from a normal distribution with mean zero and variance equal to σ2_(γ

id(j)) for each of the 100testlets. Each of the three test let-based CAT study conditions was then run on the same ten samples of 1,000 examinees.

The SAS program created by Zhang (2007) was written to generate test let datasets. The program was modified in R software to simulate the response strings in this study.

2.3. CAT Simulations ) 3 ...( ... ... ) ( ) 1 ∑ = = I i I θi ) 4 ...( ... ... ) ( ) 1 ∑ = = T t t I θ ) 5 ...( ... ... ) 1 ) = ) 2 ..( ... ) − − + − ⎥ ⎥ ⎦ ⎤ ⎢ ⎢ ⎣ ⎡ − − − − − =

(4)

Fire star-D Computerized Adaptive Testing Simulation Program for Dichotomous Item Response Theory Models (Choi, Podrabsky, & McKinney, 2012) generated R code for CAT Simulation, this code was modified to simulate the three test let-based CATs. Each CAT consisted of test let selection based on maximum information contingent on content balancing and exposure control procedures. The ability and the person- specific test let effects were estimated using expected a posteriori (EAP) estimation after each test let was administered. The stopping rule for test administration was the standard error of theta is equal to or less than 0.3 or maximum items meet the pre-specified criterion (upper limit = 48). The simulations incorporate practical considerations that an examinee cannot be administered too short tests; The minimum test length is defined at lower limit = 16.

Figure 1. Distribution of a, b and c parameters in the test let pool 2.4. Data Analysis

Assessment of the CAT systems was based on retrieval of stimulus’s known ability values and the effectiveness of the exposure control procedures. The degree to which the CAT systems recovered the known theta values was evaluated through descriptive statistics, the Pearson product-moment correlation, bias, mean squared error (MSE).

Evaluation of the test let exposure was based on descriptive statistics including percentage of test lets never exposed and overexposed, and maximum exposure rate. The test let exposure rate represented the number of times a test let was administered to examinees divided by the total number of examinees. The percentage of test lets not administered during any of the CAT administrations represented pool utilization.

3. Results

The results of the simulation study are summarized in Table 1.The measurement precision in terms of Bias, MSE and the correlation between the true and the estimated ability show that the maximization information method (MI) was the most effective and the randomization method (RAN) the least preferable in terms of ability estimations, It is clear that the MI method he constraint-weighted a-stratification method (CW-ASTR) are considerably more accurate than the RAN method in ability estimation. These results indicate that using the MI method alone is not appropriate for the variable-length CAT based on TRT model because of its weak item exposure control.

(5)

Methods Measurement precision indices Exposure control indices

Bias MSE Max Over exposed (>0.2) Never exposed (<.02)

CW-ASTR -0.021 0.215 0.916 0.312 24% 0%

MI -0.005 0.194 0.951 0.576 29.6% 59.6%

RAN -0.093 1.645 0.723 0.1.93 0% 0%

Compared with the MI method, the constraint-weighted a-stratification method produces very precise ability estimates, even very close to the results of the MI method. In contrast, they have much better item exposure control than the MI method in the variable-length CAT based on TRT model. Thus, overall performance the constraint-weighted a-stratification method outperforms the other adapted variable-length CAT based on TRT model. (Figure 2, 3 and 4)

Figure 2. Test let exposure rates for the maximum information method. Figure 3. Test let exposure rates for the randomized method.

Figure 4. Test let exposure rates for the constraint-weighted a-stratification methods.

θ θ

(6)

4. Discussion and Conclusion

This research shows the effectiveness of constraint-weighted a-stratification in controlling item and test let exposure rates while ensuring good measurement precision in a variable-length CAT. The findings support the results from Cheng et al. (2009) constraint-weighted a-Stratification method condition for the traditional fixed-length CAT in IRT model. The findings also show that the method is ineffective in controlling item exposure rates for the variable-length CAT. Because the original a-stratification method (Chang & Ying, 1999) is designed for the fixed-length CAT here the test fixed-length is determined in advance and the number of items selected from each stratum is evenly distributed.

Limitation of this study is that we selected the test at the test let level rather than at the item within the test let level. CATs based on one of the TRT models that allow selecting items adaptively within a test let. Future studies should therefore examine the effect of item-level selection in CATs with adaptive test lets.

Acknowledgements

We would like to thank Chualalongkorn University for providing THE 90th_{ANNIVERSARY OF}

CHULALONGKORN UNIVERSITY FUND (Ratchadaphiseksomphot Endowment Fund) for this research. References

Boyd, A. M. (2003). Strategies for controlling testlet exposure rates in computerized adaptive testing systems. (doctoral dissertation), The University of Texas at Austin.

Chang, H.-H., Qian, J., & Ying, Z. (2001). a-Stratified Multistage Computerized Adaptive Testing with b Blocking. Applied Psychological Measurement, 25(4), 333-341. doi: 10.1177/01466210122032181

Chang, H.-H., & Ying, Z. (1999). a-Stratified Multistage Computerized Adaptive Testing. Applied Psychological Measurement, 23(3), 211-222. doi: 10.1177/01466219922031338

Cheng, Y., & Chang, H.-H. (2009). The maximum priority index method for severely constrained item selection in computerized adaptive testing. British Journal of Mathematical and Statistical Psychology, 62, 369-383. doi: 10.1348/000711008X304376

Cheng, Y., Chang, H.-H., Douglas, J., & Guo, F. (2009). Constraint-Weighted a-Stratification for Computerized Adaptive Testing With Nonstatistical Constraints. Educational and Psychological Measurement, 69(1), 35-49. doi: 10.1177/0013164408322030

Choi, S. W., Podrabsky, T., & McKinney, N. (2012). Firestar-D : Computerized Adaptive Testing Simulation Program for Dichotomous Item Response Theory Models. Applied Psychological Measurement, 36(1), 67-68. doi: 10.1177/0146621611406107

Davis, L. L., & Dodd, B. G. (2003). Item Exposure Constraints for Testlets in the Verbal Reasoning Section of the MCAT. Applied Psychological Measurement, 27(5), 335-356. doi: 10.1177/0146621603256804

Eignor, D. R. (2005). Education, Tests and Measures in. In K.-L. Editor-in-Chief: Kimberly (Ed.), Encyclopedia of Social Measurement (pp. 765-772). New York: Elsevier.

Hambleton, R. K., & Swaminathan, H. (1985). Item response theory: Principles and applications. Boston: Kluwer-Nijhoff Publishing.

Huo, Y. (2009). Variable-Length Computerized Adaptive Testing: Adaptation of the A-Stratified Strategy in Item Selection with Content Balancing. (doctoral dissertation), University of Illinois at Urbana-Champaign.

Thissen, D., & Mislevy, R. J. (2000). Testing algorithms. In H. Wainer (Ed.), Computerized adaptive testing: A primer (2nd ed., pp. 101-133). Mahwah, NJ: Lawrence Erlbaum.

Wainer, H., Bradlow, E., & Du, Z. (2002). Testlet Response Theory: An Analog for the 3PL Model Useful in Testlet-Based Adaptive Testing. In W. Linden & G. W. Glas (Eds.), Computerized Adaptive Testing: Theory and Practice (pp. 245-269): Springer Netherlands.

Wainer, H., Bradlow, E. T., & Wang, X. (2007). Testlet Response Theory and Its Applications: Cambridge University Press.

Wainer, H., & Kiely, G. L. (1987). Item Clusters and Computerized Adaptive Testing: A Case for Testlets. Journal of Educational Measurement, 24(3), 185-201. doi: 10.1111/j.1745-3984.1987.tb00274.x

Weiss, D. J. (1985). Adaptive Testing by Computer. Journal of Consulting and Clinical Psychology, 53(6), 774-789. doi: 10.1037/0022-006X.53.6.774

Zhang, J. (2007). Dichotomous Or Polytomous Model? Equating of Testlet-based Tests in Light of Conditional Item Pair Correlations. (doctoral dissertation), The University of Iowa. Retrieved from http://ir.uiowa.edu/etd/139/