LIMITATIONS AND FUTURE STUDIES - Item selection methods in multidimensional computerized adapti

LIMITATIONS AND FUTURE STUDIES Summary of Findings

This study compares four item selection methods, D-optimality, KI, MI, and CEM under three test formats, POLYTYPE, DPMIX, and PDMIX. In general, D-optimality presents the best estimation performance, followed by MI, CEM, and KI. The KI method, however, shows the

smallest estimation error for the theta pattern of ₁ ₂ although it shows larger estimation error

than the other three methods under other theta patterns and all explored test formats. D- optimality, MI, and CEM have similar estimation and item selection pattern, while KI differs than them. Polytomous items provide more information than dichotomous items so that the POLYTYPE test format yields the best conditional estimation accuracy and two-dimensional theta points with relatively extreme values could be better probed under the POLYTYPE format. This finding applies to all four item selection methods. Which item type, dichotomous or

polytomous items, being administered first depends on the test designs and settings. When test length is normal such as around 25 and when the proportion of polytomous and dichotomous items is similar to the design of this study, which item type being administered first does not affect the estimation precision or conditional estimation precision. If the test length is very long or if the test length is very short, or if the proportion of dichotomous and polytomous items is different from the current design, it is likely that one type of mixed test format is better than the other. Further investigation is needed to make a conclusion. Another finding is that item bank size affects the estimation precision. When item bank size shrinks, estimation errors become larger for all item selection methods. CEM is affected apparently by the change of item bank size.

109

Limitations and Future Studies

Studies in the MCAT field have not been extensively conducted and a variety of studies could be done in the field.

The first direction is to study the MCAT test where items and examinees have complex structure. In the current two-dimensional study, items are assumed to be loaded on both

dimensions and two dimensions have no correlation with each other. In addition, examinees' abilities in two dimensions are assumed to be uncorrelated. However, in practice, two

dimensions of items and two dimensions of examinees might be correlated. It is important to investigate the MCAT test by using real data or assuming a real test setting.

Second, item exposure control, content constraint, and other non-statistical factors are not explored in this study. To make the MCAT become applicable in real world, it is necessary to discuss the MCAT test considering factors such as item exposure control, content constraints, etc.

In addition, explore item selection methods in the test formats that contain several types of polytomous items either with different response categories or from different MIRT models. In the real world practice, there are a variety of polytomous items examining student abilities in different fields. It is common that one test consists of several different item types to assess students' comprehensive abilities.These polytomous items might fit different MIRT models or have different numbers of response categories. Besides MGPCM model, various polytomous MIRT models need to be explored. Therefore, in the future study, the POLYTYPE test format or the mix-type test format containing polytomous items with mixed numbers of categories or from various MIRT models should be discussed.

Moreover, item selection methods with variable length MCAT test is an interesting direction. Both MCAT test delivery method and polytomous items provide more information

110

than UCAT method and dichotomous items, it is possible that students' abilities could be

diagnosed when the test length is shorter than UCAT test using dichotomous items. Furthermore, when the purpose is to diagnose or to improve instruction, a variable-length adaptive test has advantages with the adoption of certain item selection methods providing sufficient information. One advantage is that test length could be shortened and the item exposure rate is thus controlled. Therefore, the items could last longer in the pool before they are retired and students will get lower testing burden with shorter test time and length.

Another promising direction is to study item selection methods in MCAT test that has intentional and nuisance abilities and one composite ability as an explicit linear combination of theta. This study finds that it is likely item selection methods excel in one certain theta pattern but produces larger estimation error in another certain theta pattern. Therefore, for different case of multi-dimensional thetas, item selection methods might show a variety of performance patterns and accuracy levels. The further exploration is needed.

Item pool structure is one of factors affecting estimation accuracy. For example, MI and CEM provide similar performance under different test formats, while MI is slightly better than CEM. Under different item bank structures and different conditions, MI and CEM might have different performance. More explorations related to item pool structure need to be conducted. Conclusion

MCAT will become one of main test delivery approaches in the future testing thanks to its diagnostic feature. To facilitate the development of formative assessments and testing for diagnosis, studies in MCAT are getting increasingly important. Similarly, polytomous items should be applied into the test so that tests could take full advantage of polytomous items'

111

element of a MCAT test, item selection methods play an important role in improving ability estimation precision and fulfilling the diagnostic purpose. D-optimality is found to provide the best performance in general in this study. The other three methods, however, possess their unique advantages in a variety of fields. Moreover, under different item pool structures and test designs, the performance of item selection methods varies accordingly. Research in MCAT test using polytomous items should be investigated profoundly both from theoretical perspective and for practical application.

112

REFERENCES

Akkermans, W., & Muraki, E. (1997). Item information and discrimination function for trinary PCM items. Psychometrika, 62, 569-578.

Berger, M. P. F., & Veerkamp, W. J. J. (1997). Some new item selection criteria for adaptive testing. Journal of Educational and Behavioral Statistics, 22, 203-226.

Berger, M.P. F., & Veerkamp, W. J. J. (1996). A review of selection methods for optimal tests design. In M. Wilson & G. Engelhard (Eds.), Objective measurement: Theory into practice (Vol. 3, pp. 437-455). Norwood, NJ: Ablex.

Birnbaum, A. (1968). Some latent trait models. In F. M. Lord & M. R. Novick (Eds.),

Statistical theories of mental test scores (pp. 395-479). Reading, MA: Addison-

Wesley.

Bock, R. D., & Mislevy, R. J. (1982). Adaptive EAP estimation of ability in microcomputer environment. Applied Psychological Measurement, 6, 431-444.

Chang, H., & Mazzeo, J. (1994). The unique correspondence of the item response function and item category response functions in polytomously scored item response models. Psychometrika, 59, 391-404.

Chang, H., & Ying, Z. (1996). A global information approach to computerized adaptive testing. Applied Psychological Measurement, 20 (3), 213-229.

Chang, H., & Ying, Z. (1999). a-stratified multistage computerized adaptive testing. Applied

Psychological Measurement, 23(3), 211-222.

Chang, H., & Stout, W. (1993). The asymptotic posterior normality of the latent trait in an IRT Model. Psychometrika, 58 (1), 37-52.

Chang, H., & Qian, J., Ying, Z. (2001). a-stratified multistage computerized adaptive testing with b blocking. Applied Psychological Measurement, 25(4), 333-341.

Chang, H. (2004). Understanding computerized adaptive testing: From Robbins-Monro to Lord and beyond. In D. Kaplan (Ed.), The Sage handbook of quantitative methodology for

the social sciences (pp. 117-133). Thousand Oaks, CA: Sage.

Chang, H., & Ying, Z. (2008). To weight or not to weight? Balancing influence of initial items in adaptive testing. Psychometrika, 73 (3), 441-450.

Chen, S.-Y., Ankenmann, R. D., & Chang, H. H.(2000). A comparison of item selection rules at the early stages of computerized adaptive testing. Applied Psychological Measurement, 24, 241-255.

113

Cheng, Y. (2009). When cognitive diagnosis meets computerized adaptive testing: CD-CAT. Psychometrika, 74(4), 619-632.

Choi, S. W., & Swartz, R. J. (2009). Comparison of CAT item selection criteria for polytomous items. Applied Psychological Measurement, 33, 419-440.

De Ayala, R. J. (1989). A comparison of the nominal response model and the three- parameter logistic model in computerized adaptive testing. Educational and

Psychological Measurement, 49, 789-805.

De Ayala, R. J. (1992). The nominal response model in computerized adaptive testing.

Applied Psychological Measurement, 16, 327-343.

De Ayala, R. J. (1993). An introduction to polytomous item response theory models.

Measurement and Evaluation in Counceling and Development, 25, 172-189.

De Ayala, R. J., Dodd, B. G., & Koch,W. R.(1992). A comparison of the partial credit and graded response models in computerized adaptive testing. Applied Measurement

in Education, 5, 17-34.

Dodd, B. G. (1987, April). Computerized adaptive testing with the rating scale model. Paper presented at the Fourth International Objective Measurement Workshop, Chicago.

Dodd, B. G. (1990). The effect of item selection procedure and stepsize on computerized adaptive attitude measurement using the rating scale model. Applied

Psychological Measurement, 14, 355-366.

Dodd, B. G., Koch,W. R., & De Ayala, R. J. (1988, April). Computerized adaptive attitude

measurement: A comparison of the graded response and rating scale models.

Paper presented at the annual meeting of the American Educational Research association, New Orleans.

Dodd, B. G., De Ayala, R. J., Koch, W. R. (1995). Computerized adaptive testing with polytomous items. Applied Psychological Measurement, 19(1), 5-22.

Donoghue, J. R. (1994). An empirical examination of the IRT information function of polytomously scored reading items under the generalized partial credit model. Journal of Educational Measurement, 31, 295-311.

Eggen, T. J. H. M. (1999). Item selection in adaptive testingwith the sequential probability ratio test. Applied Psychological Measurement, 23, 249-261.

Hambleton, R. K., & Swaminathan, H. (1985). Item response theory: Principles and

114

Hemker, B. T., Sijtsma, K., Molenaar, I. W., & Junker, B. W. (1997). Stochastic ordering using the latent trait and the sumscore inpolytomous IRT models. Psychometrika, 62, 331-347. Jodoin, M.G. (2003). Measurement efficiency of innovative item formats in computer- based testing. Journal of Educational Measurement, 40 (1), 1-15.

Lazer, S. (2010). High-level model for an assessment of common standards. National Conference on Next Generation K–12 Assessment Systems. Washington, DC: Center for K–12 Assessment & Performance Management with the Education Commission of the States (ECS) and The Council of Great City Schools (CGCS).

Finkelman, M., Nering, M. L., Roussos, L. A. (2009). A conditional exposure control method for multidimensional adaptive testing. Journal of Educational Measurement, 46 (1), 84-103. Glas, C. A. W. (1992). A rasch model with a multivariate distribution of ability. In M. Wilson (Ed.), Objective measurement: Theory into practice (Vol.1). Norwood, NJ: Ablex. Koch, W. R., Dodd, B. G., & Fitzpatrick, S. J. (1990). Computerized adaptive measurement of attitudes. Measurement and Evaluation in Counseling and Development, 23,

20-30.

Lehmann, E. L., & Casella, G. (1998). Theory of point estimation (2nd ed.). New York: Springer.

Lima Passos, V., Berger, M. P. F., & Tan, F. E. (2007a). Test design optimization in CAT Early stage with the nominal response model. Applied Psychological Measurement,

31 (3), 213-232. doi: 10.1177/0146621606291571.

Lima Passos, V., Berger, M. P. F., & Tan, F. E. (2007b). The D-optimality item selection criterion in the early stage of CAT: A study with the graded response model. Journal of Educational and Behavioral Statistics, 33(1), 88-110. doi: 10.3102/1076998607302631

Lord, F. M. (1980). Applications of item response theory to practical testing problems. Hillsdale: Erlbaum.

Lord, F. M., & Novick, M. R. (1968). Statistical theories of mental test scores. Reading: Addison-Wesley.

Luecht, R. M. (1996). Multidimensional computerized adaptive testing in a certification or licensure context. Applied Psychological Measurement, 20 (4), 389-404.

Mulder, J., & van der Linden, W. J. (2009). Muldimensional adaptive testing with optimal design criteria for item selection. Psychometrika, 74 (2), 273-296.

115

Mulder, J., & van der Linden, W. J. (2010). Multidimensional adaptive testing with Kullback- Leibler information item selection. In W.J. van der Linden & C. A. W. Glas (Eds.), Elements of adaptive testing (pp.77-101). New York: Springer.

Muraki, E. (1992). A generalized partial credit model: Application of an EM algorithm. Applied

Psychological Measurement, 16, 159-176.

Muraki, E. (1993). Information functions of the generalized partial credit model. Applied Psychological Measurement, 17, 351-363.

Muraki, E., & Carlson, J. E. (1995). Full-information factor analysis for polytomous item responses. Applied Psychological Measurement, 19 (1), 73-90.

Owen, R. J. (1975). A Bayesian sequential procedure for quantal response in the context of adaptive mental testing. Journal of the American Statistical Association, 70, 351-356. Passos, V. L., Berger, M. P. F., & Tan, F. E. (2007). Test design optimization in CAT early stage with the nominal response model. Applied Psychological Measurement, 31, 213-232. Penfield, R. D. (2006). Applying Bayesian item selection approaches to adaptive tests using polytomous items. Applied Measurement in Education, 19, 1-20.

Quellmalz, E. S., & Pellegrino, J. W. (2009). Technology and testing. Science, 323, 75-79. Reckase, M. D. (1985). The difficulty of test items that measure more than one ability. Applied

Psychological Measurement, 9(4), 401-412.

Reckase, M. D., & McKinley, R. L. (1991). The discriminating power of items that measure more than one dimension. Applied Psychological Measurement, 15(4), 361-373.

Reckase, M. D. (1997). The past and future of multidimensional item response theory. Applied

Psychological Measurement, 21 (1), 25-36.

Reckase, M. D. (2009). Multidimensional item response theory. New York: Springer. Roberts, J. S., Lin, Y., & Laughlin, J. E. (2001). Computerized adaptive testing with the Generalized Graded Unfolding Model. Applied Psychological Measurement, 25, 177-196. Samejima, F. (1969). Estimation of latent ability using a response pattern of graded responses. Psychometrika Monograph, No. 17.

Samejima, F. (1976). Graded response model of the latent trait theory and tailored testing. In C. K. Clark (Ed.), Proceedings of the first Conference on Computerized Adaptive Testing (pp. 5-17). Washington, DC: U.S. Government Printing Office.

116

Segall, D. O. (1996). Multidimensional adaptive testing. Psychometrika, 61(2), 331-354. Segall, D. O. (2001). General ability measurement: An application of multidimensional item response theory. Psychometrika, 66 (1), 79-97.

Shannon, C. E. (1948). A mathematical theory of communication. Bell System Technical Journal, 27, 379-423.

Stocking, M. L., & Lewis, C. (1995). A new method of controlling item exposure in

computerized adaptive testing (Research Report No. 95-25). Princeton, NJ:

Educational Testing Service.

Sympson, J. B., & Hetter, R. D. (1985, October). Controlling item-exposure rates in

computerized adaptive testing. Proceedings of the 27th annual meeting of the Military Testing Association (pp. 973-977). San Diego CA: Navy Personnel Research and Development Center.

Tang, L., (1996). Polytomous item response theorymodels and their applications in large-scale

testing programs:Review of literature (ETS Research Monograph RM-96-8). Princeton, NJ:

Educational Testing Service.

The U.S. Department of Education. (2010). Notice Inviting Applications. Retrieved from http://www.gpo.gov/fdsys/pkg/FR-2009-11-18/pdf/E9-27427.pdf

Thomasson, G. L. (1995, June). New item exposure control algorithms for

computerized adaptive testing. Paper presented at the annual meeting of the

Psychometric Society, Minneapolis MN.

van Rijin, P. W., Eggen, T. J. H. M., Hemker, B. T., & Sanders, P. F. (2002). Evaluation of selection procedures for computerized adaptive testing with polytomous items. Applied

Psychological Measurement, 26 (4), 393-411.

van der Linden, W. J. (1996). Assembling tests for the measurement of multiple traits. Applied

Psychological Measurement, 20, 373-388.

van der Linden, W. J. (1998). Bayesian item selection criteria for adaptive testing. Psychometrika, 63, 201-216.

van der Linden, W. J. (1999). Multidimensional adaptive testing with a minimum error-variance criterion. Journal of Educational and Behavorial Statistics, 24 (4), 398-412.

van der Linden, W. J., & Pashley, P. J. (2000). Item selection and theta estimation in adaptive testing. In W. J. van der Linden & C. A. W. Glas (Eds.), Computerized adaptive testing:

117

van der Linden, W. J., & Pashley, P. J. (2010). Item selection and ability estimation in adaptive testing. In W. J. van der Linden & C. A. W. Glas (Eds.), Elements of adaptive testing (pp.3- 30). New York: Springer.

Veerkamp, W. J. J., & Berger, M. P. F. (1997). Some new item selection criteria for adaptive testing. Journal of Educational and Behavioral Statistics, 22, 203-226.

Veldkamp, B. P. , & van der Linden, W. J. (2002). Multidimensional adaptive testing with constraints on test content. Psychometrika, 67(4), 575-588.

Veldkamp, B. P. (2003). Item selection on polytomous CAT. In H.Yanai, A.Okada, K. Shigemasu, Y. Kano, & J. J. Meulman (Eds.), New developments in psychometrics (pp. 207-214).

Tokyo: Springer-Verlag.

Wang, C., Chang, H., & Boughton, K. A. (2011). Kullback-Leibler information and its applications in multi-dimensional adaptive testing. Psychometrika, 76 (1), 13-39.

Wang, C., & Chang, H. (2011). Item selection in multidimensional computerized adaptive tests: Gaining information from different angles. Psychometrika, 76(3), 363-384.

Weissman, A. (2003). Information theoretic approaches to item selection. Paper presented at the International Meeting of the Psychometric Society (IMPS), Italy.

Weissman, A. (2007). Mutual information item selection in adaptive classification testing. Educational and Psychological Measurement, 67, 41-58.

Xu, X., Chang, H., & Douglas, J. (2005). Computerized adaptive testing strategies for cognitive

diagnosis. Paper presented at the annual meeting of National Council on Measurement in

Education, Montreal, Canada.

Yao, L., & Schwarz, R. D. (2006). A multidimensional partial credit model with associated item and test statistics: An application to mixed-format tests. Applied Psychological

In document Item selection methods in multidimensional computerized adaptive testing adopting polytomously-scored items under multidimensional generalized partial credit model (Page 113-122)