Step 3: Linking the Standardized Tests for Sixth, Eighth, and Ninth Grades

Chapter 2. Linking Standardized Tests and an Online Item Bank for Formative

2.5 Concept for Implementing a Common Vertical Scale for Mathematics

2.5.4 Step 3: Linking the Standardized Tests for Sixth, Eighth, and Ninth Grades

For the third calibration step, we propose linking the standardized tests for the sixth, eighth, and ninth grades to the vertical scale developed based on the calibration assessments. Because the standardized sixth-grade test is very similar to the standardized third-grade test in terms of assessment type and purpose, content specifications, and measurement conditions, we suggest a test design for the sixth-grade test that strongly resembles the standardized third-grade test. In particular, we recommend developing a common-item nonequivalent group design with eight linked parallel test forms and assigning these test forms randomly across all sixth-grade classes. This test design allows for benefitting from the large sample of sixth-grade students by administering and calibrating a large amount of items that can be used to develop future sixth- grade tests. Furthermore, we can include linking items with the calibration assessments in these test forms to link the standardized sixth-grade test with the vertical scale. These linking items already have empirical item difficulties from step 2 of the calibration process. Consequently, we advise fixing these items while concurrently calibrating the additional sixth-grade items included in the eight parallel test forms. Furthermore, we recommend checking parameter invariance between the standardized test and calibration assessments, and performing in-depth item analysis. Items with variant parameters need to be excluded from linking, and items that do not fit the underlying Rasch model need to be excluded completely prior to ability estimation to ensure the quality of student reports.

In contrast to the paper-based linear third- and sixth-grade tests, the standardized tests for the eighth and ninth grades are conceptualized as computer-based MSTs. Both standardized MSTs include four test parts (i.e., stages). The first stage contains a starting module of intermediate difficulty as a basis for preliminary ability estimation, and three subsequent stages include five modules, each differing in difficulty. In stages 2 to 4, students are routed based on their preliminary performance to one of the five modules. In total, the two MSTs include 20 different modules each, and each participating student sees four of them. Because the two standardized tests for secondary school differ only in their target population (i.e., eighth vs. ninth grade), we propose identical designs and calibration approaches for both tests.

Given that the MST design itself already is rather complex, we refrain from suggesting parallel forms for the MSTs. However, to balance the number of observations per item between the first and subsequent stages, and to prevent students from cheating by copying answers from their neighbors, we foresee developing five different parallel versions of the starting module for each test, then assigning these versions randomly across students. As an alternative to developing parallel starting modules, it also would be possible to develop targeted starting modules for each of the three performance-related school types distinguished on the secondary school level in Northwestern Switzerland. Each student is assigned to one of these three school types, and this information is available before test administration. Related, targeted starting modules could optimize the efficiency of the MST designs for item calibration based on the first administration, as well as for ability estimation in general. However, several studies have

found heavy overlap between the abilities of students assigned to different secondary school types (e.g., Angelone, Keller, & Moser, 2013; Baumert, Stanat, & Watermann, 2006b). Based on this research, we cannot rule out that students with abilities that differ from their group’s mean ability could be disadvantaged by such a design. Furthermore, we could not find any research on the expected efficiency gain through a targeted assignment of starting modules in an MST design for item calibration and ability estimation. Consequently, we propose refraining from distinguishing different difficulty levels during the first stage of standardized eighth- and ninth-grade tests.

Usually, a calibrated item pool is available for distributing items within an MST design and for defining related routing rules. However, due to limited time and financial resources, as well as related limited options for running calibration studies, we propose using the first administration of the standardized MSTs for the eighth and ninth grades to calibrate the items. In contrast to the calibration assessments, for which we refrain from using an MST design due to limited knowledge about item difficulty and students’ ability distribution within the target school grades, the outcome of the calibration assessments (i.e., step 2) provide a basis for developing the MSTs for the eighth and ninth grades. Namely, the calibration assessments’ outcomes will provide an indication of eighth- and ninth-grade populations’ ability distributions, and the calibrated eighth- and ninth-grade items from the calibration assessments can serve as references for adjusting content experts’ difficulty ratings for the eighth- and ninth- grade items. Moreover, we will have empirical item difficulty estimates for linking items between the standardized tests and calibration assessments. Finally, the expected sample size is much larger for the compulsory standardized eighth- and ninth-grade tests than for the calibration assessments.

Against this backdrop, we suggest developing the two standardized MSTs for the eighth and ninth grades based on content experts’ expertise. For each of the two tests, experts need to develop five parallel test forms for the first stage and five test forms targeted to five different difficulty levels for stages 2 to 4. Linking items between the different modules are not required for item calibration. Instead, we can link the different modules within one MST through overlapping paths (i.e., various combinations of different modules from the four stages). The link between the two MSTs can be established by linking them to the vertical scale of the calibration assessments, specifically through the linking items between each standardized test and the calibration assessments. We also suggest asking content experts to determine rules to route students based on their raw scores from one stage to the next. Therefore, it is important to define routing rules that guide a comparable number of students to all five modules within a stage. A low number of observations would impair the calibration of related items. Thanks to the large sample size (i.e., 13,000 students per test), we rate the risk of having an insufficient number of observations as being very small. Ideally, approximately 2,600 students would reach each module. Nevertheless, only 10 percent of this sample would be sufficient for accurately estimating item difficulty parameters based on the Rasch model (Wright, 1977).

For calibrating the items, we again recommend a procedure similar to that of the standardized third- and sixth-grade tests. Specifically, for each of the two standardized tests, we propose calibrating all modules of the MST concurrently while fixing the difficulty parameters of the linking items to the outcomes of the calibration assessments. Furthermore, we advise investigating parameter invariance between the standardized tests and calibration assessments, getting a close look at item fit statistics during item analysis, and excluding misfitting items prior to ability estimation and reporting. Even though we expect some efficiency loss due to limited knowledge about true item difficulty parameters compared with MSTs, which are developed based on calibrated item pools, we still expect an efficiency gain for item calibration and ability estimation compared with linear tests. Furthermore, measurement efficiency is limited only during the first administration of the test. In subsequent years, when we have a calibrated item pool at hand, we can improve item-difficulty consistency within the MST modules and adjust the routing rules if needed. However, the detailed interactions between limited knowledge about item difficulty parameters during test construction, consequential specification errors in test modules, and measurement efficiency are subject to further research.

In document Implementation and validation of an item response theory scale for formative assessment (Page 46-48)