DAL and COM Studies - Design Components of VL-CCT

CHAPTER II. LITERATURE REVIEW

2.3 Design Components of VL-CCT

2.4.3 DAL and COM Studies

A series of studies by Frick, Plew, and Welch represents a major step in classical VL- CCT: application of item-level parameter estimates to make additional efficiency gains. Research conducted by Frick (1989) found that the SPRT was a viable option for efficiently and accurately make mastery decisions despite variations in item difficulty and

discrimination power and, consequently, could be leveraged to individualize learning experiences. Frick found that SPRT had high predictive validity.

Frick’s (1989) research conducted computer simulations based on historical test data from two tests, a Digital Authoring Language (DAL) test and a test of knowledge of how computers functionally work (COM test). Both tests were delivered via computer where items were randomly selected from the item-bank without replacement until all items were

used. Also, both tests contained a variety of item types where difficulty and power of discrimination varied considerably.

The DAL test (97 items) was administered 53 times. Most DAL test examinees were graduate students taking a class with Frick that covered DAL programming. The remaining DAL test examinees were professional staff at Indiana University who claimed or did not claim to be experienced DAL programmers. Knowledge of the individual examinees previous experience with DAL enabled master and nonmaster groups to be defined

independent of the examinee score on the DAL test. The DAL test (mean score 63%, SD = 24.6) was considerably harder than the COM test (mean score 79%, SD =13.6).

The COM test (85 items) was administered 104 times. Note that the original

publication reporting on the COM test indicates that “There were 105 administrations of the COM test” (Frick, 1989, p. 102) but subsequent publications and the available historical data files show report the number to be 104 (Frick, 1992, p. 203). Current or former graduate students, representing two thirds of the COM test examinees, took the test twice at different points in a course and undergraduate students, representing the rest of the COM test

examinees, took the test only once.

SPRT parameters for both tests were set to P(C|M) = .85, P(C|N) = .60, false master = false nonmaster = .025. Choice of these particular SPRT parameters were based on obtaining a grade of B or higher (> .85) or D or lower (<.60). Simulated SPRT tests were conducted post-hoc based on historical data from the DAL and COM tests. SPRT mastery decisions were compared to mastery decisions based on total test scores. Total test scores

were converted into mastery decisions using the mid-point between mastery and nonmastery (72.5%) as the cut score.

The 1989 study Frick found a high level of agreement between SPRT decisions and total score decisions: 96% agreement for DAL, 99% agreement for COM; 98% agreement across both tests. Fewer classification errors occurred than were expected (<5%). The

mastery decisions were made via SPRT with mean test lengths that were less than one fourth of the total test length. On average, fewer items were required to make nonmastery decisions than mastery decisions. All SPRT classification errors were cases where it classified a master as a nonmaster.

In a follow-up study using test re-enactments with the same historical data from the COM and DAL tests Frick (1992) examined the efficiency and accuracy of several classical and an IRT based VL-CCT approaches where items were calibrated using different sample sizes. Frick introduces the EXSPRT-R, which uses item level parameter estimates and

random item selection to make classification decisions about an examinee. Also introduced is the EXSPRT-I, which uses item level parameter estimates but applies intelligent item

selection to make classification decisions. EXSPRT-I was jointly developed by Frick and Plew (1989) and applies item selection reasoning based on item discrimination, the item/examinee incompatibility, and the utility of the item.

While the focus of the study was on examining the accuracy and efficiency of the various VL-CCT approaches, an additional factor was also examined – the consequences of calibration sample size. For both the DAL and the COM test item parameter estimates were established using two different sample sizes: 25 and 50 randomly selected examinees with

the later included the former. The number of examinees who took the COM test also enabled calibration samples of 75 and 100 examinees to be used.

Results from the 1992 study showed that calibration sample size did not substantially impact test efficiency but did impact accuracy. When only 25 examinees were used to calibrate items both the accuracy of the decisions reached by EXSPRT-I and the EXSPRT significantly departed from decisions made using the total test based on a Chi-squared goodness of fit test (p < .05) and were less accurate than expected. With 50 examinees in the calibration sample all the approaches, except AMT and EXSPRT-I with the COM test, had classification accuracies within expected error rates that did not significantly differ from decisions made using the total test. Calibration samples of 75 and 100 examinees enabled all but the AMT approach to make classification decisions within a priori error rates.

Percent agreement numbers that were used in the study, unlike Proportion Reduction in Error (see Rudner, 2009, p. 7), do not explicitly address agreement due to chance. In addition, it is not clear if or how problematic items were handled. Welch (1997) points out that the SPRT tests in the Frick study were simulated rather than controlling the test in real- time which provided the motivation for subsequent studies by Welch and Frick.

Welch and Frick (1993) showed that SPRT and EXSPRT-R testing approaches can make accurate and efficient mastery decisions in real-time testing situations and are viable and practical alternatives to IRT based methods. Thirty-eight students from a graduate course on the use of computers in education were randomly assigned to two groups (20 given

EXSPRT-R/SPRT and 18 given EXSPRT-I). Tests drew from an item-bank of 85 items that represented a variety of item types.

Examinees were told they would be taking two tests (adaptive and fixed length) but only one was truly given. Decisions were made at various points about examinee mastery using different algorithms but all examinees ended up taking all 85 items. Item parameters for SPRT, EXSPRT, and Rasch estimates were based on historical data from 185

administrations from past studies (Frick, 1989; Powell, 1992). For SPRT the probability of a correct response from a master was set at .90 and the probability of a correct response from a nonmaster was set at .63. Equal prior probabilities of master and nonmastery were assumed and the acceptable rate of false mastery and false nonmastery were both set to .01. The Adaptive Mastery Testing (AMT) method (Weiss and Kingsbury, 1984) was used for the IRT approach.

Results again showed that EXSPRT-I tests were significantly shorter than EXSPRT-R (half as long) but no significant differences were found among other tests. The conventional proportion correct with a confidence interval based on a standard error of measurement and the IRT theta estimation with a standard error of measurement based on test information at the given theta level made identical mastery decisions with both being unable to make decisions in nearly 40% of cases. EXSPRT-R procedures applied to the total test, on the other hand, made decisions in all but 13% of cases (one third of 40%). When compared to classification decisions made by applying EXSPRT-R procedures to the total test, EXSPRT-I disagreed in over 20% of cases and SPRT disagreed 10% of the time. Decisions made with AMT disagreed with EXSPRT-R procedures applied to the total test in over 20% of cases. SPRT performed about as well as other methods

On critique of the Welch and Frick (1993) study is that SPRT is presented as requiring no historical data to for probabilities of a correct response from a classification

group. However, this is not necessarily true. Decision makers can set these values but without empirical data to support their decision the accuracy of their estimates cannot be determined. Furthermore, it is not clear why “a conventional proportion correct metric with a .85 cut-off score” (Welch & Frick, 1993, p. 58) was used rather than the halfway point between the mastery and nonmaster SPRT probability of a correct answer as was done in earlier studies (e.g. Frick, 1989).

Only two studies could be found that specifically focus on how the size of the calibration sample impacts subsequent classical VL-CCT efficiency and accuracy – Frick’s 1992 study already described and a study from Rudner (2009) that will be reviewed next.

In document Facilitating Variable-Length Computerized Classification Testing Via Automatic Racing Calibration Heuristics (Page 55-60)