High Error Rate with Empirically Established Item-bank Level Probabilities

CHAPTER V. RESULTS

5.1 ARCH Calibration Sufficiency (RQ1) Results

5.1.2 High Error Rate with Empirically Established Item-bank Level Probabilities

empirically established item-bank level probabilities with SPRT instead of the set item-bank level probability values used in previous studies. It was found that the use of empirically established item-bank level probabilities with SPRT resulted in false nonmaster error rates higher than those established a priori. This was a problem for two reasons. First, the SPRT algorithm is a key component of ARCH, and issues with SPRT would also represent issues with ARCH. Second, RQ2 involves the comparison of ARCH to SPRT and problems with SPRT would limit the value of this comparison. The SPRT challenge was overcome by setting item-bank level probabilities to values used in previous COM test data-based studies rather than applying an empirical approach.

Item-bank level probabilities of a correct response from each classification group (equations 2 and 4) were established empirically using the total number of correct and incorrect responses from nonmasters (# and #¬ ) and masters (# and #¬ ). A probability of a correct response given nonmastery, P(C|NM), of 0.56, was established empirically by using all available response data from true nonmasters. A probability of a correct response given mastery, P(C|M), of 0.88 was also established empirically through response data from all true masters. The index of discrimination (equation 15) for the item- bank level probabilities established empirically was 0.32.

Item-bank level probabilities used with SPRT have not typically been established empirically in previous studies, with the exception Welch and Frick’s (1993, p. 57) use of empirically derived values for use with SPRT. In the original research that used COM test data, “the SPRT parameters were set a priori as follows: mastery level = .85, non-mastery level = .60, α = β = .025” (Frick, 1989, p. 102) instead of establishing the values empirically.

The values .85 and .60 were selected to reflect widely used letter grade cutoffs with .725 representing the value between these two cutoffs. Using the set values from the Frick (1989) study, the index of discrimination for the item-bank level probabilities (equation 15) was 0.25, which is 0.07 or 21.9% smaller than the index of discrimination calculated using empirically established probabilities. In other words, the set SPRT parameters were substantially less discriminating than the SPRT parameters established empirically.

I initially thought that empirically based item-bank level probabilities would be the most appropriate to use in simulations involving both SPRT and ARCH. Since these values are based on actual response data, I expected them to lead to optimal SPRT performance. However, repeated simulations of SPRT with COM test data using P(C|M) = 0.88, P(C|NM) = 0.56, α = β = .025 consistently yielded false nonmastery error rates that exceeded the .025 rate established a priori. Recall that a false nonmastery error occurs when an examinee is classified as a nonmaster when they are, in fact, a master (according to their total test score).

For example, examinee response data from 104 examinees was used to simulate 2,080 SPRT tests calibrated empirically using all the available response data to establish item-bank level probabilities of P(C|M) = 0.88, P(C|NM) = 0.56, α = β = .025. The simulation involved each examinee being administered a SPRT based test twenty times. Since SPRT randomly selects items, the chances of two SPRT test administrations being identical is unlikely.

Table 15 provides the error rates by algorithm and includes both the empirically calibrated SPRT and manually calibrated SPRT using set parameters to match earlier SPRT studies based on COM test data. Out of the 2,080 simulated test administrations, SPRT calibrated empirically made 2,073 decisions with 67 of those decisions (3.23%) being false nonmastery decisions. A nonmastery error rate of 3.23% is above the a priori false

nonmastery rate of 2.5% by 0.73%. The mean test length of tests that applied SPRT calibrated empirically was 16.61 items (SD = 12.28).

Table 15. SPRT Decision Error Rates By Method of Setting Item-bank Level Probabilities Item-bank Probability Approach False Nonmastery Errors False Mastery

Errors Total Errors

Total Nonmastery & Mastery Decisions

n % n % n % N

Empirical 67 3.23% 29 1.40% 96 4.63% 2,073

Manual 42 2.11% 28 1.41% 70 3.52% 1,990

Using the same approach, examinee response data from 104 examinees was used to simulate 2,080 SPRT tests using item-bank level probabilities manually set to P(C|M) = 0.85, P(C|NM) = 0.60, α = β = .025 that are consistent with earlier studies (Frick, 1989). Out of the 2,080 simulated test administrations, SPRT made 1,990 decisions (83 fewer than the

empirically calibrated SPRT) with 42 of those decisions (2.11%) being false nonmastery decisions – well below the a priori false nonmastery rate of 2.5%. The mean test length of the tests that applied SPRT with item-bank probabilities set manually was 21.97 items (SD = 16.37), which is 5.36 items (32.3%) longer than the results obtained with SPRTusing item- bank probabilities set empirically. On hindsight, this should not be surprising, since the SPRT requires more items to reach a decision when Wald’s zone of uncertainty is smaller ([.85 - .60 = .25] is less than [.88 - .56 = .32]), when using the same a priori error rates (see Frick, 1989).

The differences in the results between the SPRT algorithms calibrated empirically and using values set manually are outside the scope of this study but warrant further

investigation. Nevertheless, results from the analysis above show that: (1) it cannot be assumed that SPRT will always make classification decisions within error rates established a

priori; (2) choice of item-bank level probabilities impacts SPRT error rates; and (3) the empirically established item-bank level probabilities for the COM test data had a higher index of discrimination, shorter average SPRT test lengths, and higher false mastery error rates than associated manually set item-bank level probabilities.

Given that SPRT using set values used in earlier studies resulted in decision error rates within rates established a priori, I decided to proceed with SPRT using the manually set values and abandon empirically established item-bank level probabilities for use with SPRT for the remainder of the Monte Carlo studies.

In document Facilitating Variable-Length Computerized Classification Testing Via Automatic Racing Calibration Heuristics (Page 135-138)