• No results found

Kullback-Leibler Methods

2.3 Unidimensional Item Selection Algorithms

2.3.2 Kullback-Leibler Methods

Chang and Ying (1996) suggested basing item selection on Kullback-Leibler (KL) divergence rather than FI as the information metric. KL divergence (e.g., Kullback, 1959; Kullback & Leibler, 1951) relates to the expected loss when choosing an approx- imate model rather than the correct model. Let f be the true probability distribution of univariate random variable X, and let g be an alternative/approximate distribution of X. Then the KL divergence between f and g is defined to be

KL(f ||g) = Ef  log f (X) g(X)  = Z ∞ −∞ f (x) log f (x) g(x)  dx, (2.23)

where || stands for “distance” between distributions, KL(f ||g) ≥ 0, and the expectation is taken with respect to the true distribution, f . To derive a KL-based index for com- puterized adaptive tests, let θi be the true ability of examinee i. Then KL divergence for the jth item can be defined

KLj(θi||θ) = Eθilog  L(θi|Yij) L(θ|Yij)  = pj(θi) log  pj(θi) pj(θ)  + [1 − pj(θi)] log  1 − pj(θi) 1 − pj(θ)  . (2.24) As shown by Chang and Ying (1996), the curvature of the KL divergence function at a point is equal to Fisher information at that point. Therefore, KL divergence effectively

reduces to Fisher information if choosing θ to be close to θi. Chang and Ying (1996) recommended using KL divergence because “there is no requirement that θi be close to θ” (p. 218), unlike the more local character of Fisher information. Moreover, by using a likelihood ratio statistic, KL divergence is similar to the decision-making process of the SPRT. Several different KL divergence indices have been proposed for use in adaptive testing. Originally, the KL information index was defined as average KL divergence along a small interval,

KLj(ˆθi) =

Z θˆi+δij

ˆ θi−δij

KL(ˆθi||θ)dθ, (2.25)

where ˆθi is the MLE of θi before administering item j, and δij is a function of the precision in the MLE. Chen, Ankenmann, and Chang (2000) also noted that, as in Fisher information, weight functions can be applied to KL divergence indices, resulting in KLj(ˆθi|wij) = Z θˆi+δij ˆ θi−δij wijKL(ˆθi||θ)dθ. (2.26)

They further compared bias, MSE, and item overlap for various FI and KL criteria across CATs designed to estimate θi for each person. Only for extreme ability levels did KL information or weighted Fisher information improve over standard Fisher information in terms of bias, MSE, and item overlap early in a test. Moreover, as they wrote, “differences among all [item selection algorithms] with respect to BIAS, RMSE, SE, and item overlap were negligible for tests of more than 10 items” (p. 253, and see Cheng and Lio, 2000, for a partial replication of this study with nearly identical results).

All of the item selection algorithms heretofore discussed were derived to pinpoint an examinee’s true ability. None of the algorithms as presented can be used to efficiently decide whether an examinee is in one of two broadly defined categories. In Chapter 3, I

show why each of the aforementioned algorithms results in inefficient CMTs. However, before explaining reasons for inefficiencies, I first describe alternatives to the typical item selection algorithms appropriate for unidimensional adaptive mastery tests.

2.3.3 Mastery Testing Methods

Many researchers have suggested modifications of the above algorithms for use in mastery testing. For instance, Eggen (1999) promoted selecting items by maximiz- ing Fisher information at θ0 rather than ˆθi or maximizing point-wise KL divergence (Equation 2.24) at KLj(θu||θl). He found that maximizing Fisher information at the cut-point resulted in the shortest and most accurate tests, but selecting items to maxi- mize Fisher information at ˆθi or KL divergence using KLj(θu||θl) did not result in much performance decrement (although see Eggen, 2010 for a replication of Eggen, 1999 with slightly different results).

A common complaint in using point-wise KL divergence in mastery testing is the lack of symmetry between KLj(θu||θl) and KLj(θl||θu). Recall that KL divergence is defined as the expected log-likelihood ratio comparing the true model to an alternative model with respect to the true model. Therefore, when choosing items by maximizing KLj(θu||θl), one implicitly assumes that every examinee is a master. Alternative mastery testing item selection algorithms have been developed that better consider the actual location of an examinee when selecting items. These alternative algorithms include the weighted log-odds ratio (LO; Lin & Spray, 2000) and mutual information (MI; Weissman, 2007). The weighted log-odds ratio selects items that maximize the expected log-odds at the ends of the indifference region,

LOj(θu||θl) = X y E log  pj(θu) pj(θl) Y ÷ 1 − pj(θu) 1 − pj(θl) 1−Y! (2.27) = E(Y = 1) log pj(θu) pj(θl)  − [1 − E(Y = 1)] log 1 − pj(θu) 1 − pj(θl)  , (2.28)

where E(Y = 1) is the classical difficulty of an item and can be calculated by integrating the probability of response for θ weighted on the density of θ across the examinee distribution6.

Mutual information generalizes log-likelihood-based information criteria across mul- tiple cut-points. Weissman (2007) proposed MI as a symmetric version of KL diver- gence. Let ΘB be a discrete set describing the classification bound(s). In our case, ΘB = {θl, θu}. Then mutual information can be defined as

MIj(ΘB) = X y X θ∈ΘB f (y, θ) log  f (y, θ) f (y)f (θ)  =X y X θ∈ΘB Prj(Y = y|θ)π(θ) log  Prj(Y = y|θ) f (y)  , (2.29)

where Prj(Y = y|θ) is the probability of Y = y given a particular θ, π(θ) is the prior probability of θ, and f (y) is the marginal probability of Y = y. Lin (2011) tested FI, KL, LO, and MI in several SPRT-based CMTs. He found that the weighted log- odds ratio resulted in the fewest number of items administered, and mutual information resulted in the most number of items administered. All of the algorithms had similar classification accuracies. In Chapter 4, I discuss generalizations of the item selection

6

Lin and Spray, 2000 and Lin, 2011 take the expectation in Equation (2.27) with respect to the

marginal distribution of θ to arrive at Equation (2.28). However, I found taking the expectation with respect to a single examinee’s ˆθi to better reflect the associated SPRT stopping rule. The latter item

and stopping rules to multidimensional adaptive tests. But first, I explain limitations of using certain item selection rules in adaptive mastery tests as a partial justification for deriving particular multidimensional mastery testing item selection algorithms.

Chapter 3

SPRT and Binary Response

Models

A potential limitation of using the SPRT as a decision rule in unidimensional clas- sification tests is due to the non-zero lower asymptote of the three-parameter logistic model. Spray and Reckase (1994) noticed that when using the 3PL, “selecting items to have maximum information at the examinee’s true ability results in longer average test lengths” (p. 9) than selecting items to have maximum information at the cut-points, and “this result is quite dramatic for the lower [classification bound] and examinees above θ of .5” (p. 9). In other words, the SPRT is inefficient for high ability simulees when using the three-parameter logistic model and selecting items based on the maximum likelihood estimate. Spray and Reckase (1994) proposed a simple method of reducing the number of administered items in SPRT-based classification tests: select items to maximize information at the cut-point separating categories. However, the ideal item selection point depends on the true item and person parameters as well as the classi- fication bound. Selecting items to maximize information at the classification bound is

only a coarse approximation of the most efficient item selection algorithm. By exam- ining properties of IRT log-likelihood ratios, one can shed light on optimal methods of designing item banks, choosing item selection algorithms, and selecting classification criteria for adaptive tests. Because multidimensional IRT models are generalizations of the unidimensional functional form, many of these results should also apply to multi- dimensional adaptive tests. In the following sections, I present the effect of item and person parameters on the magnitude of the SPRT test statistic in two parts: first with mathematical evidence, and then, supporting mathematical conclusions with a small set of simulations.