4 TUTOR MODEL
4.3 Adaptive assessment and feedback
4.3.1 Computerized Adaptive Testing and Item Response Theory
In CAL context, the Computerized Adaptive Testing (CAT) differs from the static nature of traditional tests approach because its construction process is dynamic and the quantity of questions is not predefined. The idea behind a CAT is quite forward: to apply to each examinee only those items useful to know his/her proficiency level. As a consequence of this, CAT is usually more efficient than conventional, i.e. fixed-items, tests, providing more precise measurements for same length tests or shorter test for same precision measurements (Ponsoda, 2000).
From the examinee's perspective, the difficulty of the generated test seems to tailor itself to his/her knowledge level (that is why in early systems it was called ‘tailored testing’). For example, if an examinee performs well on an item of intermediate difficulty, there should be a high probability for the next question to be more difficult question. Or in the other way, if he performed poorly, a simpler question would be the more adequate next step. This does not mean that intention of CAT is neither to facilitate assessments for students presenting them easier questions because their knowledge level is low, nor to complicate assessments for the ones who answer correctly because they master topics. What CAT really looks for is to avoid the students’ boredom when they have to repeat issues they already proved to know, as well as the frustration of those who block themselves mentally when facing a difficult test.
In order to achieve this aim, the general CAT procedure consists in an iterative algorithm with the following steps (Thissen & Mislevy, 2000):
1. The more adequate assessment item is searched from the items bank, based on the current estimate of the examinee's ability.
2. The chosen item is presented to the examinee, who then answers it correctly or incorrectly.
3. The ability estimation is updated, based upon all prior answers. 4. Steps 1 to 3 are repeated until a termination criterion is met.
According to this procedure the fundamental four elements of CAT are: a) an assessment item bank, b) a criterion to select items, c) a procedure to estimate student’s knowledge level, and d) a stopping criterion. A good item bank must contain a large number of correctly described items, obviously the more items the better performance of the test. The stopping criterion may takes different forms like when the estimation reaches certain threshold, when a limit time is reached, etc.
Now, with regard to elements b) and c), several applications from AEHS, ITS and AIES (Inspire (Papanikolaou et al., 2003), SIETTE (Conejo et al., 2004), AHA 3.0 (de Bra et al.,
60
2007), CIA (Jiménez et al., 2008), Flip (Barla et al., 2010)) define them based on an approach known as Item Response Theory (IRT). Formerly known as ‘Latent Trait Theory’, the IRT tries to provide some probabilistic bases to the problem of measuring non- directly observable traits (latent traits). Its name derivates from considering the item or question as the test’s fundamental unit, instead of the total score as it was common in traditional testing approaches.
According to this theory the relationship between the trait θ (that may be understood as the examinee ability or knowledge level) and the subject answer to each item (question) may be explained through an increasing monotonous function, known as Item Characteristic Curve (ICC) that establishes the probability of a right answer. Depending on the nature and parameters of such function, there are several models that may be used. Some of the more popular are (Traub & Wolfe, 1981):
The Rasch model, also known as 1PL for having just one parameter: difficulty; and a logistic shape.
The Normal ogive or logistic, with two item parameters: difficulty and discrimination. Its logistic version is the more common and is known as 2PL. Normal ogive or logistic with three item parameters: difficulty, discrimination, and
guessing. Its logistic version is the more common and is known as 3PL.
The formulas for the ICC in the 1PL, 2PL and 3PL models are presented in equations 4.4, 4.5 and 4.6 respectively.
(Equation 4.4)
(Equation 4.5)
(Equation 4.6)
In order to illustrate how these functions work as well as the meaning of the involved parameters, figure 4.6 presents a 3PL curve. As it may be seen in the previous equations, the 3PL is the more general from the three, so its explanation may be extrapolated to the other two. The domain of this function is the open interval (c,1) being both values its asymptotic limits. The range is (-∞,∞) but for practical purposes only the interval [-3,3] is considered.
61
Figure 4.6: 3PL typical ICC
In the IRT context, the guessing c defines the probability of a right answer without considering examinee’s ability. In other words this parameter is inherent to the item nature, for example in a true or false kind of question all students have a 0.5 probability of success if they just guess. The difficulty b defines how that item suits the examinee ability. In graphical terms it defines how long to the right the item meets high-ability examinees or reciprocally, how long to the right the item meets low-ability ones (it defines the location of the curve's inflection point along the θ scale). The discrimination a defines how well an item can differentiate between examinees having abilities below the item difficulty and those having abilities above it. This parameter essentially reflects the steepness of the ICC in its middle section: the steeper the curve, the better the item can discriminate; whereas the flatter the curve, the less the item is able to discriminate between two examinees whose abilities are close (it defines the slope of the curve at its inflection point).
To clarify even more the impact of these parameters over the ICC, specifically a and b that could be harder to interpret, figure 4.7 shows different curves varying them whereas the other ones remain fixed (note that c is zero in these examples).
Figure 4.7: 3PL ICC varying parameters b (left) and a (right) (Baker, 2001)