• No results found

The Universe, Dark Energy and Us

2.3 Test methods

2.4.1 CTT and sample dependency

Linden and Hambleton (1997:2) define CTT as a theory that

[s]tarts from the assumption that systematic effects between responses of examinees are due only to variation in the ability (i.e. true score) of interest. All other potential sources of variation existing in the testing materials, external conditions, or internal to the examinees assumed either to be constant through rigorous standardization or to have an effect that is nonsystematic or “random by nature”.

In a broad sense, the main distinction between CTT and MTT is that CTT statistics such as item difficulty (i.e. proportion correct), item discrimination (i.e.

point biserial correlations) and internal reliability (i.e. the degree to which one particular test would yield identical results if applied to the same candidates in repeated iterations) are sample dependent (Hambleton and Jones, 1993:38;

McNamara, 1996:151-152), while MTT statistics corrects for such dependency (Linden and Hambleton, 1997:2) through the use of probability models applied to data matrixes.

Psychometrics

One-parameter logistic model

Basic Rasch model Rating scale model Partial credit model Multi-faceted model

Two-parameter logistic model Three-parameter logistic model

MTT

Discrimination index

Facility value

Internal reliability

etc.

CTT

Sample dependency in CTT is not considered as a positive attribute because it is as much as saying that the psychometric characteristics of one item depend on the population it was tested on. If the inferences about the reliability of our results depend on the test-takers used during the trials, we will never be completely sure about its properties because these will change if the item is trialed in a different group of candidates. Such is the reason why CTT inferences are sometimes criticized. Let us see one example of what sample dependency means through the analysis of what happens in discrimination indexes when candidates vary.

Roughly speaking, the discrimination index of one item is a figure that tells us how well this particular item differentiates stronger from weaker performers. If the item in question is answered correctly by those that we identify as stronger candidates (i.e. those with higher overall scores) but it is not answered correctly by weaker candidates (i.e. those with lower overall scores), then we can say that this particular item is helping us to discriminate between stronger and weaker candidates because it predicts accurately higher scorers. However, there are times in which one “difficult” item is answered correctly by weaker candidates. The same goes the other way around and sometimes stronger candidates give wrong answers to “easy” items.

To calculate discrimination indexes we are going to use an imaginary item, Item A, which was answered by 300 candidates (n). To calculate the discrimination index of Item A we need 2 data from these 300 candidates, 1) the facility value, and 2) the Pemberton index (Martínez, 2011:68). By comparing the difficulty value and the result of the Pemberton formula we will know the discrimination index of the item, i.e. if it discriminates well strong and weak candidates. Then we will analyze what happens if both data are obtained from a different sample of candidates to exemplify sample dependency. The whole calculation is summarized in table 2.4.1.a below.

First, let us start by calculating the facility value if Item A. Since 75% of candidates answered Item A correctly (i.e. 225 candidates), it is said to have a facility value of 75%.

Second, let us calculate the Pemberton index. To apply the Pemberton formula (see the formula below) we are going to divide the 300 candidates who answered Item A into 3 groups of equal size (100 candidates each). In these 3 groups we will include candidates according to their overall score. The 100 candidates with the best overall marks in the exam will go to group 1 (G1), the 100 candidates with the lowest overall score will go to group 3 (G3) and group 2 (G2) will contain candidates with intermediate overall scores. The numbers on the right of the groups in the table below (100, 95 and 30) indicate how many candidates answered Item A correctly in each group. Thus, from the table we learn that 100 candidates out of the 100 candidates of G1 answered item A correctly, that 95 candidates out of the 100 candidates of G2 answered item A correctly and that only 30 candidates out of the 100 candidates of G3 answered item A correctly.

Item A (n = 300 samples) Correct overall answers (%)

Facility value of Item A 225 of 300 = 75%

Discrimination index

G1: 100 strongest candidates 100 G2: 100 intermediate candidates 95 G3: 100 weakest candidates 30

Table 2.4.1.a. Data for discrimination index through the Pemberton formula

The Pemberton formula being (Martínez, 2011:68):

G1− G3 n / 3

by substitution we find that:

100 − 30 100 = 70

100= 0.7

Now, the facility value of Item A (75%) and the result from the Pemberton formula (0.7) must be correlated. A correlation is expected between the theoretical difficulty of Item A (75%) and its discrimination index (0.7). The more difficult an item is, the closer its facility value approaches 0% (because 0%

candidates will be able to answer it correctly). The easier an item is, the closer its facility value approaches 100% (because 100% candidates will be able to answer it correctly provided it is very easy). This correlation is analyzed through the table below, which is adapted from Martínez (2011:68).

Facility value Pemberton index

100 0.0

96 0.1

93 0.2

90 0.3

86 0.4

83 0.5

80 0.6

76 0.7

73 0.8

70 0.9

66 1.0

50 1.0

F

As we can see in the table, a Pemberton index of 0.7 is expected for an item with facility values of 76%. Since the facility value of our item is 75% (very close to 76%), we can say that Item A is a good item because it displays the expected ratio between facility values and the Pemberton index.

The problem with this statistic (and the source of most criticism directed at CTT) is that the results are sample-dependent. This means that if, for example, the exam is taken by not-very-motivated students, the outcome and the conclusions drawn will vary as well. Not all groups of candidates will necessarily show the same degree of regularity in their answers, and this will also affect the final calculations.

To exemplify this inconsistency, let us imagine now that the same item is trialed in another group of 300 candidates who are slightly less motivated. In this case, only 65% of them (195 candidates) answer it correctly. In G1, 90 candidates give the correct answer to Item A, 65 in G2 and 30 in G3. By applying the formula we obtain a Pemberton index of 0.6 which, as we see in table 2.4.1.b, is not a good result because a Pemberton index close to 1 is expected for an item with a difficulty value of 65%.

The item remained the same in the first and the second calculation, but the sampled candidates changed and so did the statistic properties of the item. This is what is meant by sample dependency in CTT. Sample dependency affects the conclusions drawn from items in so far as the same item sampled in different candidates yields different results. Rita Green (personal communication) claims that reliability indexes may vary up to 14% depending on whether the exam is being taken by real candidates or by candidates who are trialing it.

Bachman (1991:203) also criticizes that CTT “does not provide a very satisfactory basis for predicting how a given individual will perform on a given item”, chiefly because “it makes no assumptions about how an individual’s level of ability affects the way he performs on a test” and because “the only information that is available for predicting an individual’s performance on a given item is the

index of difficulty”, that is to say “the proportion of individuals in a group that responded correctly to the item”.

Set against this, MTT, as we will see in 2.4.2 and 2.4.3, uses mathematical models that balance the possible dependency and that provide accurate predictions of individual candidates on different items. Bachman (1991:203) points out the following:

These models are based on the fundamental theorem that an individual’s expected performance on a particular test question, or item, is a function of both the level of difficulty of the item and the individual’s level of ability.