MTT and statistical models

The Universe, Dark Energy and Us

2.3 Test methods

2.4.2 MTT and statistical models

MTT is also frequently referred to as Item Response Theory or Latent Trait Theory.

However, for the present dissertation, we will use the MTT (modern) label since in mnemotechnic terms it is easier to remember the CTT (classical) vs. MTT (modern) distinction, as already stated in the introduction to section 2.4.

Defining what MTT is “sometimes verges on the nonsensical, and certainly on the irascible, because protagonists are using the term in very different senses”

(Linacre, 2003:926). A precise definition of MTT draws on quite complex statistical models and obviously this is not the place to discuss such matters.

However, most of the definitions of MTT (but not all) agree on the fact that it relates the probability of an examinee’s response to a test item to an underlying ability (Linden and Hambleton, 1997:v; Green, 2013:xii) and on the fact that “it encompasses any mathematical model which attempts to predict observations on a latent variable” (Linacre, 2003:926) (hence its alternative name of “Latent Trait Theory”). The 2 assumptions above do not always go together but help us to understand that MTT has an eminently predictive nature. In other words, MTT can help us to predict how our language tests (or rubrics) will behave departing from a reduced data set. MTT tries to identify patterns in data which researchers or test designers can use to draw conclusions, even if such data sets are reduced in size,

can still be derived from a ‘holey’ data matrix, although the more information available to the analysis the better these estimates will be”.

Once the basic difference between CTT and MTT is stated (item-dependency vs. estimation of probability), it is necessary delve into MTT, which is, by far, more complex than CTT.

What characterizes MTT internally is the mathematical model used to estimate the aforementioned probabilities, the models used to find patterns in data sets. McNamara (1996:257-258) entangles the origin of MTT and the rise of its main mathematical currents as follows:

Item Response Theory (IRT) is a powerful general measurement theory which was developed in the 1950s and 1960s independently, it seems, in two different locations: by Alan Birnbaum in the United States and by the Danish mathematician Georg Rasch in Denmark. Rasch’s work was promoted and extended by an American, Ben Wright, who attended a series of invitational lectures given by Rasch in Chicago in 1960 and became his pupil and the advocate of his ideas in North America […]. Two main branches of Item Response Theory (or Latent Trait Theory as it is sometimes still known), stemming from these two developmental traditions, are recognized […]. They differ theoretically and practically. The essential feature of both is that they attempt to model statistically patterns in data from performances by candidates on test items, in order to draw conclusions about the underlying difficulty of items and the underlying ability of candidates. They differ mainly in the number of item parameters (characteristics of the interaction between a test taker and a test item) being estimated in the analysis: Rasch analysis considers one item parameter (item difficulty), while other models consider one or more further parameters (item discrimination, and a guessing factor).

Generally speaking, it is the number of parameters considered what establishes the current different forms of MTT. This way we find the one-parameter logistic model (also known as Rasch), the two-one-parameter logistic model and the three-parameter logistic model. For a deeper mathematical analysis of these models see Linden and Hambleton (1997), McNamara (1996) and Harris

(1989) and for a close up of their historical evolution see McNamara and Knoch (2012). We will be using the one-parameter logistic model in our analysis of rubrics although we will also use 1 example of three-parameter logistic model to provide the big picture of MTT.

In general, there is a series of reasons why researchers opt for MTT. Reise et al. (2005:100) claim that MTT methods are used because:

[R]esearchers want to (a) more rigorously study how items function differently in different groups; (b) place individuals from different groups onto a common scale, even if they have responded to different items; (c) use individual scores that have good psychometric properties, so that statistical techniques (such as growth model) can be applied with greater accuracy and spurious results or invalid findings can be avoided; (d) thoroughly understand the psychometric properties of their instruments; (e) create more order in their fields by having a common metric for a construct, rather than many competing fixed-length instruments; and (f) develop CAT (computerized adaptive testing)³ systems for more efficient assessment of individual differences.

Most scholars agree on the fact that that MTT has 4 intrinsic properties, namely sufficiency, separability, specific objectivity and latent additivity. Among these, the most interesting one is specific objectivity, which matches with reason (b) in the excerpt above. The property of specific objectivity in MTT allows, for example, for the comparison of persons without reference to the particular items taken and comparison of items without reference to the particular persons providing the responses, which compensates CTT’s sample-dependency referred to in 2.4.1. Thus MTT models “place individuals from different groups onto a common scale, even if they have responded to different items”, as Reise et al.

(2005:100) pointed. In practical terms this means that if we have collated data properly and if these fit the parameters, MTT models create a suited scale in which all the measurements will be distributed accurately. This scale is our logits► scale which is explained below.

The implications of specific objectivity of MTT models is particularly relevant in one study like ours, in which we are trying to validate a measuring instrument. As Thurstone (1928:547) puts it, a scale must transcend the group measured:

A measuring instrument must not be seriously affected in its measuring function by the object of measurement. To the extent that its measuring function is so affected, the validity of the instrument is impaired or limited. If a yardstick measured differently because of the fact that it was a rug, a picture, or a piece of paper that was being measured, then to that extent the trustworthiness of that yardstick as a measuring device would be impaired. Within the range of objects for which the measuring instrument is intended, its function must be independent of the object of measurement.

As a consequence of the underlying property of specific objectivity, if we can prove that our rubrics fit MTT models, by analyzing the data of a reduced number of trialed candidates, we will be able to ascertain whether our rubrics are valid to rate any prospective test-taker. The corresponding fitting analyses are shown in section 4.2.4.

The one-parameter model MTT is the model that we will be using for validation purposes, but there are others. The existing mathematical models can be basically broken down into the one-parameter logistic model (or Rasch model), the two-parameter logistic model and the three-parameter logistic model (Harris, 1989:35). Although we will be using one-parameter logistic models in the validation of our rubrics, we will define first what makes the one-parameter different from the two-parameter and the three-parameter logistic models to provide a general view of this type of statistics.

When establishing the degree of probability of a test taker answering correctly one item, each of the 3 models considers, as their name suggest, a different number of parameters. Parameters can be defined as the characteristics of an item which, according to the model being used, may or may not be taken into account. The 3 characteristics that models may consider are item difficulty,

item discriminability and the effect of guessing (Davies et al. 2006:140). As it is obvious from their name, the one-parameter model takes into account 1 parameter, the two-parameter model takes into account 2 and the three-parameter 3. These parameters are taken into account by the mathematical models that define item characteristic curves (ICC)►, which are the cornerstone of MTT models (Bachman, 1991:203).

ICC and many other statistics coming from MTT analyses are measured in logits(Green, 2013:151), as we can see in figure 2.4.2 across the X axis. It is very important at this point to become acquainted with such concept. Bearing in mind that one of the main characteristics of MTT is that it allows us to make predictions on candidates’ answers based on probability theory (Green, 2013:xii), let us consider what McNamara (1996:165) writes about the probabilities or odds of a particular response:

The odds are expressed as a logarithm (‘log’ for short) of the naturally occurring constant e. We thus speak of the ‘log odds’ of a response, rather than the odds of a response, and the units of measurement scale constructed in this way are called

‘log odds units’ or logits (pronounced ‘LOH-jits’; stress on the first syllable). The logit scale has the advantage that it is an interval scale – that is, it can tell us not only that one item is more difficult than another, but also how much more difficult it is. The interval nature of the ability measurements means that growth in ability over time can be plotted on the scale; this has attractive implications for the evaluation of the effectiveness of teaching […]. By convention, the average difficulty of items in a test is set at zero logits. Items of above-average difficulty will thus be positive in sign, those of below-average difficulty negative in sign.

Ability estimates in turn are related to item difficulty estimates, so that a person of an ability expressed as 0 logits would have a 50 per cent chance of getting right an item of average difficulty.

The most important thing about logits is that they will be our yardstick from now onwards. As we will see later in chapter 4 during the validation of the rubrics, logits will allow us to relate different aspects (or facets) of our

measurements to the same scale in the so-called vertical rulers, which will be very visual and convenient.

It is also important at this point to remark that 0 in a logit scale marks an average point and that this average point will vary from data set to data set depending on, for example, the average difficulty of items, the average ability of candidates, etc.

Once we know how our results will be scaled, it is time to go back to ICC.

Since they are core to our analyses, let us see how they work through one example adapted from Bachman (1996:204-205), displayed as figure 2.4.2. For this example we will consider a three-parameter logistic graph, that is to say, a mathematical model which considers item difficulty, item discriminability and guessing (the 3 parameters) to tell us how likely one candidate is to answer a given item correctly. The graph below displays 3 different curves for 3 different items. The probability of one candidate answering one item correctly is displayed in the Y axis. The ability of candidates is displayed in the X axis and includes already a logit scale.

Figure 2.4.2. ICC curve of a three-parameter logistic model

Item 1 Item 2 Item 3

-3.0 -2.0 -1.0 0.0 +1.0 +2.0 +3.0

.00 .10 .20 .30 .40 .50 .60 .70 .80 .90 1

Probability

Candidate’s ability scale (measured in logits)

The Y axis tells us how likely candidates are to answer one item correctly in relation to their ability (1 means 100% probability and 0 means 0%

probability). The higher the ability of candidates (+1, +2, +3, etc.) the closer they will be to answering any item correctly, which is an intuitive idea. This way, for example, we see that a candidate with a logit ability of -2.0 will have 60%

probability of answering item 1 correctly. Item 3 is clearly more difficult than item 1 because test takers must have an ability of +2.0 logits to hit 60% probability of correct answer. Likewise, a candidate with +3.0 logits ability will have 90%

chance of answering item 3 correctly. Similarly, a candidate with a -1.0 logits ability will have 40% probability of answering item 2 correctly and 90%

probability of answering item 1 correctly, etc.

Since this is a three-parameter logistic graph, it provides information regarding the 3 parameters above mentioned, which now we will present in a different order: guessing, item difficulty and discrimination.

The most interesting tenet here is the parameter of chance, since this model (the three-parameter model) is the only one that accounts for it. The Rasch model which we will be using later assumes that there is no chance in answers, which is as much as saying that there is no guessing. In contrast, in the graph above we see that the lower bound of the curves for the 3 items are asymptotic to .20. Being asymptotic, the lines will approach .20 but will never reach that value.

This is the point at which the so-called pseudo-chance parameter is set (Bachman, 1991:205), which means that there is approximately 20% probability of candidates of very low levels answering the 3 items correctly as a result of (wild) guessing.

The middle point between the pseudo-chance parameter (.20) and 1 is called the difficulty parameter (Bachman, 1991:205), here set at .60 for the 3 items considered in the graph.

Finally, the discrimination of items is proportional to the slope of the curve at the point of the difficulty parameter. The steeper the slope is, the greater the

will discriminate the least. Items 1 and 3, with much steeper slopes will discriminate much more effectively between individuals at different ability levels (Bachman, 1991:205).

If we used other probability models, the curves for these items would be different as well, describing different probabilities.

In document Protocol to design a CEFR-linked proficiency rating scale for oral production and app implementation (Page 108-115)