• No results found

Construct characterisation based on task analysis

4 Approaches to defining constructs for language tests

4.6 Test-based approaches to construct characterisation

4.6.3 Construct characterisation based on task analysis

Upshur and Turner’s, Fulcher’s and Chalhoub-Deville’s proposals for the use of data to derive constructs work in contexts where learners produce an extended response which can be analysed and/or rated in detail. However, tests with structured tasks and limited responses, such as those of reading or listening, do not offer such data. Instead, what can be analysed in these settings is the task material. Freedle and Kostin (1993a) report on a study in which they analysed the construct assessed in the TOEFL reading test by analysing the format of the texts and items and seeing how these influenced item difficulty. McNamara (1996), from a slightly different perspective, describes a number of studies which used task content for the mapping of the abilities (ie.. construct) assessed in a test. Both of these methods are

post hoc explanations in that the dependent variable is item difficulty. However, the advantage is that the task analysis yields a content description of the ability assessed.

Freedle and Kostin (1993a) conducted an analysis of TOEFL reading tests to explain item difficulties and through them the nature of the information yielded by the scores. They investigated three categories of TOEFL reading items: main idea, inference, and supporting idea items (1993a:145). There were 213 of them altogether and they were related to a total of one hundred reading passages.

Freedle and Kostin (1993a) reviewed existing research to assemble a set of variables which had been found to be related to item difficulty in reading comprehension and investigated how well these variables helped predict the difficulty of sampled TOEFL reading comprehension items. The variables characterised the reading passages, reading items, and passage- item overlap. They included features such as number of words, number of negations, location of focal information within passage, subject matter of text, type of rhetorical organization, and frequency of fronted text structures such as cleft sentences. A total of 65 variables were included in the analyses as well as 6 to 11 text-by-item interactions depending on item category (Freedle and Kostin 1993a:146-154). The categorisation into text- related, item-related, and text/item overlap-related features was made because criticisms had been presented that multiple choice tests of reading assess reasoning skills related to understanding the item stem and options rather than assessing passage comprehension. Since the TOEFL test is

based on multiple choice questions, this would be a considerable criticism of the test.

Using stepwise linear regression, Freedle and Kostin (1993a:161- 162) found that eight of the variables could be considered significant predictors of the difficulty of the sample of TOEFL reading items that they investigated. Six of these variables were related to the reading passages, two to passage-item overlap, and none to the textual characteristics of the items alone. The portion of variance explained by the variables was 33 percent. The researchers also conducted separate analyses for a non-nested sample of items, that is, a sample of items where only one item per reading passage was included. In this analysis, eleven variables accounted for 58 percent of the variance of the scores. Ten of the 11 variables were related to the reading passages or to passage-item overlap, and one (number of negations in correct answer) to item-related variables.

Freedle and Kostin (1993a:166) concluded that their results supported the construct validity of the reading test, because they were able to show that candidate scores were significantly related to features indicating text comprehension rather than to technical or linguistic features in the items. In a related ETS publication the researchers report that they also found a tendency in the data that the proportion of variance explained was higher for the two lower-scoring ability groups than for the higher- scoring candidates (Freedle and Kostin 1993b:24-25). They suggested (Freedle and Kostin 1993b:27) that think-aloud protocols might be used to clarify the strategies employed by high-scoring candidates, so that item difficulty could be better predicted for them as well. They did not speculate what such variables might be. Their use of the word “strategies” and the method of think-alouds may indicate a suspicion that reader-related variables which concern the operations that the items make readers perform could explain further portions of item difficulty for high-scoring candidates. Such reader-related processing variables were not investigated in Freedle and Kostin’s study. If these kinds of variables are considered important for a comprehensive picture of the construct assessed, as they might in an interactionalist definition of test-based reading, the degree of variation not explained might be a good result (cf. Buck and Tatsuoka’s results discussed below). However, the operationalization of such variables would require careful work before assessments could be made of whether it explains score variation in a systematic way.

Conceptual issues are only one possible explanation for why Freedle and Kostin (1993a) were able to explain item difficulty better for the non- nested sample of items and for lower ability levels. Boldt and Freedle

(1995) re-analysed Freedle and Kostin’s (1993) data with the help of a neural net, originally in the hope that this more flexible prediction system would improve the degree of prediction achieved (Boldt and Freedle 1995:1). They found that the degree of prediction did improve in some samples of items, but the variables that the neural net used for the successful predictions were different from those in the linear regression. Only two of the variables, “number of words in key text sentence containing relevant supporting idea information” and “number of lexically related words in key text sentence containing relevant supporting idea information” were the same (Boldt and Freedle 1995:15). The researchers also found that the highest improvement in prediction of difficulty concerned the nonnested sample of items. However, they studied an alternative explanation for this. They drew another sample of 98 items from the 213 and used the same 11 variables that they had used with the nonnested sample to study the degree of prediction. They found that percentage of variation in difficulty explained for this set was almost as high as for the nonnested sample even if the predictor variables were not formed specifically for the new sample. This argued for the alternative explanation that the difference in degree of explanation in the original Freedle and Kostin (1993) study was not due to the independence of the items but to the smaller sample of item difficulties that had to be explained, introducing randomness in the nature of the sample that allowed capitalization on chance (Boldt and Freedle 1995:14). Similarly, although the Boldt and Freedle study repeated the Freedle and Kostin (1993) finding that item difficulty was best explained for the lowest ability groups, it was possible that this was because the number of predictors that were used for that group was the highest (Boldt and Freedle 1995:14). The authors continued that this alternative explanation was supported by the fact that the accuracy of prediction for all the ability levels in their study reflected the number of predictors. The effects of skill level and sample size were confounded and if a design were developed to investigate the cause, Boldt and Freedle proposed that fewer predictors and more items should be used so that the issue could be resolved (Boldt and Freedle 1995:15).

This example illustrates that findings in empirical studies may be explained by the methods used. Boldt and Freedle (1995) seem keen to find a small number of generalizable constructs, since they say that they would like to find few predictors that work across a large sample of items. Another way of pursuing research on this would be to make parallel small samples and investigate how the variables that explain difficulty vary and possibly discover contextual or content-based explanations for why they

vary. For both types of research, Boldt and Freedle’s (1995:15) observation that ideally such studies would be informed by theory holds true. The nature of the theory would inform the content of the variables studied, or vice versa if theory construction were sought from identifying item properties and using them in prediction.

McNamara (1996:199-213) describes three projects which used what he terms skill-ability maps to characterise the skills assessed in reading and listening tasks. The approach begins from the output of a Rasch item analysis. This locates examinees and items on the latent measurement scale. The researcher then attempts to identify the skills assessed by items at a specific region of the latent ability scale. The logic is that “if the knowledge or skills involved in the items found at a given level of achievement can be reliably identified, then we have a basis for characterising descriptively that level of achievement. If successive achievement levels can be defined in this way, we have succeeded in describing a continuum of achievement in terms of which individual performances can be characterized” (McNamara 1996:200). A researcher who uses this approach thus hopes to be able to say, for instance, that items testing “ability to understand and recount narrative sequence” cluster at one region of item difficulty while items testing “ability to understand metaphorical meaning” would be found in another.

McNamara (1996:201-202) discusses a first language reading test (Mossenson et al. 1987, in McNamara 1996) in which the reading ability scale was developed through the method described above. The scale proceeds in thirteen steps from the identification of the topic of the story through the connecting of ideas separated in the text to inference of emotion from scattered clues. McNamara (1996:205) reports that the validity of the scale has been called to question, both on the grounds that the status of sub-skills in reading is questionable and especially that the methodology used to characterise the content of the items is not reported in the test manual. Further exploration of this method with carefully reported procedures may nevertheless produce interesting results for construct characterisation. The nature of the properties that are ascribed to the items in the reading test is strongly related to the theoretical views of the analyst about what would explain correct or incorrect responses to the reading item analysed.

McNamara (1996: 203-204) also discusses an individual learner map, where a similar mapping methodology was used in a university test of English as a second language, but this time to detail the answer pattern of an individual learner. The basic grid of the map is defined by the latent

ability/difficulty scale on the one hand and the examinee’s answer pattern on the other. The logic follows item response theory, which expects that if a set of items is suitable for the examinee, he or she would tend to get items

below his/her ability level correct, items at his/her ability level either correct or incorrect, and items above his/her ability level mostly incorrect. Accordingly, the individual learner map includes four regions: easy items which the learner answered correctly, difficult items which the learner answered incorrectly, difficult items which the learner somewhat unexpectedly answered correctly, and easy items which the learner unexpectedly answered incorrectly. McNamara suggests that the last category in particular is useful in educational contexts, because it may indicate where remedial teaching is needed.

McNamara suggests that information from such learner maps might be used in two ways. It could be reported to learners as it is and learners could draw their own conclusions of ability based on their examination of the items. It could also be combined with content analysis of the items to express learner abilities in terms of more general underlying abilities. As with McNamara’s earlier reading example, the nature of such abilities would require validation. However, if ability constructs are thought to underlie examinee performance on tests, skill-ability mapping might offer a way to identify and describe them.

As a third example, McNamara (1996:206-210) reports on another Australian test of reading and listening in which skill-ability mapping is used to create ability level descriptors used in certificates. He describes an independent validation study by McQueen (1992, in McNamara 1996) in which the researcher derived from existing research a set of characteristics which could be considered to affect the difficulty of items in reading Chinese. He used the criteria to analyse a test which had already been administered and the scores reported. There was considerable coherence between the factors that McQueen derived and the ones used by the examination. McNamara (1996:210) reports that McQueen’s results largely supported the validity of ability mapping in general and at least in the context of the examination. In addition, McNamara proposes that the ability mapping approach could be followed by performance analysis, both of which should operationalize constructs mentioned in the test specifications. This would provide more powerful evidence for the validity of the maps. Such an approach would also strengthen the role of construct definition as a rationale for the development and validation of tests. The approach would allow a wide range of theoretical approaches to construct definition in terms of Chapelle’s (1998) model, and since the empirical

logic of the mapping system is based on the IRT ability dimension underlying the scoring system, the connections between the construct tested and the scoring structure would be a natural part of the investigation.

4.6.4 Construct characterisation based on task and ability analysis