The status of the test in test validity - Issues relevant to the present study

3 Validation in language test development

3.6 Issues relevant to the present study

3.6.1 The status of the test in test validity

Cronbach and Messick, and other measurement experts in their wake, strongly emphasize that one does not validate a test but interpretations of test scores. Cronbach, for instance, proclaims:

Only as a form of shorthand is it legitimate to speak of “the validity of a test”; a test relevant to one decision may have no value for another. So users must ask, “How valid is this test for the decision to be made?” or “How valid are the several interpretations I am making?” Cronbach (1990:150)

Messick (1989a:13) similarly states that “what is to be validated is not the test or observation device as such but the inferences derived from test scores”. These formulations shift the focus from the test to score use, and at the same time, they imply several actors. The people responsible for the validation of score-based inferences are both the test developers and the score users.

The mainstream version of current validity theory provides a coherent context for these statements, but it is not easy to implement the statements in one coherent line of validation practice. This has led several measurement experts to call for clear guidelines specifically focusing on the responsibilities of the test developers (eg. Maguire, Hattie and Haig 1994, Shepard 1993, Wiley 1991, Yalow and Popham 1983). The writers make different cases, but they are joined in the concern that current theory and standards on validity do not give sufficient guidance to individual testing boards. Evidence for the quality of a test is only one strand in cases which concern the quality of score-based inferences. For individual testing boards, however, their test is the main concern throughout its development and use. There need not be a conflict, but some clarification of responsibilities is needed.

Yalow and Popham (1983) want the content of a test to be a clear and legitimate focus of validity inquiry. They especially take issue with Messick’s (1980:1015) characterisation of content validity as an aspect of test construction and “not validity at all”. Messick argued this because content validity is a stable property of the test rather than scores, and does not concern the nature of the skills represented in test responses as validity should. Yalow and Popham see content validity as a necessary precursor to drawing reasonable inferences from the test scores. Messick (1989a:36-42) discusses the contributions and the limitations of content investigations to validity arguments in detail and concludes with an emphatic statement to the effect that content relevance and representativeness do contribute an important perspective to validity investigations. They just cannot be the only

developers is that content related evidence is important, but it must be complemented by other evidence from the test development process to construct a solid validity case.

Wiley (1991) argues for a return to test validity, which he sees to be focused on the social and psychological processes which the test performances are supposed to reflect. The difference between his case and Yalow and Popham’s is that Wiley does not consider task or content characterisation only, he focuses on both tasks and test taker processing, which are combined into a particular kind of construct definition for the test. Wiley proposes that test validation should be an “engineering” task which investigates the faithfulness with which the test reflects a detailed model of the intended construct. According to him, this should be kept separate from the “scientific” task of validating the construct model. He presents an approach to modelling constructs as complex combinations of skills and tasks and provides an example of how test validation could be conducted without reference to the scientific validation of the construct. The case is appealing, but it clearly builds on the presupposition that test developers can draw up a detailed model of the intended construct. A particularly attractive feature in Wiley’s case is the limitation of the test developers’ responsibilities. Shepard (1993:444) and Moss (1995:7), though from a different viewpoint, make a similar case for separating test-related validation from the validation of theoretical constructs.

Maguire, Hattie and Haig (1994) read Messick’s (1989a) emphasis on score use to mean that he thinks that the use to which people put a score as an indication of a construct is more important than an understanding of what the construct is, and they disagree. They consider investigations of the nature of educational constructs to be the most important, and they promote qualitative, processing-oriented studies for inquiring into them. They point out that too much interpretation and theorising about constructs in testing is based on scores. Such investigations, they argue, conflate the nature of the construct and the properties of the scoring model which is used in the test. Tests can help to build construct theory, but primarily through opportunities for qualitative evidence about test taker processing. Once processing- oriented studies have resulted in a detailed construct, they suggest that educational measurement experts should probably consider whether such constructs can be measured along a scale as current tests do, or whether it might be better to assign test takers to nominal categories which are not necessarily ordered on a single dimension.

Maguire, Hattie and Haig’s (1994) proposal holds merit if constructs are to focus primarily on cognitive processing. However, the question can

also be raised whether processing is a sensible primary basis for defining constructs, given the contextual nature of human processing. Furthermore, we know very little about possible variation in processing when one person takes one task on one test occasion versus taking it on another occasion, let alone the differences in processing between individuals – on one test occasion or across different test occasions. In fact, at least some of the evidence available from think-aloud studies, eg. Alderson’s (1990:470-478) analysis of the processes that two learners went through when answering a reading test, suggest that variation in processing can be considerable. It is also possible that cognitive processing is significant in some tasks, but that in others, such as in the ways in which readers achieve an understanding of a text, the specific processes employed are neither a sensible nor perhaps a useful way of analysing their skills.

Maguire et al.’s (1994) contribution to the construct validity discussion is nevertheless thought-provoking. The research they promote seems to belong to the theoretical side of test-related and theory-related construct validation, as discussed above. Yet the separation they make between skill-constructs and their quantified indicators is important, as is the question that they raise about whether different score categories justifiably indicate higher or lower levels of “ability”. Such a questioning approach would probably be welcomed by Messick and those who continue his work, because it directs attention to the values and practices of current educational tests. Such studies undoubtedly have implications for individual tests, but it might be more justified to see this line of inquiry as a “more scientific” pursuit in the first instance and the domain of an individual testing board’s validation activities only after some research basis exists to which they can tie their investigations.

Neither Cronbach nor Messick explicitly discuss the status of the test instrument in their theories of validity. Both theorists, instead, centre validation on the construct which the test is intended to assess. The test tasks, the scoring system, and the score interpretation are referenced to evidence about the construct. But the construct is abstract and related to other constructs and construct theories as well, and construct-related inquiries may end up questioning the construct as well as the test. Test developers may be happy to agree in theory, but in practice they face the question of how to implement a focus on the construct in their validation work while keeping the scope of their task in manageable proportions and continuing to develop and implement their test.

3.6.2 Construct theory and construct definition in validation inquiry

In document UNIVERSITY OF JYVÄSKYLÄ Centre for Applied Language Studies (Page 103-106)