• No results found

Research techniques employed in validation

3 Validation in language test development

3.5 Approaches to validation

3.5.4 Research techniques employed in validation

All the frameworks presented above offer guidelines on what to investigate in the process of validation inquiry, and how to organise the inquiry. They also mention research techniques for validation, which are discussed extensively in several theoretical presentations of validity. Cronbach (eg. 1990) and Messick (eg. 1989a, 1989b, 1995a), for instance, provide comprehensive discussion of possible techniques. A very similar range of techniques in the context of language testing is presented in several articles in Clapham & Corson (eds.) (1997), which give concrete examples of research with language tests where the techniques have been used.

Because of the history of the validity concept and the current focus in validity on score interpretations, many of the validation techniques are numerical and use test scores as the primary data. Correlation, however, encompasses only part of the analyses and indices involved. Internal correlations to discover relationships among items, and external correlations to investigate the relationships between the scores and other indicators of interest, are the most obvious. Connections to underlying dimensions which might stand for ability constructs are most commonly made through factor analysis. Score-based investigations, however, also include generalizability studies of interpretations across items, populations, and raters. Such studies take into consideration the test, the interpretation, and the population of test takers. Additional test-taker related investigations include the stability of scores over time and across different subgroups of test takers. Insights into the construct measured can be gained by keeping the group of test takers

constant and altering the testing conditions. Alternatively, the testing conditions can be kept constant, but the test taker group manipulated, for instance through providing extra training in the skills measured, or in test- taking skills; the latter in order to see if the scores can be influenced by coaching. Recent advances in quantitative validation techniques are reviewed in several chapters in Clapham & Corson (eds.) 1997, eg. Bachman, Bachman and Eignor, McNamara, and Pollitt).

Proposed techniques which use something else than scores as their primary data can be divided into three main groups: investigations of test content, processing, and test discourse which is assessed to arrive at the scores. The sole technique suggested for the validation of the content of a test is expert judgement. To elicit the judgements, test developers must produce a domain specification and a test specification against which the actual test forms can be judged. The judgements should concern both the relevance and the representativeness of the content of the test.

Techniques recommended for the analysis of processing are more varied, including think-alouds, retrospective interviews, questionnaires, computer modelling, and experimental control of sub-processes. Studies employing such techniques focus on the nature of the construct and the actions through which the assessment is realised, ie. Messick’s (1995) substantive aspect of validity. These studies often concentrate on test taker processing, which is understandable, because the score is assigned to the test taker, and it should say something about the test taker’s skills. As Banerjee and Luoma (1997) note, however, assessor processing has also begun to be investigated in tests which rely on human assessors.

Moreover, the language samples which the raters rate are also beginning to be analysed to provide an additional perspective on what it is that is being assessed. In language testing, such studies concern assessment of writing (eg. Bardovi-Harlig and Bofman 1989, Ginther and Grant 1997) and speaking (eg. Lazaraton 1992, 1996, O’Loughlin 1995, Ross 1992, Young 1995). The researchers usually approach their data from a conversation analysis or discourse analysis perspective, and often count and describe interesting instances of language use. Apart from characterising test discourse and facilitating construct-related inquiries of how it compares with non-test discourse, these studies offer useful material for testing boards because they allow them to assess the quality of their test and its implementation. They can investigate, for instance, whether the scale descriptors that they use actually correspond to the features of discourse found in the performance of examinees who are awarded each of the scores. They can also study whether the interlocutors act as instructed and

in a comparable manner with each other. The results may lead to minor or more major revisions in the testing procedures.

A research technique which combines qualitative and quantitative analyses for validation is the range of judgmental methods employed in standard setting. If a language test uses reporting scales to assist the interpretation of test scores (e.g. reports the scores of a reading test on a five-band scale), a very important aspect of its validation is the setting of the cut points that divide the distribution of scores into categories. This is usually done by having experts give judgements on items or learners. The experts should be well qualified for their work and the procedures should be well enough specified to enable them to “apply their knowledge and experience to reach meaningful and relevant judgments that accurately reflect their understandings and interpretations” (AERA 1999:54). One such procedure was specified in the context of the DIALANG assessment system by Kaftandjieva, Verhelst and Takala (1999). In it, qualified experts were trained in the use of the Council of Europe descriptive scale (Council of Europe forthcoming) and in a highly specified procedure judged the difficulty of each item that had been pretested. The judgement information was combined with empirical difficulty information from piloting to set cut scores. Such procedures provide empirical evidence for meaningful score conversion from a measurement scale to a conceptual reporting scale and support the validity of score interpretations in terms of the descriptive scale.