3 Validation in language test development
3.4 The current concept of construct validity
3.4.4 The complexity of unified validity
Since the division of validity into three types with quantified indicators was abandoned and construct validity began to be seen as the main concern, validity has become a complex concept. One of the fathers of the current concept, Samuel Messick, would argue that this is a well motivated complexity. He sees tests as instruments which are used in society, and his point is that the meanings of test scores cannot and should not be investigated without reference to the way they are going to be used. Instead, investigations of validity should always entail inquiry into the values and consequences involved in the interpretation and use of test scores. He has promoted a unified view of validity throughout his writings (eg. 1975, 1980, 1982, 1984, 1989a, 1989b, 1994, 1995), but to make it more comprehensible, he has also proposed a new model for the concept.
Messick calls his faceted conception of unified validity the progressive matrix (see Table 1). He distinguishes two main facets in testing: the source of justification for the testing, which can be either evidence for score meaning or consequences of score use, and function or outcome of testing, which can be either test interpretation or test use. The heart of this formulation is score meaning, but its four conceptual categories express Messick’s three main theses about the nature of validity: (1) that values form an integral part of score meaning, (2) that both the theoretical meaning arising from the measure and the applied meaning which is connected to particular contexts of test use need to be considered in construct validity, and (3) that consequences of test use form an essential aspect of score meaning.
Table 1. Facets of Validity as a Progressive Matrix (Messick 1989b:10)
Test Interpretation Test Use
Evidential Basis Construct Validity (CV) CV + Relevance/Utility (R/U) Consequential Basis CV + Value Implications (VI) CV + R/U + VI + Social
The first cell, the evidential basis for test interpretation, calls for evidence for score meaning, which is the core meaning of construct validity. The kinds of evidence that belong here are content relevance and representativeness, theoretical comprehensiveness of representation, correspondence between theoretical structure and scoring structure, relationships between items within the test, and relationships between scores or sub-scores and other measures (Messick 1989a:34-57). The second cell, the evidential basis for test use, requires additional evidence for the relevance and utility of the scores for a particular applied purpose. Construct validity belongs to this cell because the relevance and utility of score meaning for any purpose are dependent on the evidential meaning of the score. The consequential basis of test interpretation calls for considerations of the value implications of score interpretation, including the values of construct labels, theories, and ideological bases of the test. The consequential basis of test use requires the evaluation of the potential and actual consequences of score use.
Messick calls the matrix progressive, because construct validity appears in each of its cells. In the previous version of the matrix, construct validity had only appeared in the first cell, although in explaining the figure, Messick (e.g. 1980:1019-1023) stressed that the other cells illustrated specific aspects of score meaning. By including construct validity in all the cells in 1989b, Messick clarified a disjunction between the figure and the explanation. The inclusion of construct validity in all the four cells emphasizes the centrality of construct meaning in Messick’s conception of validity.
While agreeing that the social dimensions that Messick introduces to validity are important, Shepard (1993, 1997) criticizes the matrix formulation because it is conceptually difficult to understand. Construct validity appears in every cell, yet the whole matrix also depicts construct validity. Moreover, she argues that the progressive nature of the matrix allows investigators to begin with “simple” construct validity concerns in the first cell, and they may never get to the fourth cell where consequences of measurement use are addressed. She says that this is unfortunate because it is not at all what Messick intended, but this is the way the matrix is sometimes used. Moss (1995:7) similarly agrees with the importance of the social meaning of scores, but suggests that the progressive matrix cannot replace the traditional categories of content, criterion, and construct-related evidence because it does not distinguish categories within the concept of construct
validity. Rather, it locates construct validity in a larger notion of validity which includes values and consequences.
Chapelle (1994) applied Messick’s concept of construct validity to evaluate validity when c-tests are used in research on second language (L2) vocabulary. Her evaluation covered all the cells of Messick’s matrix, ie. the four concerns of construct validity, relevance and utility, value implications, and social consequences. Following Messick’s theory, she began the investigation by defining the construct of interest, vocabulary ability. Throughout her analysis, she referred to this construct definition, using it as a criterion in the evaluation. She discussed the first cell, construct validity, through six types of evidence and analyses: content evidence, item analysis, task analysis, internal test structure, correlational research, and experimental research identifying performance differences under different theoretical conditions (Chapelle 1994:168-178). Although Chapelle did not investigate a single test but a test method, and the immediate context of reference was second language acquisition (SLA) theory rather than the use of examinations for decision-making purposes in social life, her faithful application of Messick’s concept of validity showed that the theory can be understood and operationalized.
Two observations from Chapelle’s study are particularly relevant for the present thesis: firstly that the guiding force in her study was the detailed construct definition, and secondly that the article actually defined a research programme for a thorough evaluation of the validity of using c-tests in research on L2 vocabulary. The programme is similarly based on the construct definition. The implication from the first observation is that a detailed construct definition can provide an elegant design principle for a coherent study. The implication from the second observation is that a construct-driven rationale can lead to a very broad research agenda. This is by no means a demerit; however, it is too big a challenge for an individual test development board. Lines will thus need to be drawn between what is immediately relevant for testing boards and what is part of a broader discussion.