4 Approaches to defining constructs for language tests
4.6 Test-based approaches to construct characterisation
4.6.2 Construct characterisation based on examinee performances
operationalized definitions of the constructs assessed. His study concentrates on verbally defined assessment scales as these are used in examinations. He suggests that if scale descriptors are detailed and clearly relatable to actual language test performances, a validation study of the scale provides evidence for score interpretation which is related to the construct presumed to be assessed (Fulcher 1996b:225). He reports on a study in which such a concrete, detailed scale for perceived fluency was developed and validated.
Fulcher (1996b) developed a data-based scale of perceived fluency on the basis of coded transcripts of recorded oral interviews. This was an ELTS oral interview, and the operational ratings provided a criterion against which Fulcher (1996b:212) could judge the ratings from the experimental scale that he developed. To construct the scale, Fulcher initially distinguished six categories of fluency-related features of learner speech. These were coherent with existing research literature on fluency which discusses ”surface aspects of performance which interrupt fluency” (Fulcher 1996b:215), covering pausing, hesitation, and repetition/reformulation. However, Fulcher did not deal with surface features of performance descriptively, but coded the instances of the surface features in the transcripts for assumed rater interpretatio ns of the surface phenomena. Fulcher used his own intuition as a rater to derive explanatory categories for the (dis)fluency phenomena and arrived at the following eight explanatory categories: end-of-turn pauses; content planning hesitation; grammatical planning hesitation; addition of examples, counterexamples or reasons to support a point of view; expressing lexical uncertainty (searching for words or expressions); grammatical and/or lexical repair; expressing propositional uncertainty; and misunderstanding or breakdown in communication (Fulcher 1996b:216-217).
Fulcher notes that some of the explanatory categories do not reflect a linear relationship between phenomena, interpretation, and ability. End-of- turn pausing, for instance, is fairly frequent in the performances of both low-ability and high-ability examinees, but not in performances at the intermediate proficiency ranges. However, the pausing occurs in different contexts, and raters interpret it differently in the two cases. Low-ability
examinees pause to ask the interlocutor to take over before the proposition they are expressing is complete because they do not know how to continue, while high-ability examinees pause after completing a proposition to indicate that their turn is complete and the examiner can take over. (Fulcher 1996b:220-221.) The implication for the development of level descriptors for assessment scales is that unidimensional increase of fluency phenomena and decrease of disfluency phenomena is a simplification which does not tally well with learner performances. Closer and more realistic description of pausing in rating scales, for instance, would take the nature and motivation for the learner’s pausing into account.
Having taken the multidimensionality of surface phenomena such as pausing into account when the transcripts were coded for the interpretive categories, Fulcher used discriminant analysis to investigate how well tallies of occurrences accounted for the operational ELTS ratings of the population. Only one person of 21 would have been given a different rating if the experimental scale had been used instead of the operational one.
Fulcher concluded that his results to support the usefulness of the explanatory categories and proceeded to use the categories and the transcribed interviews to construct a data-based fluency rating scale with categories 1-5 described and additional undefined categories of 0 (below 1) and 6 (above 5) added. In addition to the surface features and explanatory categories discussed above, he added descriptors of backchanneling to the final scale, because his review and re-review of the tapes in the course of the study indicated that backchanneling increased with higher-ability students, and he hypothesized that frequency of backchanneling would influence ratings (1996b:224).
Fulcher derived a fluency rating scale from the data and investigated its validity and functionality by asking five raters to use it in the rating of three oral tasks (two one-to-one interviews and a group discussion, as described in Fulcher 1996a). The students rated were different from the group that provided the performance data for the first part of the study, but belonged to the same population. Fulcher (1996b:214) used a G-study to calculate rater reliability and assessed the validity of the scale by investigating group differences and conducting a Rasch partial credit analysis on the scores awarded. The reliabilities and inter-rater and inter- task generalizability coefficients were very high, .9 or above (Fulcher 1996b:226), and this led Fulcher to conclude that the scale was able to discriminate between three teacher-assigned levels of general ability. The Rasch partial credit analysis indicated that the cut points for different skill levels were fairly comparable across the three tasks (p. 227). The
researcher concluded that the scale was relatively stable across task types. Fulcher (1996b:228) reports that an examination of the scale in the context of a different examinee population is under way, which indicates that he considers it an open issue whether the concrete descriptions of fluency phenomena are generalizable across different groups of learners.
Similarly to Chalhoub-Deville (1995, 1997), Fulcher (1996b) focused on the relationship between test task characteristics and performance consistency. In terms of Chapelle’s (1998) figure (see Figure 1), then, he also worked with concepts in the middle, but unlike Chalhoub- Deville, his intention was to support the establishment of performance consistency in tests of speaking. He used analysis of learner performance as material and ascribed the assessments to the learners’ fluency, which combined the features of test discourse and learner factors. His conclusions concerned the notion of fluency in context, which he sought to describe empirically. His contribution to the construct description issue for language testers is a data-based way to develop rating scales, and he argued that through these means testers could provide construct validity data for the examination at the same time. This is done by describing the construct actually assessed in the examination in a rating scale with detailed level descriptors. The contrast is to existing rating scales, which may not be based on any direct observation of learner performances or systematic collation of rater perceptions but armchair theorising which is not supported by critical conceptual analysis or by investigation against empirical data (Fulcher 1996b:211-212). Fulcher (1996b:217, 221) notes that the explanatory categories he used for rater interpretations of the features of examinee speech are inferences which require validation, but the statistical evidence of the usefulness of the categories for the prediction of the overall proficiency ratings lends some support to the plausibility of the explanations. Furthermore, this approach offers the possibility of creating links between language tests and applied linguistic theory by using theory to suggest descriptive and explanatory categories to be used in rating scales and by using the assumptions of links in rating scales to inform studies of language ability or language learning to see if the links are plausible.
The rating scales that Fulcher developed are long and complex compared with the scales that assessment developers are used to seeing. Each level descriptor is more than 200 words long. If test developers choose to use this method to construct their scales, they would have to make sure that their assessors are willing to work with them. This detail in the scales might provide a useful means for a group of raters to agree on
ratings, but whether this is so should be investigated. The strength of the scales is their direct basis on learner data. A weakness might be that scales from different systems may not be compatible, new analyses would always be needed if a new test were developed. Considering the number of stages needed, this approach to scale development is time-consuming, but if the information from the scale can be used to provide learner feedback and it proves that learners find this useful, an important gain might be made. Further study is needed to verify the case.
Similarly to Fulcher, Turner and Upshur (1996) also used examinee performances to construct assessment scales. The researchers worked together with 12 elementary school teachers and built assessment scales for their ESL speaking test tasks with the help of the teachers’ perceptions of salient differences in learner performances. The project took the view that operationalized constructs are task-scale units, and since the project used two speaking tasks, they also developed two assessment scales. Upshur and Turner (1999) discuss the implications of their project to language testers’ understanding of the processes of test taking and scoring.
Upshur and Turner (1999:101-102) describe their scale-making activities as analysis of test discourse. Both the process that they used for deriving the scales and the nature of the resulting scales were different from standard test development procedures and also different from techniques used in discourse analysis. The scale-making procedure began with the participants agreeing in broad terms on the ability or construct they wanted to measure. The process itself consisted of iterative rounds of three steps. First, each member of a scale construction group individually divides a group of performances into two piles, top half and bottom half. Second, as a whole group, they discuss their divisions and reconcile differences. And third, they find some characteristic which distinguishes the two groups of performances from one another and state it in the form of a yes-no question. The same procedure was applied to successive sub- samples of the original sample so that six levels of performance were identified. The resulting scale took the form of five hierarchical yes-no questions which characterised salient differences in the sample of performances used in scale-making. The two scales, one for each task in the project, were then applied to the performances of 255 students.
Upshur and Turner (1999) used many-facet Rasch measurement with the program FACETS (Linacre 1994) to analyse the two task-scale units on a common latent measurement scale. The analysis was performed on 805 ratings given by 12 raters to 297 speech performances produced by 255 children. It showed that the tasks were not of equal difficulty, that there
were differences of severity between the raters, and that the score boundaries were also different, so that for instance it was easier to earn a 6 on one of the two tasks than the other (p. 95).
Upshur and Turner (1999) also discussed the scales in terms of the features of language they focused on. Both scales employed fluency to distinguish between the highest level of achievement and the next highest level. Both scales also based the distinction between the lowest and next lowest levels on use of the mother tongue. The intermediate levels, however, proved to be distinguished by different features in the two scales, which according to Upshur and Turner’s analysis were related to task requirements and possibly the rating processes. On a story retell, where the raters knew the content that the students were trying to express, the levels were distinguished on the basis of the content of the retell performances. On an audio letter to an exchange student, where raters were not able to make such content assessments, they focused on the phonology and grammar of the students’ speech (Upshur and Turner 1999:103-104). The authors suggested that rating scales should be task-specific rather than generic, since effective rating scales reflect task demands and discourse differences. They also speculated that such task-specific application may happen even if raters are ostensibly applying a single standard scale to rate performances on different tasks (p. 105).
Upshur and Turner (1999:103) noted that the discourse analysis of performances produced by their scales was not exhaustive. It only identified features of the performances which were the most salient for the main purpose of the exercise, which was to enable raters to distinguish between ability levels. The features identified were also dependent on the nature of the performances used when the scales were created. There probably were other features which also distinguished between levels of achievement but which were not equally salient to the group of raters, and other performances might have included other salient features. The authors also pointed out (1999:105, 107) that the resulting task-specific assessments of achievement pose a problem of how to generalize from task-based assessment scores to any more generic ability estimates. However, the advantage is that task-specific assessments allow the assessors to give expression to the process of assessment, reflected in their project in the way that task-specific assessment strategies featured in the scales.
In terms of the distinctions of theoretical approach into trait theorists, behaviorists and interactionalists that Chapelle (1998) used, Upshur and Turner’s conclusions certainly show that they cannot be counted as trait
theorists. The argument for strong task dependency may be interactionalist, but their reluctance to draw conclusions about individuals puts them on a socio -constructivist dimension of it, which is not separated as a clear category in Chapelle’s model. The researchers emphasize that the assessment process influences the scores given and that assessments are task-specific at least to a degree, whether assessment scales recognise this or not. Their method of employing rater perceptions to construct a scale provides yet another strategy for test developers to arrive at concrete formulations for explanations of scores. The method is so empirically grounded, however, that it may not suit all formal assessment contexts, especially if generalizations in terms of individual abilities are needed. If scales are constructed in such a strongly data-driven way, they really are task specific. This makes score interpretation in typical assessment contexts difficult, since the purpose of educational assessment is surely not only to categorise learners into six groups on the basis of a one-off task. There may be a useful purpose for such task-based assessments but, being new, it calls for detailed definition.
If individually based score interpretations are going to be made on the basis of this type of assessments, the meaning of the scores must be investigated, for instance by analysing transcripts of learner performances and examining whether they reflect the features of performance named in the scale-defining questions. Another question which might be asked is whether the scales were task-specific because they were developed to be so. This could be studied through employing a more generic six-level rating scale on the same performances and analysing whether there are differences between the ratings. It would be difficult to prove which scale was “more right”, but the data might show whether ratings are scale- specific or task-specific.
A related call to investigate test takers’ and assessors’ models in action is also made by Alderson (1997). He contrasts explicit and formal models of language, as embodied in theories of language ability, with implicit models that teachers, testers, and learners enact when they engage in language learning, teaching, and assessment. Extending this logic towards test development, this call would also encompass taking account of the models of language which test developers work by when they develop a test. One possible way of gathering data to study the usefulness of such an approach would be to keep account of decisions made in test development. The advantage of such analyses, as Alderson (1997) argues, is that they throw light on the perceptions actually involved in a concrete assessment instrument and its socially used products, the scores and their
interpretation. Once such data exists, judgements could be made about whether perceptions vary and whether it matters. If data is not gathered, the assumption is automatically that it does not.