4 Approaches to defining constructs for language tests
4.6 Test-based approaches to construct characterisation
4.6.1 Construct characterisation based on score analysis
Chalhoub-Deville (1995, 1997) studied the construct of “proficiency”. She proposed that an important point in the testing of speaking proficiency in learning contexts is that the scores should reflect generic perceptions of proficiency, not only the teacher’s. She considered this important because “N[ative] S[peaker] teachers, who usually evaluate learners’ L2 oral proficiency, are acting as surrogates for the nonteaching NSs, it is necessary to validate these teachers’ criteria with those of nonteaching NSs” (Chalhoub-Deville 1995:258). She proposed that the “end user’s” perceptions of proficiency can be investigated empirically by asking groups of naïve raters to rate some speaking performances and then analysing the scores which they give. Since native speakers are not a unified group but differ from each other in terms of cultural background and experience of learner speech, it might be relevant to sample several subgroups among native speaker judges.
Chalhoub-Deville makes a distinction between theoretical models, such as Bachman’s CLA, and operational assessment frameworks which, in tests of speaking at least, are embodied in scales and scores. She maintains (Chalhoub-Deville 1997:11) that “when the purpose for which the model is to be used is clearly delimited, a [scale-based] parsimonious model, which relates to a theoretical model, but only includes the contextually salient components, is more appropriate”. Therefore, when test developers have properly defined the purpose of their test, the language ability that they want to assess, the profic iency level that the test is intended for, and the tasks they are going to use, Chalhoub-Deville (1997:11) suggests that they should empirically derive a specific, contextually appropriate assessment framework for the instrument rather than assume that a generic framework is appropriate. A more specific model would enable them to say more
clearly exactly which variables best explain the test scores and the differences between them. This consideration is important if the scores from a test do indeed vary by tasks and rater groups. Chalhoub-Deville conducted a study to investigate this.
Chalhoub-Deville (1995, 1997) studied “the components employed by native speakers of Arabic when assessing the proficiency of intermediate-level students of Arabic on three oral tasks: an interview, a narration, and a read-aloud” (1997:12). From the performances of six learners, she extracted two-minute samples on each task and played them to three assessor groups, 15 teachers of Arabic in the United States, 31 nonteachers resident in the United States, and 36 nonteachers living in Lebanon. The assessors used a rating instrument which included the overall impression and “specific scales, encompassing intelligibility, linguistic, and personality variables. Some of these scales, such as grammar and confidence, were common across all three tasks and some were task- specific, such as temporal shift in the narration and melodizing the script in the read-aloud” (Chalhoub-Deville 1995:261). The researcher had arrived at the list of the criteria through an analysis of previous research and a pilot run with a working version of the scales. The judges gave their ratings on a 9-point scale from 1= lowest performance level to 9=educated native speaker.
Chalhoub-Deville (1995, 1997) used multidimensional scaling and linear regression to analyse the ratings and interpreted the derived dimensions in terms of the names of the criteria which seemed to belong to the same factor, backed up by an analysis of the features of performance which seemed to have caused the ratings. The results indicated that the proficiency ratings given on each of the tasks were influenced by two main factors, but that the nature and the weightings of the factors varied across tasks and rater groups. Teachers in the United States emphasized appropriate vocabulary usage in an interview performance, creativity in presenting information in narration, and pronunciation with a minor emphasis on confidence when they rated read-aloud. Nonteachers resident in the United States emphasized grammar-pronunciation and appropriate vocabulary use in the interview, creativity in presenting information when they rated narration, and confidence on the read-aloud task. Nonteachers resident in Lebanon emphasized grammar-pronunciation in the interview, grammar-pronunciation with a minor emphasis on creativity in presenting information on narration, and confidence in read-aloud. Chalhoub-Deville did not express the size of the differences in terms of learner scores, ie.. she did not report whether learners scored differently on different tasks and
whether it was possible to combine the information from the different tasks to provide an overall score. Instead, she concluded that oral ratings are context-specific and influenced by both tasks and rater groups. She stated that the implication for researchers investigating oral proficiency was to take care to employ a range of tasks and rater groups, as this would lead to a better understanding of the proficiency construct (1995:275). The implication of her results for test developers, Chalhoub-Deville suggested (1997:17), was that empirical investigation of end-user constructs is prudent especially if the scores are used for making high-stakes decisions. She stated that an advantage of her approach is that it can be employed during the test construction stage, before scores are actually used to make decisions (Chalhoub-Deville 1997:17).
In terms of theoretical approaches, Chalhoub-Deville’s construct is interactionalist in that it connects the proficiency of the individual with varied task demands and rater perceptions. The researcher addresses performance consistency through the argument that proficiency ratings are context-specific to both tasks and rater groups but she does not provide numerical data on the size of the differences in terms of contextualised proficiency. The analysis is focused on ratings rather than language, so that in the context of Chapelle’s model (see Figure 1) she can be considered to analyse some features in the middle bar, namely those of performance (in)consistency and the ways in which they reflect learner factors and contextual features in the assessment of speaking.
The value of Chalhoub-Deville’s approach is its considered attention to score user perceptions and the empirical grounding of the constructs derived. However, the study was clearly research-oriented and not aimed at developing a test. While the data were assessments (as made by naïve raters without training), the study did not yield assessment scales complete with level descriptors. Chalhoub-Deville did not specify what type of scale the components should inform, though she made the point that scales should be task- and context-specific. Because she was not actually building a test, she did not need to decide which end user group or combination of groups was the most relevant for the assessment context and how operational raters could be made to assess the features which were salient to them. The approach provides interesting, empirically grounded information about audience perceptions of task-related proficiency, but for scale construction and score explanation, test developers need to combine this method with others. Moreover, the author says nothing about the relationship between the different task-specific ratings for individual examinees. Nevertheless, the study serves as a reminder that assessment constructs may indeed be
task-specific, and if detailed feedback is needed in an assessment context, this approach to defining task-specific scales might be able to provide such detailed assessment information.
4.6.2 Construct characterisation based on examinee performances