Task development for teachers: Weir (1993)

1 Introduction

2.4 Task development for teachers: Weir (1993)

Weir’s (1993) Understanding and Developing Language Tests

concentrates on the development and revision of test tasks. The book is organised according to tests of different skills: spoken interaction, reading, listening, and writing. A first chapter discusses general issues in test

construction and evaluation, and a concluding chapter summarises the thrust of the book and looks into the future of test development. Each of the skill- specific chapters first summarises research on the nature of the skill discussed, and then considers the nature of the test situation and the criteria relevant for assessing the skill. This is followed by a presentation of a wealth of task types, with discussion of what particular skills are being tested and how they are tested through this particular task type. Furthermore, Weir discusses the pros and cons of task types from the point of view of the teacher who needs to prepare all the task materials, implement and assess the tests, and produce results which are as valid and reliable as possible.

2.4.1 View of test development

Weir (1993) regards test development as one of the classroom teacher’s professional activities which are aimed at supporting and enhancing learning. He argues that testing should be done well to achieve this aim while negative impact, such as adverse reactions to test tasks and harmful influence on teaching and learning habits, is avoided. He proposes that good testing can ensue if teachers think about what they want to test, know about the alternatives of how the different skills can be assessed, and work together by commenting on each other’s draft tests and assessment criteria. He provides a framework for the planning and evaluation of test tasks. The framework covers three dimensions: the operations, ie. activities or skills, to be tested; the conditions of performance while the learners are taking the test; and the quality of student output, which refers to the assessment criteria to be used and the ideas about levels of language ability which underlie them.

The basic unit in testing on which Weir concentrates is the task. He wants to make teachers plan their test tasks carefully, so that they should serve a certain purpose which is motivated by the teacher’s understanding of language skills. This begins by thinking and talking about what the teachers want the tasks to test, and then making reflective decisions about

how these skills should be tested and how the performances are to be assessed. The idea is to create links from intention to implementation, so that the skills originally envisioned actually get tested. The means that Weir proposes is careful planning. In addition to discussing concrete examples of task types for each of the four skills, Weir provides generic guidelines for good test development under the headings of moderating tasks, moderating the mark scheme, and standardising marking.

Weir proposes that when moderating their own or other people’s tasks, teachers should pay attention to the level of difficulty of both individual tasks and the test as a whole, making sure that each test as a whole covers a range of levels of difficulty. Furthermore, they should ensure that the tasks elicit an appropriate sample of the students’ skills while avoiding of excessive overlap through including too many tasks on a narrow range of skills and omitting other skills altogether. In terms of technical accuracy, task reviewers should make sure that the tasks are easy to understand and that the questions are linguistically easier to comprehend than the actual task material. Moreover, they should assess the appropriacy of the total test time and the test layout. Finally, Weir points out that the task review process should help guard against bias arising from one-sided test techniques or cultural unfamiliarity of content. (Weir 1993:22-25.)

Weir suggests that when moderating the mark scheme, evaluators should check that the assessment guidelines define all the acceptable responses and their variations and that subjectivity is reduced as far as possible where assessment of spoken or written performances is concerned. Evaluators should also check that item weighting is justified on content grounds if weighting is used. He recommends that test developers should leave as little as possible of any calculation or summing activities to raters, because this is a potential extra source of error in sum scores. Reviewers should also check that the marking scheme is intelligible enough, so that a group of markers can be guaranteed to mark different sets of performances in the same way. His final recommendation for moderating the mark scheme concerns conceptual coherence in the assessment system: reviewers should check whether the skills required by the scoring operations, for instance spelling in open-ended reading comprehension tasks, are also what the scores are interpreted to mean. If the scores are only interpreted to convey information about reading, reviewers should perhaps recommend changes in the scoring procedures. (Weir 1993:25-26.)

As for the actual marking work, Weir states that standardisation is required to ensure uniformity of marking, so that any individual’s score does not depend on who marked his or her performance. For Weir, standardisation means that the marking criteria are communicated to markers in such a way that they understand them, that trial assessments are conducted, that assessment procedures are reviewed, and that follow-up checks are conducted during each successive round of marking (1993:26- 27).

2.4.2 Principles and quality criteria

The principles of good language testing for Weir (1993) are validity, reliability, and practicality. A test is valid if it tests “what the writer wants it to test” (Weir 1993:19). This presupposes that the test writer can be explicit about what the nature of the desired ability is. Weir argues for the development of theory-driven tests and supports this by always discussing existing theories at the beginning of his chapters on the assessment of language skills. His motivation for concentrating on test techniques is that if the tasks are flawed, it is possible that this threatens the validity of the test. He also briefly discusses authenticity under the heading of validity, making the point that although full replication of real life language use cannot be achieved in language tests, an attempt should be made to make language use in test tasks as life-like as possible within the constraints of reliability and practicality. The case he makes, albeit concisely, is very similar to Bachman and Palmer (1996).

Weir (1993:20) defines reliability as score dependability. This means that the test should give roughly the same results if it were given again, and more or less the same result whether the performances are assessed by rater A or rater B. Moreover, reliability is connected with the number of samples of student work that the test covers, since if the test only contains one task, it is difficult to judge whether the result can be generalized to just that task type, or whether it says something meaningful about the skill more broadly. Weir (1993:20-21) states that validity and reliability are interdependent in that known degrees of consistency of measurement are required for test scores to make any sense, but also that consistency without knowing what the test is testing is pointless. Furthermore, Weir connects reliability with the quality of the test items. Unclear instructions and poor items can make the test different for different candidates, thus affecting reliability. Similarly, sloppy administration can also introduce variation in the test which can influence the test scores, which makes consistency of administrative procedures partly a reliability concern.

For Weir, practicality is connected with cost effectiveness. In classroom contexts this concerns the teacher’s and the students’ time and effort in particular, but it also relates to practical resources such as paper or tape recorders, number of teaching hours that can be reserved for testing purposes, and availability of collegial support to comment on draft tasks (1993:21-22). Weir makes a strong plea that practicality should not outweigh validity of the authenticity-directness type. He states that although some task

types may be easier to administer and score, if the skills that they are measuring cannot be specified, the tests are not worth a great deal.

2.4.3 View of validation

Apart from a brief discussion of validity as a principle for test development, Weir (1993) does not discuss validation as an activity. However, he states that “validity is the starting point in test task design” (Weir 1993:20). Moreover, his whole approach to task design and revision is built on testing skills which the test developers can name and trying to guarantee that this is what actually gets tested during the assessment process. He does not discuss ways of providing proof that this is happening.

Weir 1990, to which Weir (1993) refers to as “the companion volume to this book” (1993:28), discusses construct, content, face, washback, and criterion-related validity. He makes a distinction between a priori and a posteriori construct validation. A priori construct validation involves a description of the theoretical construct that the test is intended to measure, and a posteriori validation entails statistical studies to investigate whether this is happening (Weir 1990:24). As for content validity, he argues that in classroom testing, given the restrictions on time and resources, a priori

consideration of the content of test tasks is the most feasible validation procedure. He stresses the acceptability side of face validity and considers this important for the test to be effective, but joins others in warning that content and construct validities should not be sacrificed to acceptability. He defines “washback validity” or simply “washback” as the influence of the test on the teaching that precedes it. Finally, Weir sees criterion-related validity as “a quantitative and a posteriori concept” to determine the extent to which a test correlates with appropriate external criteria (Weir 1990:27). He argues against blind faith in criterion-related evidence, because the validity of the criterion can be questionable, and because it is possible that scores from a test correlate well with an external criterion, but the authors cannot say what the test is measuring. Weir (1990:29) suggests that an appropriate mix of validity evidence depends on the purpose of a particular test that is being validated. He also makes a case for a possible new combination of evaluation criteria that might be applied on communicative tests (Weir 1990:27). He suggests that in addition to content, construct, and washback work, systematic judgements could be gathered from students, teachers, and other users of the test on its perceived validity before the test ever gets administered. Only if the test passes this hurdle should “confirmatory a posteriori statistical analysis” be conducted, presumably against the posited factor structure of the test. The proposal is a reiteration

of Weir’s emphasis on a combination of a priori and a posteriori work, and similar proposals, though perhaps with less emphasis on the theoretical/empirical division and the order in which studies should be conducted, are made eg. by Alderson et al. (1995), Bachman and Palmer (1996), and McNamara (1996).

2.4.4 Distinctive characteristics of the text

While Weir (1993) promotes principles very similar to those brought up in Bachman and Palmer (1996), and presents principles of task revision which largely cohere with Alderson et al. (1995), the distinctive characteristic of his book is his approach to test development through test tasks. Moreover, he is perhaps the most emphatic among the writers on test development about the need for test developers to specify in advance what skills their tasks are supposed to be testing, and try to make sure that this is what actually happens. However, he does not provide means for how to perform any checks.

In document UNIVERSITY OF JYVÄSKYLÄ Centre for Applied Language Studies (Page 36-41)