THE EXPERIMENT - Protocol to design a CEFR-linked proficiency rating scale for oral production

Over fifty of the T-14 android as he recalled had made their way by one means or another to Earth, and had not been detected for a period in some cases up to an entire year. But then the Voigt Empathy Test had been devised by the Pavlov Institute working in the Soviet Union. And no T-14 android –insofar, at least, as was known– had managed to pass that particular test.

Dick (1996:29)

CHAPTER 4. DESIGN OF A NEW SET OF RUBRICS

No test is perfect. Even the Voigt Empathy Test (Dick, 1996) had its flaws. One test must serve one particular purpose and is likely not to be useful if used in a context different from the one it was designed for. Language tests are not one exception. The test used to check the English of a group of prospective plane pilots should have some specifications different from those of university entry tests, as discussed in section 2.1.

Tests designed for broad audiences lose specificity. Tests designed for very specific purposes will not be valid for broad audiences. Tests must be designed taking into account what needs to be measured and so must be the assessment criteria linked to such tests. In chapter 4 we will explore the difficulties encountered when designing measurement tools, our rubrics, conceived to be used in 9 different public universities of Andalusia under different types of budgetary and political shortcomings.

In section 4.1 we will focus on the genesis of the project, on how the needs were detected and on how a first attempt to create a common model proved to be inoperative due to the variegation of factors to take into account. In section 4.2, we will describe a scalable protocol to design rubrics which has already proved to be valid. As mentioned in the introduction of this dissertation, this protocol is one of the most relevant contributions of the present work. The protocol provides test experts with a straightforward and usable tool to build rating scales based on the CEFR (Council of Europe, 2001). Within the protocol, we will describe how 11 different raters validated in 2 stages a newly created set of rubrics. Section 4.2.4 is particularly relevant because it provides an in-depth statistical validation of the outcomes of the protocol.

4.1 Revision of previous sets of Andalusian rubrics

In the context described in section 3.3, all Andalusian universities had been developing their own rubrics for proficiency tests. The tests served a common purpose across Andalusia and had the same specifications but they were marked through different rubrics. This was an evident problem which had to be solved. In 2011 each Andalusian university created its own rubrics for oral and written production. The contradiction was obvious because these universities had already started to share tasks but were marking them through different rubrics, none of which had been previously analytically validated.

This did not only jeopardize fairness but also the general validity of tests (see sections 2.2.1 and 2.2.3). At that point, the Technical Advisory Committee (see section 3.3) decided to unify rating criteria across universities. Since most of the exams designed by universities were B1 level, this was the level chosen to design the first set of rubrics. If it worked, then the design process could be extended to other levels below and above B1. The Committee also decided to set off by producing rubrics for the oral component of exams since, due to lack of resources and of expertise, it was not possible to tackle the creation of rubrics for speaking and writing at once. Again, if the design proved itself successful at speaking rubrics, it could be extended to writing rubrics mutatis mutandis. Since the majority of tests designed in Andalusia were English proficiency tests, that one was the language chosen to the description of the rubrics.

The most challenging aspect in the design and validation of the intended common set of rubrics was, without any doubt, creating a final tool that professionals from different universities accepted as their own. If a set of rubrics is designed without consulting those that will be using it in real tests, rubrics, which are intended to generate consensus, may become a major source of dissent. After the decision to create a new set of rubrics, we were commissioned by Technical Advisory Committee for the task.

To begin our work, we listened to the opinion of experts from the different universities. Rubrics are often a direct way to operationalize the construct of one test. This meant that if the created rubrics were able to contain the proposals of fellow colleagues and even bits of former rubrics used by them, the new yardstick would not be alien to them and it would not generate a negative reactions. After listening to other colleagues we decided to compile and contrast all the existing rubrics to identify the points they shared, their strengths and weaknesses. The assumption was that, after this initial analysis, it would be easy to develop a new set of rubrics taking into account everything learnt from the 9 Andalusian universities. Unfortunately, it was not that simple.

When we compiled the existing rubrics we noticed a high degree of difference among them. Though most of the pre-existing rubrics from Andalusian universities were analytic, some others were holistic, the number of band descriptors ranged from 5 to 10, the linguistic features assessed (adequacy, task achievement, language, pronunciation, etc.) were not always the same and eventually, 2 rubrics which shared 1 feature defined it in different ways. On top of all these differences, none of the pre-existing rubrics was explicitly linked to the CEFR (Council of Europe, 2001) although all of them had been developed with it in mind. Finally, and most importantly, none of them had been validated through analytic methods.

It was a major challenge to create a set of rubrics that could compensate for the gaps and which could also be perceived as the common denominator of all the pre-existing ones. To respond to the challenge we envisaged an 8-stage process that would yield (we hoped) the desired result:

Table 4.1.a. Stages of the first attempt to design a common set of rubrics

For stage 1 we compiled the B1 rubrics for oral proficiency tests of 8 of the 9 Andalusian universities: the universities of Almería, Cádiz, Córdoba, Granada, Jaén, Málaga, Pablo de Olavide and Seville. All the analyzed rubrics are available for download at <https://goo.gl/UIsLO7>. The University of Huelva had not developed any set of rubrics this far and they used external exams for the certification of their students. Out of the 8 compiled sets, only 7 were usable because the rubrics of the University of Granada were specific for multi-level exams.

The rubrics compiled were generically used across different languages (i.e.

to assess French, Italian, English, etc.) except in the case of the University of Seville, which used different rubrics for different languages. In the case of the University of Seville, the rubrics used at stage 1 were the ones corresponding to the English area.

Most of the scales were analytic, i.e., rubrics in which the rater assigns a score to each of the linguistic features being assessed in the task (Jonsson and Svingby, 2007:131-132). Two of the sets were holistic (Almería and Córdoba), i.e., clearly aimed at producing overall judgments about the quality of the

In document Protocol to design a CEFR-linked proficiency rating scale for oral production and app implementation (Page 159-166)