Stages of the first attempt to design a common set of rubrics

In document Protocol to design a CEFR-linked proficiency rating scale for oral production and app implementation (Page 166-178)

Stage 1 Comparison of pre-existing sets of rubrics from Andalusian universities Stage 2 Definition of the type of scale: analytic vs. holistic

Stage 3 Identification of how many and which linguistic features should be included Stage 4 Identification of the number of bands and scores to be included

Stage 5 Design of descriptors departing from the pre-existing ones Stage 6 Qualityative validation

Stage 7 Quantitative validation Stage 8 Implementation

table for holistic marks despite being analytic (Cádiz and Pablo de Olavide) and 1 included holistic appraisals as 1 individual linguistic feature (Málaga). Only the set from Córdoba was task-specific, although not consistently. The set from Granada was specific in the design of its outcome but general in the definition of its dimensions.

The most frequent number of bands was 6, which meant that most sets of rubrics distinguished 6 different scores for each linguistic feature measured. The distribution of the numbers of bands was irregular and ranged from 3 (Córdoba) to 11 (Pablo de Olavide, which used half points for bands). Besides, 6 of the 8 sets used numeric labels for the bands (generally from 0 to 6) while 1 used lexical labels (Córdoba: pass, merit, distinction) and another one (Seville) used a mix of numeric and lexical labels. The case of the set from Granada was particularly different because its descriptors were not graded in bands. Instead of graded bands, they displayed achievement dichotomous descriptors (i.e. test takers either reached the level of the descriptor or not). In some cases, some bands were described as sharing characteristics from upper and lower levels (Cádiz, Jaén, Málaga).

The set of Granada deserves particular attention. Besides the aforementioned lack of proper bands, it was the only set designed to cover 2 CEFR (Council of Europe, 2001) levels (B1 and B2). The set was designed to match the requirements of the exam designed at the University of Granada, which contains 3 different tasks, the first one designed for a B1 level, the second for an intermediate B1/B2 level and the third one for a B2 level. Table 4.1.b reflects the heterogeneity of the sets of rubrics analyzed.

Table 4.1.b. Analysis of pre-existing rubrics

The challenge was obvious, because although some of the sets analyzed were quite similar, let us say, in the dimensions measured (Cádiz, Jaén and Málaga), others were clearly in a league of their own (Córdoba or Granada). It was very difficult to find a common denominator that the participating universities and their test administrators were willing to share. Most consulted test administrators were used to their pre-existing rubrics and early on they proved to be reluctant to a new set that they did not perceive as operational or that they

Córdoba Unilevel Specific Analytic 3 lexical 2 General achievement Task fulfillment

as a sound and useful tool which, at the same time, retained part of their previous system. They were pessimistic and thought that the new rubrics would require a time-consuming familiarization process.

Following the analysis displayed in figure 4.1.b there were several obvious decisions to make in the design of the new rubrics, and some others which were not so obvious. At stage 2, for example, it was clear that our rubrics would be analytic, since most of the pre-existing ones were so. Analytic scoring is somewhat less practical than holistic scoring because it takes more time to apply the criteria to each performance to be marked (see East, 2009:91), but they also provide more detailed information about a test taker’s performance in different aspects and are for this reason preferred over holistic schemes by many writing specialists (see Weigle, 2002:115). Most raters would agree on the fact that the task of rating is approached in a more consistent way when using a shared set of analytical criteria (East, 2009:91).

Another easy decision was whether to make rubrics general or task-specific. Luckily too, the only task-specific sets of rubrics were those of Córdoba and Granada, which meant that we should design a general type of rubrics. When rubrics are designed with a general character, these can be used in a wider range of tasks. Rubrics which are too specific and designed for one specific task or test type are not useful outside the task and test they were designed for.

The not-so-clear decisions were chiefly related to linguistic features and they arose at stage 3. As table 4.1.b above shows, the rubrics did not always break down the same linguistic features. Finding the common denominator in the existing sets of rubrics was very complex because, for example, some of the universities merged into 1 category features that others analyzed separately. As one example, Discourse management in the sets of Almería, Jaén and Pablo de Olavide included aspects of Coherence and cohesion and Fluency. In turn, Fluency and coherence in the rubrics of Cádiz was both linked to Fluency and to Coherence. At this point, it was important to simplify the variety of linguistic features in order to create operational categories of linguistic features.

Besides the mix of features, the names given to some of them were misleading in many cases and there was no consistency across rubrics in this respect. In this way, for example, the linguistic features of Grammar (Granada), Grammar and Vocabulary (Almería, Cádiz, Jaén, Málaga and Pablo de Olavide) and Range of Vocabulary and structures (Seville) referred pretty much to the same concepts. To clarify how many linguistic features the rubrics had in real fact we analyzed the frequency with which categories of linguistic features appeared across the scales, as displayed in table 4.1.c below.

The grammar of candidates’ speeches is still perceived by professionals as an accurate reflection of the knowledge that test takers have of the language, as we learn from its 87.5% prevalence. In the rubrics analyzed grammar is normally used to assess candidates’ control over linguistic units such as words and phrases, sentences and clauses. As a result of this view, it is hand in glove with vocabulary and syntactic concerns, in fact, grammar and vocabulary are only differentiated in 1 set of rubrics (Granada). Range of vocabulary and structures (Seville) clearly refered to grammar too.

Pronunciation is on a par with Grammar in terms of importance (87.5%). It was present in 7 different sets. It was presented alone on 5 occasions (Almería, Cádiz, Jaén, Pablo de Olavide and Seville) and it was presented together with Fluency on 2 (Granada and Málaga).

Table 4.1.c. Categories, linguistic features and frequency of pre-existing rubrics

Interaction was found in as many as 6 sets (Almería, Cádiz, Jaén, Málaga, Pablo de Olavide and Seville), which equals 75% of the rubrics analyzed.

Interaction is, undoubtedly, 1 of the aspects perceived as most important in oral interaction in so far as it is an inherent part of most forms of oral communication.

In all universities except in Seville candidates take oral interviews in groups of 2.

Coherence and cohesion was a particularly complex dimension. Globally speaking it was present in 75% of the rubrics analyzed. The concepts of Coherence and Cohesion only appeared together explicitly in 1 set of rubrics (Granada) and yet many others made reference to either Coherence or to Cohesion under different headings. In this sense, Communication (Málaga) and Discourse management (Almería, Jaén and Pablo de Olavide) were, despite their names, clearly referring to Coherence, Cohesion and to Fluency alike. This mix provided a blurred boundary between Coherence and cohesion and other dimensions which led us to consider the possibility of unifying this category with the next feature, Fluency.

Fluency was also present in many other sets (75%) disguised under different names. In some other cases, the references were explicit, as in the case of Fluency and coherence (Cádiz). But also, as we have just discussed, Discourse management (Almería and Jaén) as well as Communication (Málaga) included references to Fluency. Málaga was thus doubling references to Fluency since this dimension was also explicitly referred to in the individual linguistic feature of Pronunciation and fluency.

Cohesion and coherence and Fluency were at this point serious candidates to be merged into 1 category. The justification for this could have been based on 2 aspects. First, theoretically speaking, pragmatic competence can be understood as a function of discourse competences (cohesion and coherence among others), functional competences (where fluency is key) and design competences. This means that Cohesion and coherence and Fluency are rooted in common grounds (Council of Europe, 2001:123-130; North and Schneider, 1998:227;235). Second,

terms thanks to a reduction of linguistic features and descriptors to be born in mind while rating. This approach is followed in tests such as IELTS (UCLES 2012:18) or the Cambridge suite of exams, where “[f]luency and coherence are captured under the Discourse Management criterion” (Khalifa and Ffrench, 2009:13). It will also be the approach followed in the final version of the rubrics presented in this dissertation, as shown in section 4.2.

Despite most sets being analytic, a number of them (37.5%) included Holistic appraisals, as we have already mentioned. In 2 cases (Cádiz and Pablo de Olavide) the Holistic approach was included as a separate grid of descriptors and only in 1 case (Málaga) was it embedded in the main set of rubrics. However, we felt that a Holistic assessment could not be interpreted as a linguistic feature since it is an intrinsic characteristic of the rubric itself, not one of its elements. As a consequence, using a holistic appraisal as a linguistic feature was easily ruled out.

Task fulfillment was only present in 2 sets (Córdoba and Seville) and thus it shows one of the lowest frequencies, 25%. This is consistent with the fact that most sets were general and not task-specific. It was ruled out not only because of its low prevalence but also because having used a task-specific design that could account for all types of oral tasks found across the 8 universities would have been impossible.

At the bottom of the classification we find the cases of Vocabulary as well as Accuracy, each present in only 1 set (Granada and Seville respectively). The descriptors of Accuracy could be matched to Interaction or to Coherence since they focused on the effectiveness of communication (i.e. “A few minor errors which do not impede communication”, “Fairly frequent errors which do not prevent communication of the essential message” or “Large number of errors make utterances unintelligible”). This was a conceptual problem since the set which included Accuracy as a separate dimension (Seville) already included a specific dimension for Interaction and Task completion. There was no explicit indication to whether the errors mentioned in the descriptors were grammatical or

errors of any other type. Due to these interpretation problems, Accuracy was seen as a complementary feature of other categories.

At this point of stage 3, we decided to limit the number to 4 linguistic features, the most frequent number of features in the analyzed sets. Two of the sets clearly displayed 5 linguistic features (Málaga and Seville), 2 sets (Cádiz and Pablo de Olavide) had 4 features plus an additional holistic appraisal, but there were 3 sets that clearly displayed 4 dimensions (Almería, Granada and Jaén). Thus 4 dimensions was the logical choice.

As to which ones would these features be, the 5 most frequent ones, as shown in table 4.1.c above, were Grammar, Pronunciation, Interaction, Coherence and cohesion and Fluency. Since Fluency could be easily merged with Coherence and cohesion, the final decision was straightforward: by merging 2 of the 5 most frequent features we would be able to come up with 4 categories, namely Grammar, Pronunciation, Interaction and a mix of Fluency and Coherence and cohesion. This closed stage 3 of our first attempt (see table 4.1.a above).

In stage 4 the objective was to define the number of bands. Despite the frequency analysis (table 4.1.c above) yielded 6 as the most frequent number of bands in previous rubrics (followed by 5) a final decision was made to limit the number of bands to 5 to enhance rater cognition. The assumption was that in a 5-bands scale it would be easy to interpret band 3 as the minimum required level for a given linguistic feature. Band 3 being the pass, set in the middle of the rubric, the rubric is divided into 2 equal halves. In a 6-band set (as the initial frequency analysis suggested), the pass level would be somewhere between bands 3 and 4, which is less intuitive and thus more confusing in terms of rater cognition. This is not to say that a 6-band scheme would have been less reliable, but a reduced number of bands would also reduce the number of descriptors and the cognitive load necessary for raters. In short, the fewer bands, the easier it is for raters to use rubrics.

Having decided the number of bands to use at stage 4, we were ready to move on to stage 5. Stage 5 was at the core of the process because in this stage we would have to fine-tune the pre-existing descriptors, eliminate those which were redundant and we would have to fill in the existing gaps. Stage 5 was dedicated to the creation of the very essence of the rubrics: intelligible, operational and valid descriptors that raters could use swiftly and reliably while rating oral performances.

The first analyses of pre-existing descriptors for stage 5 were far from promising. Unfortunately, we noticed that our analysis had been flawed from the beginning. The pre-existing descriptors did not hold a 1 to 1 comparison. The descriptors from the different universities were not just completely different among them, they also had the problem that they had never been previously validated quantitatively. These had neither been linked to the CEFR (Council of Europe, 2001). Because of this, retaining them at all costs did not seem a good idea. Yes, retaining them would have made the new rubric more familiar to raters (our original intention), but at a tremendous cost, at the cost of reliability.

To make de decision we opened an opinion poll and asked 10 of the experts in the Technical Advisory Committee. They, together with other fellow workers, would have to use at some point the new rubrics and, as team leaders, they would have to introduce the new tool to the rest of the raters in their universities. It was paramount to reach consensus at this point. After the survey, 3 experts voted to refurbish the pre-existing descriptors and 7 voted to bin them and start from scratch.

This way, after 5 months of work, we failed to proceed to stage 5 and decided to make a fresh start. Sometimes, a timely retreat is a victory. We had learnt our lesson and time had not been completely wasted. Now we were more conscious than ever of the variegation and limitations of our rubrics. We had also gained awareness on which linguistic features were more important for the teams of raters involved, we had tentatively defined some of these linguistic features and

we knew which was the most operational number of bands to include. Perhaps, we hoped, we should be able to use this knowledge in the future.

4.2 Development of a protocol to design rubrics

Indeed, we had learnt our lesson. When we decided to continue with the design of the rubrics we believed that it was a good idea to revise the literature again to find out if there existed any type of protocol for such purpose. This time we were not only looking for generic directions to create rubrics. We were looking for literature that described how the problem of variegation in descriptors could be solved. We found nothing about this because, generally speaking, rubrics are analyzed in the literature as improvements of previous sets or as sets of descriptors created from scratch. We opted for a fresh start and began to think that it would be interesting to see how the CEFR (Council of Europe, 2001) descriptors could be used for the development of rubrics. Again, the literature is not conclusive in this respect and that was the moment in which we envisaged the creation of a protocol that did not only establish the different steps to follow in the design of rubrics, but which could also use the CEFR (ibid.) for such purpose.

One might expect that, due to the importance that rubrics have in contemporary testing, there should be extensive literature about the way in which they must be designed, validated and implemented. Nevertheless, “there is surprisingly little information on how commonly used rating scales are constructed” (Knoch, 2009:42). Brindley (1998:117) points out that “it is often difficult to find elicit information on how the descriptors used in some high profile rating scales were arrived at”. McNamara (1996:182) wrote about the descriptors that make up rubrics the following:

Given the crucial role that they play in performance assessments, one would expect that this subject would have been intensively researched and discussed. In fact, certainly in the field of language assessment, we find that this is not so; we are frequently simply presented with rating scales as products for consumption

Turner (2000:556) makes the same point:

With the important role that rating scales play in performance evaluation, one would think that the literature would abound with descriptions and procedures for scale construction. But, as we quickly learn, this is not the case.

Generally speaking, this is still true, which is striking since, for the sake of clarity and fairness, many rating scales of high stakes tests are unveiled to candidates but with no reference whatsoever to the way in which they were developed. We wanted our protocol to bridge this gap.

The design of rubrics for writing is more frequent in the literature but, with minimal changes, their insight can be extrapolated to the design of speaking rubrics. To build our protocol we basically took into account Weigle (2002), Knoch (2009), Dean (2012) and, indirectly Turner (2000), Ffrench (2003), Council of Europe (2009) and section 5.1.3 in the CEFR (Council of Europe, 2001).

Weigle (2002:109) points that the first decision to make when designing rubrics is to establish the type of rubric that we want to build. She gives the choice of primary trait, holistic or analytic rubrics (ibid.) although, as we saw in section 2.1.3, the choice is slightly more complex than this. We chose to develop an analytic scale because these are more reliable (Davies et al., 1999:75) and because this was the tendency in most Andalusian universities as we saw in table 4.1.b above.

After this decision, Weigle (2002:122-125) proposes to consider the following ones:

• Who is going to use the rubric (test designers, raters, stakeholders or a combination of them)

• What linguistic features will be assessed by the rubric

• How many bands it will have

• How scores will be reported.

Knoch (2009:38) extends this list by pointing to the need of describing the linguistic features included and by underlining the importance of quantitative validation (ibid.:79-100; 193-287). Weigle (2002:134-136) suggests inter- and intra-rater reliability checks to validate the rubrics. Dean (2012:18) only mentions qualitative methods of validation. In terms of validation procedures, the most comprehensive method is that proposed by Knoch (2009), since it uses both CTT and MTT. Knoch’s (2009) is, indeed, a very comprehensive method of validation but, in our opinion, it is so comprehensive and complex that it is not practical at early stages of development. There is an additional aspect which is not mentioned by either Weigle (2002), Knoch (2009) or Dean (2012) but which is paramount in the European context of these rubrics. As mentioned above, this aspect is the linkage of the rubrics to the CEFR (Council of Europe, 2001), which constitutes the most important contribution of the design protocol that we propose:

In document Protocol to design a CEFR-linked proficiency rating scale for oral production and app implementation (Page 166-178)