Determining the scoring validity of a co-constructed CEFR-based rating scale

(1)

Determining the scoring validity of a co-constructed CEFR-based rating scale Abstract:

Considering scoring validity as encompassing both reliable rating scale use and valid descriptor interpretation, this study reports on the validation of a CEFR-based scale that was co-constructed and used by novice raters. The research questions this paper wishes to answer are (a) whether it is possible to construct a CEFR-based rating scale with novice raters that yields reliable ratings and (b) allows for a uniform interpretation of the descriptors.

Additionally, this study focuses on the question whether co-constructing a rating scale with novice raters helps to stimulate a shared interpretation of the descriptors over time. For this study, six novice raters employed a CEFR-based scale that had been co-constructed by themselves and 14 peers to rate 200 spoken and written performances in a missing data design. The quantitative data were analysed using item response theory, classical test theory and principal component analysis. The focus group data, collected after the rating process, were transcribed and coded using both a priori and inductive coding. The results indicate that novice raters can reliably use the CEFR-based rating scale, but that the interpretations of the descriptors, in spite of training and co-construction, are not as homogeneous as the inter-rater reliability would suggest.

Keywords: scoring validity, novice rater, rating scale construction, CEFR, co-construction

I Introduction

The Certificate of Dutch as a Foreign Language (CNaVT) is the largest test of Dutch as a foreign language. Currently there are two academic tests in the CNaVT suite; one at B2 level and one at C1 level of the Common European Framework of Reference for languages (CEFR). Combined, they are taken by some 1000 candidates annually. Following a decision by the funding organisation, the existing academic tests will be merged into one test that has two cut scores, one at B2 and one at C1. Additionally, the funding body requires that the CEFR forms the basis of the new rating scale. The main reason for this is transparency: the link with the CEFR is to be visible on every test level - from task design to rating scale. This study reports on the trial of the fourth version of a CEFR-based rating scale that was first constructed in 2010 together with subject specialists. Since then, novice raters have worked together with test developers to rewrite the descriptors. Because of administrative and financial reasons, the CNaVT can only recruit raters among master students of linguistics, meaning that all CNaVT raters are novices. It is the aim of this study to determine whether co-construction with novice raters may lead to a rating process that is both reliable in terms of rater agreement and valid in terms of descriptor interpretation. A rating process, in other words, that has scoring validity (Weir, 2005a).

II Rating Scales: a Long History of Development and Validation

By the end of the 19th_{century, statisticians such as Edgeworth (1888) were devising ways of} increasing the reliability of written examinations. It was a known problem that different judges rated the same essays differently (Cattell, 1905), so instruments were created to streamline the judgments of individual judges. By 1910, Thorndike had developed what probably was the first standardized rating scale, based on thousands of real-life performances (Thorndike, 1910). In the following decades, concerns about rating reliability would lead to the development of other performance-based rating scales, such as those designed by the Alpha Test (Henmon, 1929) and proposed by Thorndike in 1938 (Spolsky, 1995).

Fundamentally, the issues that were high on the agenda then are not so different from those we grapple with today: standardizing what is prone to variation and measuring what is

(2)

abstruse. Then and now, rating scales are tools used to facilitate measurement and reduce unwanted variation by streamlining individual raters’ judgments with each other and with the test’s construct (Cooper & Odell, 1977; Lumley, 2005). Rating based exclusively on the intuition of individual raters leads to excessive variability, as shown by Diederich et al. in 1961.

From Edgeworth to Diederich, rating reliability has been a long-standing concern in language testing, continuing until today. Rating reliability studies often compare empirically developed scales to intuitively developed ones, experienced raters to novices or single-score holistic scales to their multiple-criteria analytic counterparts. To date, the results of these studies are inconclusive (Harsch & Martin, 2013). Studies that focus on the developmental background of rating scales find that intuitively developed scales may cause problems for raters because they may invite subjectivity (Fulcher, 2012; Galaczi et al., 2011). On the other hand,

empirically developed scales may be too detailed for operational use (Fulcher et al., 2011). Of course, this dichotomy is not always as strict and crossover formats are possible. Knoch (2009a) actually argues in favour of intuition-based scales supplemented with empirical data. Studies that investigate the reliability of holistic and analytic rating scales have led to mixed results, but trends are visible (Cumming et al., 2002; Barkaoui, 2011). Holistic scales are most reliable when used by experienced raters, who have a richer framework to fall back on (Weigle, 2002; Barkaoui, 2010). Novice raters on the other hand often have a dissimilar conception of language proficiency (Isaacs & Thomson, 2013) and may be more strongly influenced by the communicative and argumentative quality of a performance than by its form (Barkaoui, 2010). Consequently, an analytic scale may offer novice raters the explicit guidance they need (Harsch & Martin, 2013) and when using analytic scales, novice raters are not necessarily less reliable than experienced ones, although they may be less harmonious in their interpretation of the scale (Derwing et al., 2004). Coming to the same score in a different way may appear unproblematic from a reliability perspective, but there is more to rating than score consistency and rater agreement (Weigle, 2002; Lumley, 2005). In fact, a high overall rater agreement may mask an underlying dissonance concerning the interpretation of the rating criteria (Carlsen, 2003; Weigle, 2002; Harsch & Martin, 2012). Consequently, when determining how valid a score is, it is vital to gain an understanding into how a rater interpreted the criteria (Harsch & Martin, 2012; 2013). According to Weir (2005a) and Messick (1989) before him, it even makes little sense to rigidly separate reliability from validity. They propose to regard reliability not as opposed to validity, but as one type of validity evidence (Weir, 2005a) contributing to overall scoring validity. In this study too, scoring validity is seen as a combination of the uniform interpretation of a scale across raters, combined with indicators of reliability, such as inter-rater agreement (Messick, 1989; Weir, 2005a).

One aspect of scoring validity is rater reliability, i.e. the extent to which raters are consistent with their own and with other raters’ rating. In achieving this consistency, rater training is of considerable importance, but it does not always reduce rater variability (Elder et al., 2005), Spolsky (LTest-L communication, May 2013) proposes to achieve parallel rating scale interpretations by developing a scale in conjunction with the raters. From the discussion this proposition generated, it is clear that the role of the rater in the rating scale development process is still under debate, even though raters have been part of rating scale construction for nearly a century (see the 1929 Alpha test in Spolsky, 1995: 47). More recent rating scale construction and validation studies that take into account the voice of the rater include

Galaczi et al. (2011) and Harsch & Martin (2012). These studies differ somewhat from earlier co-constructed scales however, since their foundation lies partly in empirical data and partly in a predefined, general description of language proficiency, known as the Common European Framework of Reference for Languages (CEFR).

(3)

The CEFR’s goal is to provide “a common basis for the elaboration of language syllabuses, curriculum guidelines, examinations, textbooks, etc.” (Council of Europe, 2001: 1). However, as Little (2007) points out, the CEFR’s impact in the field of language testing surpasses its influence in the other domains of language education it wishes to provide a common basis for. Perhaps because of this, the CEFR has received ample attention from language testers who have identified not only its beneficial consequences, but also its shortcomings as a language testing tool. These inadequacies include having a limited basis in both second language acquisition theory (Fulcher, 2012) and empirical research (Alderson, 2007), and being too focused on production instead of reception (Weir, 2005b). The CEFR’s level descriptors have been criticized for their generic nature and for containing impressionistic terms and

inconsistencies (Alderson, 2007; Fulcher, 2012). Countering the criticism, North refers to the CEFR as a point of departure in rating scale development (North, 2014) and encourages the practice of redeveloping the descriptive scales so they better fit the context and requirements of a test (North, 2014).

In spite of the CEFR’s reported flaws, it has become all but necessary for a language test in Europe to be linked to the framework (Fulcher, 2004). In line with this evolution, some tests have created a CEFR-based rating scale, which brings about a specific set of concerns. Galaczi et al. (2011) found that the CEFR descriptors were unusable as readymade rating instruments and that some of the CEFR’s scales were too concise to guide the rating scale construction process. In their multimodal rating scale construction process they used

qualitative techniques and quantitative methods to streamline the raters’ interpretation of the scales. In a similar study, Harsch & Martin (2012) propose a data-driven approach to co-constructing a CEFR-based rating scale with raters, since this allows for exploring the causes of dissonance among raters in their interpretation and application of the scale. The authors acknowledge that their study required a substantial amount of time and resources, and suggest an approach adapted to contexts with more limited means. They suggest “using already established rating scale descriptors […] as starting point. In a next step, the use of these descriptors could then be monitored during a relatively short trial period to ensure they can be applied reliably in the given context” (Harsch & Martin, 2012: 244).

Set in the context of a moderately-sized test, this paper discusses the effect of rating scale co-construction on scoring validity and investigates whether a trial period suffices for novice raters to reliably and validly use a rating scale that was co-constructed by themselves and their peers. By involving novice raters in the refinement of a CEFR-based rating scale, this study combines Harsch and Martin’s (2012) suggestion with Spolsky’s recommendation for rating scale co-construction.

III Rating Scale Construction

This section provides a brief overview of the rating scale construction prior to this study. A full account of the whole process has been discussed in a previous publication (Author, 2013). The first step in the rating scale design was determining the rating criteria, based on a

literature review, which had identified the main characteristics of academic language proficiency, such as the ability to deal with abstract information (Hulstijn, 2011; Taylor & Geranpayeh, 2011), the importance of argumentation and knowing how to combine different sources and skills (Cho & Bridgeman, 2012; Cumming, 2013; Sawaki et al., 2013). Based on this review, a preliminary set of rating criteria was put to two independent focus groups of domain experts in order to shed light on their “indigenous criteria” (Jacoby & McNamara, 1999). After adding “sociolinguistics” to the criteria, the domain experts agreed on a final set of criteria, which was ratified by a questionnaire administered among 188 academics (see Author, 2013 for a more detailed overview). Table 1 displays the criteria that were

(4)

and the performance aspects they relate to. [Table 1]

In the first draft of the four-band rating scale, the level descriptors were based on the Dutch translation of the CEFR and were adapted to the context of use (Galaczi et al., 2011). The upper level A of the rating scale corresponds to C1 or above, B to B2 and C and D to B1 and A2 or below respectively. In line with their source, all level descriptors in the first version of the rating scale were positive and devoid of context. For such criteria as “grammar”,

“vocabulary” and ”pronunciation” little was adapted in the first draft. For more academic skills, such as “argumentation” and “summarizing” however, no directly corresponding scales were at hand, and elements from different CEFR descriptors were combined into one. The descriptors for the criteria were adopted from the Dutch translation of the CEFR and were adapted to the context of use (Galaczi et al., 2011). For even though the educated language user (Jones, 2011) and the academic context are central to the CEFR, it insufficiently describes essential LAP features such as argumentation, summarizing, and the use of source material (Cumming, 2013; Sawaki et al., 2013).

In the first pilot of the rating scale in July 2011 four trained novice raters judged 125

performances (ρ = .720**, K = .35). In August 2011, six trained novice raters used the rating scale to assess 60 oral performances (ρ = .876**, K = .45) and in July 2012, four trained novice raters judged 55 written and spoken performances (ρ = .876**, K = .27). After each pilot, the raters participated in a focus group to discuss possible improvements to the rating scale in order to enhance a uniform interpretation of the descriptors. Consequently, the rating scale used for the trial reported on in the current study was based on the CEFR, but enriched with performance-based data and rewritten with novice raters of a comparable background every step of the way.

IV Research Questions

This study examines to what extent a trial period allows novice raters to validly and reliably use a CEFR-based rating scale that was co-constructed by themselves and by their peers. It reports on the following research questions:

Scoring validity:

1. Does an iterative co-construction process followed by a trial period allow for the reliable use of a CEFR-based rating scale by novice raters?

2. If reliable, to what extent do the ratings harbour a uniform interpretation of the descriptors?

Rating scale co-construction:

3. To what extent does co-constructing a rating scale help to stimulate a shared interpretation among novice raters over time?

V Method & procedure

In order to answer the first research question, six trained novice raters judged 200

performances using a CEFR-based rating scale (see “Participants” below). For the second and third question, the quantitative data were supplemented with an analysis of the focus group conducted immediately after rating.

(5)

Two hundred written and spoken samples were selected from the May 2013 test

administration. The samples contained performances on B2 and C1 versions of four open task types: a written summary, a written and a spoken argumentation and a spoken presentation. In the first task type, test takers summarized one 800-word text about language acquisition (B2) or three texts with a combined length of 1500 words about wolf packs (C1). In the written and spoken argumentation task, candidates composed an argument relating to one of three study-related topics (B2) or to one of three socio-political themes (C1). In the presentation task, candidates gave a 7-slide presentation about social media use (B2) or an 11-slide research-based presentation (C1).

The samples were stratified based on L1, country of origin and original test score. In the 2013 ratings, 14% of the samples were placed below B2, 56% at B2 and 30% at C1. The samples were anonymized and distributed among six raters.

2 Participants

Contractual constraints and financial restrictions prevent the CNaVT from working with a fixed pool of raters. Consequently, this study only includes novice raters who had not worked with the CNaVT before and took part in a two-day familiarisation. During the first day, the raters received information about the test’s construct, purpose and rating scale. They also scored eight performances for each task type and discussed any confusion or vagueness caused by the rating scale, the construction process of which is discussed above. If any confusion arose during the rater training, an alternative formulation was discussed in a process of co-construction. During the second day, the raters reconvened to rate additional samples. Only when all raters had assigned the same score to identical samples of all task types, were they considered ready to participate in the study.

The participants of this study were representative for the actual rater population in terms of age, gender and educational background: they were between 22 and 24 years old; five were female and one was male; all were undergraduate students of Dutch completing their master degree. The raters involved in rewriting previous versions of the rating scale also fit this profile.

Throughout this study, the raters will be referred to as participants or respondents. In the analysis all names are replaced by rater codes, i.e. R1 through R6.

3 Mixed-method approach

Since in rating scale research, a strict separation between reliability and validity is hard to maintain (Messick, 1989; Weir, 2005a; Harsch & Martin, 2012; 2013), rating scale validation studies often adopt a mixed-method approach (Galaczi et al., 2011; Harsch & Martin, 2012; 2013). Since the late 1990s, multi-faceted Rasch analyses have become a widely accepted way of determining rater characteristics such as severity and consistency (McNamara & Knoch, 2012), while qualitative methods serve to identify raters’ interpretations of the descriptors.

a Quantitative

Each participant scored 50 written and 50 spoken performances on each of the four task types. Since there were two versions of each task type, one at B2 and one at C1 level, there were eight tasks and 200 performances in total. Each performance was rated three times, but since not all participants judged all performances, a missing data design was used.

(6)

After an exploration of the descriptive data, a dimensionality analysis was conducted in order to determine whether the speaking and writing tasks could be used within the same IRT model. Based on these results, also discussed below, three multi-faceted Rasch analyses were undertaken using the Facets program (Linacre, 2012). In one analysis, both skills were used in the same model and in the other two, speaking and writing were analysed separately. The Rasch analysis allows for modelling the probability of interacting ordinal parameters or 'facets' in a way that best explains the observed data. These parameters are candidate ability, task difficulty, rater severity, and criterion difficulty. The difficulty and the ability parameters are estimated on the same scale, as visualized in the Wright map below (see Figure 1). Based on the Rasch results, this study includes three indicators of rating scale robustness:

discriminatory potential, rater uniformity, and rating variability.

Discriminatory potential: A rating scale’s discriminatory potential indicates whether a rating scale distinguishes proficient candidates from less proficient ones. Candidate separation is a good indicator of discriminatory potential.

Rater Uniformity: Three measures were used to determine whether different raters use the same criteria in a uniform way: rater separation, weighted kappa and intraclass correlation (ICC). A small rater separation in the IRT analysis indicates that raters have a similar understanding of the criteria. Because this study employs a missing data design in which not all participants assessed the same performances, weighted kappa was used as a measure of rater agreement. Quite a robust measure (Vieira et al., 2010), weighted kappa quantifies the level of agreement between multiple ratings that use ordinal scales (Sim & Wright, 2005). The ICC coefficients between overlapping rater triads provided a final check for rater uniformity.

Rating variability: In a robust rating scale raters and rating criteria fit the Rasch model. The Infit Mean Square (Infit MnSq) value is a good indicator of such a model fit. The closer Infit MnSq approaches 1, the better a rater or a criterion fits the Rasch model. An Infit MnSq within the .5 to 1.5 range fits the Rasch model and is productive for measurement (Linacre, 2012: 257).

Additionally, a principal component analysis was used to determine which criteria account for most variance in each task type. In factor analysis, each factor contributes a variance of 1. Hence, factors with an eigenvalue of less than 1 explain less variance than they add (Child, 2006).

b Qualitative

The day after the participants had finished rating, they took part in a semi-structured focus group to discuss their perception and use of the rating scale. Group speak (Belzile & Öberg, 2012) was avoided by asking the respondents to individually write down their thoughts pertaining to the main discussion topics before the focus group started (Kahneman, 2011). During the focus group, participants were asked to elaborate on their stated viewpoints in order to facilitate discussion or contrast opinions (Humphreys et al., 2012). The main discussion topics included the respondents’ use and perceptions of the rating criteria, the wording of the descriptors and the operationalization of the levels.

The focus group data were transcribed verbatim and coded using the NVivo10 software. Coding was both a priori and inductive (Dey, 1993; Miles & Huberman, 1994). The inductive coding used to analyse the discussions fit into a larger a priori coding scheme (Mortelmans, 2011) based on known salient issues in the context of CEFR-based rating and working with novice raters as discussed in the literature review above (see Appendix 1 for the coding scheme).

(7)

The quotes used in this paper are translated from the original Dutch transcriptions. VI Results

1 Quantitative analysis a Frequency analysis

The frequency analysis below (Table 2) shows that the level corresponding to B2 was selected most frequently, followed by B1, C1 and A2.

[Table 2]

b Dimensionality analysis

The dimensionality analysis indicates that the speaking and writing tasks belong to different dimensions for criteria that recur in all tasks (see appendix 2). The test for significance using Fischer’s r-to-z transformation indicates a presumption against the null hypothesis for Vocabulary (p = .045), but not for the other overlapping criteria (Structure & cohesion p = .166, Grammar p = .212). In other words: the criterion “vocabulary” most likely functions differently in speaking and writing tasks. Because of the probability of dimensionality, speaking and writing tasks were analysed separately for the facets “criteria” and “tasks”, but for the facet “rater”, the combined data was used.

c Discriminatory potential

The Wright map below offers a visual representation of the analysis of the combined speaking and writing data - one of the three Rasch analyses conducted in this study. The first column shows the ratings, logistically transformed so they compose a linear logit scale (Knoch, 2009b). In this case, the logit values range from -3 to 3. The next columns display the

candidates, the raters, the tasks and the rating criteria. A more proficient candidate will appear higher on the graph, as will a stricter rater, a more difficult task and a more difficult criterion. The candidate separation is distributed over five logits (See Figure 1, Column 2), indicating a substantial spread.

[Figure 1]

d Rater uniformity

The rater separation ranges from -.7 to .91 (Table 3) with all raters staying within the .5 to 1.5 Infit MnSq range. The weighted kappa coefficients (Table 4) of the two dyads rating the same performances indicate a very good inter-rater agreement (K = .802, K = .797) and the ICCs of the comparable triads show high coefficients for the total scores (r ≥ .832), but not so for criteria such as “mechanics” (r ≥ .580), “sociolinguistics” (r ≥ .573) and “initiative” (r ≥ .401) – possibly indicating a disharmonious interpretation.

[Table 3] [Table 4]

(8)

The measures of the criteria (Table 5) range from -.70 to .69, with “Structure and cohesion” and “Initiative” at the upper and at the lower end of the spectrum respectively. All criteria fall within the acceptable Infit MnSq range, but some criteria do account for more score variance than others. The principal component analysis (Table 6) – conducted separately for spoken and written tasks – showed that irrespective of task type “vocabulary” and “grammar” account for 20% to 30% of the score variance. The only criterion to surpass “vocabulary” in terms of explained variance is “argumentation” in argumentation tasks (33.06% and 27.40% in written and spoken tasks respectively). In spoken argumentation tasks, “argumentation” correlates (Spearman’s rho) highly and significantly with all criteria, but mostly so with “vocabulary” (r = .804**).

[Table 5] [Table 6]

Finally, the IRT scaling for the facet “task” (Table 7) indicates little spread between the B2 and C1 tasks. The tasks used in this study had been formally linked to the B2 and C1 level by an expert panel and had proven robust throughout two previous administrations. The most logical explanation for the outcome of the task facet lies with the use of the rating scale. During the focus group, discussed below, it became clear that the raters used the relative, CEFR-based descriptors in an absolute way, expecting perfection at C1 level. The frequency analysis above confirms this, as the C1 level was assigned overall in 18% of the cases, even though 30% of the samples were placed at that level by previous raters.

[Table 7]

2 Qualitative analysis

In the focus group that followed the rating process, the participants were asked to define the rating criteria from memory and to clarify the differences between the performance levels. The purpose of this recall task was to determine whether their memory of a criterion corresponded to the original wording and whether different participants held similar

perceptions about the same criteria. This recall task allowed the researchers to determine what characteristics of a rating scale caused problems without probing for these problems directly. The issues emerging from the focus group can be bundled into three themes: level width, broad criteria and specificity.

The first major theme that emerged during the analysis relates to the width of the levels. Some respondents reported difficulties when scoring performances that were within the same band, but not at the exact same proficiency level.

R1 Sometimes I thought, gosh, this is a “B” and that is a “B” and yet they are so different, but not different enough to put them in different levels. So

sometimes you find yourself assigning the same level to very different performances

Secondly, the participants felt that some criteria were rather broad or multifaceted. Because of the broadness of criteria such as “vocabulary” and “grammar”, some participants struggled to assign a final score. In general, the participants preferred single-faceted criteria to broader ones, which require a rater to judge various elements simultaneously. Furthermore, because broad criteria incorporate different facets of language, they may cause different raters to focus their attention differently (Lumley, 2005). The focus group did indeed show that not all participants paid equal attention to the same aspects of the same criterion.

(9)

R2 For me, layout was important when deciding on “Mechanics”. If the punctuation wasn’t ok and there was no layout, I’d often assign “C”. R4 I really didn’t take that into account.

R5 I didn’t count that either, but it did bother me sometimes.

Because of the broadness of certain criteria, all respondents felt that they had interpreted some criteria differently when assessing oral or written performances. The respondents assumed that they had been stricter in applying the rating scale for written tasks, even though facets (Table 5) shows that – taking account of the standard error – the measures for

“grammar” in spoken and written tasks fall within the same range. For “vocabulary” however, the raters were significantly more severe in spoken tasks (p = .045), which contradicts their intuition.

R1 When you are rating oral performances, you tend to focus more on structure and fluency and less on vocabulary and grammar, simply because there is nothing tangible to hold on to […]

R5 I did notice that I was more severe for the written tasks. In the spoken part I wrote some errors down when I heard them, but I was kinder. I was more easily satisfied; because when you read written texts you can really focus on what you see. You can look at it again and again.

[others nod in agreement]

Specificity was a third recurring theme. All participants preferred specific over generic wording and all respondents felt that certain criteria would benefit from an exhaustive list of error categories. The description of “grammar” for example, did not contain references to every mistake that could be encountered during the rating process.

R5: The tricky thing about “grammar” was that a lot of attention is spent on sentence structure, but not a lot on grammatical mistakes.

Int: So what are grammatical mistakes?

R4: These things are mentioned in the general description of the criterion, but the individual level descriptors don’t always cover them.

Some participants attempted to overcome vagueness by creating their own objective criteria, such as the number rather than the type of errors, which is what the criterion actually

demands. The respondents relied on concrete, specific markers when making their judgment, which was difficult for broad categories such as “grammar”. In those cases, raters adopted two strategies: they either reread the criteria every time, or they created a simplified version of it, as R4 did: “For me, D means that there’s an error in every sentence”. Evidently, this is quite a leap from the original descriptor at A2 level: “The performance relies mainly on basic syntactic patterns (such as the main clause word order) which may contain mistakes that obscure the meaning of the sentence”.

R4’s treatment of the D-level is representative for the way other respondents treated the scale. Instead of regarding the levels for the criterion-based descriptors that they were, they

translated them into a norm-based scale, where A corresponds to the best possible performance and D to the worst.

R3: “A” meant that the performance was completely clear and contained all the important information. “B” meant that the performance contained some insignificant information, but not too much. ”C” meant that the summary mentioned more unimportant than important elements. “D” was just off the grid.

(10)

Finally, the research focused on the effects of co-construction on stimulating a shared rating scale interpretation. The particants involved in this study used a scale that had been rewritten three times with other groups of novice raters with a comparable background. In spite of this, the respondents did not consider the rating scale self-explanatory or intuitively interpretable. R2 If we had used the rating scales without further explanation the problems we

encountered at the beginning of the training would have persisted. […] INT Did the rating scale offer enough of a foothold?

R5 It did, but mainly because we were able to make adjustments during the training. That has been really important.

Even though not every descriptor was considered readily interpretable without rater training, the respondents found it difficult to make improvements or suggest clarifications. When given the opportunity to co-construct the scale, they chose to make some adaptations on word level, but no major changes were proposed.

VII Discussion

1. Does an iterative co-construction process followed by a trial period allow for the reliable use of a CEFR-based rating scale by novice raters?

The scale used in this study is based on the CEFR, but has been modified according to rater comments and empirical data (Author, 2013). The IRT analyses show that all raters fit the model. The differing measures for the raters imply that not all raters interpreted the criteria with equal severity, but the high weighted kappa and ICC coefficients signify a high inter-rater correlation. This runs parallel to Eckes’ (2008) observation that even though inter-raters may differ in terms of leniency and severity, they do rate consistently.

The IRT analysis shows the CEFR-based criteria to be rather robust. All criteria fit the IRT model but there are differences concerning the criteria’s measures. Not dissimilar from the outcomes of previous research (Cumming, Kantor & Powers, 2002), “mechanics” is an easy criterion (Measure = -.69) and it explains the least score variance in tasks where it was operationalized. “argumentation” on the other hand explains most variance in the argumentative speaking and writing tasks. The influence of “argumentation” on the total score variance and its high correlations with the other criteria seems to confirm its influence on the overall judgment of novice raters (Cumming, Kantor & Powers, 2002; Barkaoui, 2010). In non-argumentative tasks, “vocabulary” and “grammar” explain most score variance. These three criteria all have high average ICC coefficients and could be considered quite robust in terms of reliability.

2. If reliable, to what extent do the ratings harbour a uniform interpretation of the descriptors?

This study shows that it is possible to develop a CEFR-based rating scale that allows for reliable rating in terms of score consistency (Isaacs & Thomson, 2013). The data also indicates, however that after the trial period, the respondents did not use or interpret all criteria in the same way, potentially compromising rating scale use and thus scoring validity (Harsch & Martin, 2013).

First of all, the participants spontaneously referred to the varying widths of the performance levels as a confusing factor. Raters did not always feel comfortable assigning the same score to performances of differing quality but belonging to the same band. Furthermore, the participants considered certain criteria too multifaceted. Some raters even developed

(11)

simplified or crude versions of these criteria, which reminds of previous research warning about raters interpreting the same criteria (Weigle, 2002; Lumley, 2005). The participants included in this study needed concrete and exhaustive criteria and disliked working with scales that leave room for subjectivity. They had little problems recalling more concrete descriptors, but did face difficulties when reconstructing descriptors that invited subjectivity. Perceived vagueness was the main reason why participants considered certain criteria as unreliable. Interestingly, the criteria that were perceived as unreliable turned out to be quite robust in the quantitative analyses. The opposite is true for the criteria that were considered reliable. In the focus groups the respondents stated that multifaceted criteria such as

“grammar” and “vocabulary” appeared less tangible than more homogeneous criteria, such as “initiative”. Possibly, the perceived unreliability of broad criteria could be remedied by asking raters to score different aspects of multifaceted criteria separately.

3. To what extent does co-constructing a rating scale help to stimulate a shared interpretation among novice raters over time?

Given the varying uses and interpretations of the criteria, one may wonder whether rating scale co-construction has an effect on the way the descriptors were interpreted. This study found that co-construction does not in itself lead to a scale that is readily understandable by raters with similar backgrounds, nor does it eliminate problems associated with vagueness or generalisation. What is clear to one group of raters might be vague to another - even if they share the same background. This study offers little data to support the hypothesis that collaborative rating scale development creates a shared understanding of a rating scale over time. A rather time-consuming endeavour, the main advantage of co-construction lies not in future interpretations of a scale, but in offering a method that stimulates a specific and focused rater training in which every aspect of a scale is discussed and clarified (Carlsen, 2003; Eckes, 2008). Given the rich discussion it generates, -co-construction can act as a catalyst for focused participation during rater training, but it does not in itself eliminate a disharmonious interpretation of the criteria.

VIII Conclusion and limitations

Naturally, this study has its limitations, some stemming from the idiosyncrasies of the CNaVT. First of all, the data were collected in the context of a comparatively small test with a limited candidature and limited means. The raters received feedback on their performance during their training, but not during the data collection, which was not feasible given the design of the study and given the paper-based way of rating at the CNaVT. Furthermore, this study focuses exclusively on novice raters. Since experienced raters have a richer framework to rely on, the results of a similar undertaking with experienced raters could be quite different. Lastly, this study involved 6 respondents and 200 performances. Every measure was taken to control the variables but including less raters may increase the impact of one deviant rater. In spite of these limitations, the study does offer some insights into rater variability and scoring validity. First, this study indicates that it is possible to achieve a high level of

agreement among novice raters employing a co-constructed CEFR-based scale. But reliability in the sense of overall rater agreement is only one aspect of scoring validity (Weir, 2005a). In this sense, Harsch and Martin (2012) argue in favour of analysing subordinate scores and of determining why a certain score was assigned. In their study, 14 experts co-constructed a CEFR-based rating scale with the authors. In their conclusions, the authors pointed out that external users were able to interpret the descriptors as intended by the developers. The current study, employing 14 novice raters in the iterative co-construction process prior to the trial reported here, can come to no such conclusion. The descriptors that had been iteratively co-constructed by their peers were not readily interpretable to the six participants of this study.

(12)

Secondly, this study has little evidence to support the suggestion that developing rating scales with raters is the optimal way of achieving a shared understanding of the criteria (Spolsky, LTest-L communication, May 2013). The participants in this study used a scale that had been co-constructed by their peers in three previous pilots. During rater training, the respondents had had the chance to suggest alterations to the rating scale, but they proposed few, except on word level. In spite of this, they did not consider the criteria self-evident, nor the wording crystal clear. In fact, the participants unanimously agreed that without training, they would not have been able to use the scale adequately. Even though the co-construction process did appear to lead to greater involvement during rater standardisation, it seems unlikely that co-construction with novice raters would yield a more uniform interpretation or greater inter-rater reliability than other focused and specific inter-rater training sessions.

The word “specific” is of importance here, since all participants – mirroring Koch’s (2009a) advice to supplement intuitive scales with empirical data - expressed a great need for descriptors that are as exhaustive and concrete as possible. In his 2008 article, Eckes points out that little is known about the influence of raters’ perceptions of rating criteria on their operational behaviour. This study shows that even though novice raters may consider certain criteria to be unreliable, this does not translate into unreliable rating for those criteria. In fact, the criteria that were considered the most concrete, were the least robust statistically. As such, these results fit into a long line of research showing the unreliability of human judgement when estimating difficulty and reliability (Kahneman, 2011).

Finally, this study has researched the use of the CEFR as the foundation for a rating scale to be used by novice raters. With its apparent universality and clear-cut structure, the CEFR has a number of characteristics that make it into an attractive source for rating scale development (Fulcher, 2004). Test developers using the CEFR will adapt it to fit their context and their raters, thereby using it as a starting point and as an adaptable heuristic (Weir, 2005b; North, 2014). This study indicates however, that even though it is possible to create a statistically robust CEFR-based rating scale in terms of reliability, achieving a uniform interpretation of CEFR-based descriptors remains a challenge.

References

Alderson, C. (2007). The CEFR and the need for more research. The Modern Language Journal, 91(4) 659–663.

Barkaoui, K. (2010). Explaining ESL essay holistic scores: A multilevel modeling approach. Language Testing, 27(4) 515–535.

Barkaoui, K. (2011). Effects of marking method and rater experience on ESL essay scores and rater performance. Assessment in Education: Principles, Policy & Practice, 18(3) 279–93. Belzile, J., & Öberg, G. (2012). Where to begin? Grappling with how to use participant interaction in focus group design. Qualitative Research, 12(4) 459–472.

Carlsen, C. (2003). Guarding the guardians. Rating scale and rater training effects on reliability and validity of scores of an oral test of Norwegian as a second language. Bergen: Nordisk institutt Universitetet i Bergen.

Cattell, J. (1905). Examinations, grades and credits. Popular Science Monthly, 66 367–378. Child, D. (2006). The essentials of factor analysis. London: Bloomsbury Academic.

Cooper, C., & Odell, L. (1977). Evaluating writing: describing, measuring, judging. Urbana, Illinois: National Council of Teachers of English.

Council of Europe. (2001). Common European framework of reference for languages: learning, teaching, assessment. Strasbourg: Council of Europe.

Cumming, A., Kantor, R. & Powers, D. (2002). Decision making while rating ESL/EFL writing tasks: A descriptive framework. Modern Language Journal, 86 67–96.

(13)

Derwing, T., Rossiter, M., Munro, M., & Thomson, R. (2004). Second language fluency: judgments on different tasks. Language Learning, 54(4) 655–679.

Dey, I. (1993). Qualitative Data Analysis. London: Routledge.

Diederich, P., French, J., & Carlton, S. (1961). Factors in judgments of writing ability. In ETS Research Bulletin Series. Princeton, NJ: Educational Testing Service.

Eckes, T. (2008). Rater types in writing performance assessments: A classification approach to rater variability. Language Testing, 25(2) 155–185.

Edgeworth, F. (1888). The statistics of examinations. Journal of the Royal Statistical Society, 51(3) 599–635.

Elder, C., Knoch, U., Barkhuizen, G. & von Randow, J. (2005). Individual feedback to enhance rater training: Does it work? Language Assessment Quarterly, 2(3) 175–196. Fulcher, G. (2004). Deluded by Artifices? The Common European Framework and harmonization. Language Assessment Quarterly, 1(4) 253–266.

Fulcher, G. (2012). Scoring performance tests. In G. Fulcher & F. Davidson, editors, The Routledge Handbook of Language Testing (pp. 378–392). London and New York: Routledge. Fulcher, G., Davidson, F., & Kemp, J. (2011). Effective rating scale development for

speaking tests: Performance decision trees. Language Testing, 28(1) 5–29.

Galaczi, E., French, A., Hubbard, C., & Green, A. (2011). Developing assessment scales for large-scale speaking tests: a multiple-method approach. Assessment in Education: Principles, Policy & Practice, 18(3) 217–237.

Harsch, C., & Martin, G. (2012). Adapting CEF-descriptors for rating purposes: Validation by a combined rater training and scale revision approach. Assessing Writing, 17(4) 228–250. Harsch, C., & Martin, G. (2013). Comparing holistic and analytic scoring methods: issues of validity and reliability. Assessment in education: principles, policy & practice, 20(3) 281– 307.

Henmon, V. (1929). Achievement Tests in the Modern Foreign Languages. New York: Macmillan.

Humphreys, P., Haugh, M., Fenton-Smith, B., Lobo, A., Rowan, M., & Walkinshaw, I. (2012). Tracking international students’ English proficiency over the first semester of undergraduate study. IELTS Research Report Series, 1 1–41.

Isaacs, T., & Thomson, R. (2013). Rater experience, rating scale length, and judgments of L2 pronunciation: revisiting research conventions. Language Assessment Quarterly, 10(2) 135– 159.

Jones, N. (2011, June). Defining an inclusive framework for languages. Paper presented at the ALTE 4th International Conference, Kraków.

Kahneman, D. (2011). Thinking, fast and slow. New York: Farrar, Straus and Giroux Knoch, U. (2009a). Diagnostic assessment of writing: A comparison of two rating scales. Language Testing, 26(2) 275–304.

Knoch, U. (2009b). Diagnostic writing assessment: the development and validation of a rating scale. Frankfurt am Main: Peter Lang.

Linacre, J. (2012). A user’s guide to FACETS Rasch-model computer programs. Retrieved, November 20, 2013, from www.winsteps.com/a/facets-manual.pdf.

Lumley, T. (2005). Assessing second language writing: The rater’s perspective. Frankfurt: Peter Lang.

McNamara, T., & Knoch, U. (2012). The Rasch wars: The emergence of Rasch measurement in language testing. Language Testing. 29(4) 555–577.

Messick, S. (1989). Validity. In R. Linn, editor, Educational measurement (pp. 13–103). New York: Macmillan.

Miles M. & Huberman. (1994). Qualitative Data Analysis. Baverly Hills: Sage Mortelmans, D. (2011). Kwalitatieve analyse met Nvivo. Leuven: Acco.

North, B. (2014). The CEFR in Practice. English Profile Studies, 4. Cambride: Cambridge University Press.

(14)

Sawaki, Y., Quinlan, T., & Lee, Y. (2013). Understanding learner strengths and weaknesses: assessing performance on an integrated writing task. Language Assessment Quarterly, 10(1) 73–95.

Sim, J., & Wright, C. C. (2005). The kappa statistic in reliability studies: use, interpretation, and sample size requirements. Physical Therapy, 85(3) 257–268.

Spolsky, B. (1995). Measured Words: The Development of Objective Language Testing. Oxford: Oxford University Press.

Thorndike, E. (1910). Educational measurements of fifty years ago. Journal of Educational Psychology, 4 551–552

Vieira, S., Kaymak, U., & Sousa, J. (2010). Cohen’s kappa coefficient as a performance measure for feature selection. In 2010 IEEE International Conference on Fuzzy Systems (FUZZ) (pp. 1–8). Red Hook: Curran Associates.

Weigle, S. (2002). Assessing Writing. Cambridge: Cambridge University Press. Weir, C. (1990). Communicative language testing. New York: Prentice Hall.

Weir, C. (2005a). Language Testing and Validation. New York: Palgrave Macmillan. Weir, C. (2005b). Limitations of the Common European Framework for developing comparable examinations and tests. Language Testing, 22(3) 281–300.

(15)

Table 1. Overview of LAP rating criteria

Task type Content of criterion

Argumentation WA, SA Clarity and logic, use of source material Vocabulary WS, WA, SP, SA Width of lexicon, word choice, idioms Grammar WS, WA, SP, SA Congruence, morphology, conjugation,

word order

Summarizing WS Conciseness, identifying salient content Structure and cohesion WS, WA, SP, SA Sequencing, conjunctions, paragraphing Mechanics WS, WA Spelling, punctuation and layout Sociolinguistics SP, SA Register, salutations

Initiative SP, SA Active attitude

Pronunciation SP, SA Accent, rhythm, stress and intonation WA: written argumentation, SA: spoken argumentation, WS: written summary, SP: spoken presentation

Table 2. Frequency analysis

Arg Voc Gram Sum Str Mech Soling Ini Pron TOT Level A (C1) 27.2 16.1 10.4 15.3 14.7 27.7 17.0 27.4 20,2 18% Level B (B2) 42.9 54.4 44.4 46.7 35.0 37.7 69.5 62.9 49.6 47% Level C (B1) 24.4 24.1 35.0 24.7 41.5 26.6 12.8 9.0 25.2 28% Level D (A2) 5.5 5.4 10.2 13.3 8.8 8.0 .7 .7 5.0 7%

(16)

Table 3. Facets output: Raters

Facets output Measure S.E. Infit MnSq

Rater (combined data)

R1 .91 .08 1.12 R4 .65 .08 1.09 R5 -.11 .08 1.07 R6 -.11 .08 .93 R2 -.65 .08 .90 R3 -.70 .08 .93

Table 4. Rater agreement (Weighted Kappa & ICC)

Weighted Kappa

K SE 95% Confidence interval

R2 & R4 0,802 0,018 0,767 - 0,837 R3 & R5 0,797 0,020 0,757 - 0,837 Intraclass Correlation (ICC)

V G Su St M A So I P T

R1, R2, R4 .720 .560 .805 .787 .580 .864

R3, R5, R6 .733 .805 .781 .683 .810 .906

R1, R3, R5 .600 .737 .749 .573 .674 .639 .832

R2, R4, R6 .716 .651 .764 .825 .401 .849

V: vocabulary, G: grammar, Su: summary, St: structure & cohesion, M: mechanics, A: argumentation, So: sociolinguistics, I: initiative, P: pronunciation, T: total ICC

Table 5. Facets output: Criteria

Criteria (written)

Structure & cohesion .62 .11 1.11

Grammar .66 .11 .87 Vocabulary -.20 .10 .73 Summarizing -.26 .14 .88 Argumentation -.14 .14 1.12 Mechanics -.69 .10 1.31 Criteria (spoken)

Structure & cohesion .69 .12 1.15

Grammar .59 .12 .94 Pronunciation .31 .12 1.13 Vocabulary .04 .12 .97 Sociolinguistics -.25 .16 .84 Argumentation -.69 .17 1.13 Initiative -.70 .12 .79

(17)

Table 6: Linear principal component analysis for the four task types

Initial Eigenvalues

Tot % of variance Tot % of variance

Written Argumentation Written Summary

Argumentation 1.65 33.06

Vocabulary 1.21 24.30 1.71 34.11

Grammar .91 18.12 1.19 23.86

Summarizing .82 16.50

Structure & cohesion .704 14.08 .80 16.08

Mechanics .52 10.44 .47 9.44

Spoken Argumentation Spoken Presentation

Argumentation 1.64 27.40

Vocabulary 1.24 20.60 1.49 24.79

Grammar 1.01 16.76 1.12 18.77

Structure & cohesion .88 14.68 1.02 17.04

Sociolinguistics .93 15.47

Initiative .66 11.02 .87 14.55

Pronunciation .57 9.53 .56 9.37

Table 7. Facets output: Tasks

Tasks (spoken) C1 Spoken Argumentation .27 .10 .92 C1 Spoken Presentation -.16 .10 1.01 B2 Spoken Presentation -.12 .09 .97 B2 Spoken Argumentation -.60 .10 1.08 Tasks (written) C1 Written Summary -.77 .09 .96 B2 Written Argumentation -.78 .10 .94 C1 Written Argumentation -.93 .09 1.14 B2 Written Summary -1.23 .10 .99

(18)

(19)

Appendix 2. Dimensionality analysis Appendix 1. NVivo coding tree

A priori codes Inductive coding

Criteria  Argumentation  Content  Grammar  Initiative  Mechanics  Pronunciation  Register  Structure  Summarizing  Vocabulary

 Qualities of a bad criterion  Broadness  Incompleteness  Overlap  Vagueness  Subjectivity  Qualities of a good criterion  Specificity

 Discriminatory potential Level descriptors  A – B – C – D

 Positive / negative wording

 Problems with assigning level  Doubt  Level width Rater severity  Contributing factors

 Severity & task type Rater training  Effect of

 and intuition Rating scale use  Oral vs written tasks

 Scale vs intuition

 Actual use  Focus