EBB scales: a solution? - Rating scales - The scoring element of validity

2.7 The scoring element of validity

2.7.3 Rating scales

2.7.3.3 EBB scales: a solution?

An approach to scale design that has not featured in the pre-task planning literature is Turner and Upshur’s (1996) EBB method. This method is an example of a data based approach to rating scale construction that attempts to resolve some of the issues relating to traditional rating scales identified in the previous section. ‘The scale is empirically derived, requires binary choices by raters, and defines the boundaries between score levels (EBB)’ (Turner and Upshur, 1996, pp. 60-61). In the EBB method, the features of language performance on a specific task that are most salient to the raters are identified and used as the basis for the scale content. EBB scales are assessor oriented: the rater’s rationale for scoring test samples is at the center of the rating process. Rating criteria are presented as a series of binary distinctions that represent boundaries between the levels of ability in the test taking population. Turner and Upshur (1996) describe the procedure for creating EBB scales as follows:

• A series of task samples representing the range of ability is selected and presented to a group of raters who are familiar with the student profile and task.

• The group then rank-orders the samples and decides how many levels of proficiency are present in the samples.

• The samples are divided into two groups: high-level proficiency and low-level proficiency. A feature that is common to the performances in one half of the sample is identified, e.g., ‘Variety of structures (2+ sentences patterns) with expansions’ (Turner and Upshur, 1996, p. 67). This is then formulated as a binary yes/no question.

• This process is repeated until each level has been distinguished with similar binary questions.

• A descriptive summary of language ability is composed for each level on the scale for stakeholder feedback purposes.

Research has shown that the use of EBB scales leads to high levels of test reliability in terms of inter-rater agreement and high discrimination between test takers’ levels of speaking proficiency (Hirai and Koizumi, 2013, Turner and Upshur, 1996, Upshur and Turner, 1995). In an EBB scale validation study conducted in Japan, raters were asked to grade a series of spoken samples using both an EBB scale and an analytic scale containing the same descriptors for five levels of proficiency.

The EBB format was shown to foster higher levels of rater agreement and rater consistency than the analytic format (Hirai and Koizumi, 2013). Discussing rater evaluations and comparisons of the scales, the researchers write that the analytic scale exposed raters to all of the scale criteria at once and may thus have ‘created too much of a cognitive demand on the raters, which may have led to fluctuating ratings across the five levels’ (2013, p. 409).

In sum, EBB rating scales have the advantage of being referenced to a specific population and task. EBB scales are assessor oriented and reflect the raters’ criteria for making proficiency related decisions. In contrast, the most frequently used scale in the pre-task planning literature, the Iwashita et al. (2001) scale, is general-purpose and seeks to describe a broad range of language proficiency. Iwashita’s approach to scale design is ambitious in the range of proficiency it seeks to describe but compromises the precision of measurement and in turn the validity of the test scores. In order to measure the impact of pre-task planning on test scores, the measurement tool must be precise. The contrast between EBB and the analytic, Iwashita et al. (2001) scales is an important one for this study. The characteristics of each scale are summarised in Table 4.

Table 4 Features of EBB scales and the analytic scale

EBB scale (Turner and Upshur, 1996) Analytic scale (Iwashita et al., 2001) Defines boundaries between performance

levels as a binary distinction.

Grades performance levels on a five- point scale.

Designed to reflect language use within a specific context by a specific test taking population.

Intended as a general-purpose scale for all contexts and users.

Empirical: raters provide rating criteria. Intuitive: rating criteria are informed by theory.

2.7.4 Summary

This section has described approaches to measurement in the pre-task planning literature. To sum up, the results of research that involve rating scales have been inconsistent with regard to the pre-task planning impact. This may be due to the rating scale content, which may not provide sufficient description of language performance within the test taking population or adequately reflect the criteria that the raters regard as salient to their decision making process. In contrast, CAF measures have recorded consistent impacts of planning on test performance. These impacts are most evident in the complexity and fluency of the speech, although increases in accuracy have also been reported. However, there are a series of limitations in the use of CAF. Firstly, the relationship between CAF measures and test performance has been questioned on the basis that CAF does not adequately represent language use in context. In addition, increases in CAF have not been shown to correspond to increases in test scores when trained raters make judgments about language proficiency. Thirdly, the absence of

alpha correction in the analysis of multiple CAF results is a shortcoming that detracts from the researchers’ conclusions.

3 Research questions

This chapter begins by summarising the key issues relating to pre-task planning that were discussed in the literature review and identifies gaps in the literature relating to the measurement of test performance, task type, test taker proficiency, and different amounts of planning time. Following this, the research questions are stated.

The literature review indicates that conflicting accounts of pre-task planning may be attributable to the measurement of speech that is adopted in the research. There are broadly two approaches to the measurement of speech in the pre-task planning literature. The first approach involves measures of complexity, accuracy and fluency (CAF; see Section 2.7.2). Planning has consistently been shown to affect these measures although the absence of alpha correction is an important limitation in this research (see Section 2.7.2.4). The second approach to measurement involves rating scales (see Section 2.7.3). This second approach has not provided consistent evidence of a pre-task planning effect. However, the rating scales that have been investigated so far have not been created to describe a specific population of test takers to a specific group of raters. Research findings indicate that when rating scale content does not represent the contextualised variations in spoken proficiency that raters regard as salient, the potential for test scores to uncover a planning impact is limited (see Section 2.7.3.2). Turner and Upshur’s (1996) EBB method of rating scale development is an alternative approach that may successfully discriminate between performances after different levels of planning (see Section 2.7.3.3). To investigate speech planning, this study utilises three analytical approaches to language measurement; measures of CAF, an EBB rating scale and an analytic rating scale

(Iwashita et al., 2001). The analytic scale was selected to enhance the comparability between the current study and research that has used the same scale to investigate pre- task planning in language tests (Iwashita et al., 2001, Nitta and Nakatsuhara, 2014).

Research in task-based language teaching (TBLT) has shown that the impact of planning on task completion varies substantially between different task types. In short, the more challenging the language learner finds a language task, the larger the planning impact (see Section 2.5). Positive findings have generally been recorded for picture-based narrative tasks. Picture-based narrative tasks may be regarded as more challenging than non-picture based tasks if they involve obligatory content that test takers do not have adequate language resources to describe. However, the ability to generate and communicate content independently is an important skill for assessment. Therefore, this study investigates the effect of planning on two task types; picture- based-narratives and non-picture-based description tasks.

This study is designed to assess the impact of pre-task planning with learners who are limited in second language proficiency (see Section 1.1). The research literature presents mixed results for the relationship between planning and proficiency (see Section 2.6.2). One confounding factor in this is that consistent methods for reporting language proficiency, such as the reference level descriptors in the Common European Framework (Council of Europe, 2001) were not used in the studies. It is difficult to understand what terms like ‘low-intermediate’ (Genc, 2012, p. 72) and ‘limited proficiency’ (Sasayama and Izumi, 2012, p. 29) actually refer to without recourse to a common scale. The present study systematically investigates proficiency as a potential variable in the result of planning for a language test by reporting

participants’ L2 proficiency in terms of the Common European Framework reference level descriptors (Council of Europe, 2001) and comparing planning results between different levels.

Research in TBLT most frequently investigates the impact of providing language learners with ten minutes to plan their speech (see Section 2.5.2). This amount of planning time has generally resulted in positive impacts on CAF. However, the language testing literature generally investigates the impact of providing much shorter amounts of planning time (most typically one minute). This amount of planning time has not had the effect that has typically been observed after ten minutes planning. There is a clear gap in the literature in relation to planning time. At present it is unclear how increasing planning time in exam conditions influences test scores. This study therefore investigates the amount of planning time that most substantially impacts CAF and test scores.

The research questions to be answered in this study are:

1. Does variation in planning time operationalized as 30 seconds, one minute, five minutes and ten minutes impact the results of a language test when assessed with a) an EBB scale

b) an analytic scale

c) measures of complexity, accuracy, and fluency (CAF)?

Evidence of pre-task planning effects has primarily been reported in measures of CAF (e.g. Foster and Skehan, 1996). Rating scales have proved less effective in

demonstrating an effect of planning on test scores (e.g. Iwashita et al., 2001). Wigglesworth (1997) found increases in CAF after planning but no corresponding effect on test scores. The comparison of CAF scores and rating scale scores after variation in planning time is an important focus of this study.

If the answer to research question 1 is affirmative,

1.1 Which amount of planning time (30 seconds, one minute, five minutes, ten minutes) most substantially impacts test scores and CAF results?

Studies in TBLT consistently report increases in CAF after a period of ten minutes (e.g. Foster and Skehan, 1996) and five minutes (e.g. Sasayama and Izumi, 2012). In contrast, studies with a language testing focus indicate that planning for one minute or 30 seconds (e.g. Iwashita et al, 2001) has little impact on CAF and test scores.

1.2 Does the impact of the four planning conditions on test scores vary between the analytic scale and the EBB scale?

Research findings consistently demonstrate that variation in pre-task planning time makes little difference to scores on the analytic scale (Elder et al., 2002, Elder and Iwashita, 2005, Iwashita et al. 2001, Nitta and Nakatsuhara, 2014), whereas research is yet to investigate the impact of planning on EBB scale scores.

1.3 Does the impact of the four planning conditions on test scores and CAF results vary between groups of test takers who have different levels of language proficiency?

Proficiency may be a key variable in the effect of variation in pre-task planning time (Mochizuki and Ortega, 2008, Kawauchi, 2005). However, this is difficult to establish given the absence of systematic methods in the literature to measure participant proficiency in the L2 (see Section 2.1).

2. Does the impact of the four planning conditions on test scores and CAF results vary between picture-based narrative tasks and non-picture-based description tasks?

Skehan (2009) proposes that the extent to which a task obliges test takers to use specific language forms is a key indication of task difficulty. Picture-based narratives have a constraining effect, which may pose problems when test takers lack the requisite language to complete the task. For this reason, the impact of planning may vary between the two task types.

If the answer to research question 2 is affirmative,

2.1 Which task type and planning condition has the largest impact on test scores and CAF results?

4 Pilot studies

4.1 Introduction

This chapter reports the data collection, analytical procedures, and results of two pilot studies. The chapter includes information about the development of two EBB rating scales (‘The scale is empirically derived, requires binary choices by raters, and defines the boundaries between score levels (EBB)’, Turner and Upshur, 1996, pp. 60-61), rater training on the EBB scale and the analytic scale (Iwashita et al., 2001), and score analysis. It provides information about the choice of complexity, accuracy and fluency (CAF) measures and the statistical procedures adopted in the analysis. The results are discussed and the implications of the findings for the main study are set out.

In document The impact of pre-task planning on speaking test performance for English-medium university study (Page 97-107)