Trait scale: accuracy - Discussion – Analysis of Writing Scripts 7.1 Introduction

Chapter 7: Discussion – Analysis of Writing Scripts 7.1 Introduction

7.2.1 Trait scale: accuracy

The rating scale for accuracy was designed so that the raters did not have to actu-ally count each error-free t-unit. Instead, it required them to estimate the propor-tion of error-free t-units when reading a script. It was further decided that raters did not need to be trained to identify t-units in this data because a brief analysis of t-unit borders showed that these coincided in over 90% of the cases with sentence breaks.

Although the analysis of the scripts only showed five distinct levels of accuracy (because no scripts at level 9 were included in the analysis), a sixth level was added to the trait scale of accuracy to acknowledge completely error-free scripts.

The rating scale for accuracy is shown in Table 61.

Table 61: Rating scale - Accuracy

Two types of fluency were investigated in both the pilot study and the main analy-sis of the writing scripts: temporal and repair fluency. The measure chosen for temporal fluency was the number of words produced within the time limit of 30 minutes. Although some doubt existed about this measure because of varying findings of other research studies and the fact that not all students used the 30 minutes they were entitled to, this measure produced some promising results in the pilot study. Therefore, the number of words were analysed in the main study.

Although the histogram showed large variation among the scripts in terms of the number of words produced, this measure was not successful in distinguishing be-tween the different proficiency levels.

Wolfe-Quintero et al. (1998), in their review of the literature, also found varying results for this measure. Although ten studies found significant differences among proficiency levels, seven did not. As in this study, Larsen-Freeman (1978; 1983) and Henry (1996) found a ceiling effect around the higher levels or even a de-crease at the advanced level. The findings of this study are also in line with Cum-ming et al.’s (2005) investigation of TOEFL essays. The authors also failed to find a significant difference between the two higher levels (levels 4 and 5). A similar study looking at IELTS essays (Kennedy & Thorp, 2002) did not differen-tiate between immediately adjacent levels, but looked only at differences between essays at levels 4, 6 and 8. Although the authors fail to report means for each level, the minimum and maximum number of words at each level also indicate a large amount of overlap, even though the levels were not adjacent. Essays at level 4 ranged from 111 to 370 words, essays at level 6 from 184 to 485 words and es-says at level 8 from 239 to 457 words. These ranges suggest that there was proba-bly no statistical difference between the essays at higher levels. It could therefore be argued that the number of words is more successful in distinguishing between lower level writers, but is not a measure that can be expected to successfully dif-ferentiate between students who have already been admitted to university.

7.3.1.1 Trait scale: temporal fluency

It was decided not to include temporal fluency in the rating scale because there was little evidence from the analysis of the scripts that there are differences be-tween the levels of writing in terms of the number of words that writers produce.

7.3.2 Repair fluency

The second measure of fluency was repair fluency, operationalised as the number of self-corrections. This measure has not been applied to writing before but was

‘borrowed’ from research on speaking. The variable distinguished successfully between the different proficiency levels but the differences between levels were a lot less pronounced than for accuracy. The measure was included in the rating scale but there is some doubt regarding its usefulness in the context of writing.

This will be discussed in more detail in the context of Research Question 2, in light of the feedback from the raters.

7.3.2.1 Trait scale: repair fluency

On the basis of these findings, the rating scale for fluency was based only on the variable ‘the number of self-corrections’. The scale largely followed the findings from the analysis. The levels were slightly adjusted to allow for better distinctions between bands. For example, band level 8 was designed to include no more than five self-corrections although the analysis of band 8 resulted in a mean of nearly 7 and so on. As with accuracy, a sixth level (level 9) was added to the scale to ac-knowledge scripts with no self-corrections. The rating scale for repair fluency is shown in Table 62 below.

Table 62: Rating scale - Repair fluency

9 8 7 6 5 4 Overall, it can be said that the area of fluency in writing is under-researched. If more appropriate and successful rating scale descriptors for fluency are to be de-veloped, then this area needs more attention in the future. The area of second lan-guage acquisition could contribute greatly to this endeavour.

7.4 Complexity

7.4.1 Grammatical complexity

Two different types of complexity were investigated in both the pilot study and the main analysis of the writing scripts: grammatical and lexical complexity.

Grammatical complexity was operationalised as clauses per t-units. Although the measure was successful in the pilot study, it failed to distinguish between the five levels of writing in the main analysis. Interestingly, Wolfe-Quintero et al. (1998) found that although a number of studies in their review returned non-significant results for this measure, it seemed to generally increase at least with overall profi-ciency level. However, a more recent study also undertaken in an assessment con-text, in this case TOEFL (Cumming et al., 2005), also returned non-significant results for this measure. The authors report very little difference between the pro-ficiency levels, with the means ranging from 1.5 to 1.8. These are slightly higher and more varied than those found in this study (which found means ranging from 1.4 to 1.5), but present a similar picture to the current findings. It is possible that the context under which the data are collected plays a role in this measure. The students taking the DELNA assessment were aware that their writing was going to be assessed. It is possible, therefore, that when students are in an assessment situa-tion, they employ a play-it-safe method and focus more on the accuracy (and lexi-cal complexity) of their writing at the expense of grammatilexi-cal complexity. It is interesting though that the complexity of sentences is regularly included in rating scales of writing. If this and other studies show that writers do not differ greatly from each other in terms of the complexity of their sentence structure when in an assessment context, this measure should perhaps not be included in rating scales in the future. It might be important to make raters and rating scale designers aware of the limitation of this measure.

It can further be argued that not having a successful measure for grammatical complexity is a limitation of this study. If time had allowed it, it would have been useful to pursue other measures of grammatical complexity. A possibility for fur-ther research would be measures of the number of passives per t-units or complex nominals per t-unit. However, Wolfe-Quintero et al.’s review shows that in previ-ous research not many measures of grammatical complexity have been successful.

7.4.1.1 Trait scale: grammatical complexity

Based on these findings, the decision was made not to include this variable in the rating scale.

7.4.2 Lexical complexity

The second type of complexity pursued was lexical complexity. Several measures were examined in the pilot study and the three most promising measures were ex-amined in the main analysis. These were the sophisticated words over total lexical words, the average word length and the number of AWL words. All three meas-ures were successful in distinguishing between the different levels of writing. The measure of sophisticated lexical words over total words was used in a longitudi-nal study by Laufer (1994). Like Laufer’s, this study was able to show that this measure differentiates between proficiency levels. The concern was however, that it would be difficult for raters to use in the rating process. The average word length was also successful in distinguishing between the different proficiency lev-els as it has been in other studies (e.g. Grant & Ginther, 2000). However, there was a concern whether raters would be able to judge the average word length when rating. Differences between the different proficiency levels were not pro-nounced enough to be detected by human raters examining a hand-written writing product. For this reason, the measure, although promising, was not included in the rating scale. The only measure incorporated in the scale, was the number of AWL words in a text. Although the original measure was the percentage of AWL words, a brief investigation of the scripts in the sample showed that controlling for text length in this manner made no difference to the result. It was therefore thought that it might be easier for raters to look for the number of AWL words. No prior research could be located for this measure, although it parallels the variable of so-phisticated lexical words over total lexical words. There are, of course, several problems with this measure. It does not control for students reusing the same word on several occasions. It would possibly have been better to measure the number of different AWL words. However, overall, the number of AWL words seems a promising measure which might be usefully applied in other contexts.

7.4.2.1 Trait scale: lexical complexity

Two main considerations went into the design of the descriptors for lexical com-plexity. Firstly, the variable used needed to be usable by raters in a testing situa-tion. This excluded the average word length and sophisticated lexical words over total lexical words because measuring these would be time-consuming. The num-ber of AWL words was seen to be usable in a rating situation. Secondly, it was decided that six levels of AWL words would be difficult to distinguish for raters.

Therefore, only four levels were created, joining levels 4 with 5 and 8 with 9. The rating scale for lexical complexity can be seen in Table 63 below.

Table 63: Rating scale – Lexical complexity

Three measures were used for the analysis of mechanics: spelling, punctuation and paragraphing. The number of spelling mistakes was a promising measure, but the only really worthwhile difference was found between levels 4 (with eight mis-takes on average) and level 5 (with just under four errors on average). The other four proficiency levels were very similar and probably not distinguishable by rat-ers. No prior research was found that investigated spelling mistakes over different proficiency levels. However, it was interesting to see that although there were slight differences between the levels, this measure was not very successful in ferentiating them. One explanation why the number of spelling errors did not dif-ferentiate between the different writing levels is that lower level writers know fewer words and although they produce many mistakes when spelling these words, in relative terms they can make only a certain number of mistakes. The words they know are often just very simple, easily spelt words. Writers at higher levels have access to a larger vocabulary and therefore the chances of misspelling words also increases. For these reasons, higher level writers produce the same number of mistakes as lower level writers.

It is therefore not surprising that the measure of the number of spelling mistakes was not successful in identifying differences between writers at different levels (except between levels 4 and 5). However, spelling is regularly included in rating scales of writing. If there are indeed very few differences between writers in the number of mistakes they produce, then it might be necessary to bring this fact to the attention of rating scale developers. Further research in this area is clearly necessary.

The number of punctuation errors did also not distinguish between the different levels of writing. Very little research was identified on this measure. Mugharbil (1999) was able to show in his study that the full stop (the only punctuation mark that was examined in this study) is acquired first by learners. Therefore, it is pos-sible that there were very few differences between the learners in this study be-cause all had reached post-beginner level as they were already at university. It might have been better to include comma errors into the analysis, as Mugharbil was able to show that the correct usage of the comma is what differentiates higher

In document and Evaluation Ute Knoch Diagnostic Writing Assessment PETER LANG The Development and Validation of a Rating Scale LTE 17 Ute Knoch LANG (Page 172-177)