Alternatives to raters: Grading using the computer

In the following section, research about automated ratings using the computer is reviewed and the relevance of these programs to diagnostic assessment is dis-cussed.

The difficulty of obtaining consistently high reliability in the ratings of human judges has resulted in research in the field of automated essay scoring. This re-search began as early as the 1960s (Shermis & Burstein, 2003). Several computer programs have been designed to help with the automated scoring of essays.

The first program, called ‘Exrater’ (Corbel, 1995), is a knowledge-based system which was designed with the sole purpose of assisting raters in the rating process.

Exrater does not attempt to identify a candidate’s level by computer-mediated questions and answers, but rather presents the categories of the rating scale, so that the rater can choose the most appropriate. It also does not present the full de-scription, but rather only shows the most important statements and keywords which are underlined. The aim is to avoid distraction to raters by having them fo-cus on only one category at a time and not the whole rating scale. Corbel identi-fies a number of potential problems with the program. Firstly, he predicts a halo effect because raters might select most descriptors at the same level without checking the more detailed descriptions which are also accessible at the click of a button. Secondly, he argues that there might be a lack of uptake due to the un-availability of computers when rating. Overall it can be argued that Exrater is a helpful tool to assist raters, but it still requires the rater to perform the entire rating process and make all decisions. Because of the risk of a halo effect, Exrater is probably not suitable for diagnostic assessment purposes.

In the past few years, a number of computer programs have become available which completely replace human raters. This advance has been made possible by developments in Natural Language Processing (NLP). NLP uses tools such as syntactic parsers which analyse discourse structure and organisation, and lexical similarity measures which analyze the word use of a text. There are some general advantages to automated assessment. It is generally understood to be cost effec-tive, highly consistent, objective and impartial. However, sceptics of NLP argue that these computer techniques are not able to evaluate communicative writing ability. Shaw (2004) reviews four automated essay assessment programs: Project Essay Grader, the E-rater model, the Latent semantic analysis model and the text categorisation model.

Project Essay Grader (Page, 1994) examines the linguistic features of an essay. It makes use of multiple linear regression to ascertain an optimal combination of weighted features that most accurately predict human markers’ ratings. This pro-gram started its development in the 60s. It was only a partial success as it ad-dressed only indirect measures of writing and could not capture rhetorical, organ-isational and stylistic features of writing.

The second program evaluated by Shaw (2004) is Latent Semantic Analysis (LSA). LSA is based on word co-occurrence statistics represented as a matrix, which is “decomposed and then subjected to a dimensionality technique” (p.14).

This system looks beneath surface lexical content to quantify deeper content by mapping words onto a matrix and then rates the essay on the basis of this matrix and the relations in it. The LSA model is the basis of the Intelligent Essay Asses-sor (Foltz, Laham, & Landauer, 2003). LSA has been found to be almost as reli-able as human assessors, but as it does not account for syntactic information, it can be tricked. It can also not cope with certain features that are difficult for NLP (e.g. negation).

The third program, the Text Categorisation Technique Model (Larkey, 1998), uses a combination of key words and linguistic features. In this model a text document is grouped into one or more pre-existing categories based on its content. This model has been shown to match the ratings of human examiners about 65% of the time. Almost all ratings were within one grade point of the human ratings.

Finally, e-rater was developed by the Education Testing Service (ETS) (Burstein et al., 1998). It uses of a com-bination of statistical and NLP techniques to extract linguistic features. The programme compares essays at different levels in its data base with features (e.g. sentence structure, organisation and vocabulary) found in the current essay. Essays earning high scores are those with charac-teristics most similar to the high-scoring essays in the data base and vice versa. Over one hun-dred automatically extractable essay features and computerized algorithms are used to extract values for every feature from each essay. Then, stepwise linear re-gression is used to group features in order to optimize rating models. The content of an essay is checked by vectors of weighted content words. An essay that re-mains focussed is coherent as evidenced by use of discourse structures, good lexi-cal resource and varied syntactic structure. E-rater has been evaluated by Burstein et al. (1998) and has been found to have levels of agreement with human raters of 87 to 94 percent. E-rater is used operationally in GMAT (Graduate Management Admission Test) as one of two raters and research is underway to establish the feasibility of using e-rater operationally as second rater for the TOEFL iBT inde-pendent writing samples (Jamieson, 2005; Weigle, Lu, & Baker, 2007).

Based on the E-rater technology, ETS has developed a programme called Crite-rion. This programme is able to provide students with immediate feedback on their writing ability, in the form of a holistic score, trait level scores and detailed feedback.

There are several reasons why computerized rating of performance essays might be useful for diagnostic assess-ment. The main advantage of computer grading might be the quick, immediate feedback that this scoring method can provide

(Weigle et al., 2007). Alderson (2005) stressed that for diagnostic tests to be ef-fective, the feedback should be immediate, a feature which his indirect test of writing in the context of DIALANG is able to achieve. Performance assessment of writing rated by human raters will inevitably mean a delay in score reporting. The second advantage might be the internal consistency of such computer programs (see for example the feedback provided by the Criterion programme developed by ETS). However, research comparing human and the e-rater technology has shown (1) that e-rater was not as sensitive to some aspects of writing as human raters were when length was removed as variable (Chodorow & Burstein, 2004), (2) that human/human correlations were generally higher than human/e-rater correlations (Weigle et al., 2007), and (3) that human raters fared better than automated scor-ing systems when correlations were investigated of writscor-ing scores with grades, instructor assessment of writing ability, independent rater assessment on disci-pline-specific writing tasks and student self-assessment of writing (Powers, Burstein, Chodorow, Fowles, & Kukich, 2000; Weigle et al., 2007).

There are also a number of concerns about using computerized essay rating.

Firstly, these ratings might not be practical in contexts where computers are not readily available. Furthermore, it could be argued that writing is essentially a so-cial act and that writing to a computer vio-lates the soso-cial nature of writing. Simi-larly, what counts as an error might vary across different sociolinguistic contexts and therefore human raters might be more suitable to evaluate writing (Cheville, 2004). In addition, as dia-gnostic tests should provide feedback on a wide variety of features of a learner’s performance, current rating programs are unable to measure the same number of features as human raters. This means that automated scoring programs might under-represent the writing construct. For example, the programs reviewed above were not able to evaluate communicative writing ability or more advanced features of syntactic complexity. Taking all the above into ac-count, it can be argued that human raters should be able to provide more useful information for diagnostic assessment.

2.6 Conclusion

This chapter has attempted to situate diagnostic assessment within the literature on performance assessment of writing, and research regarding the influences of a number of variables on performance assessment was reported. Because the focus of this study is the rating scale, research relating to rating scales and rating scale development, as well as considerations regarding the design of a rating scale for diagnostic assessment, are considered in the following chapter.

--- Notes:

1 For a more detailed discussion of this and later models refer to Chapter 3.

2 Most research cited in this section is based on studies conducted in the context of oral assess-ment. This research is equally as relevant to writing assessassess-ment.

Chapter 3: RATING SCALES

In document and Evaluation Ute Knoch Diagnostic Writing Assessment PETER LANG The Development and Validation of a Rating Scale LTE 17 Ute Knoch LANG (Page 34-39)