“… it has come to this. The essay, the great literary art form that Montaigne conceived and Virginia Woolf carried on … has sunk to a state where someone thinks it is a bright idea
to ask a computer if an essay is any good.” (Scott, 1999)
Automated essay evaluation, especially automated essay scoring, has been subject to significant controversy. On the one hand there is significant support for AES as
“automated essay scoring and evaluation becomes more widely accepted as an educational supplement for both assessment and classroom instruction” (preface in Shermis &
Burstein, 2003). There are several studies showing that AES systems work well, and studies reporting high agreement rates between AES systems and human markers
(Bridgeman, Trapani, & Attali, 2012; Burstein & Chodorow, 1999, 2010; Burstein et al., 2003; Landauer, Laham, & Foltz, 2003; Powers, Burstein, Chodorow, Fowles, & Kukich, 2001)
On the other hand, there has been and still is significant opposition to AES, particularly to the idea, originated by Page, that ‘it might replace human scoring’ (Ericsson & Haswell, 2006; Herrington & Moran, 2012; Perelman, 2012). Harsh criticism comes particularly from the community of writing researchers. The major organisation Conference for
College Composition for writing researchers has actively opposed AES during the last
decade. Writing professionals claim that such systems prepare their students to write for machines, writers writing to computers (Herrington & Moran, 2001), and therefore they
say: “Because all writing is social, all writing should have human readers, regardless of the purpose of the writing … We oppose the use of machine-scored writing in the assessment of writing” (Deane, 2013, p. 8). They have not revised their statement yet, although there
has been a great deal of AES deployment over the last 15 years. Critics argued that the replacement of human markers by a machine would not just threaten the jobs of tutors, but also change students’ sense of what it means to write in school and university (Herrington & Moran, 2001).
Common criticisms of AES (based on (Cheville, 2004; Ericsson & Haswell, 2006)) focus on the capability of such systems to interpret meaning, evaluate factual correctness of the content, and quality of the argumentation. Machines cannot truly read, understand an essay and interpret its meaning (Attali, 2013). Therefore, there is a possibility that such systems can be gamed as AES systems can be insensitive to particular features in student writing that human markers might detect and penalise, such as repetition and lack of coherence (Deane, 2013). There is little research regarding the impact of AES on writers’ behaviour, or on the view of it as a barrier to be gamed and manipulated by tricks rather than as a person to communicate with (Deane, 2013). The biggest opposition to AES focuses on when it is deployed as a replacement for a human scorer, when it becomes the sole scorer. However, such an extreme situation is rare as even the widely-known ETS systems use AES as a complement to the human marker.
It is true that current AES systems do not mimic human markers’ ability to measure
conceptual reasoning, thus AES measures a narrower range of skills than human markers (Deane, 2012), though they could measure a lot that human markers do not pay attention to. Such systems therefore are criticised as they fail to measure higher-order writing skills such as high-quality and strong argumentation due to their limited nature (Attali, 2013). For example, the E-rater measures efficiency in ‘knowledge-telling’ writing and cannot score the ‘knowledge-transforming’ writing well enough. In his research, Bennett (2011)
reports on the use of AES in persuasive writing style. He concludes that although the overall correlation between human and machine scores are high, AES systems are better at scoring essays which are marked based on a text-production rubric that values fluency, effective word choice, and accuracy of the text production than they are at scoring essays which are marked based on a critical-thinking rubric that values effective argumentation and attention to the audience. When the focus of assessment is on students who need practice to improve their fluency, and control their text production processes with less cognitive load, the capacity of the AES is relatively strong (Kellogg & Raulerson, 2007); but if the focus is on quality of argumentation, AES is relatively weak (Deane, 2013). Therefore, it is not reasonable to deploy AES as the sole scorer. Instead, it can be deployed in combination with human markers instead.
Attali (2013) pointed out that there is a lack of understanding of what human markers do in their evaluation. He mentioned that the primary goal of AES is to ensure that human markers think similarly about what constitutes high or low quality student writing so that machine scores measure the same elements as human markers. However, there is evidence showing discrepancies between the way human markers interpret the quality of the same essay (Attali, Lewis, & Steier, 2012). For instance, ‘rater severity/leniency’, the systematic assignment of lower or higher ratings than the average of ratings assigned by other
markers, is one of the main discrepancies between markers (Engelhard & Myford, 2003). However, even ‘rater calibration methods’, extensive training prior to marking and use of marking rubrics to bring consensus, cannot alter the ‘rater severity’ (Engelhard & Myford, 2003).
If human markers are inconsistent and unreliable, then the machine cannot be trained effectively (Bridgeman, 2013). Therefore, the aim of mimicking human markers is a difficult task to achieve. Bridgeman (2013) discusses how to assess the rater reliability so that machines can be trained better. However, in order to deploy an AES system by
well. It does not understand the essay and therefore it is limited to measuring a subset of the written context; therefore, AES should currently be considered as a “complement to human scoring” (Attali, 2013, p. 194). A “division of labour” approach (Attali, 2013, p. 194) between human markers and machines can be used to overcome such issues. Unlike the initial intention of Page, AES should be used as a “complement to (instead of replacement for) human scoring, limited in its ability to measure a subset of the writing construct” (Attali, 2013, p. 182). “No assessment technology should be applied blindly; but neither should any method be rejected a priori, without considering how it can be used to support effective learning and teaching” (Deane, 2013, p. 18).
Although, in general, AES systems mimic the human markers well enough that various studies show high correlations, just because the approach works well on average does not guarantee that it will work well in all population subgroups (Bridgeman, 2013). There are several studies (Bridgeman et al., 2012) of how well such systems work with student essays written by people of different gender, race, ethnic, and language backgrounds. However, no studies are available on how automated essay evaluation works between different disciplines and student levels, possibly because the systems that are evaluated are mass-market ETS systems that only work in student essays for entrance exams and which do not differ in level or discipline.