CHAPTER 2 LITERATURE REVIEW
2. Define, assess, revise own criteria & strategies
2.4.2 Comprehensive models of the rating process as a sequence
2.4.2.2 Wolfe’s (1997, 2006) framework of the scoring process
The main issues raised in studies concerned with the sequence of steps the raters go through, according to Wolfe (1997), revolve around disagreement over whether or not different raters adopt one basic method or procedure when making scoring decisions. Wolfe (1997) was amongst others (e.g. Milanovic et al., 1996 above) who suggest that there are important variations in the rating approaches used by different raters in contrast to some who just consider that raters adopt one rating method (Freedman and Calfee, 1983; Homburg, 1983). This is then is a discussion about a different form of inter-rater agreement from that based on the scores awarded, one concerned with whether the same criteria etc. are used, and especially whether the same sequence of steps is followed in the rating process (Wolfe, 1997). In order to cope with a range of potential individual variation in sequences, Wolfe therefore designed a model with little linearity specified, in contrast with that of Milanovic et al., (1996).
Wolfe’s proposed model of the rating process was arrived at not only from his own study of holistic rating, but also by summarizing the findings of a series of studies which attempted to document cognitive differences between raters who rate essays in psychometric, large-scale direct writing assessment settings (Wolfe, 2006, p. 37). He also made use of the information-processing model of holistic scoring proposed by Freedman and Calfee (1983), in which they identified three main steps that underlie the
rating of a composition: 1) read and comprehend text to create a ‘text image’, 2) evaluate text image and store impressions and 3) articulate evaluation (Freedman & Calfee, 1983, p.91). This model supposes that the rater is the one who guides the process at all stages, so of course the model may vary from one rater to another (Wolfe, 1997) (cf. 2.2.4), but rater variables are not included in the model itself, in the way that writer variables are, for example, in the Flower and Hayes (1981) cognitive model of the writing process itself.
Wolfe's model of the rating process claims to cover rater thinking/cognition in general (see figure 2-3). This model differentiates between what it calls two cognitive frameworks: a framework of writing and a framework of scoring together with the text and text image (as shown in figure 2-3). The framework of scoring is in fact the model of the rating process itself, based on Freedman and Calfee as just described, with just three broad stages of 'processing action' top to bottom, compared with the seven stages in Milanovic et al. The framework of writing is analogous to the bottom boxes in Milanovic et al., dealing with what we call factor 2, the criteria. Although it is subtitled 'content focus' it covers all types of criteria, whether focused on content or language etc. The reference to text and text image on the left captures our factor 1. As with Milanovic, there is no explicit reference in the diagram to factor 3, the rater characteristics, although these researchers clearly are aware of their impact on the processing itself.
Wolfe claims that the rater first reads the text written by the student and creates a mental image of the text (left side of the diagram). Of course, the created text images may differ from one rater to another due to environmental and experiential differences among raters (Pula & Huot, 1993; Wolfe, 2006). Next a scoring decision is made
through the performance of a series of later processing actions that constitute the framework of scoring (middle of the diagram). That is, the framework of scoring is “a mental script of a series of procedures that can be performed while creating a mental image of the text and evaluating the quality of that mental image” (Wolfe, 2006, p. 40). For example, after reading the text in order to begin formulating the text image and commenting on the text, the rater proceeds to evaluation which constitutes monitoring specific characteristics of the text (prompted by the framework of writing), reviewing the features that seemed most noteworthy and then making a decision about the score to assign; justification actions which follow are diagnosing, coming up with rationale, and comparing texts. It is noticeable that in the diagram the framework of scoring box commits itself to far less detail than Milanovic et al.'s model does in terms of number of steps, and has no arrows within it, so presumably implies that the sequence may be gone through repeatedly, with omission of steps, as much as required, i.e. is fully recursive. We will be interested to see if this fits our data better than Milanovic et al.'s more detailed model.
Figure 2-3 Model of scorer cognition (Wolfe, 1997, p. 89)
Besides proposing this model, Wolfe's study further tried to identify differences and similarities between raters of different rating proficiency (defined by him in terms of ability to come to agreement with other raters, rather than training etc.). He found that
differences might appear in the failure to identify the connection between ideas contained in the writing as a result of not capturing the essence of writing in the text image adequately. Furthermore, personal comments might distract the rater from the rating process.
With respect to the content focus, in Figure 2-3 above, Wolfe found that the quality of the mechanics, the organization of the student’s ideas, the degree to which the student adopted storytelling devices to communicate the sequence of events, and the degree to which the student developed a unique style for presenting his or her ideas all influenced his raters’ decisions (Wolfe, 2006). There is no guarantee, of course, that our raters will focus on the same features.
In addition, Wolfe found that adoption of different content focus categories (i.e., in our terms, criteria and their weighting, Factor 2) may reveal significant cognitive differences among raters. For instance, different conclusions may be reached if raters have different areas of emphasis during evaluation, such as focus on writer’s style versus focus on storytelling devices (Wolfe, 2006:41). Importantly, Wolfe points out that rater differences may not be limited to the number and nature of the content focus categories used while making rating decisions, but may extend to other components of the framework of writing. For example, raters may differ in respect of the frequency with which they shift their focus and jump between content focus categories (Wolfe, 2006). In other words, individual rater styles may be characterised by the sequences of steps they follow, not just the criteria they choose to rely on.
Wolfe used think-aloud protocols to examine the jumps between categories that raters made, and concluded that less proficient raters made more jumps which suggests that they have trouble conceptualizing their decision-making process. The protocol analysis
revealed that less proficient raters tended initially to read a short section of the essay and begin to formulate a decision and then, as they read on, their decision developed. Proficient raters on the other hand tended to read the entire essay withholding judgment until the entire essay had been read. This was evidenced by the fact that the less proficient raters in this study employed more early decisions and monitoring behaviours, whereas the more proficient raters employed more review behaviours. This is something we will be interested in checking on in our study.
Proficient raters also made fewer personal comments (which by the way his model does not seem to have a place for), and Wolfe’s interpreted this to mean that rating is a cognitively demanding task, and hence, if done properly, leaves no space for such comments. Conversely, less proficient raters found it difficult to cope with the task and thus they often deviated from the rating process. They also tended to focus on surface features or break the evaluation down into chunks, which ran contrary to the marking scheme provided, which was holistic marking in his study. Although we will not be using Wolfe's definition of rater proficiency, we will be interested to see if such rater differences in style are found also in our study.