5 DATA ANALYSIS
5.3.1 Rating drafts with a rubric
This section will describe how a scoring rubric was chosen and modified, and then used to rate first and second student drafts. Because there were twenty-six pairs of first and second drafts, fifty-two drafts were rated in this phase of the data analysis.
First, a rubric was chosen and modified. The rubric used for rating drafts is adapted from Paulus’ (1999) measure that was developed for ESL writers and has been used in peer response studies (e.g., Tang and Tithecott, 1999). Paulus’ analytic rubric used point values from one to ten for each score category, but for the purposes of this study, the point values were compressed to five. The bottom two categories were removed because the language proficiency they describe is below that of freshman ESL composition students, and highest category was omitted because it describes an error-free, native speaker standard that is also inappropriate for this study. After these three categories were omitted, the remaining seven were compressed into five.
Paulus’ original rubric is comprised of six analytical categories: organization/unity,
development, structure, vocabulary, cohesion/coherence, and mechanics. For this study, cohesion and coherence were subsumed into organization and unity, because there seemed to be a fair amount of overlap in these two categories in the original rubric. Also, mechanics were seen as unnecessary for type-written essays that are grammar and spell-checked automatically, so this category was deleted. After these revisions, the modified rubric included four analytical categories with five possible points for each one, such that each essay could be given a maximum score of twenty points. Table 5-3 displays the revised rubric:
Table 5.3 Scoring rubric for student drafts (adapted from Paulus, 1999)
Organization / Unity Development Structure Vocabulary
1 Some organization. Relationship between ideas not evident. Absent or unclear thesis.
Lacks content. Few examples and details.
Almost all simple sentences. Attempts at complicated sentences impede meaning. No embedding. Meaning inhibited by limited range of vocabulary. 2 Organization present. Ideas show grouping. May have general thesis.
Underdeveloped. Examples may be
inappropriate/ineffective. May use main points as support for each other.
Mainly simple sentences. Attempts at embedding may be present in simple structures with inconsistent success. Somewhat limited command of word usage. Frequent use of
circumlocution. Often uses informal language.
3 Clear introduction, body, and conclusion. Topic sentences present but may lack focus. Narrowed thesis. Relationship between ideas present.
Partially underdeveloped. Logic flaws may be evident. Some areas under-supported and generalized. Repetitive.
Some variety of complex structures. Clause construction and placement somewhat under control. Errors may occasionally impede meaning.
Meaning seldom
inhibited. Adequate range and variety. Little use of circumlocution. Infrequent errors. 4 Appropriate paragraphing and focused topic sentences. Narrowed thesis, but essay may digress from it. Hierarchy of ideas generally present and effective.
Acceptable level of
development. Logic evident. Mostly adequate supporting ideas. May be repetitive.
Sentence variety evident. Frequent successful attempts at complex structures. Meaning generally not impeded by errors.
Meaning not inhibited. Adequate range and variety. Mistakes almost never distracting. Appropriately academic.
5 Definite control of organization. Uses transitions between parts of essay. Focused thesis that directs organization of essay.
Each point clearly developed with variety of convincing types of evidence. Ideas supported effectively. Clear and logical progression of ideas.
Successful variety of sentences and complex structures. Manipulates syntax with attention to style. No errors that impede meaning.
Meaning totally clear. Sophisticated range and variety. Attempts at original, appropriate word choices.
Using the rubric presented above, I assigned first and second drafts a score out of twenty total points, and then calculated the gain in score for that participant. For example, a writer who scored sixteen points on his first draft and eighteen points on his second one would have a gain in score of two points. I completed this process for all drafts: ten participants with two drafts each across three writing assignments (with four participants missing one pair of drafts), or fifty- six drafts.
After I assigned a score for each draft, second raters were recruited and trained. These are MA students in Applied Linguistics who had experience either teaching ESL composition or rating the university’s ESL placement exam. Rater training, which is important for sound measurement because it eliminates extreme differences in rater interpretation of the scoring rubric, increases the self-consistency of raters, and reduces individual biases displayed by raters (Knoch, 2007), was also completed.
During the rater training session, I presented an overview of the study, explained how ratings would be used to answer the research question, distributed and discussed the writing prompts (summary-response and persuasive research paper), and asked raters to assign scores to a first and second draft of summary-response papers that were not included in the study. Raters were encouraged to compare the second draft to the first when assigning ratings, to take into account how revisions may have improved the composition.
Next, raters were invited to share their scores for each category, and any disputes were discussed as a group. Disputes were resolved by referencing the language in the rubric and the assignment sheet for the paper, and considering how it applied to the draft in question. For example, some raters thought that summary-response papers without a thesis statement should lose points in the organization/unity category, but I pointed them to the assignment sheet for that paper, which did not require a thesis statement. The procedure was repeated for the persuasive research paper. Appendix K provides the training packet that was used for raters.
Three raters completed rating sessions over five consecutive days, and each session lasted from three and a half to five hours. At the beginning of each rating session, the raters and I completed a norming activity where we rated a sample paper not used in the current study and discussed our scores. For the first three days, these norming sessions focused on summary-
response papers, and we used a persuasive research paper on the fourth and fifth days, corresponding to the paper type that they were rating each day. After each draft was double- rated, the average gain in score for each pair of drafts was calculated. For example, rater one scored Ivana’s first draft at fourteen points, and rater two at fifteen points, so her average score for draft one was 14.5 points. For her second draft, Ivana scored sixteen points from rater one and sixteen from rater two, for an average of sixteen. Ivana’s average score gain, then, is 1.5 (the difference between 14.5 on her first draft and sixteen on her second). Inter-rater reliability for all drafts was also calculated.
For this study, it is appropriate to use consensus estimates for reliability, which are applied when raters use rubrics that represent a linear continuum of progress along a construct of writing ability (Brown et al., 2004), as do the four categories of this rubric. In percent exact agreement, which is one measure of consensus estimates, exact agreement levels of 70% are considered indications of reliable scoring (Stemler, 2004). Because it is possible for students to earn twenty points based on this rubric, 70% percent exact agreement can be considered as two raters having no more than six points of difference between their final ratings. Percent exact agreement between the second raters and me for this study was 94%. As the high percent exact agreement shows, third rating was not necessary for any papers.