Main study Phase 1 - METHODOLOGY – ANALYSIS OF WRITING SCRIPTS

Chapter 5: METHODOLOGY – ANALYSIS OF WRITING SCRIPTS

5.5 Main study Phase 1

The following section will briefly describe the writing scripts collected as part of the 2004 administration of the DELNA assessment. Of the just over two thousand scripts, 601 were randomly chosen for the main analysis.

5.5.1 Instruments:

5.5.1.1 Writing scripts:

Five prompts were used in the administration of DELNA in 2004. Table 30 below illustrates the distribution across scripts in the sample. As mentioned previously, scripts on prompt five were excluded based on a FACETS analysis (in which prompt was specified as a facet), which showed that it was marginally more diffi-cult than the others.

Table 30: Percentages of different prompts used in sample

Task Frequency Percentage

1 176 29.3%

2 93 15.5%

3 171 28.4%

4 161 26.7%

TOTAL 601 100%

The length of the scripts ranged from 47 to 628 words, with a mean of 270 words.

Deletions were not part of the word count. All scripts were originally written by hand and then typed for the analysis.

Table 31 below shows the distribution of final scores awar-ded to the writing scripts. This is based on the averaged final score from both raters. It can be seen that no scripts were awarded a nine overall by both raters.

Table 31: Final marks awarded to scripts in sample Final Mark Frequency Percentage

4 12 2%

5 115 19%

6 276 46%

7 172 29%

8 26 4%

9 0 0%

TOTAL 601 100%

5.5.2 Participants 5.5.2.1 The writers:

Several background variables were available for the participants, because DELNA students routinely fill in a background information sheet when booking their as-sessment. Here, gender, age group and L1 of the students in the sample are re-ported.

Table 32 below shows that there were somewhat more females in the sample overall.

Table 32: Gender distribution in sample

Gender Frequency Percentage

Female 329 55%

Male 247 41%

Not specified 25 4%

Table 33 below shows that most students occupied the under 20 category. Very few writing scripts in the sample were produced by writers over 41.

The L1 of the students was also noted as part of the self-report questionnaire. Ta-ble 34 below shows that the two largest L1 groups were students speaking an East Asian language as L1 (41%), closely followed by students with English as their first language (36%). Other L1s included in the sample were European languages

other than English (9%), Pacific Island languages (4%), languages from Paki-stan/India and Sri Lanka (4%) and others (3%). A further 3% of students did not specify their L1.

Table 33: Age distribution in sample

Age group Frequency Percentage

Under 20 340 57%

20 – 40 225 37%

41 or above 14 2%

Not specified 22 4%

TOTAL 601 100%

Table 34: L1 of students in sample

L1 Frequency Percentage

English 217 36%

East Asian language 248 41%

European language 52 9%

Pacific Island language 26 4%

Language from Pakistan/India/Sri Lanka 21 4%

Other 19 3%

Not specified 18 3%

Total 601 100%

As part of the information above, the distribution of the final average writing mark in relation to the test takers’ L1 was calculated. Table 35 shows that almost all students scoring an eight overall were native speakers of English, while the largest number scoring lower marks (fours or fives) were from Asian back-grounds. Test takers that did not specify their language background were not in-cluded in this table.

Table 35: Marks awarded to different L1 groups in sample

L1 \ Final Writing Mark 4 5 6 7 8 Total

English - 13 86 95 23 217

East Asian language 11 79 124 32 2 248

European language - 8 21 23 - 52

Pacific Island language - 6 15 5 - 26 Language from India/Sri Lanka/Pakistan - 3 9 8 1 21

Other - 4 10 5 - 19

5.5.2.2 The raters:

Very little specific information was available about the raters of the 601 scripts during the 2004 administration. However, as mentioned earlier, all DELNA raters are experienced teachers of either ESOL or English, a large number have rating experience outside the context of DELNA (for example in the context of IELTS) and all have postgraduate qualifications. More background details on the partici-pating raters in Phase 2 of the study will be reported in Chapter 8.

5.5.3 Procedures 5.5.3.1 Data collection:

The 601 writing scripts randomly selected for the purpose of this study were col-lected as part of the normal administration of the DELNA writing component over the course of the academic year 2004. All scripts were rated by two raters and, in case of discrepancies of more than two band scores, a third rater was consulted.

As part of the DELNA administration, a background information sheet is rou-tinely collected from each student. Several categories on this background informa-tion sheet were entered into a database (see secinforma-tion on data entry).

5.5.3.2 Data entry:

Data were entered into a Microsoft Access Database which included a random ID number for each script, the students’ ID number to identify the script, the task (prompt) number, the score awarded to the scripts by the two raters on the three different categories in the analytic scale (fluency, content, form) as well as any relevant background information about the students. The variables entered from the background information sheet were as follows: country of birth, gender, age group, L1, home language, time in NZ, time in other English speaking country, marks on other relevant English exams and enrolment at the University of Auck-land at time of sitting the assessment (i.e. first, second or third year). The scores awarded on each category of the analytic scale (i.e. fluency, content, form), by the two (or three) raters were then averaged (in the case of uneven scores arising, the score was rounded down) to arrive at a final score for each script in each category.

An overall writing score was also calculated for each script. This was based on the average of the mean scores for each of the three categories of fluency, content and form. The overall score was rounded down if .333 and up if .667.

5.5.3.3 Data analysis:

5.5.3.3.1Accuracy:

As mentioned in the pilot study, the measure chosen for accuracy was the per-centage of error-free t-units. This therefore involved identifying both t-unit boundaries and errors. As these variables cannot be coded with the aid of com-puter programs (Sylvianne Granger, personal comm-unication), both had to be coded manually. To save time, t-units were coded in combination with clause boundaries (see grammatical complexity) and errors were coded in combination with spelling mistakes and punctuation mistakes (see mechanics).

After coding t-unit boundaries and errors, all error-free t-units were recorded into a SPSS (Statistical Package for the Social Sciences) spreadsheet. To make the variable more meaningful, the percentage of error-free t-units was calculated by dividing the error-free t-units by the total number of t-units. A second coder was then involved to ensure inter-rater reliability by double-coding a subset of the whole sample (50 scripts). A Pearson correlation co-efficient was calculated using SPSS.

5.5.3.3.2 Temporal Fluency:

Temporal fluency was operationalised by the number of words written. This was established using a Perl Program specifically produced for this task. The output of the Perl program is composed of the script number from 0 to 601 in one column and the number of words in the script in the adjacent column. The output is in TextPad (a free downloadable software for Windows) and this can then easily be transferred into Excel or SPSS spreadsheets. The reason a Perl programme was chosen for this task is that, instead of having to go through the laborious task of checking the number of words in each individual script through the help of the Microsoft Word Tools menu, Perl performs the analysis within seconds. Because this variable was analysed by a computer program, double rating was unneces-sary. However, it should be mentioned that as part of the design process of the Perl program, a number of spot checks were carried out to ensure that the program was working in the way required.

5.5.3.3.3 Repair Fluency:

The variable chosen to analyse repair fluency was the number of self-corrections.

The self-corrections were ope-rationalised as described in the pilot study. To en-sure inter-rater reliability, this variable was double rated in 50 scripts and a Pear-son correlation coefficient was calculated using SPSS.

5.5.3.3.4 Grammatical complexity:

Grammatical complexity, as mentioned in the previous section, was operational-ised as the number of clauses per total number of units. Both clauses and t-units were coded manually. A clause boundary can occur between an independent clause (a clause that can stand by itself) and a dependent clause (a clause which cannot stand by itself), or between two dependent clauses. However, as with the coding of t-units described above, sometimes the clause boundaries were hard to define as some of the writers had not achieved a high level of accuracy in their writing. As with the t-units, the decision was made that a clause needed to contain a subject and a main verb to count as a clause. Therefore, the sentence ‘the graph shows that the amount of departures after 2001 big’ was counted as just one t-unit with no clause attached because the verb was missing in the second part. Again, a second coder was used to code a subset of the whole sample (50 scripts) to ensure inter-coder reliability. Then, a Pearson correlation coefficient was calculated us-ing SPSS.

5.5.3.3.5 Lexical complexity:

Lexical complexity was coded into three variables: firstly, sophisticated lexical words per total lexical words, secondly the average length of words and finally the number of AWL words. The variable sophisticated lexical words per total lexical words was analysed with the help of the computer program Web VocabProfile (Cobb, 2002) which is an adaptation of Heatly and Nation’s Range (1994).

Before the data was entered into VocabProfile, all spelling mistakes were cor-rected. This was done because the program would not be able to recognise mis-spelled words and would therefore move them into the offlist wordlist. The ra-tionale behind including these words in the analysis was that the writer had at-tempted the items, but was just not able to spell them correctly. Items of vocabu-lary that were too unclear to be corrected were excluded from the analysis.

The sophisticated lexical words were taken from the tokens of the AWL (aca-demic word list) and the Off-List Word tokens. However, as the Off-List words also included ab-breviations and words like ‘Zealander’ from New Zea-lander, this list was first scanned and then only the ‘real’ Off-List words were included in the analysis. The Off-List words could be investigated easily because lower down the screen, each token of the Off-List words was given. The number of sophisti-cated lexical words was then divided by the total number of content words. As the number of content words was not stated in the output of VocabProfile, the value for lexical density had to be used. Lexical density is defined as the number of con-tent words divided by the total number of words. Therefore, it was quite straight-forward to arrive at the number of content words (i.e. by multiplying the value of

lexical density by the total number of words). Because the variable sophisticated lexical words over total lexical words was analysed with the aid of the computer program VocabProfile, no inter-rater reliability check was deemed necessary.

The second variable that was investigated for lexical complexity was the average length of words. This was done completely automatically, again using a Perl script specifically designed for the task. The Perl program was written so that it identified the number of characters in each script, as well as the number of spaces between characters. Before this count, the Perl script disregarded all punc-tuation marks (so that they were not added into the final count where they might inflate the length of words). To arrive at the final average word length for each script, the number of characters was divided by the number of spaces between words. As this was done completely automatically, no inter-rater reliability check was deemed necessary. The Perl program was however thoroughly checked for any mistakes before it was used.

Finally, the number of words from the Academic Word List was recorded in the spreadsheet. This was also taken from the output of VocabProfile.

5.5.3.3.6 Mechanics:

The first group of variables examined for mechanics was the number of errors in each script for spelling and punctuation. They were coded at the same time as the rest of the errors (i.e. the types of errors analysed for accuracy). Each of these was defined as described in the methodology section of the pilot study. A second rater rated a subset of the data (50 scripts) and Pearson correlation coefficients were calculated for each of the variables using SPSS.

Paragraphing was coded as described in the pilot study. Double-coding of a subset of 50 scripts was undertaken and a Pearson correlation coefficient was calculated to ensure inter-rater reliability.

5.5.3.3.7 Coherence:

Using the categories established in the pilot study, the scripts were coded manu-ally. The same t-unit breaks for accuracy were used. Inter-rater reliability was es-tablished using a second coder who rated a subset of 50 scripts and calculated us-ing a Pearson correlation coefficient in SPSS.

5.5.3.3.8 Cohesion:

The variable chosen to investigate cohesion was the number of anaphoric pro-nominals (e.g. this, that, these) used by the writer. The propro-nominals used in the main analysis are listed in Appendix 1. The decision was made that instead of hand-coding these in the 601 writing scripts, with the risk of missing some due to human error, a concordancing program would be used to search for each of these pronominals individually. The concordancer chosen for this task was MonoConc Pro Concordance Software Version 2.2 (Barlow, 2002).

Monoconc not only displays the concordancing lines, but also displays as much context as is requested. This proved invaluable, because many of the words identi-fied were not anaphoric pronominals and thus were not acting as cohesive devices as described by Halliday and Hasan (1976). Although this method of data analysis has the advantage that it saves time compared to the manual method, it still proved time-consuming in the sense that all instances of the words in the concor-dance needed to be checked in the top window, to eliminate all occasions where the word was not used as a cohesive device. For example, when counting the use of those, all instances of those as in those of us, needed to be discarded as well as the those used in the sense of those people that I am familiar with. After pronomi-nals that were not used as cohesive devices were discarded, the next step was to assess if the referent referred to by the pronominal was in fact over the clause boundary in accordance with the definition adopted for cohesive devices. This ex-cluded a number of possessive pronominals occurring in the same clause as the referent as for example the use of its in ... the motor vehicle crashes declined to half its number....

Following this procedure, each pronominal was recorded and entered into an SPSS spreadsheet next to the relevant script number. The next step was to ex-clude all pronouns that occurred fewer than 50 times in all scripts. This was done because it was not deemed useful to include very rare items in a rating scale.

Therefore the following words were excluded from any further analysis: here, its, those, his, her, she and he. Then the results for each pronoun were correlated with the final score awarded by the DELNA raters. Finally, an inter-rater reliability check was undertaken by double-rating 50 scripts and calculating a Pearson corre-lation coefficient.

5.5.3.3.9 Reader-Writer Interaction:

Reader-Writer interaction was investigated by using MonoConc (Barlow, 2002) which was described in the previous section on cohesion. The structures investi-gated in this category were allocated to four groups: hedges, boosters, markers of

writer identity and the passive voice. The complete list of items investigated was established based on previous research of the literature and can be found in Ap-pendix 1. Each lexical item was investigated individually using MonoConc. Here special care needed to be taken, so that lexical items that did not function as hedges or boosters were excluded from the analysis. For example, in the case of the booster certain, all uses of certain + noun needed to be excluded as this struc-ture does not act as a boosting device. In the case of the lexical item major, all uses of the word in conjunction with cities or axial routes, for example, needed to be excluded because these were also not used as boosters. So for each lexical item in Appendix 1, the whole concordancing list produced in MonoConc needed to be thoroughly examined before each instance of that item could be entered into a spreadsheet. Finally, all items were added together, so that a final frequency count for each script was found for hedges, boosters and markers of writer identity. The passive voice was initially also investigated using MonoConc. However, because it was impossible to search for erroneous instances of the passive (i.e. unsuccess-ful attempts), this analysis was later refined by a manual search.

Finally, all four variables investigated in the category of mechanics underwent an inter-rater reliability check. Fifty scripts were coded by a second rater and a Pear-son correlation coefficient calculated.

5.5.3.3.10 Content:

Using the scoring scheme described in the pilot study, the scripts were manually coded. A second rater was used to ensure inter-rater reliability by scoring a subset of 50 scripts. A Pearson correlation coefficient was calculated using SPSS.

5.5.3.4 Data analysis: Inferential statistics

To ascertain that any differences found between different DELNA writing levels did not occur purely due to sampling variation, each measure in the analysis was subjected to an Analysis of Variance (ANOVA). A number of assumptions under-lie an ANOVA (A. Field, 2000; Wild & Seber, 2000). The first assumption relates to independence of samples. This assumption is satisfied in this situation, as no writing script is repeated in any of the groups (DELNA band levels) compared.

The second assumption stipulates that the sample should be normally distributed.

However, according to Wild & Seber (2000, p. 452), ANOVA is robust enough to cope with departures from this assumption. Furthermore, because most groups in this analysis were very large, we can rely on the central limit theorem, which stipulates that large samples will always be approximately normally distributed.

The third assumption stipulates that the groups compared should have equal vari-ances. This is the most important assumption relating to ANOVA. Wild & Seber

(2000) suggest that this can be tested by ensuring that the largest standard devia-tion is no more than twice as large as the smallest standard deviadevia-tion². If the vari-ances were found to be unequal following this analysis, a Welch test (Welch’s variance-weighted ANOVA) was used. This test is robust enough to cope with departures from the assumption of equality of variances and performs well in situations where group sizes are unequal. The post hoc test used for all analyses was the Games-Howell procedure. This test is appropriate when va-riances are unequal or when variances and group sizes are unequal (A. Field, 2000, p.276).

This was found to be the most appropriate test of pair-wise comparisons because in all cases the groups were unequal (with DELNA band levels 4 and 8 having fewer cases than band levels 5, 6 and 7).

Whilst pair-wise post hoc comparisons were performed for each measure, it was not deemed important for each mea-sure to achieve statistical significance be-tween each ad-jacent level. Pair-wise comparisons bebe-tween adjacent levels are however briefly mentioned in the results chapter.

After the ANOVAs and pair-wise post hoc comparisons had been computed, it came to my attention that a MANOVA analysis would be more suitable for this

In document and Evaluation Ute Knoch Diagnostic Writing Assessment PETER LANG The Development and Validation of a Rating Scale LTE 17 Ute Knoch LANG (Page 129-139)