Existing Error Classification Schemes - A Statistical Model of Error Correction for Computer As

The data that I collected will be referred to as the learner corpus. The learner corpus consists of grammatical and ungrammatical sentences written by language learners. Error analysis can be performed quantitatively by investigating the learners’ errors on the corpus data. Jie (2008) listed three important aspects of error analysis in SLA. Firstly, the learners’ errors tell the teachers how far towards the goal the learners have progressed and what remains for them to learn. Secondly, the learners’ errors provide evidence of how language is learned or acquired, for researchers. Lastly, the learners’ errors are means whereby learners test hypotheses about the interlanguage of the learners. In my study, firstly, I analysed my corpus data to identify most frequent errors committed by students and secondly to investigate the performance of the students across form levels. This relates to the first aspect indicated by Jie. As a first step in the analysis, I went through each sentence in the corpus and corrected any syntax errors, if any. In order to do this, a scheme of error classification was required to be used as a reference to annotate all the located errors.

The creation of error classification schemes relies on error taxonomies which contain categories for error classification. As agreed by James (1998), there are two dimensions which should be included in error taxonomies. The dimensions are

• a linguistic category classification, which represents linguistic features of learner error for example tense, grammar, lexical, etc., and

• a target modification taxonomy, which accounts for what actions need to be done to correct learners’ errors for instance insertion, deletion, replacement, order, etc. In the next subsection, I will explain three existing error classification schemes used to annotate learners’ errors in corpora. The schemes are the Cambridge Learner Cor- pus, the National Institute of Information and Communications Technology Japanese

Learner of English, and the FreeText project. I will use the terms “error classification scheme”, “error coding scheme” interchangebly; and like also the terms “error codes” and “error tags”. In the last subsection, I will discuss a spelling error technique proposed by Kukich (1992) which I adopted in my error classification scheme.

3.3.1 The Cambridge Learner Corpus

The Cambridge Learner Corpus (CLC) is a collection of written essay examination scripts taken by learners where English is their second or foreign language (Nicholls, 2003). The scripts are transcribed and ranged across 8 EFL examinations which cover both general and business English. According to Nicholls (2003), CLC is growing but at the time reported, it consists of 16 million-word. Only 6 million-word component of the corpus has been error coded. The coding of errors is based on an error classification scheme developed at Cambridge University Press.

The error classification scheme covers 80 types of error and has 31 error codes. Each error code is represented in an eXtension Markup Language (XML) convention as shown below:

< #CODE > wrong word|corrected word < /#CODE >

In most of the error codes, <#CODE> examples are based on a two-alphabet system: the first letter represents the general type of error, and the second one identifies the word category of the required word. The general types of error consist of a wrong form used (F), a word missing (M), replacement of a word (R), an unnecessary word (U) and an incorrectly derived word (D). Beside the general types, other types of error included are countability (C), false friend (FF), and agreement (AG) errors. Some additional errors such as spelling error (S), American spelling (AS), wrong Tense of Verb (TV), incorrect verb inflection (IV) are also included.

There are 9 word categories such as pronoun (A), conjunction (C), determiner (D), adjective (A), noun (N), quantifier (Q), preposition (P), verbs (V), and adverb (Y). Punctuation errors are also included and represented as P in the second letter of the error code following the general types of errors M, R, U as the first letter. Below is

an example of a sentence with a correction using the CLC error classification scheme (Nicholls, 2003, pp576):

So later in the evening I felt <#RY>hardly|seriously</#RY> ill.

The above error code annotation means “Replace (R) the adverb (Y) word “hardly” with a more appropriate adverb,“seriously””. The CLC only has one punctuation category which caters for all types of punctuation. The two-alphabet error code system is in flat representation which means CLC does not allow for identification of errors at different levels of specificity. Flat annotation is unsuitable for the inclusion of additional interpretation of errors since once annotations are added alongside the errors, additional interpretation layers of annotation cannot be inserted (D´ıaz-Negrillo and Fern´andez-Dom´ınguez, 2006).

3.3.2 The National Institute of Information and Communica-

tions Technology Japanese Learner of English

The National Institute of Information and Communications Technology Japanese Learner English (NICT JLE) corpus is a two-million-word speech corpus from Japanese who are learning English (Izumi, Uchimoto, and Isahara, 2005). Its source is from 1281 audio- recorded speech samples of an English oral proficiency interview test ACTFL-ALC Standard Speaking Test (SST). The NICT JLE error classification scheme has 46 error tags which have three pieces of information: POS, morphological/grammatical/lexical (MGL) rules, and a corrected form (Izumi et al., 2005, pp75). Similar to CLC, the error code of NICT JLE is also represented in a XML form. Below is an example of error codes:

< P G crr = “corrected word” > wrong word < /P G >

The P symbol identifies a POS symbol (i.e. n for noun) and G symbol represents the MGL rules (i.e. num for number which is under the grammatical system). There are 11 categories of POS in the NICT JLE such as noun, verb, modal verb, adjective, adverb, preposition, article, pronoun, conjunction, relative pronoun, and interrogative. In addition, there is one more error category which is named with Others. This category

represents errors such as Japanese English, collocation, misordering of words, unknown type errors, and unintelligible utterance.

Below is an example of a sentence with a correction using the NICT JLE error classification scheme (Izumi et al., 2005, pp75):

I belong to two baseball <n num crr=”teams” >team </n num >.

The NICT JLE doesn’t cater for any punctuation errors. As suggested by James (1998), in a creation of error classification schemes, two dimensions of error taxonomy must be included. However, NICT JLE includes one only which is a linguistic category classification. The excluded dimension is target modification taxonomy, but with an exception to one error code which is Misordering of words. The NICT JLE is also considered as L2-biased because it has a relative pronoun tagset which only occurs in the Japanese language. Similar to the CLC, the NICT JLE error codes representation is flat and it does not allow for identification of errors at different detailed levels.

3.3.3 The FreeText System

The FreeText system is an error annotation system used to annotate the French Inter- language Database (FRIDA) corpus. The FRIDA corpus contains a large collection of intermediate to advanced L2 French writing. It contains 450,00 words, but only two-thirds have been error annotated completely (at the time Granger (2003) reports). FreeText consists of three levels of annotation: error domain, error category, followed by word category. The error domain specifies whether the error is formal, grammatical, lexical, and so forth. There are nine error domains such as form (<F>), morphology (<M>), grammar (<G>), lexis (<L>), syntax (<X>), register (<R>), style (<Y>), punctuation (<Q>), and typo (<Z>). Each error domain has its own error categories. For example for the morphological error domain, there are 6 error categories. As for the second level, the number of error categories from each error domain ranges between 2 to 10 categories with a total of 36 categories. An exception is the (<Z>) error domain because no error categories are included.

The word category consists of a POS type which comprises 11 major categories: for example adjective, adverb, article, conjunction, determiner, noun, preposition, pro-

noun, verb, punctuation, and sequence. The number of POS sub-categories from each major category ranges between 1 to 12 categories with a total of 54 subcategories.

For each error, an annotator has to select 3 tags from the three different groups (the error domain, the error category and the word category). There are 9 tags from the error domain, 36 tags from the error categories and 55 POS tags. Therefore, in total there are about 100 error tags. Below is an example of a sentence with a correction using FreeText (Granger, 2003, pp470):

L’héritage du passé est très <G><GEN><ADJ>#fort$forte </ADJ></GEN></G> et le sexisme est toujours présent.

3.3.4 Spelling Correction Techniques

In this section, I will explain a technique in automatic spelling correction which I adopted in the creation of my error classification scheme. Kukich (1992) in her paper, discusses the current state of various techniques for correcting spelling errors in three areas of research: nonword error detection, isolated-word error correction and context- dependent word correction.

In response to the nonword error detection area, efficient pattern matching and n- gram analysis techniques have been developed for detecting strings that do not appear in a given word list. The context-dependent word correction uses NLP tools. The isolated-word correction focuses on detailed studies of spelling error patterns. Kukich identifies four common error types of isolated-word correction: insertion of a character, deletion of a character, substitution of a character with another character, and transposition of two adjacent characters. The four error correction techniques are similar to the target modification taxonomy which will be mentioned in the next section regard- ing the creation of my error classification scheme (§3.4). Therefore I decided to adopt those error correction types as one of error codes in my error classification scheme. See Table 3.5 for the comparison between a word level and a syntax level error correction.

Table 3.5: Common error types identified by Kukich (1992)

Word Level Syntax Level

Insert correction

speling → spelling I from Malaysia. → I am from Malaysia

Delete correction

scarry → scary My parents they are kind. → My parents are kind.

Substitution correction

sorri → sorry My city is peace. → My city is peaceful.

Transposition correction

taht → that I like that car blue. → I like that blue car.

In document A Statistical Model of Error Correction for Computer Assisted Language Learning Systems (Page 91-96)