4.6 An Evaluation of the Proposed Model
4.6.3 Parameters
4.6.3.3 Building separate error models for different sentence types
I then create several error models for specific types of sentence. The errors found in a sentence may be relative to the type of sentence it is, for instance whether it is present or past tense. While these specific models have less data, the data should be of a higher quality. This may result in an improved model overall.
Several error models are built to represent different types of sentence. While PN- label-backoff is a model which has less sparse data problem, each error model built here consists of different sentence types which represent a specific type context. The contexts are distinguished by tense forms, types of pronoun, and types of posed ques- tion.
Error model 3: Tense I built three error models, by dividing the Comprehensive model into three subsets: Present-tense, Past-tense and Future-tense.
Error model 4: Pronoun I built four error models, by dividing Comprehensive into four subsets. These four error models are categorised by a person type and a grammatical number. The type of person is either 1st person or 3rd person and the grammatical number is either Singular or Plural . Table 4.3 lists the four error models as well as the pronoun/s used in sentences for each sub-model.
Table 4.3: Pronoun types
Plural Singular
1st person we I
3rd person they, my parents he, she, my best friend
Table 4.4: Question types
Wh-questions Open-ended questions
Which city are you from? Tell me about your city.
What did your best friend do last weekend? Describe your parents.
What will your parents do this evening Describe what your father does in his job.
Error model 5: Question Types I built four error models, by dividing Com- prehensive into two subsets. The first subset consists of sentences or responses from Wh-question types (Wh-q ) and the second one consists of open-ended question re- sponses (Open-q ). Table 4.4 shows some samples of questions for both types.
4.6.3.4 Mining errors from the learner corpus: Training data sets
Types of training data are differentiated by how many n-gram perturbations are gen- erated. There are three types of training data set as listed below.
Type 1 This training data set consists of n-gram perturbations generated from orig- inal sentences of sentence perturbations which contain one error only. These are likely to be the most accurately identified error n-gram perturbations, as argued in §4.4.1.
Type 2 This training data set consists of n-gram perturbations generated from orig- inal sentences of sentence perturbations which have one or more errors, but excluding errors that are adjacent to each other. These are likely to be a little less accurate but the training set will be larger.
Type 3 This training data set consists of n-gram perturbations generated from orig- inal sentences of sentence perturbations which have have one or more errors, including errors that are adjacent to each other. These are likely to be the least accurately
identified error n-gram perturbations. Please refer back §4.4.2 in order to see how I generate trigram perturbations from multiple error sentence perturbations which the errors are adjacent to each other.
Let me give an example of the contents in Type 1, Type 2 and Type 3 re- spectively. Suppose a sentence perturbation corpus consists of the following sentence perturbations
(60) a. (“I from Malaysia”, “I am from Malaysia”) b. (“I Malaysia”, “Malaysia”)
c. (“He sales many fishes at the market”, “He sells many fish at the market”) d. (“I to go school”, “I go to school”)
e. (“He like to reading”, “He likes reading”) f. (“He like to reading”, “He likes to read”)
For each sentence perturbation in (60), numbers of errors are determined by cal- culating the difference between an original and a corrected sentence. Here, I adapt the Levenshtein distance algorithm (Levenshtein, 1966). The algorithm measures the amount of difference between two strings or sentences. Based on (60), only (60a) and (60b) consist of one error. The remaining item have more than one error. There- fore, only (60a) and (60b) are chosen for generating n-grams perturbations for Type 1 training data set.
The sentence perturbations (60c), (60e), and (60f) consist of more than one error but the last two have errors which are adjacent to each other. As such, for Type 2 training data, (60a) to (60d) are considered. A reason why (60d) is taken for Type 2 is because a correction can be done by performing a transformation of the two adjacent words. As for Type 3, all errors in (60) are counted.
Obviously, Type 1 is a subset of Type 2 and Type 2 is a subset of Type 3. This means the number of n-gram perturbations in Type 1 is less than in Type 2, and the number of n-gram perturbations in Type 2 is less than in Type 3.
4.6.4
Evaluation Procedure
I evaluate the statistical model of error correction using a n-fold cross-validation tech- nique. This technique can be used to evaluate how effectively the statistical model can propose appropriate and grammatical corrections for an ungrammatical input sentence based on some empirical collected data. In this technique, each segment of data be- comes a training set and a data set (Manning and Sch¨utze, 1999). The n-fold indicates how many n segments the data is partitioned into. Then, the model will be evaluated on n rounds. For each round, there are three processes involved: data partitioning, data analysis and data validation. Data is partitioned into two subsets: a data test and a training data set. Data analysis is performed to analyse the training data set. Lastly, an error model is validated based on the analysis outcome.