The learner corpora data consist of essays, which are subdivided into paragraphs. For each paragraph the original text is given, as well as error annotations. An error annotation usually consists of the position of the word or phrase in the original text that should be replaced, the type of error, and the suggested correction. We extract the following from a corpus, for each sentence in the original text:
- The tokenized original (possibly incorrect) sentence. We refer to this as the incorrect sentence.
- The tokenized correct sentence.
- A word alignment between the tokens of the correct and incorrect sentences.
5.2.1 Tokenization
An important preprocessing step is tokenization. Tokenization is the process of seg- menting running text into words and sentences (Jurafsky and Martin, 2009, chap. 3). The tokenization is performed with NLTK (Bird et al., 2009), following the conven- tions used in the CoNLL-2013 shared task.
The first tokenization step is to split the paragraphs into sentences. Sentence splitting is based on sentence-terminating punctuation (“.”, “?” and “!”). However, in some cases periods are ambiguous, as they are also used in abbreviations, for example “Ms.” or “ex.”. Tokenizers use heuristics or machine learning algorithms to classify sentence boundaries, using surrounding words and punctuation as features. In our implementation sentence splitting is performed with NLTK punkt. This tokenizer uses heuristics that, though not error-free, give good tokenizations.
Note that it is possible that due to punctuation errors there may not be an exact correspondence between the correct and incorrect sentences. In such cases we follow the sentence alignment of the incorrect sentence. Therefore it is possible that in some cases the corresponding correct “sentence” will consist of more than one sentence, or may not be a complete sentence.
Next, word tokenization is performed on each sentence. The primary goal of word tokenization (in English) is to separate punctuation from words. There are slight differences between tokenization conventions of different tokenizers. For ex- ample, in handling apostrophes, shouldn’t can be tokenized as either should n’t or shouldn ’t. During tokenization, quotation marks may also be normalized (there are different textual representations for open and closing quotation marks). As the phrase structure in sentence parses are usually indicated with nested parenthesis, during tokenization parenthesis in the text are replaced by other symbols. For ex- ample, in PTB trees ( and ) are replaced with -LRB- and -RRB-, respectively, the acronyms denoting Left or Right Curly Bracket. Our implementation uses NLTK word tokenize. This tokenizer uses relatively simple heuristics. It does make some mistakes, but not enough to affect the performance of the system significantly. As an example, in some contexts quotation marks are not split from the words they precede or follow.
CHAPTER 5. EXPERIMENTAL SETUP 64
A large number of URLs occur in the NUCLE training data, as citations are included in some of the essays. We replace these with <url> symbols to reduce noise in the vocabulary.
Several symbols that occur in the text data are used by Tiburon as reserved symbols. These include #, @, % and >. Therefore, these symbols should be replaced by placeholder symbols such as -HSH-, -AT-, -PRC- and -GT-.
The last sentence preprocessing step is the normalization of capitalization. We convert all words to lowercase to reduce data sparsity in the constructed models. We store versions of the sentences with the original and the lowercased capitalization, so that the original capitalization can be restored on the system output after decoding. An alternative way to normalize capitalization is to use truecasing. In this method, the case of the first word of each sentence (which is always capitalized in English) is restored to its most frequent capitalization. We did perform some experiments using truecasing, but found that it is inadequate to eliminate the occurrence of both capitalized and uncapitalized versions of some words in the text.
Example 5.2.1 The tokenization and lowercasing of a sentence are given below. (1) is the original sentence, (2) the word-tokenized sentence (where all tokens are separated by spaces) and (3) the lowercased form of the tokenized sentence.
(1) Most Chinese patents are “Appearance Patents”, not “Innovation Patents”. (2) Most Chinese patents are " Appearance Patents " , not " Innovation Patents " . (3) most chinese patents are " appearance patents " , not " innovation patents " .
5.2.2 Applying corrections
As described in Section 2.2.1, the learner corpora contain annotations for many kinds of errors. In our experiments we construct models to correct only subsets of these errors.
For the FCE corpus we use a set of 9 error types classified according to word classes: Pronoun, conjunction, determiner, adjective, noun, quantifier, preposition, verb and adverb errors. We exclude errors such as spelling, punctuation, word order, idiom and inappropriate register errors.
For the NUCLE corpus we consider the set of five error types used in the CoNLL- 2013 shared task: Article or determiner, preposition, noun number, verb form, and subject-verb agreement errors.
For both corpora we construct models to correct these error types. To construct the correct version of the training sentences, we only apply corrections for the error types that we want to correct in a specific model. Spelling and punctuation errors are corrected on the correct and incorrect sides of the training data to reduce noise that these errors may introduce into the model. Annotated errors of other error types are left uncorrected.
An alternative approach is to apply the corrections of the excluded error types to the correct and incorrect versions of the sentences. However, we decided against this in order to keep the training data realistically close to the test data, which also contain these other errors.
A disadvantage of performing the correction task for only a subset of error types is that multiple error annotations in a sentence may interact with each other, and the correction task may only involve performing some of these corrections. As a result some of the gold standard edits will not actually make a sentence more grammatical, as other edits should have been performed as well to make them sensible.
Example 5.2.2 Below we give an incorrect sentence, its error annotations, and the corresponding correct sentence. All the error annotations except the collocation error are applied. This example also shows the disadvantage of correcting only some of the annotated errors, as the phrase amounts in the billions is still incorrect.
Incorrect sentence: In countries like China and India, their population amounts to billions.
- Determiner error: their population → the population - Collocation error: amounts → numbers
- Preposition error: to → in
- Determiner error: billions → the billions
Correct sentence: In countries like China and India, the population amounts in the billions.
5.2.3 Word alignment
The methods that we use to extract transitions or rules for our transducer models are based on word alignments. The concept of word alignments was originally devel- oped to align words in sentence pairs used as training data for statistical machine translation models (Brown et al., 1993). We consider alignments between words in the correct and incorrect version of sentences. The word alignment a of a sentence pair (s, t) is a set of pairs such that (i, j) ∈ a if and only if the ith word in s is aligned with the jth word in t. In contrast to SMT alignment, here most of the words will be aligned to identical words. The edits to transform the one sentence to the other are given by the training data, so we use that to extract the alignments.
In our method, the first step is to align all words that do not occur in any edits one-to-one between the correct and incorrect sentences. Then each edit annotation’s incorrect and correct phrases are considered. Words that occur in both the correct and incorrect edit phrases are aligned one-to-one. This is done with a simple left- to-right search through the phrases, with the restriction that alignments may not overlap. Adding these alignments may split an edit phrase into pairs of subphrases without alignments, which may be empty on either side. If such a subphrase is empty on one side, then the words on the other side are left unaligned. But if the subphrases are non-empty on both sides, then the words on the incorrect side are all aligned to each of the words on the correct side of the subphrase. There are relatively few cases where phrases with multiple words on both the correct and incorrect sides are aligned in this way. A further refinement to this alignment procedure would be to align words with the same lexeme or POS tag.
CHAPTER 5. EXPERIMENTAL SETUP 66
In countries like China and India , the population amounts in the billions . In countries like China and India , their population amounts to billions . Figure 5.1: Word alignment between a correct sentence (top) and an incorrect sen- tence (bottom).
Example 5.2.3 The word alignment for Example 5.2.2 is shown visually in Figure 5.1. No edits are made to the first part of the sentence In countries like China and India. For the edit their population → the population, the word population is aligned firstly, as it appears on both sides. Next the words the and their are aligned. The prepositions in the edit to → in are aligned. For the final edit, billions on both sides are aligned, and the is left unaligned.