MLETagger - Initial Steps for an RBMT-APE Approach

Chapter 6. APE Automatic Post-Editing System: Background

6.10 Initial Steps for an RBMT-APE Approach

6.10.1 MLETagger

This component is an implementation of maximum likelihood estimation (MLE). Evaluation showed that the use of this approach for tagging the Persian language yielded promising results (Raja, Amiri et al., 2007). There are several classes included in this component. The function of each class is explained in detail below.

6.10.2 Tagger class

This class, which is called Tagger, is used to tag the input text. Tag() is the only method for this class, and has three main parameters essential to the running of MLETagger:

Train set: this parameter is used to define the name and address of the training file for the tagger. The format for the training file is such that each line contains one word and its tag, separated by a tab.

Test set: this parameter is used to define the name and address for the input file containing tokens which require tagging.

Result: this parameter is used to save the name and address for the tagged file.

115

While running this method, first the tagger will be trained based on the data in the train set file. Next, the test data will begin to be tagged.

6.10.3 MLETagger class

There are two parts in this class: training tagger and tagging process.

6.10.3.1 Training tagger:

Method learning() is used to train the tagger. In this method, the training data file is loaded and read line by line. In each line tokens and their parts of speech are detected, and, based on that information, a key content of a combination of that token and its part of speech can be made, together with the number of times that token is repeated with the particular part of speech in the training file. This information will be kept in a new collection with the name of htNewStat. In the next stage, another process will modify htNewStat in order to find the variety of parts of speech which that token is linked to, and the number of times each part of speech occurs in the training set. The new information is maintained in a collection called ht. For example, if the word “ίΎΑ” had the part-of-speech classification of noun, occurring 20 times, but also had part-of-speech classification of adjective, which occurred 15 times, ht would have an entry as shown below:

ίΎΑ N^20 AJ^15

Then, from this collection, the part-of-speech with the highest repetition would be chosen, and would be considered to have the greatest maximum likelihood probability for that token, to be used in the tagging process. In the previous example, part-of-speech noun which had a repetition of 20 (compared to adjective which had 15) would be chosen, and added to the MLiklihood set.

6.10.3.2 Tagging process:

The tagging process will run with the method tagging() in MLETagger class. In this method the input file (normal file, including words,

116

punctuation, numbers etc.) is loaded, and then tokenised. After this, each token is examined for its existence in MLiklihood collection. If it is available, the tag equal to that word will be linked to the word. If not, the part-of-speech noun will be considered as the default part of speech for that token.

In order to account for the difference in Unicode for some characters in Persian and Arabic (such as “_É” and “”), words with these characters will be considered with both Persian and Arabic Unicode.

Next, the probability of the tag for that word will be evaluated in the training set. This difference is due to the fact that the source of training data for tagger and source of input data have been generated in different machines.

6.10.4 CoNLL class

In order to generate the training data set for tagger from the Persian Dependency Treebank, the code in this file is used, and the method PrepareTrainData. In the Persian Dependency Treebank in CoNLL-2005, in each line there are several fields for each token, two of which are token, and token part of speech. The reason Treebank was used as a training set for the tagger was because of the compatibility the generated tagSet has with the tagger training set.

6.11 POS-Tagger

In order for our RBMT-based APE algorithm to work with our SMT system, it was necessary to parse both the output and also the reference text in order to extract rules to map the reference and output together, and improve the quality of the output by performing revision tasks, such as replacing OOV words with their equivalents in the target language, correcting grammar, and modifying the word order.

To accomplish this, the Persian output must be parsed. The first stage of parsing is POS-tagging, or annotating each word for its part of speech (grammatical type) in a given sentence. Examples of POS-tagging Persian output are shown in Table 6-2:

117

Table 6-2: Examples of pos-tagging Persian output

̶ϳΎΒϳί N_SING ϩΪϳΪ̡ N_SING ̵΍ OH Ζγ΍ V_PRE Ϫ̯ CON ί΍ P ΎϬΗΪϣ N_PL ϞΒϗ N_SING ΩέϮϣ N_SING ϪΟϮΗ N_SING ϩΩϮΑ ADJ_INO Ζγ΍ V_PRE . DELM

Generally, POS-tagging helps with parsing, and resolves pronunciation and semantic ambiguities.

POS-tagging is a useful task for many applications such as word sense disambiguation, parsing, and language modelling. Tagging techniques can also be used for a variety of tasks such as semantic tagging, dialogue tagging and information retrieval.

Not all pos-taggers follow the same standard for tagging. Some use coarse classes, such as N, V, A, Aux… (Amiri, Raja et al., 2007).

Some other taggers, such as Penn Treebank, prefer finer distinctions: • PRP: personal pronouns (you, me, she, he, them, him, …) • PRPS: possessive pronouns (my, our, her, his, …)

• NN: singular common nouns (leg, plate, calculator, …) • NNS: plural common nouns (legs, plates, calculators, …) • NNP: singular proper names (Microsoft, Europe, London, …) • NNPS: plural proper names (Americas, Carolinas, …)

Data is tagged for POS in the same way that humans tag a corpus. A POS-tagger attempts to model human performance by matching their performance. To build the model, corpora are hand-tagged for POS by more than one annotator before being checked for reliability. The corpus used for the tagger in this research is the Bijankhan corpus.

118

The Bijankhan corpus is a collection of articles from daily news and common texts. The articles and documents are categorized, divided into different domains and subjects (literature, politics, culture, science etc.), that is, about 4300 separate subjects in total. The corpus itself is a tagged corpus, containing about 2.6 million manually- tagged words. They are tagged with a tag set containing 40 Persian POS tags. It is used by researchers in natural language processing, and is distributed by a database research group at the University of Tehran14.

As shown in Figure 6-9, there is a number of different approaches to POS tagging:

Figure 6-9: POS-Tagging Approaches

For this APE, MLE parser was used, which is stochastic. Automatic training is made possible with the use of a probabilistic POS-tagger. In this way, rule revision, which is tedious and takes time, can be avoided. Automatic training also makes adaptation to new text domains possible.

The chosen approach to stochastic parsing was Maximum Likelihood Estimation (MLE). MLE calculates the maximum likelihood probability for each tag assigned to the words in the training set. In the second stage, for each word, the tag with the greatest maximum likelihood probability will be set specifically for that word alone. In the evaluation stage, the test set words are analysed, and those tags that were set specifically are assigned to those same words in the test set.

MLE parser can provide accurate parsing when it is trained on a large corpus. Unigram statistics (the most common part of speech for each word) can achieve up to 90% accuracy. Further accuracy is achievable with more information on adjacent words.

119

In a statistical model, the probability can be extracted from the tagger corpus which the MLE tagger has trained on it. Also, a corpus embedded too deeply in a particular domain may not be transferrable or usable by other domains, yet, if it is too generic, it may be unable to benefit from domain-specific probabilities.

A tagging model can be tested, typically, by splitting the corpus into the training set and the test set. The test set should be held out from the training set. The tagger can learn the tag sequences that can maximize the probability for that model. Finally, the tagger can be tested on the test set. Although the tagger should not be trained on the test data (as an unreliable result would be generated), it is possible to have test data very similar to training data.

The MLE tagger is run on both output and reference texts from the SMT system. Details of this can be found in Appendix III, section 4. The results are as follows: Output text: ̶ϳΎΒϳί N_SING ̵΍ ϩΪϳΪ̡ N_SING Ζγ΍ V_PRE Ϫ̯ CON ί΍ P ΎϬΗΪϣ N_PL ϞΒϗ N_SING ΩέϮϣ N_SING ϪΟϮΗ N_SING ϩΩϮΑ ADJ_INO Ζγ΍ V_PRE ήϨϫ N_SING .principle_OOV N_SING ϥΎθϧ N_SING ΪϫΩ V_SUB Ϫ̯ CON ̶ϳΎΒϳί N_SING έΩ P ϩΪϳΪ̡ N_SING ̶Η΍Ϋ ADJ_SIM Ζγ΍ V_PRE Reference text: ̶ϳΎΒϳί N_SING ϩΪϳΪ̡ N_SING ̵΍ OH Ζγ΍ V_PRE Ϫ̯ CON ί΍ P ΎϬΗΪϣ N_PL ϞΒϗ N_SING ΩέϮϣ N_SING ϪΟϮΗ N_SING ϩΩϮΑ ADJ_INO

120 Ζγ΍ V_PRE . DELM ήϨϫ N_SING ̶Ϡλ΍ ADJ_SIM ϥΎθϧ N_SING ̶ϣ N_SING ΪϫΩ V_SUB Ϫ̯ CON ̶ϳΎΒϳί N_SING ̮ϳ N_SING ϩΪϳΪ̡ N_SING ̶Η΍Ϋ ADJ_SIM Ζγ΍. N_SING

6.12 Summary

In summary, this chapter shows the motivation behind the development of an automatic post-editing approach, and gives an overview of related work in automatic post-editing approaches, showing the different architecture of various hybrid systems. In particular it is shown that the method of a Rule-based automatic post-editing approach has not been explored extensively, specifically with respect to correction of an SMT system’s output. The chapter also shows the preparation necessary for a Rule-based APE approach, such as POS-tagging and parsing, and shows the particular POS-tagging and parsing approaches used for this system.

In document English Persian phrase based statistical machine translation : enhanced models, search and training : a thesis presented in fulfilment of the requirements for the degree of Doctor of Philosophy in Computer Engineering at Massey University, Albany (Aucklan (Page 135-142)