Chapter 6: Statistical Post-editing
6.1.1 Building and Modifying an SPE Module
There are four steps involved in building and modifying a basic SPE module. To make our explanation clear, a list of notations at each step was created. The notations for the corpora and translations are listed in Table 6.2.
Meaning
EN
Train The English training corpus
ZH
Train The Chinese reference translation for the English training corpus
MT
Train The Systran translation for the English training corpus
EN
Tune The English tuning corpus
ZH
Tune The Chinese reference translation for the English tuning corpus
MT
Tune The Systran translation for the English tuning corpus
EN
Test The English test sample
ZH
Step 1 Train and tune an SPE system
The SPE system required TrainMT (the “source” language) and TrainZH (the “target” language) for training and TuneMT and TuneZH for fine-tuning. The phrase table in the obtained SPE system is monolingual (Chinese) and contains raw RBMT output on one side and the reference translation on the other. Phrase tables are of vital importance to an SPE module (and to an SMT system) (cf. Chapter 2). An SPE module attempts to select the translations with the highest probability using its phrase table (which determines the accuracy of translations) together with a pre-extracted target language model (which determines the fluency of translations). Thus, the more precise and correct the phrase table, the higher the quality of the SPE output (cf. Chapter 2). The notation for the phrase table of this pre-trained SPE system is presented in Table 6.3.
Notation Meaning
REF MT
Phrase Monolingual phrase table containing phrases learnt from the raw RBMT output and the reference translations.
Table 6.3: Notation for the monolingual phrase table of the SPE module
Step 2 Translate the test sample with the SPE module
To use the pre-trained SPE module, the test sample was first translated by Systran into Chinese. This is the Baseline translation to which other translation versions are compared. Next, the pre-trained SPE module was initiated to post-edit the raw Baseline translation to get a second version of the translation. This translation variant is called the default output of the SPE module as no modification was applied to the SPE system. The notations are shown in table 6.4 below.
Notation Meaning Baseline The Systran output of the English test sample
SPED The default translation of the SPE module which is obtained by post-editing the Baseline translation using the basic SPE system
Table 6.4: Notations for Baseline and SPED
The first two steps are depicted in Figure 6.1.
Figure 6.1: Flowchart of the first two steps in the process of modifying SPE
Step 3 Modify the SPE system
This step involves modifying the core component (i.e. the phrase table) of this SPE module by removing phrases not containing prepositions. However, as mentioned, the phrase table (PhraseMTREF) in the unmodified SPE module is monolingual, with raw RBMT Chinese output on one side and reference Chinese translation on the other side. It is necessary to find out which of the raw RBMT output strings were translated from English phrases with prepositions. In other words, we need the translation phrases between the source English and the raw RBMT output.
Step 4 Generate a bilingual phrase table
Obtain a monolingual phrase table REF MT Phrase MT Train
Build an SPE module
Translate the test sample using Systran
ZH
Train
Post-edit the raw Systran translation of the sample using the trained SPE module
Obtain SPED (translation of the default SPE module)
Using TrainEN as the source language and TrainMT as the target language together with the statistical phrase toolkits of Moses, we obtained a bilingual phrase table. The resulting phrase table contains pairs with English phrases on one side and raw RBMT output on the other. Next, any phrase pairs where the English side contained no prepositions were removed. The resulting phrase table (which is a preposition phrase table to be more specific) will help us to modify the default SPE system obtained in the following steps. The notations used at this step are shown in Table 6.5. The phrase table from PhraseENprepMT was used to remove phrases that do not relate to prepositions in PhraseMTREF obtained in step 1.
Notation Meaning
MT EN
Phrase Bilingual phrase table containing all possible corresponding translation sequences learnt from the English training data and the RBMT translation
prep MT EN
Phrase Phrase table with English phrases containing prepositions and their corresponding RBMT translation
Table 6.5: Notations for bilingual and preposition phrase table
Comparing PhraseENprepMT (from step 4) and PhraseMTREF (from step 1), we can see that the common part between these two phrase tables is the raw RBMT output. PhraseENprepMT contains English phrases with prepositions and their corresponding raw RBMT Chinese translations. PhraseMTREF contains raw
RBMT Chinese translations and the corresponding reference Chinese
translations. For example, InPhraseENprepMT, the following phrases are present: English phrase Raw RBMT translation
In the Requirement tab ||| 要求 表 里 [gloss: Requirement tab in]
Raw RBMT translation Reference translation
要求 表 里 ||| “ 要求 ” 表 中
[gloss: Requirement tab in] [gloss: “Requirement” tab in]
Therefore, the two phrase tables can be connected through the raw RBMT translation as follows:
prep MT EN
Phrase PhraseMTREF
English phrase ||| Raw RBMT translation ||| Reference translation In the Requirement tab ||| 要求 表 里 ||| “ 要求 ” 表 中
We comparedPhraseMTREF toPhraseENprepMT and retained those phrase pairs inPhraseMTREF where the raw RBMT side inPhraseENprepMT could be matched toPhraseMTREF. However, we are aware that not all the raw RBMT phrases in
prep MT EN
Phrase can be found inPhraseMTREF. Even if a phrase is found in both phrase tables, there are two types of matches. Let us continue with the same example to illustrate this point. Suppose the following phrase is present in thePhraseENprepMT:
English phrase Raw RBMT translation
In the Requirement tab ||| 要求 表 里 [gloss: Requirement tab in]
And inPhraseMTREF, we may find two matching phrases: 1) Raw RBMT translation Reference translation
要求 表 里 ||| “ 要求 ” 表 中
2) Raw RBMT translation Reference translation
将 名称 填 在 要求 表 里 ||| 在 “ 要求 ” 表 中 填入 名字 [gloss: name fill in Requirement tab in] ||| [gloss: In “Requirement” tab in fill name]
The first match indicates that the whole phrase from PhraseENprepMT can be fully matched inPhraseMTREF . The second match indicates that the phrase from
prep MT EN
Phrase may be contained as a part of a phrase inPhraseMTREF. We called the first match a Full Match, i.e. a phrase in PhraseENprepMTis equally and exactly matched in PhraseMTREFand the second match a Partial Match, i.e. a phrase in PhraseENprepMT is matched into part of a phrase in PhraseMTREF . The difference between the two matches is that the latter (case 2) contains extra information that is not necessarily related to prepositions. Using the first match can minimise this unrelated information which may cause degradation in the translation of prepositions. The problem is that phrases like the one in case 2 would be missed although it did contain translation of prepositions. Using the second match can ensure that all phrases related to prepositions are included but faces the challenge of including contexts not related to prepositions. Based on these two matches, we filtered the PhraseMTREF in two ways. The first way is to only keep phrase pairs that have a full and exact match betweenPhraseMTREF and PhraseENprepMT (match 1). This removed 76.9% of the phrase pairs fromPhraseMTREF . The second way is to keep phrase pairs inPhraseMTREF if it contains or is exactly matched to a phrase in PhraseENprepMT (both match 1 and
match 2). In contrast to the first filtering, just 2.6% of phrase pairs in
REF MT
Phrase were removed.
After filtering the phrase table of the general SPE system, two new SPE systems were generated. The Baseline translation was then post-edited again by each of the two new SPE systems and two new translations were obtained (Table 6.6).
Notation Meaning
SPEP Translation from the modified SPE module with the phrase table that was filtered based on Partial Matches
SPEF Translation from the modified SPE module with the phrase table that was filtered based on Full Matches
Table 6.6: Notations for SPEP and SPEF Figure 6.2 illustrates the above steps.
filter
Obtain a bilingual phrase table PhraseENMT EN Train Full Match MT Train
Compare the MT side between PhraseENprepMT and PhraseMTREF
Obtain a bilingual preposition phrase table
prep MT EN Phrase Partial Match REF MT Phrase from Figure 5.1 Preposition list SPEP SPED
To reiterate the purpose of this test, we wanted to compare the Baseline, SPED, SPEP and SPEF, while focusing on analysing the gains and losses of modified SPE modules in particular for the translation of prepositions. To ascertain the level of gains or losses, a comparison and evaluation of these translations was conducted manually and automatically.