Building and Modifying an SPE Module - Statistical Post-editing

Chapter 6: Statistical Post-editing

6.1.1 Building and Modifying an SPE Module

There are four steps involved in building and modifying a basic SPE module. To make our explanation clear, a list of notations at each step was created. The notations for the corpora and translations are listed in Table 6.2.

Meaning

Train The English training corpus

Train The Chinese reference translation for the English training corpus

Train The Systran translation for the English training corpus

Tune The English tuning corpus

Tune The Chinese reference translation for the English tuning corpus

Tune The Systran translation for the English tuning corpus

Test The English test sample

 Step 1 Train and tune an SPE system

The SPE system required Train_MT (the “source” language) and Train_ZH (the “target” language) for training and Tune_MT and Tune_ZH for fine-tuning. The phrase table in the obtained SPE system is monolingual (Chinese) and contains raw RBMT output on one side and the reference translation on the other. Phrase tables are of vital importance to an SPE module (and to an SMT system) (cf. Chapter 2). An SPE module attempts to select the translations with the highest probability using its phrase table (which determines the accuracy of translations) together with a pre-extracted target language model (which determines the fluency of translations). Thus, the more precise and correct the phrase table, the higher the quality of the SPE output (cf. Chapter 2). The notation for the phrase table of this pre-trained SPE system is presented in Table 6.3.

Notation Meaning

REF MT

Phrase _ Monolingual phrase table containing phrases learnt from the raw RBMT output and the reference translations.

Table 6.3: Notation for the monolingual phrase table of the SPE module

 Step 2 Translate the test sample with the SPE module

To use the pre-trained SPE module, the test sample was first translated by Systran into Chinese. This is the Baseline translation to which other translation versions are compared. Next, the pre-trained SPE module was initiated to post-edit the raw Baseline translation to get a second version of the translation. This translation variant is called the default output of the SPE module as no modification was applied to the SPE system. The notations are shown in table 6.4 below.

Notation Meaning Baseline The Systran output of the English test sample

SPED The default translation of the SPE module which is obtained by post-editing the Baseline translation using the basic SPE system

Table 6.4: Notations for Baseline and SPED

The first two steps are depicted in Figure 6.1.

Figure 6.1: Flowchart of the first two steps in the process of modifying SPE

 Step 3 Modify the SPE system

This step involves modifying the core component (i.e. the phrase table) of this SPE module by removing phrases not containing prepositions. However, as mentioned, the phrase table (Phrase_MT__REF) in the unmodified SPE module is monolingual, with raw RBMT Chinese output on one side and reference Chinese translation on the other side. It is necessary to find out which of the raw RBMT output strings were translated from English phrases with prepositions. In other words, we need the translation phrases between the source English and the raw RBMT output.

 Step 4 Generate a bilingual phrase table

Obtain a monolingual phrase table REF MT Phrase _ MT Train

Build an SPE module

Translate the test sample using Systran

Train

Post-edit the raw Systran translation of the sample using the trained SPE module

Obtain SPED (translation of the default SPE module)

Using Train_EN as the source language and Train_MT as the target language together with the statistical phrase toolkits of Moses, we obtained a bilingual phrase table. The resulting phrase table contains pairs with English phrases on one side and raw RBMT output on the other. Next, any phrase pairs where the English side contained no prepositions were removed. The resulting phrase table (which is a preposition phrase table to be more specific) will help us to modify the default SPE system obtained in the following steps. The notations used at this step are shown in Table 6.5. The phrase table from Phrase_ENprep__MT was used to remove phrases that do not relate to prepositions in Phrase_MT__REF obtained in step 1.

Notation Meaning

MT EN

Phrase _ Bilingual phrase table containing all possible corresponding translation sequences learnt from the English training data and the RBMT translation

prep MT EN

Phrase _ Phrase table with English phrases containing prepositions and their corresponding RBMT translation

Table 6.5: Notations for bilingual and preposition phrase table

Comparing Phrase_ENprep__MT (from step 4) and Phrase_MT__REF (from step 1), we can see that the common part between these two phrase tables is the raw RBMT output. Phrase_ENprep__MT contains English phrases with prepositions and their corresponding raw RBMT Chinese translations. Phrase_MT__REF contains raw

RBMT Chinese translations and the corresponding reference Chinese

translations. For example, InPhrase_ENprep__MT, the following phrases are present: English phrase Raw RBMT translation

In the Requirement tab ||| 要求表里 [gloss: Requirement tab in]

Raw RBMT translation Reference translation

要求表里 ||| “ 要求 ” 表中

[gloss: Requirement tab in] [gloss: “Requirement” tab in]

Therefore, the two phrase tables can be connected through the raw RBMT translation as follows:

prep MT EN

Phrase _ Phrase_MT__REF

English phrase ||| Raw RBMT translation ||| Reference translation In the Requirement tab ||| 要求表里 ||| “ 要求 ” 表中

We comparedPhrase_MT__REF toPhrase_ENprep__MT and retained those phrase pairs inPhrase_MT__REF where the raw RBMT side inPhrase_ENprep__MT could be matched toPhrase_MT__REF. However, we are aware that not all the raw RBMT phrases in

prep MT EN

Phrase _ can be found inPhrase_MT__REF. Even if a phrase is found in both phrase tables, there are two types of matches. Let us continue with the same example to illustrate this point. Suppose the following phrase is present in thePhrase_ENprep__MT:

English phrase Raw RBMT translation

In the Requirement tab ||| 要求表里 [gloss: Requirement tab in]

And inPhrase_MT__REF, we may find two matching phrases: 1) Raw RBMT translation Reference translation

要求表里 ||| “ 要求 ” 表中

2) Raw RBMT translation Reference translation

将名称填在要求表里 ||| 在 “ 要求 ” 表中填入名字 [gloss: name fill in Requirement tab in] ||| [gloss: In “Requirement” tab in fill name]

The first match indicates that the whole phrase from Phrase_ENprep__MT can be fully matched inPhrase_MT__REF . The second match indicates that the phrase from

prep MT EN

Phrase _ may be contained as a part of a phrase inPhrase_MT__REF. We called the first match a Full Match, i.e. a phrase in Phrase_ENprep__MTis equally and exactly matched in Phrase_MT__REFand the second match a Partial Match, i.e. a phrase in Phrase_ENprep__MT is matched into part of a phrase in Phrase_MT__REF . The difference between the two matches is that the latter (case 2) contains extra information that is not necessarily related to prepositions. Using the first match can minimise this unrelated information which may cause degradation in the translation of prepositions. The problem is that phrases like the one in case 2 would be missed although it did contain translation of prepositions. Using the second match can ensure that all phrases related to prepositions are included but faces the challenge of including contexts not related to prepositions. Based on these two matches, we filtered the Phrase_MT__REF in two ways. The first way is to only keep phrase pairs that have a full and exact match betweenPhrase_MT__REF and Phrase_ENprep__MT (match 1). This removed 76.9% of the phrase pairs fromPhrase_MT__REF . The second way is to keep phrase pairs inPhrase_MT__REF if it contains or is exactly matched to a phrase in Phrase_ENprep__MT (both match 1 and

match 2). In contrast to the first filtering, just 2.6% of phrase pairs in

REF MT

Phrase _ were removed.

After filtering the phrase table of the general SPE system, two new SPE systems were generated. The Baseline translation was then post-edited again by each of the two new SPE systems and two new translations were obtained (Table 6.6).

Notation Meaning

SPEP Translation from the modified SPE module with the phrase table that was filtered based on Partial Matches

SPEF Translation from the modified SPE module with the phrase table that was filtered based on Full Matches

Table 6.6: Notations for SPEP and SPEF Figure 6.2 illustrates the above steps.

filter

Obtain a bilingual phrase table Phrase_EN__MT EN Train Full Match MT Train

Compare the MT side between Phrase_ENprep__MT and Phrase_MT__REF

Obtain a bilingual preposition phrase table

prep MT EN Phrase _ Partial Match REF MT Phrase _ from Figure 5.1 Preposition list SPEP SPED

To reiterate the purpose of this test, we wanted to compare the Baseline, SPED, SPEP and SPEF, while focusing on analysing the gains and losses of modified SPE modules in particular for the translation of prepositions. To ascertain the level of gains or losses, a comparison and evaluation of these translations was conducted manually and automatically.

In document An Investigation into Automatic Translation of Prepositions in IT Technical Documentation from English to Chinese (Page 124-131)