• No results found

Chapter 6: Statistical Post-editing

6.1.1 Building and Modifying an SPE Module

There are four steps involved in building and modifying a basic SPE module. To make our explanation clear, a list of notations at each step was created. The notations for the corpora and translations are listed in Table 6.2.

Meaning

EN

Train The English training corpus

ZH

Train The Chinese reference translation for the English training corpus

MT

Train The Systran translation for the English training corpus

EN

Tune The English tuning corpus

ZH

Tune The Chinese reference translation for the English tuning corpus

MT

Tune The Systran translation for the English tuning corpus

EN

Test The English test sample

ZH

 Step 1 Train and tune an SPE system

The SPE system required TrainMT (the “source” language) and TrainZH (the “target” language) for training and TuneMT and TuneZH for fine-tuning. The phrase table in the obtained SPE system is monolingual (Chinese) and contains raw RBMT output on one side and the reference translation on the other. Phrase tables are of vital importance to an SPE module (and to an SMT system) (cf. Chapter 2). An SPE module attempts to select the translations with the highest probability using its phrase table (which determines the accuracy of translations) together with a pre-extracted target language model (which determines the fluency of translations). Thus, the more precise and correct the phrase table, the higher the quality of the SPE output (cf. Chapter 2). The notation for the phrase table of this pre-trained SPE system is presented in Table 6.3.

Notation Meaning

REF MT

Phrase Monolingual phrase table containing phrases learnt from the raw RBMT output and the reference translations.

Table 6.3: Notation for the monolingual phrase table of the SPE module

 Step 2 Translate the test sample with the SPE module

To use the pre-trained SPE module, the test sample was first translated by Systran into Chinese. This is the Baseline translation to which other translation versions are compared. Next, the pre-trained SPE module was initiated to post-edit the raw Baseline translation to get a second version of the translation. This translation variant is called the default output of the SPE module as no modification was applied to the SPE system. The notations are shown in table 6.4 below.

Notation Meaning Baseline The Systran output of the English test sample

SPED The default translation of the SPE module which is obtained by post-editing the Baseline translation using the basic SPE system

Table 6.4: Notations for Baseline and SPED

The first two steps are depicted in Figure 6.1.

Figure 6.1: Flowchart of the first two steps in the process of modifying SPE

 Step 3 Modify the SPE system

This step involves modifying the core component (i.e. the phrase table) of this SPE module by removing phrases not containing prepositions. However, as mentioned, the phrase table (PhraseMTREF) in the unmodified SPE module is monolingual, with raw RBMT Chinese output on one side and reference Chinese translation on the other side. It is necessary to find out which of the raw RBMT output strings were translated from English phrases with prepositions. In other words, we need the translation phrases between the source English and the raw RBMT output.

 Step 4 Generate a bilingual phrase table

Obtain a monolingual phrase table REF MT Phrase MT Train

Build an SPE module

Translate the test sample using Systran

ZH

Train

Post-edit the raw Systran translation of the sample using the trained SPE module

Obtain SPED (translation of the default SPE module)

Using TrainEN as the source language and TrainMT as the target language together with the statistical phrase toolkits of Moses, we obtained a bilingual phrase table. The resulting phrase table contains pairs with English phrases on one side and raw RBMT output on the other. Next, any phrase pairs where the English side contained no prepositions were removed. The resulting phrase table (which is a preposition phrase table to be more specific) will help us to modify the default SPE system obtained in the following steps. The notations used at this step are shown in Table 6.5. The phrase table from PhraseENprepMT was used to remove phrases that do not relate to prepositions in PhraseMTREF obtained in step 1.

Notation Meaning

MT EN

Phrase Bilingual phrase table containing all possible corresponding translation sequences learnt from the English training data and the RBMT translation

prep MT EN

Phrase Phrase table with English phrases containing prepositions and their corresponding RBMT translation

Table 6.5: Notations for bilingual and preposition phrase table

Comparing PhraseENprepMT (from step 4) and PhraseMTREF (from step 1), we can see that the common part between these two phrase tables is the raw RBMT output. PhraseENprepMT contains English phrases with prepositions and their corresponding raw RBMT Chinese translations. PhraseMTREF contains raw

RBMT Chinese translations and the corresponding reference Chinese

translations. For example, InPhraseENprepMT, the following phrases are present: English phrase Raw RBMT translation

In the Requirement tab ||| 要求 表 里 [gloss: Requirement tab in]

Raw RBMT translation Reference translation

要求 表 里 ||| “ 要求 ” 表 中

[gloss: Requirement tab in] [gloss: “Requirement” tab in]

Therefore, the two phrase tables can be connected through the raw RBMT translation as follows:

prep MT EN

Phrase PhraseMTREF

English phrase ||| Raw RBMT translation ||| Reference translation In the Requirement tab ||| 要求 表 里 ||| “ 要求 ” 表 中

We comparedPhraseMTREF toPhraseENprepMT and retained those phrase pairs inPhraseMTREF where the raw RBMT side inPhraseENprepMT could be matched toPhraseMTREF. However, we are aware that not all the raw RBMT phrases in

prep MT EN

Phrase can be found inPhraseMTREF. Even if a phrase is found in both phrase tables, there are two types of matches. Let us continue with the same example to illustrate this point. Suppose the following phrase is present in thePhraseENprepMT:

English phrase Raw RBMT translation

In the Requirement tab ||| 要求 表 里 [gloss: Requirement tab in]

And inPhraseMTREF, we may find two matching phrases: 1) Raw RBMT translation Reference translation

要求 表 里 ||| “ 要求 ” 表 中

2) Raw RBMT translation Reference translation

将 名称 填 在 要求 表 里 ||| 在 “ 要求 ” 表 中 填入 名字 [gloss: name fill in Requirement tab in] ||| [gloss: In “Requirement” tab in fill name]

The first match indicates that the whole phrase from PhraseENprepMT can be fully matched inPhraseMTREF . The second match indicates that the phrase from

prep MT EN

Phrase may be contained as a part of a phrase inPhraseMTREF. We called the first match a Full Match, i.e. a phrase in PhraseENprepMTis equally and exactly matched in PhraseMTREFand the second match a Partial Match, i.e. a phrase in PhraseENprepMT is matched into part of a phrase in PhraseMTREF . The difference between the two matches is that the latter (case 2) contains extra information that is not necessarily related to prepositions. Using the first match can minimise this unrelated information which may cause degradation in the translation of prepositions. The problem is that phrases like the one in case 2 would be missed although it did contain translation of prepositions. Using the second match can ensure that all phrases related to prepositions are included but faces the challenge of including contexts not related to prepositions. Based on these two matches, we filtered the PhraseMTREF in two ways. The first way is to only keep phrase pairs that have a full and exact match betweenPhraseMTREF and PhraseENprepMT (match 1). This removed 76.9% of the phrase pairs fromPhraseMTREF . The second way is to keep phrase pairs inPhraseMTREF if it contains or is exactly matched to a phrase in PhraseENprepMT (both match 1 and

match 2). In contrast to the first filtering, just 2.6% of phrase pairs in

REF MT

Phrase were removed.

After filtering the phrase table of the general SPE system, two new SPE systems were generated. The Baseline translation was then post-edited again by each of the two new SPE systems and two new translations were obtained (Table 6.6).

Notation Meaning

SPEP Translation from the modified SPE module with the phrase table that was filtered based on Partial Matches

SPEF Translation from the modified SPE module with the phrase table that was filtered based on Full Matches

Table 6.6: Notations for SPEP and SPEF Figure 6.2 illustrates the above steps.

filter

Obtain a bilingual phrase table PhraseENMT EN Train Full Match MT Train

Compare the MT side between PhraseENprepMT and PhraseMTREF

Obtain a bilingual preposition phrase table

prep MT EN Phrase Partial Match REF MT Phrase from Figure 5.1 Preposition list SPEP SPED

To reiterate the purpose of this test, we wanted to compare the Baseline, SPED, SPEP and SPEF, while focusing on analysing the gains and losses of modified SPE modules in particular for the translation of prepositions. To ascertain the level of gains or losses, a comparison and evaluation of these translations was conducted manually and automatically.