Experimental Setup - Preference Learning for Machine Translation

For the experiments in this chapter we use a range of data sets for the German- English and Russian-English language pairs, which we describe here for clarity. The used MT system is also described, as well as the basic setup used for all the presented experiments.

3.4.1 Data

Brieftaubenreisevereinigung [Compound word in German]

To provide reliable empirical results, we use a total of four data sets to evaluate the aspects of the proposed algorithms, considering two diverse translation directions: German-to-English and Russian-to-English. For German-to-English we are using three data sets, for small, medium and large scale experimentation, namelyNc, Ep,Wmt13, and a single data set for Russian-to-English, Wmt15. All numbers depicted in the tables are rounded to the nearest thousand, ten thousand, hundred thousand, or million, and abbreviated by K (thousand) and M (million).

The small scale dataNcis useful for fast experimentation and veriőcation of hypotheses with a fast turn-around time. The corpus is well studied [Koehn and Monz, 2006a] and has been used to show domain adaptation behaviour of various MT systems [Koehn and Schroeder, 2007], inter-alia. The data is furthermore attractive for experimentation, since a minimal system trained only using the about 130K parallel segments of training data already produces sensible outputs on the in-domain test sets. The basic statistics of this data are depicted in Table 3.1. The training set is the parallel data distributed for the WMT’11 (Workshop on Machine Translation, 2011) [Callison-Burch et al., 2011b] translation task. Tuning, development test and test data arenc-dev2007,nc-devtest2007 andnc-test2007

Data Set # Segments Training 1.7M Tuning 2K Dev. Test 2K Test1 2K Test2 2K

Table 3.2: Statistics forEuroparl (Ep) German-to-English data.

respectively, from the data provided for the WMT’07 translation task [Callison- Burch et al., 2007]. Three different versions of grammar extractor were used for the experiments withNcdata set, which is why there are different baselines. Result tables with different versions are annotated with∗_,∗∗_or@_.11

Note that the results reported on different versions are not comparable.

The medium sized data set for German-to-English is denoted asEp, a collection of speeches given in the European parliament as described by Koehn [2005]. The corpus is also well studied and widely used, and it’s training data is particularly interesting since it is a large body of very homogeneous textual data. The training data is a magnitude larger than that available for the small data setNc, as depicted in Table 3.2. We also have two test sets available. The training set is the parallel data distributed for the WMT’11 [Callison-Burch et al., 2011b] translation task. Tuning, development test, test1 and test2data aredev2006,devtest2006,test2006

andtest2007 respectively, from the data provided for the WMT’07 translation task [Callison-Burch et al., 2007]. Two different versions of grammar extractor were used for the experiments with Epdata set, which is why the baselines differ. Result tables with different versions are annotated with∗ _or@_{. Also note that the}

results reported on different versions are not comparable.

In addition to the in-domain data, we also use another set of development and test sets withEp, which is extracted from theCommon Crawl data [Smith et al., 2013]. We use a development set for tuning with about 2K segments, and also two test sets, one containing about 2.5K and the other about 3K segments. This data is referred to asCrawl.

The largest data set we use is denoted as Wmt13: The training data is a concatenation of Nc,Epand the Common Crawl training data sets. Additional monolingual English data is added from the English Gigaword corpus [Parker et al., 2011] for language modeling. The data was originally distributed for the news translation task described in [Bojar et al., 2013]. For tuning, development test, and test we have each two distinct data sets available. Parallel and monolingual training

Data Set # Segments Training 4.5M Train. Mono. 120M TuningS 1K TuningL 10K Dev. Test1 .5K Dev. Test2 3K Test1 3K Test2 3K

Table 3.3: Statistics for WMT’13 (Wmt13) German-to-English data. Data Set # Segments

Training 2M

Tuning 3K

Dev. Test 3K

Test 3K

Table 3.4: Statistics for WMT’15 (Wmt15) Russian-to-English data.

data are a concatenation of all data distributed for the WMT’13 translation task [Bojar et al., 2013], as noted above. The monolingual data used for training a language model includes the English side of the parallel data. TuningS data is the

őrst half of thenewstest2008 data. TuningL is the concatenation ofnewstest2008,

newstest2009,newstest2010 andnewstest2011 data. Development test1 and devel-

opment test2are the newssyscomb2009 andnewstest2013 sets respectively. The

test sets arenewstest2012 andnewstest2014. All tuning, development test and test sets are distributed for theWMT’14 translation task [Bojar et al., 2014b].

For Russian-to-English, we use the data distributed for the news translation task for [Bojar et al., 2015]. Parallel and monolingual training data is all data provided this task. For tuning we usenewstest2012, for development test newstest2013 and for testnewstest2014.

All data sets presented in this chapter only have a single reference translation per source segment. The data is mostly used as is, without any őltering12_{, with} the exception of the Common Crawl data for theWmt13German-to-English data,

where the data was őltered by length (maximum of 200 for both source and target) when it is used as tuning data. For training, data is lowercased and tokenized using the scripts distributed with theMoses SMT toolkit [Koehn et al., 2007]. When German is the source language, we apply compound splitting either by the methods recommended by [Koehn and Knight, 2003] or Dyer [2009]. For all experiments, tri- or 4-gram language models are used, depending on the size of the data (N = 3

for the small data set,N = 4for all others). Language models are estimated either

withlmplz [Heaőeld et al., 2013] orSRILM [Stolcke, 2002] toolkits using modiőed Kneser-Ney smoothing [Kneser and Ney, 1995; Chen and Goodman, 1996], and pruning of singletonN-grams as well as backoff interpolation. All language models are binarized using thekenlm library [Heaőeld, 2011].

Tuning is performed on the respective tuning set of the data unless noted otherwise.

3.4.2 Machine Translation Systems

Throughout all experiments we use the implementation of the Hiero approach for SMT [Chiang, 2007] provided within thecdec framework [Dyer et al., 2010]. To prevent excess computation while translating, we use a őxed cube pruning setting of 200 for rescoring with the language models. Viterbi word alignments in both directions as required for the extraction of grammars [Lopez, 2007] are estimated with either theGIZA++ toolkit [Och and Ney, 2003] using the wrapper provided with the Moses toolkit. Alignment symmetrization [Liang et al., 2006b] is also performed with the Moses toolkit using thegrow-diag-final-and heuristic. Per-segment grammars13

and associated features are extracted according to the algorithm proposed by Lopez [2007] with either the original implementation, the implementation described in Baltescu and Blunsom [2014], or the one distributed with cdec. When extracting grammars for the training data, we applied a leave- one-out technique [Zollmann and Sima’an, 2005; Wuebker et al., 2012], excluding the current segment for rule extraction and feature estimation. For each source token, a pass-through rule is added to the per-sentence-grammar to enable the translation of unknown source tokens. Finally, a number ofglue rules are added to the grammars by default, providing re-combination facilities for all possible rule conőgurations. Two non-terminalsX1 andX2are allowed, following Chiang [2007].

The maximal span size for each non-terminal is set to 15 in grammar extraction and for decoding, while the minimum span size is one. Further grammar extraction parameters are: Adjacent non-terminals are disallowed; there may be only őve terminal symbols on left- and right-hand side of each rule, and őve symbols in total; 300 samples were taken into account for each source phrase.

Compiling all rules applicable to a sentence in a single file, which is in contrast to a global grammar including all possible rules.

All settings as described here apply to all following experiments unless noted otherwise.

In document Preference Learning for Machine Translation (Page 91-95)