Setups for Tuning Methods - Preference Learning for Machine Translation

In addition to our own work, we compare our efforts to the Mert and Mira algorithms in implementations we are going to describe below. We refer to our online pairwise ranking tuning method asDtrain.

Nc@

System Dev. Test Test # Features

Dense 25.9 28.0 12 Rule-Id 25.5 †_27.6 _140K Rule-Bigram 25.8 †_27.4 _30K Rule-Shape 25.9 28.1 51 Sparse 25.7 28.2 180K Dense, Bitext 26.1 †_27.9 ₁₂ Rule-Id, Bitext 26.1 †_28.0 _3.4M Rule-Bigram, Bitext 26.3 28.3 330K Rule-Shape, Bitext 26.4 28.3 51 Sparse, Bitext 26.4 28.6 4.7M

Table 3.5: Comparing SparseandDensefeature sets on small-scale Nc@ data tuning on a small development (baseline algorithm) set as well as the full bitext (IterMixSGD algorithm, cf. Section 3.10): Signiőcance is assessed with an approximate randomization test between experiments in the same group, and signiőcant differences, withp <0.05, to the best

result (in bold) are denoted by†_{. Table adapted from [Simianer et al.,} 2012].

Nc∗

System Dev. Test Dense 25.2 Sparse 25.5

Table 3.6: Comparing SparseandDensefeature sets on small-scale Nc∗ data, tuning on a single, small development set.

Ep∗

System Dev. Test Test1 Test2

Dense 27.8 27.9 28.1 Sparse 29.4 29.1 29.5

Table 3.7: ComparingSparseandDensefeature sets on medium-scaleEp∗data, tuning on a single, small development set.

Wmt13

System Dev. Test1 Test1 Dev. Test2 Test2

Dense 18.3 16.7 19.1 19.2 Sparse 19.6 18.1 20.3 20.6

Table 3.8: ComparingSparse andDensefeature sets on large-scaleWmt13data, tuning on a single, small development set (TuningS).

Wmt15

System Dev. Test Test

Dense 17.0 20.5 Sparse 19.1 22.3

Dense, Bitext 20.7 25.0 Sparse, Bitext 21.9 26.1

Table 3.9: Comparing SparseandDensefeature sets onWmt15data, tuning on a single, small development set as well as the full bitext.

All methods described in this work rely onk-best lists for approximating the true search space. Throughout this work we usek= 100unique entries. Uniqueness is

applied on the string level, always including only the derivation with maximum score.

For all data sets a development test set is used to adjust the various hyperparameters of different methods, reporting scores on test with settings that maximized the score on the respective development test set.

If applicable, all methods use the same amount of epochs18 . 3.6.1 Minimum Error Rate Training

We use an implementation ofhypergraph Mert, as described by Kumar et al. [2009] which is an adaptation of the lattice-basedMertalgorithm [Macherey et al., 2008], but using hypergraphs instead ofk-best lists.

An implementation of hypergraph-Mertis provided within the cdec framework. This version requires non-zero initial weights for each feature to be optimized. For our experiments use the weights as depicted as described in Section 3.5.1.

SinceMertuses random initializations, we can account for optimizer instability by repeating the tuning process at least three times, following Clark et al. [2011]. We report the mean scores along with standard deviations if applicable.

3.6.2 Margin-infused Relaxed Algorithm

For the experiments with theMiraalgorithm we use the implementation distributed with cdec which supports a wide range of hyperparameters. This implementation of Mirausesk-best lists as surrogate for the true search space to selecthope and fear derivations. Hope and fear can be selected by a number of criteria, which we exhaustively explore in our experiments. In one variant we calculate hope and fear as proposed by Chiang [2012] by:

hope = arg max

· m(·) +g(·)

fear = arg max

· m(

·)−g(·), (3.18)

wherem(·)is the model score andg(·)is the gold-standard score (the higher the

better).

In the other variant, hope and fear derivations are simply calculated as g(·)

and−g(·) respectively, similar to the local update of Liang et al. [2006a]. The

gold-standard function is a smoothed per-sentence BLEU [Chiang, 2012].

ForMirawe use the default of _k_{= 500} unique translation hypotheses for all experiments. Sentence-level BLEU scores are calculated using a pseudo corpus, as proposed by Chiang [2012], using a decay rate of 0.95.

The implementation supports parallelization similar to the downpour scheme [Dean et al., 2012], but we use only a single process to obtain deterministic results, since the parallelization resulted in large variance in the results [Simianer et al., 2012]. The algorithm is run for the default of 20 epochs, and őnal weights are generated by averaging the őnal weights of each epoch.

We try őve different optimizers: SGD, passive aggressiveMirawith selection from cutting plane, cutting planeMira, passive-aggressiveMira, and fullMira withk-best constraints of hope, fear, and model-best constraints. Optimization is always started from the same initial weights as used forMert. The weights for theSparsefeature set are initialized to0.

We always perform a full sweep over step sizes/learning rates over the range

10−10_{. . .}₁_.₀_{with a granularity of}₁₀−1_.

Optimization is carried out online, withk-best lists of translations re-generated once per each epoch.

3.6.3 Online Discriminative Training with Pairwise Ranking

For all experiments withDtrain, the optimization starts from the0vector. When

using a margin, it is őxed at 1.0, and a coarse grid search for an optimal learning rate is performed over the range10−10_{. . .}₁_.₀_{with a step size of} ₁₀−1_{. Without}

a margin the learning rate is őxed to 1.0. We always use ak-best size of 100 for experiments withDtrain, and optimize a smoothed per-sentence BLEU according to Nakov et al. [2012] unless noted otherwise. Examples from allk-best lists are generated by őrst sorting according to the gold-standard score, then applying a pair extraction algorithm as described in Section 3.8. By default we are using the algorithm depicted in Algorithm 5.

The implementation of the algorithm is online, each segment (excluding the very őrst segment) is translated with an updated weight vector, unless the data for the previous segment were all correctly classiőed. The individual updates consist however of the sum of all gradients for a singlek-best list. Thus the updates can be considered as mini-batches, since the learning rate applies to a (non-normalized) sum of gradients.

The settings for parallelization are described n the description of the speciőc experiments.

For evaluation we report %BLEU-4 scores on development test and test data sets. We do not report score on training or tuning sets scores, since scores are calculated on a per-sentence basis. All scores are calculated on lowercased and tokenized data.

Full Multipartite Training Test Acc. Train Err. Test Acc. Train Err.

all 97% 0% 100% 0%

100K 90% 0% 96% 0%

10K 74% 0% 85% 0%

1K 64% 0% 71% 0%

Table 3.10: Synthetical experiments using theSparsefeature set for both full and multipartite settings, reporting accuracies on test data and error rates on training data.

In document Preference Learning for Machine Translation (Page 98-103)