Direct Error Minimization - Optimization in Machine Translation

2.5 Optimization in Machine Translation

2.5.2 Direct Error Minimization

In direct error minimization methods, the goal is to conductively optimize the gold-standard evaluation metric of interest, e.g. TER or BLEU scores, on a given set of training examples. The respective objective for optimizing the log-linear model can be formulated as follows:

w∗= arg max w { _n ∑ i=1 g ( e∗i, max j∈Y(fi) pw(ei,j|f_i) )} , (2.53)

whereg(·)is the evaluation metric, iiterates over the indexes of the training data withnexamples, Y(f)is the (indexed) set of translation hypotheses for a

given source segmentf, ande∗

i is the reference translation for examplei.

50 Formulated as a loss function, to minimize the true error:

Ltrue=− n ∑ i=1 g ( e∗i, arg max (e,h)∈Y(fi) m(fi, e, h) ) , (2.54)

wherem(·)represents the model score of a translation hypothesis, i.e. ⟨w,φ(f, e, h)⟩, for a feature representation derived from sourcef, targetewith derivation h. The derivationhcorresponds in word-based models to the word alignment, in phrase- based models to a phrase segmentation, and in hierarchical phrase-based models to the co-aligned source and target parse trees.

While TER can be directly optimized in this framework, the BLEU score, being deőned on the full corpus, is not evaluable for isolated hypotheses, which either calls for a sentence-wise approximation, or specialized optimization procedures. The predominant approach for directly optimizing BLEU is described by Och [2003]. Other direct approaches include work by Chung and Galley [2012] and Erdmann and Gwinnup [2015].

2.5.2.1 Minimum Error Rate Training

Minimum error rate training (Mert), as described by Och [2003] aims to directly optimize the true loss (Equation 2.54) on a set of translation hypotheses, e.g. a

set ofk-best lists of a training set51

. Given an initial or previous model, a single iteration of theMertprocedure őrst produces a list ofk-best translations for each input of the training data, annotated with model scores and feature values.

Then, Mert, in its simplest form, optimizes the weight of each feature in a linear modelwseparately, as a variant of Powell’s method [Powell, 1964]: An naïve

modus operandi for this single-weight optimization would utilize a grid search, reranking all lists with the new weight and observing the global evaluation score. This is generally infeasible due to the large number of 1-best allocations. Instead, inMert, an efficient way of őnding the optimal weight for a given feature for a őxed set ofk-best lists in utilized: While other parameters are őxed, the model score of all entries of everyk-best list can be represented as follows (assuming only unique entries in thek-best lists):

⟨λ,φ′(f, e)⟩+γϕ′′(f, e), (2.55) whereφ′ is the feature representation using only the őxed features,λare the

weights for the őxed features,γis the weights for the current active feature, and ϕ′′ _{the respective feature value. Each hypothesis can thus be represented as a} line, with slopeϕ′′₍_{f, e}₎ _{and intercept} _⟨_λ_,_φ′₍_{f, e}₎_⟩_{. This formulation allows to} efficiently determine an optimal global weight for a given feature by generating the upper envelope (UE) of the linear model for eachk-best list [Macherey et al.,

2008], providing an exhaustive representation for allγ∈R:

UE (Y(f)) = max

e∈Y(f){λ,φ

′₍_{f, e}₎_⟩₊_γϕ′′₍_{f, e}_{) :}_γ_∈_R_}_. _(2.56) Since the upper envelope is piecewise linear and convex, it enables to efficiently determine a őnite number of values for γ, namely those where the global evaluation metric actually changes its value. With this insight, a globally optimal score for each feature can be efficiently determined. The process is iterated for a number of epochs, re-decoding the training data each time with the new parameters. While the algorithm is capable to optimize the weights of a typical SMT system, i.e. less than 30 features, it does not scale to larger feature sets, e.g. when using sparse, lexicalized features, see e.g. [Hopkins and May, 2011].

Mert is extensively studied: The algorithm can be extended to use a larger portion of the search space, i.e. lattices encoding source segmentation in phrase- based MT [Macherey et al., 2008; Galley and Quirk, 2011], or hypergraphs encoding tree-structured derivations in syntax-based MT [Kumar et al., 2009], including more efficient approaches to compute the upper envelope [Sokolov and Yvon, 2011; Dyer,

Even with a small data set and reduced search space, finding optimal weights that maximize the global BLEU score on this sample is still a daunting task, as there areNk_{1-best allocations}

2013]. Regularization can also be employed [Cer et al., 2008]. Other aspects of the algorithm, considering random restarts [Moore and Quirk, 2008], multi-dimensional optimization [Galley et al., 2013], or stability of the resulting weights [Foster and Kuhn, 2009; Clark et al., 2011] have also been thoroughly explored.

Note that, since BLEU is non-differential and piecewise linear [Och, 2003; Pap- ineni et al., 2002], it cannot be optimized directly as some metrics in IR (cf. Section 2.6.1.2). WhileMertcan optimize the gold-standard score for a given set ofk-best lists, it is no guarantee whatsoever that it is a global optimum, since thek-best lists depend on the weights.

In document Preference Learning for Machine Translation (Page 64-66)