2.5 Optimization in Machine Translation
2.5.3 Structured Prediction
In general ML, when there are complex output spaces Ð which is commonplace in MT Ð a different type of learning algorithms can be used compared to the ones designed for more simple tasks such as (binary) classiőcation. Structured prediction can be characterized by an antagonism [Daumé, 2006]: While each outputy∈ Y
decomposes into an ordered set of states encoded by variable sized vectors, the losses or error metrics do not decompose in the same way, but are evaluated for the full structure. Besides MT, typical applications include other tasks in NLP, such as tagging, parsing or more general sequence learning tasks52. Structured prediction problems are abundant in NLP.
The structured prediction framework enables the use of expressive features deőned on structured inputs and outputs, and allows for efficient online learning algorithms. A fundamental algorithm in structured prediction is thestructured perceptron [Collins and Duffy, 2002; Collins, 2002], a linear model parametrized by a weight vectorw, as depicted in Algorithm 2.
Algorithm 2 Structured perceptron. Inputs: Learning rate η, set of training
examples with sizeN. Algorithm adapted from [Collins and Duffy, 2002].
fori←1. . . N do ˆ y←arg maxy⟨wi,φ(xi, y)⟩ if yˆ̸=y∗ then wi+1←wi+η(φ(xi, yn∗)−φ(xi,yˆ)) end if end for ReturnwN+1
The algorithm closely resembles the original perceptron algorithm for classiőca- tion and has the same convergence properties, but the condition for an update is
52
There are also notable exceptions of this classification, such as POS-tagging where the natural loss is actually decomposable.
different: An update is performed iff the 1-best output structure with the current weightsw is not correct, i.e. not identical to the gold-standard structure.
The algorithm is not directly applicable to traditional SMT: Since one or more reference translations are used for evaluation, an exact match is either unlikely or impossible, which is further complicated by the large search spaces. Additionally, since SMT relies on a Viterbi (maximum) approximation in decoding, scoring a single derivation of a string instead of all possible derivations, the arg max
operation adopted in the structured perceptron is not exactly computable. The Viterbi approximation can be straight-forwardly coped with by having the feature mappingφ(·)include thehidden variable h, encoding the derivation, see [Liang et al., 2006a]. Liang et al. [2006a] further propose to use an approximate reference for the update condition, using a sentence-wise surrogate metric making it possible to pick an oracle translation for eachk-best lists [Och and Ney, 2002].
The loss for the structured perceptron in SMT can be compactly formulated following Gimpel and Smith [2012b]:
Lstruct= n ∑ i=1 −m ( fi, arg max (e,h)∈K(fi) g(e, e∗) ) +m(fi,e,ˆ ˆh), (2.57)
whereg(·)is again a gold-standard evaluation metric, e.g. per-sentence BLEU
(slightly abusing notation by letting the function also return a (e, h)pair), and
m(f, e, h) =⟨w,φ(f, e, h)⟩,K(f)returns ak-best list for inputf, and:
(ˆe,ˆh)≈arg max
(e,h)
⟨w,φ(f, e, h)⟩, (2.58) as returned by a SMT decoding algorithm. This loss is also closely related to aramp loss objective [Chapelle et al., 2009; Collobert et al., 2006], as shown by [Gimpel and Smith, 2012b]53
for SMT: Lramp= n ∑ i=1 − max (e,h)∈K(fi) [m(fi, e, h) +g(e, e∗)] +m(fi,e,ˆ ˆh). (2.59)
In contrast to the hinge loss of the structured perceptron, the ramp loss is non-convex, but it is a tighter upper bound of the true loss in Equation 2.54. 2.5.3.1 Margin-Infused Relaxed Algorithm
For SMT variants of the margin-infused relaxed algorithm (MIRA) [Crammer and Singer, 2003], more precisely its application to structured prediction [Crammer et al., 2006], have been proposed [Watanabe et al., 2007b; Chiang et al., 2008; Chiang, 2012; Gimpel and Smith, 2012b]. The key idea of the algorithm is to
53
maintain a large margin (margin-infused) between better and worse hypotheses (in terms of a gold-standard evaluation function), which is at least as large as the difference in their evaluation scores, while at the same time keeping the weight updates as small as possible (conservative). The algorithm is commonly formulated as a quadratic program, see e.g. [Chiang, 2012]:
minimize 1/2η||w′−w||2+ξ
subject to ⟨w,φ(f, e+, h+)⟩ − ⟨w,φ(f, e, h)⟩ −ξ ≥g(e+, e∗)−g(e, e∗)
∀(e, h)∈ Y(f),
(2.60)
where e+ is a reachable, high-scoring derivation in terms of the evaluation
function (referred to as thehope derivation):
(e+, h+) = arg max
(e,h)∈Y(f)
⟨w,φ(f, e, h)⟩+g(e, e∗), (2.61) ξ ≥ 0 being slack variables. This optimization problem can be approached in numerous ways, for example with the cutting plane algorithm proposed by Tsochantaridis et al. [2004], as Chiang [2012] explicate.
Instead of a range of derivations as suggested by Watanabe et al. [2007b] and Chiang et al. [2008], Chiang [2012] consider using only a singlefear derivation
(y−, h−), which represents the most-violated constraint:
(e−, h−) = arg max
(e,h)∈Y(f)
⟨w,φ(f, e, h)⟩ −g(e, e∗). (2.62) Following Gimpel and Smith [2012b]54
, the optimization problem can be formu- lated as a ramp loss function:
Lmira = n ∑ i=1 − [ max (e,h)∈Y(fi) m(fi, e, h) +g(e, e∗i) ] + [ max (e,h)∈Y(fi) m(f, e, h)−g(e, e∗ i) ] = n ∑ i=1 −[m(fi, e+i, hi+) +g(e+i , e∗i) ] +[m(fi, e−i , hi−)−g(e−i , e∗i) ] . (2.63) Being an online learning algorithm, the problem can also be approached in a simpler variant by stochastic gradient descent (SGD) [Martins et al., 2010; Crammer et al., 2006; Eidelman et al., 2013b], when using a single constraint, omitting slack variables, and without imposing limits to the magnitudes of updates that the weight
54
vector should receive55 :
w′←w+ηφ(f, e+, h+)−φ(f, e−, h−), (2.64) which resembles the update of the structured perceptron, as shown in Algorithm 2, but using the notion of hope andfear derivations.
Mirahas őrst been applied to a structured NLP problem by McDonald et al. [2005]. In SMT, the algorithm has received most interest due to its appeal as a online learning algorithm [Arun and Koehn, 2007; Watanabe, 2012; Watanabe et al., 2007b,a], and for enabling the use of sparse features [Chiang et al., 2009, 2008; Hasler et al., 2011; Eidelman, 2012] due to its efficiency. SinceMiracan be implemented as an online algorithm, it also allows for parallelization [Eidelman et al., 2013c,b]. Batch variants of theMiraalgorithm have also been explored for SMT [Zhao and Huang, 2013; Cherry and Foster, 2012]. As we have shown in our presentation of Mira for SMT, hope andfear derivations are a way of deőning effective constraints. However, by usingk-best lists as stand-in for the full search space, some ődelity is lost, which is why Chiang [2012] proposes a cost-augmented inference approach to search for constraints in a larger space. Wisniewski and Yvon [2013] present another variant for the constraints, and Eidelman et al. [2013a] propose a variant for the margin deőnition inMira. Tan et al. [2013] propose an algorithm, which, similar to Mert, optimizes the exact corpus-level BLEU score. In general structured prediction for SMT, approaches that include the search procedure for learning have been explored thoroughly: Zhang et al. [2008] present an application of search-based structured prediction [Daumé et al., 2009] for SMT, and in another line of work, violation-őxing approaches are presented, which try to counter-act incorrect updates which are due to search errors [Huang et al., 2012; Yu et al., 2013; Liu and Huang, 2014; Zhang et al., 2013; Zhao et al., 2014].