3.3 Globalization Strategies
3.3.1 Merit Function
The most straightforward idea for measuring progress is to combine the two goals – reduc- tion of the objective function and constraint violation – into the so called merit function Ψ : Rnx× R → R, e.g., defined by
Ψ(x; τ) := f (x) + τθ (x) (3.21)
with a penalty parameter τ ∈ (0, ∞) that balances these two goals.5The measure of constraint violation θ(x) does not need to be defined as in (3.20), but a merit function has to fulfill two necessary conditions:
i. An optimal solution of minx∈RnxΨ(x; τ) for τ → ∞ must be an optimal solution of
(NLP) and vice versa.
ii. A step ∆xk must produce a reduction in the merit function and therefore be a descent direction for it, i.e., ∇xΨ xk; τ
⊤
∆xk< 0.
These two necessary conditions ensure that with every iteration k the merit function decreases monotonically until it ends up at an optimal solution x∗. So, optimizing (NLP) becomes equiv- alent to the unconstrained minimization of (3.21), but with the difference that the step calcu- lation does not rely on the merit function directly.
Popular examples of merit functions are: i. The ℓp merit functions (cf., Han [113]):
Ψ(x; τ) = f (x) + τ ∥(g(x), max {h(x), 0})∥p, p ∈ {1, 2, ∞} (3.22)
ii. The differentiable ℓ2 merit function for equality constrained problems (cf., Fiacco and McCormick [64, Chapter 4]):
Ψ(x; τ) = f (x) +12τ∥g(x)∥22 (3.23)
4Other globalization strategies like Gould and Toint [101] or Liu and Yuan [136] depend on different step
calculations and are therefore not considered here.
5It is also possible to position the penalty parameter in front of the objective function, but (3.21) is the common
3.3. Globalization Strategies 37
iii. The augmented Lagrangian merit function for equality constrained problems (cf., Hestenes [116] and Powell [164]):
Ψ(x; τ) = f (x) + λ⊤g(x) +12τ∥g(x)∥22 (3.24) iv. The augmented Lagrangian merit function for inequality constrained problems (cf., Ar-
row et al. [8] and Rockafellar [169]):
Ψ(x; τ) = f (x) +4τ1 nh ∑ i=1 (max {νi+ τhi(x), 0})2− ν2i (3.25)
Exact Merit Functions
It can be impractical that the penalty parameter τ has to go to infinity in order to satisfy the necessary condition of merit functions, mentioned above. Instead, one wishes that there exists a finite penalty parameter ¯τ > 0 such that this condition holds. For an implementation it would then be sufficient to choose this parameter ¯τ and never increase it. Merit functions having this additional property are called exact merit functions.
Definition 3.8 (Exact Merit Functions). A merit function Ψ(x; τ) defined by (3.21) is called
exact at an optimal solution x∗, if there exists a fixed parameter ¯τ > 0 such that for all τ > ¯τ the point x∗is also an optimal solution of min
x∈RnxΨ(x; τ).
It turns out that the ℓp and augmented Lagrangian merit functions are exact as stated in the following theorems, but unfortunately the differentiable ℓ2merit function is not.
Theorem 3.9. Let x∗be an optimal solution of (NLP) satisfying the MFCQ and SOSC. Then, the
merit function Ψ(x; τ) = f (x) + τ ∥(g(x), max {h(x), 0})∥p with p ∈ [1, ∞] is exact.
Proof. See Han and Mangasarian [114, Corollary 4.7].
Theorem 3.10. Let x∗ be an optimal solution of (NLP) satisfying the MFCQ and SOSC. Then,
the merit function Ψ(x; τ) = f (x) + λ⊤g(x) +1
2τ∥g(x)∥22is exact. Proof. See Hestenes [116, Theorem 2.1].
The drawback of exact merit functions, however, is that the penalty parameter ¯τ is unknown a priori. This requires a strategy to update the penalty parameter during the optimization. Unfortunately, choosing a very large value from the beginning and hoping to be larger than ¯τ is not a good option as it can lead to very slow convergence. A very small penalty, on the other hand, can cause the attraction of unbounded infeasible points, if the objective function decreases much faster than the constraint violation increases. A survey on exact merit functions is given by Di Pillo [52], which also proposes to use penalty parameters that depend on the constraint violation to overcome the latter drawback.
Sufficient Decrease Condition
So far it has been neglected that the descent direction property, i.e., ∇xΨ xk; τ
⊤
∆xk < 0, does not lead to a sufficient reduction of the merit function Ψ(x; τ) for nonlinear programming, since – similarly to the beginning of Section 3.3 – this property is based on local information only. This is the point, where the line-search method comes into play and the step ∆xk may have to be shortened. In the following it is assumed, that the merit function is differentiable6 and compare the actual reduction
Ψ xk+ αk∆xk; τ
− Ψ xk; τ (3.26)
with the predicted reduction based on a linear or quadratic Taylor approximation Ψ xk; τ+ αk∇xΨ xk; τ ⊤ ∆xk+ α2k ∆xk ⊤ ∇2x xΨ xk; τ ∆xk− Ψ xk; τ =αk∇xΨ xk; τ ⊤ ∆xk+ α2k ∆xk ⊤ ∇2x xΨ xk; τ ∆xk. (3.27)
If the actual reduction is at least a fraction of the predicted reduction, the step is said to be acceptable. In case of a linear model of reduction this yields the Armijo [7] condition
Ψ xk+ αk∆xk; τ
− Ψ xk; τ≤ σαk∇xΨ xk; τ
⊤
∆xk≤ 0 (3.28)
with a parameter σ ∈ (0, 1) and which is illustrated in Figure 3.1 (left). Wolfe [194, 195] proposes to extend the Armijo condition by
∇xΨ xk+ αk∆xk; τ
⊤
∆xk≥ η∇xΨ xk; τ
⊤
∆xk, (3.29)
η ∈ (σ, 1), to avoid arbitrarily small step sizes. In practice however, this further condition is often neglected and instead a value αk ∈ (0, 1] satisfying the Armijo condition and be- ing as large as possible is selected. Note, that finding the optimal step size, e.g., solving minαk>0Ψ xk+ αk∆xk; τ
, is not a practical option since it involves the solution of a (nons- mooth) nonlinear program.
Exemplary for the SQP method, Algorithm D presents a globally convergent version of Algo- rithm B under rather strong assumptions.
Theorem 3.11 (Global Convergence of SQP Method with a Merit Function). Let
xk, λk, νk
k be a sequence generated by Algorithm D such that the tuple xk, λk, νk
lies in some compact set for all k, xksatisfies the LICQ and, for all d ∈ Rnx,
c1∥d∥2≤ d⊤∇2
x xL xk, λk, νk
d ≤ c2∥d∥2 (3.30)
with c1> 0 and c2> 0. Then, xk, λk, νk k converges to a first-order optimal point of (NLP). Proof. See Boggs and Tolle [19, Theorem 4.3].
6If the merit function is not differentiable, then the linear or quadratic model for the predicted reduction has
3.3. Globalization Strategies 39 (θ (xk), f (xk)) forbidden region acceptable region Armijo condition slope −ρ θ f
Monotone Merit Function
(θ (xk), f (xk)) (θ (xk−2), f (xk−2)) (θ (xk−1), f (xk−1)) forbidden region acceptable region θ f
Non-Monotone Merit Function
Figure 3.1:Monotone merit function (left) and non-monotone merit function (right). The non-monotonicity level
on the right is l = 2.
Non-Monotone Merit Functions
Although Theorem 3.11 proofs global convergence for an optimization algorithm, the intro- duction of the merit function – the main extension done in Algorithm D – requires a new study of local convergence, since Theorem 3.6 is based on the full step (αk= 1). One could think that the same properties would hold, but this is actually not true. There exist examples (cf., Powell [165, Section 3]) that show search directions ∆xk yielding local q-quadratic conver-
gence but increasing both, the objective function and the constraint violation, and, thus, would be rejected by the merit function. This is known as the Maratos effect [139]. But also in the unconstrained case, the step size can be reduced unnecessarily, for example when the step direction tries to follow a curvy valley. Possibilities to avoid this are the modification of the step ∆xk, in particular second-order-correction steps (cf., Conn et al. [45, Section 15.3.2.3] or Section 3.6.2), or the relaxation of the merit function acceptance criterion (3.28) to allow a non-monotone decrease of it. Examples include Chamberlain et al. [39], Panier and Tits [157] and Toint [182], which basically exchange (3.28) for
Ψ xk+ αk∆xk; τ − max i=0,...,lm Ψ x(k−i)+; τ ≤ σαk∇xΨ xk; τ⊤∆xk (3.31)
and force a decrease with respect to the largest value of the former lm ∈ N merit function values, see Figure 3.1 (right). While non-monotone merit function techniques usually compli- cate the global convergence theory, overall efficiency gains can be reported (cf., Grippo et al. [112]).