1. Unconstrained Optimization.
Given f : Rn → R.
Minimize f(x) over x ∈ Rn.
f has a local minimum at a point ˉx if f(ˉx) ≤ f(x) for all x near ˉx, i.e.
∃ ε > 0 s.t. f(ˉx) ≤ f(x) ∀ x : kx − xˉk < ε . f has a global minimum at ˉx if
2. Optimality Conditions.
• First order necessary conditions:
Suppose that f has a local minimum at ˉx and that f is continuously differentiable in an open neighbourhood of ˉx. Then ∇f(ˉx) = 0. (ˉx is called a stationary point.)
• Second order sufficient Conditions:
Suppose that f is twice continuously differentiable in an open neighbourhood of ˉ
x and that ∇f(ˉx) = 0 and ∇2f(ˉx) is positive definite. Then ˉx is a strict local minimizer of f.
Example: Show that f = (2x21 − x2)(x21 − 2x2) has a minimum at (0, 0) along any
straight line passing through the origin, but f has no minimum at (0,0).
Exercise: Find the minimum solution of
f(x1, x2) = 2x21 + x1x2 + x22 − x1 − 3x2. (4)
Sufficient Condition.
Taylor gives for any d ∈ Rn:
f(ˉx + d) = f(ˉx) + ∇f(ˉx)T d + 21 dT ∇2f(ˉx + λ d)d λ ∈ (0, 1). If ˉx is not strict local minimizer, then
∃ {xk} ⊂ Rn \ {xˉ} : xk → xˉ s.t. f(xk) ≤ f(ˉx). Define dk := xk−xˉ
kxk−xˉk. Then kdkk = 1 and there exists a subsequence {dkj} such that
dkj → d? as j → ∞ and kd?k = 1. W.l.o.g. we assume dk → d? as k → ∞. f(ˉx) ≥ f(xk) = f(ˉx + kxk − xˉk dk)
= f(ˉx) + kxk − xˉk ∇f(ˉx)T dk + 12 kxk − xˉk2dTk ∇2f(ˉx + λk kxk − xˉk dk)dk = f(ˉx) + 12 kxk − xˉk2dTk ∇2f(ˉx + λk kxk − xˉkdk) dk .
Hence dTk ∇2f(ˉx + λk kxk − xˉkdk)dk ≤ 0, and on letting k → ∞ dT? ∇2f(ˉx)d? ≤ 0.
Example 6.2. Show that f = (2x21−x2)(x21− 2x2) has a minimum at (0,0) along any straight line passing through the origin, but f has no minimum at (0, 0).
Answer.
Straight line through (0, 0): x2 = α x1, α ∈ R fixed.
g(r) := f(r, α r) = (2r2 − α r) (r2 − 2 α r)
g0(r) = 8r3 − 15 α r2 + 4 α2 r, g00(r) = 24r2 − 30 α r + 4 α2
⇒ g0(0) = 0 and g00(0) = 4α2 > 0 .
Hence r = 0 is a minimizer for g ⇐⇒ (0,0) is a minimizer for f along any straight line.
Now let (x1k, xk2) = (1k, k12) → (0, 0) as k → ∞. Then f(xk1, xk2) = − 1
k2
1
k2 < 0 = f(0, 0) ∀ k .
Hence (0,0) is not a minimizer for f.
[Note: ∇f(0,0) = 0, but ∇2f(0,0) = 0 0 0 4
!
3. Convex Optimization.
Exercise. When f is convex, any local minimizer ˉx is a global minimizer of f. If in addition f is differentiable, then any stationary point ˉx is a global minimizer of f. (Hint. Use a contradiction argument.)
Exercise 6.3.
When f is convex, any local minimizer ˉx is a global minimizer of f.
Proof.
Suppose ˉx is a local minimizer, but not a global minimizer. Then
∃ xe s.t. f(x)e < f(ˉx). Since f is convex, we have that
f(λxe + (1 − λ) ˉx) ≤ λ f(x) + (1e − λ)f(ˉx)
< λ f(ˉx) + (1 − λ) f(ˉx) = f(ˉx) ∀ λ ∈ (0, 1]. Let xλ := λxe + (1 − λ) ˉx. Then
xλ → xˉ and f(xλ) < f(ˉx) as λ → 0. This is a contradiction to ˉx being a local minimizer.
4. Line Search.
The basic procedure to solve numerically an unconstrained problem (minimize f(x) over x ∈ Rn) is as follows.
(i) Choose an initial point x0 ∈ Rn and an initial search direction d0 ∈ Rn and set k = 0.
(ii) Choose a step size αk and define a new point xk+1 = xk + αk dk. Check if the stopping criterion is satisfied (k∇f(xk+1)k < ε?). If yes, xk+1 is the optimal solution, stop. If no, go to (iii).
(iii) Choose a new search direction dk+1 (descent direction) and set k = k + 1. Go to (ii).
The essential and most difficult part in any search algorithm is to choose a descent direction dk and a step size αk with good convergence and stability properties.
5. Steepest Descent Method.
f is differentiable.
Choose dk = −gk, where gk = ∇f(xk), and choose αk s.t. f(xk + αk dk) = min
α∈R f(x
k + α dk).
Note that the successive descent directions are orthogonal to each other, i.e. (gk)T gk+1 = 0, and the convergence for some functions may be very slow, called zigzagging.
Exercise.
Use the steepest descent (SD) method to solve (4) with the initial point x0 = (1, 1). (Answer. First three iterations give x1 = (0,1), x2 = (0, 32), and x3 = (−18, 32).)
Steepest Descent.
Taylor gives:
f(xk + α dk) = f(xk) + α∇f(xk)T dk + O(α2). As
∇f(xk)T dk = k∇f(xk)k kdkk cosθk,
with θk the angle between dk and ∇f(xk), we see that dk is a descent direction if cosθk < 0. The descent is steepest when θk = π ⇐⇒ cosθk = −1.
Zigzagging.
αk is minimizer of φ(α) := f(xk + α dk) with dk = −gk. Hence
0 = φ0(αk) = ∇f(xk + αk dk)T dk = ∇f(xk+1)T (−gk) = −(gk+1)T gk . Hence dk+1 ⊥ dk, which leads to zigzagging.
Exercise 6.5.
Use the SD method to solve (4) with the initial point x0 = (1, 1). [min: 17 (−1,11).]
Answer. ∇f = (4x1 + x2 − 1, 2x2 + x1 − 3). Iteration 0: d0 = −∇f(x0) = −(4,0) 6= (0, 0). φ(α) = f(x0 + α d0) = f(1 − 4α,1) = 2 (1 − 4α)2 − 2 minimum point at α0 = 14 ⇒ x1 = x0 + α0 d0 = (0,1), d1 = −∇f(x1) = −(0, −1) = (0, 1) 6= (0,0). Iteration 1: x2 = (0, 32), d2 = (−12, 0). Iteration 2: x3 = (−18, 32), d3 = (0, 18).
6. Newton Method.
f is twice differentiable.
Choose dk = −[Hk]−1gk, where Hk = ∇2f(xk). Set xk+1 = xk + dk.
If Hk is positive definite then dk is a descent direction.
The main drawback of the Newton method is that it requires the computation of
∇2f(xk) and its inverse, which can be difficult and time-consuming.
Exercise.
Use the Newton method to solve (4) with x0 = (1, 1). (Answer. First iteration gives x1 = 17 (−1,11).)
Newton Method. Taylor gives f(xk + d) ≈ f(xk) + dT ∇f(xk) + 12 dT ∇2f(xk) d =: m(d) min d m(d) ⇒ ∇m(d) = 0 ⇒ ∇f(xk) + ∇2f(xk)d = 0 . Hence choose dk = −[∇2f(xk)]−1 ∇f(xk) = −[Hk]−1gk. If Hk is positive definite, then so is (Hk)−1, and we get
(dk)T gk = −(gk)T (Hk)−1 gk ≤ −σk kgkk2 < 0 for some σk > 0.
Hence dk is a descent direction.
[Aside: The Newton method for minx f(x) is equivalent to the Newton method for finding a root of the system of nonlinear equations ∇f(x) = 0.]
Exercise 6.6.
Use the Newton method to minimize
f(x1, x2) = 2x21 + x1x2 + x22 − x1 − 3x2 with x0 = (1,1)T. Answer. ∇f = 4 x1 + x2 − 1 2 x2 + x1 − 3 ! , H := ∇2f = 4 1 1 2 ! . H−1 = 1 detH 2 −1 −1 4 ! = 17 2 −1 −1 4 ! . Iteration 0: x0 = (1,1)T, ∇f(x0) = (4, 0)T. x1 = x0 − [H0]−1 ∇f(x0) = 1 1 ! − 17 2 −1 −1 4 ! 4 0 ! = 17 −1 11 ! .
⇒ ∇f(x1) = (0, 0)T and H positive definite.
7. Choice of Stepsize.
In computing the step size αk we face a tradeoff. We would like to choose αk to give a substantial reduction of f, but at the same time we do not want to spend too much time making the choice. The ideal choice would be the global minimizer of the univariate function φ : R → R defined by
φ(α) = f(xk + α dk), α > 0, but in general, it is too expensive to identify this value.
A common strategy is to perform an inexact line search to identify a step size that achieves adequate reductions in f with minimum cost.
α is normally chosen to satisfy the Wolfe conditions:
f(xk + αk dk) ≤ f(xk) + c1 αk (gk)Tdk (5)
∇f(xk + αk dk)Tdk ≥ c2 (gk)Tdk, (6) with 0 < c1 < c2 < 1. (5) is called the sufficient decrease condition, and (6) is the curvature condition.
Choice of Stepsize.
The simple condition
f(xk + αk dk) < f(xk) (†) is not appropriate, as it may not lead to a sufficient reduction.
Example: f(x) = (x − 1)2 − 1. So minf(x) = −1, but we can choose xk satisfying (†) such that f(xk) = 1k → 0.
Note that the sufficient decrease condition (5)
φ(α) = f(xk + α dk) ≤ `(α) := f(xk) + c1 α(gk)Tdk
yields acceptable regions for α. Here φ(α) < `(α) for small α > 0, as (gk)Tdk < 0 for descent directions.
The curvature condition (6) is equivalent to
φ0(α) ≥ c2φ0(0) [ > φ0(0) ]
8. Convergence of Line Search Methods.
An algorithm is said to be globally convergent if lim
k→∞kg k
k = 0.
It can be shown that if the step sizes satisfy the Wolfe conditions
• then the steepest descent method is globally convergent,
• so is the Newton method provided the Hessian matrices ∇2f(xk) have a bounded condition number and are positive definite.
Exercise. Show that the steepest descent method is globally convergent if the
following conditions hold
(a) αk satisfies the Wolfe conditions, (b) f(x) ≥ M ∀ x ∈ Rn,
[Hint: Show that
∞
X
k=0
Exercise 6.8.
Assume that dk is a descent direction, i.e. (gk)T dk < 0, where gk := ∇f(xk). Then if 1. αk satisfies the Wolfe conditions,
2. f(x) ≥ M ∀ x ∈ Rn,
3. f ∈ C1 and ∇f is Lipschitz, i.e. k∇f(x) − ∇f(y)k ≤ Lkx − yk ∀ x, y ∈ Rn, it holds that
∞
X
k=0
cos2 θk kgkk2 < ∞, where cosθk := k(ggkkk k)Tddkkk.
[Note: SD method is special case with cos2θk = 1. ⇒ lim k→∞kg k k = 0.] Proof. Wolfe condition (6) ⇒ (gk+1)T dk ≥ c2(gk)T dk ⇒ (gk+1 − gk)T dk ≥ (c2 − 1) (gk)T dk . (†) .
The Lipschitz condition yields that
(gk+1 − gk)T dk ≤ kgk+1 − gkk kdkk = k∇f(xk+1) − ∇f(xk)k kdkk
≤ Lkxk+1 − xkk kdkk = αk Lkdkk2. (‡) Combining (†) and (‡) gives αk ≥ c2 − 1
L (gk)T dk kdkk2 , and hence αk (gk)T dk ≤ c2 − 1 L [(gk)T dk]2 kdkk2
Together with Wolfe condition (5) we get
f(xk+1) ≤ f(xk) + c1 c2 − 1 L [(gk)T dk]2 kdkk2 = f(x k) − c cos2 θk kgkk2 ,
where c := c1 1−c2 L > 0. f(xk+1) ≤ f(xk) − c cos2 θk kgkk2 ≤ f(x0) − c k X j=0 cos2θj kgjk2 ⇒ k X j=0 cos2 θj kgjk2 ≤ 1 c (f(x 0) − M) ∀ k ⇒ ∞ X j=0 cos2θj kgjk2 < ∞ .
9. Popular Search Methods.
In practice the steepest descent method and the Newton method are rarely used due to the slow convergence rate and the difficulty in computing Hessian matrices, respectively.
The popular search methods are
• the conjugate gradient method (variation of SD method with superlinear conver- gence) and
• the quasi-Newton method (variation of Newton method without computation of Hessian matrices).
There are some efficient algorithms based on the trust-region approach. See Fletcher (2000) for details.
10. Constrained Optimization.
Minimize f(x) over x ∈ Rn subject to the equality constraints
hi(x) = 0, i = 1, . . . , l , and the inequality constraints
gj(x) ≤ 0, j = 1, . . . , m . Assume that all functions involved are differentiable.
11. Linear Programming.
The problem is to minimize
z = c1 x1 + ∙ ∙ ∙ + cnxn subject to
ai1 x1 + ∙ ∙ ∙ + ainxn ≥ bi, i = 1, . . . , m , and
x1, . . . , xn ≥ 0.
LPs can be easily and efficiently solved with the simplex algorithm or the interior point method.
MS-Excel has a good in-built LP solver capable of solving problems up to 200 vari- ables. MATLAB with optimization toolbox also provides a good LP solver.
12. Graphic Method.
If an LP problem has only two decision variables (x1, x2), then it can be solved by
the graphic method as follows:
• First draw the feasible region from the given constraints and a contour line of the objective function,
• then, on establishing the increasing direction perpendicular to the contour line, find the optimal point on the boundary of the feasible region,
• then find two linear equations which define that point,
• and finally solve the two equations to obtain the optimal point.
Exercise. Use the graphic method to solve the LP: minimize z = −3 x1 − 2 x2
subject to x1 + x2 ≤ 80, 2x1 + x2 ≤ 100, x1 ≤ 40, and x1, x2 ≥ 0. (Answer. x1 = 20, x2 = 60.)
13. Quadratic Programming.
Minimize
xTQ x + cTx subject to
A x ≤ b and x ≥ 0,
where Q is an n × n symmetric positive definite matrix, A is an n × m matrix, x, c ∈ Rn, b ∈ Rm.
To solve a QP problem, one
• first derives a set of equations from the Kuhn–Tucker conditions, and
• then applies the Wolfe algorithm or the Lemke algorithm to find the optimal solution.
The MS-Excel solver is capable of solving reasonably sized QP problems, similarly for MATLAB.
14. Kuhn–Tucker Conditions.
min f(x) over x ∈ Rn s.t. hi(x) = 0, i = 1, . . . , l; gj(x) ≤ 0, j = 1, . . . , m. Assume that ˉx is an optimal solution.
Under some regularity conditions, called the constraint qualifications, there exist two vectors ˉu = (ˉu1, . . . ,uˉl) and ˉv = (ˉv1, . . . ,vˉm), called the Lagrange multipliers, such that the following set of conditions is satisfied:
Lxk(ˉx,u,ˉ v) = 0,ˉ k = 1, . . . , n hi(ˉx) = 0, i = 1, . . . , l gj(ˉx) ≤ 0, j = 1, . . . , m ˉ vj gj(ˉx) = 0, vˉj ≥ 0, j = 1, . . . , m where L(x, u, v) = f(x) + l X i=1 ui hi(x) + m X j=1 vj gj(x) is called the Lagrange function or Lagrangian.
Furthermore, if f : Rn → R and hi, gj : Rn → R are convex, then ˉx is an optimal solution if and only if (ˉx,u,ˉ v) satisfies the Kuhn–Tucker conditions.ˉ
This holds in particular, when f is convex and hi, gj are linear.
Example.
Find the minimum solution to the function x2 − x1 subject to x21 + x22 ≤ 1.
Exercise.
Find the minimum solution to the function x21+x22−2x1−4x2 subject to x1+2x2 ≤ 2
and x2 ≥ 0.
Interpretation of Kuhn–Tucker conditions
Assume that no equality constraints are present.
If ˉx is an interior point, i.e. no constraints are active, then we recover the usual optimality condition: ∇f(ˉx) = 0.
Now assume that ˉx lies on the boundary of the feasible set and let gjk be the active constraints at ˉx. Then a necessary condition for optimality is that we cannot find a descent direction for f at ˉx that is also a feasible direction. Such a vector cannot exist, if
−∇f(ˉx) = X k
ˉ
vjk ∇gjk(ˉx) with ˉvjk ≥ 0. (†)
This is because, if d ∈ Rn is a descent direction, then ∇f(ˉx)T d < 0 and Pk vˉjk ∇gjk(ˉx)T d > 0.
So there must exist a jk, such that ∇gjk(ˉx)T d > 0. But that means that d is an ascent direction for gjk, and as gjk is active at ˉx, it is not a feasible direction.
Application of Kuhn–Tucker: LP Duality
Let b ∈ Rm, c ∈ Rn and A ∈ Rn×m.
min cT x s.t. A x ≥ b, x ≥ 0. (P) Equivalent to min cT x s.t. b − A x ≤ 0, −x ≤ 0.
Lagrangian: L = cT x + vT(b − A x) + yT (−x).
Hence, ˉx is the solution, if there exist ˉv and ˉy such that
∇L = c − AT vˉ − yˉ = 0 ⇒ yˉ = c − AT v,ˉ KT conditions: vˉT (b − Ax) = 0,ˉ yˉT (−x) = 0,ˉ
ˉ
Eliminate ˉy to find ˉv, ˉx:
Axˉ ≥ b, xˉ ≥ 0 feasible region: primal AT vˉ ≤ c, vˉ ≥ 0 feasible region: dual
ˉ vT (b − Ax) = 0ˉ ˉ xT (c − AT v) = 0ˉ ) ⇒ xˉT c = ˉxT AT vˉ = ˉvT b Hence ˉv ∈ Rm solves the dual:
Here we have used that cT xˉ = min x≥0, A x≥b c T x ≥ min x≥0 maxv≥0 c T x + vT(b − A x) = max v≥0 minx≥0 c T x + vT(b − A x) = max v≥0 minx≥0 v T b + xT(c − AT v) ≥ max v≥0, AT v≤c vT b ≥ vˉT b = cT xˉ
Example 6.14.
Find the minimum solution to the function x2 − x1 subject to x21 + x22 ≤ 1.
Answer.
L = x2 − x1 + v (x21 + x22 − 1), so the KT conditions become Lx1 = −1 + 2v x1 = 0 (1)
Lx2 = 1 + 2v x2 = 0 (2) x21 + x22 ≤ 1 (3) v (x21 + x22 − 1) = 0, v ≥ 0 (4) (1) ⇒ v > 0 and hence x1 = 21v, x2 = −21v.
Plugging this into (4) yields 42v2 = 1 and hence ˉ v = √1 2 ⇒ xˉ1 = 1 √ 2, xˉ2 = − 1 √ 2; with the optimal value being ˉz = −√2.
Example.
min x1 s.t. x2 − x31 ≤ 0, x1 ≤ 1, x2 ≥ 0.
Since x31 ≥ x2 ≥ 0 we have x1 ≥ 0 and hence ˉx = (0, 0) is the unique minimizer. Lagrangian: L = x1 + v1 (x2 − x31) + v2 (x1 − 1) + v3(−x2).
KT conditions for a feasible point x:
∇L = 1 − 3v1x21 + v2 v1 − v3 = 0 (1) v1(x2 − x31) = 0, v2(x1 − 1) = 0, v3(−x2) = 0 (2) v1, v2, v3 ≥ 0 (3) Check KT conditions at ˉx = (0,0): (1) ⇒ v2 = −1 < 0 impossible!
KT condition is not satisfied, since the constraint qualifications do not hold.
Here g1 = x2−x13 and g3 = −x2 are active at (0, 0), and ∇g1 = 01, ∇g3 = −01. Hence
Exercise 6.14
Find the minimum solution to the function x21+x22−2x1−4x2 subject to x1+ 2x2 ≤ 2 and x2 ≥ 0. Answer. Lagrangian: L = x21 + x22 − 2x1 − 4x2 + v1 (x1 + 2 x2 − 2) + v2 (−x2) KT conditions: ∇L = 2x1 − 2 + v1 2x2 − 4 + 2 v1 − v2 = 0 (1) v1(x1 + 2x2 − 2) = 0, v2 x2 = 0, v1, v2 ≥ 0 (2) x1 + 2x2 ≤ 2, x2 ≥ 0 (3)
If x1 + 2x2 − 2 < 0 then v1 = 0 ⇒ x1 = 1 , x2 = 2 + 12 v2 ≥ 2. Hence from (2), v2 = 0, and so x1 = 1, x2 = 2. But that contradicts (3), and so it must hold that x1 + 2 x2 − 2 = 0.