Unconstrained optimization - Nonlinear programming: theory and algorithms

Nonlinear programming: theory and algorithms

5.4 Unconstrained optimization

We now move on to nonlinear optimization problems with multiple variables. First, we will focus on problems that have no constraints. Typical examples of uncon-strained nonlinear optimization problems arise in model fitting and regression. The study of unconstrained problems is also important for constrained optimization as one often solves a sequence of unconstrained problems as subproblems in various algorithms for the solution of constrained problems.

We use the following generic format for unconstrained nonlinear programs we consider in this section:

min f (x), where x = (x1, . . . , xn).

For simplicity, we will restrict our discussion to minimization problems. These ideas can be trivially adapted for maximization problems.

5.4 Unconstrained optimization 93 5.4.1 Steepest descent

The simplest numerical method for finding a minimizing solution is based on the idea of going downhill on the graph of the function f . When the function f is differentiable, its gradient always points in the direction of fastest initial increase and the negative gradient is the direction of fastest decrease. This suggests that, if our current estimate of the minimizing point is x^∗, moving in the direction of

−∇ f (x^∗) is desirable. Once we choose a direction, deciding how far we should move along this direction is determined using line search. The line search problem is a univariate problem that can be solved, perhaps in an approximate fashion, using the methods of the previous section. This will provide a new estimate of the minimizing point and the procedure can be repeated.

We illustrate this approach with the following example:

min f (x)= (x1− 2)⁴+ exp(x1− 2) + (x1− 2x2)².

The first step is to compute the gradient of the function, namely the vector of the partial derivatives of the function with respect to each variable:

∇ f (x) =

4(x₁− 2)³+ exp(x1− 2) + 2(x1− 2x2)

−4(x1− 2x2)

. (5.4)

Next, we need to choose a starting point. We arbitrarily select the point x⁰= (0, 3).

Now we are ready to compute the steepest descent direction at point x⁰. It is the direction opposite to the gradient vector computed at x⁰, namely

d⁰= −∇ f (x⁰)=

44+ e⁻²

−24

If we move from x⁰ in the direction d⁰, using a step size α, we get a new point x⁰+ αd⁰(α = 0 corresponds to staying at x⁰). Since our goal is to minimize f , we will try to move to a point x¹= x⁰+ αd⁰, whereα is chosen to approximately minimize the function along this direction. For this purpose, we evaluate the value of the function f along the steepest descent direction as a function of the step sizeα:

φ(α) := f (x⁰+ αd⁰)= {[0 + (44 + e⁻²)α] − 2}⁴+ exp{[0 + (44 + e⁻²)α] − 2}

+ {[0 + (44 + e⁻²)α] − 2[3 − 24α]}².

Now, the optimal value ofα can be found by solving the one-dimensional mini-mization problem minφ(α).

This minimization can be performed through one of the numerical line search procedures of the previous section. Here we use the approximate line search ap-proach with sufficient decrease condition we discussed in Section 5.3.3. We want

to choose a step size alpha satisfying

φ(α) ≤ φ(0) + μαφ(0),

whereμ ∈ (0, 1) is the desired fraction for the sufficient decrease condition. We observe that the derivative of the functionφ at 0 can be expressed as

φ(0)= ∇ f (x⁰)^Td⁰.

This is the directional derivative of the function f at point x⁰ and direction d⁰. Using this identity the sufficient decrease condition on functionφ can be written in terms of the original function f as follows:

f (x⁰+ αd⁰)≤ f (x⁰)+ μα∇ f (x⁰)^Td⁰. (5.5) The condition (5.5) is the multivariate version of the Armijo–Goldstein condition (5.3).

As discussed in Section 5.3.3, the sufficient decrease condition (5.5) can be combined with a backtracking strategy. For this example, we useμ = 0.3 for the sufficient decrease condition and apply backtracking with an initial trial step size of 1 and a backtracking factor ofβ = 0.8. Namely, we try step sizes 1, 0.8, 0.64, 0.512, and so on, until we find a step size of the form 0.8^kthat satisfied the Armijo–

Goldstein condition. The first five iterates of this approach as well as the 20th iterate are given in Table 5.5. For completeness, one also has to specify a termination cri-terion for the approach. Since the gradient of the function must be the zero vector at an unconstrained minimizer, most implementations will use a termination criterion of the form∇ f (x) ≤ ε, where ε > 0 is an appropriately chosen tolerance param-eter. Alternatively, one might stop when successive iterations are getting very close to each other, that is whenx^k+1− x^k ≤ ε for some ε > 0. This last condition indicates that progress has stalled. While this may be due to the fact that iterates approached the optimizer and can not progress any more, there are instances where the stalling is due to the high degree of nonlinearity in f .

A quick examination of Table 5.5 reveals that the signs of the second coordinate of the steepest descent directions change from one iteration to the next in most cases. What we are observing is the zigzagging phenomenon, a typical feature of steepest descent approaches that explain their slow convergence behavior for most problems. When we pursue the steepest descent algorithm for more iterations, the zigzagging phenomenon becomes even more pronounced and the method is slow to converge to the optimal solution x^∗≈ (1.472, 0.736). Figure 5.3 shows the steepest descent iterates for our example superimposed on the contour lines of the objective function. Steepest descent directions are perpendicular to the contour lines and zigzag between the two sides of the contour lines, especially when these lines create long and narrow corridors. It takes more than 30 steepest descent iterations in this

5.4 Unconstrained optimization 95 Table 5.5 Steepest descent iterations

x₁^k, x₂^k d₁^k, d₂^k

α^k ∇ f (x^k⁺¹) 0 (0.000, 3.000) (43.864,−24.000) 0.055 3.800

1 (2.412, 1.681) (0.112,−3.799) 0.168 2.891

2 (2.430, 1.043) (−2.544, 1.375) 0.134 1.511 3 (2.089, 1.228) (−0.362, −1.467) 0.210 1.523 4 (2.013, 0.920) (−1.358, 0.690) 0.168 1.163 5 (1.785, 1.036) (−0.193, −1.148) 0.210 1.188

... ... ... ... ...

20 (1.472, 0.736) (−0.001, 0.000) 0.134 0.001

0 0.5 1 1.5 2 2.5

0.4 0.6 0.8 1 1.2 1.4 1.6

Figure 5.3 Zigzagging behavior in the steepest descent approach

small example to achieve∇ f (x) ≤ 10⁻⁵. In summary, while the steepest descent approach is easy to implement and intuitive, and has relatively cheap iterations, it can also be quite slow to converge to solutions.

Exercise 5.10 Consider a differentiable multivariate function f (x) that we wish to minimize. Let xkbe a given estimate of the solution, and consider the first-order Taylor series expansion of the function around xk:

ˆf(δ) = f (xk)+ ∇ f (x)δ.

The quickest decrease in ˆf starting from xk is obtained in the direction that solves min ˆf(δ)

δ ≤ 1.

Show that the solution isδ^∗= α∇ f (x) with some α < 0, i.e., the opposite direction to the gradient is the direction of steepest descent.

Exercise 5.11 Recall the maximum likelihood estimation problem we considered in Exercise 5.4 . While we maintain the assumption that the observed samples come from a normal distribution, we will no longer assume that we know the mean of the distribution to be zero. In this case, we have a two-parameter (meanμ and standard deviationσ) maximum likelihood estimation problem. Solve this problem using the steepest descent method.

5.4.2 Newton’s method

There are several numerical techniques for modifying the method of steepest de-scent that reduce the propensity of this approach to zigzag, and thereby speed up convergence. The steepest descent method uses the gradient of the objective func-tion, only a first-order information on the function. Improvements can be expected by employing second-order information on the function, that is by considering its curvature. Methods using curvature information include Newton’s method that we have already discussed in the univariate setting. Here, we describe the generalization of this method to multivariate problems.

Once again, we begin with the version of the method for solving equations. We will look at the case where there are several equations involving several variables:

f1(x1, x2, . . . , xn)= 0 f2(x1, x2, . . . , xn)= 0 ... ... fn(x1, x2, . . . , xn)= 0.

(5.6)

Let us represent this system as

F(x)= 0,

where x is a vector of n variables and F(x) is an IRⁿ-valued function with com-ponents f1(x), . . . , fn(x). We repeat the procedure in Section 5.3.2: first, we write the first-order Taylor’s series approximation to the function F around the current estimate x^k:

F(x^k+ δ) ≈ ˆF(δ) := F(x^k)+ ∇ F(x^k)δ. (5.7) Above,∇ F(x) denotes the Jacobian matrix of the function F, i.e., ∇ F(x) has rows (∇ f1(x)), . . . , (∇ fn(x)), the transposed gradients of the functions f1 through fn. We denote the components of the n-dimensional vector x using subscripts, i.e.,

5.4 Unconstrained optimization 97

As before, ˆF(δ) is the linear approximation to the function F by the hyperplane that is tangent to it at the current point x^k. The next step is to find the value ofδ that would make the approximation equal to zero, i.e., the value that satisfies:

F(x^k)+ ∇ F(x^k)δ = 0.

Notice that what we have on the right-hand side is a vector of zeros and the equation above represents a system of linear equations. If∇ F(x^k) is nonsingular, the equality above has a unique solution given by

δ = −∇ F(x^k)⁻¹F(x^k), and the formula for the Newton update in this case is:

x^k+1= x^k + δ = x^k− ∇ F(x^k)⁻¹F(x^k).

Example 5.5 Consider the following problem:

F(x)= F(x1, x2)=

First we calculate the Jacobian:

∇ F(x1, x2)=

x₂− 2 x1+ 1 2x1+ 2 2x2− 7

If our initial estimate of the solution is x⁰= (0, 0), then the next point generated by Newton’s method will be:

x₁¹, x₂¹

Optimization version

When we use Newton’s method for unconstrained optimization of a twice-differentiable function f (x), the nonlinear equality system that we want to solve is the first-order necessary optimality condition∇ f (x) = 0. In this case, the functions

fi(x) in (5.6) are the partial derivatives of the function f . That is, fi(x)= ∂ f matrix of function f :

∇ F(x1, x2, . . . , xn)=

Therefore, the Newton direction at iterate x^k is given by

δ = −∇²f (x^k)⁻¹∇ f (x^k) (5.8) and the Newton update formula is

x^k+1= x^k + δ = x^k− ∇ f²(x^k)⁻¹∇ f (x^k).

For illustration and comparison purposes, we apply this technique to the example problem of Section 5.4.1. Recall that the problem was to find

min f (x)= (x1− 2)⁴+ exp(x1− 2) + (x1− 2x2)² starting from x⁰= (0, 3).

5.4 Unconstrained optimization 99 0 (0.000, 3.000) (0.662,−2.669) 1.000 9.319 1 (0.662, 0.331) (0.429, 0.214) 1.000 2.606 2 (1.091, 0.545) (0.252, 0.126) 1.000 0.617 3 (1.343, 0.671) (0.108, 0.054) 1.000 0.084 4 (1.451, 0.726) (0.020, 0.010) 1.000 0.002 5 (1.471, 0.735) (0.001, 0.000) 1.000 0.000

The gradient of f was given in (5.4) and the Hessian matrix is given below:

∇²f (x)=

12(x1− 2)²+ exp(x1− 2) + 2 −4

−4 8

. (5.9)

Thus, we calculate the Newton direction at x⁰= (0, 3) as follows:

δ = −∇²f

We list the first five iterates in Table 5.6 and illustrate the rapid progress of the algorithm towards the optimal solution in Figure 5.4. Note that the ideal step size for Newton’s method is almost always 1. In our example, this step size always satisfied the sufficient decrease condition and was chosen in each iteration. Newton’s method identifies a point with∇ f (x) ≤ 10⁻⁵after seven iterations.

Despite its excellent convergence behavior close to a solution, Newton’s method is not always the best option, especially for large-scale optimization. Often the Hessian matrix is expensive to compute at each iteration. In such cases, it may be preferable to use an approximation of the Hessian matrix instead. These approxima-tions are usually chosen in such a way that the solution of the linear system in (5.8) is much cheaper that what it would be with the exact Hessian. Such approaches are known as quasi-Newton methods. Most popular variants of quasi-Newton methods are BFGS and DFP methods. These acronyms represent the developers of these al-gorithms in the late 1960s and early 1970s. Detailed information on quasi-Newton approaches can be found in, for example, [61].

Exercise 5.12 Repeat Exercise 5.11 , this time using the optimization version of Newton’s method. Use line search withμ = 1/2 in the Armijo–Goldstein condition and a backtracking ratio ofβ = 1/2.

0 0.5 1 1.5 2 2.5 0.4

0.6 0.8 1 1.2 1.4 1.6

Figure 5.4 Rapid convergence of Newton’s method

In document Optimization Methods in Finance (Page 106-114)