The Frank-Wolfe Method - Stochastic Programming Algorithms

2.5 Stochastic Programming Algorithms

3.1.1 The Frank-Wolfe Method

This chapter introduces the Frank-Wolfe method and its more sophisticated generalisations, and develops theory related to these algorithms for reference in subsequent chapters. Section 3.2 demon- strates how the Frank-Wolfe method and its variants can be applied to solving the convex hull relaxation of an integer program. Section 3.3 explores a generalisation of the Frank-Wolfe method to non-smooth optimisation.

The Frank-Wolfe method (sometimes referred to as the conditional gradient method) was initially proposed by Frank and Wolfe in [41] for quadratic programming problems, and was generalised by Holloway [61] for general convex programming problems.

Given a problem of the form

ζCP min

x tfpxq | x P Xu , (3.1)

where f is a convex, continuous and differentiable function whose gradient ∇f is known, and the feasible set X is closed and convex, the Frank-Wolfe method consists of the following steps:

Initialise Find a feasible solution ˆx0 _{P X for Equation 3.1. Set k 1.}

Step 1 Set ξk P arg min_xPX ∇fpˆxk1_{qpx ˆx}k1_q(_{. If ∇f}_pˆxk1_qpξk _ˆxk1_{q ¥ 0, the algorithm}

terminates; we cannot find a better point than ˆxk1. 43

Step 2 Set tk _{P arg min}

0¤τ¤1 fpp1 τqˆxk1 τ ξkq

( . Step 3 Set ˆxk _{p1 t}k_qˆxk _tk_ξk_.

Step 4 Set k k 1 and return to Step 1.

The termination condition in Step 1 may be informally interpreted as “there exists no feasible point in a direction of descent from the current point”. Formally, by reference to the Karush-Kuhn- Tucker conditions in Theorem 2.17, ˆxk1 is an optimal solution of (3.1) if and only if

∇fpˆxk1_{q P N}

Xpˆxk1q (3.2)

i.e. xx ˆxk1,∇fpˆxqy ¤ 0 for all x P X.

Furthermore, since f is a convex function, the hyperplane with gradient ∇f which intersects the graph of f at ˆxk1 _{minorizes f ; therefore,}

fpˆxk1q ∇fpˆxk1qpξk ˆxk1q

(i.e. the minimum of this hyperplane over X) is a lower bound on ζCP _{for all k.}

The Frank-Wolfe method has a worst-case convergence rate of Op1{kq (e.g. [42]).

To utilise the Frank-Wolfe method we need to be able to solve the minimisation problems in Steps 1 and 2. Since ∇fpˆxqpx ˆxq is affine with respect to x for a fixed ˆx, and X is convex, the step 1 update is typically not very difficult. However, fpp1 τqˆx τξq is merely convex with respect to τ for fixed ˆx and ξ (it need not be affine or smooth), so even though its feasible set r0, 1s is very simple in structure the Step 2 update may not be easy to solve exactly, depending on the structure of f .

In early iterations of the algorithm our tk _{updates are based on gradient information at points}

which are not necessarily close to optimal. This gradient information is therefore only an approxi- mate guide to the location of the optimal point; therefore, it is unsurprising that the line search in Step 2 may be performed approximately as well without compromising the convergence properties of the Frank-Wolfe method. The rules used for these approximations generally make use of the gradient information at ˆxk1 and global properties of the objective function. The approximation schemes used in ordinary gradient descent algorithms are generally applicable to the Frank-Wolfe method as well.

45 An example of such an approximation scheme is the Armijo rule [6]. The Armijo rule chooses a step length based on the accuracy of the gradient information. If the gradient information remains reliable over a long step length then a long step will be chosen. Conversely, if the gradient information becomes inaccurate over a long step length then a shorter step will be chosen. This process is formalised in the context of the Frank-Wolfe method as follows. At step 2 in iteration k of the Frank-Wolfe method we wish to choose a step length tk _{from ˆ}_xk1 _{in the direction d}k ξk ˆxk1

based on the gradient ∇fpˆxk1q. Replace the minimisation over τ with the following procedure: Initialise Set an initial step length 0 s ¤ 1, a step-size multiplier 0 β 1, and a parameter

0 γ 1 which determines the required accuracy of the gradient projection. Set m 0. Step 1 Set τ βm_{s (this is the trial step length).}

Step 2 If fpˆxk1q fpˆxk1 τ dkq ¥ γτx∇fpˆxk1q, dky then set tk τ and terminate. Step 3 Set m m 1 and return to Step 1.

In fact, it is not necessary to use any information about the objective function or the progress of the algorithm when choosing tk. For example, if we skip step 2 entirely and initialise the step sizes using the rule

tk 2

k 2

for all k, the Frank-Wolfe method converges with the same worst-case rate of Op1{kq [42].

In document Decomposition and duality based approaches to stochastic integer programming (Page 53-55)