2.5 Stochastic Programming Algorithms
3.1.1 The Frank-Wolfe Method
This chapter introduces the Frank-Wolfe method and its more sophisticated generalisations, and develops theory related to these algorithms for reference in subsequent chapters. Section 3.2 demon- strates how the Frank-Wolfe method and its variants can be applied to solving the convex hull relaxation of an integer program. Section 3.3 explores a generalisation of the Frank-Wolfe method to non-smooth optimisation.
The Frank-Wolfe method (sometimes referred to as the conditional gradient method) was initially proposed by Frank and Wolfe in [41] for quadratic programming problems, and was generalised by Holloway [61] for general convex programming problems.
Given a problem of the form
ζCP min
x tfpxq | x P Xu , (3.1)
where f is a convex, continuous and differentiable function whose gradient ∇f is known, and the feasible set X is closed and convex, the Frank-Wolfe method consists of the following steps:
Initialise Find a feasible solution ˆx0 P X for Equation 3.1. Set k 1.
Step 1 Set ξk P arg minxPX ∇fpˆxk1qpx ˆxk1q(. If ∇fpˆxk1qpξk ˆxk1q ¥ 0, the algorithm
terminates; we cannot find a better point than ˆxk1. 43
Step 2 Set tk P arg min
0¤τ¤1 fpp1 τqˆxk1 τ ξkq
( . Step 3 Set ˆxk p1 tkqˆxk tkξk.
Step 4 Set k k 1 and return to Step 1.
The termination condition in Step 1 may be informally interpreted as “there exists no feasible point in a direction of descent from the current point”. Formally, by reference to the Karush-Kuhn- Tucker conditions in Theorem 2.17, ˆxk1 is an optimal solution of (3.1) if and only if
∇fpˆxk1q P N
Xpˆxk1q (3.2)
i.e. xx ˆxk1,∇fpˆxqy ¤ 0 for all x P X.
Furthermore, since f is a convex function, the hyperplane with gradient ∇f which intersects the graph of f at ˆxk1 minorizes f ; therefore,
fpˆxk1q ∇fpˆxk1qpξk ˆxk1q
(i.e. the minimum of this hyperplane over X) is a lower bound on ζCP for all k.
The Frank-Wolfe method has a worst-case convergence rate of Op1{kq (e.g. [42]).
To utilise the Frank-Wolfe method we need to be able to solve the minimisation problems in Steps 1 and 2. Since ∇fpˆxqpx ˆxq is affine with respect to x for a fixed ˆx, and X is convex, the step 1 update is typically not very difficult. However, fpp1 τqˆx τξq is merely convex with respect to τ for fixed ˆx and ξ (it need not be affine or smooth), so even though its feasible set r0, 1s is very simple in structure the Step 2 update may not be easy to solve exactly, depending on the structure of f .
In early iterations of the algorithm our tk updates are based on gradient information at points
which are not necessarily close to optimal. This gradient information is therefore only an approxi- mate guide to the location of the optimal point; therefore, it is unsurprising that the line search in Step 2 may be performed approximately as well without compromising the convergence properties of the Frank-Wolfe method. The rules used for these approximations generally make use of the gradient information at ˆxk1 and global properties of the objective function. The approximation schemes used in ordinary gradient descent algorithms are generally applicable to the Frank-Wolfe method as well.
45 An example of such an approximation scheme is the Armijo rule [6]. The Armijo rule chooses a step length based on the accuracy of the gradient information. If the gradient information remains reliable over a long step length then a long step will be chosen. Conversely, if the gradient infor- mation becomes inaccurate over a long step length then a shorter step will be chosen. This process is formalised in the context of the Frank-Wolfe method as follows. At step 2 in iteration k of the Frank-Wolfe method we wish to choose a step length tk from ˆxk1 in the direction dk ξk ˆxk1
based on the gradient ∇fpˆxk1q. Replace the minimisation over τ with the following procedure: Initialise Set an initial step length 0 s ¤ 1, a step-size multiplier 0 β 1, and a parameter
0 γ 1 which determines the required accuracy of the gradient projection. Set m 0. Step 1 Set τ βms (this is the trial step length).
Step 2 If fpˆxk1q fpˆxk1 τ dkq ¥ γτx∇fpˆxk1q, dky then set tk τ and terminate. Step 3 Set m m 1 and return to Step 1.
In fact, it is not necessary to use any information about the objective function or the progress of the algorithm when choosing tk. For example, if we skip step 2 entirely and initialise the step sizes using the rule
tk 2
k 2
for all k, the Frank-Wolfe method converges with the same worst-case rate of Op1{kq [42].