Momentum and Variable Step Size Based Acceleration Methods

ation Methods

The most general form of these acceleration methods is

x(n) = x(n−1)+ α(n−1)γ_g,ˆ_x(n−1)+ β(n−1)(x(n−1)− x(n−2)) (6.8)

where γ_g,ˆ_x(n−1) is the additive update resulting from minimization of surrogate function

g parameterized by x(n−1)_. _β(n−1) _{is “momentum” term that is chosen to be between}

[0, 1) [9]. The Lipschitz surrogate version of this is known as the heavy ball method [87]. Arguably, historically the most important work in momentum and variable step size methods is by Nesterov [79]. For non-strongly convex smooth functions, this method improves the convergence rate from O(1/n) to O(1/n2) while for strongly convex smooth functions, the linear rate of convergence is improved from O(1 − µ/L)2n to O(1 −pµ/L)n [80]. We first start with definitions and explanation of a key concept that leads to accelerated methods, which is called estimate sequences. Here, we follow [3, 80] mostly. A more general treatment for estimate sequences was done by [3].

Definition 6.3.1. ( [80], Definition 2.2.1) A pair of sequences {φ(n)_(x)}∞

n=0 and {λ(n)} ∞ n=0,

λ(n) _{≥ 0, is called an estimate sequence of function f (x) if}

• λ(n)_{→ 0,}

• for any x ∈ RN _{and all n ≥ 0,}

Lemma 6.3.2. ( [80], Lemma 2.2.1) If a sequence {x(n)_{} satisfies} f (x(n)) ≤ φ(n)_∗ = min x∈RNφ (n)_(x), _(6.10) then, f (x(n)) − f (x∗) ≤ λ(n)(φ(0)(x∗) − f (x∗)) → 0. Proof. f (x(n)) ≤ φ(n)_∗ = min x∈RN (1 − λ (n)_{)f (x) + λ}(n)_φ(0)_(x) ≤ (1 − λ(n))f (x∗) + λ(n)φ(0)(x∗). (6.11)

This lemma is crucial in the sense that now in contrast to unaccelerated convergent first- order methods where we use monotonicity in function value decrease as a tool to show rate of convergence, we can use the rate of convergence of λ(n) _{to compute it.}

Lemma 6.3.3. ( [3], Proposition 2.2) Assume

• f ∈ C1,1 µ,L(R

N₎

• φ(0)

(x) is a convex function on RN such that minxφ(0)(x) ≥ f (x∗).

• We have a sequence of functions {f(n)_}∞

n=0 that underestimates f . In other words,

f(n)_{(x) ≤ f (x) for all x and k ≥ 0.}

• {α(n)_}∞

n=0: α(n)∈ (0, 1),

P∞

n=0α(n)= ∞.

• λ(0) _{= 1.}

Then, a pair of sequences {φ(n)(x)}∞_n=0 and {λ(n)}∞

• λ(n+1) _{= (1 − α}(n)_)λ(n)

• φ(n+1)_{(x) = (1 − α}(n)_)φ(n)_{(x) + α}(n)_f(n)_(x)

is an estimate sequence for function f .

Proof. See [3].

The key parts of these methods are finding such functions and parameters. For the case of gradient descent, Nesterov [80] provided a general path of possible accelerated algorithms. For the more general case of surrogate functions that are strongly-convex, Mairal [68] uses another sequence of functions, which will be useful for a fast variant of Jensen surrogate optimization that will be discussed in the next chapter.

Chapter 7 Acceleration Methods for Convex

Optimization Using Jensen Surrogates

In this section, we will investigate acceleration methods using Jensen surrogates for convex problems. Many of acceleration schemes used for commonly known method gradient descent actually are applicable for Jensen surrogates as well, but require careful analysis. Here, we look at different techniques that can be used with Jensen surrogates.

7.1 Range Based Acceleration Methods with Jensen

Surrogates

Recalling the definitions we made in the previous chapter, we attempt to numerically solve

min x∈X Br X k=1 fBk(x), (7.1) where fBr k(x) = P

i∈Bkfi(x), where each Bkrepresents the set of indices of the corresponding

batch. We define the Jensen surrogate function that was formed using the forward projected estimate of ˜x around ˆx with data terms from mini-batch Bkas gBk,r(x; ˜x, ˆx). This is formally

defined as gBr k,r(x; ˜x, ˆx) = X i∈Br k X j rijf˜i h_ij rij (xj − ˆxj) + (H ˜x)i + X i∈Br k ri0f˜i h_i0 ri0 (x0− ˆx0) + (H ˜x)i , (7.2)

where r satisfies (4.11) as before. Also, assume that each batch of functions has a Lipschitz gradient constant and they are bounded by Lf. Similarly, we denote the Jensen surrogate

counterpart as Lg while strong-convexity parameters are denoted as µf and µg, respectively.

Using this notation, we will present several different algorithms with comments on their rates of convergence. The most well known range based acceleration technique with Jensen surrogates is Cyclic Incremental Convex Optimization and is presented in Algorithm 6. This variant is a deterministic case with no additional storage needed. Per iteration, computa- tional cost is of order O(M N_Br ). Algorithm 7 presents its stochastic counterpart.

Algorithm 6 Cyclic Convex Optimization Using Jensen Surrogates Input : x(0) _{∈ R}N_{, H ∈ R}M ×N_{, r ∈ R}N₊, B_kr for k = 0, 1, ..., (Br− 1) for n = 0, 1, 2, ... do k = mod (n, Br₎ x(n+1)_{= argmin} x∈XgBk,r(x; x(n), x(n)) end

Algorithm 7 Stochastic Convex Optimization Using Jensen Surrogates Input : x(0) _{∈ R}N_{, H ∈ R}M ×N_{, r ∈ R}N₊, B_kr for k = 0, 1, ..., (Br− 1) for n = 0, 1, 2, ... do

Choose k from {0, 1, ..., (Br_{− 1)} randomly.}

x(n+1)_{= argmin}

x∈XgBk,r(x; x(n), x(n))

end

Proposition 7.1.1. For Algorithms 6 and 7, assume that r is fixed for all iterations. Further assume that the gradients resultant in iterations are upper bounded by a constant c in the Euclidean norm sense. Also, we assume that the function value attained at the minimum is

finite. When µg ≥ Lh where Lh is defined in Lemma 5.0.4, we have lim n→∞inf f (x (n)_{) ≤ f (x}∗ ) + β(B r₎2_c2 2Lh , (7.3) where β = 1/Br+ 4.

Proof. This proposition relies heavily on [9]. One key part in the proof outlined there is Proposition 2.1(b). For Jensen surrogates with µg ≥ Lh, using Lemma 5.0.5, we have

f (˜x) + Lh 2 kx − ˜xk 2 2 ≤ f (˜x) + µg 2 kx − ˜xk 2 2 ≤ f (x) + Lh 2 kx − ˆxk 2 2, (7.4)

which can be written as

kx − ˜xk2₂ ≤ kx − ˆxk2₂ − 2 Lh

(f (x) − f (˜x)), (7.5)

which is a special case of Proposition 2.1(b) in [9] with α = 1/Lh. Then, the bounds shown

in [9] also hold for the Jensen surrogate type algorithm we presented here. Finally, the proposition is a direct consequence of Proposition 3.2 in [9].

This proposition is important in the sense that it shows us regardless of how many iterations we run our algorithm, we will not be able to get closer than a positive factor to the minimum function value. Thus, this motivates forming incremental type algorithms that converge.

Now, we propose a new type of algorithm for Jensen surrogates which we call Stochastic Incremental Convex Optimization. This is a Jensen surrogate extension of the general algorithm proposed in [68]. The cyclic variant was proposed in [11]. Algorithm 8 presents Stochastic Incremental Convex Optimization Using Jensen Surrogates.

Algorithm 8 Stochastic Incremental Convex Optimization Using Jensen Surrogates Input : x(0) _{∈ R}N_{, H ∈ R}M ×N_{, r ∈ R}N₊, x(0,k) _{∈ R}N, B_kr for k = 0, 1, ..., (Br− 1) for n = 0, 1, 2, ... do

Choose k from {0, 1, ..., (Br_{− 1)} randomly.}

x(n,k) _{= argmin} x∈X P kgBrk,r(x; x (n,k)_{, x}(n,k)₎ x(n+1,˜k) _{= x}(n,˜k) _{for all ˜}_{k ∈ {0, 1, ..., (B}r_{− 1)}} end

In an iteration, we update the parameters of the chosen surrogate function with the latest iterate x(n)_{, and then minimize the sum of surrogate functions around their own estimates.}

Now, let us present the convergence analysis for this algorithm.

Proposition 7.1.2. (See Proposition 6.2 in [68]) Denoting γ = 1/Br and assuming that µg ≥ Lh,

• If f is convex, Algorithm 8 almost surely converges to the minimum with rate E[f (x(n)) − f (x∗)] ≤ Lhkx

(0)_{− x}∗_k2 2

2δn (7.6)

• If f is strongly-convex with constant µf, then the algorithm surely converges to the

minimum with rate

E[f (x(n)) − f (x∗)] ≤ 1 − δ + δ Lh µg+ µf L_hkx(0)− x∗k2 2 2 . (7.7) for all n ≥ 1.

Proof. See the proof of Proposition 6.2 in [68].

Comparing Algorithm 9 with Algorithm 8, the only difference is that the surrogate functions have forward projected estimates for their own range estimates but are minimized as if they are around the last iterate. Roux et. al [94] proposed a rate of convergence analysis when

Algorithm 9 Stochastic Averaging Convex Optimization Using Jensen Surrogates Input : x(0) _{∈ R}N_{, H ∈ R}M ×N_{, r ∈ R}N₊, x(0,k) _{∈ R}N, B_kr for k = 0, 1, ..., (Br− 1) for n = 0, 1, 2, ... do

Choose k from {0, 1, ..., (Br_{− 1)} randomly.}

x(n,k) _{= x}(n) x(n+1)= argmin_x∈XP kgBkr,r(x; x (n,k)_{, x}(n)₎ x(n+1,˜k) = x(n,˜k) for all ˜k ∈ {0, 1, ..., (Br− 1)} end

the weighted sum of gradients is used as an update where step-size is equal to 1/16Lf. We

do not have a rate of convergence proof for this algorithm and this is left as future work. In the results section, it will be shown that the proposed method performs well for several applications.

In document A General Framework of Large-Scale Convex Optimization Using Jensen Surrogates and Acceleration Techniques (Page 79-86)