Testing and non-linear preconditioning of the proximal point method

(1)

arXiv:1703.05705v1 [math.OC] 16 Mar 2017

testing and non-linear preconditioning of the

proximal point method

Tuomo Valkonen∗

2017-03-16

Abstract Employing the ideas of non-linear preconditioning and testing of the classical proximal point method, we formalise common arguments in convergence rate and con-vergence proofs of optimisation methods to the verification of a simple iteration-wise in-equality. When applied to fixed point operators, the latter can be seen as a generalisation of firm non-expansivity or theα-averaged property. The main purpose of this work is to pro-vide the abstract background theory for our companion paper “Block-proximal methods with spatially adapted acceleration”. In the present account we demonstrate the effective-ness of the general approach on several classical algorithms, as well as their stochastic variants. Besides, of course, the proximal point method, these method include the gradi-ent descgradi-ent, forward–backward splitting, Douglas–Rachford splitting, Newton’s method, as well as several methods for saddle-point problems, such as the Alternating Directions Method of Multipliers, and the Chambolle–Pock method.

Get the version fromhttp://tuomov.iki.fi/publications, citations broken in this one due broken arXiv biblatex support.

1 introduction

The proximal point method for monotone operators [17,22], while infrequently used by itself, can be found as a building block of many popular optimisation algorithms. Indeed, many im-portant application problems can be written in the form

(P) min

x G(x)+F(Kx)

for convex non-smoothGandF, and a linear operatorK. Examples abound in image processing and data science. The problem (P) can often be solved by methods such as forward–backward splitting, ADMM (alternating directions method of multipliers) and their variants [2,16,11,6]. They all involve a proximal point step.

The equivalent saddle point form of (P) is

(S) min

x maxy G(x)+hKx,yi −F ∗₍_y₎_.

In particular within mathematical image processing and computer vision, a popular algorithm for solving (S) is the primal–dual method of Chambolle and Pock [6]. As discovered in [12], the

(2)

method can most concisely be written as apreconditioned proximal point method, solving on each iteration forui+1

=(xi+1,yi+1)the variational inclusion

(PP0) 0∈H(ui+1)+Mi+1(ui+1−ui),

where the monotone operator

H₍u₎:=

∂G₍x₎+K∗y

∂F∗₍y_{) −}Kx

encodes the optimality condition 0∈H(bu)for (S). In the standard proximal point method [22], one would takeMi+1 = I the identity. With this choice, (PP0) is generally difficult to solve. In the Chambolle–Pock method thepreconditioning operator is given for suitable step length parametersτi,σi₊1,θi >0 by

(1.1) Mi+1 :=

τ_i−1I −K∗ −θiK σi−+11I

.

This choice ofMi₊1decouples the primalxand dualyupdates, making the solution of (PP0) fea-sible in a wide range of problems. IfGis strongly convex, the step length parametersτi,σi+1,θi

can be chosen to yieldO(1/N2)convergence rates of an ergodic duality gap and the quadratic distance_kxi₋x_b_k2.

In our earlier work [25], we have modifiedMi+1 as well as the condition (PP0) to still allow a level of mixed-rate acceleration whenGis strongly convex only on sub-spaces. Our conver-gence proofs were based ontesting the abstract proximal point method by a suitable operator, which encodes the desired and achievable convergence rates on relevant subspaces.

In the present paper, we extend this theoretical approach to linear preconditioning, non-invertible step-length operators, and arbitrary monotone operatorsH. Our main purpose is to provide the abstract background theory for our companion paper [24]. Here, within these pages, we demonstrate that several classical optimisation methods—including the second-order New-ton’s method—can also be seen as variants of the proximal point method, and that their com-mon convergence rate and convergence proofs reduce to the verification of a simple iteration-wise inequality. Through application of our theory to Browder’s fixed point theorem [4] in Section 2.5, we see that our inequality generalises the concepts of firm non-expansivity or the

α-averaged property. Our theory also covers stochastic variants of the considered algorithms. InSection 2, we start by developing our theory for general monotone operatorsH. This ex-tends, simplifies, and clarifies the more disconnected results from [25] that concentrated on saddle-point problems with preconditioners derived from (1.1). We demonstrate our results on the basic proximal point method, gradient descent, forward–backward splitting, Douglas– Rachford splitting, and Newton’s method. The proximal step in forward–backward splitting and proximal Newton’s method can be introduced completely “free”, without any additional proof effort, in our approach.

(3)

setting, which allows our results to be used to study various stochastic block-coordinate de-scent methods. We refer to [26] for a review of this class of methods. In the companion paper [24], we will apply our results to stochastic primal-dual methods with coordinate-wise adapted step lengths.

Besides already cited works, other previous work related to ours includes that on generalised proximal point methods, such as [5,8], as well inertial methods for variational inclusions [15].

2 an abstract preconditioned proximal point iteration

2.1 notation and general setup

We useC(X)to denote the space of convex, proper, lower semicontinuous functions fromX to the extended realsR:=[−∞,∞], andL(X;Y)to denote the space of bounded linear operators

between Hilbert spacesX andY. We denote the identity operator byI. ForT,S _{∈ L(}X;X₎, we writeT _≥ S whenT ₋S is positive semidefinite. Also for possibly non-self-adjointT, we introduce the inner product and norm-like notations

(2.1) hx,z_iT := hT x,zi, and kxkT := p

hx,x_iT.

For a setA_⊂ _R, we writeA_≥ 0if every elementt _∈Asatisfiest _≥0.

Our overall wish is to find some_bu_∈U, on a Hilbert spaceU, solving for a given set-valued mapH :U ⇒U the variational inclusion

(2.2) 0∈H(ub).

In the presentSection 2,H will be arbitrary, but in Section 3, where we specialise the results, and inSection 4, where we consider gap estimates, we concentrate onHarising from the saddle point problem (S).

Our strategy towards finding a solution_buis to introduce an arbitrary non-linear iteration-dependentpreconditionerVi+1 :U →U and astep length operatorWi+1 ∈ L(U;U). With these,

we define the generalised proximal point method, which on each iterationi_∈_Nsolves forui+1

from

(PP) 0_∈Wi+1H(ui+1)+Vi+1(ui+1),

We assume thatVi₊1splits intoMi+1 ∈ L(U;U), andVi′+1 :U →U as

(2.3) Vi+1(u)=Vi′+1(u)+Mi+1(u−ui).

More generally, to rigorously extend our approach to cases that would otherwise involve set-valuedVi₊1, we also consider forHie+1:U ⇒U the iteration

(4)

2.2 basic estimates

We analyse (PP) and (PP∼) by applying atesting operatorZi+1 ∈ L(U;U), following the ideas

introduced in [25]. The productZi+1Mi+1with the linear part of the preconditioner, will, as we

soon demonstrate, be an indicator of convergence rates.

Theorem 2.1. On a Hilbert spaceU, let Hie+1 : U ⇒ U, andMi+1,Zi+1 ∈ L(U;U) fori ∈ N.

Suppose(PP∼₎_{is solvable, and denote the iterates by}_{_ui

}i∈N. IfZi+1Mi+1is self-adjoint, and

(CI∼₎ 1

2ku

i+1

−uikZ2i₊₁Mi₊₁ +

1 2ku

i+1

−buk2Zi₊₁Mi₊₁−Zi₊₂Mi₊₂

+hHei+1(ui+1),ui+1−buiZi+1 ≥ −∆i+1(ub)

for alli ∈Nand someu_b∈U, then

(DI) 1

2ku

N

−ub_k_ZN2

+1MN+1 ≤

1 2ku

0

−bu_k2_Z 1M1+

N−1 Õ

i=0 ∆_i

+1(ub) (N ≥1).

Corollary 2.2. On a Hilbert spaceU, letH : U ⇒ U. Also letZi₊1,Wi+1,Mi+1 ∈ L(U;U), and

V_i′

+1 :U →U fori∈N. Suppose(PP)is solvable forVi+1as in(2.3). Denote the iterates by{u

i }i∈N.

Let_bu ∈H−1(0). IfZi+1Mi+1 is self-adjoint, and

(CI) 1 2ku

i+1

−ui_k_Zi2 +1Mi+1+

1 2ku

i+1

−ub_k_Zi2

+1Mi+1−Zi+2Mi+2

+hWi+1[H(ui+1) −H(ub)]+Vi′+1(u

i+1₎_,_ui+1₋_b_u_i

Zi₊₁ ≥ −∆i₊1(ub),

for alli ∈N, then(CI∼₎_and _(DI)_{hold for}_Hi_e

+1(u):=Wi+1H(u)+Vi′+1(u).

For (PP), the condition (CI) is often more practical to verify than (CI∼) thanks to the additional structure introduced byH(bu) ∋0. Indeed, in many of our examples, we can eliminateHthrough monotonicity. To derive gap estimates inSection 4, we will however need (CI∼_).

Proof ofTheorem 2.1. Inserting (PP∼) into (CI∼), we obtain

(2.4) 1 2ku

i+1

−uikZi2₊1Mi+1 +

1 2ku

i+1

−bukZi2₊1Mi+1−Zi+2Mi+2

− hui+1₋_ui_,_ui+1₋_b_u_i

Zi₊₁Mi₊₁ ≥ −∆i₊1(bu).

We recall for general self-adjointM the three-point formula

(2.5) hui+1

−ui,ui+1

−buiM = 1

2ku

i+1

−uik2M−

1 2ku

i

−ubk2M+

1 2ku

i+1

−bukM2 .

Using this withM =Zi+1Mi+1, we rewrite (2.4) as

1 2ku

i

−bukZ2i₊₁Mi₊₁ −

1 2ku

i+1

−bukZ2i₊₂Mi₊₂ ≥ −∆i+1(bu).

(5)

Remark 2.3 (Bregman distances). The three-point formula(2.5)generalises to Bregman distances [8]. IfZi₊1=ϕi+1Ifor some scalarϕi+1, it is then easy to generaliseTheorem 2.1from

1

2k · kMi₊1 to

more general Bregman distances. While we do occasionally work withVi+1 arising as the gradient

of a more general Bregman distance, we will, however, not benefit from more generalMi+1.

The next two results demonstrate how the estimate of Theorem 2.1 can be used to prove convergence with or without rates.

Proposition 2.4 (Convergence with a rate). Suppose (DI) holds with ∆_i

+1(ub) ≤ 0, and that

ZN+1MN+1 ≥ µ(N)I. ThenkuN −buk2→0at the rateO(1/µ(N)).

Proof. Immediate from (DI).

Proposition 2.5 (Weak convergence). SupposeZiMi = Z0M0 ≥ 0is self-adjoint, and that the

iterates of (PP∼₎_satisfy_(CI∼₎_with∆_i

+1(ub) ≤ −δ2ku

i+1₋_ui_k2

Zi₊1Mi+1for allub∈

ˆ

U :={u ∈U |0∈

H₍u_)}and someδ >0. If (CL) Zi₊1Mi+1(u

i+1

−ui_{) →}0anduik ⇀u =⇒ lim sup

k→∞ e

Hi₊1(uik) ⊂W∗H(u)

for some non-singularW_∗∈ L(U;U), thenZ0M0(ui−u∗)⇀0weakly inU for someu∗satisfying

0_∈H₍u∗₎.

Thelim supdenotes the (strong) outer limit [?, see, e.g.,]]rockafellar-wets-va. For the proof, we use the next lemma. Its earliest version is contained in the proof of [18, Theorem 1].

Lemma 2.6 ([4, Lemma 6]). On a Hilbert spaceX, letXˆ _⊂X be closed and convex, and{xi_}i∈N ⊂

X. If the following conditions hold, thenxi ⇀x∗weakly inX for somex∗ ∈Xˆ: (i) i7→ kxi −x∗kis non-increasing for allx∗ ∈Xˆ.

(ii) All weak limit points of{xi}i∈N belong toXˆ.

Proof ofProposition 2.5. SinceZi₊1Mi+1−Zi+2Mi+2 ≤ 0, it is easy to see that (CI∼) and conse-quently (DI) holds for all_bu _∈ U′ := cl conv ˆU. We applyTheorem 2.1on anybu ∈ U′. Using

∆_i

+1(ub) ≤ −δ₂kui+1−uik2_Zi

+1Mi+1, we haveZi+1Mi+1(u

i+1₋_ui_{) →}_{0. By (PP}∼_{) and (CL), any weak} limit pointu∗of the sequence_{ui_}i∈Nthen satisfiesu∗∈Uˆ ⊂U′. SinceA:=Z0M0=Zi+1Mi+1,

this verifies condition(ii)of the lemma forxi _:

=A1/2uiandX′:=A1/2UˆonX :=A1/2U ⊂U.

Ap-plied withN =1anduiin place ofu0, (DI) shows condition(i)of the lemma. Thusxi ⇀x∗∈Xˆ.

Butx∗ = A1/2u∗ for someu∗ = Uˆ. ThusA(ui −u∗) ⇀ 0. This impliesZ0M0(ui −u∗) ⇀ 0

weakly.

2.3 examples of first-order methods

We now look at several concrete examples.

Example 2.1 (The proximal point method). TakeMi = I,Vi′ = 0, andWi+1 = τiI for some

(6)

maximal monotone,_{ui_}_i∈_Nconverges weakly to someu∗_∈H−1₍0₎.

Proof of convergence. We takeZi+1 =ϕiIfor someϕi > 0. As long asϕi ≥ϕi+1, the

monotonic-ity ofH clearly shows (CI) with∆_i

+1(bu) = −ϕi₂ kui+1 −uik2. Using the maximal monotonicity,

Minty’s theorem [?, e.g.,]Theorem21.1]bauschke2011convex guarantees the solvability of (PP). Thus the conditions ofCorollary 2.2are satisfied. Maximal monotonicity also guarantees thatH is weak-to-strong outer semicontinuous; seeLemmaa.1. This establishes (CL). Takingϕi ≡ϕ0 for constantϕ0> 0, so thatZi+1Mi+1 =Z0M0=ϕ0I, it remains to refer toProposition 2.5.

Example 2.2 (Accelerated proximal point method). Continuing fromExample 2.1, suppose

H is strongly monotone. ThenhH(ui+1_{) −}_H₍_b_u₎_,ui+1₋_u_b_{i ≥}_γ_k_ui+1₋_u_b_k2_{for some}_γ _>₀_{, so} (CI) continues to hold with∆_i

+1(ub)=−ϕ i 2 ku

i+1

−ui_k2ifϕi(1+2γτi) ≥ϕi+1. This is the case forτi+1 := τi/√1+2γτi, andϕi+1 := 1/τi2+1. The testing variableϕN is of the orderΘ(N

2

)

[6,25], so we get convergence ofkuN ₋_u_b_k2_{to zero at the rate}_O₍₁_/_N2₎_from_{Corollary 2.2} andProposition 2.4.

To facilitate the analysis algorithms with a proximal step, we introduce the following strength-ened version of (CI), assumed to hold for some∆_i

+1(u∗;u)at allu ∈Ui+1 ⊂U andu∗∈U:

(CI∗) 1

2ku−u

i

k2Zi₊1Mi+1+

1 2ku−u

∗_k2

Zi₊1Mi+1−Zi+2Mi+2

+hWi+1(H(u) −H(u∗))+Vi′+1(u),u−u

∗_i

Zi₊₁ ≥ −∆i₊1(u∗;u). Note that only the choiceu=ui+1 andu∗=ubimplies (CI∼) and thus convergence. The role of

the subsetUi+1 is to model a compatible range ofui+1 betweenH = AandH = A+Bin the next lemma. TypicallyUi+1 =U, but for the stochastic examples ofSection 4.5, we will need to make restrictions.

Lemma 2.7. LetA,B :U ⇒_U_{. Suppose}₍_CI∗₎_{holds for}_H ₌_A_{, and that}

(2.6) hB₍u_{) −}B₍u∗₎,u₋u∗_iZi+1Wi+1 ≥ 0, (u ∈Ui+1,u∗∈U).

Then(CI∗₎_{holds for}_H ₌_A₊_B_with_W

i+1,Mi+1,Zi+1,Vi′+1and∆i+1(u,u

∗₎_{unchanged. Moreover,}

ifvi+1 _solves₍_PP₎_for_H

=A, thenui+1 :=(I+Wi+1B)−1(vi+1)solves(PP)forH =A+B.

Proof. Using (2.6),Bis easily eliminated from (CI∗). The result is (CI∗) forH =A. The

relation-ship betweenvi+1 _and_ui+1 _{is immediate from expansion of (}_PP_).

The next lemma starts our analysis of gradient descent:

Lemma 2.8. LetH =∇GforG ∈ C(X)such that∇GisL-Lipschitz. TakeMi+1 ≡IandVi′+1(u):=

τi(∇G(ui) − ∇G(u))withWi+1 = τiI as well asZi+1 ≡ ϕiI for someτi,ϕi > 0. Then(CI∗)holds

withUi₊1 =U if

(7)

IfGis strongly convex with factorγ >0, alternatively:

(ii) τ0L2 <γ,ϕi+1 :=ϕi+ϕiτi(γ −τiL2),τi :=ϕi−1/2, and∆i+1(u∗;u)=0.

Moreover,Vi+1satisfies(CL)under the above constraints onτi.

Proof. The satisfaction (CL) is immediate from the continuity of∇Gand the boundedness ofτi. For the rest, we start by expanding the condition (CI∗_{) as}

(2.7) ϕi

2 ku−u

i

k2+ϕi−ϕi+1

2 ku−u

∗_k2

+ϕiτih∇G(ui) − ∇G(u∗),u−u∗i ≥ −∆i+1(u∗;u). (i)Lipschitz gradient impliesL−1-co-coercivity ([1], see alsoAppendixb)

(2.8) _h∇G₍u′_{) − ∇}G₍u₎,u′₋u_{i ≥}L−1_k∇G₍u′_{) − ∇}G₍u_)k2 for all u,u′. Now (2.7) follows after we use (2.8) and Cauchy’s inequality to estimate

(2.9) h∇G₍ui_{) − ∇}G₍u∗₎,u₋u∗_i₌ _h∇G₍ui_{) − ∇}G₍u∗₎,ui₋u∗_i

+h∇G(ui) − ∇G(u∗),u−uii ≥ −L

4ku−u

i k2.

(ii)We estimate

h∇G₍ui_{) − ∇}G₍u∗₎,u₋u∗_i= h∇G(u) − ∇G(u∗),u−u∗i+h∇G(ui) − ∇G(u),u−u∗i

≥ γ

2ku−u

∗_k2

− 1

2τiku−u i

k2−τiL

2

2 ku−u

∗_k2_.

Inserting this into (2.7), we see that (CI∗_{) holds with}∆_i

+1(u∗;u)=0if (2.10) ϕi+ϕiτi(γ −τiL2) ≥ϕi+1.

Clearly our choice of{τi}i∈Nis non-increasing. Therefore, (2.10) follows from the initialisation conditionτ0L2<γ and the update ruleϕi+1:=ϕi+ϕiτi(γ −τiL2).

Example 2.3 (Gradient descent). Takingτi =τ constant inLemma 2.8, (PP) reads

0=τ∇G(ui)+ui+1−ui.

This is the gradient descent method. Direct application ofLemma 2.8(i)withu =ui+1 and

u∗ =butogether withCorollary 2.2andProposition 2.5now verifies the well-known weak

convergence of the method whenτ L< 2. Observe thatVi₊1 =∇Qi+1for

Qi+1(u):=

1 2ku−u

i

k2+τ G(ui)+h∇G(ui),u−uii −G(u).

Each step of (PP) therefore minimises thesurrogate objective[9]

(8)

The functionQi+1 on one hand penalises long steps, and on the other hand allows longer steps when the local linearisation error is large. In this example,Qi+1 is, in fact, a Breg-man distance. Proximal point methods based on general BregBreg-man distances in place of the squared norm are studied in, e.g., [5,8,13,14].

Example 2.4 (Acceleration of gradient descent). Continuing fromExample 2.3, ifGis strongly

convex, we may use the acceleration scheme inLemma 2.8(ii). Similarly toExample 2.1,ϕN is of the orderΘ₍_N2₎_{. Therefore,}_{Corollary 2.2}_and_{Proposition 2.4}_{show the convergence} of kuN ₋_bu_k2to zero at the rateO₍1_/N2₎.

Example 2.5 (Forward–backward splitting). LetH = ∇G +∂F forG,F ∈ C(X) with∇G

Lipschitz. TakingMi+1,Wi+1, andVi′+1as inExample 2.3, (PP) becomes

0_∈τi∂F(ui+1)+τi∇G(ui)+ui+1−ui.

This is the forward–backward splitting method

ui+1 _:

=(I +τi∂F)−1(ui−τi∇G(ui)).

ByLemma 2.7, convergence and acceleration work exactly as for gradient descent inExamples 2.3 and2.4. IfF is strongly convex with factorγF, we can introduce the additional termγF₂ kui+1₋ b

u_k2to (2.7). This will improve (2.10) to allowϕi+1 :=ϕi+ϕiτi(γ +γF −τiL). Alternatively, it would be possible to chooseϕiandτi to yield FISTA-style acceleration [2].

Example 2.6 (Douglas–Rachford splitting). LetA,B:U ⇒_U _{be monotone operators.}

Con-sider the problem of findingu_bwith0_∈A₍_bu₎+B(ub). Forλ >0, let

H₍u,v₎:=

λB₍u₎+u−v

λA₍u₎+v−u

, Mi+1 :=

0 0 0 I

, and

e

Hi+1(u,v):=

λB(ui+1₎

+ui+1−vi

λA₍ui+1

+vi+1−vi)+vi−ui+1

. (2.12)

Then0 _∈A₍_bu₎+B(ub)if and only if0 ∈H(u,bvb), wherevb∈ (ub−λA(bu)) ∩ (bu+λB(ub)). The

algorithm (PP∼) becomes the Douglas–Rachford splitting [10]

ui+1 _:

=(I+λB)−1(vi),

vi+1 _:

=vi+(I+λA)−1(2ui+1−vi) −ui+1.

We work with (PP∼_{) since in (}_PP_),_V′

(9)

acceleration schemes under strong monotonicity [?, see, e.g.,]]bredies2016accelerated.

Proof of convergence. Writeu¯i _:

=(ui,vi₎_and_b_u_¯ _:

=(u,bvb). Observe that

ui+1₋_vi+1

=:qi+1 ∈λA(ui+1−vi+1−vi) and ub−vb=:bq ∈λA(bu).

Using the monotonicity ofAandB, withZi₊1:=I, we have

hHie+1(u¯i+1),Zi∗+1(u¯i+1−bu¯)i ⊂ hHie+1(u¯i+1) −H(bu¯),Zi∗+1(u¯i+1−bu¯)i

=λhB(ui+1) −B(ub),ui+1−ubi+λhqi+1−q,vb i+1−vbi +hui+1−vi,(ui+1−vi+1) − (bu−vb)i

=λhB(ui+1) −B(ub),ui+1−ubi+λhqi+1−q,bui+1+vi+1−vi −vbi ≥0.

Thus (CI∼) holds with∆_i₊₁₍ub¯) := −21ku¯i+1 −u¯ikZi2₊1Mi+1. Using (2.12) and the weak-to-strong

outer semicontinuity ofA andB (seeLemmaa.1), we easily verify (CL). Weak convergence

now follows fromTheorem 2.1andProposition 2.5.

2.4 examples of second-order methods

Lemma 2.9. LetH =∇GforG ∈C2(U). Take

Vi+1(u):=∇2G(ui)(u−ui)+∇G(ui) − ∇G(u), and Wi+1:=I

If∇2_G₍_u∗₎ _> ₀_{, then}₍_CI∗₎_{holds for}_ui _{close enough to}_u∗ _with∆_i

+1(u,u∗) = 0andZNMN =

κN_∇2G₍u∗₎for someκ >1.

Proof. We setMi+1 := ∇2G(u∗)andZi+1 :=ϕiI for someϕi > 0. ThenG ∈C2(X) implies that Zi+1Mi+1 =ϕi∇2G(u∗)is self-adjoint. The condition (CI∗) reads

(2.13) 1

2ku−u

i

kϕi2 ∇2_G₍_u∗₎+

1 2ku−u

∗_k2

(ϕi−ϕi₊1)∇2G(u∗)+ϕiDi+1 ≥ −

∆_i

+1(u∗;u), where

Di+1 :=h∇G(ui) − ∇G(u∗)+(∇2G(ui) − ∇2G(u∗))(u−ui),u−u∗i. By the fundamental theorem of calculus, there existsζibetweenui andu∗with

Di+1 = h∇2G(ζi)(ui−u∗),u−u∗i+h(∇2G(ui) − ∇2G(u∗))(u−ui),u−u∗i.

Using the three-point formula (2.5) and Cauchy’s inequality we therefore obtain

Di₊1= 1

2ku−u

∗_k2 2∇2_G₍_ui

)−∇2_G₍_u∗₎−

1 2ku−u

i

k2_∇2_G₍_u∗₎+

1 2ku

i

−u∗_k2_∇2_G₍_u∗₎

+h[∇2G(ζi) − ∇2G(ui)](ui−u∗),u−u∗i

≥ 1

2ku−u

∗_k2 2∇2_G(ui

)−∇2_G(u∗_)−Ai −

1 2ku−u

i

k2_∇2_G(u∗₎

for

(10)

Inserting this estimate into (2.13), we deduce that we can take∆_i

+1(u∗;u)=0if

2ϕi∇2G(ui) −ϕi+1∇2G(u∗) ≥ϕiAi. SinceG _∈ C2₍U₎, and _∇2_G

(u∗₎ > 0, locally nearu∗, we can ensure Ai _≤ ϵ_∇2G₍ζi+1

) and

∇2G₍ui_{) ≥ [}κ_/2+ϵ/2]∇2G(u∗)for someκ >1andϵ >0. Thus it remains to satisfy

(1+ϵ)κϕi−ϕi+1 ≥ϕiϵκ.

This holds whenϕi+1 =κϕi. Takingϕ0=1, thusZNMN ≥κN∇2G(u∗).

Example 2.7 (Newton’s method). SupposeH =∇GforG ∈C2(U). TakeVi+1andWi+1as in

Lemma 2.9. Then (PP) reads

0₌_∇G₍ui₎₊_∇2G₍ui₎₍ui+1

−ui₎.

This is Newton’s method. ByLemma 2.9,Corollary 2.2, andProposition 2.4, we obtain linear convergence if∇2_G₍_u_b₎_> ₀_.

Observe that nowVi₊1(u)is the gradient of

Qi₊1(u):=G(ui)+h∇G(ui),u−uii+ 1

2ku−u

i k2_∇2_G₍_ui

)−G(u).

In the surrogate objective (2.11), this allows longer steps when the second-order Taylor ex-pansion under-approximates, and forces shorter steps when it over-approximates.

Example 2.8 (Proximal Newton’s method). Similarly toExample 2.5, letH = ∇G +∂F for

G∈C2(X), andF ∈ C(X). TakingMi₊1,Wi₊1, andV_i′₊₁ as inLemma 2.9, (PP) becomes

0_∈∂F₍ui+1

)+∇G(ui)+∇2G(ui)(ui+1−ui).

This is the proximal Newton’s method [?, see, e.g.,]]lee2014proximal

ui+1 _:

=(I+[∇2G(ui)]−1∂F)−1(ui− [∇2G(ui)]−1∇G(ui)),

where(I +A−1∂F)−1(v) solvesminu 1₂ku −vk_A2 + F(u). By Lemma 2.7, convergence and

acceleration work exactly as for Newton’s method inExample 2.7.

2.5 connections to fixed point theorems

We demonstrate connections of our approach to established fixed point theorems.

Example 2.9 (Browder’s fixed point theorem [4]). LetT : U _→U beα-averaged, that is

(11)

b

u=T(bu). Thenui⇀u∗for some fixed pointu∗ofT.

Proof of Browder’s fixed point theorem. Let us setH(u) :=T(u) −u, as well asZi+1 := Wi+1 :=

Mi₊1 :=IandVi′+1(u):=T(ui)+ui−T(u) −u. We have

(2.14) Hei+1(ui+1):=Wi+1H(ui+1)+Vi′+1(ui+1)=T(ui)+ui −2ui+1 =ui−ui+1,

where the last step follows by observing from the previous steps that (PP) saysui+1

= T(ui).

The expression (2.14) easily gives (CL), and reduces (CI∼) to

1 2ku

i+1

−ui_k2+hui−ui+1,ui+1−bui ≥ −∆i+1(bu).

Usingui+1

=T(ui)andbu=T(bu), and takingβ >0, (CI∼) therefore holds for

(2.15) ∆_i

+1(bu)=

α+2β−1

2₍1₋α₎ ku

i+1₋_ui_k2

provided

0_≤D:= β

1₋αkT(u i

) −ui_k2+hui−bu− (T(ui) −T(ub)),T(ui) −T(bu)i.

Using theα-averaged property and_bu= J(ub), we expand

D

1₋α =βkJ(u i

) −ui_k2+hui−ub−J(ui)+J(ub),(1−α)(J(ui) −J(ub))+α(ui −bu)i

=(α+β)kui−ubk2+(β+α−1)kJ(ui) −J(ub)k2− (2α+2β−1)hJ(ui) −J(ub),ui−ubi.

We takeβ :=max{0,1/2−α}. Then2α+2β ≥ 1. Cauchy’s inequality and non-expansivity of

J thus give

D

1₋α ≥

1 2ku

i

−buk2− 1

2kJ(u

i

) −J(bu)k2 ≥0.

This verifies (CI∼). From (2.15),∆_i

+1(bu) ≤ −1₂min{1,α/(1−α)}kui+1−uik2. We now obtain the

claimed convergence fromCorollary 2.2andProposition 2.5.

Remark 2.10. The preconditionerVi+1(u) = T(ui) −T(u) is aT-based “distance”, which is not

obviously a Bregman distance.

3 saddle point problems

WithK ∈ L(X;Y),G∈ C(X)andF∗ ∈ C(Y)on Hilbert spacesX andY, we now wish to solve (S). The first-order necessary optimality conditions can be written

(12)

SettingU :₌ X _×Y and introducing the variable splitting notationu ₌ ₍x,y₎,u_b₌ ₍_bx,_by₎, etc., this succinctly be written as0∈H(bu)in terms of the operator

(3.1) H₍u₎ :₌

∂G₍x₎+K∗y

∂F∗(y) −Kx

.

In this section, concentrating on this specificH, we specialise the theory ofSection 2.2to saddle point problems. Throughout, for some primal and dual step length and testing operatorsTi,Φ_i _∈

L(X;X₎, andΣ_i

+1,Ψi+1 ∈ L(Y;Y), we take

(3.2) Wi+1 :=

Ti 0

0 Σ_i +1

, and Zi+1 :=

Φ_i ₀

0 Ψ_i +1

.

To work with arbitrary step length operators, which will be necessary for stochastic algo-rithms inSection 4.5, as well as the partially accelerated algorithms of [25], we will need abstract forms of partial strong monotonicity ofGandF∗. As a first step, we take subsets of operators

T ⊂ L(X;X₎, and _{S ⊂ L(}Y;Y₎.

We suppose that∂Gispartially (strongly)_T-monotone, which we take to mean

(G-PM) h∂G₍x′_{) −}∂G₍x₎,x′₋x_i_T_e _{≥ k}x′₋x_k_e2

TΓ, (x,x

′ _∈_X_;_T_e_{∈ T )}

for some linear operator0 _≤ Γ _{∈ L(}_X_;_X₎_{. The operator}_Te _{∈ T} _{acts as a testing operator.} Similarly, we assume that∂F∗isS-monotonein the sense

(F∗-PM) h∂F∗₍y′_{) −}∂F∗₍y₎,y′₋y_i_eΣ≥ 0 (y,y′∈Y;eΣ∈ S).

AssumingGto satisfy (G-PM) forΓ_and_F∗_{to satisfy (}_F∗_-PM_{), we also introduce}

Ξ_i₊₁₍Γ₎_:₌

2TiΓ 2TiK∗ −2Σ_i

+1K 0

,

which is an operator measure of strong monotonicity ofH.

Example 3.1 (Block-separable structure, monotonicity). LetP1, . . . ,Pm be projection

opera-tors inX withÍm_j₌₁Pj = IandPjPi = 0ifi, j. SupposeG1, . . . ,Gm ∈ C(X)are (strongly) convex with factorsγ1, . . . ,γm ≥0. Then (G-PM) holds withΓ=Ímj=1γjPj for

G₍x₎₌ m Õ

j=1

Gj(Pjx), and T =

T :₌Õ

j∈S tjPj

tj >0,S ⊂ {1, . . . ,m}

. (3.3)

3.1 estimates

(13)

Theorem 3.1. Let us be givenK _{∈ L(}X;Y₎,G _{∈ C(}X₎, andF∗ _{∈ C(}Y₎on Hilbert spacesX and

Y. SupposeGsatisfies (G-PM)for some0 ≤ Γ _{∈ L(}_X_;_X₎_{. For each}_i _∈ _N_{, let}_Ti,Φ_i _{∈ L(}_X_;_X₎

andΣ_i

+1,Ψi+1 ∈ L(Y;Y)be such thatΦiTi ∈ T. Also takeVi′+1 :X ×Y →X ×Y, andMi+1 ∈

L(X _×Y;X _×Y₎. Let H given by (3.1),Zi+1 andWi+1 by (3.2), andVi+1 by (2.3). Suppose(PP)

is solvable, and denote the iterates byui = (xi,yi). Then(CI),(CI∼)and (DI)hold ifZi+1Mi+1 is self-adjoint, and foreΓ₌Γ_{we have}

(CI-Γ₎ 1 2ku

i+1₋_ui_k2

Zi+1Mi+1 | {z }

step length in local metric + 1

2ku

i+1₋_b_u_k2

Zi₊1(Ξi₊₁(eΓ)+Mi₊1)−Zi+2Mi+2 | {z }

linear preconditioner update discrepancy +h∂F∗(yi+1) −∂F∗(by),yi+1

−by_iΨi₊₁Σi₊₁ | {z }

variably useful remainder fromH

+hVi′+1(ui+1),ui+1−ubiZi₊1 | {z }

from non-linear preconditioner

≥ −∆_i +1(ub).

Proof. First of all, we observe that (CI-Γ_{) implies}

(3.4) 1

2ku

i+1

−ui_k+ 1

2ku

i+1

−ub_k_Zi2

+1(Ξi₊₁(0)+Mi+1)−Zi+2Mi+2

+h∂G(xi+1) −∂G(bx),xi+1−bxiΦ_iTi ₊h∂F∗(yi+1) −∂F∗(by),yi+1−byiΨi₊₁Σi₊₁

+hVi′+1(ui+1),ui+1−ubiZi₊1 ≥ −∆i+1(ub). Here pay attention to the fact that (3.4) employsΞ_i

+1(0) while (CI-Γ) employsΞi+1(eΓ). If we show that (CI) follows from (3.4), then (CI∼_{) and (}_DI_{) follow from}_{Corollary 2.2}_{. Indeed, using}

the expansion

Zi+1Wi+1=

Φ_iTi ₀

0 Ψ_i +1Σi+1

,

we expand for any_eu=(xe,ey)that

hZi+1Wi+1(H(ui+1) −H(eu)),ui+1−eui

=h∂G(xi+1) −∂G(ex),xi+1−exiΦiTi +h∂F∗(y i+1

) −∂F∗₍_ey₎,yi+1

−ey_iΨi₊₁Σi₊₁

+hΦiTiK∗(yi+1−ye),xi+1−exi − hΨi+1Σi+1K(x

i+1

−ex),yi+1

−eyi. With the help ofΞ_i

+1(0)we then obtain

hH₍ui+1_{) −}_H₍_u_e₎_,_ui+1₋_u_e_i

Zi₊₁Wi₊₁ ≥

1 2ku

i+1₋_u_e_k

Zi+1Ξi₊₁(0)

+h∂G(xi+1) −∂G(ex),xi+1−exiΦ_iTi ₊h∂F∗(yi+1) −∂F∗(ye),yi+1−eyiΨi₊₁Σi₊₁.

Inserting this into (3.4), we obtain (CI).

3.2 examples of primal–dual methods

(14)

Example 3.2 (The primal–dual method of Chambolle and Pock [6]). This method consists of iterating the system

xi+1 _:

=(I+τi∂G)−1(xi−τiK∗yi), (3.5a)

¯

xi+1 _:

=ωi(xi+1−xi)+xi+1, (3.5b)

yi+1 _:

=(I+σi+1∂F∗)−

1

(yi+σi+1Kx¯

i+1

). (3.5c)

In the basic version of the algorithm,ωi = 1,τi ≡τ0 > 0, andσi ≡ σ0 > 0, assuming the step length parameters to satisfy

(3.6) τ0σ0kKk2<1.

The iterates convergence weakly, and the method hasO₍1_/N₎rate for the ergodic duality gap, to which we will return inSection 4. IfGis strongly convex with factorγ, we may take e

γ _{∈ (}0,γ_], and accelerate

(3.7) ωi :=1/p1+2eγτi, τi+1 :=τiωi, and σi+1 :=σi/ωi.

This yieldsO₍1_/N2₎convergence ofkxN ₋_bx_k2to zero.

Proof of convergence of iterates. We formulate the method in our proximal point framework fol-lowing [25,12] by taking as the preconditioner

Mi+1 =

I ₋τiK∗ −σiK I

and V_i′₊₁ =0.

As the step length and testing operators we takeTi =τiI,Σi+1 = σi+1I,Φi =ϕiI,Ψi+1 =ψi+1I. We also writeeΓ_:₌_e_{γ I}_{. Taking}∆_i₊₁₍_u_b₎_:₌₋1

2kui+1−uikZi2+1Mi+1, we reduce (CI-Γ) to

(3.8) 1

2ku

i+1₋_u_b_k2

Di+2 ≥0 for Di+2:=Zi+1(Ξi+1(eΓ)+Mi+1) −Zi+2Mi+2.

We may expand

Zi+1Mi+1 =

ϕiI −ϕiτiK∗ −ψi+1σiK ψi+1I

, and

(3.9a)

Di+2=

(ϕi(1+2γτie ) −ϕi+1)I (ϕiτi+ϕi+1τi+1)K∗

(ψi+2σi+1−2ψi+1σi+1−ψi+1σi)K (ψi+1−ψi+2)I

. (3.9b)

We have_k · kDi₊₂ =0(but notDi₊2=0, as the former depends on the off-diagonals cancelling out), andZi+1Mi+1is self-adjoint, if for some constantψ we take

(3.10) ϕi+1 :=ϕi(1+2γτei), τi :=ϕi−1/2, σi :=ϕiτi/ψ, and ψi+1 :=ψ. This gives the acceleration scheme (3.7). Moreover, for anyδ _{∈ (}0,1₎holds

(3.11) Zi+1Mi+1 ≥

δϕiI 0

0 ψ I− (1−δ)−1_KK∗

(15)

ThusZi+1Mi+1 ≥ 0ifψ ≥ (1−δ)−1kKk2. By (3.10),σiτi =1/ψ. Since this fixes the ratio ofσi to τi, we need to takeψ :=1/(σ0τ0)as well asδ := 1−σ0τ0kKk2. Through the positivity ofδ, we recover the initialisation condition (3.6).

Theorem 3.1andProposition 2.5show weak convergence of the iterates without a rate. IfG is strongly convex with factorγ _≥ 0, so that also_eγ > 0, the results in [6,25] show thatτN is of the orderO₍1_/N₎, and consequentlyϕN is of the orderΘ(N2). ByProposition 2.4,kxN−bxk2

converges to zero at the rateO₍1_/N2₎.

Example 3.3 (Alternating Directions Method of Multipliers, briefly). The classical ADMM

[11] and Douglas–Rachford splitting [10] are known to be related to the Chambolle–Pock method; in fact the Chambolle–Pock method is a preconditioned ADMM [6]. From [3, Sec-tion 5], we can deduce that compared to the Chambolle–Pock method, the ADMM merely has the sign ofK reversed in

Mi+1=

I τiK σiK I

.

Takingτi = τ0 andσi = σ0 constant and satisfying (3.6), the iterates converge weakly. Acceleration can provideO(1/N)convergence ofkxN −bxk2_.

Proof of convergence. FollowingExample 3.2, we now expand

Di+2=

(ϕi(1+2γτei) −ϕi+1)I (3ϕiτi−ϕi+1τi+1)K∗

(ψi₊1σi−2ψi+1σi+1−ψi+2σi+1)K (ψi+1−ψi+2)I

.

This time k · kDi+2 =0andZi+1Mi+1is self-adjoint if we take

(3.12) ϕi+1 :=ϕi(1+2eγ τi), τi+1 :=τiϕi/ϕi+1, σi:=ϕiτi/ψ, and ψi+1 :=ψ.

Ifγ_e=0, which corresponds to the standard ADMM with fixed step lengths, it is easy to retrace

the steps ofExample 3.2to prove weak convergence (without a rate). If_eγ ,0, we obtainϕN₊1 =

ϕN +2γ τeN−1ϕN−1 = ϕN +2γτe0ϕ0 = ϕ0+2Neγτ0ϕ0. Therefore, the acceleration scheme (3.12)

only gives the rateO₍1_/N₎.

Example 3.4 (Chambolle–Pock with a forward step). SupposeG=G0+JwithG(strongly)

convex with factorγ ≥ 0, and∇J Lipschitz with factorL. (J does not have to be convex.) In [7], the Chambolle–Pock method was extended to take forward steps with respect to J. With everything else as inExample 3.2, takeV_i′

+1(u) := (τi(∇J(x

i

) − ∇J₍x₎₎,0₎. Then (PP) can be rearranged as

xi+1 _:

=(I+τi∂G0)−1(xi−τi∇J(xi) −τiK∗yi), (3.13)

¯

xi+1 _:

=ωi(xi+1−xi)+xi+1, (3.14)

yi+1 _:

=(I+σi+1∂F∗)−

1

(yi+σi+1Kx¯

i+1

). (3.15)

(16)

update rules (3.7), and initialiseτ0,σ0>0subject to (3.6), and

(3.16) 0<θ :=1−Lτ0/(1−τ0σ0kKk2).

Proof of convergence. WithDi+2as in (3.8), the condition (CI-Γ) becomes

(3.17) 1

2ku

i+1

−ui_k_Zi2 +1Mi+1+

1 2ku

i+1

−ub_k_Di2

+2 +τiϕih∇J(x i

) − ∇J₍_bx₎,xi+1

−bx_{i ≥ −}∆_i +1(bu). The rules (3.10) force_k · kDi₊2 =0. Applying the estimate (2.9) to J, (3.17) becomes

1 2ku

i+1

−uikZi2₊1Mi+1− τiϕiL

4 kx

i+1

−xik2≥ −∆_i₊₁₍_u_b₎.

We take∆_i

+1(ub)=−θ₂kui+1−uik_Zi2

+1Mi+1for someθ > 0, and deduce using Cauchy’s inequality

that this condition holds if

(1₋θ₎Zi+1Mi+1 ≥τiϕiL

I 0 0 0

.

Recalling (3.11), this is true if(1₋θ₎δϕi ≥ τiϕiLandψ ≥ (1−δ)−1ϕiτi2kKk2. Further recalling (3.10), and observing that{τi}is non-increasing, we only have to satisfy(1−θ)(1−τ0σ0kKk2) ≥

Lτ0. Otherwise put, we obtain (3.16).

Example 3.5 (GIST). SupposeG₍x₎₌ ₂1_kf ₋Ax_k2_,_k_A_k_< √₂_{, and}_k_K_{k ≤}₁_{. Take}

V_i′₊₁(u):=

∇G₍xi_{) − ∇}G₍x₎

0

, and Mi₊1 :=

I 0 0 I−KK∗

.

WithTi :=IandΣi+1 :=I, we then obtain the Generalised Iterative Soft Thresholding (GIST)

algorithm of [16]

yi+1 _:

=(I+∂F∗)−1((I−KK∗)yi+K(xi − ∇G(xi))),

xi+1 _:

=xi− ∇G(xi) −K∗yi+1.

The iterates_{xi_}_i∈_Nconverge weakly to_bx.

Proof of convergence. ClearlyZi+1Mi+1 is positive semi-definite self-adjoint. AlsoG satisfies (G-PM) withΓ₌_A∗_A_{. If we take}Φ_i ₌_I _andΨ_i

+1=I, then

Di+2 :=Zi+1(Ξi+1(eΓ)+Mi+1) −Zi+2Mi+2=

2A∗A 2K∗ −2K 0

.

Thus 1 2kuk

2

Di₊₂ = kxk 2

A∗_A. Eliminating∂F∗by monotonicity, (CI-Γ) thus holds if

1 2ku

i+1

−ui_k_Zi2

+1Mi+1 +kx i+1

(17)

Expanding and usingkK_k<1, we see this to hold when

1 2kx

i+1

−xi_k2+kxi+1−bxkA2∗A+hA∗A(xi −xi+1),xi+1−bxi ≥ −∆i+1(ub).

Our assumption kA_k < √2guarantees ₂1(A∗A₎2 < A∗A. Cauchy’s inequality therefore shows that we can take∆_i₊₁ ₌₋c

2kxi+1−xik2for somec >0. UsingTheorem 3.1andProposition 2.4,

we obtain weak convergence.

4 the ergodic duality gap and stochastic methods

We now study the extension of the testing approach ofSection 2.2to produce the convergence of an ergodic duality gap. Throughout this section, we are in the saddle point setup ofSection 3. In particular,H is as in (3.1), and the step length and testing operatorsWi+1andZi+1as in (3.2).

4.1 preliminary gap estimates

Our first lemma demonstrates how to obtain a “preliminary” gap_G′

i+1(u) fromH. If the step

lengths and tests are scalar,Ti = τiI, andΦi =ϕiI, etc., and satisfyτiϕi = σiψi+1, it is easy to

bound this preliminary gap from below byτiϕi times the conventional duality gap

(4.1) G(x,y₎:= G(x)+hyb,Kx_{i −}F₍y_b₎₋ G₍_bx₎+hy,K_bx_{i −}F∗₍y₎.

To do the same for more general step length operators, we will inSection 4.2introduce abstract notions of convexity that incorporate ergodicity and stochasticity.

Lemma 4.1. Let us be givenK _{∈ L(}X;Y₎,G _{∈ C(}X₎, andF∗ _{∈ C(}Y₎on Hilbert spacesX andY.

For eachi ∈N, letTi,Φ_i _{∈ L(}_X_;_X₎_andΣ_i

+1,Ψi+1 ∈ L(Y;Y). Then for anyeΓ ∈ L(X;X),

(4.2) hH(ui+1

),ui+1

−ubiZi₊1Wi+1 =Gi′+1(ui+1;eΓ)+

1 2ku

i+1

−ubk_Zi2 +1Ξi₊₁(eΓ)

,

where the “preliminary gap”

Gi′+1(u;eΓ):=h∂G(x),x−bxiΦiTi − kx−bxk 2

Φ_iTieΓ+h∂F

∗₍_y₎_,_y₋_b_y_i Ψi₊₁Σi₊₁

− hby,₍KT_i∗Φ∗_i ₋Ψ_i

+1Σi+1K)xbi − hy,Ψi+1Σi+1Kbxi+hyb,KTi∗Φ∗ixi.

Proof. Similarly to the proof ofTheorem 3.1, we have

hH₍ui+1

),ui+1

−bu_iZi₊1Wi+1 = h∂G(x i+1

),xi+1

−bx_iΦiTi +hΦiTiK∗yi+1,xi+1−bxi +h∂F∗(yi+1),yi+1−byiΨi₊₁Σi₊₁− hΨi₊1Σi₊1Kx

i+1_,_yi+1

(18)

A little bit of reorganisation gives (4.2). Indeed

hH₍ui+1

),ui+1

−bu_iZi₊1Wi+1 = h∂G(x i+1

),xi+1

−bx_iΦ_iTi − kxi+1−bxk2 ΦiTieΓ

+h∂F∗(yi+1),yi+1−byiΨi₊₁Σi₊₁+kx i+1

−xbkΦ2iTieΓ

+hyi+1−y,b(KTi∗Φ∗i −Ψi+1Σi+1K)(xi+1−bx)i

− hby,(KT_i∗Φ∗_i ₋Ψ_i₊₁Σ_i₊₁_K₎_b_x_i

− hyi+1_,_Ψ

i+1Σi+1Kbxi+hby,KTi∗Φ∗ixi+1i

= Gi′+1(ui+1;eΓ)+

1 2ku

i+1

−buk_Zi2 +1Ξi₊₁(eΓ)

.

The next lemma extendsTheorem 3.1to estimate the preliminary gap.

Lemma 4.2. Let us be givenK _{∈ L(}X;Y₎,G _{∈ C(}X₎, andF∗_{∈ C(}Y₎on Hilbert spacesX andY.

For eachi_∈_N, letTi,Φi ∈ R(L(X;X))andΣi+1,Ψi+1∈ R(L(Y;Y)), as well asV_i′₊₁ ∈ R(X×Y → X ×Y)andMi+1 ∈ R(L(X×Y;X ×Y)). LetH given by(3.1),Zi+1andWi+1by(3.2), andVi+1by (2.3). Suppose(PP)is solvable, and denote the iterates byui =(xi,yi). IfZi+1Mi+1 is self-adjoint,

and

(CI-_G) 1

2ku

i+1

−ui_k_Zi2 +1Mi+1+

1 2ku

i+1

−bu_k2

Zi₊1(Ξi₊₁(eΓ)+Mi₊1)−Zi+2Mi+2

+hVi′+1(ui+1),ui+1−buiZi₊1 ≥ −e∆i+1(ub)

for someeΓ _{∈ L(}_X_;_X₎_{, then}

(4.3) 1

2ku

N

−bu_k_ZN2

+1MN+1 + N−1 Õ

i=0

Gi′+1(u

i+1_;e_Γ_{) ≤} 1 2ku

0

−bu_k_Z2 1M1 +

N−1 Õ

i=0 e

∆_i

+1(bu) (N ≥1).

Proof. Inserting (4.2) into (CI-G) proves (CI∼) for∆_i

+1(ub) := e∆i+1(ub) − G_i′₊₁(ui+1;eΓ). Now we

useTheorem 2.1.

The problem with the aboveLemma 4.2 is that it loses∂F from the condition (CI-G) com-pared to (CI-Γ_{). Thus (}_CI-_G_{) can be more difficult to satisfy for particular preconditioners that} are related to∂F, such as the forward–backward splitting inExample 2.5. Fortunately, there is a remedy: to study a one-sided gap that provides no indication of the convergence of the dual variable.

Lemma 4.3. Let us be givenK ∈ L(X;Y),G ∈ C(X), andF∗∈ C(Y)on Hilbert spacesX andY.

For eachi_∈_N, letTi,Φi ∈ R(L(X;X))andΣi+1,Ψi+1∈ R(L(Y;Y)), as well asVi′+1 ∈ R(X×Y → X _×Y₎andMi+1 ∈ R(L(X×Y;X ×Y)). LetH given by(3.1),Zi+1andWi+1by(3.2), andVi+1by (2.3). Suppose(PP)is solvable, and denote the iterates byui =(xi,yi). IfZi+1Mi+1 is self-adjoint,

and(CI-Γ₎_{holds for some}eΓ _{∈ L(}_X_;_X₎_{, then}

(4.4) 1

2ku

N

−ub_k_ZN2

+1MN+1 + N_Õ−1

i=0

Gi′+1(x

i+1_,_b_y_;_e_Γ

) ≤ ₂1ku0₋_bu_k_Z2 1M1 +

N−1 Õ

i=0 ∆_i

(19)

Proof. Let us write(Hx(u),Hy(u)):=H(u). Then0∈Hy(by). We may thus expand

Gi′+1(ui+1;eΓ)= Gi′+1(ui+1;eΓ) − hHy(yb),yi+1−ybiΨi₊₁Σi₊₁

= Gi′+1(ui+1;eΓ) − h∂F∗(by),yi+1−byiΨi₊₁Σi₊₁ +hΨi+1Σi+1K∗xb,yi+1−byi

= Gi′+1(x

i+1_,_b_y_;e_Γ₎

+h∂F∗(yi+1) −∂F∗(by),yi+1−byiΨi₊₁Σi₊₁. Inserting this into (4.2), we obtain

hH(ui+1

),ui+1

−ubiZi₊1Wi+1 =Gi′+1(xi+1,by;eΓ)+

1 2ku

i+1

−ubk_Zi2 +1Ξi₊₁(eΓ)

+h∂F∗(yi+1) −∂F∗(by),yi+1−byiΨi₊₁Σi₊₁. (4.5)

We writee∆_i

+1 for the ∆i+1 for which (CI-Γ) holds. Inserting (4.5) into (CI-Γ) proves (CI∼) for ∆_i

+1(ub):=e∆i+1(ub) − Gi′+1(x

i+1_,_b_y_;eΓ₎_{. The rest follows from}_{Theorem 2.1}_.

4.2 conversion of preliminary gaps to ergodic gaps

The “preliminary gaps” are not as such very useful. To go further, the abstract monotonicity assumptions (G-PM) and (F∗_-PM_{) are not enough, and we need analogous convexity}

formula-tions. We formulate these conditions directly in the stochastic setting. Towards this end we introduce the following notation:

Definition 4.1. We writex _{∈ R(}X₎ifT is an_T-valued random variable:x : Ω _→_X _{for some}

(in the present work fixed) probability space(Ω,_O), whereOis aσ-algebra onΩ_{. We denote} byEthe expectation with respect to a probability measurePon Ω_{. As is common, we abuse} notation and writex =x(ω)for the unknown random realisationω ∈Ω.

We refer to [23] for more details on measure-theoretic probability. From now on, we assume for allN _≥1that wheneverTie₍:=ΦiTi) ∈ R(T )andxi+1 ∈ R(X)for eachi=0, . . . ,N−1with

ÍN−1

i=0 E[Tei]=I, then for some0 ≤Γ ∈ L(X;X)holds

(G-EC) G

N−1 Õ

i=0

E[Te_i∗xi+1

] !

−G(bx) ≥ N−1 Õ

i=0

Eh∂G(xi+1

),xi+1

−bxiTie +

1 2kx

i+1

−bxk_Ti2_eΓ

.

Analogously, we assume foreΣ_i

+1(:=Ψi+1Σi+1) ∈ R(S)andyi+1 ∈ R(Y)for eachi=0, . . . ,N−1 withÍN_i₌−₀1E[eΣ_i

+1]=I that

(F∗_-EC) _F∗ N−1 Õ

i=0 E[eΣ∗_i

+1y

i+1

] !

−F∗₍_by_{) ≥} N−1 Õ

i=0

Eh∂F∗₍yi+1

),yi+1

−yb_i_eΣi₊₁

.

Example 4.1 (Block-separable structure, ergodic convexity). LetG andT have the

separa-ble structure of Example 3.1. We claim that (G-EC) holds. Indeed, let us introduceTie :=

Ím

j=1eτj,iPj ≥0, satisfying ÍN−1

i=0 E[eτj,i]=1for eachj =1, . . . ,m. Splitting (G-EC) into

(20)

to be true if for allj =1, . . . ,mholds

(4.6) Gj

N−1 Õ

i=0

E[eτj,iPjx i+1

] !

−Gj(Pj_bx) ≥ N−1 Õ

i=0

Eeτi Gj(Pjxi+1

) −Gj(Pj_bx).

The right hand side can also be written as∫ΩNGj(Pjx i

(ω_{)) −}Gj(Pjxb)dµN(i,ω)for the mea-sureµN :=eτjÍNi=−01δi×Pon the domainΩN :={0, . . . ,N−1} ×Ω. Using our assumption ÍN₋1

i=0 E[eτj,i]=1, we deduceµ N₍_ΩN₎

=1. An application of Jensen’s inequality now shows

(4.6). Therefore (G-EC) is satisfied.

We also assume that either

E[Φ_iTi_]₌ηiI,¯ and E[Ψ_i₊₁Σ_i₊₁_]₌ηiI,¯ (i ≥1), (CG)

or

E[Φ_i_T_i_]₌_η_¯_i_I, _and _E_[Ψ_iΣ_i_]₌_η_¯_i_I, ₍_i _≥₁₎, (CG∗)

As will see in Example 4.2, (CG∗) is satisfied by the accelerated Chambolle–Pock method of

Example 3.2. In our companion paper [24], we will however see that (C_G) is required to develop doubly-stochastic methods.

With these, and the gap functional G from (4.1), we derive the next two lemmas that are meant to be used in combination with eitherLemma 4.2orLemma 4.3, to estimate the sum of the preliminary gaps therein. For this, the expectation needs to be taken in the estimates of the latter. All of these different combinations will be summarised inTheorem 4.6after the lemmas.

Lemma 4.4. Suppose(G-EC),(F∗_-EC₎_{, and} ₍_C_G₎_{hold. Set}

(4.7) ζN :=

N−1 Õ

i=0 ¯

ηi,

and for{(xi,yi_)}_iN

=1 ⊂X ×Y, define the ergodic sequences (4.8) _exN :=ζN−1E

"_N₋₁ Õ

i=0

T_i∗Φ∗_i_xi+1 #

, and _eyN :=ζN−1E

"_N−₁ Õ

i=0 Σ∗_i

+1Ψi∗+1yi+1 #

.

Then

N−1 Õ

i=0

E[Gi′+1(x

i+1_,_yi+1_;_Γ_/₂_{)] ≥}_ζ

NG(exN,eyN).

Proof. Using (CG), (G-EC), and (F∗_-EC_{), we compute}

N−1 Õ

i=0

E[Gi′+1(xi+1,yi+1;Γ/2)]= N−1 Õ

i=0

Ehh∂G₍xi+1

),xi+1

−bx_iΦ_iTi

−kxi+1

−bxkΦ2_iTiΓ_/₂+h∂F∗(y i+1

),yi+1

−byiΨi₊₁Σi₊₁ i

−ζNheyN,Kbxi+ζNhy,bKexNi ≥ζNG(exN,eyN).

(21)

Lemma 4.5. Suppose(G-PM),(F∗-PM),(G-EC),(F∗-EC), and(CG∗)hold. Set

(4.9) ζ_∗,N :=

N−1 Õ

i=1

¯

ηi,

and for{(xi,yi)}N

i=1 ⊂X ×Y, define the ergodic sequences

(4.10) ex_∗,N :=ζ −1 ∗,NE

"_N₋₁ Õ

i=1 T_i∗Φ∗

ixi+1 #

, and ey_∗,N :=ζ

−1 ∗,NE

"_N₋₁ Õ

i=1 Σ∗

iΨi∗yi #

.

Then

N−1 Õ

i=0

E[Gi′+1(x

i+1_,_yi+1_;_Γ_/₂_{)] ≥}_ζ

NG(ex∗,N,ey∗,N).

Proof. Using (G-PM) and (OC), we deduce

G1′(x1,y1;Γ/2) ≥ h∂F∗(y1),y1−byiΨ₁Σ₁ ₊hby,Ψ₁Σ₁Kxbi − hy1,Ψ₁Σ₁Kbxi.

Likewise (F∗-PM) and (OC) give

GN′ (x

N_,_yN_;_Γ

/2_{) ≥ h}∂G₍xN₎,xN ₋_bx_iΦN₋₁TN₋₁− kx N

−bx_kΦ2N₋₁TN₋₁Γ/2 − hby,KT_N∗₋₁Φ∗

N−1bxi+hby,KTN∗−1Φ∗N−1xNi. Shifting indices ofyi by one compared toG′

i+1, we define

G∗′,i+1 :=h∂G(x

i+1₎_,_xi+1₋_b_x_i

ΦiTi − kx

i+1₋_b_x_k2

Φ_iTiΓ_/₂+h∂F∗(y i

),Σ∗_iΨ_i∗₍_yi₋_b_y_)i

− hby,₍KT_i∗Φ∗

i −ΨiΣiK)xbi − hyi,ΨiΣiKbxi+hyb,KTi∗Φ∗ixi+1i. Correspondingly reorganising terms, we observe

N−1 Õ

i=0

Gi′+1(xi+1,yi+1;Γ/2)=G1′(x1,y1)+GN′(xN,yN;Γ/2)

+

N_Õ−2

i=1

Gi′+1(xi+1,yi+1;Γ/2) ≥ N−1 Õ

i=1

G_∗′,i+1 .

We now estimateÍN−_i₌₁1E[G_∗′,i+1]analogously to the proof ofLemma 4.4.

The next theorem is our main result for saddle point problems.

Theorem 4.6. Let us be givenK ∈ L(X;Y),G ∈ C(X), andF∗∈ C(Y)on Hilbert spacesX andY,

satisfying(G-PM)and(F∗-PM)for some0_≤ Γ_{∈ L(}_X_;_X₎_{. For each}_i _∈_N_{, let}_T_i,Φ_i _{∈ R(L(}_X_;_X₎₎

andΣ_i

+1,Ψi+1 ∈ R(L(Y;Y))be such thatΦiTi ∈ R(T )andΨi+1Σi+1 ∈ R(S). Also takeV_i′₊₁ ∈

(22)

andVi+1by(2.3). Suppose(PP)is solvable, and denote the iterates byui = (xi,yi). Letbu= (bx,by)

be a solution to(OC). Assuming one of the following cases to hold, let

e дN :=

      

0, eΓ₌ Γ, (CI-Γ₎_holds,

ζNG(exN,by), eΓ= Γ/2;(CI-Γ),(G-EC),(F∗-EC)and(CG)hold, ζ_∗,NG(ex∗,N,by), e

Γ₌ Γ_/_2;₍_CI-Γ₎_,₍_G-EC₎_,₍_F∗_-EC₎_and₍_C_G_∗₎_hold_, ζNG(exN,_eyN), eΓ₌ Γ_/_2;₍_CI-_G₎_,₍_G-EC₎_,₍_F∗_-EC₎_and ₍_C_G₎_hold, ζ_∗,NG(ex∗,N,ey∗,N), e

Γ₌ Γ_/_2;₍_CI-_G₎_,₍_G-EC₎_,₍_F∗_-EC₎_and ₍_C_G_∗₎_hold.

IfZi₊1Mi+1 is self-adjoint, then

(DI-_G) 1

2E

h

kuN ₋_bu_k_ZN2 +1MN+1

i

+дeN ≤ ku0−bukZ21M1+ N−1 Õ

i=0

E[∆_i +1(bu)].

Proof. The caseeдN =0is simply the result of taking the expectation in the claim ofTheorem 3.1. The remaining cases follow by taking the expectation in different combinations ofLemma 4.2 or4.3withLemma 4.4or4.5.

As an easy corollary, we obtain convergence of function values for the basic minimisation problemH = ∂G.

Corollary 4.7. Let us be givenG ∈ C(X), satisfying (G-PM)and (G-EC) for Γ ₌ ₀_{. For each}

i _∈ _N, letWi,Mi,Zi ∈ R(L(X;X)) as well asVi′ ∈ R(X → X). SupposeZiWi ∈ R(T ), that ZiMiis self-adjoint, that(CI)holds, and(PP)is solvable withH =∂GandVi+1as in(2.3). Suppose

E[ZiWi]=ηiI¯ for someηi¯ >0. Letub∈ [∂G]−1(0). Then the iterates{ui}i∈Nof (PP)satisfy(DI-G)

with

e

дN :=ζN(G(euN) −G(bu)),

where

e

uN :=ζN−1E "_N−₁

Õ

i=0

W_i∗Z_i∗ui+1 #

, ζN =

N−1 Õ

i=0

¯

ηi.

Proof. IntroducingK :=0andF∗ ≡0(orF∗ ≡δ{0}), we can write the original problem in the saddle point form (S). Then the gap_G(x,y₎ =G(x)−G(bx)measures the convergence of function

values. We can also extend the method for (PP) withH = ∂G to the saddle point problem by

choosing

¯

V_i′₊₁(u)=(Vi′+1(x),0), and Mi¯ +1:=

Mi+1 0

0 0

,

as well asTi := Wi,Φi := Zi. We also denote byW¯i+1 andZ¯i+1 the step length and testing operators for the saddle point problem. NowTiΦi = η¯iI, so we can chooseΨi+1 = ψi+1I and Σ_i₊₁ ₌ _σi₊₁_I _{such that (}_C_G_{) holds and indeed}_KT∗

iΦ∗i = Ψi+1Σi+1K. The latter causes the off-diagonal components ofZi¯+1Ξi+1(eΓ)to cancel. Consequently (CI-Γ) holds foreΓ = 0by virtue

(23)

4.3 primal–dual examples revisited

We now study gap estimates for several of the examples fromSection 3.

Lemma 4.8. SupposeG _{∈ C(}X₎is (strongly) convex with factorγ _≥0,Ti =τiIandΦi =ϕiI, and

T =[0,_∞)I. Then both(G-PM)and (G-EC)hold withΓ₌_{γ I}_.

Proof. This follows fromExample 4.1withm=1.

Suppose we have a method for (S) that satisfies the conditions of the earlierTheorem 3.1 witheΓ ₌ Γ ₌ _{γ I}_,_T_i ₌ _τ_i_I_,Φ_i ₌ _ϕ_i_I_,Σ_i

+1 = σi+1, andΨi+1 = ψi+1I. This includes the exam-ples ofSection 3.2. ThenLemma 4.8proves all of (G-EC), (F∗_-EC_{), (}_G-PM_{) and (}_F∗_-PM_{). To use}

Theorem 4.6, it remains to prove either (CI-Γ_{) or (}_CI-_G_{) with}eΓ₌₍_γ_/₂₎_I _{instead of}eΓ₌_{γ I}_{, and} either (CG) or (CG∗). The conditions (CG) or (CG∗) we reduce to

(4.11) either ϕiτi =ψi+1σi+1 or ϕiτi =ψiσi.

If these conditions are satisfied, and∆_i

+1 ≤ 0, we get fromTheorem 4.6the convergence of

G(xeN,by)orG(ex∗,N,by)to zero at the respective rateO(1/ζN)or(1/ζ∗,N).

Let us now return to the primal–dual examples ofSection 3.2. In the accelerated variants, we took arbitrary_eγ ∈ [0,γ], and proved (CI-Γ_{) for}eΓ₌_e_{γ I}_{. Therefore, it now suffices to restrict}_e_γ _∈

[0,γ_/2_]to satisfy (CI-Γ_{) for}_{Theorem 4.6}_{. We can also eliminate}_F∗_{from (}_CI-Γ_{) by monotonicity,} so (CI-G) also holds in that case.

Example 4.2 (Chambolle–Pock gap). The Chambolle–Pock method ofExample 3.2satisfies

the second part of (4.11), and we haveζ_∗,N = ÍN−1

i=1 ϕ 1/2

i as well as∆i+1 ≤ 0. In the unac-celerated case (γ_e = 0), we getζ∗,N = N ϕ

1/2

0 . Therefore, according to the remarks in the previous paragraph, we getO₍1_/N₎convergence ofG(ex_∗,N,ey∗,N)to zero. In the accelerated

case_eγ ∈ (0,γ/2],ϕiis of the orderΘ₍_i2₎_{. Therefore also}_ζ_∗

,N is of the orderΘ(N

2₎_{, so we get}

O₍1_/N2₎convergence of_G(_ex_∗,N,ye∗,N)to zero. The convergence ofG(xe∗,N,by)is analogous.

Example 4.3 (ADMM gap). The ADMM ofExample 3.3also satisfies the second part of (4.11).

We recall thatϕiτi = constant. ThereforeζN is always of the orderΘ(N). We now get the convergence ofG(exN,yeN)to zero at the rateO(1/N)with or without the step length update scheme (3.12).

Example 4.4 (GIST gap). The GIST ofExample 3.5satisfies either of the conditions in (4.11),

(24)

4.4 basic examples revisited

LetH = ∂G forG ∈ C(U), and consider a method satisfying the conditions of Theorem 2.1

withWi =τiI,Zi =ϕiI. This includes many of our examples inSection 2.3.Lemma 4.8proves

(G-EC), so the conditions ofCorollary 4.7are satisfied withζN =ÍNi=−11τiϕi. Therefore,G(exN) converges toG(bx)at the rateO(1/ζN).

Example 4.5 (Gradient descent function value). For the gradient descent method ofExample 2.3,

we haveτi = τ andϕi = ϕ constants, so we obtainO(1/N) rate. Similarly we can obtain

O₍1_/N2₎ convergence for the accelerated variant from Example 2.4as long as we choose e

γ _{∈ (}0,γ_/2_].

Example 4.6 (Forward–backward splitting function value). As we recall fromExample 2.5,

forward–backward splitting has the same convergence properties as gradient descent. There-foreExample 4.5characterises convergence of the function values.

Example 4.7 (Newton’s method function value). For Newton’s method inExample 2.7, we

haveτi =1andϕN := (2κ)Nϕ0forκ ∈ (1/2,1). We therefore obtain linear convergence of the function values.

4.5 stochastic examples

We now exploit the fact that the step lengthWi₊1can be a non-invertible operator. We observe that in a stochastic setting, we only need the expectation_E_[∆_i

+1]inCorollary 4.7andTheorem 4.6. Therefore, we can relax the relevant condition (CI∼), (CI), (CI-Γ_{), or (}_CI-_G_{) to the expectation.} This may produce more lenient step length and other conditions. Here we demonstrate the flexibility of our techniques with a few basic examples. We refer to the review article [26] for an introduction and further references to stochastic coordinate descent, and to our companion paper [24] for primal–dual methods based on the work here.

Definition 4.2. We write(P1, . . . ,Pm) ∈ P(U)ifP1, . . . ,Pm are projection operators inU with

Ím

j=1Pj =I, andPjPi =0fori,j. For randomS(i) ⊂ {1, . . . ,m}, we then set PS₍i):=

Õ

j∈S(i)

Pj, and Π_S₍_i₎_:₌ Õ

j∈S(i)

π_j−,i1Pj, where πj,i :=P[j ∈S(i)]> 0.

For smoothG _{∈ C(}U₎, we letLS(i)> 0be theΠS(i)-relative smoothness factor (seeLemmab.1), satisfying

(4.12) L−_S(i)1 _k∇G₍u_{) − ∇}G₍v_)kΠ2S₍i₎ ≤ h∇G(u) − ∇G(v),u−vi, (u,v ∈U).

(25)

Example 4.8 (Stochastic gradient descent). LetG _{∈ C(}U₎ have Lipschitz gradient, and

(P1, . . . ,Pm) ∈ P(U). For eachi ∈N, take randomS(i) ⊂ {1, . . . ,n}, and set

(4.13) Wi+1 :=τiΠS(i), Mi+1 :=I, and Vi′+1(u):=Wi+1[∇G(ui) − ∇G(u)].

Then (PP) says that we take gradient step on the random subspacerange₍Π_S₍_i₎₎_:

(4.14) ui+1

=ui−τiΠS(i)∇G(ui).

If the step lengths are deterministic and satisfy ϵ ≤ τiLS(i) < 2πj,i for all j ∈ S(i) for

someϵ > 0, we have_E_[G₍u_eN)] →G(ub)at the rateO(1/N). Through the use of the “local” smoothness factorsLS(i), the method may be able to take larger stepsτi than those allowed by the global factorLinExample 2.3.

Proof of convergence. TakingZi+1 := I,Lemma 4.8shows thatGsatisfies (G-EC) (withΓ = 0).

We can also simply defineη¯i := E[ZiWi]. ThenζN = ÍNi=−01E[ZiWi] ≥

ÍN−1

i=0 E[Wi]ϵ ≥ ϵI.

ThereforeCorollary 4.7andProposition 2.4show the desired convergence provided we verify (CI∼_{). We do this through (}_CI∗_{), which with}_Ui

+1=U now reads 1

2ku−u

i

k2+ϕτih∇G(ui) − ∇G(u∗),u−u∗iΠS₍i₎ ≥ −∆i+1(u∗;u).

We have

E[h∇G₍ui_{) − ∇}G₍u∗₎,ui₋u∗_iΠS₍i₎]=E[h∇G(u i

) − ∇G₍u∗₎,ui₋u∗_i_E_[Π_S (i₎|i−1]]

=E[h∇G(ui) − ∇G(u∗),ui−u∗i].

Similarly to (2.9), we may thus estimate

E[h∇G(ui) − ∇G(u∗),u−u∗iΠ_S (i₎]

=E[h∇G(ui) − ∇G(u∗),ui −u∗i]+E[h∇G(ui) − ∇G(u∗),u−uiiΠS₍i₎] ≥E

h

h∇G₍ui_{) − ∇}G₍u∗₎,ui₋u∗_{i −}L−_S(i)1 _k∇G₍ui_{) − ∇}G₍u∗_)k2ΠS₍i₎

−LS₄(i)ku₋ui_k2ΠS₍i₎ i

.

Using (4.12), we see that (CI∗_{) is verified with}

(4.15) E[∆_i₊₁₍_u∗_;_u_)]₌₋_E

"_m Õ

j=1

1−τiπ_j−1 ,iLS(i)/2

2 kPj(u−u

i )k2

# .

This satisfies∆_i

+1(u∗;u) ≤0under our step length assumptions.