• No results found

Testing and non-linear preconditioning of the proximal point method

N/A
N/A
Protected

Academic year: 2019

Share "Testing and non-linear preconditioning of the proximal point method"

Copied!
31
0
0

Loading.... (view fulltext now)

Full text

(1)

arXiv:1703.05705v1 [math.OC] 16 Mar 2017

testing and non-linear preconditioning of the

proximal point method

Tuomo Valkonen∗

2017-03-16

Abstract Employing the ideas of non-linear preconditioning and testing of the classical proximal point method, we formalise common arguments in convergence rate and con-vergence proofs of optimisation methods to the verification of a simple iteration-wise in-equality. When applied to fixed point operators, the latter can be seen as a generalisation of firm non-expansivity or theα-averaged property. The main purpose of this work is to pro-vide the abstract background theory for our companion paper “Block-proximal methods with spatially adapted acceleration”. In the present account we demonstrate the effective-ness of the general approach on several classical algorithms, as well as their stochastic variants. Besides, of course, the proximal point method, these method include the gradi-ent descgradi-ent, forward–backward splitting, Douglas–Rachford splitting, Newton’s method, as well as several methods for saddle-point problems, such as the Alternating Directions Method of Multipliers, and the Chambolle–Pock method.

Get the version fromhttp://tuomov.iki.fi/publications, citations broken in this one due broken arXiv biblatex support.

1

introduction

The proximal point method for monotone operators [17,22], while infrequently used by itself, can be found as a building block of many popular optimisation algorithms. Indeed, many im-portant application problems can be written in the form

(P) min

x G(x)+F(Kx)

for convex non-smoothGandF, and a linear operatorK. Examples abound in image processing and data science. The problem (P) can often be solved by methods such as forward–backward splitting, ADMM (alternating directions method of multipliers) and their variants [2,16,11,6]. They all involve a proximal point step.

The equivalent saddle point form of (P) is

(S) min

x maxy G(x)+hKx,yi −F ∗(y).

In particular within mathematical image processing and computer vision, a popular algorithm for solving (S) is the primal–dual method of Chambolle and Pock [6]. As discovered in [12], the

(2)

method can most concisely be written as apreconditioned proximal point method, solving on each iteration forui+1

=(xi+1,yi+1)the variational inclusion

(PP0) 0∈H(ui+1)+Mi+1(ui+1−ui),

where the monotone operator

H(u):=

∂G(x)+K∗y

∂F∗(y) −Kx

encodes the optimality condition 0∈H(bu)for (S). In the standard proximal point method [22], one would takeMi+1 = I the identity. With this choice, (PP0) is generally difficult to solve. In the Chambolle–Pock method thepreconditioning operator is given for suitable step length parametersτi,σi+1,θi >0 by

(1.1) Mi+1 :=

τi−1I −K∗ −θiK σi−+11I

.

This choice ofMi+1decouples the primalxand dualyupdates, making the solution of (PP0) fea-sible in a wide range of problems. IfGis strongly convex, the step length parametersτi,σi+1,θi

can be chosen to yieldO(1/N2)convergence rates of an ergodic duality gap and the quadratic distancekxixbk2.

In our earlier work [25], we have modifiedMi+1 as well as the condition (PP0) to still allow a level of mixed-rate acceleration whenGis strongly convex only on sub-spaces. Our conver-gence proofs were based ontesting the abstract proximal point method by a suitable operator, which encodes the desired and achievable convergence rates on relevant subspaces.

In the present paper, we extend this theoretical approach to linear preconditioning, non-invertible step-length operators, and arbitrary monotone operatorsH. Our main purpose is to provide the abstract background theory for our companion paper [24]. Here, within these pages, we demonstrate that several classical optimisation methods—including the second-order New-ton’s method—can also be seen as variants of the proximal point method, and that their com-mon convergence rate and convergence proofs reduce to the verification of a simple iteration-wise inequality. Through application of our theory to Browder’s fixed point theorem [4] in Section 2.5, we see that our inequality generalises the concepts of firm non-expansivity or the

α-averaged property. Our theory also covers stochastic variants of the considered algorithms. InSection 2, we start by developing our theory for general monotone operatorsH. This ex-tends, simplifies, and clarifies the more disconnected results from [25] that concentrated on saddle-point problems with preconditioners derived from (1.1). We demonstrate our results on the basic proximal point method, gradient descent, forward–backward splitting, Douglas– Rachford splitting, and Newton’s method. The proximal step in forward–backward splitting and proximal Newton’s method can be introduced completely “free”, without any additional proof effort, in our approach.

(3)

setting, which allows our results to be used to study various stochastic block-coordinate de-scent methods. We refer to [26] for a review of this class of methods. In the companion paper [24], we will apply our results to stochastic primal-dual methods with coordinate-wise adapted step lengths.

Besides already cited works, other previous work related to ours includes that on generalised proximal point methods, such as [5,8], as well inertial methods for variational inclusions [15].

2 an abstract preconditioned proximal point iteration

2.1 notation and general setup

We useC(X)to denote the space of convex, proper, lower semicontinuous functions fromX to the extended realsR:=[−∞,∞], andL(X;Y)to denote the space of bounded linear operators

between Hilbert spacesX andY. We denote the identity operator byI. ForT,S ∈ L(X;X), we writeT S whenT S is positive semidefinite. Also for possibly non-self-adjointT, we introduce the inner product and norm-like notations

(2.1) hx,ziT := hT x,zi, and kxkT := p

hx,xiT.

For a setA R, we writeA 0if every elementt Asatisfiest 0.

Our overall wish is to find somebuU, on a Hilbert spaceU, solving for a given set-valued mapH :U ⇒U the variational inclusion

(2.2) 0∈H(ub).

In the presentSection 2,H will be arbitrary, but in Section 3, where we specialise the results, and inSection 4, where we consider gap estimates, we concentrate onHarising from the saddle point problem (S).

Our strategy towards finding a solutionbuis to introduce an arbitrary non-linear iteration-dependentpreconditionerVi+1 :U →U and astep length operatorWi+1 ∈ L(U;U). With these,

we define the generalised proximal point method, which on each iterationiNsolves forui+1

from

(PP) 0Wi+1H(ui+1)+Vi+1(ui+1),

We assume thatVi+1splits intoMi+1 ∈ L(U;U), andVi′+1 :U →U as

(2.3) Vi+1(u)=Vi′+1(u)+Mi+1(u−ui).

More generally, to rigorously extend our approach to cases that would otherwise involve set-valuedVi+1, we also consider forHie+1:U ⇒U the iteration

(4)

2.2 basic estimates

We analyse (PP) and (PP∼) by applying atesting operatorZi+1 ∈ L(U;U), following the ideas

introduced in [25]. The productZi+1Mi+1with the linear part of the preconditioner, will, as we

soon demonstrate, be an indicator of convergence rates.

Theorem 2.1. On a Hilbert spaceU, let Hie+1 : U ⇒ U, andMi+1,Zi+1 ∈ L(U;U) fori ∈ N.

Suppose(PP∼)is solvable, and denote the iterates by{ui

}i∈N. IfZi+1Mi+1is self-adjoint, and

(CI∼) 1

2ku

i+1

−uikZ2i+1Mi+1 +

1 2ku

i+1

−buk2Zi+1Mi+1−Zi+2Mi+2

+hHei+1(ui+1),ui+1−buiZi+1 ≥ −∆i+1(ub)

for alli ∈Nand someub∈U, then

(DI) 1

2ku

N

−ubkZN2

+1MN+1 ≤

1 2ku

0

−buk2Z 1M1+

N−1 Õ

i=0 ∆i

+1(ub) (N ≥1).

Corollary 2.2. On a Hilbert spaceU, letH : U ⇒ U. Also letZi+1,Wi+1,Mi+1 ∈ L(U;U), and

Vi

+1 :U →U fori∈N. Suppose(PP)is solvable forVi+1as in(2.3). Denote the iterates by{u

i }i∈N.

Letbu ∈H−1(0). IfZi+1Mi+1 is self-adjoint, and

(CI) 1 2ku

i+1

−uikZi2 +1Mi+1+

1 2ku

i+1

−ubkZi2

+1Mi+1−Zi+2Mi+2

+hWi+1[H(ui+1) −H(ub)]+Vi′+1(u

i+1),ui+1bui

Zi+1 ≥ −∆i+1(ub),

for alli ∈N, then(CI∼)and (DI)hold forHie

+1(u):=Wi+1H(u)+Vi′+1(u).

For (PP), the condition (CI) is often more practical to verify than (CI∼) thanks to the additional structure introduced byH(bu) ∋0. Indeed, in many of our examples, we can eliminateHthrough monotonicity. To derive gap estimates inSection 4, we will however need (CI∼).

Proof ofTheorem 2.1. Inserting (PP∼) into (CI∼), we obtain

(2.4) 1 2ku

i+1

−uikZi2+1Mi+1 +

1 2ku

i+1

−bukZi2+1Mi+1−Zi+2Mi+2

− hui+1ui,ui+1bui

Zi+1Mi+1 ≥ −∆i+1(bu).

We recall for general self-adjointM the three-point formula

(2.5) hui+1

−ui,ui+1

−buiM = 1

2ku

i+1

−uik2M−

1 2ku

i

−ubk2M+

1 2ku

i+1

−bukM2 .

Using this withM =Zi+1Mi+1, we rewrite (2.4) as

1 2ku

i

−bukZ2i+1Mi+1

1 2ku

i+1

−bukZ2i+2Mi+2 ≥ −∆i+1(bu).

(5)

Remark 2.3 (Bregman distances). The three-point formula(2.5)generalises to Bregman distances [8]. IfZi+1=ϕi+1Ifor some scalarϕi+1, it is then easy to generaliseTheorem 2.1from

1

2k · kMi+1 to

more general Bregman distances. While we do occasionally work withVi+1 arising as the gradient

of a more general Bregman distance, we will, however, not benefit from more generalMi+1.

The next two results demonstrate how the estimate of Theorem 2.1 can be used to prove convergence with or without rates.

Proposition 2.4 (Convergence with a rate). Suppose (DI) holds with ∆i

+1(ub) ≤ 0, and that

ZN+1MN+1 ≥ µ(N)I. ThenkuN −buk2→0at the rateO(1/µ(N)).

Proof. Immediate from (DI).

Proposition 2.5 (Weak convergence). SupposeZiMi = Z0M0 ≥ 0is self-adjoint, and that the

iterates of (PP∼)satisfy(CI)withi

+1(ub) ≤ −δ2ku

i+1uik2

Zi+1Mi+1for allub∈

ˆ

U :={u ∈U |0∈

H(u)}and someδ >0. If (CL) Zi+1Mi+1(u

i+1

−ui) →0anduik ⇀u =⇒ lim sup

k→∞ e

Hi+1(uik) ⊂W∗H(u)

for some non-singularW∈ L(U;U), thenZ0M0(ui−u∗)⇀0weakly inU for someu∗satisfying

0H(u∗).

Thelim supdenotes the (strong) outer limit [?, see, e.g.,]]rockafellar-wets-va. For the proof, we use the next lemma. Its earliest version is contained in the proof of [18, Theorem 1].

Lemma 2.6 ([4, Lemma 6]). On a Hilbert spaceX, letXˆ X be closed and convex, and{xi}i∈N ⊂

X. If the following conditions hold, thenxi ⇀x∗weakly inX for somex∗ ∈Xˆ: (i) i7→ kxi −x∗kis non-increasing for allx∗ ∈Xˆ.

(ii) All weak limit points of{xi}i∈N belong toXˆ.

Proof ofProposition 2.5. SinceZi+1Mi+1−Zi+2Mi+2 ≤ 0, it is easy to see that (CI∼) and conse-quently (DI) holds for allbu U′ := cl conv ˆU. We applyTheorem 2.1on anybu ∈ U′. Using

i

+1(ub) ≤ −δ2kui+1−uik2Zi

+1Mi+1, we haveZi+1Mi+1(u

i+1ui) →0. By (PP) and (CL), any weak limit pointu∗of the sequence{ui}i∈Nthen satisfiesu∗∈Uˆ ⊂U′. SinceA:=Z0M0=Zi+1Mi+1,

this verifies condition(ii)of the lemma forxi :

=A1/2uiandX′:=A1/2UˆonX :=A1/2U ⊂U.

Ap-plied withN =1anduiin place ofu0, (DI) shows condition(i)of the lemma. Thusxi ⇀x∗∈Xˆ.

Butx∗ = A1/2u∗ for someu∗ = Uˆ. ThusA(ui −u∗) ⇀ 0. This impliesZ0M0(ui −u∗) ⇀ 0

weakly.

2.3 examples of first-order methods

We now look at several concrete examples.

Example 2.1 (The proximal point method). TakeMi = I,Vi′ = 0, andWi+1 = τiI for some

(6)

maximal monotone,{ui}i∈Nconverges weakly to someu∗H−1(0).

Proof of convergence. We takeZi+1 =ϕiIfor someϕi > 0. As long asϕi ≥ϕi+1, the

monotonic-ity ofH clearly shows (CI) with∆i

+1(bu) = −ϕi2 kui+1 −uik2. Using the maximal monotonicity,

Minty’s theorem [?, e.g.,]Theorem21.1]bauschke2011convex guarantees the solvability of (PP). Thus the conditions ofCorollary 2.2are satisfied. Maximal monotonicity also guarantees thatH is weak-to-strong outer semicontinuous; seeLemmaa.1. This establishes (CL). Takingϕi ≡ϕ0 for constantϕ0> 0, so thatZi+1Mi+1 =Z0M0=ϕ0I, it remains to refer toProposition 2.5.

Example 2.2 (Accelerated proximal point method). Continuing fromExample 2.1, suppose

H is strongly monotone. ThenhH(ui+1) −H(bu),ui+1ubi ≥γkui+1ubk2for someγ >0, so (CI) continues to hold with∆i

+1(ub)=−ϕ i 2 ku

i+1

−uik2ifϕi(1+2γτi) ≥ϕi+1. This is the case forτi+1 := τi/√1+2γτi, andϕi+1 := 1/τi2+1. The testing variableϕN is of the orderΘ(N

2

)

[6,25], so we get convergence ofkuN ubk2to zero at the rateO(1/N2)fromCorollary 2.2 andProposition 2.4.

To facilitate the analysis algorithms with a proximal step, we introduce the following strength-ened version of (CI), assumed to hold for some∆i

+1(u∗;u)at allu ∈Ui+1 ⊂U andu∗∈U:

(CI∗) 1

2ku−u

i

k2Zi+1Mi+1+

1 2ku−u

k2

Zi+1Mi+1−Zi+2Mi+2

+hWi+1(H(u) −H(u∗))+Vi′+1(u),u−u

i

Zi+1 ≥ −∆i+1(u∗;u). Note that only the choiceu=ui+1 andu∗=ubimplies (CI∼) and thus convergence. The role of

the subsetUi+1 is to model a compatible range ofui+1 betweenH = AandH = A+Bin the next lemma. TypicallyUi+1 =U, but for the stochastic examples ofSection 4.5, we will need to make restrictions.

Lemma 2.7. LetA,B :U ⇒U. Suppose(CI)holds forH =A, and that

(2.6) hB(u) −B(u∗),uu∗iZi+1Wi+1 ≥ 0, (u ∈Ui+1,u∗∈U).

Then(CI∗)holds forH =A+BwithW

i+1,Mi+1,Zi+1,Vi′+1and∆i+1(u,u

)unchanged. Moreover,

ifvi+1 solves(PP)forH

=A, thenui+1 :=(I+Wi+1B)−1(vi+1)solves(PP)forH =A+B.

Proof. Using (2.6),Bis easily eliminated from (CI∗). The result is (CI∗) forH =A. The

relation-ship betweenvi+1 andui+1 is immediate from expansion of (PP).

The next lemma starts our analysis of gradient descent:

Lemma 2.8. LetH =∇GforG ∈ C(X)such that∇GisL-Lipschitz. TakeMi+1 ≡IandVi′+1(u):=

τi(∇G(ui) − ∇G(u))withWi+1 = τiI as well asZi+1 ≡ ϕiI for someτi,ϕi > 0. Then(CI∗)holds

withUi+1 =U if

(7)

IfGis strongly convex with factorγ >0, alternatively:

(ii) τ0L2 <γ,ϕi+1 :=ϕi+ϕiτi(γ −τiL2),τi :=ϕi−1/2, and∆i+1(u∗;u)=0.

Moreover,Vi+1satisfies(CL)under the above constraints onτi.

Proof. The satisfaction (CL) is immediate from the continuity of∇Gand the boundedness ofτi. For the rest, we start by expanding the condition (CI∗) as

(2.7) ϕi

2 ku−u

i

k2+ϕi−ϕi+1

2 ku−u

k2

+ϕiτih∇G(ui) − ∇G(u∗),u−u∗i ≥ −∆i+1(u∗;u). (i)Lipschitz gradient impliesL−1-co-coercivity ([1], see alsoAppendixb)

(2.8) h∇G(u′) − ∇G(u),u′ui ≥L−1k∇G(u′) − ∇G(u)k2 for all u,u′. Now (2.7) follows after we use (2.8) and Cauchy’s inequality to estimate

(2.9) h∇G(ui) − ∇G(u∗),uu∗i= h∇G(ui) − ∇G(u∗),uiu∗i

+h∇G(ui) − ∇G(u∗),u−uii ≥ −L

4ku−u

i k2.

(ii)We estimate

h∇G(ui) − ∇G(u∗),uu∗i= h∇G(u) − ∇G(u∗),u−u∗i+h∇G(ui) − ∇G(u),u−u∗i

≥ γ

2ku−u

k2

− 1

2τiku−u i

k2−τiL

2

2 ku−u

k2.

Inserting this into (2.7), we see that (CI∗) holds withi

+1(u∗;u)=0if (2.10) ϕi+ϕiτi(γ −τiL2) ≥ϕi+1.

Clearly our choice of{τi}i∈Nis non-increasing. Therefore, (2.10) follows from the initialisation conditionτ0L2<γ and the update ruleϕi+1:=ϕi+ϕiτi(γ −τiL2).

Example 2.3 (Gradient descent). Takingτi =τ constant inLemma 2.8, (PP) reads

0=τ∇G(ui)+ui+1−ui.

This is the gradient descent method. Direct application ofLemma 2.8(i)withu =ui+1 and

u∗ =butogether withCorollary 2.2andProposition 2.5now verifies the well-known weak

convergence of the method whenτ L< 2. Observe thatVi+1 =∇Qi+1for

Qi+1(u):=

1 2ku−u

i

k2+τ G(ui)+h∇G(ui),u−uii −G(u).

Each step of (PP) therefore minimises thesurrogate objective[9]

(8)

The functionQi+1 on one hand penalises long steps, and on the other hand allows longer steps when the local linearisation error is large. In this example,Qi+1 is, in fact, a Breg-man distance. Proximal point methods based on general BregBreg-man distances in place of the squared norm are studied in, e.g., [5,8,13,14].

Example 2.4 (Acceleration of gradient descent). Continuing fromExample 2.3, ifGis strongly

convex, we may use the acceleration scheme inLemma 2.8(ii). Similarly toExample 2.1,ϕN is of the orderΘ(N2). Therefore,Corollary 2.2andProposition 2.4show the convergence of kuN buk2to zero at the rateO(1/N2).

Example 2.5 (Forward–backward splitting). LetH = ∇G +∂F forG,F ∈ C(X) with∇G

Lipschitz. TakingMi+1,Wi+1, andVi′+1as inExample 2.3, (PP) becomes

0τi∂F(ui+1)+τi∇G(ui)+ui+1−ui.

This is the forward–backward splitting method

ui+1 :

=(I +τi∂F)−1(ui−τi∇G(ui)).

ByLemma 2.7, convergence and acceleration work exactly as for gradient descent inExamples 2.3 and2.4. IfF is strongly convex with factorγF, we can introduce the additional termγF2 kui+1 b

uk2to (2.7). This will improve (2.10) to allowϕi+1 :=ϕi+ϕiτi(γ +γF −τiL). Alternatively, it would be possible to chooseϕiandτi to yield FISTA-style acceleration [2].

Example 2.6 (Douglas–Rachford splitting). LetA,B:U ⇒U be monotone operators.

Con-sider the problem of findingubwith0A(bu)+B(ub). Forλ >0, let

H(u,v):=

λB(u)+u−v

λA(u)+v−u

, Mi+1 :=

0 0 0 I

, and

e

Hi+1(u,v):=

λB(ui+1)

+ui+1−vi

λA(ui+1

+vi+1−vi)+vi−ui+1

. (2.12)

Then0 A(bu)+B(ub)if and only if0 ∈H(u,bvb), wherevb∈ (ub−λA(bu)) ∩ (bu+λB(ub)). The

algorithm (PP∼) becomes the Douglas–Rachford splitting [10]

ui+1 :

=(I+λB)−1(vi),

vi+1 :

=vi+(I+λA)−1(2ui+1−vi) −ui+1.

We work with (PP∼) since in (PP),V

(9)

acceleration schemes under strong monotonicity [?, see, e.g.,]]bredies2016accelerated.

Proof of convergence. Writeu¯i :

=(ui,vi)andbu¯ :

=(u,bvb). Observe that

ui+1vi+1

=:qi+1 ∈λA(ui+1−vi+1−vi) and ub−vb=:bq ∈λA(bu).

Using the monotonicity ofAandB, withZi+1:=I, we have

hHie+1(u¯i+1),Zi∗+1(u¯i+1−bu¯)i ⊂ hHie+1(u¯i+1) −H(bu¯),Zi∗+1(u¯i+1−bu¯)i

=λhB(ui+1) −B(ub),ui+1−ubi+λhqi+1−q,vb i+1−vbi +hui+1−vi,(ui+1−vi+1) − (bu−vb)i

=λhB(ui+1) −B(ub),ui+1−ubi+λhqi+1−q,bui+1+vi+1−vi −vbi ≥0.

Thus (CI∼) holds with∆i+1(ub¯) := −21ku¯i+1 −u¯ikZi2+1Mi+1. Using (2.12) and the weak-to-strong

outer semicontinuity ofA andB (seeLemmaa.1), we easily verify (CL). Weak convergence

now follows fromTheorem 2.1andProposition 2.5.

2.4 examples of second-order methods

Lemma 2.9. LetH =∇GforG ∈C2(U). Take

Vi+1(u):=∇2G(ui)(u−ui)+∇G(ui) − ∇G(u), and Wi+1:=I

If∇2G(u) > 0, then(CI)holds forui close enough touwithi

+1(u,u∗) = 0andZNMN =

κN2G(u∗)for someκ >1.

Proof. We setMi+1 := ∇2G(u∗)andZi+1 :=ϕiI for someϕi > 0. ThenG ∈C2(X) implies that Zi+1Mi+1 =ϕi∇2G(u∗)is self-adjoint. The condition (CI∗) reads

(2.13) 1

2ku−u

i

kϕi2 ∇2G(u)+

1 2ku−u

k2

(ϕi−ϕi+1)∇2G(u∗)+ϕiDi+1 ≥ −

i

+1(u∗;u), where

Di+1 :=h∇G(ui) − ∇G(u∗)+(∇2G(ui) − ∇2G(u∗))(u−ui),u−u∗i. By the fundamental theorem of calculus, there existsζibetweenui andu∗with

Di+1 = h∇2G(ζi)(ui−u∗),u−u∗i+h(∇2G(ui) − ∇2G(u∗))(u−ui),u−u∗i.

Using the three-point formula (2.5) and Cauchy’s inequality we therefore obtain

Di+1= 1

2ku−u

k2 2∇2G(ui

)−∇2G(u)

1 2ku−u

i

k22G(u)+

1 2ku

i

−u∗k22G(u)

+h[∇2G(ζi) − ∇2G(ui)](ui−u∗),u−u∗i

≥ 1

2ku−u

k2 2∇2G(ui

)−∇2G(u)−Ai −

1 2ku−u

i

k22G(u)

for

(10)

Inserting this estimate into (2.13), we deduce that we can take∆i

+1(u∗;u)=0if

2ϕi∇2G(ui) −ϕi+1∇2G(u∗) ≥ϕiAi. SinceG C2(U), and 2G

(u∗) > 0, locally nearu∗, we can ensure Ai ϵ2G(ζi+1

) and

∇2G(ui) ≥ [κ/2+ϵ/2]∇2G(u∗)for someκ >1andϵ >0. Thus it remains to satisfy

(1+ϵ)κϕi−ϕi+1 ≥ϕiϵκ.

This holds whenϕi+1 =κϕi. Takingϕ0=1, thusZNMN ≥κN∇2G(u∗).

Example 2.7 (Newton’s method). SupposeH =∇GforG ∈C2(U). TakeVi+1andWi+1as in

Lemma 2.9. Then (PP) reads

0=G(ui)+2G(ui)(ui+1

−ui).

This is Newton’s method. ByLemma 2.9,Corollary 2.2, andProposition 2.4, we obtain linear convergence if∇2G(ub)> 0.

Observe that nowVi+1(u)is the gradient of

Qi+1(u):=G(ui)+h∇G(ui),u−uii+ 1

2ku−u

i k22G(ui

)−G(u).

In the surrogate objective (2.11), this allows longer steps when the second-order Taylor ex-pansion under-approximates, and forces shorter steps when it over-approximates.

Example 2.8 (Proximal Newton’s method). Similarly toExample 2.5, letH = ∇G +∂F for

G∈C2(X), andF ∈ C(X). TakingMi+1,Wi+1, andVi+1 as inLemma 2.9, (PP) becomes

0∂F(ui+1

)+∇G(ui)+∇2G(ui)(ui+1−ui).

This is the proximal Newton’s method [?, see, e.g.,]]lee2014proximal

ui+1 :

=(I+[∇2G(ui)]−1∂F)−1(ui− [∇2G(ui)]−1∇G(ui)),

where(I +A−1∂F)−1(v) solvesminu 12ku −vkA2 + F(u). By Lemma 2.7, convergence and

acceleration work exactly as for Newton’s method inExample 2.7.

2.5 connections to fixed point theorems

We demonstrate connections of our approach to established fixed point theorems.

Example 2.9 (Browder’s fixed point theorem [4]). LetT : U U beα-averaged, that is

(11)

b

u=T(bu). Thenui⇀u∗for some fixed pointu∗ofT.

Proof of Browder’s fixed point theorem. Let us setH(u) :=T(u) −u, as well asZi+1 := Wi+1 :=

Mi+1 :=IandVi′+1(u):=T(ui)+ui−T(u) −u. We have

(2.14) Hei+1(ui+1):=Wi+1H(ui+1)+Vi′+1(ui+1)=T(ui)+ui −2ui+1 =ui−ui+1,

where the last step follows by observing from the previous steps that (PP) saysui+1

= T(ui).

The expression (2.14) easily gives (CL), and reduces (CI∼) to

1 2ku

i+1

−uik2+hui−ui+1,ui+1−bui ≥ −∆i+1(bu).

Usingui+1

=T(ui)andbu=T(bu), and takingβ >0, (CI∼) therefore holds for

(2.15) ∆i

+1(bu)=

α+2β−1

2(1α) ku

i+1uik2

provided

0D:= β

1αkT(u i

) −uik2+hui−bu− (T(ui) −T(ub)),T(ui) −T(bu)i.

Using theα-averaged property andbu= J(ub), we expand

D

1α =βkJ(u i

) −uik2+hui−ub−J(ui)+J(ub),(1−α)(J(ui) −J(ub))+α(ui −bu)i

=(α+β)kui−ubk2+(β+α−1)kJ(ui) −J(ub)k2− (2α+2β−1)hJ(ui) −J(ub),ui−ubi.

We takeβ :=max{0,1/2−α}. Then2α+2β ≥ 1. Cauchy’s inequality and non-expansivity of

J thus give

D

1α ≥

1 2ku

i

−buk2− 1

2kJ(u

i

) −J(bu)k2 ≥0.

This verifies (CI∼). From (2.15),∆i

+1(bu) ≤ −12min{1,α/(1−α)}kui+1−uik2. We now obtain the

claimed convergence fromCorollary 2.2andProposition 2.5.

Remark 2.10. The preconditionerVi+1(u) = T(ui) −T(u) is aT-based “distance”, which is not

obviously a Bregman distance.

3 saddle point problems

WithK ∈ L(X;Y),G∈ C(X)andF∗ ∈ C(Y)on Hilbert spacesX andY, we now wish to solve (S). The first-order necessary optimality conditions can be written

(12)

SettingU := X ×Y and introducing the variable splitting notationu = (x,y),ub= (bx,by), etc., this succinctly be written as0∈H(bu)in terms of the operator

(3.1) H(u) :=

∂G(x)+K∗y

∂F∗(y) −Kx

.

In this section, concentrating on this specificH, we specialise the theory ofSection 2.2to saddle point problems. Throughout, for some primal and dual step length and testing operatorsTi,Φi

L(X;X), andΣi

+1,Ψi+1 ∈ L(Y;Y), we take

(3.2) Wi+1 :=

Ti 0

0 Σi +1

, and Zi+1 :=

Φi 0

0 Ψi +1

.

To work with arbitrary step length operators, which will be necessary for stochastic algo-rithms inSection 4.5, as well as the partially accelerated algorithms of [25], we will need abstract forms of partial strong monotonicity ofGandF∗. As a first step, we take subsets of operators

T ⊂ L(X;X), and S ⊂ L(Y;Y).

We suppose that∂Gispartially (strongly)T-monotone, which we take to mean

(G-PM) h∂G(x′) −∂G(x),x′xiTe ≥ kx′xke2

TΓ, (x,x

X;Te∈ T )

for some linear operator0 Γ ∈ L(X;X). The operatorTe ∈ T acts as a testing operator. Similarly, we assume that∂F∗isS-monotonein the sense

(F∗-PM) h∂F∗(y′) −∂F∗(y),y′yieΣ≥ 0 (y,y′∈Y;eΣ∈ S).

AssumingGto satisfy (G-PM) forΓandFto satisfy (F-PM), we also introduce

Ξi+1(Γ):=

2TiΓ 2TiK∗ −2Σi

+1K 0

,

which is an operator measure of strong monotonicity ofH.

Example 3.1 (Block-separable structure, monotonicity). LetP1, . . . ,Pm be projection

opera-tors inX withÍmj=1Pj = IandPjPi = 0ifi, j. SupposeG1, . . . ,Gm ∈ C(X)are (strongly) convex with factorsγ1, . . . ,γm ≥0. Then (G-PM) holds withΓ=Ímj=1γjPj for

G(x)= m Õ

j=1

Gj(Pjx), and T =

T :=Õ

j∈S tjPj

tj >0,S ⊂ {1, . . . ,m}

. (3.3)

3.1 estimates

(13)

Theorem 3.1. Let us be givenK ∈ L(X;Y),G ∈ C(X), andF∗ ∈ C(Y)on Hilbert spacesX and

Y. SupposeGsatisfies (G-PM)for some0 ≤ Γ ∈ L(X;X). For eachi N, letTii ∈ L(X;X)

andΣi

+1,Ψi+1 ∈ L(Y;Y)be such thatΦiTi ∈ T. Also takeVi′+1 :X ×Y →X ×Y, andMi+1 ∈

L(X ×Y;X ×Y). Let H given by (3.1),Zi+1 andWi+1 by (3.2), andVi+1 by (2.3). Suppose(PP)

is solvable, and denote the iterates byui = (xi,yi). Then(CI),(CI∼)and (DI)hold ifZi+1Mi+1 is self-adjoint, and foreΓ=Γwe have

(CI-Γ) 1 2ku

i+1uik2

Zi+1Mi+1 | {z }

step length in local metric + 1

2ku

i+1buk2

Zi+1(Ξi+1(eΓ)+Mi+1)−Zi+2Mi+2 | {z }

linear preconditioner update discrepancy +h∂F∗(yi+1) −∂F∗(by),yi+1

−byiΨi+1Σi+1 | {z }

variably useful remainder fromH

+hVi′+1(ui+1),ui+1−ubiZi+1 | {z }

from non-linear preconditioner

≥ −∆i +1(ub).

Proof. First of all, we observe that (CI-Γ) implies

(3.4) 1

2ku

i+1

−uik+ 1

2ku

i+1

−ubkZi2

+1(Ξi+1(0)+Mi+1)−Zi+2Mi+2

+h∂G(xi+1) −∂G(bx),xi+1−bxiΦiTi +h∂F∗(yi+1) −∂F∗(by),yi+1−byiΨi+1Σi+1

+hVi′+1(ui+1),ui+1−ubiZi+1 ≥ −∆i+1(ub). Here pay attention to the fact that (3.4) employsΞi

+1(0) while (CI-Γ) employsΞi+1(eΓ). If we show that (CI) follows from (3.4), then (CI∼) and (DI) follow fromCorollary 2.2. Indeed, using

the expansion

Zi+1Wi+1=

ΦiTi 0

0 Ψi +1Σi+1

,

we expand for anyeu=(xe,ey)that

hZi+1Wi+1(H(ui+1) −H(eu)),ui+1−eui

=h∂G(xi+1) −∂G(ex),xi+1−exiΦiTi +h∂F∗(y i+1

) −∂F∗(ey),yi+1

−eyiΨi+1Σi+1

+hΦiTiK∗(yi+1−ye),xi+1−exi − hΨi+1Σi+1K(x

i+1

−ex),yi+1

−eyi. With the help ofΞi

+1(0)we then obtain

hH(ui+1) −H(ue),ui+1uei

Zi+1Wi+1

1 2ku

i+1uek

Zi+1Ξi+1(0)

+h∂G(xi+1) −∂G(ex),xi+1−exiΦiTi +h∂F∗(yi+1) −∂F∗(ye),yi+1−eyiΨi+1Σi+1.

Inserting this into (3.4), we obtain (CI).

3.2 examples of primal–dual methods

(14)

Example 3.2 (The primal–dual method of Chambolle and Pock [6]). This method consists of iterating the system

xi+1 :

=(I+τi∂G)−1(xi−τiK∗yi), (3.5a)

¯

xi+1 :

=ωi(xi+1−xi)+xi+1, (3.5b)

yi+1 :

=(I+σi+1∂F∗)−

1

(yi+σi+1Kx¯

i+1

). (3.5c)

In the basic version of the algorithm,ωi = 1,τi ≡τ0 > 0, andσi ≡ σ0 > 0, assuming the step length parameters to satisfy

(3.6) τ0σ0kKk2<1.

The iterates convergence weakly, and the method hasO(1/N)rate for the ergodic duality gap, to which we will return inSection 4. IfGis strongly convex with factorγ, we may take e

γ ∈ (0,γ], and accelerate

(3.7) ωi :=1/p1+2eγτi, τi+1 :=τiωi, and σi+1 :=σi/ωi.

This yieldsO(1/N2)convergence ofkxN bxk2to zero.

Proof of convergence of iterates. We formulate the method in our proximal point framework fol-lowing [25,12] by taking as the preconditioner

Mi+1 =

I τiK∗ −σiK I

and Vi+1 =0.

As the step length and testing operators we takeTi =τiI,Σi+1 = σi+1I,Φi =ϕiI,Ψi+1 =ψi+1I. We also writeeΓ:=eγ I. Takingi+1(ub):=1

2kui+1−uikZi2+1Mi+1, we reduce (CI-Γ) to

(3.8) 1

2ku

i+1ubk2

Di+2 ≥0 for Di+2:=Zi+1(Ξi+1(eΓ)+Mi+1) −Zi+2Mi+2.

We may expand

Zi+1Mi+1 =

ϕiI −ϕiτiK∗ −ψi+1σiK ψi+1I

, and

(3.9a)

Di+2=

(ϕi(1+2γτie ) −ϕi+1)I (ϕiτi+ϕi+1τi+1)K∗

(ψi+2σi+1−2ψi+1σi+1−ψi+1σi)K (ψi+1−ψi+2)I

. (3.9b)

We havek · kDi+2 =0(but notDi+2=0, as the former depends on the off-diagonals cancelling out), andZi+1Mi+1is self-adjoint, if for some constantψ we take

(3.10) ϕi+1 :=ϕi(1+2γτei), τi :=ϕi−1/2, σi :=ϕiτi/ψ, and ψi+1 :=ψ. This gives the acceleration scheme (3.7). Moreover, for anyδ ∈ (0,1)holds

(3.11) Zi+1Mi+1 ≥

δϕiI 0

0 ψ I− (1−δ)−1KK

(15)

ThusZi+1Mi+1 ≥ 0ifψ ≥ (1−δ)−1kKk2. By (3.10),σiτi =1/ψ. Since this fixes the ratio ofσi to τi, we need to takeψ :=1/(σ0τ0)as well asδ := 1−σ0τ0kKk2. Through the positivity ofδ, we recover the initialisation condition (3.6).

Theorem 3.1andProposition 2.5show weak convergence of the iterates without a rate. IfG is strongly convex with factorγ 0, so that alsoeγ > 0, the results in [6,25] show thatτN is of the orderO(1/N), and consequentlyϕN is of the orderΘ(N2). ByProposition 2.4,kxN−bxk2

converges to zero at the rateO(1/N2).

Example 3.3 (Alternating Directions Method of Multipliers, briefly). The classical ADMM

[11] and Douglas–Rachford splitting [10] are known to be related to the Chambolle–Pock method; in fact the Chambolle–Pock method is a preconditioned ADMM [6]. From [3, Sec-tion 5], we can deduce that compared to the Chambolle–Pock method, the ADMM merely has the sign ofK reversed in

Mi+1=

I τiK σiK I

.

Takingτi = τ0 andσi = σ0 constant and satisfying (3.6), the iterates converge weakly. Acceleration can provideO(1/N)convergence ofkxN −bxk2.

Proof of convergence. FollowingExample 3.2, we now expand

Di+2=

(ϕi(1+2γτei) −ϕi+1)I (3ϕiτi−ϕi+1τi+1)K∗

(ψi+1σi−2ψi+1σi+1−ψi+2σi+1)K (ψi+1−ψi+2)I

.

This time k · kDi+2 =0andZi+1Mi+1is self-adjoint if we take

(3.12) ϕi+1 :=ϕi(1+2eγ τi), τi+1 :=τiϕi/ϕi+1, σi:=ϕiτi/ψ, and ψi+1 :=ψ.

Ifγe=0, which corresponds to the standard ADMM with fixed step lengths, it is easy to retrace

the steps ofExample 3.2to prove weak convergence (without a rate). Ifeγ ,0, we obtainϕN+1 =

ϕN +2γ τeN−1ϕN−1 = ϕN +2γτe0ϕ0 = ϕ0+2Neγτ0ϕ0. Therefore, the acceleration scheme (3.12)

only gives the rateO(1/N).

Example 3.4 (Chambolle–Pock with a forward step). SupposeG=G0+JwithG(strongly)

convex with factorγ ≥ 0, and∇J Lipschitz with factorL. (J does not have to be convex.) In [7], the Chambolle–Pock method was extended to take forward steps with respect to J. With everything else as inExample 3.2, takeVi

+1(u) := (τi(∇J(x

i

) − ∇J(x)),0). Then (PP) can be rearranged as

xi+1 :

=(I+τi∂G0)−1(xi−τi∇J(xi) −τiK∗yi), (3.13)

¯

xi+1 :

=ωi(xi+1−xi)+xi+1, (3.14)

yi+1 :

=(I+σi+1∂F∗)−

1

(yi+σi+1Kx¯

i+1

). (3.15)

(16)

update rules (3.7), and initialiseτ0,σ0>0subject to (3.6), and

(3.16) 0<θ :=1−Lτ0/(1−τ0σ0kKk2).

Proof of convergence. WithDi+2as in (3.8), the condition (CI-Γ) becomes

(3.17) 1

2ku

i+1

−uikZi2 +1Mi+1+

1 2ku

i+1

−ubkDi2

+2 +τiϕih∇J(x i

) − ∇J(bx),xi+1

−bxi ≥ −i +1(bu). The rules (3.10) forcek · kDi+2 =0. Applying the estimate (2.9) to J, (3.17) becomes

1 2ku

i+1

−uikZi2+1Mi+1− τiϕiL

4 kx

i+1

−xik2≥ −∆i+1(ub).

We take∆i

+1(ub)=−θ2kui+1−uikZi2

+1Mi+1for someθ > 0, and deduce using Cauchy’s inequality

that this condition holds if

(1θ)Zi+1Mi+1 ≥τiϕiL

I 0 0 0

.

Recalling (3.11), this is true if(1θ)δϕi ≥ τiϕiLandψ ≥ (1−δ)−1ϕiτi2kKk2. Further recalling (3.10), and observing that{τi}is non-increasing, we only have to satisfy(1−θ)(1−τ0σ0kKk2) ≥

Lτ0. Otherwise put, we obtain (3.16).

Example 3.5 (GIST). SupposeG(x)= 21kf Axk2,kAk<2, andkKk ≤1. Take

Vi+1(u):=

∇G(xi) − ∇G(x)

0

, and Mi+1 :=

I 0 0 I−KK∗

.

WithTi :=IandΣi+1 :=I, we then obtain the Generalised Iterative Soft Thresholding (GIST)

algorithm of [16]

yi+1 :

=(I+∂F∗)−1((I−KK∗)yi+K(xi − ∇G(xi))),

xi+1 :

=xi− ∇G(xi) −K∗yi+1.

The iterates{xi}i∈Nconverge weakly tobx.

Proof of convergence. ClearlyZi+1Mi+1 is positive semi-definite self-adjoint. AlsoG satisfies (G-PM) withΓ=AA. If we takeΦi =I andΨi

+1=I, then

Di+2 :=Zi+1(Ξi+1(eΓ)+Mi+1) −Zi+2Mi+2=

2A∗A 2K∗ −2K 0

.

Thus 1 2kuk

2

Di+2 = kxk 2

A∗A. Eliminating∂F∗by monotonicity, (CI-Γ) thus holds if

1 2ku

i+1

−uikZi2

+1Mi+1 +kx i+1

(17)

Expanding and usingkKk<1, we see this to hold when

1 2kx

i+1

−xik2+kxi+1−bxkA2∗A+hA∗A(xi −xi+1),xi+1−bxi ≥ −∆i+1(ub).

Our assumption kAk < √2guarantees 21(A∗A)2 < A∗A. Cauchy’s inequality therefore shows that we can take∆i+1 =c

2kxi+1−xik2for somec >0. UsingTheorem 3.1andProposition 2.4,

we obtain weak convergence.

4 the ergodic duality gap and stochastic methods

We now study the extension of the testing approach ofSection 2.2to produce the convergence of an ergodic duality gap. Throughout this section, we are in the saddle point setup ofSection 3. In particular,H is as in (3.1), and the step length and testing operatorsWi+1andZi+1as in (3.2).

4.1 preliminary gap estimates

Our first lemma demonstrates how to obtain a “preliminary” gapG

i+1(u) fromH. If the step

lengths and tests are scalar,Ti = τiI, andΦi =ϕiI, etc., and satisfyτiϕi = σiψi+1, it is easy to

bound this preliminary gap from below byτiϕi times the conventional duality gap

(4.1) G(x,y):= G(x)+hyb,Kxi −F(yb) G(bx)+hy,Kbxi −F∗(y).

To do the same for more general step length operators, we will inSection 4.2introduce abstract notions of convexity that incorporate ergodicity and stochasticity.

Lemma 4.1. Let us be givenK ∈ L(X;Y),G ∈ C(X), andF∗ ∈ C(Y)on Hilbert spacesX andY.

For eachi ∈N, letTi,Φi ∈ L(X;X)andΣi

+1,Ψi+1 ∈ L(Y;Y). Then for anyeΓ ∈ L(X;X),

(4.2) hH(ui+1

),ui+1

−ubiZi+1Wi+1 =Gi′+1(ui+1;eΓ)+

1 2ku

i+1

−ubkZi2 +1Ξi+1(eΓ)

,

where the “preliminary gap”

Gi′+1(u;eΓ):=h∂G(x),x−bxiΦiTi − kx−bxk 2

ΦiTieΓ+h∂F

(y),ybyi Ψi+1Σi+1

− hby,(KTi∗Φ∗i Ψi

+1Σi+1K)xbi − hy,Ψi+1Σi+1Kbxi+hyb,KTi∗Φ∗ixi.

Proof. Similarly to the proof ofTheorem 3.1, we have

hH(ui+1

),ui+1

−buiZi+1Wi+1 = h∂G(x i+1

),xi+1

−bxiΦiTi +hΦiTiK∗yi+1,xi+1−bxi +h∂F∗(yi+1),yi+1−byiΨi+1Σi+1− hΨi+1Σi+1Kx

i+1,yi+1

(18)

A little bit of reorganisation gives (4.2). Indeed

hH(ui+1

),ui+1

−buiZi+1Wi+1 = h∂G(x i+1

),xi+1

−bxiΦiTi − kxi+1−bxk2 ΦiTieΓ

+h∂F∗(yi+1),yi+1−byiΨi+1Σi+1+kx i+1

−xbkΦ2iTieΓ

+hyi+1−y,b(KTi∗Φ∗i −Ψi+1Σi+1K)(xi+1−bx)i

− hby,(KTi∗Φ∗i Ψi+1Σi+1K)bxi

− hyi+1,Ψ

i+1Σi+1Kbxi+hby,KTi∗Φ∗ixi+1i

= Gi′+1(ui+1;eΓ)+

1 2ku

i+1

−bukZi2 +1Ξi+1(eΓ)

.

The next lemma extendsTheorem 3.1to estimate the preliminary gap.

Lemma 4.2. Let us be givenK ∈ L(X;Y),G ∈ C(X), andF∗∈ C(Y)on Hilbert spacesX andY.

For eachiN, letTi,Φi ∈ R(L(X;X))andΣi+1,Ψi+1∈ R(L(Y;Y)), as well asVi+1 ∈ R(X×Y → X ×Y)andMi+1 ∈ R(L(X×Y;X ×Y)). LetH given by(3.1),Zi+1andWi+1by(3.2), andVi+1by (2.3). Suppose(PP)is solvable, and denote the iterates byui =(xi,yi). IfZi+1Mi+1 is self-adjoint,

and

(CI-G) 1

2ku

i+1

−uikZi2 +1Mi+1+

1 2ku

i+1

−buk2

Zi+1(Ξi+1(eΓ)+Mi+1)−Zi+2Mi+2

+hVi′+1(ui+1),ui+1−buiZi+1 ≥ −e∆i+1(ub)

for someeΓ ∈ L(X;X), then

(4.3) 1

2ku

N

−bukZN2

+1MN+1 + N−1 Õ

i=0

Gi′+1(u

i+1;eΓ) ≤ 1 2ku

0

−bukZ2 1M1 +

N−1 Õ

i=0 e

i

+1(bu) (N ≥1).

Proof. Inserting (4.2) into (CI-G) proves (CI∼) for∆i

+1(ub) := e∆i+1(ub) − Gi+1(ui+1;eΓ). Now we

useTheorem 2.1.

The problem with the aboveLemma 4.2 is that it loses∂F from the condition (CI-G) com-pared to (CI-Γ). Thus (CI-G) can be more difficult to satisfy for particular preconditioners that are related to∂F, such as the forward–backward splitting inExample 2.5. Fortunately, there is a remedy: to study a one-sided gap that provides no indication of the convergence of the dual variable.

Lemma 4.3. Let us be givenK ∈ L(X;Y),G ∈ C(X), andF∗∈ C(Y)on Hilbert spacesX andY.

For eachiN, letTi,Φi ∈ R(L(X;X))andΣi+1,Ψi+1∈ R(L(Y;Y)), as well asVi′+1 ∈ R(X×Y → X ×Y)andMi+1 ∈ R(L(X×Y;X ×Y)). LetH given by(3.1),Zi+1andWi+1by(3.2), andVi+1by (2.3). Suppose(PP)is solvable, and denote the iterates byui =(xi,yi). IfZi+1Mi+1 is self-adjoint,

and(CI-Γ)holds for some∈ L(X;X), then

(4.4) 1

2ku

N

−ubkZN2

+1MN+1 + NÕ−1

i=0

Gi′+1(x

i+1,by;eΓ

) ≤ 21ku0bukZ2 1M1 +

N−1 Õ

i=0 ∆i

(19)

Proof. Let us write(Hx(u),Hy(u)):=H(u). Then0∈Hy(by). We may thus expand

Gi′+1(ui+1;eΓ)= Gi′+1(ui+1;eΓ) − hHy(yb),yi+1−ybiΨi+1Σi+1

= Gi′+1(ui+1;eΓ) − h∂F∗(by),yi+1−byiΨi+1Σi+1 +hΨi+1Σi+1K∗xb,yi+1−byi

= Gi′+1(x

i+1,by;eΓ)

+h∂F∗(yi+1) −∂F∗(by),yi+1−byiΨi+1Σi+1. Inserting this into (4.2), we obtain

hH(ui+1

),ui+1

−ubiZi+1Wi+1 =Gi′+1(xi+1,by;eΓ)+

1 2ku

i+1

−ubkZi2 +1Ξi+1(eΓ)

+h∂F∗(yi+1) −∂F∗(by),yi+1−byiΨi+1Σi+1. (4.5)

We writee∆i

+1 for the ∆i+1 for which (CI-Γ) holds. Inserting (4.5) into (CI-Γ) proves (CI∼) for ∆i

+1(ub):=e∆i+1(ub) − Gi′+1(x

i+1,by;). The rest follows fromTheorem 2.1.

4.2 conversion of preliminary gaps to ergodic gaps

The “preliminary gaps” are not as such very useful. To go further, the abstract monotonicity assumptions (G-PM) and (F∗-PM) are not enough, and we need analogous convexity

formula-tions. We formulate these conditions directly in the stochastic setting. Towards this end we introduce the following notation:

Definition 4.1. We writex ∈ R(X)ifT is anT-valued random variable:x : Ω X for some

(in the present work fixed) probability space(Ω,O), whereOis aσ-algebra onΩ. We denote byEthe expectation with respect to a probability measurePon Ω. As is common, we abuse notation and writex =x(ω)for the unknown random realisationω ∈Ω.

We refer to [23] for more details on measure-theoretic probability. From now on, we assume for allN 1that wheneverTie(:=ΦiTi) ∈ R(T )andxi+1 ∈ R(X)for eachi=0, . . . ,N−1with

ÍN−1

i=0 E[Tei]=I, then for some0 ≤Γ ∈ L(X;X)holds

(G-EC) G

N−1 Õ

i=0

E[Tei∗xi+1

] !

−G(bx) ≥ N−1 Õ

i=0

Eh∂G(xi+1

),xi+1

−bxiTie +

1 2kx

i+1

−bxkTi2eΓ

.

Analogously, we assume foreΣi

+1(:=Ψi+1Σi+1) ∈ R(S)andyi+1 ∈ R(Y)for eachi=0, . . . ,N−1 withÍNi=01E[eΣi

+1]=I that

(F∗-EC) F∗ N−1 Õ

i=0 E[eΣ∗i

+1y

i+1

] !

−F∗(by) ≥ N−1 Õ

i=0

Eh∂F∗(yi+1

),yi+1

−ybieΣi+1

.

Example 4.1 (Block-separable structure, ergodic convexity). LetG andT have the

separa-ble structure of Example 3.1. We claim that (G-EC) holds. Indeed, let us introduceTie :=

Ím

j=1eτj,iPj ≥0, satisfying ÍN−1

i=0 E[eτj,i]=1for eachj =1, . . . ,m. Splitting (G-EC) into

(20)

to be true if for allj =1, . . . ,mholds

(4.6) Gj

N−1 Õ

i=0

E[eτj,iPjx i+1

] !

−Gj(Pjbx) ≥ N−1 Õ

i=0

Eeτi Gj(Pjxi+1

) −Gj(Pjbx).

The right hand side can also be written as∫ΩNGj(Pjx i

)) −Gj(Pjxb)dµN(i,ω)for the mea-sureµN :=eτjÍNi=−01δi×Pon the domainΩN :={0, . . . ,N−1} ×Ω. Using our assumption ÍN1

i=0 E[eτj,i]=1, we deduceµ N(N)

=1. An application of Jensen’s inequality now shows

(4.6). Therefore (G-EC) is satisfied.

We also assume that either

E[ΦiTi]=ηiI,¯ and E[Ψi+1Σi+1]=ηiI,¯ (i ≥1), (CG)

or

E[ΦiTi]=η¯iI, and E[ΨiΣi]=η¯iI, (i 1), (CG∗)

As will see in Example 4.2, (CG∗) is satisfied by the accelerated Chambolle–Pock method of

Example 3.2. In our companion paper [24], we will however see that (CG) is required to develop doubly-stochastic methods.

With these, and the gap functional G from (4.1), we derive the next two lemmas that are meant to be used in combination with eitherLemma 4.2orLemma 4.3, to estimate the sum of the preliminary gaps therein. For this, the expectation needs to be taken in the estimates of the latter. All of these different combinations will be summarised inTheorem 4.6after the lemmas.

Lemma 4.4. Suppose(G-EC),(F∗-EC), and (CG)hold. Set

(4.7) ζN :=

N−1 Õ

i=0 ¯

ηi,

and for{(xi,yi)}iN

=1 ⊂X ×Y, define the ergodic sequences (4.8) exN :=ζN−1E

"N1 Õ

i=0

Ti∗Φ∗ixi+1 #

, and eyN :=ζN−1E

"N−1 Õ

i=0 Σ∗i

+1Ψi∗+1yi+1 #

.

Then

N−1 Õ

i=0

E[Gi′+1(x

i+1,yi+1;Γ/2)] ≥ζ

NG(exN,eyN).

Proof. Using (CG), (G-EC), and (F∗-EC), we compute

N−1 Õ

i=0

E[Gi′+1(xi+1,yi+1;Γ/2)]= N−1 Õ

i=0

Ehh∂G(xi+1

),xi+1

−bxiΦiTi

−kxi+1

−bxkΦ2iTiΓ/2+h∂F∗(y i+1

),yi+1

−byiΨi+1Σi+1 i

−ζNheyN,Kbxi+ζNhy,bKexNi ≥ζNG(exN,eyN).

(21)

Lemma 4.5. Suppose(G-PM),(F∗-PM),(G-EC),(F∗-EC), and(CG∗)hold. Set

(4.9) ζ,N :=

N−1 Õ

i=1

¯

ηi,

and for{(xi,yi)}N

i=1 ⊂X ×Y, define the ergodic sequences

(4.10) ex,N :=ζ −1 ∗,NE

"N1 Õ

i=1 Ti∗Φ∗

ixi+1 #

, and ey,N :=ζ

−1 ∗,NE

"N1 Õ

i=1 Σ∗

iΨi∗yi #

.

Then

N−1 Õ

i=0

E[Gi′+1(x

i+1,yi+1;Γ/2)] ≥ζ

NG(ex∗,N,ey∗,N).

Proof. Using (G-PM) and (OC), we deduce

G1′(x1,y1;Γ/2) ≥ h∂F∗(y1),y1−byiΨ1Σ1 +hby,Ψ1Σ1Kxbi − hy1,Ψ1Σ1Kbxi.

Likewise (F∗-PM) and (OC) give

GN′ (x

N,yN;Γ

/2) ≥ h∂G(xN),xN bxiΦN1TN1− kx N

−bxkΦ2N1TN1Γ/2 − hby,KTN1Φ∗

N−1bxi+hby,KTN∗−1Φ∗N−1xNi. Shifting indices ofyi by one compared toG′

i+1, we define

G∗′,i+1 :=h∂G(x

i+1),xi+1bxi

ΦiTi − kx

i+1bxk2

ΦiTiΓ/2+h∂F∗(y i

),Σ∗iΨi(yiby)i

− hby,(KTi∗Φ∗

i −ΨiΣiK)xbi − hyi,ΨiΣiKbxi+hyb,KTi∗Φ∗ixi+1i. Correspondingly reorganising terms, we observe

N−1 Õ

i=0

Gi′+1(xi+1,yi+1;Γ/2)=G1′(x1,y1)+GN′(xN,yN;Γ/2)

+

NÕ−2

i=1

Gi′+1(xi+1,yi+1;Γ/2) ≥ N−1 Õ

i=1

G′,i+1 .

We now estimateÍN−i=11E[G′,i+1]analogously to the proof ofLemma 4.4.

The next theorem is our main result for saddle point problems.

Theorem 4.6. Let us be givenK ∈ L(X;Y),G ∈ C(X), andF∗∈ C(Y)on Hilbert spacesX andY,

satisfying(G-PM)and(F∗-PM)for some0 Γ∈ L(X;X). For eachi N, letTii ∈ R(L(X;X))

andΣi

+1,Ψi+1 ∈ R(L(Y;Y))be such thatΦiTi ∈ R(T )andΨi+1Σi+1 ∈ R(S). Also takeVi+1

(22)

andVi+1by(2.3). Suppose(PP)is solvable, and denote the iterates byui = (xi,yi). Letbu= (bx,by)

be a solution to(OC). Assuming one of the following cases to hold, let

e дN :=

      

0, eΓ= Γ, (CI-Γ)holds,

ζNG(exN,by), eΓ= Γ/2;(CI-Γ),(G-EC),(F∗-EC)and(CG)hold, ζ,NG(ex∗,N,by), e

Γ= Γ/2;(CI-Γ),(G-EC),(F-EC)and(CG)hold, ζNG(exN,eyN), eΓ= Γ/2;(CI-G),(G-EC),(F-EC)and (CG)hold, ζ,NG(ex∗,N,ey∗,N), e

Γ= Γ/2;(CI-G),(G-EC),(F-EC)and (CG)hold.

IfZi+1Mi+1 is self-adjoint, then

(DI-G) 1

2E

h

kuN bukZN2 +1MN+1

i

+дeN ≤ ku0−bukZ21M1+ N−1 Õ

i=0

E[∆i +1(bu)].

Proof. The caseeдN =0is simply the result of taking the expectation in the claim ofTheorem 3.1. The remaining cases follow by taking the expectation in different combinations ofLemma 4.2 or4.3withLemma 4.4or4.5.

As an easy corollary, we obtain convergence of function values for the basic minimisation problemH = ∂G.

Corollary 4.7. Let us be givenG ∈ C(X), satisfying (G-PM)and (G-EC) for Γ = 0. For each

i N, letWi,Mi,Zi ∈ R(L(X;X)) as well asVi′ ∈ R(X → X). SupposeZiWi ∈ R(T ), that ZiMiis self-adjoint, that(CI)holds, and(PP)is solvable withH =∂GandVi+1as in(2.3). Suppose

E[ZiWi]=ηiI¯ for someηi¯ >0. Letub∈ [∂G]−1(0). Then the iterates{ui}i∈Nof (PP)satisfy(DI-G)

with

e

дN :=ζN(G(euN) −G(bu)),

where

e

uN :=ζN−1E "N−1

Õ

i=0

Wi∗Zi∗ui+1 #

, ζN =

N−1 Õ

i=0

¯

ηi.

Proof. IntroducingK :=0andF∗ ≡0(orF∗ ≡δ{0}), we can write the original problem in the saddle point form (S). Then the gapG(x,y) =G(x)−G(bx)measures the convergence of function

values. We can also extend the method for (PP) withH = ∂G to the saddle point problem by

choosing

¯

Vi+1(u)=(Vi′+1(x),0), and Mi¯ +1:=

Mi+1 0

0 0

,

as well asTi := Wi,Φi := Zi. We also denote byW¯i+1 andZ¯i+1 the step length and testing operators for the saddle point problem. NowTiΦi = η¯iI, so we can chooseΨi+1 = ψi+1I and Σi+1 = σi+1I such that (CG) holds and indeedKT

iΦ∗i = Ψi+1Σi+1K. The latter causes the off-diagonal components ofZi¯+1Ξi+1(eΓ)to cancel. Consequently (CI-Γ) holds foreΓ = 0by virtue

(23)

4.3 primal–dual examples revisited

We now study gap estimates for several of the examples fromSection 3.

Lemma 4.8. SupposeG ∈ C(X)is (strongly) convex with factorγ 0,Ti =τiIandΦi =ϕiI, and

T =[0,∞)I. Then both(G-PM)and (G-EC)hold withΓ=γ I.

Proof. This follows fromExample 4.1withm=1.

Suppose we have a method for (S) that satisfies the conditions of the earlierTheorem 3.1 witheΓ = Γ = γ I,Ti = τiI,Φi = ϕiI,Σi

+1 = σi+1, andΨi+1 = ψi+1I. This includes the exam-ples ofSection 3.2. ThenLemma 4.8proves all of (G-EC), (F∗-EC), (G-PM) and (F-PM). To use

Theorem 4.6, it remains to prove either (CI-Γ) or (CI-G) with=(γ/2)I instead of=γ I, and either (CG) or (CG∗). The conditions (CG) or (CG∗) we reduce to

(4.11) either ϕiτi =ψi+1σi+1 or ϕiτi =ψiσi.

If these conditions are satisfied, and∆i

+1 ≤ 0, we get fromTheorem 4.6the convergence of

G(xeN,by)orG(ex∗,N,by)to zero at the respective rateO(1/ζN)or(1/ζ∗,N).

Let us now return to the primal–dual examples ofSection 3.2. In the accelerated variants, we took arbitraryeγ ∈ [0,γ], and proved (CI-Γ) for=eγ I. Therefore, it now suffices to restricteγ

[0,γ/2]to satisfy (CI-Γ) forTheorem 4.6. We can also eliminateFfrom (CI-Γ) by monotonicity, so (CI-G) also holds in that case.

Example 4.2 (Chambolle–Pock gap). The Chambolle–Pock method ofExample 3.2satisfies

the second part of (4.11), and we haveζ,N = ÍN−1

i=1 ϕ 1/2

i as well as∆i+1 ≤ 0. In the unac-celerated case (γe = 0), we getζ∗,N = N ϕ

1/2

0 . Therefore, according to the remarks in the previous paragraph, we getO(1/N)convergence ofG(ex,N,ey∗,N)to zero. In the accelerated

caseeγ ∈ (0,γ/2],ϕiis of the orderΘ(i2). Therefore alsoζ

,N is of the orderΘ(N

2), so we get

O(1/N2)convergence ofG(ex,N,ye∗,N)to zero. The convergence ofG(xe∗,N,by)is analogous.

Example 4.3 (ADMM gap). The ADMM ofExample 3.3also satisfies the second part of (4.11).

We recall thatϕiτi = constant. ThereforeζN is always of the orderΘ(N). We now get the convergence ofG(exN,yeN)to zero at the rateO(1/N)with or without the step length update scheme (3.12).

Example 4.4 (GIST gap). The GIST ofExample 3.5satisfies either of the conditions in (4.11),

(24)

4.4 basic examples revisited

LetH = ∂G forG ∈ C(U), and consider a method satisfying the conditions of Theorem 2.1

withWi =τiI,Zi =ϕiI. This includes many of our examples inSection 2.3.Lemma 4.8proves

(G-EC), so the conditions ofCorollary 4.7are satisfied withζN =ÍNi=−11τiϕi. Therefore,G(exN) converges toG(bx)at the rateO(1/ζN).

Example 4.5 (Gradient descent function value). For the gradient descent method ofExample 2.3,

we haveτi = τ andϕi = ϕ constants, so we obtainO(1/N) rate. Similarly we can obtain

O(1/N2) convergence for the accelerated variant from Example 2.4as long as we choose e

γ ∈ (0,γ/2].

Example 4.6 (Forward–backward splitting function value). As we recall fromExample 2.5,

forward–backward splitting has the same convergence properties as gradient descent. There-foreExample 4.5characterises convergence of the function values.

Example 4.7 (Newton’s method function value). For Newton’s method inExample 2.7, we

haveτi =1andϕN := (2κ)Nϕ0forκ ∈ (1/2,1). We therefore obtain linear convergence of the function values.

4.5 stochastic examples

We now exploit the fact that the step lengthWi+1can be a non-invertible operator. We observe that in a stochastic setting, we only need the expectationE[i

+1]inCorollary 4.7andTheorem 4.6. Therefore, we can relax the relevant condition (CI∼), (CI), (CI-Γ), or (CI-G) to the expectation. This may produce more lenient step length and other conditions. Here we demonstrate the flexibility of our techniques with a few basic examples. We refer to the review article [26] for an introduction and further references to stochastic coordinate descent, and to our companion paper [24] for primal–dual methods based on the work here.

Definition 4.2. We write(P1, . . . ,Pm) ∈ P(U)ifP1, . . . ,Pm are projection operators inU with

Ím

j=1Pj =I, andPjPi =0fori,j. For randomS(i) ⊂ {1, . . . ,m}, we then set PS(i):=

Õ

j∈S(i)

Pj, and ΠS(i):= Õ

j∈S(i)

πj−,i1Pj, where πj,i :=P[j ∈S(i)]> 0.

For smoothG ∈ C(U), we letLS(i)> 0be theΠS(i)-relative smoothness factor (seeLemmab.1), satisfying

(4.12) L−S(i)1 k∇G(u) − ∇G(v)kΠ2S(i) ≤ h∇G(u) − ∇G(v),u−vi, (u,v ∈U).

(25)

Example 4.8 (Stochastic gradient descent). LetG ∈ C(U) have Lipschitz gradient, and

(P1, . . . ,Pm) ∈ P(U). For eachi ∈N, take randomS(i) ⊂ {1, . . . ,n}, and set

(4.13) Wi+1 :=τiΠS(i), Mi+1 :=I, and Vi′+1(u):=Wi+1[∇G(ui) − ∇G(u)].

Then (PP) says that we take gradient step on the random subspacerange(ΠS(i)):

(4.14) ui+1

=ui−τiΠS(i)∇G(ui).

If the step lengths are deterministic and satisfy ϵ ≤ τiLS(i) < 2πj,i for all j ∈ S(i) for

someϵ > 0, we haveE[G(ueN)] →G(ub)at the rateO(1/N). Through the use of the “local” smoothness factorsLS(i), the method may be able to take larger stepsτi than those allowed by the global factorLinExample 2.3.

Proof of convergence. TakingZi+1 := I,Lemma 4.8shows thatGsatisfies (G-EC) (withΓ = 0).

We can also simply defineη¯i := E[ZiWi]. ThenζN = ÍNi=−01E[ZiWi] ≥

ÍN−1

i=0 E[Wi]ϵ ≥ ϵI.

ThereforeCorollary 4.7andProposition 2.4show the desired convergence provided we verify (CI∼). We do this through (CI), which withUi

+1=U now reads 1

2ku−u

i

k2+ϕτih∇G(ui) − ∇G(u∗),u−u∗iΠS(i) ≥ −∆i+1(u∗;u).

We have

E[h∇G(ui) − ∇G(u∗),uiu∗iΠS(i)]=E[h∇G(u i

) − ∇G(u∗),uiu∗iE[ΠS (i)|i−1]]

=E[h∇G(ui) − ∇G(u∗),ui−u∗i].

Similarly to (2.9), we may thus estimate

E[h∇G(ui) − ∇G(u∗),u−u∗iΠS (i)]

=E[h∇G(ui) − ∇G(u∗),ui −u∗i]+E[h∇G(ui) − ∇G(u∗),u−uiiΠS(i)] ≥E

h

h∇G(ui) − ∇G(u∗),uiu∗i −L−S(i)1 k∇G(ui) − ∇G(u∗)k2ΠS(i)

−LS4(i)kuuik2ΠS(i) i

.

Using (4.12), we see that (CI∗) is verified with

(4.15) E[∆i+1(u;u)]=E

"m Õ

j=1

1−τiπj−1 ,iLS(i)/2

2 kPj(u−u

i )k2

# .

This satisfies∆i

+1(u∗;u) ≤0under our step length assumptions.

References

Related documents

decision in Texas, as interpreted by the Texas asbestos multidistrict litigation judge (Mark Davidson), requires defendants to establish the proximity, frequency, and dura- tion

The objective of this work is to evaluate the effect of the modernization work carried out at Rio Grande Harbor access channel on a specific sand spit called Pontal Sul as

One of the pyroclastic unilS of the Alban Hills, the «Tufo di Villa Senni» has been dated with extremely good accuracy lit 338 ± 8 ka and it is 10 be considered as the best

We introduce a new class of coupled sequential fractional differential equations involving Riemann–Liouville and Caputo derivatives, integral and nonintegral type nonlinearities

Case study 1: Due to the facts that: (1) the hot regions of PPIs are pre-organized on the protein surface and complementarily packed between two proteins; (2) the hot spots in

Spain’s Podemos, Greece’s Syriza, France’s Parti de Gauche and Portugal’s Left Bloc and Communist Party, together with like-minded parties across Europe have been inspired in

Main stages of development and approbation of online course on elementary mathematics knowledge actualization for hearing-impaired and deaf students on the first

It is a conservative credit fund and there is no suggestion that the manager is not managing the fund in accordance with its investment parameters.. Accordingly, we do not see a