Generalized Conditional Gradient with Augmented Lagrangian for Composite Minimization

(1)

HAL Id: hal-02307114

https://hal.archives-ouvertes.fr/hal-02307114

Submitted on 7 Oct 2019

HAL is a multi-disciplinary open access archive for the deposit and dissemination of sci-entific research documents, whether they are pub-lished or not. The documents may come from teaching and research institutions in France or abroad, or from public or private research centers.

L’archive ouverte pluridisciplinaire HAL, est destinée au dépôt et à la diffusion de documents scientifiques de niveau recherche, publiés ou non, émanant des établissements d’enseignement et de recherche français ou étrangers, des laboratoires publics ou privés.

Generalized Conditional Gradient with Augmented

Lagrangian for Composite Minimization

Antonio Silveti-Falls, Cesare Molinari, Jalal M. Fadili

To cite this version:

Antonio Silveti-Falls, Cesare Molinari, Jalal M. Fadili. Generalized Conditional Gradient with Aug-mented Lagrangian for Composite Minimization. 2019. �hal-02307114�

(2)

Generalized Conditional Gradient with Augmented Lagrangian for

Composite Minimization

Antonio Silveti-Falls∗ Cesare Molinari∗ Jalal Fadili∗

Abstract.In this paper we propose a splitting scheme which hybridizes generalized conditional gradient with a prox-imal step which we call CGALP algorithm, for minimizing the sum of three proper convex and lower-semicontinuous functions in real Hilbert spaces. The minimization is subject to an affine constraint, that allows in particular to deal with composite problems (sum of more than three functions) in a separate way by the usual product space technique. While classical conditional gradient methods require Lipschitz-continuity of the gradient of the differentiable part of the objective, CGALP needs only differentiability (on an appropriate subset), hence circumventing the intricate ques-tion of Lipschitz continuity of gradients. For the two remaining funcques-tions in the objective, we do not require any additional regularity assumption. The second function, possibly nonsmooth, is assumed simple, i.e., the associated proximal mapping is easily computable. For the third function, again nonsmooth, we just assume that its domain is weakly compact and that a linearly perturbed minimization oracle is accessible. In particular, this last function can be chosen to be the indicator of a nonempty bounded closed convex set, in order to deal with additional constraints. Finally, the affine constraint is addressed by the augmented Lagrangian approach. Our analysis is carried out for a wide choice of algorithm parameters satisfying so called "open loop" rules. As main results, under mild conditions, we show asymptotic feasibility with respect to the affine constraint, boundedness of the dual multipliers, and convergence of the Lagrangian values to the saddle-point optimal value. We also provide (subsequential) rates of convergence for both the feasibility gap and the Lagrangian values.

Key words.Conditional gradient; Augmented Lagrangian; Composite minimization; Proximal mapping; Moreau envelope.

AMS subject classifications.49J52, 65K05, 65K10.

1 Introduction

1.1 Problem Statement

In this work, we consider the composite optimization problem, min

x∈Hp{

f(x) +g(T x) +h(x) : Ax=b}, (P₎

where_Hp,Hd,Hv are real Hilbert spaces (the subindicesp, dandvdenoting the “primal”, the “dual” and

an auxiliary space - respectively), endowed with the associated scalar products and norms (to be understood

∗

Normandie Université, ENSICAEN, UNICAEN, CNRS, GREYC, France. E-mail: [email protected], [email protected], [email protected].

(3)

from the context), A : _Hp → Hd and T : Hp → Hv are bounded linear operators, b ∈ Hd andf, g, h

are proper, convex, and lower semi-continuous functions with_C def= dom (h)being a weakly compact subset ofHp. We allow for someasymmetryin regularity between the functions involved in the objective. While g is assumed to be prox-friendly, for h we assume that it is easy to compute a linearly-perturbed oracle (see (1.2)). On the other hand, f is assumed to be differentiable and satisfies a condition that generalizes Lipschitz-continuity of the gradient (see Definition2.6).

Problem (P_{) can be seen as a generalization of the classical Frank-Wolfe problem in [}₁₅_{] of minimizing}

a Lipschitz-smooth functionf on a convex closed bounded subsetC ⊂ Hp,

min

x∈Hp{

f(x) : x_{∈ C}} _(1.1)

In fact, ifA_≡0,b_≡0,g_≡0, andh_≡ι_Cis the indicator function of_Cthen we recover exactly (1.1) from (P_).

1.2 Contribution

We develop and analyze a novel algorithm to solve (P_{) which combines penalization for the nonsmooth}

functiong with the augmented Lagrangian method for the affine constraintAx = b. In turn, this achieves full splitting of all the parts in the composite problem (P_{) by using the proximal mapping of} _g_(assumed

prox-friendly) and a linear oracle for hof the form (1.2). Our analysis shows that the sequence of iterates is asymptotically feasible for the affine constraint, that the sequence of dual variables converges weakly to a solution of the dual problem, that the associated Lagrangian converges to optimality, and establishes con-vergence rates for a family of sequences of step sizes and sequences of smoothing/penalization parameters which satisfy so-called "open loop" rules in the sense of [31] and [13]. This means that the allowable se-quences of parameters do not depend on the iterates, in contrast to a "closed loop" rule, e.g. line search or other adaptive step sizes. Our analysis also shows, in the case where (P_{) admits a unique minimizer, weak}

convergence of the whole sequence of primal iterates to the solution.

The structure of (P_{) generalizes (}_1.1_{) in several ways. First, we allow for a possibly nonsmooth term}_g_.

Second, we considerhbeyond the case of an indicator function where the linear oracle of the form

min

s∈Hh(s) +hx, si (1.2)

can be easily solved. Observe that (1.2) has a solution overdom(h)since the latter is weakly compact. This oracle is reminiscent of that in the generalized conditional gradient method [7,8,5,3]. Third, the regularity assumptions onf are also greatly weakened to go far beyond the standard Lipschitz gradient case. Finally, handling an affine constraint in our problem means that our framework can be applied to the splitting of a wide range of composite optimization problems, through a product space technique, including those involving finitely many functionshiandgi, and, in particular, intersection of finitely many nonempty bounded closed

convex sets; see Section5.

1.3 Relation to prior work

In the 1950’s Frank and Wolfe developed the so-called Frank-Wolfe algorithm in [15], also commonly referred to as the conditional gradient algorithm [24,12,13], for solving problems of the form (1.1). The main idea is to replace the objective function f with a linear model at each iteration and solve the resulting linear optimization problem; the solution to the linear model is used as a step direction and the next iterate is

(4)

computed as a convex combination of the current iterate and the step direction. We generalize this setting to include composite optimization problems involving both smooth and nonsmooth terms, intersection of multiple constraint sets, and also affine constraints.

Frank-Wolfe algorithms have received a lot of attention in the modern era due to their effectiveness in fields with high-dimensional problems like machine learning and signal processing (without being exhaustive, see, e.g., [20, 6, 22, 17, 39, 26, 10]). In the past, composite, constrained problems like (P_{) have been}

approached using proximal splitting methods, e.g. generalized forward-backward as developed in [32] or forward-douglas-rachford [25]. Such approaches require one to compute the proximal mapping associated to the functionh. Alternatively, when the objective function satisfies some regularity conditions and when the constraint set is well behaved, one can forgo computing a proximal mapping, instead computing a linear minimization oracle. The computation of the proximal step can be prohibitively expensive; for example, whenhis the indicator function of the nuclear norm ball, computing the proximal operator ofhrequires a full singular value decomposition while the linear minimization oracle over the nuclear norm ball requires only the leading singular vector to be computed ([21], [38]). Unfortunately, the regularity assumptions required by classical Frank-Wolfe style algorithms are too restrictive to apply to general problems like (P_).

While finalizing this work, we became aware of the recent work of [37], who independently developed a conditional gradient-based framework which allows one to solve composite optimization problems involving a Lipschitz-smooth functionf and a nonsmooth functiong,

min

x∈C {f(x) +g(T x)}. (1.3) The main idea is to replacegwith its Moreau envelope of indexβkat each iterationk, with the index

param-eterβkgoing to0. This is equivalent to partial minimization with a quadratic penalization term, as in our

algorithm. Like our algorithm, that of [37] is able to handle problems involving both smooth and nonsmooth terms, intersection of multiple constraint sets and affine constraints, however their algorithms employ differ-ent methods for these situations. Our algorithm uses an augmdiffer-ented Lagrangian to handle the affine constraint while the conditional gradient framework treats the affine constraint as a nonsmooth termgand uses penal-ization to smooth the indicator function corresponding to the affine constraint. In particular circumstances, outlined in more detail in Section6, our algorithms agree completely.

Another recent and parallel work to ours is that of [16], where the Frank-Wolfe via Augmented Lagrangian (FW-AL) is developed to approach the problem of minimizing a Lipschitz-smooth function over a convex, compact set with a linear constraint,

min

x∈C {f(x) : Ax= 0}. (1.4)

The main idea of FW-AL is to use the augmented Lagrangian to handle the linear constraint and then apply the classical augmented Lagrangian algorithm, except that the marginal minimization on the primal variable that is usually performed is replaced by an inner loop of Frank-Wolfe. It turns out that the problem they consider is a particular case of (P_{), discussed in Section}₆_.

1.4 Organization of the paper

In Section2we introduce the notation and review some necessary material from convex and real analysis. In Section3we present theConditionalGradient withAugmentedLagrangian andProximal-step(CGALP )

algorithm and the underlying assumptions. In Section4, we first state our main convergence results and then turn to their proof. The latter is divided in three main parts. First we show the asymptotic feasibility, then the

(5)

boundedness of the dual multiplier in the augmented Lagrangian and finally the optimality guarantees, i.e. weak convergence of the sequence(µk)_k_∈Nto a solution of the dual problem, weak subsequential convergence

of the sequence(xk)k∈Nto a solution of the primal problem, and convergence of the Lagrangian values, and

with convergence rates. In Section5we describe how our framework can be instantiated to solve a variety of composite optimization problems. In Section6we provide a more detailed discussion comparing CGALP to prior work. Some numerical results are reported in Section7.

For readers who are primarily interested in the practical perspective, we suggest skipping directly to Sec-tion3for the algorithms and its assumptions or Section4for the main convergence results.

2 Notation and Preliminaries

We first recall some important definitions and results from convex analysis. For a more comprehensive coverage we refer the interested reader to [4,30] and [33] in the finite dimensional case. Throughout, we let

Hdenote an arbitrary real Hilbert space andgan arbitrary function fromHto the real extended line, namely

g : _{H →} R_{∪ {}₊_∞}_{. The function}_g_{is said to belong to}_Γ₀₍_H₎_{if it is proper, convex, and lower}

semi-continuous. Thedomainofgis defined to bedom (g) def= _{x_{∈ H}:g(x)<+_∞}. TheLegendre-Fenchel conjugateofgis the functiong∗ :H →R_{∪ {}₊_∞}_{such that, for every}_u_{∈ H}_,

g∗(u)= supdef

x∈H{h

u, x_{i −}g(x)_}.

Notice that

g1≤g2 =⇒ g∗2 ≤g∗1. (2.1)

Moreau proximal mapping and envelope Theproximal operatorfor the functiongis defined to be

prox_g(x)= argmindef

y∈H

g(y) +1

2kx−yk 2

and itsMoreau envelope with parameterβas

gβ(x)= infdef y∈H g(y) + 1 2β kx−yk 2_. _(2.2) Denotingx+_{= prox}

g(x), we have the following classical inequality (see, for instance, [30, Chapter 6.2.1]):

for everyy∈ H,

2

g(x+)₋g(y)

+_kx+₋y_k2_{− k}x₋y_k2+_kx+₋x_k2_≤0. (2.3) We recall that thesubdifferential of the functiongis defined as the set-valued operator∂g :H → 2Hsuch that, for everyxin_H,

∂g(x) =

u_{∈ H}: g(y)_≥g(x) +_hu, y₋x_i _∀y_{∈ H} . (2.4) We denote dom(∂g) =def

x ∈ H : ∂g(x) 6= ∅ . When g belongs toΓ0(H), it is well-known that the

subdifferential is a maximal monotone operator. If, moreover, the function is Gâteaux differentiable atx_{∈ H}, then ∂g(x) = _{∇g(x)_}. For x _∈ dom(∂g), the minimal norm selection of ∂g(x) is defined to be the unique element

n

[∂g(x)]0odef= Argmin

y∈∂g(x) k

y_k. Then we have the following fundamental result about Moreau envelopes.

(6)

Proposition 2.1. Given a functiong_∈Γ0(H), we have the following:

(i) The Moreau envelope is convex, real-valued, and continuous.

(ii) Lax-Hopf formula: the Moreau envelope is the viscosity solution to the following Hamilton Jacobi equation: ( ∂ ∂βgβ(x) =−12 ∇_xgβ(x) 2 (x, β)_{∈ H ×}(0,+_∞) g0(x) =g(x) x_{∈ H}. (2.5)

(iii) The gradient of the Moreau envelope is _β1-Lipschitz continuous and is given by the expression

∇xgβ(x) = x₋prox_βg(x) β . (iv) _∀x_∈dom(∂g),∇gβ(x) ր [∂g(x)] 0 asβց0.

(v) _∀x_{∈ H},gβ₍_x₎_ր_g₍_x₎_as_β _ց₀_{. In addition, given two positive real numbers}_β′ _{< β}_{, for all}_x_{∈ H} we have 0_≤gβ′(x)₋gβ(x)_≤ β−β′ 2 ∇xg β′₍_x₎ 2 ; 0≤g(x)−gβ(x)≤ β 2 [∂g(x)] 0 2 .

Proof. (i): see [4, Proposition 12.15]. The proof for(ii)can be found in [2, Lemma 3.27 and Remark 3.32] (see also [19] or [1, Section 3.1]). The proof for claim(iii) can be found in [4, Proposition 12.29] and the proof for claim(iv)can be found in [4, Corollary 23.46]. For the first part in(v), see [4, Proposition 12.32(i)]. To show the first inequality in(v), combine(ii)and convexity of the functionβ 7→gβ(x)for everyx∈ H. The second inequality follows from the first one and(iv), taking the limit asβ′_→0.

Remark 2.2.

(i) While the regularity claim in Proposition2.1(iii)of the Moreau envelopegβ(x)w.r.t.xis well-known, a less known result is theC1_{-regularity w.r.t.}_β_{for any}_x_{∈ H}_(Proposition_2.1(ii)_{). To our knowledge,}

the proof goes back, at least, to the book of [2]. Though it has been rediscovered in the recent literature in less general settings.

(ii) For given functions H :_{H →} R_and_g₀ _:_{H →} R_{, a natural generalization of the Hamilton-Jacobi}

equation in (2.5) is

(

∂

∂βg(x, β) +H(∇xg(x, β)) = 0 (x, β)∈ H ×(0,+∞) g(x,0) =g(x) x∈ H.

Supposing thatH is convex and that lim

kpk→+∞H(p)/kpk = +∞, the solution of the above system is given by the Lax-Hopf formula (see [14, Theorem 5, Section 3.3.2]1):

g(x, t)def= inf y∈H g0(y) +tH∗ y₋x t . IfH(p) = 1₂_kp_k2, thenH∗₍_p_{) =} 1 2kpk

2 _{and we recover the result in Proposition}_2.1_.

1

(7)

Regularity of differentiable functions In what follows, we introduce some definitions related with regu-larity of differentiable functions. They will provide useful upper-bounds and descent properties. Notice that the the notions and results of this part are independent from convexity.

Definition 2.3. (ω-smoothness) Consider a functionω:R₊_→R₊_{such that}_ω_{(0) = 0}_and

ξ(s)def=

Z 1

0

ω(st)dt (2.6)

is non-decreasing. A differentiable functiong :_{H →}R_{is said to belong to}_C1,ω₍_H₎_{or to be}_ω_{-smooth if}

the following inequality is satisfied for everyx, y_{∈ H}:

k∇g(x)_{− ∇}g(y)_{k ≤}ω(_kx₋y_k).

Lemma 2.4. (ω-smooth Descent Lemma) Given a functiong _∈C1,ω₍_H₎_{we have the following inequality:}

for everyxandyinH,

g(y)−g(x)≤ h∇g(x), y−xi+ky−xkξ(ky−xk),

whereξis defined in(2.6).

Proof. We recall here the simple proof for completeness:

g(y)−g(x) = Z 1 0 d dtg(x+t(y−x))dt = Z 1 0 h∇ g(x), y−xidt+ Z 1 0 h∇ g(x+t(y−x))− ∇g(x), y−xidt ≤ h∇g(x), y−xi+ky−xk Z 1 0 k∇ g(x+t(y−x))− ∇g(x)kdt ≤ h∇g(x), y−xi+ky−xk Z 1 0 ω(tky−xk)dt,

where in the first inequality we used Cauchy-Schwartz and in the second Definition2.3. We conclude using the definition ofξ.

For L > 0 and ω(t) = Ltν, ν ∈]0,1], C1,ω(H) is the space of differentiable functions with Hölder continuous gradients, in which caseξ(s) =Lsν/(1 +ν)and the Descent Lemma reads

g(y)₋g(x)_{≤ h∇}g(x), y₋x_i+ L

1 +ν ky−xk

1+ν_, _(2.7)

see e.g., [27,28]. Whenν = 1, we have thatC1,ω(H)is the class of differentiable functions withL-Lipschitz continuous gradient, and one recovers the classical Descent Lemma.

Now, following [18], we introduce some notions that allow one to further generalize (2.7). Given a function

G:H →R_{∪ {}₊_∞}_{, differentiable on the open set}_C₀ _⊂_{int (dom (}_G₎₎_{, define the}_{Bregman divergence}_of

Gas the functionDG: dom (G)× C0 →R,

DG(x, y) =G(x)−G(y)− h∇G(y), x−yi. (2.8)

(8)

Lemma 2.5. (Generalized Descent Lemma, [18, Lemma 1]) LetGandgbe differentiable on_C0, whereC0

is an open subset ofint (dom (G)). Assume thatG₋gis convex on_C0. Then, for everyxandyinC0, g(y)≤g(x) +h∇g(x), y−xi+DG(y, x).

Proof. For our purpose, we intentionally weakened the hypothesis needed in the original result of [18, Lemma 1]. We repeat their argument but show the result is still valid under our weaker assumption. Let

xandybe inC0, where, by hypothesis,C0 is open and contained inint (dom (G)). AsG−gis convex and

differentiable on_C0, from the gradient inequality (2.4) we have, for ally∈ C0, (G₋g) (y)_≥(G₋g) (x) +_h∇(G₋g) (x), y₋x_i.

Rearranging the terms and using the definition ofDGin (2.8), we obtain the claim.

The previous lemma suggests the introduction of the following definition, which extends Definition2.3.

Definition 2.6. ((G, ζ)-smoothness) LetG:_{H →}R_{∪ {}₊_∞}_and_ζ _:]0_,_1]_→R₊_{. The pair}₍_g,_C₎_{, where}

g :_{H →}R_{∪ {}₊_∞}_and_{C ⊂} _dom(_g₎_{, is said to be}₍_{G, ζ}₎_{-smooth if there exists an open set}_C₀_{such that}

C ⊂ C0 ⊂int (dom (G))and

(i) Gandgare differentiable on_C0;

(ii) G−gis convex onC0; (iii) it holds K₍_G,ζ,_C₎ =def sup x,s∈C;γ∈]0,1] z=x+γ(s−x) DG(z, x) ζ(γ) < +∞. (2.9)

K₍_G,ζ,_C₎is a far-reaching generalization of the standard curvature constant widely used in the literature of conditional gradient.

Remark 2.7. Assume that(g,C) is(G, ζ)-smooth. Using first Lemma2.5and then the definition in (2.9), we have the following descent property: for everyx, s_{∈ C}and for everyγ _∈]0,1],

g(x+γ(s−x))≤g(x) +γh∇g(x), s−xi+DG(x+γ(s−x), x)

≤g(x) +γh∇g(x), s−xi+K(G,ζ,C)ζ(γ).

Notice that, as in the previous definition, we do not require Cto be convex. So, in general, the pointz =

x+γ(s₋x)may not lie in_C.

Lemma 2.8. Suppose that the setCis bounded and denote byd_C def= sup_x,y_∈Ckx−ykitsdiameter. Moreover, assume that the functiongisω-smooth on some open and convex subset_C0containingC. Setζ(γ)

def

=ξ(d_Cγ), whereξis given in(2.6). Then the pair(g,C)is(g, ζ)-smooth withK(g,ζ,C)≤dC.

Proof. WithG =gandgbeingω-smooth on_C0, bothGandgare differentiable onC0 andG−g ≡0is

convex on_C0. Thus, all conditions required in Definition2.6hold true. It then remains to show (2.9) with the

boundK(g,ζ,C)≤dC. First notice that, for everyx, s∈ Cand for everyγ∈]0,1], the pointz=x+γ(s−x) belongs to_C0. Indeed, C ⊂ C0 andC0 is convex by hypothesis. In particular, asg isω-smooth onC0, the

(9)

Descent Lemma2.4holds between the pointsxandz. Then K(g,ζ,C)= sup x,s∈C;γ∈]0,1] z=x+γ(s−x) Dg(z, x) ζ(γ) = sup x,s∈C;γ∈]0,1] z=x+γ(s−x) g(z)₋g(x)_{− h∇}g(x), z₋x_i ξ(d_Cγ) ≤ sup x,s∈C;γ∈]0,1] z=x+γ(s−x) kz₋x_kξ(_kz₋x_k) ξ(d_Cγ) = sup x,s∈C;γ∈]0,1] γ_ks₋x_kξ(γ_ks₋x_k) ξ(d_Cγ) ≤ sup γ∈]0,1] γd_Cξ(d_Cγ) ξ(d_Cγ) =dC.

In the first inequality we used Lemma2.4, while in the second we used that_ks₋x_{k ≤} d_C (bothx ands

belong to_C, that is bounded by hypothesis) and the monotonicity of the functionξ(see Definition2.3).

Indicator and support functions Given a subset_{C ⊂ H}, we define itsindicator functionasι_C(x) = 0 ifx belongs toC and ι_C(x) = +∞otherwise. Recall that, ifC is nonempty, closed, and convex, thenι_C

belongs to Γ0(H). Remember also the definition of the support function of C, σC def

= ι∗_C. Equivalently,

σ_C(x) = supdef _{hz, x_i : z_{∈ C}}. We denote byri (_C)therelative interior of the set _C(in finite dimension, it is the interior for the topology relative to its affine full). We denotepar(C)as the subspace parallel toC

which, in finite dimension, takes the formR₍_C₋_C₎_.

We have the following characterization of the support function from the relative interior in finite dimen-sion.

Proposition 2.9. ([35, Lemma 1]) LetHbe finite-dimensional andC ⊂ Ha nonempty, closed bounded and convex subset. If0_∈ri(_C), thenσ_C _∈Γ0(Rn)is sublinear, non-negative and finite-valued, and

σ_C(x) = 0 _⇐⇒ x_∈(par(_C))⊥.

Coercivity We recall that a functiongiscoerciveiflim_k_x_k→₊_∞g(x) = +_∞and that coercivity is equiv-alent to the boundedness of the sublevel-sets [4, Proposition 11.11]. We have the following result, that relates coercivity to properties of the Fenchel conjugate.

Proposition 2.10. ([4, Theorem 14.17]) GivenginΓ0(H),g∗is coercive if and only if0∈int (dom(g)).

Therecession function(sometimes referred to as the horizon function) ofg at a given pointd ∈ Rn_is

defined to begd,∞:Rn_→R_{∪ {}₊_∞}_{such that, for every}_x_∈Rn_,

gd,∞(x)def= lim

α→∞

g(d+αx)−g(d)

α .

Recall that, ifgis convex, the recession function is independent from the selection of the pointd_∈Rn_and

can be then simply denoted asg∞. In finite dimension, the following result relates coercivity to properties of the recession function.

(10)

Proposition 2.11. Letg_∈Γ0(Rn)andA:Rm→Rnbe a linear operator. Then,

(i) gcoercive _⇐⇒ g∞₍_x₎_>₀ _∀_x₆_{= 0}_. (ii) g∞≡σ_dom(_g∗₎.

(iii) (g_◦A)∞_≡g∞_◦A.

In particular, we deduce thatg_◦Ais coercive if and only ifσ_dom(_g∗₎(Ax)>0for everyx6= 0.

Proof. The proofs can be found in [34, Theorem 3.26], [34, Theorem 11.5] and [23, Corollary 3.2] respec-tively.

Real sequences We close this section with some definitions and lemmas for real sequences that will be used to prove the convergence properties of the algorithm. We denote ℓ+ as the set of all sequences in [0,+_∞[. Givenp _∈ [1,+_∞[,ℓp _{is the space of real sequences}₍_r

k)k∈Nsuch that _∞ P k=1| rk|p 1/p < +_∞. For p = +_∞, we denote byℓ∞ the space of bounded sequences. Furthermore, we will use the notation

ℓp₊=defℓp_∩ℓ+. In the next, we recall some key results about real sequences.

Lemma 2.12. ([11, Lemma 3.1]) Consider three sequences(rk)k∈N∈ℓ+,(ak)k∈N∈ℓ+, and(zk)k∈N∈ℓ1₊,

such that

rk+1 ≤rk−ak+zk, ∀k∈N.

Then(rk)k∈Nis convergent and(ak)_k_∈N∈ℓ1₊.

Lemma 2.13. ([36, Theorem 2] and [36, Proposition 2(ii)]) Consider two sequences (pk)k∈N ∈ ℓ+ and

(wk)k∈N∈ℓ+such that(pkwk)_k_∈N∈ℓ1₊and(pk)_k_∈N∈/ ℓ1. Then the following holds:

(i) there exists a subsequence wkj

j∈Nsuch that

wkj ≤P− 1

kj ,

wherePn=Pnk=1pk. In particular,lim inf

k wk= 0.

(ii) If moreover there exists a constantα >0such thatwk−wk+1 ≤αpkfor everyk∈N, then

lim

k wk= 0.

Lemma 2.14. Consider the sequences(rk)k∈N ∈ ℓ+,(pk)_k_∈N ∈ ℓ+,(wk)_k_∈N ∈ ℓ+, and(zk)_k_∈N ∈ ℓ+.

Suppose that(zk)k∈N∈ℓ1₊,(pk)_k_∈N∈/ ℓ1, and that, for someα >0, the following inequalities are satisfied

for everyk_∈N_:

rk+1≤rk−pkwk+zk; wk−wk+1 ≤αpk.

(2.10) Then,

(i) (rk)k∈Nis convergent and(pkwk)_k_∈N∈ℓ1+.

(ii) lim

k wk= 0.

(iii) For everyk_∈N_,_inf₁_≤_i_≤_k_w_i _≤₍_r₀₊_E₎_/P_k_{, where, again,}_P_n₌Pn

k=1pkandE=P+k=1∞zk.

(iv) There exists a subsequence wkj

j∈Nsuch that, for allj∈N,wkj ≤P− 1

kj . Proof. (i) See Lemma2.12.

(11)

(ii) Claim(ii)follows by combining(i)and Lemma2.13(ii).

(iii) Sum (2.10) using a telescoping property and summability of(zk)_k_∈N.

(iv) Claim(iv)follows by combining(i)and Lemma2.13(i).

Notice that the conclusions of Lemma2.14remain true if non-negativity of the sequence(rk)k∈Nis

re-placed with the assumption that it is bounded from below by a trivial translation argument. Observe also that Lemma2.14guarantees the convergence of the whole sequence to zero, but it gives a convergence rate only on a subsequence.

3 Algorithm and assumptions

3.1 Algorithm

As described in the introduction, we combine penalization with the augmented Lagrangian approach to form the following functional

Jk(x, y, µ) =f(x) +g(y) +h(x) +hµ, Ax−bi+ ρk 2 kAx−bk 2₊ 1 2βk k y₋T x_k2, (3.1) whereµis the dual multiplier, andρk andβkare non-negative parameters. The steps of our scheme, then,

are summarized in Algorithm1.

Algorithm 1:Conditional Gradient with Augmented Lagrangian and Proximal-step (CGALP )

Input: x0 ∈ C = dom (h);µ0 ∈ran(A);(γk)k∈N,(βk)_k_∈N,(θk)_k_∈N,(ρk)_k_∈N∈ℓ+.

k= 0 repeat yk= proxβkg(T xk) zk=∇f(xk) +T∗(T xk−yk)/βk+A∗µk+ρkA∗(Axk−b) sk∈Argmins∈Hp{h(s) +hzk, si} xk+1=xk−γk(xk−sk) µk+1 =µk+θk(Axk+1−b) k_←k+ 1 untilconvergence; Output: xk+1.

For the interpretation of the algorithm, notice that the first step is equivalent to

{yk}= Argmin y∈Hv

Jk(xk, y, µk).

Now define the functional _Ek(x, µ)

def

= f(x) +gβk₍_{T x}_{) +}_h_{µ, Ax}₋_b_i₊ρk

2 kAx−bk

2_._{By convexity of}

the setCand the definition ofxk+1as a convex combination ofxkandsk, the sequence(xk)k∈Nremains in

(12)

Lagrangian, where we do not consider the non-differentiable function h and we replace g by its Moreau envelope. Notice that

∇xEk(x, µk) =∇f(x) +T∗[∇gβk](T x) +A∗µk+ρkA∗(Ax−b) =∇f(x) + 1 βk T∗ T x−prox_β_k_g(T x) +A∗µk+ρkA∗(Ax−b). (3.2)

where in the second equality we used2.1(iii). Thenzkis just∇xEk(xk, µk)and the first three steps of the

algorithm can be condensed in

sk ∈Argmin s∈Hp

{h(s) +_h∇xEk(xk, µk), si}. (3.3)

Thus the primal variable update of each step of our algorithm boils down to conditional gradient applied to the function_Ek(·, µk), where the next iterate is a convex combination between the previous one and the new

directionsk. A standard update of the Lagrange multiplierµkfollows.

3.2 Assumptions

3.2.1 Assumptions on the functions

In order to help the reading, we recall in a compact form the following notation that we will use to refer to various functionals throughout the paper:

Φ (x)=deff(x) +g(T x) +h(x) ; Φk(x) def =f(x) +gβk₍_{T x}_{) +}_h₍_x_{) +}ρk 2 kAx−bk 2_; ¯ Φ (x)= Φ (def x) + (ρ/2)kAx−bk2; ¯ ϕ(µ)= ¯defΦ∗(−A∗µ) +hb, µi; L(x, µ)=deff(x) +g(T x) +h(x) +hµ, Ax−bi; Lk(x, µ) def =f(x) +gβk₍_{T x}_{) +}_h₍_x_{) +}_h_{µ, Ax}₋_b_i₊ρk 2 kAx−bk 2_; Ek(x, µ) def =f(x) +gβk₍_{T x}_{) +}_h_{µ, Ax}₋_b_i₊ρk 2 kAx−bk 2_, (3.4)

whereρis defined in Assumption(P.4)to beρ= sup

k ρk.

In the list (3.4), we can recognize Φas the objective, Φk as the smoothed objective augmented with a

quadratic penalization of the constraint, and_Lkas a smoothed augmented Lagrangian. Ldenotes the classical

Lagrangian. Recall that(x⋆, µ⋆) ∈ Hp× Hdis a saddle-point for the LagrangianLif for every(x, µ) ∈

Hp× Hd,

L(x⋆, µ)_{≤ L}(x⋆, µ⋆)_{≤ L}(x, µ⋆). (3.5) It is well-known from standard Lagrange duality, see e.g. [4, Proposition 19.19] or [30, Theorem 3.68], that the existence of a saddle point (x⋆, µ⋆) ensures strong duality, that x⋆ solves (P_{) and} _µ⋆ _{solves the dual}

problem,

min

µ∈Hd

(f +g_◦T+h)∗(₋A∗µ) +_hµ, b_i. (D₎

The following assumptions on the problem will be used throughout the convergence analysis (for some results only a subset of these assumptions will be needed):

(13)

(A.1) f, g_◦T, andhbelong toΓ0(Hp).

(A.2) The pair(f,_C)is(F, ζ)-smooth (see Definition2.6), where we recall_C= dom (def h). (A.3) Cis weakly compact (and thus contained in a ball of radiusR >0).

(A.4) TC ⊂dom(∂g)andsup

x∈C [∂g(T x)] 0 <∞.

(A.5) his Lipschitz continuous relative to its domain_Cwith constantLh ≥ 0, i.e.,∀(x, z) ∈ C2,|h(x)− h(z)| ≤Lhkx−zk.

(A.6) There exists a saddle-point(x⋆, µ⋆)∈ Hp× Hdfor the LagrangianL.

(A.7) ran(A)is closed.

(A.8) One of the following holds:

(a) A−1(b)_∩int (dom (g_◦T))_∩int (_C)₆=_∅, whereA−1(b)is the pre-image ofbunderA. (b) _HpandHdare finite dimensional and

     A−1(b)∩ri (dom (g◦T))∩ri (C)6=∅ and

ran (A∗)∩par (dom (g◦T)∩ C)⊥={0}.

(3.6)

At this stage, a few remarks are in order.

Remark 3.1.

(i) By Assumption(A.1),_Cis also closed and convex. This together with Assumption(A.3)entail, upon using [4, Lemma 3.29 and Theorem 3.32], thatCis weakly compact.

(ii) Since the sequence of iterates(xk)k∈Ngenerated by Algorithm1is guaranteed to belong toC under

(P.1), we have from(A.4)

sup k [∂g(T xk)] 0 ≤M. (3.7)

whereM is a positive constant.

(iii) Assumption(A.5)will only be needed in the proof of convergence to optimality (Theorem4.2). It is not needed to show asymptotic feasibility (Theorem4.1).

(iv) Assume that A−1₍_b₎_∩_dom(_g_◦_T₎_{∩ C 6}₌ _∅_{, which entails that the set of minimizers of (}_P_{) is a}

non-empty convex closed bounded set under(A.1)-(A.3). Then there are various domain qualification conditions, e.g., one of the conditions in [4, Proposition 15.24 and Fact 15.25], that ensure the existence of a saddle-point for the Lagrangian_L(see [4, Theorem 19.1 and Proposition 9.19(v)]).

(v) Observe that under the inclusion assumption of Lemma3.2,(A.8)(a)is equivalent toA−1(b)∩int (C)6=

∅.

(vi) Assumption(A.8)will be crucial to show thatϕ¯is coercive onker(A∗₎⊥ _{= ran(}_A₎_{(the last equality} follows from (A.7)), and hence boundedness of the dual multiplier sequence (µk)k∈N provided by

Algorithm1(see Lemma4.10and Lemma4.11).

The uniform boundedness of the minimal norm selection on_C, as required in Assumption(A.4), is impor-tant when we will invoke Proposition2.1(v)in our proofs to get meaningful estimates. The following result gives some sufficient conditions under which(A.4)holds (in fact an even stronger claim).

Lemma 3.2. LetCbe a nonempty bounded subset ofHp,g ∈ Γ0(Hv)andT :Hp → Hv be a bounded

(14)

Proof. Sinceg_∈Γ0(Hp), it follows from [4, Proposition 16.21] that T_{C ⊂}int (dom (g))_⊂dom(∂g).

Moreover, by [4, Corollary 8.30(ii) and Proposition 16.14], we have that∂g is locally weakly compact on int (dom (g)). In particular, as we assume thatC is bounded, so isTC, and since TC ⊂ int (dom (g)), it means that for eachz _∈ T_Cthere exists an open neighborhood of z, denoted by Uz, such that∂g(Uz)is

bounded. Since(Uz)z∈Cis an open cover ofTCandTCis bounded, there exists a finite subcover(Uzk) n k=1. Then, [ x∈C ∂g(T x)_⊂ n [ k=1 ∂g(Uzk).

Since the right-hand-side is bounded (as it is a finite union of bounded sets), sup

x∈C, u∈∂g(T x)k

u_k <+_∞,

whence the desired conclusion trivially follows.

3.2.2 Assumptions on the parameters

We also use the following assumptions on the parameters of Algorithm1(recall the functionζin Definition

2.6):

(P.1) (γk)k∈N⊂]0,1]and the sequences(ζ(γk))_k_∈N, γ_k2/βk

k∈Nand(γkβk)k∈Nbelong toℓ1+.

(P.2) (γk)k∈N∈/ ℓ1.

(P.3) (βk)_k_∈N∈ℓ+is non-increasing and converges to0.

(P.4) (ρk)k∈N∈ℓ+is non-decreasing with0< ρ= infkρk ≤supkρk =ρ <+∞.

(P.5) For some positive constantsM andM,M _≤infk(γk/γk+1)≤supk(γk/γk+1)≤M.

(P.6) (θk)k∈Nsatisfiesθk =

γk

c for allk∈Nfor somec >0such that Mc − ρ

2 <0.

(P.7) (γk)_k_∈Nand(ρk)_k_∈Nsatisfyρk+1−ρk−γk+1ρk+1+ 2_cγk−

γk2

c ≤γk+1for all k∈ Nand for cin

(P.6).

Remark 3.3.

(i) One can recognize that the update of the dual multiplier µkin Algorithm1has a flavour of gradient

ascent applied to the augmented dual with step-sizeθk. However, unlike the standard method of

mul-tipliers with the augmented Lagrangian, Assumption(P.6)requiresθk to vanish in our setting. The

underlying reason is that our update can be seen as an inexact dual ascent (i.e., exactness stems from the conditional gradient-based update onxkwhich is not a minimization of overxof the augmented

Lagrangian_Lk). Thusθkmust annihilate this error asymptotically.

(ii) A sufficient condition for (P.7)to hold consists of taking ρk ≡ ρ > 0 and γk+1 ≥ _c₍₁₊2_ρ₎γk. In

particular, if(γk)k∈Nsatisfies(P.5), then, for(P.7)to hold, it is sufficient to takeρk ≡ρ >2M /cas

supposed in(P.6).

(iii) The relevance of having ρk vary is that it allows for more general and less stringent choice of the

step-sizeγk. It is, however, possible (and easier in practice), to simply pickρk ≡ ρfor allk∈ Nas

described above.

(15)

Example 3.4. Take2, fork_∈N_, ρk ≡ρ >0, γk= (log(k+ 2))a (k+ 1)1−b , βk= 1 (k+ 1)1−δ, with a_≥0,0_≤2b < δ <1, δ <1₋b, ρ >22−b/c, c >0.

In this case, one can take the crude boundsM = (log(2)/log(3))a_and_M _{= 2}1−b_{, and choose}_{ρ >}₂_{M /c}

as devised in Remark3.3(ii). In turn,(P.4)-(P.7)hold. In addition, suppose thatf has aν-Hölder continuous gradient (see (2.7)). Thus for(P.1)-(P.2)to hold, simple algebra shows that the allowable choice ofbis in

h

0,min1/3,₁₊ν_νh.

4 Convergence analysis

4.1 Main results

We state here our main results.

Theorem 4.1 (Asymptotic feasibility). Suppose that Assumptions(A.1)-(A.4)and(A.6)hold. Consider the sequence of iterates(xk)k∈Nfrom Algorithm1with parameters satisfying Assumptions(P.1)-(P.6). Then,

(i) Axk converges strongly tobask→ ∞, i.e., the sequence(xk)k∈Nis asymptotically feasible for(P)

in the strong topology. (ii) Pointwise rate:

inf 0≤i≤kkAxi−bk=O 1 √ Γk and _∃a subsequence xkj j∈Ns.t. for allj ∈N,kAxkj −bk ≤ 1 p Γkj , (4.1) where, for allk_∈N_,_Γ_k₌defPk

i=0γi.

(iii) Ergodic rate: for eachk∈N_{, let}_x_¯_k₌defPk

i=0γixi/Γk. Then kAx¯k−bk=O 1 √ Γk . (4.2)

Theorem4.1will be proved in Section4.3.

Theorem 4.2 (Convergence to optimality). Suppose that assumptions(A.1)-(A.8)and(P.1)-(P.7)hold, with

M ≥1. Let(xk)k∈Nbe the sequence of primal iterates generated byAlgorithm1and(x⋆, µ⋆)a saddle-point

pair for the Lagrangian. Then, in addition to the results of Theorem4.1, the following holds (i) Convergence of the Lagrangian:

lim

k→∞L(xk, µ

⋆_{) =}_L₍_x⋆_{, µ}⋆₎_.

(4.3) (ii) Every weak cluster pointx¯of(xk)k∈Nis a solution of the primal problem(P), and(µk)_k_∈Nconverges

weakly toµ¯a solution of the dual problem(D_{), i.e.,}_(¯_x,_µ_¯₎_{is a saddle point of}_L_.

2

Of course, one can add a scaling factor in the choice of the parameters which would allow for more practical flexibility. But this does not change anything to our discussion nor to the bahviour of the CGALP algorithm fork_{large enough.}

(16)

(iii) Pointwise rate: inf 0≤i≤kL(xi, µ ⋆₎_{− L}₍_x⋆_{, µ}⋆_{) =}_O 1 Γk and ∃a subsequence xkj j∈Ns.t. for eachj∈N,L xkj+1, µ ⋆ − L(x⋆, µ⋆)≤ 1 Γkj . (4.4)

(iv) Ergodic rate: for eachk∈N_{, let}_x_¯_k₌defPk

i=0γixi+1/Γk. Then L(¯xk, µ⋆)− L(x⋆, µ⋆) =O 1 Γk . (4.5)

An important observation is that Theorem4.2, which will be proved in Section4.5, actually shows that lim k→∞ h L(xk, µ⋆)− L(x⋆, µ⋆) + ρk 2 kAxk−bk 2i_{= 0}_,

and subsequentially, for eachj ∈N_,

L xkj, µ ⋆ − L(x⋆, µ⋆) +ρkj 2 kAxkj −bk 2 _≤ 1 Γkj . (4.6)

This means, in particular, that the pointwise rate for feasibility and optimality hold simulatenously for the same subsequence.

The following corollary is immediate.

Corollary 4.3. Under the assumptions ofTheorem4.2, if the problem(P₎_{admits a unique solution}_x⋆_{, then}

the primal-dual pair sequence(xk, µk)k∈Nconverges weakly to a saddle point(x⋆, µ⋆).

Proof. By uniqueness, it follows from Theorem4.2(ii) that(xk)k∈Nhas exactly one weak sequential

clus-ter point which is the solution to (P_{). Weak convergence of the sequence} ₍_x_k₎

k∈N then follows from [4,

Lemma 2.38].

Example 4.4. Suppose that the sequences of parameters are chosen according to Example3.4. Let the func-tionσ:t_∈R+_7→_(log(_t_{+ 2))}a_/₍_t_{+ 1)}1−b_{. We obviously have}_σ₍_k_{) =}_γ_k_for_k_∈N_{. Moreover, it is easy}

to see that∃k′≥0(depending onaandb), such thatσis decreasing fort≥k′. Thus,∀k≥k′, we have

Γk ≥ k X i=k′ γi ≥ Z k+1 k′ σ(t)dt_≥ Z k+2 k′₊₁ (log(t))atb−1dt= Z log(k+2) log(k′₊₁₎ taebtdt.

It is easy to show, using integration by parts for the first case, that

Γ−_k1 =          o₍_k₊₂₎1 b a= 1, b >0, O₍_k₊₂₎1 b a= 0, b >0, O_log(_k1₊₂₎ a= 0, b= 0.

This result reveals that picking aand b as large as possible results in a faster convergence rate, with the proviso thatbsatisfy some conditions for(P.1)-(P.7)to hold, see the discussion in Example3.4for the largest possible choice ofb.

(17)

4.2 Preparatory results

The next result is a direct application of the Descent Lemma2.7and the generalized one in Lemma2.5to the specific case of Algorithm1. It allows to obtain a descent property for the functionEk(·, µk)between

the previous iteratexkand next onexk+1.

Lemma 4.5. Suppose Assumptions(A.1),(A.2)and(P.1)hold. For eachk_∈N_{, define the quantity}

Lk def = kTk 2 βk +kAk2ρk. (4.7)

Then, for eachk_∈N_{, we have the following inequality:}

Ek(xk+1, µk)≤ Ek(xk, µk) +h∇xEk(xk, µk), xk+1−xki+K(F,ζ,C)ζ(γk)

+Lk

2 kxk+1−xkk 2_.

Proof. Define for eachk∈N_,

˜ Ek(x, µ) def =gβk₍_{T x}_{) +}_h_{µ, Ax}₋_b_i₊ρk 2 kAx−bk 2 , so thatEk(x, µ) =f(x) + ˜Ek(x, µ). Compute ∇xE˜k(x, µ) =T∗∇gβk(T x) +A∗µ+ρkA∗(Ax−b),

which is Lipschitz-continuous with constantLk = kTk

2

βk +kAk 2_ρ

kby virtue of(A.1)and Proposition2.1(iii).

Then we can use the Descent Lemma (2.7) withν = 1on_E˜k(·, µk)between the pointsxkandxk+1, to obtain,

for eachk∈N_, ˜ Ek(xk+1, µk)≤E˜k(xk, µk) +h∇E˜k(xk, µk), xk+1−xki+ Lk 2 kxk+1−xkk 2 . (4.8)

From Assumption(A.2), Lemma2.5and Remark2.7, we have, for eachk∈N_,

f(xk+1)≤f(xk) +h∇f(xk), xk+1−xki+DF(xk+1, xk)

≤f(xk) +h∇f(xk), xk+1−xki+K(F,ζ,C)ζ(γk),

where we used that both xk and sk lie in C, that γk belongs to ]0,1] by (P.1) and thus xk+1 = xk + γk(sk−xk) ∈ C. Summing (4.8) with the latter and recalling that Ek(x, µk) = f(x) + ˜Ek(x, µk), we

obtain the claim.

Again for the function_Ek(·, µk), we also have a lower-bound, presented in the next lemma.

Lemma 4.6. Suppose Assumptions(A.1)and(A.2)hold. Then, for allk∈N_{, for all}_{x, x}′ _{∈ H}_p_{and for all}

µ_{∈ H}d, Ek(x, µ)≥ Ek x′, µ +h∇xEk x′, µ , x−x′i+ρk 2 kA(x−x ′₎_k2_.

(18)

Proof. First, split the function_Ek(·, µ)asEk(x, µ) =Ek0(x, µ) +ρ2kkAx−bk2for an opportune definition

of_E_k0(_·, µ). For the first term, simply by convexity, we have

Ek0(x, µ)≥ Ek0 x′, µ

+h∇xEk0 x′, µ

, x−x′i. (4.9)

Now use the strong convexity of the term(ρk/2)k · −bk2between pointsAxandAx′, to affirm that ρk 2 kAx−bk 2 _≥ ρk 2 kAx′−bk 2₊_h∇ρk 2 k · −bk 2 _Ax′ , Ax₋Ax′_i+ρk 2 kA(x−x′)k 2_. _(4.10) Compute h∇ρk 2 k · −bk 2 _Ax′ , Ax₋Ax′_i=ρkhA∗ Ax′−b , x₋x′_i =_h∇ρk 2 kA· −bk 2 _x′ , x₋x′_i.

Summing (4.9) and (4.10) and invoking the gradient computation above, we obtain the claim.

Lemma 4.7. Suppose that assumptions (A.1)-(A.8)and(P.1)-(P.7)hold, withM _≥ 1. Let(xk)k∈Nbe the

sequence of primal iterates generated byAlgorithm1andµ⋆ _{a solution of the dual problem}₍_D_{). Then we}

have the following estimate,

L(xk, µ⋆)− L(xk+1, µ⋆)≤γkdC(MkTk+D+Lh+kAk kµ⋆k)

Proof. First defineuk

def

= [∂g(T xk)]0and recall that, by(A.4)and its consequence in (3.7),kukk ≤ M for

everyk_∈N_{. Then,}

L(xk, µ⋆)− L(xk+1, µ⋆) = Φ(xk)−Φ(xk+1) +hµ⋆, A(xk−xk+1)i

≤ huk, T(xk−xk+1)i+h∇f(xk), xk−xk+1i +Lhkxk−xk+1k+kµ⋆k kAk kxk−xk+1k,

where we used the subdifferential inequality (2.4) ong, the gradient inequality onf, theLh-Lipschitz

con-tinuity of h relative to C (see(A.5)), and the Cauchy-Schwartz inequality on the scalar product. Since

xk+1 =xk+γk(xk−sk), we obtain L(xk, µ⋆)− L(xk+1, µ⋆)≤γk huk, T(xk−sk)i+h∇f(xk), xk−ski+Lhkxk−skk +_kµ⋆_{k k}A_{k k}xk−skk ≤γkdC(MkTk+D+Lh+kµ⋆k kAk),

where we have denoted byDthe constantDdef= sup_x_∈C_k∇f(x)_k<+_∞(see (4.32)).

Lemma 4.8. Suppose that assumptions(A.3)and(P.4)hold. Let(xk)k∈Nbe the sequence of primal iterates

generated byAlgorithm1. Then we have the following estimate,

ρk 2 kAxk−bk 2 −ρk+1 2 kAxk+1−bk 2 ≤ρd_C_kA_k(_kA_kR+_kb_k)γk,

(19)

Proof. By(P.4)and convexity of the function ρk+1 2 kA· −bk2, we have ρk 2 kAxk−bk 2₋ρk+1 2 kAxk+1−bk 2 _≤ ρk+1 2 kAxk−bk 2₋ρk+1 2 kAxk+1−bk 2 ≤ h∇ρk₂+1kA_{· −}b_k2(xk), xk−xk+1i.

Now compute the gradient and use the definition ofxk+1, to obtain ρk 2 kAxk−bk 2₋ρk+1 2 kAxk+1−bk 2 _≤_ρ k+1γkhAxk−b, A(xk−sk)i ≤ρd_C_kA_k(_kA_kR+_kb_k)γk.

In the last inequality, we used Cauchy-Schwartz inequality, triangle inequality, the fact that

kxk−skk ≤dC, and assumptions(A.3)and(P.4)(respectively,supx∈Ckxk ≤Randρk+1 ≤ρ).

4.3 Asymptotic feasibility

We begin with an intermediary lemma establishing the main feasibility estimation and some summability results that will also be used in the proof of optimality.

Lemma 4.9. Suppose that Assumptions(A.1)-(A.4)and(A.6)hold. Consider the sequence of iterates(xk)k∈N

from Algorithm1with parameters satisfying Assumptions(P.1)-(P.6). Define the two quantities∆p_kand∆d k

in the following way,

∆p_k=def_Lk(xk+1, µk)−L˜k(µk), ∆dk

def

= ˜_{L −}_L˜k(µk),

where we have denoted_L˜k(µk)

def

= minxLk(x, µk)andL˜

def

=_L(x⋆_{, µ}⋆₎_{. Denote the sum}_∆ k

def

= ∆p_k+ ∆d k.

Then we have the following estimation, ∆k+1≤∆k−γk+1 M c kAx˜k+1−bk 2₊_δ_k_A₍_x k+1−x˜k+1)k2 +Lk+1 2 γ 2 k+1d2C +K₍_F,ζ,_C₎ζ(γk+1) + βk−βk+1 2 M+ ρk+1−ρk 2 kAxk+1−bk2, and, moreover, γkkAx˜k−bk2 k∈N∈ℓ 1 +, γkkA(xk−x˜k)k2 k∈N∈ℓ 1 +, and γkkAxk−bk2 k∈N∈ℓ 1 +. Proof. First notice that the quantity∆p_k≥0and can be seen as a primal gap at iterationkwhile∆d_kmay be negative but is bounded from below by our assumptions. Indeed, in view of(A.1),(A.6)and Remark3.1(iv),

˜

Lk(µk)is bounded from above since

˜ Lk(µk)≤ Lk(x⋆, µk) =f(x⋆) +gβk₍_{T x}⋆_{) +}_h₍_x⋆_{) +}_h_µ k, Ax⋆−bi+ ρk 2 kAx ⋆₋_b_k2 =f(x⋆) +gβk₍_{T x}⋆_{) +}_h₍_x⋆₎ ≤f(x⋆) +g(T x⋆) +h(x⋆)<+∞,

(20)

We denote a minimizer of _Lk(x, µk) by x˜k ∈ Argmin x∈Hp

Lk(x, µk), which exists and belongs to C by

(A.1)-(A.3). Then, we have

∆d_k₊₁₋∆d_k=_Lk(˜xk, µk)− Lk+1(˜xk+1, µk+1). (4.11)

Sincex˜kis a minimizer ofLk(x, µk)we have thatLk(˜xk, µk)≤ Lk(˜xk+1, µk)which leads to,

Lk(˜xk+1, µk) =Lk+1(˜xk+1, µk) +gβk(Tx˜k+1)−gβk+1(Tx˜k+1) + ρk−₂ρk+1 kAx˜k+1−bk2

≤ Lk+1(˜xk+1, µk),

where the last inequality comes from Proposition2.1(v)and the assumptions(P.3)and(P.4). Combining this with (4.11),

∆d_k₊₁₋∆d_k_{≤ L}k+1(˜xk+1, µk)− Lk+1(˜xk+1, µk+1) =_hµk−µk+1, Ax˜k+1−bi

=₋θkhAxk+1−b, Ax˜k+1−bi,

(4.12)

where in the last equality we used the definition ofµk+1. Meanwhile, for the primal gap we have

∆p_k₊₁−∆p_k= (Lk+1(xk+2, µk+1)− Lk(xk+1, µk)) + (Lk(˜xk, µk)− Lk+1(˜xk+1, µk+1)).

Note that

Lk(xk+1, µk) =Lk(xk+1, µk+1)−θkkAxk+1−bk2

and estimate_Lk(˜xk, µk)− Lk+1(˜xk+1, µk+1)as in (4.12), to get

∆p_k₊₁₋∆p_k_{≤ L}k+1(xk+2, µk+1)− Lk(xk+1, µk+1) +θkkAxk+1−bk2

−θkhAxk+1−b, Ax˜k+1−bi. (4.13)

Using (4.12) and (4.13), we then have

∆k+1−∆k≤ Lk+1(xk+2, µk+1)− Lk(xk+1, µk+1) +θkkAxk+1−bk2 −2θkhAxk+1−b, Ax˜k+1−bi. Note that Lk(xk+1, µk+1) =Lk+1(xk+1, µk+1)− h gβk+1₋_gβk i (T xk+1)− ρk+1−ρk 2 kAxk+1−bk2. Then ∆k+1−∆k≤ Lk+1(xk+2, µk+1)− Lk+1(xk+1, µk+1) +gβk+1(T xk+1)−gβk(T xk+1) + ρk+1−ρk 2 kAxk+1−bk2+θkkAxk+1−bk2−2θkhAxk+1−b, Ax˜k+1−bi.

(21)

We denote byT1=_Lk+1(xk+2, µk+1)− Lk+1(xk+1, µk+1)and the remaining part of the right-hand side

byT2. For the moment, we focus our attention onT1. Recall that_Lk(x, µk) =Ek(x, µk) +h(x)and apply

Lemma4.5between pointsxk+2andxk+1, to get

T1≤h(xk+2)−h(xk+1) +h∇xEk+1(xk+1, µk+1), xk+2−xk+1i

+K₍_F,ζ,_C₎ζ(γk+1) + Lk+1

2 kxk+2−xk+1k 2_.

By(A.1)we have thathis convex and thus, sincexk+2is a convex combination ofxk+1andsk+1, we get

T1_{≤ −}γk+1(h(xk+1)−h(sk+1) +h∇xEk+1(xk+1, µk+1), xk+1−sk+1i) +Lk+1

2 kxk+2−xk+1k 2₊_K

(F,ζ,C)ζ(γk+1).

Applying the definition ofskas the minimizer of the linear minimization oracle and Lemma4.6at the points

˜ xk+1,xk+1, andµk+1 gives, T1_{≤ −}γk+1(h(xk+1)−h(˜xk+1) +h∇xEk+1(xk+1, µk+1), xk+1−x˜k+1i) +Lk+1 2 kxk+2−xk+1k 2₊_K (F,ζ,C)ζ(γk+1) ≤ −γk+1 h(xk+1)−h(˜xk+1) +Ek+1(xk+1, µk+1)− Ek+1(˜xk+1, µk+1) +ρk+1 2 kA(xk+1−x˜k+1)k 2₊Lk+1 2 kxk+2−xk+1k 2₊_K (F,ζ,C)ζ(γk+1) =₋γk+1 Lk+1(xk+1, µk+1)− Lk+1(˜xk+1, µk+1) + ρk+1 2 kA(xk+1−x˜k+1)k 2 +Lk+1 2 kxk+2−xk+1k 2₊_K (F,ζ,C)ζ(γk+1) ≤ −γk+1₂ρk+1kA(xk+1−x˜k+1)k2+ Lk+1 2 kxk+2−xk+1k 2₊_K (F,ζ,C)ζ(γk+1),

where we used thatx˜k+1is a minimizer ofLk+1(·, µk+1)in the last inequality. Now, combiningT1andT2

and using the Pythagoras identity we have ∆k+1−∆k≤ −θkkAx˜k+1−bk2+ θk−γk+1 ρk+1 2 kA(xk+1−x˜k+1)k2 +Lk+1 2 kxk+2−xk+1k 2₊_K (F,ζ,C)ζ(γk+1) + h gβk+1₋_gβk i (T xk+1) +ρk+1−ρk 2 kAxk+1−bk 2 . (4.14)

Under(P.6)we haveθk = γ_ck for somec >0such that

∃δ >0, M c −

ρ

2 =−δ <0,

whereM is the constant such that γk ≤ M γk+1(see Assumption (P.5)). Then, using (P.5)and the above

inequality, θk−γk+1 ρk+1 2 ≤ M c − ρk+1 2 γk+1 ≤ M c − ρ 2 γk+1 =−δγk+1andθk≥ M γ_ck+1. (4.15)

(22)

Now use the fact thatxk+2 =xk+1+γk+1(sk+1−xk+1)to estimate

kxk+2−xk+1k2≤γk2+1d2C. (4.16) Moreover, by the two assumptions(P.3),(A.4)and Proposition2.1(v), (3.7) holds with a constantM >0, and thus with Proposition2.1(iv)we obtain

h gβk+1 ₋_gβk i (T xk+1)≤ βk−βk+1 2 ∂g(T xk+1) 0 2 ≤ βk−₂βk+1M. (4.17) Plugging (4.15), (4.16) and (4.17) into (4.14), we get

∆k+1−∆k≤ − M c γk+1kAx˜k+1−bk 2 −δγk+1kA(xk+1−x˜k+1)k2+ Lk+1 2 γ 2 k+1d2C +K₍_F,ζ,_C₎ζ(γk+1) + βk−βk+1 2 M+ ρk+1−ρk 2 kAxk+1−bk2. (4.18)

Because of the assumptions(P.1)and(P.4), and in view of the definition ofLkin (4.7), we have the following, Lk 2 γ 2 kd2C = 1 2 kT_k2 βk +kAk2ρk ! γ_k2d2_C ∈ℓ1₊.

For the telescopic terms from the right hand side of (4.18) we have

βk−βk+1 2 ∈ℓ 1 +and ρk+1−ρk 2 kAxk+1−bk2≤(ρk+1−ρk) kAk2R2+kbk2∈ℓ1₊,

whereRis the constant arising from(A.3). Under(P.1)we also have that

K(F,ζ,C)ζ(γk+1)∈ℓ1+.

Using the notation of Lemma2.14, we set

rk = ∆k, pk=γk+1, wk = M c kAx˜k+1−bk 2₊_δ kA(xk+1−x˜k+1)k2 , zk = Lk+1 2 γ 2 k+1d2C+K(F,ζ,C)ζ(γk+1) +βk−βk+1 2 M + ρk+1−ρk 2 kAxk+1−bk2.

We have shown above that

rk+1≤rk−pkwk+zk,

where(zk)k∈N ∈ℓ1+, andrkis bounded from below. We then deduce using Lemma2.14(i)that(rk)k∈Nis

convergent and γkkAx˜k−bk2 k∈N∈ℓ 1 +, γkkA(xk−x˜k)k2 k∈N∈ℓ 1 +. (4.19) Consequently, γkkAxk−bk2 k∈N∈ℓ 1 +, (4.20)

since, by Jensen’s inequality, ∞ X k=1 γkkAxk−bk2 ≤2 ∞ X k=1 γk kA(xk−x˜k)k2+kAx˜k−bk2 <+∞.

(23)

We are now ready to prove Theorem4.1, i.e., to show that the sequence of iterates(xk)k∈Nis

asymptoti-cally feasible.

Proof. (i) By Lemma4.8withρk ≡ρk+1 ≡2, we have

kAxk−bk2− kAxk+1−bk2 ≤2γkdCkAk(kAkR+kbk).

Using this together with Lemma4.9and Assumption(P.2), we are in position to apply Lemma2.14(ii)

to conclude thatlimk→∞kAxk−bk= 0.

(ii) The rates in (4.1) follow respectively from Lemma2.14(iii)and Lemma2.14(iv). (iii) We have, by Jensen’s inequality and Lemma4.9, that

kAx¯k−bk2 ≤ 1 Γk k X i=0 γikAxi−bk2 ≤ 1 Γk +∞ X i=0 γikAxi−bk2 =O 1 Γk .

4.4 Dual multiplier boundedness

In this part we provide a lemma that shows the sequence of dual variables(µk)k∈Ngenerated by Algorithm

1is bounded.

We start by studying coercivity ofϕ¯.

Lemma 4.10. Suppose that Assumptions(A.1)-(A.3)and(A.6)-(A.8)hold. Thenϕ¯is coercive onran (A).

Proof. From (3.4), we have, for anyc_∈A−1(b), that ¯

ϕ(µ) = ¯Φ∗+_h−c,_·i

(₋A∗µ).

Moreover, Assumptions(A.1)and (A.7)entail thatΦ¯ ∈ Γ0(Hp). We now consider separately the two

as-sumptions.

(a) Case of(A.8)(a): If follows from the Fenchel-Moreau theorem ([4, Theorem 13.32]) that ¯

Φ∗− hc,·i∗

= ¯Φ∗∗(·+c) = ¯Φ (·+c).

Using this, together with Proposition2.10and(A.2), we can assert thatΦ¯∗− hc,·iis coercive if and only if

0_∈int dom ¯Φ (_·+c)

= int (dom (Φ))₋c= int (dom (g_◦T)_{∩ C})₋c

= int (dom (g_◦T))_∩int (_C)₋c.

But this is precisely what(A.8)(a)guarantees. In turn, using [4, Proposition 14.15],(A.8)(a)is equiv-alent to

∃(a >0, β _∈R), Φ¯∗_{− h}c,_{·i ≥}a_k·k+β.

Using standard results on linear operators in Hilbert spaces [4, Facts 2.18 and 2.19], we have

(A.7) _⇐⇒ (_∃α >0),(_∀µ_∈ran(A)), _kA∗µ_{k ≥}α_kµ_k_.

Combining the last two inequalities, we deduce that under(A.8)(a),

∃(a >0, α >0, β ∈R₎_,₍_∀_µ_∈_ran(_A₎₎_, _ϕ_¯₍_µ₎_≥_a_k_A∗_µ_k₊_β_≥_aα_k_µ_k₊_β,

(24)

(b) Case of(A.8)(b): Since_Hdis finite dimensional, We have,∀u∈ Hd,

¯

ϕ∞(u) = Φ¯∗+_h−c,_·i

◦(₋A∗)∞

(u) (Proposition2.11(iii)) = ¯Φ∗+_h−c,_·i∞

(₋A∗u) (Proposition2.11(ii)) =σ_dom₍_Φ¯∗₊_h−_c,_·i₎∗(−A∗u)

=σ_dom₍_Φ(¯ _·₊_c₎_{) (}−A∗u) =σ_dom₍_Φ¯₎₋_c(−A∗u) (by(A.2)) =σdom(g◦T)∩C−c(−A∗u).

Notice that, by Assumption(A.4), we havedom (g◦T)∩ C=C. Thus, using Proposition2.11(i), we have the following chain of equivalences

¯

ϕis coercive onran (A) ⇐⇒ ϕ¯∞(u)>0, ∀u∈ran (A)\ {0} ⇐⇒ σ_C−c(−A∗u)>0, ∀u∈ran (A)\ {0}.

For this to hold, and sinceran (A) = ker (A∗)⊥, a sufficient condition is that

σ_C−c(x)>0, ∀x∈ran (A∗)\ {0}. (4.21)

It remains to check that the latter condition holds under(A.8)(b). First, observe thatCis a nonempty bounded convex set thanks to (A.1)and (A.3). The first condition in (A.8)(b)is equivalent to0 _∈ ri(_{C −}c)for somec_∈A−1₍_b₎_{. It then follows from Proposition}_2.9_that

σ_C−c(x)>0,∀x6∈par(C −c)⊥ = par(C)⊥,

which then implies (4.21) thanks to the second condition in(A.8)(b).

Lemma 4.11. Suppose that assumptions(A.1)-(A.3)and(A.6)-(A.8)and(P.1)-(P.6)hold. Then the sequence of dual iterates(µk)k∈Ngenerated byAlgorithm1is bounded.

Proof. Using the notation in (3.4), the primal problem: min x∈Hp{ Φ(x) : Ax=b}= min x∈Hp sup µ∈Hd L(x, µ), is obviously equivalent to min x∈Hp n Φ(x) +ρk 2 kAx−bk 2 _: _Ax₌_bo_{= min} x∈Hp sup µ∈Hd n L(x, µ) +ρk 2 kAx−bk 2o_.

We associate to the previous the following regularized primal problem: min x∈Hp{ Φk(x) : Ax=b}= min x∈Hp sup µ∈Hd Lk(x, µ)

and its Lagrangian dual, namely: sup µ∈Hd inf x∈Hp Lk (x, µ) =₋ inf µ∈Hd sup x∈Hp −Lk(x, µ).

(25)

Now consider the dual function in the latter, namelyϕk(µ)

def

=−infx∈Hp Lk(x, µ). Observe that the

min-imum is actually attained owing to(A.1)and (A.3). Now we claim that ϕk is continuously differentiable

withL_∇ϕk-Lipschitz gradient, and 1/ρ(see(P.4)) is an upper-bound for (L∇ϕk)k∈N. In order to show it,

introduce the notation

Fk(x) def = f(x) +gβk₍_{T x}_{) +}_h₍_x_); Gk(v) def = ρk 2 kv−bk 2_. By definition, we have ϕk(µ) =−min x∈Hp n f(x) +gβk₍_{T x}_{) +}_h₍_x_{) +}_h_{µ, Ax}₋_b_i₊ρk 2 kAx−bk 2o =₋min x∈Hp{ Fk(x) +hA∗µ, xi+Gk(Ax)}+hµ, bi. (4.22)

Using Fenchel-Rockafellar duality and strong duality, which holds by(P.4)and continuity ofGk (see, for

instance, [30, Theorem 3.51]), we have the following equality, min x∈Hp{ Fk(x) +hA∗µ, xi+Gk(Ax)}=−min v∈Hd{ (Fk(·) +hA∗µ,·i)∗(−A∗v) +G∗k(v)} =₋min v∈Hd{ F_k∗(₋A∗v₋A∗µ) +G∗_k(v)_}

where we have used the fact that the conjugate of a linear perturbation is the translation of the conjugate in the last line. Substituting the above into (4.22) we find

ϕk(µ) = min v∈Hd F_k∗(−A∗(v+µ)) + 1 2ρkk vk2+hv, bi +hµ, bi = min v∈Hd F_k∗(−A∗(v+µ)) + 1 2ρkk v+ρkbk2 +hµ, bi −ρk 2 kbk 2

Moreover, from the primal-dual extremality relationships [30, Theorem 3.51(i)], we have

−v˜=_∇Gk(Ax˜) =ρk(Ax˜−b), (4.23)

wherex˜is a minimizer (which exists and belongs to_C) of the primal objective_Lk(·, µ)and˜vis the unique

minimizer to the associated dual objective. Now, using the change of variableu=v+µ, we get

ϕk(µ) = inf u∈Hd F_k∗(₋A∗u) + 1 2ρkk u₋µ+ρkbk2 +_hµ, b_{i −}ρk 2 kbk 2 = [F_k∗_◦(₋A∗)]ρk₍_µ −ρkb) +hµ, bi − ρk 2 kbk 2_,

where the notation[_·]ρk _{denotes the Moreau envelope with parameter}_ρ

kas defined in (2.2). It follows from

Proposition2.1(i)and(iii), thatϕkis convex, real-valued and its gradient, given by

∇ϕk(µ) =ρ−_k1(µ−ρkb−u˜) +b=ρ−_k1(µ−u˜), where u˜= proxρkFk∗◦(−A∗)(µ−ρkb), (4.24)

is1/ρk-Lipschitz continuous since the gradient of a Moreau envelope with parameterρkis1/ρk-Lipschitz

(26)

(_∇ϕk)k∈N is uniformly Lipschitz-continuous with constant1/ρ. In addition, combining (4.23) and (4.24),

and recalling the change of variableu˜= ˜v+µ, we get that

∇ϕk(µ) =ρ−k1(µ−u˜) =−ρ−k1v˜=Ax˜−b. (4.25)

As in Lemma4.9, we are going to denotex˜ka minimizer ofLk(x, µk). Then, from the Descent Lemma (see

Proposition2.4and inequality (2.7)), we have

ϕk(µk+1)≤ϕk(µk) +h∇ϕk(µk), µk+1−µki+

1

2ρkµk+1−µkk 2_.

Now substitute in the right-hand-side the expression_∇ϕk(µk) =Ax˜k−bin (4.25) and the updateµk+1 = µk+θk(Axk+1−b)from the algorithm, to obtain

ϕk(µk+1)≤ϕk(µk) +θkhAx˜k−b, Axk+1−bi+ θ2_k 2ρkAxk+1−bk 2 ≤ϕk(µk) + θk 2kAx˜k−bk 2₊θk 2 θk ρ + 1 kAxk+1−bk2, (4.26)

where we estimated the scalar product by Cauchy-Schwartz and Young inequality. Moreover, by definition,

ϕk+1(µk+1) =− inf x∈Hp n f(x) +gβk+1₍_{T x}_{) +}_h₍_x_{) +}_h_µ k+1, Ax−bi+ ρk+1 2 kAx−bk 2o = sup x∈Hp −Lk(x, µk+1) + h gβk₋_gβk+1i₍_{T x}_{) +}1 2(ρk−ρk+1)kAx−bk 2 . (4.27)

Now recall assumptions(P.3)and (P.4): for βknon-increasing,

gβk −_gβk+1₍_{T x}₎ _≤ ₀_{for every}_x _{∈ H}

p

by Proposition2.1(v)and, forρknon-decreasing,ρk−ρk+1 ≤0. Then we can estimate the right-hand-side

of (4.27) to obtain

ϕk+1(µk+1)≤ sup

x∈Hp

−Lk(x, µk+1) =ϕk(µk+1).

Sum (4.26) with the latter, to obtain

ϕk+1(µk+1)−ϕk(µk)≤ θk 2 kAx˜k−bk 2₊ θk 2 θk ρ + 1 kAxk+1−bk2.

By Assumption(P.6),θk=γk/cwhereγk ≤1. Moreover, by assumption(P.5),γk ≤M γk+1. Then, ϕk+1(µk+1)−ϕk(µk)≤ γk 2ckAx˜k−bk 2₊ M 2c 1 ρc+ 1 γk+1kAxk+1−bk2. (4.28)

Notice that the right-hand-side is in ℓ1

+, because both γkkAxk−bk2 k∈N and γkkAx˜k−bk2 k∈Nare inℓ 1

+by Lemma4.9. Additionally,(ϕk(µk))k∈Nis bounded from below. Indeed,

by virtue of(A.6)and Remark3.1(iv), we have

ϕk(µk)≥ −Lk(x⋆, µk)

(27)

Then we can use Lemma2.14(i) on inequality (4.28) to conclude that(ϕk(µk))k∈Nis convergent and, in

particular, bounded. Now recallΦk,Φ¯ andϕ¯from (3.4). Notice that

ϕk(µ) = sup x∈Hp {hµ, b₋Ax_{i −}Φk(x)} = sup x∈Hp {h−A∗µ, x_{i −}Φk(x)}+hb, µi = Φ∗_k(−A∗µ) +hb, µi.

It then follows that

gβk _≤_g ₌_⇒ _Φ

k≤Φ¯ ⇐⇒ Φ¯∗≤Φ∗k =⇒ ϕ¯≤ϕk, (4.29)

where we used Proposition2.1(v)and the fact in (2.1). We are now in position to invoke Lemma4.10which shows thatϕ¯is coercive onran(A), and thus, by (4.29), (ϕk)k∈Nis coercive uniformly inkonran(A). In

turn, sinceran(A) is closed and(µk)k∈N ⊂ ran(A) = ker(A∗)⊥, we have from (4.29) and the proof of

Lemma4.10that

∃(a >0, α >0, β _∈R₎_,₍_∀_k_∈N₎_, _ϕ_k₍_µ_k₎_≥_ϕ_¯₍_µ_k₎_≥_a_k_A∗_µ_k_k₊_β _≥_aα_k_µ_k_k₊_β,

which shows that(µk)k∈Nis indeed bounded by boundedness of(ϕk(µk))_k_∈N.

4.5 Optimality

In this section we prove Theorem4.2by establishing convergence of the Lagrangian values to the optimum (i.e., the value at the saddle-point).

We start by showing some boundedness claims that will be important in our proof.

Lemma 4.12. Under assumptions(A.1)-(A.8)and(P.1)-(P.6), the objectiveΦis bounded on_C, and thus ˜ M = supdef x∈C| Φ(x)_|+ sup k∈Nk µkk(kAkR+kbk)<+∞, (4.30)

where we recall the radiusRfrom assumption(A.3).

Proof. By assumption(A.4),gis subdifferentiable atT xfor anyx_{∈ C}. Thus convexity ofgimplies that for anyx_{∈ C} g(T x)≤g(T x⋆) +D[∂g(T x)]0, T x−T x⋆E≤g(T x⋆) + [∂g(T x)] 0 kTkdC g(T x)_≥g(T x⋆) +D[∂g(T x⋆)]0, T x₋T x⋆E _≥g(T x⋆)₋ [∂g(T x ⋆_)]0 kTkdC. (4.31)

From assumptions(A.1)and (A.2),f belongs toΓ0(Hp)and is differentiable on an open set C0 that

con-tainsC ⊂ dom(f) (see Definition 2.6). Thus the continuity set of f contains C, and it follows from [4, Corollary 8.30(ii)] that_{C ⊂}int (dom (f)). Consequently, arguing as in the proof of Lemma3.2, we deduce that

sup

x∈Ck∇