Convergence Proof - Distributed Optimization and Statistical Learning via the Alternating Direc

The basic convergence result given in §3.2 can be found in several ref- erences, such as [81, 63]. Many of these give more sophisticated results, with more general penalties or inexact minimization. For completeness, we give a proof here.

We will show that if f and g are closed, proper, and convex, and the Lagrangian L0 has a saddle point, then we have primal residual convergence, meaning thatrk→0, and objective convergence, meaning thatpk→p, wherepk=f(xk) +g(zk). We will also see that the dual residualsk=ρATB(zk−zk−1) converges to zero.

Let (x, z, y) be a saddle point forL0, and deﬁne

Vk= (1/ρ)yk−y2₂ +ρB(zk−z)2₂,

We will see that Vk is aLyapunov function for the algorithm, i.e., a nonnegative quantity that decreases in each iteration. (Note thatVk is unknown while the algorithm runs, since it depends on the unknown valuesz and y.)

We ﬁrst outline the main idea. The proof relies on three key inequalities, which we will prove below using basic results from convex analysis

107 along with simple algebra. The ﬁrst inequality is

Vk+1≤Vk−ρrk+12₂−ρB(zk+1−zk)2₂. (A.1) This states that Vk _{decreases in each iteration by an amount that}

depends on the norm of the residual and on the change inz over one iteration. Because Vk_≤_V0_{, it follows that} _yk _and _Bzk _{are bounded.}

Iterating the inequality above gives that

ρ ∞ k=0 rk+12₂+B(zk+1−zk)2₂ ≤V0,

which implies that rk→0 and B(zk+1 −zk)→0 as k→ ∞. Multi- plying the second expression by ρAT shows that the dual residual

sk=ρATB(zk+1 −zk) converges to zero. (This shows that the stopping criterion (3.12), which requires the primal and dual residuals to be small, will eventually hold.)

The second key inequality is

pk+1−p

≤ −(yk+1)Trk+1−ρ(B(zk+1−zk))T(−rk+1+B(zk+1 −z)),

(A.2) and the third inequality is

p −pk+1≤yTrk+1. (A.3) The righthand side in (A.2) goes to zero ask→ ∞, becauseB(zk+1− z) is bounded and bothrk+1 and B(zk+1−zk) go to zero. The righthand side in (A.3) goes to zero ask→ ∞, since rk goes to zero. Thus we have limk→∞pk=p,i.e., objective convergence.

Before giving the proofs of the three key inequalities, we derive the inequality (3.11) mentioned in our discussion of stopping criterion from the inequality (A.2). We simply observe that−rk+1+B(zk+1 −zk) = −A(xk+1 −x); substituting this into (A.2) yields (3.11),

pk+1 −p≤ −(yk+1)Trk+1+ (xk+1−x)Tsk+1.

Proof of inequality (A.3)

Since (x, z, y) is a saddle point for L0, we have

UsingAx+Bz=c, the lefthand side is p. With pk+1=f(xk+1) +

g(zk+1), this can be written as

p ≤pk+1+yTrk+1,

which gives (A.3).

Proof of inequality (A.2)

By deﬁnition, xk+1 minimizes Lρ(x, zk, yk). Since f is closed, proper,

and convex it is subdiﬀerentiable, and so is Lρ. The (necessary and

suﬃcient) optimality condition is

0∈∂Lρ(xk+1, zk, yk) =∂f(xk+1) +ATyk+ρAT(Axk+1+Bzk−c).

(Here we use the basic fact that the subdifferential of the sum of a subdifferentiable function and a differentiable function with domain Rn is the sum of the subdifferential and the gradient; see, e.g., [140, §23].)

Since yk+1=yk +ρrk+1, we can plug in yk=yk+1 −ρrk+1 and rearrange to obtain

0∈∂f(xk+1) +AT(yk+1−ρB(zk+1−zk)).

This implies thatxk+1 minimizes

f(x) + (yk+1−ρB(zk+1−zk))TAx.

A similar argument shows that zk+1 minimizes g(z) +y(k+1)TBz. It follows that

f(xk+1) + (yk+1 −ρB(zk+1−zk))TAxk+1

≤f(x) + (yk+1−ρB(zk+1−zk))TAx

and that

g(zk+1) +y(k+1)TBzk+1≤g(z) +y(k+1)TBz.

Adding the two inequalities above, usingAx +Bz=c, and rearrang- ing, we obtain (A.2).

109 Proof of inequality (A.1)

Adding (A.2) and (A.3), regrouping terms, and multiplying through by 2 gives

2(yk+1 −y)Trk+1 −2ρ(B(zk+1−zk))Trk+1

+ 2ρ(B(zk+1₋_zk₎₎T₍_B₍_zk+1 ₋_z₎₎_≤₀_. (A.4)

The result (A.1) will follow from this inequality after some manipula- tion and rewriting.

We begin by rewriting the ﬁrst term. Substituting yk+1=yk+

ρrk+1 _gives

2(yk−y)Trk+1 +ρrk+12₂ +ρrk+12₂,

and substituting rk+1= (1/ρ)(yk+1−yk) in the ﬁrst two terms gives (2/ρ)(yk−y)T(yk+1−yk) + (1/ρ)yk+1−yk2₂+ρrk+12₂.

Sinceyk+1−yk= (yk+1−y)−(yk−y), this can be written as (1/ρ)

yk+1 −y₂2− yk−y2₂

+ρrk+12₂. (A.5) We now rewrite the remaining terms,i.e.,

ρrk+12₂−2ρ(B(zk+1−zk))Trk+1+ 2ρ(B(zk+1−zk))T(B(zk+1−z)),

whereρrk+12

2 is taken from (A.5). Substituting

zk+1−z= (zk+1−zk) + (zk−z) in the last term gives

ρrk+1−B(zk+1 −zk)2₂+ρB(zk+1−zk)2₂ +2ρ(B(zk+1−zk))T(B(zk −z)),

and substituting

zk+1−zk= (zk+1 −z)−(zk−z) in the last two terms, we get

ρrk+1−B(zk+1−zk)2₂ +ρ

B(zk+1−z)₂2 − B(zk−z)2₂

With the previous step, this implies that (A.4) can be written as

Vk−Vk+1≥ρrk+1−B(zk+1−zk)2₂. (A.6) To show (A.1), it now suﬃces to show that the middle term −2ρr(k+1)T(B(zk+1−zk)) of the expanded right hand side of (A.6) is positive. To see this, recall that zk+1 minimizes g(z) +y(k+1)TBz

and zk minimizesg(z) +ykTBz, so we can add

g(zk+1) +y(k+1)TBzk+1≤g(zk) +y(k+1)TBzk

and

g(zk) +ykTBzk≤g(zk+1) +ykTBzk+1

to get that

(yk+1 −yk)T(B(zk+1 −zk))≤0.

In document Distributed Optimization and Statistical Learning via the Alternating Direction Method of Multipliers (Page 109-114)