Distributed gradient descent - Efficient Methods For Large-Scale Empirical Risk Minimization

The network that connects the V agents is assumed connected, symmetric, and specified by the neighborhoods Nv that contain the list of nodes that can communicate with v

for v = 1, . . . , V. In problem (5.1) agent v has access to the local cost fv(w) and agents

cooperate to minimize the global costf(w). This specification is more naturally formulated by an alternative representation of (5.1) in which node v selects a local decision vector

wv∈Rp. Nodes then try to achieve the minimum of their local objective functionsfv(wv), while keeping their variables equal to the variableswuof neighborsu∈ Nv. This alternative formulation can be written as

{w_v∗}V_v₌₁ := argmin {wv}Vi=1 V X v=1 fv(wv), s.t.wv =wu, for all v, u∈ Nv. (5.2) Since the network is connected, the constraints wv =wu for all v and u∈ Nv imply that (5.1) and (5.2) are equivalent and we havew_v∗=w∗ for allv. This must be the case because for a connected network the constraintswv =wu for all vand u∈ Nv collapse the feasible space of (5.2) to a hyperplane in which all local variables are equal. When all variables are equal, the objectives in (5.1) and (5.2) coincide and so do their optima.

DGD is an established distributed method to solve (5.2) which relies on the introduction of nonnegative weights wvu ≥0 that are null if and only ifu /∈ Nv∪ {v}– the use of time varying weightswvu is common in DGD implementations but not done here; see, e.g., [80]. Letting t ∈ _N be a discrete time index and α a given stepsize, DGD is defined by the recursion wv,t+1= V X u=1 wvuwu,t−α∇fv(wv,t), v= 1, . . . , V. (5.3)

Since wvu= 0 when u6=v and u /∈ Nv, it follows from (5.3) that each agent v updates its variable wv by performing an average over the estimates wu,t of its neighbors u∈ Nv and its own estimatewv,t, and descending through the negative local gradient−∇fv(wv,t).

The weights in (5.3) cannot be arbitrary. To express conditions on the set of allowable weights define the matrix W ∈ RV×V with entries wuv. We require the weights to be symmetric, i.e., wvu =wuv for all v, u, and such that the weights of a given node sum up to 1, i.e.,PV

u=1wvu = 1 for allv. If the weights sum up to 1 we must haveW1=1 which implies that I−W is rank deficient. It is also customary to require the rank ofI−W to

be exactly equal to n−1 so that the null space of I−W is null(I−W) = span(1). We therefore have the following three restrictions on the matrix W,

WT =W, W1=1, null(I−W) = span(1). (5.4) If the conditions in (5.4) are true, it is possible to show that (5.3) approaches the solution of (5.1) in the sense that wv,t ≈w∗ for all v and large t, [80]. The accepted interpretation of why (5.3) converges is that nodes are gradient descending towards their local minima because of the term −α∇fv(wv,t) but also perform an average of neighboring variables

j=1wuvwu,t. This latter consensus operation drives the agents to agreement. In the following section we show that (5.3) can be alternatively interpreted as a penalty method.

5.2.1 Penalty method interpretation

It is illuminating to define matrices and vectors so as to rewrite (5.3) as a single equation. To do so define the vectorsy:= [w1;. . .;wV] andh(y) := [∇f1(w1);. . .;∇fV(wV)]. Vector

y ∈ _RV p _{concatenates the local vectors} _w

v, and the vector h(y) ∈ RV p concatenates the

gradients of the local functions fv taken with respect to the local variablewv. Notice that

h(y) isnotthe gradient off(w) and that a vectorywithh(y) =0doesnotnecessarily solve (5.1). To solve (5.1) we need to havewv =wu for all v and uwith PV_v₌₁∇fv(wv) =0. In any event, to rewrite (5.3) we also define the matrixZ:=W⊗I∈RV p×V pas the Kronecker

product of the weight matrixW∈_RV×V _{and the identity matrix}_I_∈

Rp×p. It is then ready

to see that (5.3) is equivalent to

yt+1=Zyt−αh(yt) =yt−

(I−Z)yt+αh(yt)

, (5.5)

where in the second equality we added and subtractedyt and regrouped terms. Inspection of (5.5) reveals that the DGD update formula at steptis equivalent to a (regular) gradient descent algorithm being used to solve the program

y∗:= argminF(y) := min1 2y T₍_I₋_Z₎_y₊_α V X v=1 fv(wv). (5.6)

This interpretation has been previously used in [42, 44] to design a Nesterov type acceler- ation of DGD. Indeed, given the definition of the function F(y) := (1/2)yT(I−Z) y+

αPV

v=1fv(wv) it follows that the gradient∇F(yt) is given by

Using (5.7) we rewrite (5.5) asyt+1 =yt−gt and conclude that DGD descends along the negative gradient of F(y) with unit stepsize. The expression in (5.3) is just a distributed implementation of gradient descent that uses the gradient in (5.7). To confirm that this is true, observe that thevth element of the gradientgt= [g1,t;. . .;gV,t] is given by

gv,t= (1−wvv)wv,t−

u∈Nv

wvuwu,t+α∇fv(wv,t). (5.8) The gradient descent iteration yt+1 = yt−gt is then equivalent to (5.3) if we entrust node v with the implementation of the descent wv,t+1 =wv,t−gv,t, where, we recall, wv,t and wv,t+1 are the vth components of the vectors yt and yt+1. Observe that the local gradient component gv,t can be computed using local information and the wu,t iterates of its neighbors u ∈ Nv. This is as it should be, because the descent wv,t+1 =wv,t−gv,t is equivalent to (5.3).

Is it a good idea to descend on F(y) to solve (5.1)? To some extent. Since we know that the null space of I −W is null(I− W) = span(1) and that Z = W ⊗I

we know that the null space of I−Z is the set of consensus vectors, i.e., null(I−Z) =

y= [w1;. . .;wV]

w₁=· · ·=w_V . Thus, (I−Z)y =0 holds if and only if w₁ = · · ·= wV. Since the matrixI−Zis positive semidefinite and symmetric, the same is true of the square root matrix (I−Z)1/2. Therefore, the optimization problem in (5.2) is equivalent to the optimization problem

˜ y∗ := argmin w V X v=1 fv(wv), s.t. (I−Z)1/2y=0. (5.9) Indeed, fory= [w1;. . .;wV] to be feasible in (5.9) we must havew1=· · ·=wV. This is the same constraint imposed in (5.2) from where it follows that we must have ˜y∗ = [w∗₁;. . .;w_V∗] with w∗_v = w∗ for all v. The unconstrained minimization in (5.6) is a penalty version of (5.9). The penalty function associated with the constraint (I−Z)1/2y =0 is the squared norm (1/2)k(I−Z)1/2_y_k2 _{and the corresponding penalty coefficient is 1}_/α_{. Inasmuch as} the penalty coefficient 1/α is sufficiently large, the optimal arguments y∗ and ˜y∗ are not too far apart.

The reinterpretation of (5.3) as a penalty method demonstrates that DGD is an algorithm that finds the optimal solution of (5.6), not (5.9) or its equivalent original formulations in (5.1) and (5.2). Using a fixed α the distance betweeny∗ and ˜y∗ is of orderO(α), [126]. To solve (5.9) we need to introduce a rule to progressively decrease α. In the following section we exploit the reinterpretation of (5.5) as a method to minimize (5.6) to propose an approximate Newton algorithm that can be implemented in a distributed manner.

In document Efficient Methods For Large-Scale Empirical Risk Minimization (Page 149-152)