Global Variable Consensus with Regularization

Consensus and Sharing

7.1 Global Variable Consensus Optimization

7.1.1 Global Variable Consensus with Regularization

In a simple variation on the global variable consensus problem, an objective termg, often representing a simple constraint or regularization, is handled by the central collector:

minimize N_i₌₁fi(xi) +g(z)

subject to xi−z= 0, i= 1, . . . , N.

The resulting ADMM algorithm is xk_i+1 := argmin xi fi(xi) +ykTi (xi−zk) + (ρ/2)xi −zk22 (7.3) zk+1 := argmin z g(z) + N i=1 (−ykT_i z+ (ρ/2)xk_i+1−z2₂) (7.4) y_ik+1 := y_ik+ρ(xk_i+1 −zk+1). (7.5) By collecting the linear and quadratic terms, we can express the z- update as an averaging step, as in consensus ADMM, followed by a proximal step involving g:

zk+1 := argmin z g(z) + (N ρ/2)z−xk+1 −(1/ρ)yk2₂ .

In the case with nonzero g, we do not in general have yk= 0, so we cannot drop theyi terms from z-update as in consensus ADMM.

As an example, forg(z) =λz1, withλ >0, the second step of the

z-update is a soft threshold operation:

zk+1:=S_{λ/N ρ}(xk+1 −(1/ρ)yk).

As another simple example, supposegis the indicator function of Rn₊, which means that thegterm enforces nonnegativity of the variable. In this case, the update is

zk+1:= (xk+1−(1/ρ)yk)+.

The scaled form of ADMM for this problem also has an appealing form, which we record here for convenience:

xk_i+1 := argmin xi fi(xi) + (ρ/2)xi−zk+uki22 (7.6) zk+1 := argmin z g(z) + (N ρ/2)z−xk+1−uk2₂ (7.7) uk_i+1 := uk_i +xk_i+1−zk+1. (7.8) In many cases, this version is simpler and easier to work with than the unscaled form.

7.2 General Form Consensus Optimization 53 7.2 General Form Consensus Optimization

We now consider a more general form of the consensus minimization problem, in which we have local variablesxi∈Rni, i= 1, . . . , N, with

the objective f1(x1) +···+fN(xN) separable in thexi. Each of these

local variables consists of a selection of the components of the global variable z∈Rn; that is, each component of each local variable corresponds to some global variable componentzg. The mapping from local

variable indices into global variable index can be written asg=G(i, j), which means that local variable component (xi)j corresponds to global

variable componentzg.

Achieving consensus between the local variables and the global variable means that

(xi)j=z_G(i,j), i= 1, . . . , N, j= 1, . . . , ni.

If G(i, j) =j for all i, then each local variable is just a copy of the global variable, and consensus reduces to global variable consensus, xi=z. General consensus is of interest in cases where nin,

so each local vector only contains a small number of the global variables.

In the context of model ﬁtting, the following is one way that general form consensus naturally arises. The global variable z is the full fea- ture vector (i.e., vector of model parameters or independent variables in the data), and diﬀerent subsets of the data are spread out amongN

processors. Thenxi can be viewed as the subvector of zcorresponding

to (nonzero) features that appear in the ith block of data. In other words, each processor handles only its block of dataand only the sub- set of model coeﬃcients that are relevant for that block of data. If in each block of data all regressors appear with nonzero values, then this reduces to global consensus.

For example, if each training example is a document, then the features may include words or combinations of words in the document; it will often be the case that some words are only used in a small sub- set of the documents, in which case each processor can just deal with the words that appear in its local corpus. In general, datasets that are high-dimensional but sparse will beneﬁt from this approach.

Fig. 7.1. General form consensus optimization. Local objective terms are on the left; global variable components are on the right. Each edge in the bipartite graph is a consistency constraint, linking a local variable and a global variable component.

For ease of notation, let ˜zi∈Rni be deﬁned by (˜zi)j=z_G(i,j). Intuitively, ˜zi is the global variable’s idea of what the local variable xi should be; the consensus constraint can then be written very simply

asxi −z˜i = 0,i= 1, . . . , N.

The general form consensus problem is minimize N_i₌₁fi(xi)

subject to xi−z˜i= 0, i= 1, . . . , N,

(7.9)

with variablesx1, . . . , xN and z (˜zi are linear functions ofz).

A simple example is shown in Figure 7.1. In this example, we have

N = 3 subsystems, global variable dimensionn= 4, and local variable dimensionsn1= 4,n2= 2, andn3= 3. The objective terms and global variables form a bipartite graph, with each edge representing a consensus constraint between a local variable component and a global variable.

The augmented Lagrangian for (7.9) is

Lρ(x, z, y) = N i=1 fi(xi) +yiT(xi −z˜i) + (ρ/2)xi−z˜i22 ,

7.2 General Form Consensus Optimization 55 with dual variableyi∈Rni. Then ADMM consists of the iterations

xk_i+1 := argmin xi fi(xi) +ykTi xi + (ρ/2)xi −z˜ik22 zk+1 := argmin z _m i=1 −y_ikTz˜i + (ρ/2)xki+1−z˜i22 yk_i+1 := y_ik+ρ(xk_i+1 −z˜_ik+1),

where thexi- andyi-updates can be carried out independently in par-

allel for eachi.

The z-update step decouples across the components of z, since Lρ

is fully separable in its components:

z_gk+1:= G(i,j)=g (xk_i+1)j + (1/ρ)(yki)j G(i,j)=g1 ,

sozgis found by averaging all entries ofxki+1+ (1/ρ)yikthat correspond

to the global index g. Applying the same type of argument as in the global variable consensus case, we can show that after the ﬁrst iteration,

G(i,j)=g

(yk_i)j = 0,

i.e., the sum of the dual variable entries that correspond to any given global index g is zero. The z-update step can thus be written in the simpler form

zk_g+1:= (1/kg)

G(i,j)=g

(xk_i+1)j,

where kg is the number of local variable entries that correspond to

global variable entryzg. In other words, thez-update is local averaging

for each componentzg rather than global averaging; in the language of

collaborative ﬁltering, we could say that only the processing elements that have an opinion on a featurezg will vote onzg.

In document Distributed Optimization and Statistical Learning via the Alternating Direction Method of Multipliers (Page 54-58)