AIDE:Fast and Communication Efficient Distributed Optimization

(1)

Edinburgh Research Explorer

AIDE

Citation for published version:

Reddi, SJ, Konený, J, Richtárik, P, Póczós, B & Smola, A 2016 'AIDE: Fast and Communication Efficient Distributed Optimization' ArXiv.

Link:

Link to publication record in Edinburgh Research Explorer

General rights

Copyright for the publications made accessible via the Edinburgh Research Explorer is retained by the author(s) and / or other copyright owners and it is a condition of accessing these publications that users recognise and abide by the legal requirements associated with these rights.

Take down policy

The University of Edinburgh has made every reasonable effort to ensure that Edinburgh Research Explorer content complies with UK legislation. If you believe that the public display of this file breaches copyright please contact [email protected] providing details, and we will remove access to the work immediately and investigate your claim.

(2)

AIDE: Fast and Communication Efficient

Distributed Optimization

Sashank J. Reddi

Carnegie Mellon University

University of Edinburgh

Jakub Koneˇcný

University of Edinburgh

Peter Richtárik

Barnabás Póczós

Carnegie Mellon University

Alex Smola

Abstract

In this paper, we present two new communication-efficient methods for distributed minimization of an average of functions. The first algorithm is an inexact variant of the DANEalgorithm [20] that allows any local algorithm to return an approximate solution to a local subproblem. We show that such a strategy does not affect the theoretical guarantees of DANEsignificantly. In fact, our approach can be viewed as a robustification strategy since the method is substantially better behaved than DANE on data partition arising in practice.

It is well known that DANE algorithm does not match the communication complexity lower bounds. To bridge this gap, we propose an accelerated variant of the first method, called AIDE, that not only matches the

communication lower bounds but can also be implemented using a purely first-order oracle. Our empirical results show that AIDEis superior to other communication efficient algorithms in settings that naturally arise

in machine learning applications.

1 Introduction

With the advent of large scale datasets, distributed machine learning has garnered significant attention recently.

For example, large scale distributed machine learning systems such as theParameter server [9],GraphLab

[22] and TensorFlow[1] work with datasets sizes in the order of hundreds of terabytes. When dealing with

datasets of such scale in distributed systems, computational and communication workloads need to be designed carefully. This in turn places primary importance to computational and communication resource constraints on the algorithms used in these machine learning systems.

In this paper, we study the problem of distributed optimization for solving empirical risk minimization

problems, where one seeks to minimize average ofN functionsminw_∈_Rdf(w) := 1

N PN

i=1fi(w). Many

problems in machine learning, such as logistic regression or deep learning, fall into this setting. Formally, we

assume a distributed system consisting ofKmachines, where machinekhas access to a (nonempty) subset of

the data indexed byPk ⊂[N] :={1, . . . , N}. We assume that{Pk}forms a partition of[N]. Our goal can

then thus be reformulated as min w∈Rdf(w) = 1 K K X k=1 Fk(w), whereFk(w) := K N X i∈Pk fi(w). (1)

We assume that (1) has an optimal solution. Bywˆwe denote an arbitrary optimal solution:wˆ∈arg minwf(w).

The fundamental constraint is that machinekcan directly access only (the data describing) functionFk(w),

i.e., functionsfi(w)fori ∈ Pk. Typically,fi(w) = φi(w>xi), wherexiis theithdata point, andφi(·)is a

loss function (dependence oniindicates potential dependence on a label). The functionfiis assumed to be

continuous butnotnecessarily convex — the structure also encompasses problems in deep learning.

(3)

The basic benchmark algorithm for solving the optimization problem is gradient descent (GD). Each

iter-ation of GD performs the update stepw _←w₋ h

K PK

k=1∇Fk(w), whereh >0is a stepsize parameter. In

a distributed system, this amounts to calculation of the gradient_∇Fk(w)by each machinekand then sending

an update to a central node to compute the full update direction 1

K PK

k=1∇Fk(w). Although such an update

sequence is simple, it is often communication intensive due to the communication involved at each iteration of the algorithm. Furthermore, because of the communication delay, computational resources at each machine are often under-utilized. While the latter problem can be addressed by using asynchronous and stochastic variants

of gradient descent [15,16], these algorithms still incur high communication costs at each iteration.

Recently, there have been considerable theoretical advances in our understanding of the communication

overheads in solving (1). In particular, three different methods address this issue by designing a procedure

that uses significant amount of computation locally, between rounds of communication. These methods are

DANE[20], DISCO[23] and COCOA+ [11,5,12,21]. Both DANEand DISCOshow that when the functions

{Fk_}K

k=1in (1) are similar (see Definition1) or exhibit special structure, one could obtain the optimal solution

in considerably fewer rounds of communication.

At thetthiteration of DANE, the following general subproblem has to be solvedexactlyby machinek, with

η _≥0andµ_≥0being parameters of the method:

min

w∈RdFk(w)−

∇Fk(wt−1)₋η_∇f(wt−1), w+µ₂_kw₋wt−1_k2,

whereh·,·iis the standard inner product andk · k :=h·,·i1/2 _{is the standard Euclidean norm. The gradient}

∇f(wt−1₎ _{is computed in a distributed fashion at the start, followed by aggregation of individual updates.}

DISCOuses aninexactdamped Newton method for solving the problem (1) [23]; and hence is asecondorder

method. Thus, both DANEand DISCO require strong oracle access — either in the form of oracle to solve

another optimization problem exactly, or second order information. This is in sharp contrast with COCOA+,

where (similarly to DANE) a general local problem is formulated, but is needed to be solved only approximately. This enables use of essentially any algorithm to be used locally for a number of iterations, making the method much more versatile.

A natural idea is to solve DANE subproblem approximately using a first-order method. However, from

the point of analysis, this leads to an additional error. Hence, it is not clear whether such method enjoys any theoretical guarantees, as such a generalization is non-trivial. Based on this intuition, we develop an inexact

version of DANE(refereed to as INEXACTDANE) and provide its convergence analysis. We demonstrate that

our proposed approach is significantly more robust to ‘bad’ data partitioning, and also highlight a connection

to a distributed version of SVRGalgorithm [6,8]; thus, yielding partial convergence guarantees for a method

that has been observed to perform well in practice [7], but has not been successfully analyzed.

While INEXACTDANEis practically appealing in comparison to DISCOdue to its simplicity and possible

reliance only on first-order information, DISCOistheoreticallysuperior to INEXACTDANEin terms of

commu-nication complexity, as it nearly matches the lower bounds derived in [2]. To address this issue, we build upon

INEXACTDANEand propose an approach, referred to as AIDE, that matches (up to logarithmic factors) these

communication complexity lower bounds and can be implemented using a first-order oracle. To our knowl-edge, ours is the first work that can be implemented using a first-order oracle, but at the same time achieves near optimal communication complexity for the setting considered in this paper.

1.1 Related Work

The literature on distributed optimization is vast and hence, we restrict our attention to the works most relevant to our paper. As mentioned earlier, the most straightforward approach for solving the distributed optimization is

using gradient descent or its accelerated version. When the functionf isL-smooth andλ-strongly convex, the

communication complexity of accelerated gradient descent isO(pL/λlog(1/))for achieving-suboptimal

solution [13]. Another possible approach, often referred to as “one shot averaging”, is solve the local

optimiza-tion problem parallelly at each machine and then compute average of the soluoptimiza-tions [25,24]. However, it can be

(4)

ADMM (alternating direction method of multipliers) is another popular approach for solving distribution optimization problems. Under certain conditions, ADMM is shown to achieve communication complexity of

O(pL/λlog(1/))forL-smooth andλ-strongly convex function. More recently, the above mentioned DANE,

DISCO and COCOA+ algorithms have been proposed to tackle the problem of reducing the communication

complexity in solving problems of form (1). We defer a detailed comparison of our algorithms with DANE

and COCOA+ to Section7and AppendixE. The lower bounds on the communication complexity for solving

problems of form (1) have been obtained in [2] (see Section8for more details). We refer the readers to [20,2]

for a more comprehensive coverage of the related work.

2 Algorithm: I

NEXACT

D

ANE

The DANE algorithm [20] proceeds as follows. At iterationt, nodek exactlysolves the subproblemwˆt

k =

arg minwgk,µt (w), where

gk,µt (w) :=Fk(w)−

∇Fk(wt−1)₋η_∇f(wt−1), w+µ₂_kw₋wt−1_k2, (2)

andµ, η _≥ 0are parameters. After this, the method computeswt ₌ PK

k=1wˆkt/K, which involves

commu-nication, and the process is repeated. Our first method, INEXACTDANE, modifies this algorithm by allowing

the subproblems to be solved inexactly. In particular, letwt

kdenote the point obtained by minimizinggk,µt only

approximately (for example, using Quartz [14] or SVRG[6,8]). The pseudocode of this method is formalized

in Algorithm1. The parameterγrefers to thelevel of inexactnessallowed (smallγ means higher accuracy).

Note that each iteration of INEXACTDANEcan be solved in an embarrassingly parallel fashion.

The communication complexity of an algorithm is defined as the number of communication rounds required

by the algorithm to reach a solutionwt_{satisfying specific convergence criteria such as}_f_(wt₎₋_f_{( ˆ}_w)_≤_{. In}

each communication round, the machines can only send information linear in size of the dimensiond. This is

also the notion of communication round used in [2].

We start with the analysis ofquadratic case(Section3), after which we deal with strongly convex, weakly

convex and nonconvex cases (Section4).

3 Analysis of I

NEXACT

D

ANE

: Quadratic Case

In this section we assume thatFk(w) = 1₂w>_Hkw₊_l>

kw, whereHk ∈ Rd×d andlk ∈Rd (this is the case

when the functions{fi_}N

i=1 in (1) are quadratic). We writeH := K1

PK

k=1Hk andl := K1

PK

k=1lk. For a

square matrixA, bykA_kwe denote its spectral norm. Our first result plays a central role in the analysis of

quadratic functions. Proofs are provided in the Appendix.

Theorem 1. AssumeH 0. Letµ_≥0and defineH˜−1_:= 1 K

PK

k=1(Hk+µI)−1. Further, choose0≤γ <1,

η >0and define

ρ:=_kηH˜−1_H

−I_k+ηγ_K PK_k=1_k(Hk+µI)−1_H

k.

Let{wt_}_{be the iterates of}_INEXACTDANE_(Algorithm₁_{applied with Option I). Then for all}_t _≥₁_{we have}

kwt₋_w_ˆ_{k ≤}_ρ_k_wt−1₋_w_ˆ_k_.

Suppose,0 < λ Hk Lfor allk ∈ [K], then withη = 1, sufficiently largeµand smallγ, one can

ensure thatρ <1. We are interested in the setting where the functions{Fk}are “similar”. Following [2], we

shall measure similarity via the notion ofδ-relatedness, defined next.

Definition 1. We say that quadratic functions{Fk_}areδ-related, if for allk, k0_∈_[K]_{we have}

(5)

Algorithm 1:INEXACTDANE(f,w0_,_s,_γ,_µ)

Data:f(w) = _K1 PK_k=1Fk(w), initial pointw0

∈Rd_{, inexactness parameter}₀

≤γ <1

fort= 1tosdo

fork= 1toKdo in parallel

Find an approximate solutionwt

k ≈wˆtk := arg minw∈Rdg_k,µt (w) Option I:kwt k−wˆtkk ≤γkwt−1−wˆtkk Option II:k∇gt k,µ(wtk)k ≤γk∇gtk,µ(wt−1)kandkwtk−wˆktk ≤γkwt−1−wˆtkk end wt₌ 1 K PK k=1wtk end returnwt

The following theorem specifies the convergence rate in the case ofδ-related objectives. We are

partic-ularly interested in the setting whereδis small, which is the case that should allow stronger communication

complexity guarantees [20,2].

Theorem 2. Let0 < λ_≤Lbe such thatλI H LI, and assume that{Fk_}areδ-related, withδ _≥0. Chooseµ= max_{0,(8δ2_/λ) −λ_},η= 1and let γ= 1/8 if2√2δ≤λ λ2_/(192δ2₎ _otherwise ρ˜:= 2/3 if2√2δ≤λ 1₋(λ2_/24δ2₎ _otherwise The iterates_{wt

}ofINEXACTDANE(Algorithm1with Option I) satisfykwt

−wˆ_{k ≤}ρ˜_kwt−1

−wˆ_k.

It is interesting to note that solving the local subproblems beyond the accuracyγstated in the above result

doesnotlead to a better overall complexity result. That is, the required accuracy at the nodes of the distributed

system should not be perfect, but should instead depend inδandλ. We have the following corollary, translating

the result into a bound on the number of iterations (i.e., number of communication rounds) sufficient to obtain

an-solution.

Corollary 1. Let the assumptions of Theorem2be satisfied, and pick0< < L_kw0

−wˆ_k2_{. If} t= ˜Oδ_λ22log _L kw0 −wˆk2 , thenf(wt₎ ≤f( ˆw) +.

The_O(˜ _·₎_{notation hides few logarithmic factors. In the situation when for all}_k,_Hk _{arises as the average}

of Hessians of ni.i.d. quadratic functions with eigenvalues upper-bounded byL, it can be shown that δ =

O(L/√n), hencet = ˜O((L/λ)2_/n_log(1/))_{(see Corollary} ₄ _{in the Appendix). This arises for instance}

in a linear regression setting where the samples at each machine are i.i.d — a scenario typically assumed in

distributed machine learning systems. Note that, in this case, the communication complexitydecreasesas n

increases.

4 Analysis of I

NEXACT

D

ANE

: General Case

In this section, we present the results for general strongly convex case, weakly convex case, and non-convex case. Proofs are provided in the Appendix.

A functionψ : Rd → R is calledL-smooth if it is differentiable, and if for all x, y ∈ Rd _{we have}

k∇ψ(x)_{− ∇}ψ(y)_{k ≤}L_kx₋y_k.It is calledλ-strongly convexifψ(x)_≥ψ(y) +_h∇ψ(y), x₋y_i+λ

2kx−yk 2_, for allx, y_∈_Rd_(if_ψ_{is twice differentiable, this is equivalent to requiring that}

∇2_ψ(w)

(6)

Ifλ >0, we say thatψis strongly convex. In this case,κ=L/λdenotes the condition number. We say thatψ isweakly convexif it is0-strongly convex.

4.1 Strongly convex case

Our analysis of the strongly convex case follows along the lines of [20] and incorporates the inexactness in

solving the subproblem in (2) at each iteration.

Theorem 3. Assume that for allk∈[K],FkisL-smooth andλ-strongly convex, withλ >0. Let ˜ ρ:= ₍₁ −γ)2 η(L+µ)− 2L (λ+µ)2 − 2γ(L+µ) η(λ+µ)2 η2λ. (3)

Supposeη >0,0_≤γ <1andµ >0are chosen such that0<ρ <˜ 1. Then for the iterates of INEXACTDANE

(Algorithm1with Option II) we have:f(wt₎

−f( ˆw)_≤(1₋ρ)˜ f(wt−1₎

−f( ˆw).

By settingγ= 0, we roughly recover the corresponding result in [20] covering the exact case. Note thatγ

only has a weak effect on the convergence rate. For instance, withγ= 1/8,η = 1andµ= 6L−λ, we require

O(L_λlog(1/))iterations to achieve a solutionwt_{such that}_f_(wt₎₋_f_{( ˆ}_w)_≤_.

4.2 Weakly convex case

We analyze the weakly convex case by using a perturbation argument. This allows us to use the analysis of strongly convex case for proving the convergence analysis. In particular, we consider the following function

f(w) = f(w) +

2kw−w

0

k2_{, which is essentially a perturbation of the function}_f_{. First, note that if}_f _is

L-smooth thenfis(L+)-smooth and-strongly convex. Applying INEXACTDANE(Algorithm1) on the

functionfand using Theorem3, we get the following:

f(wt)₋min w f(w)≤(1−ρ) sh_f(w0₎ −min w f(w) i (4)

whereρis defined as in (3) withλandLreplaced byandL+respectively. Using the above relationship,

the corollary below follows immediately.

Corollary 2. Suppose the functionf is weakly convex. Then the iterates of INEXACTDANEin Algorithm1

(Option II) withη = 1,γ= 1/8andµ= 6L+ 5applied tofafters= ˜O(Llog(1/)/)iterations satisfy

f(wt₎

≤f( ˆw) +O().

The result essentially shows sublinear convergence rate (up to logarithmic factor) of INEXACTDANEfor

weakly convex functions and weak dependence on the inexact parameterγ.

4.3 Nonconvex case

Finally, we look at the case where function f can be nonconvex. The key observation is that by choosing

large enoughµone can have a strongly convexgt

k,µ in (2). The existence of such aµis due to the Lipschitz

continuous nature of gradient off. This allows us to solve the subproblem efficiently and makes it amenable to

analysis. For the purpose of theoretical analysis, in the nonconvex case, we selectwt_{arbitrarily from}_{_wt

k}Kk=1

instead of averaging as in Algorithm1.

Theorem 4. Assume that for allk∈[K],FkisL-smooth. Suppose the iterateswtkof Algorithm1(with Option

II) for some0_≤γ <1. Furthermore, let

θ:= ₍₁ −γ)2 η(L+µ)− 2L (µ₋L)2 − 2γ(L+µ) η(µ₋L)2 η2,

whereη >0,0_≤γ <1andµ > Lare such thatθ >0. Then, we have

min 0≤t0_≤t−1k∇f(w t0₎ k2 ≤ f(w 0₎₋_f_{( ˆ}_w) θt .

(7)

Algorithm 2:AIDE(f,w0_,_λ,_τ,_s,_γ,_µ,₎

Data:f(w) = _K1 PK_k=1Fk(w), Initial pointy0₌_w0

∈Rd_{, INEXACTDANE}_iterations_{s, inexactness}

parameter0≤γ <1,τ≥0 Letq=λ/(λ+τ) whilef(wt−1₎₋_f_{( ˆ}_w)_≤_do Defineft_{(w) :=} 1 K PK k=1(Fk(w) + τ 2kw−yt−1k2) wt₌_{INEXACTDANE(f}t_{, w}t−1_{, s, γ, µ)}

Findζt∈(0,1)such thatζ2

t = (1−ζt)ζt2−1+qζt Computeyt₌_wt₊_βt(wt −wt−1₎_where_βt₌ζt−1(1−ζt−1) ζ2 t−1+ζt end returnwt

Withη = 1, one could chooseµ= 10Land constant inexactness ofγ= 1/8to getθ _≥1/100L. Again,

similar to the strongly convex case, we see a very weak effect ofγon the convergence rate. In other words,

with constant approximation at each local node, we can obtainO(1/t)convergence rate to a stationary point

for nonconvex functions.

5 Accelerated Distributed Optimization

In this section, we study an accelerated version of INEXACTDANEalgorithm. While INEXACTDANEalgorithm

reduces the communication complexity in distributed settings, it is known that it is not optimal i.e., there

is a gap between upper bounds of INEXACTDANEand the lower bounds for the communication complexity

proved in [2]. To this end, we propose an accelerated variant of INEXACTDANEalgorithm and prove that it

matches the lower bounds (upto logarithmic factors) in specific settings. We refer to the Accelerated Inexact

DanE as “AIDE” ( Algorithm 2). The key idea is to apply the generic acceleration scheme—catalyst—in

[10] to INEXACTDANE. We show that by a careful selection ofτ in Algorithm2, one can achieve optimal

communication complexity in interesting and important settings.

5.1 Quadratic case

Here we show that by appropriate selection ofτ, for quadratic case, one can achieve faster convergence rates

that match the lower bounds presented in [2].

Theorem 5. Assume thatλI H LI and further assume that{Fk_}are δ-related. Then the iterates of

AIDE(Algorithm2) withη = 1,µ= 0,τ = max_{0,2√2δ₋λ_},s= ˜O(1)andγ = 1/8₋(δ2_/2(τ₊_λ)2₎

aftert= ˜O(pδ/λlog(1/))satisfyf(wt₎

≤f( ˆw) +and the total number of iterations inINEXACTDANE

(Option I) is_O(˜ p_δ/λ_log(1/))_.

Note that the communication cost of AIDEis proportional to the number of iterations of INEXACTDANE

ex-ecuted as part of AIDEalgorithm. Hence, the total communication complexity of AIDEisO(˜ pδ/λlog(1/)).

It is interesting to note that it is enough to solve each subproblem in INEXACTDANEup to a constant accuracy

ofγ= 1/8in order to achieve the convergence rate stated above. In other words, by using AIDE, one need not

solve the subproblems with high accuracy and still achieve faster convergence than INEXACTDANE.

5.2 Convex case

For the general strongly convex case, we prove the following key result concerning the number of iterations of

(8)

Theorem 6. Assume that for allk_∈[K], the functionFkisL-smooth andλ-strongly convex. Then the iterates ofAIDEwith parametersλ,η= 1,µ= 12L,γ= 1/8,s= ˜O(1)andτ =L₋λaftert= ˜O(pL/λlog(1/))

iterations satisfy f(wt₎

≤ f( ˆw) + and the total number of iterations in INEXACTDANE (Option II) is ˜

O(pL/λlog(1/)).

The upper bound on the communication complexity matches the lower bounds proved in [2] for unrelated

strongly convex functions. It is an interesting open problem to obtain better communication complexity for a subclass of strongly convex functions other than the quadratic case explored above.

As in the case of INEXACTDANE, we apply AIDEto a perturbed problem whenf is weakly convex. In

this manner, we invoke catalyst (acceleration) for strongly convex functions. Recall the functionfis defined

asf(w) = f(w) + ₂kw−w0k2_{. Using AIDE} _on_{f, Theorem}₆_{shows that one can achieve}_{-accuracy on}

functionfint= ˜O(p(L+)/log(1/))iterations. This follows from the fact thatfisL+-smooth and

-strongly convex. Again, using similar argument as in Equation (17), we have the following corollary.

Corollary 3. Suppose the functionf is weakly convex. Then iterates ofAIDEin Algorithm2applied onf

withλ=,η = 1,γ= 1/8andµ= 12(L+),s= ˜O(1)andτ =Laftert= ˜O(pL/log(1/))iterations

satisfyf(wt₎

≤f( ˆw) +O().

6 Connection to a Practical Distributed Version of SVRG

In this section, we discuss the connection between SGD, INEXACTDANEand a version of distributed SVRG

that was observed to perform well in practice [7], but has not yet been successfully analyzed. Algorithm1uses

an approximate solution to the minimization of functiongt

k,µ(w)at each iteration; however, it does not specify

the algorithm (local solver) used to obtain it. While one could use any of recent fast algorithms for finite sums

[17,4,19,3] with comparable results, we highlight the consequences of applying the SVRGalgorithm [6,8] as

a local solver in INEXACTDANE.

The SVRGalgorithm for minimizing average of functionsf(w) = 1

N PN

i=1fi(w)runs in two nested loops.

In thetthouter iteration, SVRGcomputes the gradient of the entire function∇f(wt−1₎_{followed by inner loop,}

where an update direction atwis computed as∇fi(w)_−∇fi(wt−1_{) +}

∇f(wt−1₎_{for a randomly chosen index}

i _∈ [N]. For its distributed version, the inner loop is executed parallelly over all the machines but the index

i for machinekis sampled only fromPk (the data available locally) instead of[N]. For completeness, the

pseudocode is provided in AppendixD.

We can obtain an identical sequence1_{of iterates}_wt_{by looking at a variation of INEXACTDANE}_algorithm.

It is easy to verify that if we apply SVRGto the INEXACTDANEsubproblemgt

k,µin (2) withµ= 0andη = 1,

running SVRGfor a single outer iteration, the resulting procedure is equivalent to running the Algorithm3

(in the Appendix). Pointing out this connection gives partial justification for the performance of Algorithm3.

However, a direct analysis for Algorithm3still remains an open question as the above reasoning applies only

whenµ= 0andη = 1. Note that our results apply to this special setting only under specific conditions; for

example when quadratic functions{Fk}that areδ-related with sufficiently smallδ.

7 Experiments

In this section, we demonstrate the empirical performance of the INEXACTDANE and AIDE algorithms on

binary classification task using different loss functions. We compare the performance of our algorithms to that

of COCOA+, a popular distributed algorithm [11]. COCOA+ is particularly relevant to our setting since, similar

to INEXACTDANEand AIDE, it provides flexibility in dealing with communication/computation tradeoff by

choosing the local computation effort spent between rounds of communication. We use SVRGas the local

solver for INEXACTDANEand AIDE, and SDCAas the local solver for COCOA+ in all our experiments. Unless

stated otherwise, at each iteration of the algorithms, we run the corresponding local solvers for a single pass over

1_{Subject to the same sequence on sampled indices}_i_{∈ P}

(9)

data available locally. Note that for all the algorithms considered here, the iteration complexity is proportional to their communication complexity.

For our experiments, we use standard binary classification datasets rcv1, covtype, real-sim and url2_{. As}

part of preprocessing, we normalize the data and add a bias term. In our experiments, the data is randomly partitioned across individual nodes; thus, mimicking the i.i.d data distribution. We are interested in examining the effect of varying amount of local computation and number of nodes on the performance of the algorithms. For our results, we present the best performance obtained by selected from a range of stepsize parameters

for SVRGand aggregation parameters for COCOA+. For COCOA+, this amounts to searching for the optimal

choice of parameterσ0. In the plots, we use DANEas the label for INEXACTDANE.

0 20 40 60 80 100 10−15 10−10 10−5 100 Iterations Function suboptimality DANE AIDE COCOA 0 100 200 300 400 500 10−15 10−10 10−5 100 Iterations Function suboptimality DANE AIDE COCOA 0 200 400 600 800 1000 10−15 10−10 10−5 100 Iterations Function suboptimality DANE AIDE COCOA 0 5 10 15 20 10−15 10−10 10−5 100 Iterations Function suboptimality DANE/1 DANE/3 DANE/6 DANE*2 DANE*4 0 10 20 30 10−15 10−10 10−5 100 Iterations Function suboptimality DANE/1 DANE/3 DANE/6 DANE*2 DANE*4 0 5 10 15 20 25 10−15 10−10 10−5 100 Iterations Function suboptimality DANE4 DANE8 DANE16 DANE32 DANE64 0 10 20 30 10−15 10−10 10−5 100 Iterations Function suboptimality DANE4 DANE8 DANE16 DANE32 DANE64

Figure 1: Top: rcv1 dataset; Smoothed hinge loss; λset to1/(cN), forc _{∈ {}1,10,100_}. Bottom: Logistic

loss;left-hand two: Varying number of local data passes per iteration. rcv1 and url datasets. right-hand two:

Varying number of nodes; rcv1 and url datasets.

In the top row of Figure1, we compare INEXACTDANE, AIDEand COCOA+ for classification task on the

rcv1 dataset with smoothed hinge loss on8nodes. The three plots are with regularization parameterλset to

1/(cN), forc ∈ {1,10,100}. It can be seen that AIDEoutperforms both INEXACTDANEand COCOA+. We

see similar behavior on other datasets (see SectionE.1of the Appendix). The benefits of AIDEare particularly

pronounced in settings with large condition numberκ=L/λ > N. This can also be explained theoretically, as

convergence rate of fast accelerated stochastic methods has a dependence ofO(√N κ)onκandN, as opposed

toO(κ)without acceleration [18,10].

In the two left-hand side plots of the bottom row of Figure1, we demonstrate the effect of local

compu-tation effort at each iteration, to solve the local subproblem (2), on the overall convergence of the algorithm.

The different lines signify{1/6,1/3,1,2,4_}passes through data available locally per iteration of INEXACT

-DANE. From the plots, it can be seen that running SVRGbeyond4 passes through local data only provides

little improvement; thus, suggesting a natural computational setting for INEXACTDANEand demonstrating its

advantage over DANE. Finally, in the two right-hand side plots of the bottom row, we show performance of

INEXACTDANEwith increasing number of nodes. In particular, we partition the data to_{4,8,16,32,64_}nodes,

and keep the number of local iterations of SVRGconstant for all settings. The results suggest that in practice,

INEXACTDANEcan scale gracefully with the number of nodes. We only observe drop in the performance with

rcv1 dataset on64nodes, where the optimal stepsize of SVRGwas significantly smaller than in other cases.

Due to space constraints, we relegate experiments on other datasets and loss functions, and demonstration of

the robustness of the INEXACTDANEmethod over DANEto SectionEof the Appendix.

8 Discussion

In this section, we give a brief comparison of the results we presented. In particular, we compare the key

aspects of DANE, DISCO, COCOA+, INEXACTDANEand AIDEalgorithms.

(10)

Communication Complexity: We showed that AIDEnearly achieves the lower bounds proved in [2] in

spe-cific cases. When the functions are quadraticδ-related, AIDEand DISCOmatch the communication complexity

lower bound of_O(˜ p_δ/λ_{log(1/)). However, DANE}_{and INEXACTDANE}_{have a sub-optimal communication}

complexity of_O((δ˜ 2_/λ2_{) log(1/))}_{in this case. Similarly, for the general unrelated strongly convex function}

with condition numberκ, AIDEand DISCOenjoy communication complexity of_O(˜ √_κ_log(1/))_[₂_{]. Again,}

this complexity is superior to the convergence rate ofO(κlog(1/))of COCOA+ , INEXACTDANEand DANE.

Nature of oracle access: Both DANEand DISCOrequire access to a strong oracle. While DANEtechnically

requires an oracle that solves the subproblem (2), DISCOrequires a second-order oracle for its execution. On

the other hand, both INEXACTDANEand AIDEcan be executed using a simple first-order oracle, matching the

major advantage of COCOA+.

Parallelism and Implementation: One of the appealing aspect of the DANE, INEXACTDANE, AIDEand COCOA+ is the simplicity of implementation and its suitability for large-scale distributed environments. Each iteration of these algorithms is embarrassingly parallelizable since it involves solving a local objective function

at each node. The same cannot be said about DISCOdue to the asymmetric workload on the master node at

each iteration of the inexact damped Newton iteration.

Distributed SVRG: As a by-product of our analysis, we obtain partial convergence guarantees of a

dis-tributed version of popular SVRGalgorithm, that was observed to perform well in practice [7].

In conclusion, AIDEadopts practical advantages of DANE, but at the same time also achieves the optimal

communication complexity in [2]. Furthermore, similar to COCOA+, AIDE and INEXACTDANEprovide an

efficient way of balancing communication and computation complexity; thereby, providing a powerful frame-work for solving large-scale distributed optimization in a communication-efficient manner.

References

[1] Martín Abadi, Ashish Agarwal, Paul Barham, Eugene Brevdo, Zhifeng Chen, Craig Citro, Greg S Cor-rado, Andy Davis, Jeffrey Dean, Matthieu Devin, et al. Tensorflow: Large-scale machine learning on

heterogeneous distributed systems.arXiv:1603.04467, 2016.

[2] Yossi Arjevani and Ohad Shamir. Communication complexity of distributed convex learning and

opti-mization. In C. Cortes, N. D. Lawrence, D. D. Lee, M. Sugiyama, and R. Garnett, editors,NIPS 28, pages

1747–1755. 2015.

[3] Aaron Defazio. A simple practical accelerated method for finite sums. arXiv:1602.02442, 2016.

[4] Aaron Defazio, Francis Bach, and Simon Lacoste-Julien. SAGA: A fast incremental gradient method

with support for non-strongly convex composite objectives. InNIPS 27, pages 1646–1654. 2014.

[5] Martin Jaggi, Virginia Smith, Martin Takáˇc, Jonathan Terhorst, Sanjay Krishnan, Thomas Hofmann, and

Michael I Jordan. Communication-Efficient Distributed Dual Coordinate Ascent. In NIPS 27, pages

3068–3076. 2014.

[6] Rie Johnson and Tong Zhang. Accelerating stochastic gradient descent using predictive variance

reduc-tion. InNIPS 26, pages 315–323. 2013.

[7] Jakub Koneˇcný, Brendan McMahan, and Daniel Ramage. Federated optimization: distributed

optimiza-tion beyond the datacenter.arXiv:1511.03575, 2015.

[8] Jakub Koneˇcný and Peter Richtárik. Semi-stochastic gradient descent methods. arXiv:1312.1666, 2013.

[9] Mu Li, David G Andersen, Alex J Smola, and Kai Yu. Communication efficient distributed machine learning with the parameter server. In Z. Ghahramani, M. Welling, C. Cortes, N. D. Lawrence, and K. Q.

(11)

[10] Hongzhou Lin, Julien Mairal, and Zaid Harchaoui. A universal catalyst for first-order optimization. In

C. Cortes, N. D. Lawrence, D. D. Lee, M. Sugiyama, and R. Garnett, editors,NIPS 28, pages 3366–3374.

2015.

[11] Chenxin Ma, Jakub Koneˇcný, Martin Jaggi, Virginia Smith, Michael I Jordan, Peter Richtárik, and Martin

Takáˇc. Distributed optimization with arbitrary local solvers.arXiv:1512.04039, 2015.

[12] Chenxin Ma, Virginia Smith, Martin Jaggi, Michael I Jordan, Peter Richtárik, and Martin Takáˇc. Adding

vs. averaging in distributed primal-dual optimization.arXiv:1502.03508, 2015.

[13] Yurii Nesterov.Introductory Lectures On Convex Optimization: A Basic Course. Springer, 2003.

[14] Zheng Qu, Peter Richtárik, and Tong Zhang. Quartz: Randomized dual coordinate ascent with arbitrary

sampling. InNIPS 28. 2015.

[15] Benjamin Recht, Christopher Re, Stephen Wright, and Feng Niu. Hogwild!: A Lock-Free Approach to

Parallelizing Stochastic Gradient Descent. InNIPS 24, pages 693–701, 2011.

[16] Sashank Reddi, Ahmed Hefny, Suvrit Sra, Barnabas Poczos, and Alex J Smola. On variance reduction in

stochastic gradient descent and its asynchronous variants. InNIPS 28, pages 2629–2637, 2015.

[17] Mark W. Schmidt, Nicolas Le Roux, and Francis R. Bach. Minimizing finite sums with the stochastic

average gradient. arXiv:1309.2388, 2013.

[18] Shai Shalev-Shwartz. SDCA without duality, regularization, and individual convexity.arXiv:1602.01582,

2016.

[19] Shai Shalev-Shwartz and Tong Zhang. Stochastic dual coordinate ascent methods for regularized loss.

The Journal of Machine Learning Research, 14(1):567–599, 2013.

[20] Ohad Shamir, Nathan Srebro, and Tong Zhang. Communication efficient distributed optimization using

an approximate Newton-type method. CoRR, abs/1312.7853, 2013.

[21] Tianbao Yang. Trading computation for communication: Distributed stochastic dual coordinate ascent. InNIPS 27, pages 629–637, 2013.

[22] Aapo Kyrola Danny Bickson Carlos Guestrin Yucheng Low, Joseph Gonzalez and Joseph M. Hellerstein.

Distributed GraphLab: A Framework for Machine Learning and Data Mining in the Cloud. PVLDB,

2012.

[23] Yuchen Zhang and Xiao Lin. Disco: Distributed optimization for self-concordant empirical loss. InICML

2015, pages 362–370, 2015.

[24] Yuchen Zhang, Martin J Wainwright, and John C Duchi. Communication-efficient algorithms for

statisti-cal optimization. In F. Pereira, C. J. C. Burges, L. Bottou, and K. Q. Weinberger, editors,NIPS 25, pages

1502–1510. Curran Associates, Inc., 2012.

[25] Martin Zinkevich, Markus Weimer, Lihong Li, and Alex J Smola. Parallelized stochastic gradient descent. InNIPS 24, pages 2595–2603, 2010.

(12)

Appendix

A Analysis of I

NEXACT

D

ANE

Quadratic case

Before proving the main result, we will need to establish two lemmas.

Lemma 1. Let assumptions of Theorem1be satisfied. Then

w t −_K1 K X k=1 ˆ wkt ≤ ηγ K K X k=1 k(Hk+µI)−1H_kkwt−1₋wˆ_k Proof. Sincewˆt

k = arg minwgk,µt (w), we have∇gtk,µ( ˆwtk) = 0. That is,0 = ∇gk,µt ( ˆwtk) = ∇Fk( ˆwtk) +

η∇f(wt−1₎_{− ∇}_Fk_(wt−1_{) +}_{µ( ˆ}_wt

k−wt−1).By rearranging the terms in the last expression, we get

ˆ

wtk=wt−1−η (Hk+µI)−1Hwt−1+ (Hk+µI)−1l

. (5)

Taking norms on both sides of (5), and using the fact thatwˆ=−H−1_{l, we obtain}

kwt−1−wˆtkk=ηk(Hk+µI)−1H(wt−1−w)ˆ k. (6)

We proceed to prove the desired result by using the following set of inequalities: w t −_K1 K X k=1 ˆ wt k ≤ 1 K K X k=1 wt k−wˆtk ≤_K1 K X k=1 γwt−1₋wˆtk =ηγ K K X k=1 k(Hk+µI)−1H(wt−1₋w)ˆ _k ≤ηγ_K K X k=1 k(Hk+µI)−1Hkkwt−1−wˆk. (7)

The first inequality follows from Jensen’s inequality and the fact thatwt₌PK

k=1wtk/K. The second

inequal-ity follows from the approximation condition on wt

k. The equality is obtained from Equation (6). The last

inequality follows from properties of spectral norm of a matrix.

Lemma 2. Let assumptions of Theorem1be satisfied. Then

kwt₋wˆ_{k −}ηH˜−1H₋I_kwt−1₋wˆ_{k ≤} w t −_K1 K X k=1 ˆ wkt .

(13)

Proof. w t −_K1 K X k=1 ˆ wkt = w t −wt−1+ η K K X k=1 (Hk+µI)−1H(wt−1₋w)ˆ =(wt−w)ˆ −(wt−1−w) +ˆ ηH˜−1H(wt−1−w)ˆ =(wt−w)ˆ −(I−ηH˜−1H)(wt−1−w)ˆ ≥ kwt −wˆ_{k −}(I₋ηH˜−1_H)(wt−1 −w)ˆ ≥ kwt₋wˆ_{k −}I₋ηH˜−1H_kwt−1₋wˆ_k. (8)

The first equality follows from the optimality property ofwˆt

k (see Equation (5)). The first inequality follows

from the triangle inequality. The second inequality follows from the properties of the matrix spectral norm.

A.1 Proof of Theorem

1

Proof. Lemmas1and2imply the inequality wt −wˆ₋ηH˜−1H₋Iwt−1₋wˆ_≤ηγ K K X k=1 (Hk+µI)−1Hw t−1₋wˆ.

It suffices to rearrange the terms.

A.2 Proof of Theorem

2

Proof. Follows by using Lemma4(see AppendixC) together with Theorem1.

A.3 I

NEXACT

D

ANE

for Stochastic Quadratic Setting

In the interesting case of stochastic quadratic setting (see [20]), we can provide a more precise result. More

specifically, we have the following key result in the stochastic quadratic setting.

Corollary 4. Suppose eachHk is the average ofni.i.d. symmetric positive definite matrices with largest

eigenvalue bounded above byL. Then for any0< α≤1, with probability1−αwe have:

i Functions{Fk}areδ-related withδ= q

32L2_log(Kd/α)

n ,

ii The iterates ofINEXACTDANE(Algorithm1) (with parameters defined in Theorem2) after

t= ˜O(L/λ)_n 2log Kd_α logLkw0−wˆk2

satisfyf(wt₎

≤f( ˆw) +.

Proof. The first part of the proof is directly from Lemma 2 of [20]. The second part of the proof follows from

(14)

Strongly Convex Case

3

Proof. We first observe that

k∇gt

k,µ(wtk)− ∇gtk,µ(wt−1)k ≥ k∇gtk,µ(wt−1)k − k∇gtk,µ(wtk)k

≥(1₋γ)_k∇gtk,µ(wt−1)k. (9)

The first inequality follows from triangle inequality. The second inequality follows from the condition that

k∇gt

k,µ(wkt)k ≤γk∇gk,µt (wt−1)k. We now define functionhtk:Rd→Ras

htk(w) :=Fk(w) +

µ

2kw−w

t−1

k2.

We define the virtual iteratewˆt

k = arg minwgk,µt (w). Using the optimality conditions ofwˆtk, we have the

following:

∇htk( ˆwtk)− ∇htk(wt−1) =−η∇f(wt−1). (10)

Also, define Bregman divergence of a smooth functionφasDφ(w, w0_{) =}_φ(w)₋_φ(w0₎_{− h∇}_φ(w0_{), w}₋_w0_i_.

We observe the following:

f(wtk) =f(wt−1) +h∇f(wt−1), wtk−wt−1i+Df(wtk, wt−1) =f(wt−1₎ −1_ηh∇ht k( ˆwkt)− ∇htk(wt−1), wkt −wt−1i+Df(wtk, wt−1) =f(wt−1)₋1 ηh∇h t k( ˆwtk)− ∇htk(wkt), wtk−wt−1i | {z } T1 −_η1h∇htk(wkt)− ∇htk(wt−1), wtk−wt−1i+Df(wtk, wt−1). (11)

The second equality is due to optimality condition in (10). We bound the termT1in the following manner:

|T1| ≤ k∇htk( ˆwtk)− ∇hkt(wkt)kkwkt−wt−1k

≤(L+µ)_kwˆtk−wktkkwkt−wt−1k. (12)

The first inequality is due to Cauchy-Schwarz inequality. The second inequality follows fromL+µlipschitz

continuous nature of the gradient of ht

k. In order to proceed further, a bound on the termkwkt −wt−1k is

obtained in the following fashion:

kwt−1−wktk ≤ kwt−1−wˆktk+kwˆkt−wtkk

≤ kwt−1−wˆktk+γkwˆtk−wt−1k= (1 +γ)kwt−1−wˆtkk. (13)

The first inequality is due to triangle inequality. The second inequality is due to the inexactness condition in

Algorithm1. Substituting the above bound in Equation (12), we have the following:

|T1| ≤(L+µ)(1 +γ)kwˆkt−wtkkkwt−1−wˆtkk ≤(L+µ)(1 +γ)γkwt−1−wˆktk2. (14)

The second inequality is again due to inexactness condition in Algorithm1. In order to boundkwt−1

−wˆt

kk2, we observe the following:

(µ+λ)_kwt−1

−wˆt

kk ≤ k∇gt

(15)

The first inequality follows from Lemma5. Using this relationship in Equation (14), we have the following: |T1_{| ≤} 2γη 2_(L₊_µ) (λ+µ)2 k∇f(w t−1₎ k2.

Substituting the bound in Equation (11), we have:

f(wtk)≤f(wt−1) + 2γη(L+µ) (λ+µ)2 k∇f(w t−1₎ k2 −_η(L1₊_µ)k∇ht k(wkt)− ∇htk(wt−1)k2+Df(wtk, wt−1) ≤f(wt−1) +2γη(L+µ) (λ+µ)2 k∇f(w t−1₎ k2 −_η(L1₊_µ)k∇htk(wkt)− ∇htk(wt−1)k2+ L 2kw t−1 −wt_kk2 ≤f(wt−1) +2γη(L+µ) (λ+µ)2 k∇f(w t−1₎ k2 −_η(L1₊_µ)k∇gtk,µ(wtk)− ∇gtk,µ(wt−1)k2+ 2Lη2 (λ+µ)2k∇f(w t−1₎ k2 ≤f(wt−1)− (1−γ)2 η(L+µ)− 2L (λ+µ)2 − 2γ(L+µ) η(λ+µ)2 η2k∇f(wt−1)k2 ≤f(wt−1)₋ ₍₁ −γ)2 η(L+µ)− 2L (λ+µ)2 − 2γ(L+µ) η(λ+µ)2 η2λ(f(wt−1)₋f( ˆw)). (16)

The first step is due to Lemma5. The second step follows theL-smoothness off. The third step follows from

the fact thatk∇gt

k,µ(wtk)− ∇gtk,µ(wt−1)k =k∇htk(wkt)− ∇htk(wt−1)kand bound onkwt−1−wtkk2from

Equation (13) and (15). The fourth inequality follows from Equation (9) and the fact that ∇gt

k,µ(wt−1) =

∇f(wt−1_{). The last inequality is due to strong convexity of function}_f_{. Finally, we observe:}

(f(wt)−f( ˆw)≤ _K1 K X k=1

(f(wtk)−f( ˆw))≤(1−ρ)(f(w˜ t−1)−f( ˆw)).

The first inequality is due to convexity. The second inequality follows from Equation (16). Therefore, we have

the required result.

Weakly Convex Case

Proof of Corollary

2

We observe the following:

f(wt₎ ≤f(wt₎ ≤min w f(w) + (1−ρ) s_[f(w0₎ −min w f(w)] ≤f( ˆw) + 2kwˆ−w 0 k2_{+ (1} −ρ)s_[f(w0₎ −min w f(w)] ≤f( ˆw) +(1/2)kwˆ−w0k2_{+ (f}_(w0₎ −f( ˆw)). (17)

The second inequality is a consequence of Equation (4). The last inequality follows from the facts that (i)

(16)

Nonconvex Case

4

Proof. Using essentially the same argument as in Theorem3, we have the following:

f(wtk)≤f(wt−1)− ₍₁ −γ)2 η(L+µ)− 2L (µ₋L)2 − 2γ(L+µ) η(µ₋L)2 η2_k∇f(wt−1)_k2.

The argument uses the fact thatgt

k,µis(µ+L)-smooth and(µ−L)-strongly convex. This is in turn due to the

lipschitz continuous nature of the gradient. Note that this is possible only whenµ > L(as mentioned in the

theorem statement). Using the definition ofwt_{in the nonconvex case, we have}

f(wt)≤f(wt−1)− (1−γ)2 η(L+µ)− 2L (µ₋L)2 − 2γ(L+µ) η(µ₋L)2 η2k∇f(wt−1)k2_. ₍₁₈₎

Rearranging the above inequality and using the definition ofθ, we get

k∇f(wt−1)_k2_≤f(w

t−1₎

−f(wt₎

θ .

Using a telescopic sum on the above inequality and the fact thatwˆis an optimal solution of (1), we have the

desired result.

B Analysis of A

IDE

Quadratic Case

B.1 Proof of Theorem

5

Proof. First consider the case where2√2δ _≤ λ. In this case, we haveτ = 0and1/16 _≤ γ _≤ 1/8. The

required result trivially follows from Theorem2. We turn our attention to the interesting case of2√2δ > λ.

For this case, we chooseτ = 2√2δ₋λ. We useFt

k(w)to denote the functionFk(w) + (τ /2)kw−yt−1k2.

As mentioned in pseudocode of Algorithm2, we use INEXACTDANEfor the set of functionsFt

k. Letuˆt =

arg minwft_(w)_{(refer to Algorithm}₂_{). By solving each subproblem inexactly for}_s_{iterations, from Theorem}₂_,

we have the following:

ft_(wt₎ −ft_(ˆ_ut₎ ≤L+τ 2 kw t −uˆt k2 ≤ L+τ 2 ρ˜ s kwt−1 −uˆt k2_, ₍₁₉₎

whereρ˜= 2/3. The first inequality follows from theL+τLipschitz continuous gradient offt_{. The second}

inequality follows from Theorem2. The value ofρ˜is obtained from Theorem2noting the following two facts.

(i) The inexactness parameterγ ≤1/8. (ii)ρ˜= 2/3whenγ≤1/8and2√2δ≤(λ+τ). Here, we used the

fact that INEXACTDANEis applied on the functionft_and_∇2_ft _(λ₊_{τ)I. Using Equation (}₁₉_{), we have}

the following: ft(wt)₋ft(ˆut)_≤L+τ 2 ρ˜ s kwt−1₋uˆt_k2_≤ L+τ λ+τρ˜ s_[ft_(wt−1₎ −ft(ˆut)].

This simply follows from the fact that_∇2_ft

(λ+τ)I. Using Proposition 3.2 of [10], we have the total

number of iterations of INEXACTDANEiterations one has to execute to achieve-accuracy is

t= ˜O 1 1−ρ˜ q λ+τ λ log(1/) = ˜O q δ λlog(1/) .

(17)

B.2 A

IDE

in the Stochastic Quadratic Setting

The following result follows as a Corollary of Theorem5.

Corollary 5. Suppose eachHkis the average ofni.i.d positive semidefinite matrices with eigenvalues at most L. Then with probability at least1₋α:

i Functions{Fk_}areδ-related withδ= q

32L2_log(Kd/α)

n ,

ii The iterates ofINEXACTDANE(Algorithm1) (with parameters in Theorem5) after

t= ˜O √L n1/4√_λlog 1/4 Kd α logLkw0−wˆk2 satisfyf(wt₎ ≤f( ˆw) +.

Proof. The first part of the proof is directly from Lemma 2 of [20]. The second part of the proof follows from

Theorem5.

Strongly convex case

B.3 Proof of Theorem

6

Proof. Letuˆt_{= arg minw}_ft_{(w). We first observe that after}_s_{iterations of INEXACTDANE}_{in the}_tth_iteration

of AIDE(applied onft_{), we have the following:}

ft(wt)₋ft(ˆut)_≤(1₋ρ)˜s(ft(wt−1)₋ft(ˆut)), (20) where ˜ ρ= ₄₉ 64(L+µ+τ)− 2(L+τ) (λ+τ+µ)2 − (L+τ+µ) 4(λ+τ+µ)2 L

This follows directly from Theorem3and the fact thatft_has_(L₊_{τ)-Lipschitz continuous gradient and is}

(λ+τ)- strongly convex. Note that with the given value ofτ,µandγ, we have following:

˜ ρ=L ₄₉ 64(2L+µ−λ)− 2(2L₋λ) (L+µ)2 − 2L+µ₋λ 4(L+µ)2 (21) ≥L ₄₉ 64(14L)− 4L (13L)2− 14L 4(13L)2 ≥0.01. (22)

The above bound follows from simple algebra. Now again using Proposition 3.2 of [10], the total number of

INEXACTDANEiterations required to achieve an-accurate solution is

t= ˜O 1 ˜ ρ r λ+τ λ log 1 ! = ˜O r L λlog 1 ! .

This is obtained from the bound in Equation (21). This gives us the desired result. Note that the constants in

this result are not optimized.

C Auxiliary Results

Here we establish two lemmas which are used in the proofs of the main results.

(18)

Lemma 3. Let µ, δ, ξ be positive constants satisfying max_{µ, δ_} < ξ. Further, let Aand B1, . . . , BK be

d_×dsymmetric matrices satisfyingξI A,P_kBk = 0and_kBk_{k ≤}δfor allk_∈[K]. Then, we have the

following: (A+Bk)−1₋A−1(A₋µI)_≤ 2δ ξ₋δ, k∈[K], 1 K K X k=1 (A+Bk)−1 −A−1 ! (A₋µI) ≤ 2δ2 ξ(ξ−δ).

Proof. First, we observe that

(A+Bk)−1= (A(I+A−1Bk))−1= (I+A−1Bk)−1A−1=A−1+

∞ X j=1

(−1)j(A−1Bk)jA−1.

The above equality follows form series expansion of (I +A−1Bk)−1, which is possible as kA−1Bkk ≤

kA−1

kkBk_k= (δ/ξ)<1. From the above equality, we obtain the following bound:

(A+Bk)−1₋A−1 A₋µI)_k= ∞ X j=1 (₋1)j(A−1Bk)jA−1(A₋µI) ≤ ∞ X j=1 (₋1)j(A−1Bk)jA−1(A₋µI) ≤ ∞ X j=1 A−1j kBk_kjI₋µA−1 ≤ ∞ X j=1 δj ξj 1 + µ ξ ≤ _ξ2δ −δ.

The first inequality follows from the triangle inequality. The second inequality follows from the

Cauchy-Schwarz inequality. The last inequality follows from the fact thatµ/ξ <1.

The second part of the claim was established as Lemma 4 in [20].

Lemma 4. Assume thatλI H LI, whereH := 1 K

PK

k=1Hk andλ >0. Further, assume thatkHk−

H_{k ≤}δfor allk_∈[K]and let_H˜−1_:= 1 K

PK

k=1(Hk+µI)−1, whereµ= max n 0,8δ2 λ −λ o . If we let γ:= 1 8 if 2 √ 2δ_≤λ, λ2 192δ2 otherwise, then ρ=kH˜−1H−Ik+ γ K K X k=1 k(Hk+µI)−1Hk ≤ 2 3 if 2 √ 2δ≤λ, 1₋ λ2 24δ2 otherwise.

Proof. Using Lemma3withA=H+µIandBk=Hk−H, we have the following inequality: (Hk+µI)−1−(H+µI)−1H≤_λ₊2δ_µ −δ, (23) H˜−1₋(H+µI)−1H_≤ 2δ 2 (λ+µ)(λ+µ−δ), (24)

(19)

for allk_∈[K]. From the above inequalities, we have

k(Hk+µI)−1Hk ≤ k(Hk+µI)−1H−Ik+kIk

≤ k(Hk+µI)−1H−(H+µI)−1Hk+k(H+µI)−1H−Ik+kIk

≤ _λ₊2δ_µ

−δ+

µ

λ+µ+ 1. (25)

The first and second inequality follow from triangle inequality. The third inequality follows from (23) and

Lemma 3 in [20], which says thatk(H+µI)−1_H₋_I_k₌_µ/(λ₊_{µ). Furthermore, we also have the following}

bound: kH˜−1H₋I_{k ≤ k}H˜−1H₋(H+µI)−1H_k+_k(H+µI)−1H₋I_k ≤ 2δ 2 (λ+µ)(λ+µ₋δ)+ µ λ+µ. (26)

The first and second inequality follow from triangle inequality and Equation (24), respectively. Inequality in

Equation (26) was earlier shown in [20]. Using inequalities in Equation (25) and Equation (26), we obtain the

following bound: ρ=_kH˜−1_H −I_k+ γ K K X k=1 k(Hk+µI)−1_H k ≤ _2δ2 (λ+µ)(λ+µ₋δ)+ µ λ+µ +γ _2δ λ+µ₋δ + µ λ+µ + 1

Let us first consider the case ofλ≥2√2δ. In this case, we setµ= 0,γ= 1/8and hence, we have

ρ≤ 2δ 2 λ(λ₋δ)+γ 2δ λ₋δ+ 1 <0.53< 2 3.

Now, consider the case of λ < 2√2δ. We useψ to denote8δ/λ. Note thatψ > 2. In this case, we set

µ= (8δ2_/λ)

−λandγ=λ2_/(192δ2₎_{and hence, the following holds:}

ρ≤_ψ(ψ2 −1)+ (1 +γ) 1−_ψδλ +γ 1 + 2 ψ₋1 . (27)

We observe thatψ₋1> ψ/2(sinceψ >2). Further, the following inequality holds:

(1 +γ)

1−_ψδλ

≤1−_ψδλ +γ.

Substituting the above inequalities in Equation (27), we have the following:

ρ_≤ 4 ψ2 + 1− λ ψδ +γ+γ 1 + 2 ψ−1 ≤1₋ λ 2 16δ2 + 4γ≤1− λ2 24δ2.

The second inequality is due to that fact thatψ−1>1and the fact thatγ=λ2_/(192δ2_{). Hence, we have the}

desired result.

Lemma 5. Suppose a functionh : Rd _→ _R_is_L_{-smooth and}_λ_{-strongly convex. Let}_w∗ _{= arg min}_h(w)_.

Then, we have the following:

h∇h(w0)−h(w), w0−wi ≥ _L1k∇h(w0)− ∇hi(w)k2_,

k∇h(w)_{k ≥}λ_kw₋w∗_k,

(20)

Algorithm 3:ADISTRIBUTED VERSION OFSVRG(f,w0_,_s,_r,_h)

Data:f(w) = _K1 PK_k=1Fk(w), initial pointw0

∈Rd_{, stepsize}_{h >}₀

fort= 1tosdo

Compute_∇f(wt₎_{and distribute to all machines}

fork= 1toKdo in parallel

wt

k=wt

forj = 1tordo

Samplei∈ Pk(e.g., uniformly at random)

wt k=wtk−h ∇fi(wkt)− ∇fi(wt−1) +∇f(wt−1) end end wt₌ 1 K PK k=1wtk end returnwt

D A Practical Distributed Version of SVRG

In Section6we pointed out how running SVRGas a local solver in INEXACTDANEin certain setting is

equiv-alent to the Algorithm3below. It is a distributed version of SVRG, that has been observed to perform well in

practice [7], but has not been directly analyzed. Note that there exist another way to obtain exactly the same

algorithm. That is to rewrite the local subproblem (2) as follows

gk,µt (w) = P i∈Pk fi(w)−∇fi(wt−1₎_{− ∇}_f_(wt−1_{), w}₊µ 2kw−wt−1k2 ,

and directly apply SGDlocally on this reformulation within INEXACTDANEand setµ= 0.

E Additional Experiments

In this section, we provide extended version of what was already presented in Section7, along with results

of different types. In particular, we extend the comparison of INEXACTDANE, AIDEand COCOA+ to various

settings and with different datasets. We also study of the effect of varying local iterations between rounds of communication on the performance of the algorithms. We present further results showing performance under varying number of nodes onto which the dataset is distributed, and highlight a case of non-random

data distribution, in which DANEdiverges, while the performance of INEXACTDANEdegrades only slightly,

compared to benchmark with random data distribution.

Again, the following default statements are true, unless stated otherwise. We use SVRGas local solver for

INEXACTDANEand AIDE, and we use SDCAin COCOA+. We run all presented methods for a single pass

over data available locally in every iteration. We partition data randomly (mimicking the iid data distribution)

across8nodes. We present the best performance selected from a range of stepsize parameters for SVRGand

aggregation parameters for COCOA+. In the plots, we use DANEas label for INEXACTDANE.

We omit the experiments for quadratic objectives as the results were very similar to the ones presented here and hence, did not provide any additional insights, compared to the experiments with classification on publicly available datasets.

E.1 I

NEXACT

D

ANE

, A

IDE

and C

OCOA

+

In Figures2and3, we present a comparison of INEXACTDANE, AIDEand COCOA+ on a binary classification

task on the rcv1 dataset. Plots in top row are for logistic loss, and bottom is for smoothed hinge loss. We apply

L2-regularization withλset to1/(cN)forc_{∈ {}1,10,100_}, which correspond to left, middle and right column

(21)

0 5 10 15 20 25 10−15 10−10 10−5 100 Iterations Function suboptimality DANE AIDE COCOA 0 50 100 150 10−15 10−10 10−5 100 Iterations Function suboptimality DANE AIDE COCOA 0 200 400 600 800 10−15 10−10 10−5 100 Iterations Function suboptimality DANE AIDE COCOA 0 20 40 60 80 100 10−15 10−10 10−5 100 Iterations Function suboptimality DANE AIDE COCOA 0 100 200 300 400 500 10−15 10−10 10−5 100 Iterations Function suboptimality DANE AIDE COCOA 0 200 400 600 800 1000 10−15 10−10 10−5 100 Iterations Function suboptimality DANE AIDE COCOA

Figure 2: rcv1 dataset, 8 nodes, regularization parameter λ set to 1/(cN), for c _{∈ {}1,10,100_} (in

left/middle/right columns respectively). Top row: logistic loss. Bottom row: smoothed hinge loss.

0 20 40 60 10−15 10−10 10−5 100 Iterations Function suboptimality DANE AIDE COCOA 0 100 200 300 10−15 10−10 10−5 100 Iterations Function suboptimality DANE AIDE COCOA 0 200 400 600 800 1000 10−15 10−10 10−5 100 Iterations Function suboptimality DANE AIDE COCOA 0 100 200 300 400 500 10−15 10−10 10−5 100 Iterations Function suboptimality DANE AIDE COCOA 0 200 400 600 800 1000 10−15 10−10 10−5 100 Iterations Function suboptimality DANE AIDE COCOA 0 1000 2000 3000 4000 5000 10−15 10−10 10−5 100 Iterations Function suboptimality DANE AIDE COCOA

Figure 3: rcv1 dataset, 64 nodes, regularization parameter λ set to 1/(cN), for c ∈ {1,10,100} (in

64 nodes. To strengthen our claims, in Figures4and5, we provide experiment with settings analogous to the

ones in Figure2, but on covtype and realsim datasets respectively.

In this experiment, we can see a common pattern arising in different settings. The benefit of acceleration

in AIDE is present only when the condition number κ := L/µ > N, and the larger it is, the larger is the

gap in performance. This is to be expected, as the acceleration of the fast stochastic methods changesκin

convergence rates to√N κ[18]. Both INEXACTDANEand AIDEoutperform COCOA+, with AIDEperforming

much better in all studied settings. The behaviour of COCOA+, where if one looks only at suboptimality of

primal function value, the algorithm quickly converges to modest accuracy, and then converges very slowly to higher accuracies, has been confirmed as correct by its authors.

E.2 Varying amount of local computation

In Figure 6we demonstrate how spending more local computational resources in each iteration to solve the

local subproblem (2) leads to faster convergence in terms of number of iterations. Note that in the settings

presented, it is not possible to get close form solution to the subproblems, and hence we can only approximate

(22)

0 5 10 15 20 25 10−15 10−10 10−5 100 Iterations Function suboptimality DANE AIDE COCOA 0 50 100 150 200 10−15 10−10 10−5 100 Iterations Function suboptimality DANE AIDE COCOA 0 200 400 600 800 1000 10−15 10−10 10−5 100 Iterations Function suboptimality DANE AIDE COCOA 0 50 100 150 10−15 10−10 10−5 100 Iterations

Function suboptimality DANE

AIDE COCOA 0 100 200 300 400 500 10−15 10−10 10−5 100 Iterations Function suboptimality DANE AIDE COCOA 0 500 1000 1500 2000 10−15 10−10 10−5 100 Iterations Function suboptimality DANE AIDE COCOA

Figure 4: covtype dataset, 8 nodes, regularization parameter λ set to 1/(cN), for c _{∈ {}1,10,100_} (in

0 5 10 15 20 10−15 10−10 10−5 100 Iterations Function suboptimality DANE AIDE COCOA 0 50 100 150 10−15 10−10 10−5 100 Iterations Function suboptimality DANE AIDE COCOA 0 200 400 600 800 10−15 10−10 10−5 100 Iterations Function suboptimality DANE AIDE COCOA 0 20 40 60 80 100 10−15 10−10 10−5 100 Iterations Function suboptimality DANE AIDE COCOA 0 100 200 300 400 10−15 10−10 10−5 100 Iterations Function suboptimality DANE AIDE COCOA 0 500 1000 1500 10−15 10−10 10−5 100 Iterations Function suboptimality DANE AIDE COCOA

Figure 5: realsim dataset, 8 nodes, regularization parameter λ set to 1/(cN), for c ∈ {1,10,100} (in

(23)

SVRGfor4passes through local data already provides little to none improvement. The only dataset on which

significant improvement is visible is covtype. This behaviour likely is due to the fact thatN d, and hence

under random data distribution, the local problems can be seen asδ-related with very smallδ.

The labels in the figure mean represent the following: The labels DANE/6 and DANE/3 correspond to

running the local SVRGalgorithm for one-sixth and one-third of pass through local data in every iteration.

The labels DANE*2 and DANE*4 correspond to running the local SVRGalgorithm for two and four passes

through the local data.

0 5 10 15 20 10−15 10−10 10−5 100 Iterations Function suboptimality DANE/1 DANE/3 DANE/6 DANE*2 DANE*4 0 5 10 15 20 10−15 10−10 10−5 100 Iterations Function suboptimality DANE/1 DANE/3 DANE/6 DANE*2 DANE*4 0 5 10 15 20 10−15 10−10 10−5 100 Iterations Function suboptimality DANE/1 DANE/3 DANE/6 DANE*2 DANE*4 0 10 20 30 10−15 10−10 10−5 100 Iterations Function suboptimality DANE/1 DANE/3 DANE/6 DANE*2 DANE*4 0 20 40 60 10−15 10−10 10−5 100 Iterations Function suboptimality DANE/1 DANE/3 DANE/6 DANE*2 DANE*4 0 10 20 30 40 50 10−15 10−10 10−5 100 Iterations Function suboptimality DANE/1 DANE/3 DANE/6 DANE*2 DANE*4 0 20 40 60 10−15 10−10 10−5 100 Iterations Function suboptimality DANE/1 DANE/3 DANE/6 DANE*2DANE*4 0 50 100 150 10−15 10−10 10−5 100 Iterations Function suboptimality DANE/1 DANE/3 DANE/6 DANE*2 DANE*4

Figure 6: Varying number of passes through local data per iteration in range[1/6,4]. Top row: logistic loss,

Bottom row: smoothed hinge loss. Datasets in columns: rcv1, covtype, realsim, url

E.3 Node scaling

The plots in top row in Figure7show how the performance changes, as we partition the data across different

number of nodes, but keep the local work equal to one pass through local data. The performance degrades as we increase number of nodes, because we do less relative work on any single computer. However, this is a

positive result, as the algorithmalwaysconverges.

In the bottom row, we double the number of passes over local data when we double the number of nodes.

This is motivated to compare how the algorithm performs, when equal number of iteration of local SVRGis

used. In most cases, the algorithms perform very similarly, demonstrating the robustness of INEXACTDANE. For rcv1 and realsim datasets partitioned on 64 nodes, the convergence slows down. In this case, the optimal

stepsize for local SVRGwas significantly smaller than in the other cases. This was likely caused by getting

into region where the aggregation starts to be unstable, and thus higher accuracy on local subproblems leads to worse overall performance.

E.4 Inconvenient data partitioning

In the last experiment, we depart from theory, and observe that INEXACTDANEis, compared to DANE, much

more robust to arbitrary partitions of the data. In particular, in Figure8we compare INEXACTDANEin two

settings: random, which corresponds to random data partition to 2 nodes, and output, where the data are

partitioned according to their output label — positive examples on one node, negative examples on the other. In

this setting, DANEdiverges, while the performance of INEXACTDANEdrops down only slightly. We observed

that the optimal stepsize for local SVRGis about5−10%smaller in theoutputcase, compared to therandom

case.

This observation is particularly appealing for practitioners, as the data in huge scale applications are often partitioned “as is”, i.e., it is given and one does not have the opportunity to randomly reshuffle the data between nodes.

(24)

0 5 10 15 20 10−15 10−10 10−5 100 Iterations Function suboptimality DANE4 DANE8 DANE16 DANE32 DANE64 0 5 10 15 20 10−15 10−10 10−5 100 Iterations Function suboptimality DANE4 DANE8 DANE16 DANE32 DANE64 0 5 10 15 20 10−15 10−10 10−5 100 Iterations Function suboptimality DANE4 DANE8 DANE16 DANE32 DANE64 0 10 20 30 10−15 10−10 10−5 100 Iterations Function suboptimality DANE4 DANE8 DANE16 DANE32 DANE64 0 5 10 15 20 25 10−15 10−10 10−5 100 Iterations Function suboptimality DANE4 DANE8 DANE16 DANE32 DANE64 0 10 20 30 40 10−15 10−10 10−5 100 Iterations Function suboptimality DANE4 DANE8 DANE16 DANE32 DANE64 0 5 10 15 20 25 10−15 10−10 10−5 100 Iterations Function suboptimality DANE4 DANE8 DANE16 DANE32 DANE64 0 10 20 30 10−15 10−10 10−5 100 Iterations Function suboptimality DANE4 DANE8 DANE16 DANE32 DANE64

Figure 7: Performance comparison as data is partitioned across4–64nodes, with logistic loss. Top row: Single

pass through local data per iteration. Bottom row: Fixed number of updates of local SVRG per iteration.

Datasets in columns: rcv1, covtype, realsim, url

0 5 10 15 20 10−15 10−10 10−5 100 Iterations Function suboptimality random output 0 5 10 15 20 10−15 10−10 10−5 100 Iterations Function suboptimality random output 0 5 10 15 20 10−15 10−10 10−5 100 Iterations Function suboptimality random output

Figure 8: Data partitioned randomly, vs. data partitioned according to their output label. Datasets in columns: rcv1, covtype and realsim.

Practical considerations: Although the experiments in SectionE.1suggest superiority of AIDE, it may

not always be the case in practice. AIDEcomes with the additional requirement to set the catalyst acceleration

parameter τ. At the moment, there is not any simple rule guiding its choice, as the optimal τ depends on

properties of the algorithm being accelerated, which are usually not known — see Section 3.1 of [10] for

further details. In contrast, COCOA+ is usually a slightly easier to tune in practice, since with SDCAone can use

data-independent aggregation parameter equal to1/K, and effectively have hyper-parameter-free algorithm. In

the case of INEXACTDANE, natural local solvers would require a choice of hyper-parameter such as stepsize.

Link to publication record in Edinburgh Research Explorer