1.4 Notation
2.1.1 Related Work
Research in distributed optimization dates back several decades (see, e.g., [70] and the ref- erences therein). In recent years, various centralized optimization methods such as (sub- )gradient descent, proximal gradient descent, (quasi-)Newton method, dual averaging, al- ternating direction method of multipliers (ADMM), and many other primal-dual methods have been extended to the distributed setting. In this section, we review several classes of
distributed algorithms that can be used to solve problem (1.1).
2.1.1.1 Distributed Primal Methods
In the primal domain, implementations that are based on gradient-descent methods are effective and easy to implement. There are at least two prominent variants under this class: the consensus strategy [5–13] and the diffusion strategy [1,4,34,36,71]. A brief description of these two primal strategies is given in Appendix 2.A. There is a subtle but critical difference in the order in which computations are performed under these two strategies. In the consensus implementation, each agent runs a gradient-descent type iteration, albeit one where the starting point for the recursion and the point at which the gradient is approximated are not identical. This construction introduces an asymmetry into the update relation, which has some undesirable instability consequences (described, for example, in Secs. 7.2–7.3, Example 8.4, and also in Theorem 9.3 of [1] and Sec. V.B and Example 20 of [4]). The diffusion strategy, in comparison, employs a symmetric update where the starting point for the iteration and the point at which the gradient is approximated coincide. This property results in a wider stability range for diffusion strategies [1, 4]. Still, when sufficiently small step-sizes are employed to drive the optimization process, both types of strategies (consensus and diffusion) are able to converge exponentially fast, albeit only to an approximate solution [1, 9]. Specifically, it is proved in [1, 9, 14] that both the consensus and diffusion iterates under constant step-size learning converge towards a neighborhood of square-error size O(µ2)
around the true optimizer, w?, i.e., kw? − wk,ik2 = O(µ2) as i → ∞, where µ denotes the
step-size and wk,i denotes the local iterate at agent k and iteration i. This limiting O(µ2)
bias is not due to any gradient noise arising from stochastic approximations; it is instead due to the inherent structure of the consensus and diffusion updates as clarified in the sequel.
Second-order information such as the Hessian matrix can also be introduced to the pri- mal methods, see the distributed Newton method [78, 79], Quasi-Newton method [80] and references therein. While the Hessian matrix helps accelerate the convergence rate, these second-order algorithms still suffer from the O(µ2) inherent limiting bias. There is another
type of methods that employ multi-consensus inner loop [81–83] and thus improves the con- sensus of the variables at each outer iteration. While these two-time scale methods can reduce the limiting bias, the inner consensus loop incurs more communication rounds be- tween agents, and hence slows down the processing of new data received in the outer loop. For this reason, they are not well-suited for the adaptation and online learning problems.
2.1.1.2 Distributed Primal-Dual Methods
Another important class of distributed algorithms are based on the primal dual strategies. A brief analytical derivation of various popular primal-dual methods is given in Sec. 2.B. A well- known family of distributed primal dual methods are those based on alternating direction method of multipliers (ADMM) [74, 84–86] and its variants [87–90]. In particular, work [74] proves that distributed ADMM with constant parameters converges exponentially fast to the exact global solution w?, which is in contrast to the purely primal methods we discussed in
Sec. 2.1.1.1 that only converge to an approximate solution close to w? with constant step- sizes. However, distributed ADMM solutions are computationally more expensive since they necessitate the solution of optimal sub-problems at each iteration. Some useful variations of distributed ADMM [87–89] may alleviate the computational burden, but their recursions are still more difficult to implement than consensus or diffusion due to their primal dual structures.
In more recent work [75, 91], a modified implementation of consensus iterations, referred to as EXTRA, is proposed and shown to converge to the exact minimizer w? rather than to an O(µ2)−neighborhood around w?. The modification has a similar computational burden as
traditional consensus and is based on adding a step that combines two prior iterates to remove bias. While EXTRA does not explicitly employ a dual variable, it is essentially a primal dual saddle point algorithm [77]. Motivated by [75], other variations with similar properties were proposed in [92–98]. These variations rely instead on combining inexact gradient evaluations with a gradient tracking technique. The resulting algorithms, compared to EXTRA, have two information combinations per recursion, which doubles the amount of communication
variables compared to EXTRA, and can become a burden when communication resources are limited. Distributed primal-dual second-order methods are also studied in [89, 99] to reduce communication rounds but they suffer from the expensive construction of the Hessian matrix. Due to their easy implementations and fast convergences, EXTRA and tracking methods have been extended to other important scenarios for directed [93, 97, 98, 100, 101] and asynchronous [102] networks. There is also another family of primal-dual methods that are related to EXTRA and utilize the network structure to further accelerate the convergence and reduce the communication rounds [103–105].
When local cost function Jk(w) is not smooth and has the structure Jk(w) = sk(w) +
rk(w) where sk(w) is smooth with Lipschitz continuous gradients and rk(w) is a possibly
non-smooth regularization term, one can integrate the proximal gradient descent with the above primal-dual methods, see [88, 91, 106–109]. In particular, [109] proposes a distributed proximal gradient method that endows with exponential convergence to w? when each agent
shares the same regularization term , i.e., r1(w) = · · · = rK(w) = r(w).
2.1.1.3 Distributed dual methods
A third class of distributed algorithms are purely dual methods, see [110–113]. A short description on dual methods is provided in Sec.2.C. They first derive the unconstrained dual problem of problem (2.1) and then solve it by gradient descent. In particular, the algorithms of [110,111,113] can reach the optimal convergence rate by introducing Nesterov’s acceleration to their recursions.