• No results found

1.5 Publications

2.1.2 Related works on distributed optimization

Many recent applications related to statistical data processing and machine learning can be han- dled by the framework of distributed optimization. We may refer to applications such: network control and coordination (e.g. target or trajectory tracking [132], [123], power and resources al- location [108], [23]), big data processing (e.g. classifier training [151],[160]) or environmental monitoring in sensor networks (e.g. parameter estimation [135], [144]).

The algorithm (2.1)-(2.2) under study is not new. The idea beyond the algorithm traces back to [155,156] where a network of processors seeks to optimize some objective function known by all agents (possibly up to some additive noise). More recently, numerous works extended this kind of algorithm to more involved multi-agent scenarios, see [97,103,117,87,144,43,19,21,

23,114] as a non-exhaustive list. In this context, one seeks to minimize a sum of local private cost functions fiof the agents:

min

N

X

i=1

fi(θ) , (2.3)

where for all i, the function fiis supposed to be unknown by any other agent j, j 6= i. To address

this question, it is assumed that

Yn,i= −∇fi(θn−1,i) + ξn,i (2.4)

where ∇ is the gradient operator and ξn,irepresents some random perturbation which possibly

occurs when observing the gradient. Hence, the distributed algorithm (2.1)-(2.2) is a distributed stochastic gradient algorithm. In this paper, we handle the case where functions fi are not

necessarily convex. Of course, in that case, there is generally no hope to ensure the convergence to a minimizer to (2.3). Instead, a more realistic objective is to achieve critical points of the objective function i.e., points θ such thatP

i∇fi(θ) = 0.

In a machine learning context, fi is typically the risk of a classifier indexed by θ (for more

details we refer to [107, 65, 32, 6]). The problem of finding the optimal vector quantizer is addressed in [125] by minimizing a non convex cost function called distortion. [125] proposes a distributed and on-line implementation of the k-means named competitive learning vector

quantization algorithm (CLVQ) and based on stochastic approximation. The consistency of the algorithm is proved under suitable assumptions such row-stochastic matrices and asynchronous weights: the trajectories of agents reach an asymptotic consensus a.s. and the corresponding agreement vector converges a.s. towards one of the random connected component of the set of critical points. [43],[144] restrict their analysis by considering a linear regression model for the observations and the case of common quadratic functions for the agents. [43] studies the mean square error performance of a distributed stochastic approximation algorithm based on a deter- ministic diffusion scheme and it is shown that the error variance is bounded and the convergence is achieved in the noise-free case. In [144] these results are obtained when considering in addi- tion i.i.d. random noise. In the field of stochastic cooperative games, the work of [13] is focused on the a.s. convergence of bargaining processes when they are allocated in a distributed manner. The proposed algorithm generates iteratively sequence including two steps: a combining step involving a double stochastic time-varying random matrix in which agents communicate, and a local projection step onto a closed and convex set. The results are: the convergence a.s. towards zero of the nonlinear error due to the projection and the convergence a.s. of the network towards the sought allocation.

Regarding the works on statistical data inference, there is a rich literature on distributed esti- mation and optimization algorithms, see [26],[103], [87], [38], [117], [144] as a non-exhaustive list. Among the first gossip algorithms are those considered in the treatise [18] and in [156]. The case where the gossip matrices are random and the observations are noiseless is considered in [31]. The authors of [117] solve a constrained optimization by also using noiseless esti- mates. The contributions [38] and [144] consider the framework of linear regression models. In [134], stochastic gradient algorithms are considered in the case the matrices (Wn)nare doubly

stochastic gossip i.e. Wn1 = WnT1 = 1. This contribution assumes in addition that the gra-

dients are bounded and considers rather stringent assumptions on the conditional variances of the observation noises. Convergence to a global minimizer is shown in [116] assuming convex utility functions and bounded (sub)gradients. The results of [116] are extended in [134] to the stochastic descent case i.e., when the observation of utility functions is perturbed by a random noise. More recently, [19] investigated distributed stochastic approximation at large, providing stability conditions of the algorithm (2.1)-(2.2) while relaxing the bounded gradient assumption and including the case of random communication links. In [19], it is also proved under some hypotheses that the estimation error is asymptotically normal: the convergence rate and the asymptotic covariance matrix are characterized. An enhanced averaging algorithm à la Polyak is also proposed to recover the optimal convergence rate. Note that all the works previously cited do not take into account the case where (Wn)ndepend on the observations (Yn)nin their

convergence analysis.

Doubly and non-doubly stochastic matrices. In most works (see for instance [116,134]), the matrices (Wn)n≥1are assumed doubly stochastic, meaning that WnT1 = Wn1 = 1 where 1

is the N × 1 vector whose components are all equal to one and whereT denotes transposition. Although row-stochasticity (Wn1 = 1) is rather easy to ensure in practice, column-stochasticity

(WnT1 = 1) implies more stringent restrictions on the communication protocol. For instance, in [31], each one-way transmission from an agent i to another agent j requires at the same time a feedback link from j to i. As a matter of fact, double stochasticity prevents from using

2.1. Introduction 51

natural broadcast schemes, in which a given node may transmit its local estimate to all neighbors without expecting any immediate feedback.

Remarkably, although generally assumed, double stochasticity of the matrices Wnis in fact

not mandatory. A couple of works (see e.g. [112, 19]) get rid of the column-stochasticity condition, but at the price of assumptions that may not always be satisfied in practice. Other works ([114, 153]) manage to circumvent the use of feedback links by coupling the gradient descent with the so-called push-sum protocol [89]. The latter however introduces an additional communication of weights in the network in order to keep track of some summary of the past transmissions. As a consequence, we address the following questions:

What conditions on the sequence(Wn)n≥1are needed to ensure that Algorithm (2.1)-(2.2)

drives all agents to a common critical point ofP

ifi? What happens if these conditions are not

satisfied? How is the convergence rate influenced by the communication protocol?