1.3 Position of the thesis
1.3.2 Distributed optimization
Distributed optimization is present in many of the applications mentioned above related to sensor networks and machine learning. The goal of the network is to optimize a global function that is defined as a sum of local private functions. The minimization problem under study is described as follows: min θ∈R F (θ) , F (θ) = N X i=1 fi(θ) (1.6)
where fiis the private cost function of agent i. Let us provide an application example. We as-
sume in this thesis that functions fiare differentiable but not necessarily convex. This thesis also
put the focus on first order methods i.e., algorithms relying merely on gradient computations. Let us provide an illustrative example in the context of sensor networks.
Example 1.1. In WSN contexts, it is often the case that one should estimate a parameter θ based on a set of random observationsX1, . . . , XN collected by independent sensors and whose
marginal probability density functionspθ,1, . . . , pθ,N are indexed byθ. Provided that the Xi’s
are independent, the maximum likelihood estimate ofθ can be written as the minimizer of (1.6) wherefi(θ) = − log pθ,i(Xi). In a centralized setting, random observations are collected at a
central unit. All functions are assumed to be available at a single place, and a standard gradient descent onF can be used to obtain a minimizer. This thesis focuses on the distributed setting: functionsfiare only locally known by the agents, but the functionF is nowhere available.
In the literature, there are mainly two kinds of distributed first order algorithm for solving this problem. The first one is known as the incremental approach (see [113], [131], [133]). A single iterate travels in the network from node to node. Each node updates the estimate by incrementing the iterate from a scaled version of its negative gradient evaluated at the current point. The approach, although conceptually simple, has some drawbacks. Incremental algorithms generally require the message to go through a Hamiltonian cycle in the network. Finding such a path is known to be a NP complete problem. Relaxations of the Hamiltonian cycle requirement have been proposed: for instance, [113] only requires that an agent communicates with another agent randomly selected in the network (not necessarily in its neighborhood) according to the uniform distribution. However, substantial routing is still needed.
This thesis focuses on another cooperative approach of the form adapt-then-combine (fol- lowing a terminology introduced by [103] in [38]) and also known as adaptation-diffusion algo- rithms. The idea, which traces back to the [155], consists in coupling local gradient descent at the nodes’ side and a gossip communication step, in order to merge the iterates as explained in the previous subsection. Contrarily to incremental approaches, each node i has its own estimate θn,i. At each iteration, the following update holds:
˜
θn,i= θn−1,i− γn∇fi(θn−1,i) , (1.7)
θn,i= N
X
j=1
where γnis a deterministic positive step size and ∇ denotes the gradient and Wn= (wn(i, j))i,j∈V
is a gossip matrix similar to the ones described previously (see SectionC.1in AppendixCfor detailed examples). In addition, it is sometimes the case that a the gradient is observed up to some random perturbation, which might depends on the history of the algorithm. In that case, equation (1.7) must be replaced by
˜
θn,i= θn−1,i− γn∇fi(θn−1,i) + γnξn,i (1.9)
where ξn,iis a perturbation due to the fact that the gradient is not perfectly observed at node i.
Example 1.2. To illustrate this point, consider again the WSN example given previously, where agents seek to estimate a unknown parameterθ in the maximum likelihood sense based on ran- dom observations. Consider the case where each sensori gather a sequence of random observa- tions(Xn,i)n=1,2,...instead of a single observationXi. Assume also that the sequence is formed
by independent copies ofXi. Then, an on-line distributed estimation of parameterθ using the
above algorithm would read as ˜
θn,i= θn−1,i+ γn∇ log pθn−1,i(Xn,i) .
Under some regularity conditions, it can be shown that the above update coincides with (1.9) by lettingfi(θ) = −E[log pθn,i(Xi)] where E stands for the expectation and the perturbation ξn,i
is a martingale increment.
It is expected that, under some assumptions, ∀i ∈ V, lim
n→∞θn,i= θ
? (1.10)
where θ? is some minimizer of F (assumed to exist). We refer to Chapter2 for a more de- tailed state of the art on these techniques. However, we mention that convergence is gener- ally proved under some strong assumptions on the matrices (Wn)n describing the consensus
protocol. In general the sought consensus is achieved under the double-stochasticity assump- tion ([116], [134]), i.e. (Wn)n are row and column stochastic meaning that Wn1 = 1 and
1TWn= 1T. In [19], [112] the column-stochasticity condition is relaxed and it is only assumed
in expectation. This leads for instance the use of the broadcast gossip model of [10]. Simi- larly, the Authors of [43] introduce a diffusion model that only requires the row-stochasticity condition at expenses of its synchronous nature.
The objective in this thesis is to derive convergence results such (1.10) on the sequence generated by Algorithm (1.7)-(1.8) under more mild conditions on (Wn)n. We investigate the
results of [19] when (Wn)nare only row-stochastic. We extend them to a broader communica-
tion setting, when (Wn)nmay depend on the observations or the last estimates. In addition, we
consider a more general case on stochastic approximation framework by letting Algorithm (1.7)- (1.8) take the following form:
θn= Wn( θn−1+ γnYn) . (1.11)
Recursion (1.11) extends the application of distributed optimization problem (1.6) to a more gen- eral framework. Indeed, Algorithm (1.11) can be viewed as a distributed version of the so-called
1.3. Position of the thesis 37
Robbins-Monro algorithm [139]. For that purpose, Ynmay be related to an unbiased estimation
of a given mean field function h(θ) that one seeks to find its roots, i.e. θ ∈ { h(θ) = 0 }. We are also focusing on the convergence rate of this algorithm along with asymptotic normality. Finally, an objective of the thesis is to investigate the use of the above algorithm for statistical inference tasks in sensor networks. We propose a distributed Expectation-Maximization algo- rithm inspired of the adaptation-diffusion approach. We also apply our method to the sensor self-localization problem.