Convergence of distributed asynchronous learning
vector quantization algorithms
Outline.
1 Introduction.
2 Vector quantization, convergence of the CLVQ.
3 General distributed asynchronous algorithm.
4 Distributed Asynchronous Learning Vector Quantization (DALVQ).
Distributed computing.
Distributed algorithmsarise in a wide range of applications: including telecommunications, scientific computing...
Parallelization: most promising way to allow more computing
resources. Building fasterserial computers: increasingly expensive + strikes physical limits (transmission speed, miniaturization).
Distributedlarge scale algorithmsencounter problems:
communication delays(latency, bandwidth), the lack ofefficient shared memory.
Distributed computing.
Distributed algorithmsarise in a wide range of applications: including telecommunications, scientific computing...
Parallelization: most promising way to allow more computing
resources. Building fasterserial computers: increasingly
expensive + strikes physical limits (transmission speed, miniaturization).
Distributedlarge scale algorithmsencounter problems:
communication delays(latency, bandwidth), the lack ofefficient shared memory.
Distributed computing.
Distributed algorithmsarise in a wide range of applications: including telecommunications, scientific computing...
Parallelization: most promising way to allow more computing
resources. Building fasterserial computers: increasingly
expensive + strikes physical limits (transmission speed, miniaturization).
Distributedlarge scale algorithmsencounter problems:
communication delays(latency, bandwidth), the lack ofefficient shared memory.
Clustering algorithms.
One of the primary tools ofunsupervised learning.
Outstanding role indatamining: scientific data exploration, information retrieval, marketing, text mining, computational biology...
Clustering: division of data intogroupsof similar objects.
Representing data by clusters: loses certain fine details but achievessimplification.
Clustering algorithms.
One of the primary tools ofunsupervised learning.
Outstanding role indatamining: scientific data exploration,
information retrieval, marketing, text mining, computational biology...
Clustering: division of data intogroupsof similar objects.
Representing data by clusters: loses certain fine details but achievessimplification.
Clustering algorithms.
One of the primary tools ofunsupervised learning.
Outstanding role indatamining: scientific data exploration,
information retrieval, marketing, text mining, computational biology...
Clustering: division of data intogroupsof similar objects.
Representing data by clusters: loses certain fine details but achievessimplification.
Clustering algorithms.
One of the primary tools ofunsupervised learning.
Outstanding role indatamining: scientific data exploration,
information retrieval, marketing, text mining, computational biology...
Clustering: division of data intogroupsof similar objects. Representing data by clusters: loses certain fine details but
achievessimplification.
Clustering algorithms.
One of the primary tools ofunsupervised learning.
Outstanding role indatamining: scientific data exploration,
information retrieval, marketing, text mining, computational biology...
Clustering: division of data intogroupsof similar objects. Representing data by clusters: loses certain fine details but
achievessimplification.
Distortion.
Datahas a distribution µ: Borel probability measure on Rd (with a
second order moment).
Model this distribution by κ vectors of Rd: the number of
prototypes.
Objective:minimizationof the criterion C: find w0s.t.
w0∈ argmin w∈(Rd)κ
C(w),
where thedistortionC is defined by, forw = (w1, . . . ,wκ) ∈ Rd κ , C(w) = Z G min 1≤`≤κkz−w`k 2d µ(z ).
Distortion.
Datahas a distribution µ: Borel probability measure on Rd (with a
second order moment).
Model this distribution by κ vectors of Rd: the number of
prototypes.
Objective:minimizationof the criterion C: find w0s.t.
w0∈ argmin w∈(Rd)κ
C(w),
where thedistortionC is defined by, forw = (w1, . . . ,wκ) ∈ Rd κ , C(w) = Z G min 1≤`≤κkz−w`k 2d µ(z ).
Distortion.
Datahas a distribution µ: Borel probability measure on Rd (with a
second order moment).
Model this distribution by κ vectors of Rd: the number of
prototypes.
Objective:minimizationof the criterion C: find w0s.t.
w0∈ argmin
w∈(Rd)κ
C(w),
where thedistortionC is defined by, forw = (w1, . . . ,wκ) ∈ Rd κ , C(w) = Z G min 1≤`≤κkz−w`k 2d µ(z ).
Distortion.
Datahas a distribution µ: Borel probability measure on Rd (with a
second order moment).
Model this distribution by κ vectors of Rd: the number of
prototypes.
Objective:minimizationof the criterion C: find w0s.t.
w0∈ argmin
w∈(Rd)κ
C(w),
where thedistortionC is defined by, forw = (w1, . . . ,wκ) ∈ Rd
κ , C(w) = Z G min 1≤`≤κkz−w`k 2d µ(z ).
Assumptions on the distribution.
Strongly continuous.
µisstrongly continuousif for all hyperplane H, of Rd, µ(H) = 0.
If µ is strongly continuous, then supp (µ) is infinite.
Assumptions on the distribution.
Strongly continuous.
µisstrongly continuousif for all hyperplane H, of Rd, µ(H) = 0.
If µ is strongly continuous, then supp (µ) is infinite.
Voronoï tesselations.
Notation:The set of all κ-tuples of G is denoted Gκ.
Dκ ∗ = w ∈ Rdκ |w` 6=wk if and only if ` 6= k . Definition Forw ∈ Dκ ∗, W`(w) = {v ∈ G| kw`− v k < kwk − v k , if k 6= `} . {W`(w)}κ`=1is called theVoronoï tessellationof G related to w .
If µ is strongly continuous: ∀w ∈ Dκ ∗,
Voronoï tesselations.
Notation:The set of all κ-tuples of G is denoted Gκ.
Dκ ∗ = w ∈ Rdκ |w` 6=wk if and only if ` 6= k . Definition Forw ∈ Dκ ∗, W`(w) = {v ∈ G| kw`− v k < kwk − v k , if k 6= `} . {W`(w)}κ`=1is called theVoronoï tessellationof G related to w .
If µ is strongly continuous: ∀w ∈ Dκ ∗,
Voronoï tesselations.
Notation:The set of all κ-tuples of G is denoted Gκ.
Dκ ∗ = w ∈ Rdκ |w` 6=wk if and only if ` 6= k . Definition Forw ∈ Dκ ∗, W`(w) = {v ∈ G| kw`− v k < kwk − v k , if k 6= `} .
{W`(w)}κ`=1is called theVoronoï tessellationof G related to w .
If µ is strongly continuous: ∀w ∈ Dκ ∗,
Voronoï tesselations.
Notation:The set of all κ-tuples of G is denoted Gκ.
Dκ ∗ = w ∈ Rdκ |w` 6=wk if and only if ` 6= k . Definition Forw ∈ Dκ ∗, W`(w) = {v ∈ G| kw`− v k < kwk − v k , if k 6= `} .
{W`(w)}κ`=1is called theVoronoï tessellationof G related to w .
If µ is strongly continuous: ∀w ∈ Dκ
∗,
Voronoï tesselations 2D.
Regularity of the distortion.
Theorem (Pagès)
C is continuously differentiable at everyw = (w1, . . . ,wκ) ∈ D∗κs.t.
µ (Sκ `=1∂W`(w)). ∀1 ≤ ` ≤ κ, ∇`C(w) = Z W`(w) (w`−z)d µ(z).
Regularity of the distortion.
Theorem (Pagès)
C is continuously differentiable at everyw = (w1, . . . ,wκ) ∈ D∗κs.t.
µ (Sκ `=1∂W`(w)). ∀1 ≤ ` ≤ κ, ∇`C(w) = Z W`(w) (w`−z)d µ(z).
Local observation of the gradient.
Definition
For anyz∈ Rd andw ∈ Dκ
∗, define function H by its `-th component,
H`(z,w) =
(
z−w` ifz∈ W`(w)
0 otherwise.
If random variable Z ∼ µ, the next equality holds for allw ∈ Dκ ∗,
Local observation of the gradient.
Definition
For anyz∈ Rd andw ∈ Dκ
∗, define function H by its `-th component,
H`(z,w) =
(
z−w` ifz∈ W`(w)
0 otherwise.
If random variable Z ∼ µ, the next equality holds for allw ∈ Dκ
∗,
Extension on {D
∗κ.
Definition w ∈ Rdκ , W`(w) = ( {v ∈ G| kw`− v k < minw`6=wkkwk − v k} , if(∗), ∅ otherwise. (∗) ≡ ` , min {k|wk =w`}.If random variable Z ∼ µ, the next equality holds for allw ∈ Rdκ , h(w) , E {H(Z ,w)} .
Extension on {D
∗κ.
Definition w ∈ Rdκ , W`(w) = ( {v ∈ G| kw`− v k < minw`6=wkkwk − v k} , if(∗), ∅ otherwise. (∗) ≡ ` , min {k|wk =w`}.If random variable Z ∼ µ, the next equality holds for allw ∈ Rdκ
,
Proposition: If µ is strongly continuous, 1 argmin w∈(Rd)κC(w)∩ ◦ Gκ6= ∅ (and argminw∈( Rd) κC(w) ⊂ {∇C = 0} ∩ Dκ∗),
2 and if supp(µ) = G then,
argmin w∈(Rd)κ C(w) ⊂argminloc w∈Gκ C(w) ⊂ ◦ Gκ ∩ {∇C = 0} ∩ D∗κ.
Proposition: If µ is strongly continuous, 1 argmin w∈(Rd)κC(w)∩ ◦ Gκ6= ∅ (and argminw∈( Rd) κC(w) ⊂ {∇C = 0} ∩ Dκ∗),
2 and if supp(µ) = G then,
argmin w∈(Rd)κ C(w) ⊂argminloc w∈Gκ C(w) ⊂ ◦ Gκ ∩ {∇C = 0} ∩ D∗κ.
Stochastic gradient optimization.
Minimize C: gradient descent procedurew :=w− ∇C(w).
∇C(w)is unknown, use H(z,w)instead,
w (t + 1)=w (t)− εtH (zt+1,w (t)) ,
w (0)∈G◦κ∩Dκ
∗ and {zt}+∞t=0 is a iid sequence of law µ.
Usual assumptions: {εt}+∞t=0 ∈ (0, 1),
1 P∞
t=0εt = +∞.
2 P∞
Stochastic gradient optimization.
Minimize C: gradient descent procedurew :=w− ∇C(w).
∇C(w)is unknown, use H(z,w)instead,
w (t + 1)=w (t)− εtH (zt+1,w (t)) ,
w (0)∈G◦κ∩Dκ
∗ and {zt}+∞t=0 is a iid sequence of law µ. Usual assumptions: {εt}+∞t=0 ∈ (0, 1),
1 P∞
t=0εt = +∞.
2 P∞
Stochastic gradient optimization.
Minimize C: gradient descent procedurew :=w− ∇C(w).
∇C(w)is unknown, use H(z,w)instead,
w (t + 1)=w (t)− εtH (zt+1,w (t)) ,
w (0)∈G◦κ∩Dκ
∗ and {zt}+∞t=0 is a iid sequence of law µ.
Usual assumptions: {εt}+∞t=0 ∈ (0, 1), 1 P∞
t=0εt = +∞.
2 P∞
A deeper look.
Zooming:
Ifzt+1∈ W`0(w (t)),
w`0(t + 1)= (1 − εt)w`0(t)+ εtzt+1.
convex combination between old centroid and new input. zt+1centered homotethy with ratio (1 − εt).
Asw (0)∈G◦κ ∩Dκ
∗, we havew (t)∈ ◦
Gκ∩Dκ
A deeper look.
Zooming:
Ifzt+1∈ W`0(w (t)),
w`0(t + 1)= (1 − εt)w`0(t)+ εtzt+1.
convex combination between old centroid and new input.
zt+1centered homotethy with ratio (1 − εt).
Asw (0)∈G◦κ ∩Dκ
∗, we havew (t)∈
◦
Gκ∩Dκ
Troubles.
On the distortion:
C is not a convex function.
kC(w)k 9 +∞ as kwk → +∞.
On its gradient:
∇C is singular at {Dκ ∗.
Troubles.
On the distortion:
C is not a convex function.
kC(w)k 9 +∞ as kwk → +∞.
On its gradient:
∇C is singular at {Dκ
∗.
What can be expected ?
w (t)9w∗ =argmin
w∈Gκ
C(w ).
What can be expected ?
w (t)9w∗ =argmin
w∈Gκ
C(w ).
Popular techniques.
Robbins Monroe:
Robbins Monroe techniques do not work here because C is not convex.
Kushner-Clark:
Based on the study of the trajectories of the ODE, ODE ≡w0= −h(w), w (0)∈ Gκ, h is singular here!
Popular techniques.
Robbins Monroe:
Robbins Monroe techniques do not work here because C is not convex.
Kushner-Clark:
Based on the study of the trajectories of the ODE,
ODE ≡w0= −h(w), w (0)∈ Gκ,
Assumption: [Compact supported Density]
µhas a bounded density whose support is the compact convex set G.
Theorem (G-Lemma Fort and Pagès)
If,
1 {w (t)}
t≥0is H∞valued, for some compact H∞.
2 P
s≥1εs(H(zs+1,w (s)) −h(w (s)))converge in Rd
κ . 3 There exist a l.s.c nonnegative function G : Rd → R
+s.t
P
s≥1εs+1G(w (s)) < +∞
Then,w (t)converges to some connected component of {G = 0} ∩ H∞.
Assumption: [Compact supported Density]
µhas a bounded density whose support is the compact convex set G.
Theorem (G-Lemma Fort and Pagès)
If,
1 {w (t)}
t≥0is H∞valued, for some compact H∞.
2 P
s≥1εs(H(zs+1,w (s)) −h(w (s)))converge in Rd
κ .
3 There exist a l.s.c nonnegative function G : Rd → R
+s.t
P
s≥1εs+1G(w (s)) < +∞
Then,w (t)converges to some connected component of
A suitable G: For everyw ∈ Gκ, b G(w) , lim inf v ∈Gκ∩Dκ ∗,v →w k∇C(v )k2. b G is a nonnegative l.s.c. function on Gκ.
Theorem (Pagès)
Under assumption[Compact supp. Density], on the event
lim inf t→+∞dist w (t), {D κ ∗ > 0 , dist (w (t), Ξ∞) =0 a.s as t → ∞,
where Ξ∞is some connected component of {∇C = 0}.
Remarks:
Asymptotically parted component.
No satisfactory convergence result is provided without this assumption.
Theorem (Pagès)
Under assumption[Compact supp. Density], on the event
lim inf t→+∞dist w (t), {D κ ∗ > 0 , dist (w (t), Ξ∞) =0 a.s as t → ∞,
where Ξ∞is some connected component of {∇C = 0}.
Remarks:
Asymptotically parted component.
No satisfactory convergence result is provided without this assumption.
Introduction.
Why?
The algorithm is too slow on large data set or with high dimension data.
We need to process several inputs quickly. Need for scalability.
Distributed algorithm.
Processor 1 Processor 3 Processor 2 Processor 4 Data Computes Data Data DataA generic descent term:
w (t + 1)=w (t)+ −εtH (zt+1,w (t))
| {z }
,s(t)
Parallelization.
Basic parallelization.
For all 1 ≤ i ≤ M, where M is the number of processors.
wi(t + 1)=
M
X
j=1
ai,j(t)wj(t)+si(t).
Where theai,j(t) M
j=1is a convex combination.
Synchronization effects:
Some synchronization effects are required in this model. We should take into account communication delays.
Parallelization.
Basic parallelization.
For all 1 ≤ i ≤ M, where M is the number of processors.
wi(t + 1)=
M
X
j=1
ai,j(t)wj(t)+si(t).
Where theai,j(t) M
j=1is a convex combination.
Synchronization effects:
Some synchronization effects are required in this model. We should take into account communication delays.
The Tsitsikils’s asynchronous model.
The Tsitsiklis’s system:
wi(t + 1) = M X j=1 ai,j(t)wj(τi,j(t))+si(t). Notation: 0 ≤ τi,j(t) ≤ t.
w`i, for 1 ≤ ` ≤ κ is a Rd valued sequence.
si(t)descent terms (any vector valued sequence for now). ai,j(t) M
The Tsitsikils’s asynchronous model.
The Tsitsiklis’s system:
wi(t + 1) = M X j=1 ai,j(t)wj(τi,j(t))+si(t). Notation: 0 ≤ τi,j(t) ≤ t.
w`i, for 1 ≤ ` ≤ κ is a Rd valued sequence.
si(t)descent terms (any vector valued sequence for now).
ai,j(t) M
Global time reference
Model agreement.
The agreement algorithm.
For all, 1 ≤ i ≤ M,xi(0)∈ Rdκ , xi(t + 1)= M X j=1 ai,j(t)xj(τi,j(t)). Remark:
Similar toTsitsiklis’s systembut withsi(t)=0 for all t, i. Consensus:
Is there (or at least what are the conditions to ensure) an asymptotical consensus between the processors/workers?
Model agreement.
The agreement algorithm.
For all, 1 ≤ i ≤ M,xi(0)∈ Rdκ , xi(t + 1)= M X j=1 ai,j(t)xj(τi,j(t)). Remark:
Similar toTsitsiklis’s systembut withsi(t)=0 for all t, i.
Consensus:
Is there (or at least what are the conditions to ensure) an asymptotical consensus between the processors/workers?
Model agreement.
The agreement algorithm.
For all, 1 ≤ i ≤ M,xi(0)∈ Rdκ , xi(t + 1)= M X j=1 ai,j(t)xj(τi,j(t)). Remark:
Similar toTsitsiklis’s systembut withsi(t)=0 for all t, i.
Consensus:
Is there (or at least what are the conditions to ensure) an asymptotical consensus between the processors/workers?
Assumptions 1
Assumption (Bounded communication delays.)
There exist a positive integer constant B1s.t.,
0 ≤ t − B1< τi,j(t) ≤ t,
for all i and j and all t ≥ 0.
Assumption (Strongly stoch. matrices.)
There exists a positive constant α > 0 such that: 1 ai,i(t) ≥ α, for all i, t.
2 ai,j(t) ∈ {0} ∪ [α, 1], for all i, j, t. 3 PM
Assumptions 1
Assumption (Bounded communication delays.)
There exist a positive integer constant B1s.t.,
0 ≤ t − B1< τi,j(t) ≤ t,
for all i and j and all t ≥ 0.
Assumption (Strongly stoch. matrices.)
There exists a positive constant α > 0 such that:
1 ai,i(t) ≥ α, for all i, t.
2 ai,j(t) ∈ {0} ∪ [α, 1], for all i, j, t. 3 PM
Assumptions 2
Define a graph: thevertices, V = {1, . . . , M} (processors) and the set
ofedges, E (t) where (j, i) ∈ E (t) if and only if ai,j(t) > 0. Assumption (Graph connectivity.)
The graph (V, ∪s≥tE (s)) is strongly connected for all t ≥ 0.
Processor 1
Processor 2 Processor 3
Processor 5
Assumptions 2
Define a graph: thevertices, V = {1, . . . , M} (processors) and the set
ofedges, E (t) where (j, i) ∈ E (t) if and only if ai,j(t) > 0.
Assumption (Graph connectivity.)
The graph (V, ∪s≥tE (s)) is strongly connected for all t ≥ 0.
Processor 1
Processor 2 Processor 3
Processor 5
Assumptions 3
Assumption (Bounded communication intervals.)
If i communicates to j an infinite number of time, then there is some B2
such that, for all t, (i, j) = E (t) ∪ E (t + 1) ∪ . . . ∪ E (t + B − 1) Assumption (Symmetry)
There exists some B2>0 such that whenever (i, j) ∈ E (t), then there
Assumptions 3
Assumption (Bounded communication intervals.)
If i communicates to j an infinite number of time, then there is some B2
such that, for all t, (i, j) = E (t) ∪ E (t + 1) ∪ . . . ∪ E (t + B − 1)
Assumption (Symmetry)
There exists some B2>0 such that whenever (i, j) ∈ E (t), then there
(AsY )1≡
Assumption [Bounded communication delays] Assumption [Bounded communication intervals.] Assumption [Graph connectivity.]
Assumption [Strongly stoch. matrices]
(AsY )2≡
Assumption [Bounded communication intervals.] Assumption [Symmetry]
Assumption [Graph connectivity.] Assumption [Strongly stoch. matrices]
Agreement theorem.
Theorem (Blondel et al.)
Under assumptions (AsY )1or (AsY )2there is a vector c ∈ Rdκ
(independent of i) s. t., lim t→+∞ x i(t)− c =0.
Even more, there exists ρ ∈ [0, 1) and L > 0, s.t., x i(t)−xi(τ ) ≤ Lρ t−τ, for every t ≥ τ ≥ 0.
Lemma (Multiple Gradient Decomposition.)
For all t ≥ 0, and all processors 1 ≤ i ≤ M there are scalars, ∀1 ≤ j ≤ M,φi,j(t, τ ) t−1
τ =0, s.t., Tsitsikli’s system writes,
wi(t)= M X j=1 φi,j(t, 0)wj(0)+ t−1 X τ =1 M X j=1 φi,j(t, τ )sj(τ ). Remark:
The sequencesφi,j(t, τ ) t−1τ =0do not depend on the value taken by the descent termssi(t).
Lemma (Multiple Gradient Decomposition.)
For all t ≥ 0, and all processors 1 ≤ i ≤ M there are scalars, ∀1 ≤ j ≤ M,φi,j(t, τ ) t−1
τ =0, s.t., Tsitsikli’s system writes,
wi(t)= M X j=1 φi,j(t, 0)wj(0)+ t−1 X τ =1 M X j=1 φi,j(t, τ )sj(τ ). Remark:
The sequencesφi,j(t, τ ) t−1τ =0do not depend on the value taken by
wi(t)= M X j=1 φi,j(t, 0) | {z } −−−−→ t→+∞ φ j(0) wj(0)+ t−1 X τ =1 M X j=1 φi,j(t, τ ) | {z } −−−−→ t→+∞ φ j(τ ) sj(τ ). Moreover, φ i,j(t, τ ) − φi(τ ) ≤ Aρ t−τ, ∀t ≥ τ ≥ 0.
We define a sequence calledagreement vector sequence{w∗(t)}+∞ t=0, w∗(t), M X j=1 φj(0)wj(0)+ t−1 X τ =1 M X j=1 φj(τ )sj(τ ).
wi(t)= M X j=1 φi,j(t, 0) | {z } −−−−→ t→+∞ φ j(0) wj(0)+ t−1 X τ =1 M X j=1 φi,j(t, τ ) | {z } −−−−→ t→+∞ φ j(τ ) sj(τ ). Moreover, φ i,j(t, τ ) − φi(τ ) ≤ Aρ t−τ, ∀t ≥ τ ≥ 0.
We define a sequence calledagreement vector sequence{w∗(t)}+∞ t=0, w∗(t), M X j=1 φj(0)wj(0)+ t−1 X τ =1 M X j=1 φj(τ )sj(τ ).
wi(t)= M X j=1 φi,j(t, 0) | {z } −−−−→ t→+∞ φ j(0) wj(0)+ t−1 X τ =1 M X j=1 φi,j(t, τ ) | {z } −−−−→ t→+∞ φ j(τ ) sj(τ ). Moreover, φ i,j(t, τ ) − φi(τ ) ≤ Aρ t−τ, ∀t ≥ τ ≥ 0.
We define a sequence calledagreement vector sequence{w∗(t)}+∞t=0,
w∗(t), M X j=1 φj(0)wj(0)+ t−1 X τ =1 M X j=1 φj(τ )sj(τ ).
Agreement vector.
Remark:
The agreement vectorw∗satisfies, for all t ≥ 0,
w∗(t + 1)=w∗(t)+ M X j=1 φj(t)sj(t). Interpretation:
w∗(t0)corresponds to thecommonvalueasymptoticallyachieved by all processors if computations with descent termshave stoppedafter t0, i.e,sj(t)=0 for all t ≥ t0..
Agreement vector.
Remark:
The agreement vectorw∗satisfies, for all t ≥ 0,
w∗(t + 1)=w∗(t)+ M X j=1 φj(t)sj(t). Interpretation:
w∗(t0)corresponds to thecommonvalueasymptoticallyachieved by all
processors if computations with descent termshave stoppedafter t0,
Global time reference
Computation with descent terms
Only communication (model agreement)
Definition of descent terms.
The Tsitsiklis’s system with the descent termssi,
si(t)= ( −εi tH zit+1,wi(t) if t ∈ Ti 0 otherwise. Notation:
Ti: time where the vectorwi is updated with descent term.
zit+1 iid sequences of r.v of law µ. Ft , σ zis, for all s ≤ t and 1 ≤ i ≤ M .
Definition of descent terms.
The Tsitsiklis’s system with the descent termssi,
si(t)= ( −εi tH zit+1,wi(t) if t ∈ Ti 0 otherwise. Notation:
Ti: time where the vectorwi is updated with descent term.
zit+1 iid sequences of r.v of law µ. Ft , σ zis, for all s ≤ t and 1 ≤ i ≤ M .
To be continued.
We assume that the sequences εit satisfy the following conditions:
Assumption (Decreasing steps.)
There exist two constants K1, K2such that, for all i and all t ≥ 0,
K1 t ∨ 1 ≤ ε i t ≤ K2 t ∨ 1. Assumption (Non idle.)
PM
To be continued.
We assume that the sequences εit satisfy the following conditions:
Assumption (Decreasing steps.)
There exist two constants K1, K2such that, for all i and all t ≥ 0,
K1 t ∨ 1 ≤ ε i t ≤ K2 t ∨ 1.
Assumption (Non idle.)
PM
The recursion formula ofagreement vectorwrites, w∗(t + 1)=w∗(t)+ M X j=1 φj(t)sj(t).
Using the function h,
h(w∗(t)) = E {H (zt+1,w∗(t)) |Ft} , and h(wj(t)) = E n H zt+1,wj(t) |Ft o for all j.
Set, ε∗t , 1 2 M X j=1 1t∈Tjφj(t)εjt, Mt1, 1 2 M X j=1 1t∈Tjφj(t)εjt h(w∗(t)) −h(wj(t)), and, Mt2, 1 2 M X j=1 φj(t)1t∈Tjεjt h(wj(t)) −Hzjt+1,wj(t). Therefore, the recursion writes,
Theorem (Asynchronous G-Lemma) If, 1 ε∗ t −−−−→t→+∞ 0 and P t≥0ε∗t = +∞, 2 the sequences {w∗(t)}+∞ t=0 and {h(w∗(t))} +∞ t=0 are bounded, 3 P+∞ s=0∆Ms1and P+∞ s=0∆Ms2converge in Rd κ ,
4 there exist a l.s.c. map G : Rdκ−→ R
+, s.t., +∞ X t=0 ε∗t+1G (w∗(t)) < +∞, then, lim t→+∞dist (w ∗(t), Ξ ∞) =0.
Assumption (Trajectories in Gκ.)
P n
Remark:
Assumption[Trajectories in Gκ]is not restrictive:
ifwj(0)∈ Gκ, and,
if t ∈ Ti, we have a
i,j(t) = δi,j, and τi,i(t) = t,
wi(t + 1)=wi(t) − εit
wi(t)−zit+1
=1 − εitwi(t)+ εitzit+1. If t /∈ Ti, then equation writes,
wi(t + 1)=
M
X
j=1
ai,j(t)wj(τi,j(t)).
Lemma
Under Assumption[Trajectories in Gκ], then, for all t ≥ 1
kw∗(t) − wi(t)k ≤√κM diam(G)A(K2∨ 1) t−1 X τ =−1 1 τ ∨1ρ t−τ. A > 0. Remark
Under assumptions[Trajectories in Gκ], 1 w∗(t)−wi(t)−−→ 0 as t → +∞,a.s
Lemma
Under Assumption[Trajectories in Gκ], then, for all t ≥ 1
kw∗(t) − wi(t)k ≤√κM diam(G)A(K2∨ 1) t−1 X τ =−1 1 τ ∨1ρ t−τ. A > 0. Remark
Under assumptions[Trajectories in Gκ],
1 w∗(t)−wi(t)−−→ 0 as t → +∞,a.s
Proposition
If Assumption[Trajectories in Gκ]hold then,
ε∗t −−−−→ t→+∞ 0 and P t≥0ε ∗ t = +∞. Proposition
If Assumptions[Trajectories in Gκ]hold then the seriesP+∞
t=0∆Ms1and
P+∞
t=0∆Ms2converge in Rd
κ .
Proposition
If Assumption[Trajectories in Gκ]hold then,
ε∗t −−−−→ t→+∞ 0 and P t≥0ε ∗ t = +∞. Proposition
If Assumptions[Trajectories in Gκ]hold then the seriesP+∞
t=0∆Ms1and
P+∞
t=0∆Ms2converge in Rd
κ .
Assumption (Parted component assumption.)
1 P {w∗(t)∈ Dκ
∗} = 1, for all t ≥ 0.
2 Plim inft→+∞dist w∗(t), {Dκ
∗ = 1.
Proposition
Under assumptions[Trajectories in Gκ], [Parted component assumption] we have,
1 C (w∗(t))−−−−→as
t→+∞ C
∗
∞,where C∞is some random variable.
2
+∞
X
t=0
Assumption (Parted component assumption.)
1 P {w∗(t)∈ Dκ
∗} = 1, for all t ≥ 0.
2 Plim inft→+∞dist w∗(t), {Dκ
∗ = 1.
Proposition
Under assumptions[Trajectories in Gκ], [Parted component
assumption] we have,
1 C (w∗(t))−−−−→as t→+∞ C
∗
∞,where C∞is some random variable.
2
+∞
X
t=0
The Asynchronous Theorem.
Theorem (Asynchronous Theorem)
If assumptions[Trajectories in Gκ], [Parted component assumption]
hold then,
w∗(t)−−−−→a.s
t→+∞ Ξ∞,
and,
wi(t)−−−−→a.s
t→+∞ Ξ∞, for all processors i.
Bibliography
Dimitri P. Bertsekas and John N. Tsitsiklis.
Parallel and distributed computation: numerical methods.
Prentice-Hall, Inc., Upper Saddle River, NJ, USA, 1989. V.D. Blondel, J.M. Hendrickx, A. Olshevsky, and J.N. Tsitsiklis.
Convergence in multiagent coordination, consensus, and flocking.
Decision and Control, 2005 and 2005 European Control Conference. CDC-ECC ’05. 44th IEEE Conference on, pages 2996–3000, Dec. 2005.
Jean-Claude Fort and Gilles Pagès.
Convergence of stochastic algorithms: From the kushner-clark theorem to the lyapounov functional method.
Advances in Applied Probability, 28(4):1072–1094, 1996. Gilles Pagès.
A space vector quantization for numerical integration.
Journal of Applied and Computational Mathematics, 89:1–38, 1997. J. Tsitsiklis, D. Bertsekas, and M. Athans.
Distributed asynchronous deterministic and stochastic gradient optimization algorithms.