• No results found

Convergence of distributed asynchronous learning vector quantization algorithms

N/A
N/A
Protected

Academic year: 2021

Share "Convergence of distributed asynchronous learning vector quantization algorithms"

Copied!
100
0
0

Loading.... (view fulltext now)

Full text

(1)

Convergence of distributed asynchronous learning

vector quantization algorithms

(2)

Outline.

1 Introduction.

2 Vector quantization, convergence of the CLVQ.

3 General distributed asynchronous algorithm.

4 Distributed Asynchronous Learning Vector Quantization (DALVQ).

(3)

Distributed computing.

Distributed algorithmsarise in a wide range of applications: including telecommunications, scientific computing...

Parallelization: most promising way to allow more computing

resources. Building fasterserial computers: increasingly expensive + strikes physical limits (transmission speed, miniaturization).

Distributedlarge scale algorithmsencounter problems:

communication delays(latency, bandwidth), the lack ofefficient shared memory.

(4)

Distributed computing.

Distributed algorithmsarise in a wide range of applications: including telecommunications, scientific computing...

Parallelization: most promising way to allow more computing

resources. Building fasterserial computers: increasingly

expensive + strikes physical limits (transmission speed, miniaturization).

Distributedlarge scale algorithmsencounter problems:

communication delays(latency, bandwidth), the lack ofefficient shared memory.

(5)

Distributed computing.

Distributed algorithmsarise in a wide range of applications: including telecommunications, scientific computing...

Parallelization: most promising way to allow more computing

resources. Building fasterserial computers: increasingly

expensive + strikes physical limits (transmission speed, miniaturization).

Distributedlarge scale algorithmsencounter problems:

communication delays(latency, bandwidth), the lack ofefficient shared memory.

(6)

Clustering algorithms.

One of the primary tools ofunsupervised learning.

Outstanding role indatamining: scientific data exploration, information retrieval, marketing, text mining, computational biology...

Clustering: division of data intogroupsof similar objects.

Representing data by clusters: loses certain fine details but achievessimplification.

(7)

Clustering algorithms.

One of the primary tools ofunsupervised learning.

Outstanding role indatamining: scientific data exploration,

information retrieval, marketing, text mining, computational biology...

Clustering: division of data intogroupsof similar objects.

Representing data by clusters: loses certain fine details but achievessimplification.

(8)

Clustering algorithms.

One of the primary tools ofunsupervised learning.

Outstanding role indatamining: scientific data exploration,

information retrieval, marketing, text mining, computational biology...

Clustering: division of data intogroupsof similar objects.

Representing data by clusters: loses certain fine details but achievessimplification.

(9)

Clustering algorithms.

One of the primary tools ofunsupervised learning.

Outstanding role indatamining: scientific data exploration,

information retrieval, marketing, text mining, computational biology...

Clustering: division of data intogroupsof similar objects. Representing data by clusters: loses certain fine details but

achievessimplification.

(10)

Clustering algorithms.

One of the primary tools ofunsupervised learning.

Outstanding role indatamining: scientific data exploration,

information retrieval, marketing, text mining, computational biology...

Clustering: division of data intogroupsof similar objects. Representing data by clusters: loses certain fine details but

achievessimplification.

(11)
(12)

Distortion.

Datahas a distribution µ: Borel probability measure on Rd (with a

second order moment).

Model this distribution by κ vectors of Rd: the number of

prototypes.

Objective:minimizationof the criterion C: find w0s.t.

w0∈ argmin w∈(Rd)κ

C(w),

where thedistortionC is defined by, forw = (w1, . . . ,wκ) ∈ Rd κ , C(w) = Z G min 1≤`≤κkz−w`k 2d µ(z ).

(13)

Distortion.

Datahas a distribution µ: Borel probability measure on Rd (with a

second order moment).

Model this distribution by κ vectors of Rd: the number of

prototypes.

Objective:minimizationof the criterion C: find w0s.t.

w0∈ argmin w∈(Rd)κ

C(w),

where thedistortionC is defined by, forw = (w1, . . . ,wκ) ∈ Rd κ , C(w) = Z G min 1≤`≤κkz−w`k 2d µ(z ).

(14)

Distortion.

Datahas a distribution µ: Borel probability measure on Rd (with a

second order moment).

Model this distribution by κ vectors of Rd: the number of

prototypes.

Objective:minimizationof the criterion C: find w0s.t.

w0∈ argmin

w∈(Rd)κ

C(w),

where thedistortionC is defined by, forw = (w1, . . . ,wκ) ∈ Rd κ , C(w) = Z G min 1≤`≤κkz−w`k 2d µ(z ).

(15)

Distortion.

Datahas a distribution µ: Borel probability measure on Rd (with a

second order moment).

Model this distribution by κ vectors of Rd: the number of

prototypes.

Objective:minimizationof the criterion C: find w0s.t.

w0∈ argmin

w∈(Rd)κ

C(w),

where thedistortionC is defined by, forw = (w1, . . . ,wκ) ∈ Rd

κ , C(w) = Z G min 1≤`≤κkz−w`k 2d µ(z ).

(16)

Assumptions on the distribution.

Strongly continuous.

µisstrongly continuousif for all hyperplane H, of Rd, µ(H) = 0.

If µ is strongly continuous, then supp (µ) is infinite.

(17)

Assumptions on the distribution.

Strongly continuous.

µisstrongly continuousif for all hyperplane H, of Rd, µ(H) = 0.

If µ is strongly continuous, then supp (µ) is infinite.

(18)

Voronoï tesselations.

Notation:

The set of all κ-tuples of G is denoted Gκ.

Dκ ∗ =  w ∈ Rdκ |w` 6=wk if and only if ` 6= k . Definition Forw ∈ Dκ ∗, W`(w) = {v ∈ G| kw`− v k < kwk − v k , if k 6= `} . {W`(w)}κ`=1is called theVoronoï tessellationof G related to w .

If µ is strongly continuous: ∀w ∈ Dκ ∗,

(19)

Voronoï tesselations.

Notation:

The set of all κ-tuples of G is denoted Gκ.

Dκ ∗ =  w ∈ Rdκ |w` 6=wk if and only if ` 6= k . Definition Forw ∈ Dκ ∗, W`(w) = {v ∈ G| kw`− v k < kwk − v k , if k 6= `} . {W`(w)}κ`=1is called theVoronoï tessellationof G related to w .

If µ is strongly continuous: ∀w ∈ Dκ ∗,

(20)

Voronoï tesselations.

Notation:

The set of all κ-tuples of G is denoted Gκ.

Dκ ∗ =  w ∈ Rdκ |w` 6=wk if and only if ` 6= k . Definition Forw ∈ Dκ ∗, W`(w) = {v ∈ G| kw`− v k < kwk − v k , if k 6= `} .

{W`(w)}κ`=1is called theVoronoï tessellationof G related to w .

If µ is strongly continuous: ∀w ∈ Dκ ∗,

(21)

Voronoï tesselations.

Notation:

The set of all κ-tuples of G is denoted Gκ.

Dκ ∗ =  w ∈ Rdκ |w` 6=wk if and only if ` 6= k . Definition Forw ∈ Dκ ∗, W`(w) = {v ∈ G| kw`− v k < kwk − v k , if k 6= `} .

{W`(w)}κ`=1is called theVoronoï tessellationof G related to w .

If µ is strongly continuous: ∀w ∈ Dκ

∗,

(22)

Voronoï tesselations 2D.

(23)
(24)

Regularity of the distortion.

Theorem (Pagès)

C is continuously differentiable at everyw = (w1, . . . ,wκ) ∈ D∗κs.t.

µ (Sκ `=1∂W`(w)). ∀1 ≤ ` ≤ κ, ∇`C(w) = Z W`(w) (w`−z)d µ(z).

(25)

Regularity of the distortion.

Theorem (Pagès)

C is continuously differentiable at everyw = (w1, . . . ,wκ) ∈ D∗κs.t.

µ (Sκ `=1∂W`(w)). ∀1 ≤ ` ≤ κ, ∇`C(w) = Z W`(w) (w`−z)d µ(z).

(26)

Local observation of the gradient.

Definition

For anyz∈ Rd andw ∈ Dκ

∗, define function H by its `-th component,

H`(z,w) =

(

z−w` ifz∈ W`(w)

0 otherwise.

If random variable Z ∼ µ, the next equality holds for allw ∈ Dκ ∗,

(27)

Local observation of the gradient.

Definition

For anyz∈ Rd andw ∈ Dκ

∗, define function H by its `-th component,

H`(z,w) =

(

z−w` ifz∈ W`(w)

0 otherwise.

If random variable Z ∼ µ, the next equality holds for allw ∈ Dκ

∗,

(28)

Extension on {D

∗κ

.

Definition w ∈ Rdκ , W`(w) = ( {v ∈ G| kw`− v k < minw`6=wkkwk − v k} , if(∗), ∅ otherwise. (∗) ≡ ` , min {k|wk =w`}.

If random variable Z ∼ µ, the next equality holds for allw ∈ Rdκ , h(w) , E {H(Z ,w)} .

(29)

Extension on {D

∗κ

.

Definition w ∈ Rdκ , W`(w) = ( {v ∈ G| kw`− v k < minw`6=wkkwk − v k} , if(∗), ∅ otherwise. (∗) ≡ ` , min {k|wk =w`}.

If random variable Z ∼ µ, the next equality holds for allw ∈ Rdκ

,

(30)

Proposition: If µ is strongly continuous, 1 argmin w∈(Rd)κC(w)∩ ◦ Gκ6= ∅ (and argminw∈( Rd) κC(w) ⊂ {∇C = 0} ∩ Dκ),

2 and if supp(µ) = G then,

argmin w∈(Rd)κ C(w) ⊂argminloc w∈Gκ C(w) ⊂ ◦ Gκ ∩ {∇C = 0} ∩ Dκ.

(31)

Proposition: If µ is strongly continuous, 1 argmin w∈(Rd)κC(w)∩ ◦ Gκ6= ∅ (and argminw∈( Rd) κC(w) ⊂ {∇C = 0} ∩ Dκ),

2 and if supp(µ) = G then,

argmin w∈(Rd)κ C(w) ⊂argminloc w∈Gκ C(w) ⊂ ◦ Gκ ∩ {∇C = 0} ∩ Dκ.

(32)

Stochastic gradient optimization.

Minimize C: gradient descent procedurew :=w− ∇C(w).

∇C(w)is unknown, use H(z,w)instead,

w (t + 1)=w (t)− εtH (zt+1,w (t)) ,

w (0)∈G◦κ∩Dκ

∗ and {zt}+∞t=0 is a iid sequence of law µ.

Usual assumptions: {εt}+∞t=0 ∈ (0, 1),

1 P∞

t=0εt = +∞.

2 P∞

(33)

Stochastic gradient optimization.

Minimize C: gradient descent procedurew :=w− ∇C(w).

∇C(w)is unknown, use H(z,w)instead,

w (t + 1)=w (t)− εtH (zt+1,w (t)) ,

w (0)∈G◦κ∩Dκ

∗ and {zt}+∞t=0 is a iid sequence of law µ. Usual assumptions: {εt}+∞t=0 ∈ (0, 1),

1 P∞

t=0εt = +∞.

2 P∞

(34)

Stochastic gradient optimization.

Minimize C: gradient descent procedurew :=w− ∇C(w).

∇C(w)is unknown, use H(z,w)instead,

w (t + 1)=w (t)− εtH (zt+1,w (t)) ,

w (0)∈G◦κ∩Dκ

∗ and {zt}+∞t=0 is a iid sequence of law µ.

Usual assumptions: {εt}+∞t=0 ∈ (0, 1), 1 P∞

t=0εt = +∞.

2 P∞

(35)

A deeper look.

Zooming:

Ifzt+1∈ W`0(w (t)),

w`0(t + 1)= (1 − εt)w`0(t)+ εtzt+1.

convex combination between old centroid and new input. zt+1centered homotethy with ratio (1 − εt).

Asw (0)∈G◦κ ∩Dκ

∗, we havew (t)∈ ◦

∩Dκ

(36)

A deeper look.

Zooming:

Ifzt+1∈ W`0(w (t)),

w`0(t + 1)= (1 − εt)w`0(t)+ εtzt+1.

convex combination between old centroid and new input.

zt+1centered homotethy with ratio (1 − εt).

Asw (0)∈G◦κ ∩Dκ

∗, we havew (t)∈

∩Dκ

(37)
(38)
(39)
(40)
(41)
(42)

Troubles.

On the distortion:

C is not a convex function.

kC(w)k 9 +∞ as kwk → +∞.

On its gradient:

∇C is singular at {Dκ ∗.

(43)

Troubles.

On the distortion:

C is not a convex function.

kC(w)k 9 +∞ as kwk → +∞.

On its gradient:

∇C is singular at {Dκ

∗.

(44)

What can be expected ?

w (t)9w∗ =argmin

w∈Gκ

C(w ).

(45)

What can be expected ?

w (t)9w∗ =argmin

w∈Gκ

C(w ).

(46)

Popular techniques.

Robbins Monroe:

Robbins Monroe techniques do not work here because C is not convex.

Kushner-Clark:

Based on the study of the trajectories of the ODE, ODE ≡w0= −h(w), w (0)∈ Gκ, h is singular here!

(47)

Popular techniques.

Robbins Monroe:

Robbins Monroe techniques do not work here because C is not convex.

Kushner-Clark:

Based on the study of the trajectories of the ODE,

ODE ≡w0= −h(w), w (0)∈ Gκ,

(48)

Assumption: [Compact supported Density]

µhas a bounded density whose support is the compact convex set G.

Theorem (G-Lemma Fort and Pagès)

If,

1 {w (t)}

t≥0is H∞valued, for some compact H∞.

2 P

s≥1εs(H(zs+1,w (s)) −h(w (s)))converge in Rd

κ . 3 There exist a l.s.c nonnegative function G : Rd → R

+s.t

P

s≥1εs+1G(w (s)) < +∞

Then,w (t)converges to some connected component of {G = 0} ∩ H∞.

(49)

Assumption: [Compact supported Density]

µhas a bounded density whose support is the compact convex set G.

Theorem (G-Lemma Fort and Pagès)

If,

1 {w (t)}

t≥0is H∞valued, for some compact H∞.

2 P

s≥1εs(H(zs+1,w (s)) −h(w (s)))converge in Rd

κ .

3 There exist a l.s.c nonnegative function G : Rd → R

+s.t

P

s≥1εs+1G(w (s)) < +∞

Then,w (t)converges to some connected component of

(50)

A suitable G: For everyw ∈ Gκ, b G(w) , lim inf v ∈Gκ∩Dκ ∗,v →w k∇C(v )k2. b G is a nonnegative l.s.c. function on Gκ.

(51)

Theorem (Pagès)

Under assumption[Compact supp. Density], on the event

 lim inf t→+∞dist w (t), {D κ ∗ > 0  , dist (w (t), Ξ∞) =0 a.s as t → ∞,

where Ξ∞is some connected component of {∇C = 0}.

Remarks:

Asymptotically parted component.

No satisfactory convergence result is provided without this assumption.

(52)

Theorem (Pagès)

Under assumption[Compact supp. Density], on the event

 lim inf t→+∞dist w (t), {D κ ∗ > 0  , dist (w (t), Ξ∞) =0 a.s as t → ∞,

where Ξ∞is some connected component of {∇C = 0}.

Remarks:

Asymptotically parted component.

No satisfactory convergence result is provided without this assumption.

(53)

Introduction.

Why?

The algorithm is too slow on large data set or with high dimension data.

We need to process several inputs quickly. Need for scalability.

(54)

Distributed algorithm.

Processor 1 Processor 3 Processor 2 Processor 4 Data Computes Data Data Data

(55)
(56)
(57)

A generic descent term:

w (t + 1)=w (t)+ −εtH (zt+1,w (t))

| {z }

,s(t)

(58)

Parallelization.

Basic parallelization.

For all 1 ≤ i ≤ M, where M is the number of processors.

wi(t + 1)=

M

X

j=1

ai,j(t)wj(t)+si(t).

Where theai,j(t) M

j=1is a convex combination.

Synchronization effects:

Some synchronization effects are required in this model. We should take into account communication delays.

(59)

Parallelization.

Basic parallelization.

For all 1 ≤ i ≤ M, where M is the number of processors.

wi(t + 1)=

M

X

j=1

ai,j(t)wj(t)+si(t).

Where theai,j(t) M

j=1is a convex combination.

Synchronization effects:

Some synchronization effects are required in this model. We should take into account communication delays.

(60)

The Tsitsikils’s asynchronous model.

The Tsitsiklis’s system:

wi(t + 1) = M X j=1 ai,j(t)wj(τi,j(t))+si(t). Notation: 0 ≤ τi,j(t) ≤ t.

w`i, for 1 ≤ ` ≤ κ is a Rd valued sequence.

si(t)descent terms (any vector valued sequence for now). ai,j(t) M

(61)

The Tsitsikils’s asynchronous model.

The Tsitsiklis’s system:

wi(t + 1) = M X j=1 ai,j(t)wj(τi,j(t))+si(t). Notation: 0 ≤ τi,j(t) ≤ t.

w`i, for 1 ≤ ` ≤ κ is a Rd valued sequence.

si(t)descent terms (any vector valued sequence for now).

ai,j(t) M

(62)

Global time reference

(63)

Model agreement.

The agreement algorithm.

For all, 1 ≤ i ≤ M,xi(0)∈ Rdκ , xi(t + 1)= M X j=1 ai,j(t)xj(τi,j(t)). Remark:

Similar toTsitsiklis’s systembut withsi(t)=0 for all t, i. Consensus:

Is there (or at least what are the conditions to ensure) an asymptotical consensus between the processors/workers?

(64)

Model agreement.

The agreement algorithm.

For all, 1 ≤ i ≤ M,xi(0)∈ Rdκ , xi(t + 1)= M X j=1 ai,j(t)xj(τi,j(t)). Remark:

Similar toTsitsiklis’s systembut withsi(t)=0 for all t, i.

Consensus:

Is there (or at least what are the conditions to ensure) an asymptotical consensus between the processors/workers?

(65)

Model agreement.

The agreement algorithm.

For all, 1 ≤ i ≤ M,xi(0)∈ Rdκ , xi(t + 1)= M X j=1 ai,j(t)xj(τi,j(t)). Remark:

Similar toTsitsiklis’s systembut withsi(t)=0 for all t, i.

Consensus:

Is there (or at least what are the conditions to ensure) an asymptotical consensus between the processors/workers?

(66)

Assumptions 1

Assumption (Bounded communication delays.)

There exist a positive integer constant B1s.t.,

0 ≤ t − B1< τi,j(t) ≤ t,

for all i and j and all t ≥ 0.

Assumption (Strongly stoch. matrices.)

There exists a positive constant α > 0 such that: 1 ai,i(t) ≥ α, for all i, t.

2 ai,j(t) ∈ {0} ∪ [α, 1], for all i, j, t. 3 PM

(67)

Assumptions 1

Assumption (Bounded communication delays.)

There exist a positive integer constant B1s.t.,

0 ≤ t − B1< τi,j(t) ≤ t,

for all i and j and all t ≥ 0.

Assumption (Strongly stoch. matrices.)

There exists a positive constant α > 0 such that:

1 ai,i(t) ≥ α, for all i, t.

2 ai,j(t) ∈ {0} ∪ [α, 1], for all i, j, t. 3 PM

(68)

Assumptions 2

Define a graph: thevertices, V = {1, . . . , M} (processors) and the set

ofedges, E (t) where (j, i) ∈ E (t) if and only if ai,j(t) > 0. Assumption (Graph connectivity.)

The graph (V, ∪s≥tE (s)) is strongly connected for all t ≥ 0.

Processor 1

Processor 2 Processor 3

Processor 5

(69)

Assumptions 2

Define a graph: thevertices, V = {1, . . . , M} (processors) and the set

ofedges, E (t) where (j, i) ∈ E (t) if and only if ai,j(t) > 0.

Assumption (Graph connectivity.)

The graph (V, ∪s≥tE (s)) is strongly connected for all t ≥ 0.

Processor 1

Processor 2 Processor 3

Processor 5

(70)

Assumptions 3

Assumption (Bounded communication intervals.)

If i communicates to j an infinite number of time, then there is some B2

such that, for all t, (i, j) = E (t) ∪ E (t + 1) ∪ . . . ∪ E (t + B − 1) Assumption (Symmetry)

There exists some B2>0 such that whenever (i, j) ∈ E (t), then there

(71)

Assumptions 3

Assumption (Bounded communication intervals.)

If i communicates to j an infinite number of time, then there is some B2

such that, for all t, (i, j) = E (t) ∪ E (t + 1) ∪ . . . ∪ E (t + B − 1)

Assumption (Symmetry)

There exists some B2>0 such that whenever (i, j) ∈ E (t), then there

(72)

(AsY )1≡           

Assumption [Bounded communication delays] Assumption [Bounded communication intervals.] Assumption [Graph connectivity.]

Assumption [Strongly stoch. matrices]

(AsY )2≡           

Assumption [Bounded communication intervals.] Assumption [Symmetry]

Assumption [Graph connectivity.] Assumption [Strongly stoch. matrices]

(73)

Agreement theorem.

Theorem (Blondel et al.)

Under assumptions (AsY )1or (AsY )2there is a vector c ∈ Rdκ

(independent of i) s. t., lim t→+∞ x i(t)− c =0.

Even more, there exists ρ ∈ [0, 1) and L > 0, s.t., x i(t)xi(τ ) ≤ Lρ t−τ, for every t ≥ τ ≥ 0.

(74)

Lemma (Multiple Gradient Decomposition.)

For all t ≥ 0, and all processors 1 ≤ i ≤ M there are scalars, ∀1 ≤ j ≤ M,φi,j(t, τ ) t−1

τ =0, s.t., Tsitsikli’s system writes,

wi(t)= M X j=1 φi,j(t, 0)wj(0)+ t−1 X τ =1 M X j=1 φi,j(t, τ )sj(τ ). Remark:

The sequencesφi,j(t, τ ) t−1τ =0do not depend on the value taken by the descent termssi(t).

(75)

Lemma (Multiple Gradient Decomposition.)

For all t ≥ 0, and all processors 1 ≤ i ≤ M there are scalars, ∀1 ≤ j ≤ M,φi,j(t, τ ) t−1

τ =0, s.t., Tsitsikli’s system writes,

wi(t)= M X j=1 φi,j(t, 0)wj(0)+ t−1 X τ =1 M X j=1 φi,j(t, τ )sj(τ ). Remark:

The sequencesφi,j(t, τ ) t−1τ =0do not depend on the value taken by

(76)

wi(t)= M X j=1 φi,j(t, 0) | {z } −−−−→ t→+∞ φ j(0) wj(0)+ t−1 X τ =1 M X j=1 φi,j(t, τ ) | {z } −−−−→ t→+∞ φ j(τ ) sj(τ ). Moreover, φ i,j(t, τ ) − φi(τ ) ≤ Aρ t−τ, ∀t ≥ τ ≥ 0.

We define a sequence calledagreement vector sequence{w∗(t)}+∞ t=0, w∗(t), M X j=1 φj(0)wj(0)+ t−1 X τ =1 M X j=1 φj(τ )sj(τ ).

(77)

wi(t)= M X j=1 φi,j(t, 0) | {z } −−−−→ t→+∞ φ j(0) wj(0)+ t−1 X τ =1 M X j=1 φi,j(t, τ ) | {z } −−−−→ t→+∞ φ j(τ ) sj(τ ). Moreover, φ i,j(t, τ ) − φi(τ ) ≤ Aρ t−τ, ∀t ≥ τ ≥ 0.

We define a sequence calledagreement vector sequence{w∗(t)}+∞ t=0, w∗(t), M X j=1 φj(0)wj(0)+ t−1 X τ =1 M X j=1 φj(τ )sj(τ ).

(78)

wi(t)= M X j=1 φi,j(t, 0) | {z } −−−−→ t→+∞ φ j(0) wj(0)+ t−1 X τ =1 M X j=1 φi,j(t, τ ) | {z } −−−−→ t→+∞ φ j(τ ) sj(τ ). Moreover, φ i,j(t, τ ) − φi(τ ) ≤ Aρ t−τ, ∀t ≥ τ ≥ 0.

We define a sequence calledagreement vector sequence{w∗(t)}+∞t=0,

w∗(t), M X j=1 φj(0)wj(0)+ t−1 X τ =1 M X j=1 φj(τ )sj(τ ).

(79)

Agreement vector.

Remark:

The agreement vectorw∗satisfies, for all t ≥ 0,

w∗(t + 1)=w∗(t)+ M X j=1 φj(t)sj(t). Interpretation:

w∗(t0)corresponds to thecommonvalueasymptoticallyachieved by all processors if computations with descent termshave stoppedafter t0, i.e,sj(t)=0 for all t ≥ t0..

(80)

Agreement vector.

Remark:

The agreement vectorw∗satisfies, for all t ≥ 0,

w∗(t + 1)=w∗(t)+ M X j=1 φj(t)sj(t). Interpretation:

w∗(t0)corresponds to thecommonvalueasymptoticallyachieved by all

processors if computations with descent termshave stoppedafter t0,

(81)

Global time reference

Computation with descent terms

Only communication (model agreement)

(82)

Definition of descent terms.

The Tsitsiklis’s system with the descent termssi,

si(t)= ( −εi tH zit+1,wi(t)  if t ∈ Ti 0 otherwise. Notation:

Ti: time where the vectorwi is updated with descent term. 

zit+1 iid sequences of r.v of law µ. Ft , σ zis, for all s ≤ t and 1 ≤ i ≤ M .

(83)

Definition of descent terms.

The Tsitsiklis’s system with the descent termssi,

si(t)= ( −εi tH zit+1,wi(t)  if t ∈ Ti 0 otherwise. Notation:

Ti: time where the vectorwi is updated with descent term.



zit+1 iid sequences of r.v of law µ. Ft , σ zis, for all s ≤ t and 1 ≤ i ≤ M .

(84)
(85)
(86)

To be continued.

We assume that the sequences εit satisfy the following conditions:

Assumption (Decreasing steps.)

There exist two constants K1, K2such that, for all i and all t ≥ 0,

K1 t ∨ 1 ≤ ε i t ≤ K2 t ∨ 1. Assumption (Non idle.)

PM

(87)

To be continued.

We assume that the sequences εit satisfy the following conditions:

Assumption (Decreasing steps.)

There exist two constants K1, K2such that, for all i and all t ≥ 0,

K1 t ∨ 1 ≤ ε i t ≤ K2 t ∨ 1.

Assumption (Non idle.)

PM

(88)

The recursion formula ofagreement vectorwrites, w∗(t + 1)=w∗(t)+ M X j=1 φj(t)sj(t).

Using the function h,

h(w∗(t)) = E {H (zt+1,w∗(t)) |Ft} , and h(wj(t)) = E n H  zt+1,wj(t)  |Ft o for all j.

(89)

Set, ε∗t , 1 2 M X j=1 1t∈Tjφj(t)εjt, Mt1, 1 2 M X j=1 1t∈Tjφj(t)εjt  h(w∗(t)) −h(wj(t)), and, Mt2, 1 2 M X j=1 φj(t)1t∈Tjεjt  h(wj(t)) −Hzjt+1,wj(t). Therefore, the recursion writes,

(90)

Theorem (Asynchronous G-Lemma) If, 1 ε∗ t −−−−→t→+∞ 0 and P t≥0ε∗t = +∞, 2 the sequences {w∗(t)}+∞ t=0 and {h(w∗(t))} +∞ t=0 are bounded, 3 P+∞ s=0∆Ms1and P+∞ s=0∆Ms2converge in Rd κ ,

4 there exist a l.s.c. map G : Rdκ−→ R

+, s.t., +∞ X t=0 ε∗t+1G (w∗(t)) < +∞, then, lim t→+∞dist (w ∗(t), Ξ ∞) =0.

(91)

Assumption (Trajectories in Gκ.)

P n

(92)

Remark:

Assumption[Trajectories in Gκ]is not restrictive:

ifwj(0)∈ Gκ, and,

if t ∈ Ti, we have a

i,j(t) = δi,j, and τi,i(t) = t,

wi(t + 1)=wi(t) − εit



wi(t)−zit+1

=1 − εitwi(t)+ εitzit+1. If t /∈ Ti, then equation writes,

wi(t + 1)=

M

X

j=1

ai,j(t)wj(τi,j(t)).

(93)

Lemma

Under Assumption[Trajectories in Gκ], then, for all t ≥ 1

kw∗(t) − wi(t)k ≤√κM diam(G)A(K2∨ 1) t−1 X τ =−1 1 τ ∨1ρ t−τ. A > 0. Remark

Under assumptions[Trajectories in Gκ], 1 w∗(t)−wi(t)−−→ 0 as t → +∞,a.s

(94)

Lemma

Under Assumption[Trajectories in Gκ], then, for all t ≥ 1

kw∗(t) − wi(t)k ≤√κM diam(G)A(K2∨ 1) t−1 X τ =−1 1 τ ∨1ρ t−τ. A > 0. Remark

Under assumptions[Trajectories in Gκ],

1 w∗(t)−wi(t)−−→ 0 as t → +∞,a.s

(95)

Proposition

If Assumption[Trajectories in Gκ]hold then,

ε∗t −−−−→ t→+∞ 0 and P t≥0ε ∗ t = +∞. Proposition

If Assumptions[Trajectories in Gκ]hold then the seriesP+∞

t=0∆Ms1and

P+∞

t=0∆Ms2converge in Rd

κ .

(96)

Proposition

If Assumption[Trajectories in Gκ]hold then,

ε∗t −−−−→ t→+∞ 0 and P t≥0ε ∗ t = +∞. Proposition

If Assumptions[Trajectories in Gκ]hold then the seriesP+∞

t=0∆Ms1and

P+∞

t=0∆Ms2converge in Rd

κ .

(97)

Assumption (Parted component assumption.)

1 P {w(t)∈ Dκ

∗} = 1, for all t ≥ 0.

2 Plim inft→+∞dist w(t), {Dκ

∗ = 1.

Proposition

Under assumptions[Trajectories in Gκ], [Parted component assumption] we have,

1 C (w∗(t))−−−−→as

t→+∞ C

∞,where C∞is some random variable.

2

+∞

X

t=0

(98)

Assumption (Parted component assumption.)

1 P {w(t)∈ Dκ

∗} = 1, for all t ≥ 0.

2 Plim inft→+∞dist w(t), {Dκ

∗ = 1.

Proposition

Under assumptions[Trajectories in Gκ], [Parted component

assumption] we have,

1 C (w∗(t))−−−−→as t→+∞ C

∞,where C∞is some random variable.

2

+∞

X

t=0

(99)

The Asynchronous Theorem.

Theorem (Asynchronous Theorem)

If assumptions[Trajectories in Gκ], [Parted component assumption]

hold then,

w∗(t)−−−−→a.s

t→+∞ Ξ∞,

and,

wi(t)−−−−→a.s

t→+∞ Ξ∞, for all processors i.

(100)

Bibliography

Dimitri P. Bertsekas and John N. Tsitsiklis.

Parallel and distributed computation: numerical methods.

Prentice-Hall, Inc., Upper Saddle River, NJ, USA, 1989. V.D. Blondel, J.M. Hendrickx, A. Olshevsky, and J.N. Tsitsiklis.

Convergence in multiagent coordination, consensus, and flocking.

Decision and Control, 2005 and 2005 European Control Conference. CDC-ECC ’05. 44th IEEE Conference on, pages 2996–3000, Dec. 2005.

Jean-Claude Fort and Gilles Pagès.

Convergence of stochastic algorithms: From the kushner-clark theorem to the lyapounov functional method.

Advances in Applied Probability, 28(4):1072–1094, 1996. Gilles Pagès.

A space vector quantization for numerical integration.

Journal of Applied and Computational Mathematics, 89:1–38, 1997. J. Tsitsiklis, D. Bertsekas, and M. Athans.

Distributed asynchronous deterministic and stochastic gradient optimization algorithms.

References

Related documents