Event driven stochastic approximation

(1)

EVENT-DRIVEN STOCHASTIC APPROXIMATION1

Vivek S. Borkar, Neeraja Sahasrabudhe and M. Ashok Vardhan2

Department of Electrical Engineering, Indian Institute of Technology,

Powai, Mumbai 400 076, India

e-mails:{borkar.vs, neeraja.sigma, ashokevardhan}@gmail.com

(Received 8 June 2015; accepted 26 September 2015)

We consider a Robbins-Monro type iteration wherein noisy measurements are event-driven and therefore arrive asynchronously. We propose a modification of step-sizes that ensures desired asymptotic behaviour regardless of this aspect. This generalizes earlier results on asynchronous stochastic approximation wherein the asynchronous behaviour is across different components, but not along the same component of the vector iteration, as is the case considered here.

Key words : Stochastic approximation; asynchronous computation; distributed algorithms; o.d.e. limit; sampling bias

1. INTRODUCTION

The classical Robbins-Monro stochastic approximation scheme [11] finds zeros of a nonlinear func-tionh:Rd7→ Rdgiven its noisy observations. It is given by thed-dimensional iteration

x(n+ 1) =x(n) +a(n) [h(x(n)) +M(n+ 1)], n≥0, (1)

where{a(n)}is a positive stepsize sequence satisfying

X

n

a(n) =∞, X

n

a(n)2<∞. (2)

1

Research of VSB supported in part by a J. C. Bose Fellowship and a grant for ‘Approximation of High Dimensional

Optimization and Control Problems’ from Department of Science and Technology, Government of India.

(2)

While the original analysis of [11] and others used probabilistic tools such as ‘almost

super-martingales’, an alternative approach due to [7, 10] that has gained currency subsequently treats (1)

as a noisy Euler scheme for the ordinary differential equation (‘o.d.e.’ for short)

˙

x(t) =h(x(t)), t≥0. (3)

Thus{a(n)}are interpreted as slowly decreasing time steps that cover the entire time axis (be-cause of the first condition in (2)), while going to zero at an appropriate rate (be(be-cause of the second condition in (2)) so as to suppress the errors due to discretization and noise. One can then show that under reasonable technical conditions, the iterates (1) track the asymptotic behaviour of (3) with probability one. See [4] for a detailed treatment of this approach.

In engineering applications, one often has a situation where each component of (1) is computed by a possibly different processor. These processors have their own clocks and communicate with each other with possible delays. In particular, not all components may be updated at each time. Let

Y_n⊂ {1,2,· · · , d}denote the set of components updated at timen. For1≤i≤d, n≥0, define

ν(i, n) =

n

X

m=0

I{i∈Ym}, (4)

for I{· · · } = 1if‘· · ·0 is true,

= 0if not.

This is the ‘local clock’ at processorithat counts the number of iterates performed byitill time

n. Then iknows ν(i, n), but need not know the ‘global clock’n. In fact the global clock may be

a complete artifice, not a constant multiple of a given unit as in a physical clock, as long as causal relationships are respected. Then theith component iterate may be written as

xi(n+ 1) = xi(n) +a(ν(i, n))I{i∈Yn}

h

hi(x1(n−τ1i(n)),· · ·,

xd(n−τdi(n))) +Mi(n+ 1)

i

(5)

Hereτij(n)is the random delay with whichi’s output was received byjat timen. That is, at time

n,jhas access toxi(n−τij(n))but notxi(m)form > n−τij(n). Additional conditions imposed

on{a(n)}in [3] were:

• (A1)a(n+ 1)≤a(n)from somenon.

(3)

• (A3) Forα∈(0,1),sup_na(_a(n)dαne) <∞.

• (A4) Forx∈(0,1)andA(n) :=Pn_m=0a(m), A(_A(n)dyne) →1uniformly iny∈[x,1].

Step-sizes that satisfy this area(n) = _n1,_n_log1 _n,log_nn, etc. with suitable modification forn= 0,1 as needed. The delaysτij(n)∈ {0,1,· · · , n}were assumed to satisfy

E h

τij(n)b|X(m), Y(m), M(m), τij(m), i, j∈ {1,· · ·, d}, m≤n

i

<∞ (6)

for someb > ₁₋r_r withras in (A1), and alln≥0. Suppose

sup

n kx(n)k<∞a.s. (7)

(i.e., the iterates remain bounded with probability one) and (3) has a globally asymptotically stable equilibriumx∗. Furthermore, assume that

lim inf

n→∞

ν(i, n)

n >0, a.s. (8)

That is, all components are updated ‘comparably often’. Then we have:

Theorem 1.1 —x(n)→x∗a.s.

PROOF: This is a special case of Theorem 3.2 of [3]. As in ibid., by unrolling thenth iterate in |Yn|separate iterates in each of which only one component is updated, we may suppose without loss

of generality thatYnis a singleton∀n. To prove the claim, first note that global asymptotic stability ofx∗ implies by the converse Lyapunov theorem [9] the existence of a continuously differentiable Lyapunov functionV : Rd 7→ R+ such that limkxk↑∞V(x) = ∞ and h∇V(x), h(x)i < 0 for

x 6=x∗_{. Furthermore, in the notation of [3],}_a₍_{n, j}_{) =}_a₍_n₎_{for all}₁_≤ _j _≤_d_{and thus}_β¯₍_i₎_{(in the}

notation of [3]) is 1_d. Therefore by Theorem 3.2 of [3],x(n)→x∗a.s. 2

As already observed, this in fact is a very special case of Theorem 3.2 of [3], stated here to moti-vate the subsequent development. Our main point of departure is to assume thath(·) = [h1(·),· · ·,

h_d(·)]T _{is of the form}_h i=

P

j∈N(i)hij whereN(i)are prescribed nonempty subsets of a prescribed finite setSwith|S|=rand thehij’s are Lipschitz. The intuition behind this is: the setScorresponds to ‘sources’ or ‘measurement devices’ that supply processoriwith the relevant data in an episodic manner, with possible communication delays. The algorithm then is:

xi(n+ 1) = xi(n) +

X

j∈N(i)

a(ν(i, j, n))ξij(n+ 1)

h

hij(x1(n−τ1i(n)),· · ·,

x_d(n−τ_di(n))) +Mi(n+ 1)

i

(4)

where{ξij(n)} are independent{0,1}-valued random variables that are also identically distributed for each choice ofi, j,withP(ξij(n) = 1) = pij > 0. The idea is that theith component receives ‘inputs’ from agentsj∈ N(i). The latter may not always be active. The random variableξij(n)is1 ifidid receive an input fromj at timen,0if not. This justifies the terminology ‘event-driven’: the measurements are episodic. The ‘clocks’{ν(i, j, n)}are now defined asν(i, j, n) :=Pn_m=0ξij(m) and are seen to satisfy

lim

n↑∞

ν(i, j, n)

n = limn↑∞

P_n

m=0ξij(n)

n =pij >0a.s., (10)

which replaces (8). The conditions on the delays{τij(n)}will be specified later.

An example of such a situation is the scheme discussed in [8] for rating ‘experts’ whose opinions are sought on certain outcomes of interest. The algorithm is a stochastic approximation scheme which incrementally updates their ratings based on observed performance. As noted in [8], if one were to use a common step-size, the scheme will put more weight on and therefore favor the experts who opine more frequently. With this in mind, [8] uses the step-size schedule analyzed here. More examples from crowdsourcing and network economics applications are conceivable.

We analyze this scheme next.

2. CONVERGENCEANALYSIS

For simplicity, we first consider the caseτij(n)≡0, i.e., there are no communication delays. Assume that (7) holds and define

¯

a(n) := max

i,j

³

a(ν(i, j, n))ξ_ij(n+ 1) ´

, n≥0,

which is≥0and is>0if and only if at least oneξij(n+ 1)equals1. Then

X

n

¯

a(n)≥X

n

a(ν(i, j, n))ξij(n+ 1)∀i, j.

By Theorem 5.28, p. 96, [6], P_na(ν(i, j, n))ξij(n+ 1) and

P

na(ν(i, j, n))pij converge or diverge together a.s. But by (10),

X

n

a(ν(i, j, n))p_ij =p_ijX

n

(5)

implying thatP_n¯a(n) =∞a.s. Now,

X

n

¯

a(n)2 ≤ X

n

(X

i,j

a(ν(i, j, n))ξij(n+ 1))2

≤ KX

n

a(n)2 <∞

for a suitableK ∈(0,∞). Rewrite (9) as

xi(n+ 1) =xi(n) + ¯a(n)



 X

j∈N(i)

q(i, j, n) [hij(x(n)) +Mi(n+ 1)]



_,

where

q(i, j, n) := a(ν(i, j, n)) ¯

a(n) ξij(n+ 1) ∈ [0,1]∀n. Next define:

1. t(0) = 0, t(n) =P_m=0n a¯(m), n≥1(the algorithm’s time scale),

2. x¯(·) : [0,∞)7→ Rdbyx¯(t(n)) =x(n)with linear interpolation on[t(n), t(n+ 1)]∀n(which makes it continuous and piecewise linear),

3. λij(·) : [0,∞)7→[0,1]byλij(t) =q(i, j, n)∀i, j, n, t∈[t(n), t(n+ 1)),

4. Λ(·) := [[λij(·)]] : [0,∞)7→ Rd×r,Λi(·) :=theith row ofΛ(·),

5. Hi(·) := [hi1(·),· · ·, hir(·)]T :Rd7→ Rr(column vector) wherehij(·)≡0forj /∈ N(i),

6. Ji= [zi1,· · ·, zir]∈ Rr(row vector) wherezij = 1ifj∈ N(i),0otherwise,

7. xs(t), t≥s≥0, the solution to the o.d.e.

˙

xs_i(t) = Λi(t)Hi(x(t)), t≥s, xs(s) = ¯x(s).

Then by standard arguments based on the Gronwall inequality as in, e.g., Chapter 7 of [4], we have:

Lemma 2.1 — For anyT >0,lims↑∞supt∈[s,s+T]k¯x(t)−xs(t)k= 0a.s.

This leads to the key result:

Lemma 2.2 — Almost surely, any limit point ofx¯(s+·)inC([0,∞);Rd)ass↑ ∞is a solution of the o.d.e.

˙

(6)

whereα(·) :R+7→ R+satisfies∆≥α(t)≥δ∀t≥0for suitable∞>∆> δ >0.

PROOF: ViewΛ(·)as an element of the spaceU of measurable mapsu(·) = [[uij(·)]] : [0,∞)7→

[0,1]d×r _{with the coarsest topology that renders continuous the maps} _U_(·) _7→ RT

0 g(t)uij(t)dt

∀ T > 0,1 ≤ i ≤ d, 1 ≤ j ≤ s, g(·) ∈ L2[0, T]. Using Banach-Alaoglu theorem, one can prove thatU is compact and metrizable, therefore Polish. From (7) and the Lipschitz condition on

h(·), it follows thatx¯(t+·), t≥0, are pointwise bounded and equicontinuous, hence relatively

com-pact inC([0,∞);Rd). Let(Λ∗(·) = [[λ∗_ij(·)]], x∗(·))denote a limit point of(Λ(t+·),x¯(t+·))in U ×C([0,∞);Rd₎_as_t_{↑ ∞}_{, along a subsequence, say,}_t

n↑ ∞. By Lemma 2.1, fors > s0>0,

¯

x_i(t_n+s)−x¯_i(t_n+s0) = Z _s

s0

X

j∈N(i)

λ_ij(t_n+y)h_ij(¯x(t_n+y))dy+o(1).

Lettingn↑ ∞, we have

x∗(s)−x∗(s0) = Z _s

s0

X

j∈N(i)

λ∗_ij(y)h_ij(x∗(y))dy.

Now forj /∈ N(i),λij(·)≡0 =⇒λ∗ij(·)≡0. Define

N(n, s) := min{m > n:

m

X

k=n+1

¯

a(k)> s}.

Let[t]denote the unique integer such that[t] ≤t < [t] + 1. Then forj ∈ N(i), `∈ N(k)and

t, s >0,

R_t+s

t λ∗ij(y)dy

R_t+s

t λ∗k`(y)dy

= lim

n↑∞

R_t+s

t λij(tn+y)dy

R_t+s

t λk`(tn+y)dy

= lim n↑∞ P_N([tn+t],s) m=[tn+t] a(ν(i,j,m))ξij(m) ¯

a(m) ¯a(m)

P_N([tn+t],s)

m=[tn+t]

a(ν(k,`,m))ξk`(m) ¯

a(m) a¯(m)

= lim

n↑∞

P_{ν(i,j,N([tn+t],s))}

m=ν(i,j,[tn+t]) a(m)

P_{ν(k,`,N([tn+t],s))}

m=ν(k,`,[tn+t]) a(m)

= 1 (12)

a.s. by (A4) and (8). Thus forx >0,

lim

t↑∞

R_x

0

R_t

0λ∗ij(s+y)dsdy

R_x

0

R_t

0λ∗kl(s+y)dsdy

= lim

t↑∞

R_t

0

R_x

0 λ∗ij(s+y)dyds

R_t

0

R_x

0 λ∗kl(s+y)dyds

= lim

t↑∞

R_t

0

·µRx

0 λ∗ij(s+y)dy

R_x

0 λ∗kl(s+y)dy

¶_R

x

0 λ∗kl(s+y)dy

¸ ds R_t

0

R_x

0 λ∗kl(s+y)dyds

(7)

By l’Hˆospital’s rule,

lim

t↑∞

R_x

0 λ∗ij(t+y)dy

R_x

0 λ∗kl(t+y)dy

= 1 a.s.

Every limit point satisfies the above equation fort→ ∞. Sincex >0was arbitrary, we conclude

that fort≥0, _Z

t+s

t

λ∗_ij(y)dy= Z _t+s

t

λ∗_k`(y)dy∀i, j, k, `,

for allt, s > 0, implying by Lebesgue’s theorem that λ∗_ij(·) = α(·) (say) a.e. for someα(·) ≥ 0 independent of i, j. We drop the qualification ‘a.e.’ by choosing a suitable version. It is easily verified from the definition ofΛ(·)that

∆ :=dX

i

|N(i)| ≥ X

i,j∈N(i)

λij(t)≥1,

leading to

∆≥ X

i,j∈N(i)

λ∗_ij(t) = (dX

i

|N(i)|)α(t)≥1.

Hence∞>∆≥α(t)≥δfor a suitableδ >0. This completes the proof. 2

This brings us to our main result. Say that a setAis an internally chain transitive invariant set for (3) if for everyx ∈ A, the trajectoryx(t)of (3) withx(0) =xremains inAfor allt ∈ Rand for anyx, y ∈ AandT, ² >0, there existn≥1, x0 =x, x1,· · · , xn−1, xn=ysuch that the trajectory of (3) initiated atxi,0≤i < n, intersects with the open²-ball centered atxi+1at somet≥T. We then have the following extension of the celebrated result of Benaim for the classical Robbins-Monro scheme.

Theorem 2.1 — Almost surely,x(n)→a nonempty compact connected internally chain transitive

invariant set of (3). In particular, if (3) has a unique globally asymptotically stable attractorC, then

Cis the only such set andx(n)→Ca.s.

PROOF: Letγ(t) :=R₀tα(s)dsandx˜(t) :=x(γ(t))wherex(·)satisfies (3). Then

˙˜

x(t) =α(t)h(˜x(t)).

(8)

It is easy to see that if the right hand side of (12) were something other than1, sayκijk`>0, then we would have to replace (3) in the above statement by

˙ xi(t) =

X

j∈N(i)

αij(t)hij(x(t)),

where theαij(·)’s reflect the sampling bias. This in general would have a different asymptotic behav-ior than (3).

Recall also that we have ignored delays. We shall replace the condition (6) of [3] by the following conditions from [4], Chapter 7. These are more intuitive and allow for a much easier analysis:

• (A5) n−τij(n)_n →0a.s.

• (A6){τij(n)}are stochastically dominated by a random variableτsatisfyingE

h τη1

i

<∞for

someη >0such thata(n) =o(n−η).

Both are very reasonable assumptions. The first does not allow for arbitrarily large delays whereas the second ensures that the delay distributions have uniformly well-behaved tails in a certain sense. In fact, if{τij(n), n ≥ 0} are identically distributed for somei, j, andη = 1, then E[τij(k)] =

P

nP(τij(k) ≥ n) =

P

nP(τij(n) ≥ n) < ∞, and a simple application of the Borel-Cantelli lemma shows that (A6)=⇒(A5).

Theorem 2.1 — Theorem 2.1 continues to hold for random delays satisfying (A5)-(A6).

PROOF: This follows by the arguments of pp. 82-84, [4]. 2

Finally, we shall state without proof a sufficient condition for (7) along the lines of [5], [2] (see also Theorem 7, pp. 26-27, [4]). The proof goes along the same lines as Theorem 7, pp. 26-27, [4].

Theorem 2.2 — Supposeh∞(x) := limc↑∞h(cx)_c is well defined and the o.d.e.

˙

x(t) =h∞(x(t))

has the origin as its unique globally asymptotically stable equilibrium. Then (7) holds.

REFERENCES

(9)

2. S. Bhatnagar, The Borkar-Meyn theorem for asynchronous stochastic approximation, Systems&Control Letters, 60 (2011), 472-478.

3. V. S. Borkar, Asynchronous stochastic approximation, SIAM Journal of Control and Optimization, 36 (1998), 840-851 (Correction note in: ‘Erratum: Asynchronous Stochastic Approximation’, SIAM Jour-nal of Control and Optimization, 38 (2000), 662-663).

4. V. S. Borkar, Stochastic approximation: A dynamical systems viewpoint, Hindustan Publising Agency, New Delhi, and Cambridge University Press, Cambridge, UK, 2008.

5. V. S. Borkar and S. P. Meyn, The O.D.E. method for convergence of stochastic approximation and reinforcement learning, SIAM Journal of Control and Optimization, 38 (2000), 447-469.

6. L. Breiman, Probability, Addison-Wesley, Reading, Mass., 1968.

7. D. P. Derevitskii and A. I. Fradkov, Two models for analysing the dynamics of adaptation algorithms, Automation and Remote Control, 35 (1974), 59-67.

9. N. N. Krasovskii, Stability of motion, Stanford Uni. Press, Stanford, CA, 1963.

11. H. Robbins and J. Monro, A stochastic approximation method, Annals of Mathematical Statistics, 22 (1951), 400-407.

8. R. Dwivedi and V. S. Borkar, Removing sampling bias in networked stochastic approximation, Pro-ceedings of International Conference on Signal Processing and Communications (SPCOM), July 22-24, 2014, Bangalore.