EVENT-DRIVEN STOCHASTIC APPROXIMATION1
Vivek S. Borkar, Neeraja Sahasrabudhe and M. Ashok Vardhan2
Department of Electrical Engineering, Indian Institute of Technology,
Powai, Mumbai 400 076, India
e-mails:{borkar.vs, neeraja.sigma, ashokevardhan}@gmail.com
(Received 8 June 2015; accepted 26 September 2015)
We consider a Robbins-Monro type iteration wherein noisy measurements are event-driven and therefore arrive asynchronously. We propose a modification of step-sizes that ensures desired asymptotic behaviour regardless of this aspect. This generalizes earlier results on asynchronous stochastic approximation wherein the asynchronous behaviour is across different components, but not along the same component of the vector iteration, as is the case considered here.
Key words : Stochastic approximation; asynchronous computation; distributed algorithms; o.d.e. limit; sampling bias
1. INTRODUCTION
The classical Robbins-Monro stochastic approximation scheme [11] finds zeros of a nonlinear func-tionh:Rd7→ Rdgiven its noisy observations. It is given by thed-dimensional iteration
x(n+ 1) =x(n) +a(n) [h(x(n)) +M(n+ 1)], n≥0, (1)
where{a(n)}is a positive stepsize sequence satisfying
X
n
a(n) =∞, X
n
a(n)2<∞. (2)
1
Research of VSB supported in part by a J. C. Bose Fellowship and a grant for ‘Approximation of High Dimensional
Optimization and Control Problems’ from Department of Science and Technology, Government of India.
While the original analysis of [11] and others used probabilistic tools such as ‘almost
super-martingales’, an alternative approach due to [7, 10] that has gained currency subsequently treats (1)
as a noisy Euler scheme for the ordinary differential equation (‘o.d.e.’ for short)
˙
x(t) =h(x(t)), t≥0. (3)
Thus{a(n)}are interpreted as slowly decreasing time steps that cover the entire time axis (be-cause of the first condition in (2)), while going to zero at an appropriate rate (be(be-cause of the second condition in (2)) so as to suppress the errors due to discretization and noise. One can then show that under reasonable technical conditions, the iterates (1) track the asymptotic behaviour of (3) with probability one. See [4] for a detailed treatment of this approach.
In engineering applications, one often has a situation where each component of (1) is computed by a possibly different processor. These processors have their own clocks and communicate with each other with possible delays. In particular, not all components may be updated at each time. Let
Yn⊂ {1,2,· · · , d}denote the set of components updated at timen. For1≤i≤d, n≥0, define
ν(i, n) =
n
X
m=0
I{i∈Ym}, (4)
for I{· · · } = 1if‘· · ·0 is true,
= 0if not.
This is the ‘local clock’ at processorithat counts the number of iterates performed byitill time
n. Then iknows ν(i, n), but need not know the ‘global clock’n. In fact the global clock may be
a complete artifice, not a constant multiple of a given unit as in a physical clock, as long as causal relationships are respected. Then theith component iterate may be written as
xi(n+ 1) = xi(n) +a(ν(i, n))I{i∈Yn}
h
hi(x1(n−τ1i(n)),· · ·,
xd(n−τdi(n))) +Mi(n+ 1)
i
(5)
Hereτij(n)is the random delay with whichi’s output was received byjat timen. That is, at time
n,jhas access toxi(n−τij(n))but notxi(m)form > n−τij(n). Additional conditions imposed
on{a(n)}in [3] were:
• (A1)a(n+ 1)≤a(n)from somenon.
• (A3) Forα∈(0,1),supna(a(n)dαne) <∞.
• (A4) Forx∈(0,1)andA(n) :=Pnm=0a(m), A(A(n)dyne) →1uniformly iny∈[x,1].
Step-sizes that satisfy this area(n) = n1,nlog1 n,lognn, etc. with suitable modification forn= 0,1 as needed. The delaysτij(n)∈ {0,1,· · · , n}were assumed to satisfy
E h
τij(n)b|X(m), Y(m), M(m), τij(m), i, j∈ {1,· · ·, d}, m≤n
i
<∞ (6)
for someb > 1−rr withras in (A1), and alln≥0. Suppose
sup
n kx(n)k<∞a.s. (7)
(i.e., the iterates remain bounded with probability one) and (3) has a globally asymptotically stable equilibriumx∗. Furthermore, assume that
lim inf
n→∞
ν(i, n)
n >0, a.s. (8)
That is, all components are updated ‘comparably often’. Then we have:
Theorem 1.1 —x(n)→x∗a.s.
PROOF: This is a special case of Theorem 3.2 of [3]. As in ibid., by unrolling thenth iterate in |Yn|separate iterates in each of which only one component is updated, we may suppose without loss
of generality thatYnis a singleton∀n. To prove the claim, first note that global asymptotic stability ofx∗ implies by the converse Lyapunov theorem [9] the existence of a continuously differentiable Lyapunov functionV : Rd 7→ R+ such that limkxk↑∞V(x) = ∞ and h∇V(x), h(x)i < 0 for
x 6=x∗. Furthermore, in the notation of [3],a(n, j) =a(n)for all1≤ j ≤dand thusβ¯(i)(in the
notation of [3]) is 1d. Therefore by Theorem 3.2 of [3],x(n)→x∗a.s. 2
As already observed, this in fact is a very special case of Theorem 3.2 of [3], stated here to moti-vate the subsequent development. Our main point of departure is to assume thath(·) = [h1(·),· · ·,
hd(·)]T is of the formh i=
P
j∈N(i)hij whereN(i)are prescribed nonempty subsets of a prescribed finite setSwith|S|=rand thehij’s are Lipschitz. The intuition behind this is: the setScorresponds to ‘sources’ or ‘measurement devices’ that supply processoriwith the relevant data in an episodic manner, with possible communication delays. The algorithm then is:
xi(n+ 1) = xi(n) +
X
j∈N(i)
a(ν(i, j, n))ξij(n+ 1)
h
hij(x1(n−τ1i(n)),· · ·,
xd(n−τdi(n))) +Mi(n+ 1)
i
where{ξij(n)} are independent{0,1}-valued random variables that are also identically distributed for each choice ofi, j,withP(ξij(n) = 1) = pij > 0. The idea is that theith component receives ‘inputs’ from agentsj∈ N(i). The latter may not always be active. The random variableξij(n)is1 ifidid receive an input fromj at timen,0if not. This justifies the terminology ‘event-driven’: the measurements are episodic. The ‘clocks’{ν(i, j, n)}are now defined asν(i, j, n) :=Pnm=0ξij(m) and are seen to satisfy
lim
n↑∞
ν(i, j, n)
n = limn↑∞
Pn
m=0ξij(n)
n =pij >0a.s., (10)
which replaces (8). The conditions on the delays{τij(n)}will be specified later.
An example of such a situation is the scheme discussed in [8] for rating ‘experts’ whose opinions are sought on certain outcomes of interest. The algorithm is a stochastic approximation scheme which incrementally updates their ratings based on observed performance. As noted in [8], if one were to use a common step-size, the scheme will put more weight on and therefore favor the experts who opine more frequently. With this in mind, [8] uses the step-size schedule analyzed here. More examples from crowdsourcing and network economics applications are conceivable.
We analyze this scheme next.
2. CONVERGENCEANALYSIS
For simplicity, we first consider the caseτij(n)≡0, i.e., there are no communication delays. Assume that (7) holds and define
¯
a(n) := max
i,j
³
a(ν(i, j, n))ξij(n+ 1) ´
, n≥0,
which is≥0and is>0if and only if at least oneξij(n+ 1)equals1. Then
X
n
¯
a(n)≥X
n
a(ν(i, j, n))ξij(n+ 1)∀i, j.
By Theorem 5.28, p. 96, [6], Pna(ν(i, j, n))ξij(n+ 1) and
P
na(ν(i, j, n))pij converge or diverge together a.s. But by (10),
X
n
a(ν(i, j, n))pij =pijX
n
implying thatPn¯a(n) =∞a.s. Now,
X
n
¯
a(n)2 ≤ X
n
(X
i,j
a(ν(i, j, n))ξij(n+ 1))2
≤ KX
n
a(n)2 <∞
for a suitableK ∈(0,∞). Rewrite (9) as
xi(n+ 1) =xi(n) + ¯a(n)
X
j∈N(i)
q(i, j, n) [hij(x(n)) +Mi(n+ 1)]
,
where
q(i, j, n) := a(ν(i, j, n)) ¯
a(n) ξij(n+ 1) ∈ [0,1]∀n. Next define:
1. t(0) = 0, t(n) =Pm=0n a¯(m), n≥1(the algorithm’s time scale),
2. x¯(·) : [0,∞)7→ Rdbyx¯(t(n)) =x(n)with linear interpolation on[t(n), t(n+ 1)]∀n(which makes it continuous and piecewise linear),
3. λij(·) : [0,∞)7→[0,1]byλij(t) =q(i, j, n)∀i, j, n, t∈[t(n), t(n+ 1)),
4. Λ(·) := [[λij(·)]] : [0,∞)7→ Rd×r,Λi(·) :=theith row ofΛ(·),
5. Hi(·) := [hi1(·),· · ·, hir(·)]T :Rd7→ Rr(column vector) wherehij(·)≡0forj /∈ N(i),
6. Ji= [zi1,· · ·, zir]∈ Rr(row vector) wherezij = 1ifj∈ N(i),0otherwise,
7. xs(t), t≥s≥0, the solution to the o.d.e.
˙
xsi(t) = Λi(t)Hi(x(t)), t≥s, xs(s) = ¯x(s).
Then by standard arguments based on the Gronwall inequality as in, e.g., Chapter 7 of [4], we have:
Lemma 2.1 — For anyT >0,lims↑∞supt∈[s,s+T]k¯x(t)−xs(t)k= 0a.s.
This leads to the key result:
Lemma 2.2 — Almost surely, any limit point ofx¯(s+·)inC([0,∞);Rd)ass↑ ∞is a solution of the o.d.e.
˙
whereα(·) :R+7→ R+satisfies∆≥α(t)≥δ∀t≥0for suitable∞>∆> δ >0.
PROOF: ViewΛ(·)as an element of the spaceU of measurable mapsu(·) = [[uij(·)]] : [0,∞)7→
[0,1]d×r with the coarsest topology that renders continuous the maps U(·) 7→ RT
0 g(t)uij(t)dt
∀ T > 0,1 ≤ i ≤ d, 1 ≤ j ≤ s, g(·) ∈ L2[0, T]. Using Banach-Alaoglu theorem, one can prove thatU is compact and metrizable, therefore Polish. From (7) and the Lipschitz condition on
h(·), it follows thatx¯(t+·), t≥0, are pointwise bounded and equicontinuous, hence relatively
com-pact inC([0,∞);Rd). Let(Λ∗(·) = [[λ∗ij(·)]], x∗(·))denote a limit point of(Λ(t+·),x¯(t+·))in U ×C([0,∞);Rd)ast↑ ∞, along a subsequence, say,t
n↑ ∞. By Lemma 2.1, fors > s0>0,
¯
xi(tn+s)−x¯i(tn+s0) = Z s
s0
X
j∈N(i)
λij(tn+y)hij(¯x(tn+y))dy+o(1).
Lettingn↑ ∞, we have
x∗(s)−x∗(s0) = Z s
s0
X
j∈N(i)
λ∗ij(y)hij(x∗(y))dy.
Now forj /∈ N(i),λij(·)≡0 =⇒λ∗ij(·)≡0. Define
N(n, s) := min{m > n:
m
X
k=n+1
¯
a(k)> s}.
Let[t]denote the unique integer such that[t] ≤t < [t] + 1. Then forj ∈ N(i), `∈ N(k)and
t, s >0,
Rt+s
t λ∗ij(y)dy
Rt+s
t λ∗k`(y)dy
= lim
n↑∞
Rt+s
t λij(tn+y)dy
Rt+s
t λk`(tn+y)dy
= lim n↑∞ PN([tn+t],s) m=[tn+t] a(ν(i,j,m))ξij(m) ¯
a(m) ¯a(m)
PN([tn+t],s)
m=[tn+t]
a(ν(k,`,m))ξk`(m) ¯
a(m) a¯(m)
= lim
n↑∞
Pν(i,j,N([tn+t],s))
m=ν(i,j,[tn+t]) a(m)
Pν(k,`,N([tn+t],s))
m=ν(k,`,[tn+t]) a(m)
= 1 (12)
a.s. by (A4) and (8). Thus forx >0,
lim
t↑∞
Rx
0
Rt
0λ∗ij(s+y)dsdy
Rx
0
Rt
0λ∗kl(s+y)dsdy
= lim
t↑∞
Rt
0
Rx
0 λ∗ij(s+y)dyds
Rt
0
Rx
0 λ∗kl(s+y)dyds
= lim
t↑∞
Rt
0
·µRx
0 λ∗ij(s+y)dy
Rx
0 λ∗kl(s+y)dy
¶R
x
0 λ∗kl(s+y)dy
¸ ds Rt
0
Rx
0 λ∗kl(s+y)dyds
By l’Hˆospital’s rule,
lim
t↑∞
Rx
0 λ∗ij(t+y)dy
Rx
0 λ∗kl(t+y)dy
= 1 a.s.
Every limit point satisfies the above equation fort→ ∞. Sincex >0was arbitrary, we conclude
that fort≥0, Z
t+s
t
λ∗ij(y)dy= Z t+s
t
λ∗k`(y)dy∀i, j, k, `,
for allt, s > 0, implying by Lebesgue’s theorem that λ∗ij(·) = α(·) (say) a.e. for someα(·) ≥ 0 independent of i, j. We drop the qualification ‘a.e.’ by choosing a suitable version. It is easily verified from the definition ofΛ(·)that
∆ :=dX
i
|N(i)| ≥ X
i,j∈N(i)
λij(t)≥1,
leading to
∆≥ X
i,j∈N(i)
λ∗ij(t) = (dX
i
|N(i)|)α(t)≥1.
Hence∞>∆≥α(t)≥δfor a suitableδ >0. This completes the proof. 2
This brings us to our main result. Say that a setAis an internally chain transitive invariant set for (3) if for everyx ∈ A, the trajectoryx(t)of (3) withx(0) =xremains inAfor allt ∈ Rand for anyx, y ∈ AandT, ² >0, there existn≥1, x0 =x, x1,· · · , xn−1, xn=ysuch that the trajectory of (3) initiated atxi,0≤i < n, intersects with the open²-ball centered atxi+1at somet≥T. We then have the following extension of the celebrated result of Benaim for the classical Robbins-Monro scheme.
Theorem 2.1 — Almost surely,x(n)→a nonempty compact connected internally chain transitive
invariant set of (3). In particular, if (3) has a unique globally asymptotically stable attractorC, then
Cis the only such set andx(n)→Ca.s.
PROOF: Letγ(t) :=R0tα(s)dsandx˜(t) :=x(γ(t))wherex(·)satisfies (3). Then
˙˜
x(t) =α(t)h(˜x(t)).
It is easy to see that if the right hand side of (12) were something other than1, sayκijk`>0, then we would have to replace (3) in the above statement by
˙ xi(t) =
X
j∈N(i)
αij(t)hij(x(t)),
where theαij(·)’s reflect the sampling bias. This in general would have a different asymptotic behav-ior than (3).
Recall also that we have ignored delays. We shall replace the condition (6) of [3] by the following conditions from [4], Chapter 7. These are more intuitive and allow for a much easier analysis:
• (A5) n−τij(n)n →0a.s.
• (A6){τij(n)}are stochastically dominated by a random variableτsatisfyingE
h τη1
i
<∞for
someη >0such thata(n) =o(n−η).
Both are very reasonable assumptions. The first does not allow for arbitrarily large delays whereas the second ensures that the delay distributions have uniformly well-behaved tails in a certain sense. In fact, if{τij(n), n ≥ 0} are identically distributed for somei, j, andη = 1, then E[τij(k)] =
P
nP(τij(k) ≥ n) =
P
nP(τij(n) ≥ n) < ∞, and a simple application of the Borel-Cantelli lemma shows that (A6)=⇒(A5).
Theorem 2.1 — Theorem 2.1 continues to hold for random delays satisfying (A5)-(A6).
PROOF: This follows by the arguments of pp. 82-84, [4]. 2
Finally, we shall state without proof a sufficient condition for (7) along the lines of [5], [2] (see also Theorem 7, pp. 26-27, [4]). The proof goes along the same lines as Theorem 7, pp. 26-27, [4].
Theorem 2.2 — Supposeh∞(x) := limc↑∞h(cx)c is well defined and the o.d.e.
˙
x(t) =h∞(x(t))
has the origin as its unique globally asymptotically stable equilibrium. Then (7) holds.
REFERENCES
2. S. Bhatnagar, The Borkar-Meyn theorem for asynchronous stochastic approximation, Systems&Control Letters, 60 (2011), 472-478.
3. V. S. Borkar, Asynchronous stochastic approximation, SIAM Journal of Control and Optimization, 36 (1998), 840-851 (Correction note in: ‘Erratum: Asynchronous Stochastic Approximation’, SIAM Jour-nal of Control and Optimization, 38 (2000), 662-663).
4. V. S. Borkar, Stochastic approximation: A dynamical systems viewpoint, Hindustan Publising Agency, New Delhi, and Cambridge University Press, Cambridge, UK, 2008.
5. V. S. Borkar and S. P. Meyn, The O.D.E. method for convergence of stochastic approximation and reinforcement learning, SIAM Journal of Control and Optimization, 38 (2000), 447-469.
6. L. Breiman, Probability, Addison-Wesley, Reading, Mass., 1968.
7. D. P. Derevitskii and A. I. Fradkov, Two models for analysing the dynamics of adaptation algorithms, Automation and Remote Control, 35 (1974), 59-67.
9. N. N. Krasovskii, Stability of motion, Stanford Uni. Press, Stanford, CA, 1963.
11. H. Robbins and J. Monro, A stochastic approximation method, Annals of Mathematical Statistics, 22 (1951), 400-407.
8. R. Dwivedi and V. S. Borkar, Removing sampling bias in networked stochastic approximation, Pro-ceedings of International Conference on Signal Processing and Communications (SPCOM), July 22-24, 2014, Bangalore.