• No results found

Multi armed bandits based on a variant of simulated annealing

N/A
N/A
Protected

Academic year: 2020

Share "Multi armed bandits based on a variant of simulated annealing"

Copied!
18
0
0

Loading.... (view fulltext now)

Full text

(1)

c

° Indian National Science Academy DOI: 10.1007/s13226-016-0184-5

MULTI-ARMED BANDITS BASED ON A VARIANT OF SIMULATED ANNEALING

Mohammed Shahid Abdullaand Shalabh Bhatnagar∗∗

IT and Systems Area, Indian Institute of Management, Kozhikode, India

∗∗Department of Computer Science and Automation, Indian Institute of Science, Bangalore, India

e-mails: [email protected], [email protected]

(Received 28 May 2015; accepted 16 September 2015)

A variant of Simulated Annealing termed Simulated Annealing with Multiplicative Weights (SAMW) has been proposed in the literature. However, convergence was dependent on a pa-rameterβ(T), which was calculated a-priori based on the total iterationsT the algorithm would run for. We first show the convergence of SAMW even when a diminishing stepsizeβk 1is

used, wherekis the index of iteration. Using this SAMW as a kernel, a stochastic multi-armed bandit (SMAB) algorithm called SOFTMIX can be improved to obtain the minimum-possiblelog

regret, as compared tolog2regret of the original. Another modification of SOFTMIX is proposed which avoids the need for a parameter that is dependent on the reward distribution of the arms. Further, a variant of SOFTMIX that uses a comparison term drawn from another popular SMAB algorithm called UCB1is then described. It is also shown why the proposed scheme is computa-tionally more efficient over UCB1, and an alternative to this algorithm with simpler stepsizes is also proposed. Numerical simulations for all the proposed algorithms are then presented.

Key words : Stochastic processes; applied probability; statistics; discrete optimization.

1. INTRODUCTION

An algorithm called Simulated Annealing with Multiplicative Weights (SAMW) was introduced in [10]. Though the objective in [10] was to compute optimal policies for Finite-Horizon Markov De-cision Processes, a simplified description of this algorithm can be given as follows. In the setting of SAMW, a random rewardXki is obtained at each step k, where the actionai is such that it belongs to a finite set|A|. An empirical mean of the reward for each action ai, defined asµik =

Pk s=1Xsi

k , is updated after the stepk. The|A|−size probabilty vector φk is now updated to produce a φk+1,

(2)

φjk 0for actionsaj ∈A\{a∗}). We assume that a unique best actiona∗exists. The improvement

step ofφkis performed as:

φik+1 := φikβµ i k

P

ajAφjkβµ j k

, (1)

whereβ >0, a small constant which [10] requires to beβ(T)- a constant calculated apriori after the maximum iteration numberT, i.e., k T, is set. For a finiteT, a constantβ would only result in approximate performance as the proofs in [10] employed bounds that were ‘asymptotically efficient’, i.e., they held tightly only ifT → ∞. In§2below, we eliminate this dependance ofβonT. Note that in the setting that SAMW dealt with, each action is to be employed to obtain a random reward (that follows an unknown, but fixed, distribution) at each stepk. It is this protocol that is changed in the following section, to introduce our second problem.

1.1 Stochastic Multi-Armed Bandit

The Stochastic Multi-Armed Bandit (SMAB) problem has the goal of detecting the actiona∗ that has the highest expected reward, as early as possible, or with as few applications of the sub-optimal actions b A\a∗. We assume in the following that there is a unique best action a∗. To explain this problem in brief: a finite set of actions, called arms,Ais available. Each armaj inAyields a reward drawn from a fixed distribution, whose mean isµj. The player who pulls the arms maintains an empirical mean for each armj, but has to discover soonest the best arma∗ A(the arm which corresponds to highest meanµ∗), so that this arm can continue to be pulled for maximum expected profit. We maintain a probability vectorφkoverA, that is updated iteratively to yield convergence (to aφ∗) so that ask→ ∞,φ

k(a∗)1andφk(aj)0for allaj ∈A\{a∗}. One may consider this as a randomized stationary policy for a one-state MDP.

The expected regret of the algorithm is defined as the expected total loss incurred due to not play-ing the best arma∗in each iterationk. The arm played at each iterationk, inferred according to one’s algorithm, isaˆkand the reward obtained by pulling this arm isXˆk. Thusk·µ∗−

Pk

p=1E

³

ˆ

Xp|Fp−1

´

is the expected regret of this algorithm, which samples armsˆakaccording to probability iterateφk. Here,Fp−1 is aσ−algebra, and contains all information available to the algorithm at iterationp. In

§3, we employ the SAMW kernel inside a SMAB algorithm called SOFTMIX [7], and propose a new

(3)

of SOFTMIX was yet to be found.

Both SOFTMIX and SAMWMIX have the disadvantage that a parameterdhas to be provided a-priori as input to the algorithm satisfying the property that d < minaiA\{a}{µ∗ −µi}. We

eliminate this disadvantage in §4 by proposing a blind-SAMWMIX. While it behaves quite well in practice, blind-SAMWMIX has the disadvantage that it can only achieveO(log1+2α(k))regret, where0 < α < 0.5(with better properties forα closer to0.5). A popular log-regret algorithm for SMAB, called UCB1(Upper Confidence Bound1), is of a different type. Here, at stepk, a confidence boundbikis calculated for each armiand the arm corresponding to argmaxaiA{µik+bik}is played as the ‘winner arm’aˆk+1. Consequently, the empirical meanµjk+1 foraj = ˆak+1is re-computed to

include new sampleXˆk+1, whereas all other empirical means are carried over, i.e. µjk+1 = µjk for aj 6= ˆa

k+1. Thus it is not a randomized policy and performs an implicit explore-exploit trade-off

using the boundbik. Just as the SAMW kernel was used in SOFTMIX to produce SAMWMIX, we employ the UCB1mechanism for our algorithm UCB1MIX in§5. The advantage lies in avoidance of theO(|A|)complexity operation to find argmaxaiA{µik+bik}.

1.2 Survey of SAMW and SMAB

The algorithm in (1) as described in [10] is drawn from the work in [12] where an optimal strategy for non-cooperative repeated bi-matrix zero-sum games is the objective. In [12] too, the quantityβin (1) is constant and dependent on the total number of iterationsT. Our choice ofβkis a diminishing parameter with the further difference thatβk > 1∀k. SAMW differs from Simulated Annealing in that it does not perform any local search inA but updates the probability distributionφk overA. It also has a simpler tuning process.

As we explain in§6, there are computation as well as precision-related advantages over UCB1 for algorithms like SOFTMIX that use a Boltzmann exploration structure. The work in [14] sum-marizes the advantages that a stochastic policy-based algorithm (that additionally uses ‘Importance Sampling’, the device in (9) below) has over other deterministic algorithms like UCB1. However, UCB1 remains popular in both extensions of SMABs as well as applications. Among extensions, recent work on SMAB has focused on using a norm in the Kullback-Liebler neighbourhood to pro-duce UCB1’s confidence bounds, bringing it into the nearly constant regime as in the KL-UCB of [13]. However, KL-UCB is optimal for Bernoulli arms (i.e. arms having reward1with a Bernoulli bias probability), and also requires an input parameterα > 1. For general distributions, there is the

(4)

is an input parameter too, this timeα >2. We make specific comparisons with two other algorithms (EXP3 and Thomson Sampling) on the basis of their regret bounds in§3 below. A useful summary of stochastic multi-armed bandit algorithms’ results can be found in [6, Chapters 1-2].

Of note among applications are the ‘Mortal’ bandits in [8], where there is the possibility that certain arms will not be available after an indexk or that new arms may be made available. This work also proposes a variant of UCB1known as UCB1-KC (KandCare parameters that are chosen a-priori). The work in [8] applies the algorithm to a web-advertisement model, where pulling an arm corresponds to choosing a particular ad to be delivered on a given webpage, the reward being when the viewer clicks this ad. A problem similar to [8], is the ‘Irrevocable’ Multi-Armed Bandit problem in [11], representing fashion retailers’ procurement scenarios.

1.2.1 Contributions and Outline : Our contributions in this paper and the relevant sections that cover these are:

1. An algorithm with a decreasing stepsizeβkused in (1) above, and a proof of asymptotic conver-gence for the same (§2). This is different from the asymptotically efficient algorithm proposed in [10].

2. Modification of the proposed asymptotically convergent algorithm in§3 to obtain the log-regret algorithm SAMWMIX, on the lines of log-squared regret SOFTMIX in [7].

3. A ‘blind’ algorithm that does not require an input parameter d which both SOFTMIX and SAMWMIX need (§4).

4. An algorithm that adapts the existing UCB1algorithm to a Boltzmann exploration scheme and is numerically observed to require lesser computation and has better ‘precision’, i.e., a higher probability of pulling the best arm a larger number of times (§5).

We also provide numerical results for all these algorithms (§6).

2. SAMWWITHDIMINISHINGβ

We propose the update step in SAMW to be:

φik+1 := φ

i

µi k k

P|A|

j=1φjkβ µjk k

(5)

whereµjkis thek−sample mean corresponding to armaj, and is obtained from rewardsXkj [0,1] obtained by pulling armaj. In [10], theβkare held constant to a smallβ >1(the result shown, [10, Lemma 3.2], is that for a given number of iterationsT,β:=ψ(T)andψ(T)1asT → ∞).

Using the optimum policyφ∗, define the Kullback-Liebler entropy term as Dl=∆Pj|A=1| φ∗(j)· logφ∗(j)

φjl which, sinceφ

(a) = 1withφ(j) = 0forj6=a, equalslogφa∗

l .

Theorem 1 — For stepsizeβk=1 +∆ βk0 = 1 +log1k,k≥2and update step (2), the average reward

from the current policyφkis s.t.

P|A|

j=1φjkµjk→µ∗ask→ ∞.

Proof:D∗l+1−Dl =

|A|

X

j=1

φ∗(j) log φ j l

φjl+1

= log

 P|A|

j=1φjlβ Xlj l

βXl∗ l

= Xl(logβl) + log

|A|

X

j=1 φjlβX

j l l

Xl(logβl) + log

|A|

X

j=1

φjl(1 + (βl−1)Xlj)

usingβa≤1 + (β−1)a, forβ >1,a <1

Xl(logβk) + log (1 + (βl1)

|A|

X

j=1

Xlj·φjl)

for a largeks.t.l < k, since1< βk< βl

Xl D

l −D∗l+1

logβk + β0

lX¯l logβk

withX¯l=∆

P|A|

j=1φjlX j

l and usinglog (1 +a)≤a, fora <1 1

k

k

X

l=1

Xl D

1−D∗k+1

klogβk +

1

klogβk

k

X

l=1

βl0X¯l (3)

obtained by adding over allltillkand dividing byk.

Note that (D1∗−Dk+1) D∗1 = log|A|due to the initializationφj1 = |A1| andklogβk → ∞ ask → ∞. Hence, D

1−Dk+1∗

klogβk 0ask → ∞. Sinceβk 1, it is the case thatφk φ¯for some probability mass functionφ¯over all actions. The termXˆk=∆klog1βk

Pk

(6)

ˆ

Xk+1 := Xˆk+ β

0

k+1

(k+ 1) logβk+1 Ã

¯

Xk+1((k+ 1) logβk+1−klogβk) β0

k+1

ˆ

Xk !

.

LetE( ¯X)denoteP|iA=1| φ¯iµiand the martingale differenceMk+1 be such thatMk+1 = ¯Xk+1 E[ ¯Xk+1 | Fk]. The sequenceMk+1 above is defined w.r.t. the sigma algebraFk = σ(φil,X¯l,1

l≤k,1≤i≤ |A|). One can then rewrite the above recursion as:

ˆ

Xk+1 := Xˆk+ β

0

k+1

(k+ 1) logβk+1

· Ã

E[ ¯Xk+1 | Fk]((k+ 1) logβk+1−klogβk) β0

k+1

ˆ

Xk+Mk+1 !

(4)

:= Xˆk+ ˆβk+1

³

E[ ¯Xk+1| Fk]−β¯k+1Xˆk+Mk+1

´

, (5)

where, βˆk+1=∆ β 0

k+1

(k+1) logβk+1 and β¯k+1

=((k+1) logββk+10 −klogβk)

k+1 , respectively. Now observe that (i)

Pk

l=1βˆl → ∞as k → ∞and (ii)

P

l=1βˆl2 < . By letting, β¯0 = ¯β1 = 0.1, it is easy to see that0.1≤β¯k<1,∀k, and thatβ¯k≤β¯k+1,∀k. Further,β¯k→1ask→ ∞.

Define a sequence {t(n)} ⊂ [0,∞) according to t(0) = 0 and t(n) = n−1

X

m=0

ˆ

βm, n 1. Let

¯

β(t), t 0be defined byβ¯(t(k)) = ¯βk+1, k 0with linear interpolation (between end points) for t∈[t(k), t(k+ 1)], k≥0. Note from the construction thatβ¯(t)[0.1,1]∀t≥0. Consider now the

following ODE associated with (5):

˙ˆ

X(t) =ht( ˆX(t))4= (E[ ¯X]−β¯(t) ˆX(t)). (6)

We shall now apply a key result from [4] (cf. Theorems 2.1-2.2) (alternatively, Theorem 7, Chapter 3.2 of [5]) that is however for the case of a time homogeneous objective functionh( ˆX(t))unlike ours (that depends explicitly ont). As can be seen, the result of [4] carries over easily in our case as well. Note that∀X, Y ∈ R,

|ht(X)−ht(Y)| ≤β¯(t)|X−Y| ≤ |X−Y|.

Thus, ht(X) is Lipschitz continuous inX, uniformly in t. Note also that Mk+1 = ¯Xk+1 E[ ¯Xk+1| Fk], k≥0is a mean-zero process w.r.t.Fkand thatMk2+1, k≥0is bounded above by1.

Further, the step-sizesβˆl,l≥1satisfy (i) and (ii) above. Now, let

ht,r(X)=∆

ht(rX)

r =

E[ ¯X]

(7)

Thus, ht,∞(X)= lim∆ r→∞ht,r(X) = −β¯(t)X. Sinceβ¯(t) 0.1 > 0 (uniformly over t), it fol-lows that the origin is the unique globally asymptotically stable equilibrium for the ODEX˙ˆ(t) =

ht,∞( ˆX(t)). Note also that for any0< T <∞,

RT

0 β(τ) ≤β(T)T <∞.

In the light of the observations made in the previous paragraph, it is easy to verify that the se-quence of results in Chapter 3.2 of [5], specifically given in Lemmas 1-2, Corollary 3, Lemmas 4-6 and also Theorem 7 there continue to hold in our setting. Thus, supk k Xˆk k< almost surely (from Theorem 7, Chapter 3.2 of [5]), thereby ensuring almost sure boundedness of the iterates in (5).

Consider now the following ODE in place of (6):

˙ˆ

X(t) =E[ ¯X]−Xˆ(t), (7)

that hasXˆ = E[ ¯X]as its unique globally asymptotically stable equilibrium. Now rewrite (5) as follows:

ˆ

Xk+1 = ˆXk+ ˆβk+1

³

E[ ¯Xk+1| Fk]−Xˆk+Mk+1+²(k)

´ ,

where²(k) = ˆXk−β¯k+1Xˆkis uniformly bounded by the foregoing and further,²(k)0ask→ ∞ almost surely (sinceβ¯k+1 1ask → ∞). It now follows as a consequence of the third extension

in Chapter 2.2 of [5] thatXˆk Xˆ almost surely ask → ∞. However, from (3), this implies that the empirical meanµ∗kobtained from the best arm is s.t.µ∗k≤E( ¯X)ask→ ∞. This cannot be true

unless equality holds and thereforeφk→φ∗. 2

Each armajinAis sampled once per iterationkand this contributes to the high sampling budget of the algorithm in (2) (as well as the original in [10]). This high sampling budget comes about because all sub-optimal actionsaj are taken at each iteration kbefore a high level of confidence to infer the best actiona∗ is achieved at some iterationk << K. To mitigate this problem, we employ the structure of SAMW in the SOFTMIX algorithm of [7]. In SOFTMIX,k−sample meansµik are used at thek−th step of the iteration (by a suitable construction), even though only a ‘winner’ action (call itˆak) is actually taken at each iterationk. Such behaviour is typical of Stochastic Multi-Armed Bandit (SMAB) algorithms where only one arm is pulled in each iterationkof the algorithm.

3. LOG-REGRETSOFTMIX

We wish to obtain the effect of usingk−sample means for each actionjwhile performing updatekin (2). This is achieved in a separate algorithm named SOFTMIX in [7] by considering rewardXkj = Xkφˆj

(8)

if armaj is the winner armˆak(i.e. the arm pulled afterφk−1is sampled), and0otherwise. HereXˆk is the reward obtained by pulling the winner armˆak. A change in SOFTMIX algorithm, by giving it an SAMW-like kernel (see (2) above), helps us to obtain logarithmic regret. Thus any sub-optimal actionai ∈A\{a∗}will be taken onlyO(logk)times inktrials of the system. This has been proved as the best possible performance of any algorithm designed for the SMAB problem, with SOFTMIX capable of onlyO(log2k)regret. Also note the remark in [15] that a diminishing step-size algorithm (termed ‘decreasing²−greedy’) like SOFTMIX that achieves logarithmic regret was yet to be found.

We summarize the SOFTMIX algorithm in the following. Define an indicator function Ikj+1 which takes value1foraj ifjis selected by sampling iterateφk,Ikj+1 = 0for all other actions in A\{aj}. For each armi∈ {1,2, . . . ,|A|}, perform the following updates:

φik+1 = (1−γk) eηksˆ i k

P|A|

j=1eηkˆs

j k

+ γk

|A|, (8)

ˆ

sjk+1 = ˆsjk+Xˆk

φjkI

j

k. (9)

The step-sizesγkandηkare calculated as:γk = 1fork= 1,2

γk = min

µ

1,5|A|log (k−1) d2·(k1)

fork >2,

ηk = |A|1 γk + 1

log

1 +d(

|A|

γk + 1) 2|γkA|−d2

.

In the above, the quantitydis a heuristic value input by the programmer and satisfies the condition that0< d <min{i, ai 6= (a∗)}. Here,∆i=∆µ∗−µi is the mean loss incurred upon taking action

ai.

The proposed algorithm SAMWMIX is:

φik+1 = (1−γk) e

Pk s=1ηsXˆis

P|A|

j=1e

Pk s=1ηsXˆsi

+ γk

|A|. (10)

Here,Xˆki is obtained using the reward-modification scheme of SOFTMIX, i.e.Xˆki =∆ Xkφˆi kI

i k. The stepsizesγkandηkare computed as follows:

γk = min

µ

1, 5|A| d2·k

(9)

The stepsizeηkis the same as in SOFTMIX. Note the absence oflog(k−1)from the numerator in the expression forγk above. Also observe the term |γkA| in (10) which is essentially an ‘explore’ component that was missing in the original SAMW algorithm in (2).

The proposed algorithm (10) performs a gradual weighting of samples{Xˆsi}ks=1 with all scale-factors{ηs}ks=1 whereas SOFTMIX performs a multiplication of the entire sum

Pk

s=1Xˆsi withηk. Due to the similarities, a comparison of the time-varying algorithm EXP3 in [14] (based on the orig-inal EXP3 in [3]) with the proposed SAMWMIX is also in order:

The high-probability result in [14] bounds the regret as O(√k) inkplays of the bandit.

The companion algorithm EXP3ELM (read as ’EXP3 with Action Elimination’) in [14] does

achieve a logarithmic regret bound. However, in all the typical experiments no action elimina-tion takes place and the algorithm is treated to be on par with EXP3 itself.

A brief comparison with the stochastic policy -based method in [1], which uses Thomson Sam-pling and is the first to show logarithmic expected regret, is also due. Usingdfrom (11) above, for the generalN-armed bandit problem, the expected regret in [1] is proportional to Nd4 whereas ours is proportional todN2 (see (15) below).

We now give a proof of logarithmic regret in SAMWMIX, and obtain an upper bound ofO(1k)on

E{φi

k} for any suboptimal armai. This implies logarithmic regret, since

Pk

p=1∆pi = O(logk). Define the σ−algebra Fki as σ( ˆXpi,1 p k), k 1. In the following theorems, we let

φi

k=P{Iki+1 = 1|Fk}.

Theorem 2 — The probability of selecting actionai ∈A\{a∗}at stepkof (10) is s.t. E(φik) =

O¡1

k

¢

.

PROOF: The early part of our treatment is similar to [7, Proof of Theorem 3.1, eq. (10)]:

φik (1−γk) exp ( k−1

X

p=1

ηs( ˆXpi −Xˆp∗)) +

γk

|A|. (12)

Now consider Zpi= ˆ∆Xpi −Xˆp + ∆i and note that E{Zpi|Fp−1} = 0. Also, due to the form of

ˆ

Xi p =

ˆ

Xp φi

pI i

p we have for cp = |γpA| + 1, the inequality Zpi cp. Using the same form of Xˆpi, we obtain that E{(Zpi)2|Fp−1} ≤ 2|γpA|−i2 σ2p=∆2|γpA|−d2. Since cp > 0, for the function

(10)

Thus, we also have for eachp:

E{eηpZip|F

p−1} ≤ E{1 +ηpZpi + (Zpi)

2

φcp(ηp)|Fp−1}

1 +σp2φcp(ηp)≤eσ2pφcp(ηp).

Thus, we have that:

E(φik) γk

|A|+ (1−γk)exp   k−1 X p=1 ¡

ηpd−ξcp(ηp)σp2

¢ 

. (13)

Consider Kp =∆ d and σp2

= 2|γpA| −d2, making each term in the sum of the RHS above as

−Kpηp +ξcp(ηp)σ2

p. Now replace, as per the definition in [7, Fact 5.1],ξcp(ηp) = e

cpηp−1−cpηp c2

p . From [7, Proof of Theorem 3.1], applying log (1 +x) 2+2xx toηp (which can be represented as

1

cplog (1 + Kpcp

σ2

p )), we get thatηp

2Kp

2σ2

p+cpKp. Consequently, ξcp(ηp)

ecpηp−1 c2

p

2Kp cp(2σ2

p+cpKp) whereecpηp1 = Kpcp

σ2

p due to the representation ofηpas

1

cplog (1 + Kpcp

σ2

p ). Subtraction among the two fractions above yieldsξcp(ηp) K

2 p σ2

p(2σp2+cpKp).

For each term in the sum on the RHS of (13), we have:−Kpηp+ξcp(ηp)σ2p

−K2 p

2σ2

p+cpKp. Rewrite (13) asE(φik)exp

³

Pkp−1=1 Kp2

2σ2 p+cpKp

´

+|γkA|.

After substituting forcp,Kp,σp2and choosing an upper bound of5|A|, withp≥P for a finiteP, for the resulting denominator4|A|+d|A|+γp(d−2d2), we have:

Kp2

2σ2

p+cpKp

d2γp

5|A|, and, (14)

E(φik) exp(log (k−1)) + 5|A|

d2k. (15)

Constantcrepresents regret accumulated for indicesp < P and step-sizeγpis inferred using (11)

above. Thus,E(φik) =O(k1). 2

This result gives expected value of regret for any iterationk (and not only ask → ∞), and is achieved without use of any concentration inequalities.

Modify the stepsizeηk = ck1 log (1 +dcpσ2

p)such thatck = 2|A|/γk+ 1andσ

2

k = ck−1. Log-regret holds as the inequalitiesZki ck andE{(Zki)2|Fk−1} ≤ σk2 in the proof above continue to

(11)

Theorem 3 — Fork P, using ηk = ck1 log (1 +dcpσ2

p) with ck =

2|A|

γk + 1, σk2 = 2|γkA| and

γk= min(1,5|d2Ak|), the quantityφikforai ∈A\{a∗}in (10) is such that

γk

|A| ≤φ

i

k (1−γk)·k1 +|γAk|,

with probability1−k(αkP)for anαP s.t.0< αP <1

PROOF : The LHS of the inequality is easy to observe: it is the exploration probability for action ai. Observe from [7, eq. (9)] that φik (1−γk) exp (

Pk−1

s=1ηs( ˆXsi−Xˆs∗)) + |γkA|, where

we define Zˆi

s= ˆ∆Xsi −Xˆs∗. Using the Markov inequality, we have P{exp(

Pk−1

s=1ηsZˆsi) > 1k} ≤

kE{exp(Pks=1−1ηsZˆsi)}. Consider another random variableZki−1 = Πks=1−1exp(ηsZˆsi) and note that

Zki−1= ˆZki−1·Zki−2. Also,E(Zki−1)=∆E(Zki−1|Fk−2)whereFk−1isσ( ˆXpi,1≤p≤k−2),k≥3.

From the proof in Theorem 2, we have thatE{Zˆki−1|Fk−2}=i,E{( ˆZki−1)2|Fk−2} ≤ γk2|A|1 and

ˆ

Zki−1 γk|A|

1, hence

1

ck−1 ˆ

Zki−1 12. Further, sincelog(1 + dck−1 σ2

k−1 ) = log(1 + (1 + γk−1

2|A|)d) <1for

a smalld, we have that|ηk−1Zˆki−1| ≤1. Use the inequalityea≤1 +a+a2for|a| ≤1to obtain:

E(Zki) E(1 +ηk−1Zˆki−1+ηk2−1( ˆZki−1) 2

)·Zki−1

E(Zki) (1−ηk−1i+η2k−1

2|A| γk−1)·Z

i

k−1 (16)

In the above,ηk−1(∆i −ηk−1γk2|A1|) ηk−1(d− d/ξkd 1+1

) α¯P, whereξk−1 = σ

2 k−1 ck−1 and ¯

αP >0. To see this, notice thatηk−1γk2|A|1 =ξk−1log(1 + ξkd1)whereξk−1 <1, and then employ

the inequalitylog(1 +a) √a

1+a fora > 0. Thus, with αP = 1−α¯P, the above (16) becomes:

E(Zi

k)≤αP ·Zki−1. Thus the adverse probablityP{exp(

Pk−1

s=1ηsZˆsi)> 1k}is less thankαkP. 2

4. BLINDSAMWMIX

In SAMWMIX, an intelligent guess ofdwas still needed as input to the stepsizeγk. Let us suppose we have a simple technique to avoid guessingdand hence we chooseγk= 5|A|logk k. Then, using (14), for allk > e

1

d2, we would have that d 2logk

k 1k. However, this would result in log–squared regret due to a trailing term 5|A|klogk corresponding to5|d2Ak|in (15). We propose an alternative wheredis not needed as input, and nearly log–regret is achieved. We retain the update step (10), definecp= 1 +|γpA| andσp2= 2|γpA|. Withα∈(0,0.5), we use modified step-sizesγpandηp:γp =min

³

1,5|A| ·log2αp p ´

,

ηp = cp1 log

µ

1 + 1 logα pcp

σ2 p

(12)

Theorem 4 — Using stepsizesγp andηp, the probability of selecting an actionai A\{a∗}at stepkof (10) is such thatE(φik) =O

³

logk k

´

.

PROOF: We define a diminishing stepsizeKp=∆log1αp. Thus we note, as regards the term within the summation in RHS of (13) above, that forp≥P:

−ηpd+ξcp(ηp)σ2p−ξcp(ηp)d2 ≤ −ηpKp+ξcp(ηp)σp2,

where we use the fact thatξcp(ηp)0andP =ed

α1

. As in the proof of Theorem 2, we can bound the quantity in RHS by −K

2 p

2σ2

p+cpKp and further by

γp·log−p

5|A| . Now substitute forγpto obtain 1p as the upper-bound for this term. However, the ‘explore’ term|γkA| in (10) results inE(φki) =O(log2αk k)as

in the statement of this theorem. 2

Thus the regret accumulated isO(log(k)1+2α), which can be made arbitrarily close to log-regret by choosing a smallα. However, the smaller anαis, the larger will be the iterate indexP(the bounds derived would hold only forp > P).

5. UCB1-LIKESOFTMIX

The log-regret algorithm UCB1 in [2] has been applied in the past to finite–horizon MDPs (cf. [9]), in order to reduce the number of times simulated transitions are made to arrive at an optimal policy. We now propose a way to use the UCB1 of [2] within the SOFTMIX algorithm, thereby achieving log-regret as also avoiding a search for a maximum among|A|values needed in each iterationkof UCB1. Finding a maximum requires|A|comparisons, although there exist optimizations. Define the

ci

k−sample mean for armi asµik = c1i k

Pk

p=1XˆpIpi, and a term similar to the UCB1 ‘confidence’ termbik=∆q2 logci k

k , wherec i k =

Pk

p=1Ipi is the number of times armihas been played. The method UCB1adopts is to find the maximum among{µik+bki},∀i∈A, at each stepkand to play this action asaˆk+1.

For the proposed algorithm UCB1MIX, we will use an ‘explore’ component similar to (10). Also note that the result obtained is a bound on the instantaneous regret - unlike UCB1which bounds total regret. Consider step-sizes:

δk = 2 log1+2αk, bik=

r

1.5 log1+αk ci

k , Sk= log

−αk,

Tk= 1

k2, ηk = Tk1 log(1 +SkTk), γk= (1+β) log βk

(13)

Using the condition0< α < β <0.5, our algorithm is:

φik+1 := (1−γk) eηkδk(µ i k+bik)

P|A|

j=1eηkδk(µ

j k+b

j k)

+ γk

|A|. (17)

Theorem 5 — Assuming thatµik ≥δ >0, the probability of selecting an actionai ∈A\{a∗}at

stepkof (17) is s.t. E(φik) =O

³

logβk k

´

.

PROOF : DefineZki as(µik+bik)(µk +b∗k)and note thatηk Sk due to SkTk 0. Note thatE(cik) log1+βkfor allai A, due to the |γkA| ‘explore’ term in (17). Consequently,Zki

µi

k+bik 1oncecik≥ 1.5 log 1+αk

(1−µi k)

2 . If we assume thatµik ≥δ >0,∀k, i, thenZki ≤µik+bik 1is achieved within finiteksinceβ > α.

Further, conditioning onZki >0,

eηkδkZki e2 log1+αk since ηk≈Sk.

P(Zki 0) e−3 log1+αk for k > 1.5 log

1+αs

(∆i)2 .

In the above, we have drawn from the analysis preceding [2, (6)]. Hence, after a finitek,

E(Zki|Zki >0)·P(Zki >0)≤e−log1+αk≤ 1 k.

We condition next on Zki 0. For each actionai, there exists a d˜i > 0 s.t. d˜ii 1 and ( ˜di + 1)∆

j 1. Now, note that E(Zki) ≤ −di for some di > 0 oncecik 1.5 log 1+αk

(1−d˜ii)2 . Thus,

eηkδkZi

k Πbδkc r=1eηk·Z

i

k and therefore, using [7, Fact 5.1],eηkZik 1 +ηkZi

k+ξTk(ηk)·(Zki)

2

.

Applying expectations, we have thatE(eηkZik)1−ηkdi+ξTk(ηk)sinceE((Zi k)

2|F

k−1)1

fork > k0. Thus,E(eηkZ

i

k)≤e(−ηkSk+ξTk(ηk)σk2)and hence,E(eηkZik) exp( −S 2 k

2+SkTk). Therefore,

E(eηkδk·Zi

k)exp(−bδkcS 2 k

2+SkTk). The conditionE((Zki)2|Fk−1)1is achieved for suitablek > k0, i.e., k >max

µ

1.5 log1+αk

(1−µ∗

k)

2 ,1.5 log 1+αk

(1−µi k)

2

. We have used again, the previous assumption thatµik > δ > 0,

∀i, k. Now, the fact thatSkTk 0and−δkSk2 =2 logkmean thatE(Zki|Zki 0)·P(Zki 0)

1

k. However, the ‘explore’ term in (17), |γkA| =

(1+β) logβk

|A|·k , results in the statement of the theorem. 2

(14)

5.1 Alternative UCB1-like scheme

An alternative algorithm to obtain logarithmic regret in the manner of UCB1but without the explicit maximization is given below. The advantage of this algorithm is the use of a single, simple stepsize

γkas also achievement of exact log-regret (although there is a scale-factor exponential in the number

of arms|A|).

Considerγk= 1k and confidence termsbik=

q

logk ci

k to define this update step:

φik+1 := (1−γk) i k+bik

P|A|

j=1

j k+b

j k

+ γk

|A|. (18)

Theorem 6 — Using algorithm (18), probabilityφi

kof playing an actionai forai ∈A\{a∗}is

O(1

k).

PROOF : Define difference terms Zki as used above: Zki = µki +bik −µ∗k −b∗k. Also note thatE(φi

k+1)(1−k1)E(eZ i k) + 1

k·|A|. Now condition onZk 0to obtain a bound onφik+1when ai A\{a}. Using [2, (7)-(9), Theorem 1] fork 8 logk

i (where, as before,∆i =µ∗−µi), we have thatP(µik+bik≥µ∗k+b∗k) 1k. To obtain a bound on the random variable exp¡µik+bik−µ∗k−b∗k¢, note thatµik−µ∗k 1and thatE(cik) log|A|k due to the ‘explore’ term, i.e. φip p·|1A| forp∈ Z+.

Also, for the same reason,E(

q

logk ci

k )

p

|A|. Thus, we have thateµik+bki−µ∗k−b∗k ≤e1+

|A|. Hence,

E(φik+1)(11k)(e1+

|A|·1

k) +k·|1A|and therefore,E(φik+1) e |A|

k . This proves the statement.2

This confirms(|A| −1)e|A|logkas the upper–bound for total expected regret till stepk.

6. NUMERICALRESULTS

We performed numerical experiments on all proposed algorithms using the computational software package SciLab. We assumed all rewardsXˆk (0,1)and that the rewards were being drawn uni-formly from intervals (Ai, Bi) s.t. 0 < Ai < Bi < 1. We used the rule that |A| = 5, and

|µi −µj| < 0.3, ∀i, j ∈ {1, . . . ,|A|}, thus placing all means µi close apart and inducing a high

(15)

require thatβk = 1 +log1k. We plot the average value ofφ∗T, note in Figure 1 that using diminishing stepsizeβkresults in upto15%higherφ∗T.

To maintain well-posedness of the SMAB algorithms, we used the condition|µi−µ∗| > 0.1, ∀i∈ {1, . . . ,|A|}. For each SMAB algorithm, we considered1000cases each of5−armed bandits

with Ai, Bi chosen randomly and with proximity conditions on µi as given above. In each case, we calculated the number of pulls of the best arm and the number of pulls for the second-best arm within a total of2000pulls. For better averaging effects in the SAMWMIX and blind-SAMWMIX

algorithms, we applied the update rule φik+1:= (1−γk) e

Pk p=1ηpXip

φip

P|A|

j=1e

Pk p=1ηp

Xpj φjp

+ γk

|A|. This does not

change the analyses presented in Theorems2and3above.

We used an optimization module available in SciLab to compute the intial indexk0for SOFTMIX as well as blind-SAMWMIX. We observed better stability of the algorithm whenα = 0.5for blind-SAMWMIX. Also,β = 0.2andα= 0.1were used for UCB1MIX. We averaged the number of pulls of the best-arm as well as the second-best arm over the1000cases. Note the superior performance of SAMWMIX (Figure 2) and, even more so, of blind-SAMWMIX (Figure 3). Notice that only the first UCB1-like algorithm, UCB1MIX, performs better than SAMWMIX (Figure 4), yet slightly worse

than blind-SAMWMIX.

0.2 0.3 0.4 0.5 0.6 0.7 0.8 0.9

0 50 100 150 200 250 300 350 400 450 500

weight assigned to best action

Iteration No. D-SAMW

SAMW

As regards computational complexity, we ran an optimized version of UCB1MIX where the sum

µi

k+bik, as in (17) above, is not re-calculated at everyk, ratherµiki+b

i

kiis used, where indexkiis the

(16)

the fraction e

ηkiδkiµi ki+biki

P|A|

j=1e

ηkjδkjµjkj+bjkj is easier as only one of the|A|possible numerators has changed from iteration k−1. There are onlyO(log|A|)comparisons due to the binary search employed in generating an actionaˆkfrom iterateφk−1. In contrast, the UCB1algorithm requires, in each iteration

k,O(|A|)comparisons to determine the maximumµik−1+bik−1term. For|A|>50, the total number of arithmetic/relational operations in UCB1overtook UCB1MIX. We also compared the number of ‘bad runs’ for UCB1vis-a-vis UCB1MIX, analogous to the comparison in Figure 1. A bad run was

declared if c∗5000 < 5000|A| or if c∗5000 < 1.1 ·

³ c25000,∗

´

where c25000,∗ stands for the number of pulls of the second best arm in a total of 5000pulls. In UCB1, 154experiments out of a total of 1000 (around15%) were bad runs, whereas only1out of1000experiments for UCB1MIX displayed this characteristic. However, whenever a good run is observed, UCB1was much better in regret terms: in

a good run (resp. bad run) it showed an average of4641(resp. 2) pulls of the best arm compared to just254(resp.200) in UCB1MIX.

7. CONCLUSIONS ANDFUTUREDIRECTIONS

In this paper we have proposed a horizon-independent version of Simulated Annealing with Multi-plicative Weights (SAMW) by modifying the learning rate. We have also modified the existing SOFT-MIX algorithm, a stochastic policy -based stochastic multi-armed bandit (SMAB) algorithm, to obtain the lowest possible logarithmic expected regret in the proposed SAMWMIX (the original SOFTMIX is log-squared regret). An inconvenience with both SOFTMIX and SAMWMIX was the need to spec-ify an input parameterd, which we eliminated with Blind-SAMWMIX - although it obtains slightly worse than logarithmic regret as a result. Finally, we proposed UCB1MIX, a stochastic policy -based algorithm adapting the existing UCB1to a Boltzmann exploration scheme like SAMWMIX. We have

(

)

(17)

also given a description of the numerical experiments with each algorithm, comparing them with pre-decessor algorithms such as SAMW, SOFTMIX and UCB1. As part of future work, an algorithm that uses a tighter version of the inequality in (12) above is under development. Also, the SAMWMIX ker-nel appears to be of use for ‘Contextual Bandits’ (cf. [6, Chapter 4]) - a category of bandit problems different from SMABs - and an algorithm for the same is also under development.

REFERENCES

1. S. Agrawal and N. Goyal, Analysis of Thompson sampling for the multi-armed bandit problem, in: Proc. Intl. Conf. on Learning Theory (COLT), (2012).

2. P. Auer, N. Cesa-Bianchi and P. Fischer, Finite-time analysis of the multiarmed bandit problem, Machine Learning, 47 (2002a), 235-256.

3. P. Auer, N. Cesa-Bianchi, Y. Freund and R. E. Schapire, The non-stochastic multiarmed bandit problem, SIAM Journal of Computing, 32 (2002b), 48-77.

4. V. Borkar and S. Meyn, The ODE method for convergence of stochastic approximation and reinforce-ment learning, SIAM Journal on Control and Optimization, 38 (2000), 447-469.

5. V. S. Borkar, Stochastic approximation: a dynamical systems viewpoint, Cambridge University Press and Hindustan Book Agency (Jointly Published) (2008).

6. S. Bubeck and N. Cesa-Bianchi, Regret analysis of stochastic and non-stochastic multi-armed bandit problems, Foundations and Trends in Machine Learning, 5 (2012), 1-122.

7. N. Cesa-Bianchi and P. Fischer, Finite-time regret bounds for the multi-armed bandit problem, in: Proc. 15th International Conf. on Machine Learning (ICML) (1998).

8. D. Chakrabarti, R. Kumar, F. Radlinski and E. Upfal, Mortal multi-armed bandits, in: Proc. 25th Inter-national Conference on Machine Learning (ICML) (2008).

9. H. S. Chang, M. Fu, J. Hu and S. I. Marcus, An adaptive sampling algorithm for solving Markov decision processes, Operations Research, 53 (2005), 126-139.

10. H. S. Chang, M. C. Fu and S. I. Marcus, An asymptotically efficient algorithm for finite horizon stochas-tic dynamic programming problems, IEEE Transactions on Automastochas-tic Control, 52 (2007), 89-94. 11. V. Farias and R. Madan, The irrevocable multiarmed bandit problem, Operations Research, 59 (2011),

383-399.

(18)

13. A. Gavirier and O. Cappe, The KL-UCB algorithm for bounded stochastic bandits and beyond, in: Proc. Intl. Conf. on Learning Theory (COLT) (2011).

14. Y. Seldin, C. Szepesvari, P. Auer and Y. Abbasi-Yadkori, Evaluation and analysis of the performance of the EXP3 algorithm in stochastic environments, JMLR Workshop and Conference Proceedings, 24 (2012), 103-116.

References

Related documents

In this present study, grain yield was positively and signif- icantly associated with all the traits except DF50 and HSW in both the heat stress environments (Table 3 ).

Table 1 Variables denition V ariables Denition Predicted Sources sign NPL Nonperforming loans to total loans ratio Bank data fro m Bankscope Cred_gr Credit gro wth rate on annual

Maximum values of hip adduction, pelvic obliquity (contralateral pelvis rise/drop), lateral pelvic translation (ipsilateral/contralateral shift) and ipsilateral trunk

Experiments have been conducted to show the gain in space by comparing the space consumed by Sadakane compressed suffix tree with the space consumed by a standard pointer- based

ter for Neurodegenerative Diseases, Site Rostock/Greifswald, Gehlsheimer Straße 20, 18147 Rostock, Germany. For example, Cognitive Training aims to maintain or improve specific

Evidence gathered my study of 39 TYP works show that many of the performances played with form by creating from a range of cultural practices including the traditional art forms

Both instructors have weekly office hours, and we are always available by appointment as 

Managed Service Architecture - platform for “Unified Services Delivery” Collaboration Services SP Data Center DC—CO—VHO Customer Premise Equipment. Celcom – Cisco