Multi armed bandits based on a variant of simulated annealing

(1)

c

° Indian National Science Academy DOI: 10.1007/s13226-016-0184-5

MULTI-ARMED BANDITS BASED ON A VARIANT OF SIMULATED ANNEALING

Mohammed Shahid Abdulla∗and Shalabh Bhatnagar∗∗

∗_{IT and Systems Area, Indian Institute of Management, Kozhikode, India}

∗∗_{Department of Computer Science and Automation, Indian Institute of Science, Bangalore, India}

e-mails: [email protected], [email protected]

(Received 28 May 2015; accepted 16 September 2015)

A variant of Simulated Annealing termed Simulated Annealing with Multiplicative Weights (SAMW) has been proposed in the literature. However, convergence was dependent on a pa-rameterβ(T), which was calculated a-priori based on the total iterationsT the algorithm would run for. We first show the convergence of SAMW even when a diminishing stepsizeβk →1is

used, wherekis the index of iteration. Using this SAMW as a kernel, a stochastic multi-armed bandit (SMAB) algorithm called SOFTMIX can be improved to obtain the minimum-possiblelog

regret, as compared tolog2regret of the original. Another modification of SOFTMIX is proposed which avoids the need for a parameter that is dependent on the reward distribution of the arms. Further, a variant of SOFTMIX that uses a comparison term drawn from another popular SMAB algorithm called UCB1is then described. It is also shown why the proposed scheme is computa-tionally more efficient over UCB1, and an alternative to this algorithm with simpler stepsizes is also proposed. Numerical simulations for all the proposed algorithms are then presented.

Key words : Stochastic processes; applied probability; statistics; discrete optimization.

1. INTRODUCTION

An algorithm called Simulated Annealing with Multiplicative Weights (SAMW) was introduced in [10]. Though the objective in [10] was to compute optimal policies for Finite-Horizon Markov De-cision Processes, a simplified description of this algorithm can be given as follows. In the setting of SAMW, a random rewardX_ki is obtained at each step k, where the actionai is such that it belongs to a finite set|A|. An empirical mean of the reward for each action ai, defined asµi_k =

Pk s=1Xsi

k , is updated after the stepk. The|A|−size probabilty vector φk is now updated to produce a φk+1,

(2)

φj_k →0for actionsaj ∈A\{a∗}). We assume that a unique best actiona∗exists. The improvement

step ofφkis performed as:

φi_k₊₁ := φikβµ i k

P

aj_∈_Aφj_kβµ j k

, (1)

whereβ >0, a small constant which [10] requires to beβ(T)- a constant calculated apriori after the maximum iteration numberT, i.e., k ≤ T, is set. For a finiteT, a constantβ would only result in approximate performance as the proofs in [10] employed bounds that were ‘asymptotically efficient’, i.e., they held tightly only ifT → ∞. In§2below, we eliminate this dependance ofβonT. Note that in the setting that SAMW dealt with, each action is to be employed to obtain a random reward (that follows an unknown, but fixed, distribution) at each stepk. It is this protocol that is changed in the following section, to introduce our second problem.

1.1 Stochastic Multi-Armed Bandit

The Stochastic Multi-Armed Bandit (SMAB) problem has the goal of detecting the actiona∗ that has the highest expected reward, as early as possible, or with as few applications of the sub-optimal actions b ∈ A\a∗. We assume in the following that there is a unique best action a∗. To explain this problem in brief: a finite set of actions, called arms,Ais available. Each armaj inAyields a reward drawn from a fixed distribution, whose mean isµj. The player who pulls the arms maintains an empirical mean for each armj, but has to discover soonest the best arma∗ ∈ A(the arm which corresponds to highest meanµ∗), so that this arm can continue to be pulled for maximum expected profit. We maintain a probability vectorφkoverA, that is updated iteratively to yield convergence (to aφ∗_{) so that as}_k_{→ ∞}_,_φ

k(a∗)→1andφk(aj)→0for allaj ∈A\{a∗}. One may consider this as a randomized stationary policy for a one-state MDP.

The expected regret of the algorithm is defined as the expected total loss incurred due to not play-ing the best arma∗in each iterationk. The arm played at each iterationk, inferred according to one’s algorithm, isaˆkand the reward obtained by pulling this arm isXˆk. Thusk·µ∗−

P_k

p=1E

³

ˆ

Xp|Fp−1

´

is the expected regret of this algorithm, which samples armsˆakaccording to probability iterateφk. Here,Fp−1 is aσ−algebra, and contains all information available to the algorithm at iterationp. In

§3, we employ the SAMW kernel inside a SMAB algorithm called SOFTMIX [7], and propose a new

(3)

of SOFTMIX was yet to be found.

Both SOFTMIX and SAMWMIX have the disadvantage that a parameterdhas to be provided a-priori as input to the algorithm satisfying the property that d < min_ai_∈_A_\{_a∗_}{µ∗ −µi}. We

eliminate this disadvantage in §4 by proposing a blind-SAMWMIX. While it behaves quite well in practice, blind-SAMWMIX has the disadvantage that it can only achieveO(log1+2α(k))regret, where0 < α < 0.5(with better properties forα closer to0.5). A popular log-regret algorithm for SMAB, called UCB1(Upper Confidence Bound1), is of a different type. Here, at stepk, a confidence boundbi_kis calculated for each armiand the arm corresponding to argmax_ai_∈_A{µi_k+bi_k}is played as the ‘winner arm’aˆk+1. Consequently, the empirical meanµjk+1 foraj = ˆak+1is re-computed to

include new sampleXˆk+1, whereas all other empirical means are carried over, i.e. µj_k₊₁ = µj_k for aj ₆_{= ˆ}_a

k+1. Thus it is not a randomized policy and performs an implicit explore-exploit trade-off

using the boundbi_k. Just as the SAMW kernel was used in SOFTMIX to produce SAMWMIX, we employ the UCB1mechanism for our algorithm UCB1MIX in§5. The advantage lies in avoidance of theO(|A|)complexity operation to find argmax_ai_∈_A{µi_k+bi_k}.

1.2 Survey of SAMW and SMAB

The algorithm in (1) as described in [10] is drawn from the work in [12] where an optimal strategy for non-cooperative repeated bi-matrix zero-sum games is the objective. In [12] too, the quantityβin (1) is constant and dependent on the total number of iterationsT. Our choice ofβkis a diminishing parameter with the further difference thatβk > 1∀k. SAMW differs from Simulated Annealing in that it does not perform any local search inA but updates the probability distributionφk overA. It also has a simpler tuning process.

As we explain in§6, there are computation as well as precision-related advantages over UCB1 for algorithms like SOFTMIX that use a Boltzmann exploration structure. The work in [14] sum-marizes the advantages that a stochastic policy-based algorithm (that additionally uses ‘Importance Sampling’, the device in (9) below) has over other deterministic algorithms like UCB1. However, UCB1 remains popular in both extensions of SMABs as well as applications. Among extensions, recent work on SMAB has focused on using a norm in the Kullback-Liebler neighbourhood to pro-duce UCB1’s confidence bounds, bringing it into the nearly constant regime as in the KL-UCB of [13]. However, KL-UCB is optimal for Bernoulli arms (i.e. arms having reward1with a Bernoulli bias probability), and also requires an input parameterα > 1. For general distributions, there is the

(4)

is an input parameter too, this timeα >2. We make specific comparisons with two other algorithms (EXP3 and Thomson Sampling) on the basis of their regret bounds in§3 below. A useful summary of stochastic multi-armed bandit algorithms’ results can be found in [6, Chapters 1-2].

Of note among applications are the ‘Mortal’ bandits in [8], where there is the possibility that certain arms will not be available after an indexk or that new arms may be made available. This work also proposes a variant of UCB1known as UCB1-K_C (KandCare parameters that are chosen a-priori). The work in [8] applies the algorithm to a web-advertisement model, where pulling an arm corresponds to choosing a particular ad to be delivered on a given webpage, the reward being when the viewer clicks this ad. A problem similar to [8], is the ‘Irrevocable’ Multi-Armed Bandit problem in [11], representing fashion retailers’ procurement scenarios.

1.2.1 Contributions and Outline : Our contributions in this paper and the relevant sections that cover these are:

1. An algorithm with a decreasing stepsizeβkused in (1) above, and a proof of asymptotic conver-gence for the same (§2). This is different from the asymptotically efficient algorithm proposed in [10].

2. Modification of the proposed asymptotically convergent algorithm in§3 to obtain the log-regret algorithm SAMWMIX, on the lines of log-squared regret SOFTMIX in [7].

3. A ‘blind’ algorithm that does not require an input parameter d which both SOFTMIX and SAMWMIX need (§4).

4. An algorithm that adapts the existing UCB1algorithm to a Boltzmann exploration scheme and is numerically observed to require lesser computation and has better ‘precision’, i.e., a higher probability of pulling the best arm a larger number of times (§5).

We also provide numerical results for all these algorithms (§6).

2. SAMWWITHDIMINISHINGβ

We propose the update step in SAMW to be:

φi_k₊₁ := φ

i kβ

µi k k

P_|_A_|

j=1φjkβ µj_k k

(5)

whereµj_kis thek−sample mean corresponding to armaj, and is obtained from rewardsX_kj ∈[0,1] obtained by pulling armaj. In [10], theβkare held constant to a smallβ >1(the result shown, [10, Lemma 3.2], is that for a given number of iterationsT,β:=ψ(T)andψ(T)→1asT → ∞).

Using the optimum policyφ∗, define the Kullback-Liebler entropy term as D_l∗=∆P_j|A₌₁| φ∗(j)· logφ∗(j)

φj_l which, sinceφ

∗₍_a∗_{) = 1}_with_φ∗₍_j_{) = 0}_for_j₆₌_a∗_{, equals}₋_log_φa∗

l .

Theorem 1 — For stepsizeβk=1 +∆ βk0 = 1 +log1k,k≥2and update step (2), the average reward

from the current policyφkis s.t.

P_|_A_|

j=1φjkµjk→µ∗ask→ ∞.

Proof:D∗_l₊₁−D_l∗ =

|A|

X

j=1

φ∗(j) log φ j l

φj_l₊₁

= log



 P_|_A_|

j=1φjlβ X_lj l

βXl∗ l





= X_l∗(−logβl) + log

|A|

X

j=1 φj_lβX

j l l

≤ X_l∗(−logβl) + log

|A|

X

j=1

φj_l(1 + (βl−1)X_lj)

usingβa≤1 + (β−1)a, forβ >1,a <1

≤ X_l∗(−logβ_k) + log (1 + (β_l−1)

|A|

X

j=1

X_lj·φj_l)

for a largeks.t.l < k, since1< βk< βl

X_l∗ ≤ D

∗

l −D∗l+1

logβ_k + β0

lX¯l logβ_k

withX¯l=∆

P_|_A_|

j=1φjlX j

l and usinglog (1 +a)≤a, fora <1 1

k

X

l=1

X_l∗ ≤ D

∗

1−D∗k+1

klogβ_k +

1

klogβ_k

k

X

l=1

β_l0X¯_l (3)

obtained by adding over allltillkand dividing byk.

Note that (D₁∗−D_k∗₊₁) ≤ D∗₁ = log|A|due to the initializationφj₁ = _|_A1_| andklogβk → ∞ ask → ∞. Hence, D

∗

1−Dk+1∗

klogβk → 0ask → ∞. Sinceβk → 1, it is the case thatφk → φ¯for some probability mass functionφ¯over all actions. The termXˆk=∆_k_log1_βk

P_k

(6)

ˆ

X_k₊₁ := Xˆ_k+ β

0

k+1

(k+ 1) logβ_k₊₁ Ã

¯

X_k₊₁−((k+ 1) logβk+1−klogβk) β0

k+1

ˆ

X_k !

.

LetE( ¯X)denoteP|_iA₌₁| φ¯iµiand the martingale differenceMk+1 be such thatMk+1 = ¯Xk+1− E[ ¯Xk+1 | Fk]. The sequenceMk+1 above is defined w.r.t. the sigma algebraFk = σ(φi_l,X¯l,1 ≤

l≤k,1≤i≤ |A|). One can then rewrite the above recursion as:

ˆ

X_k₊₁ := Xˆ_k+ β

0

k+1

(k+ 1) logβk+1

· Ã

E[ ¯X_k₊₁ | F_k]−((k+ 1) logβk+1−klogβk) β0

k+1

ˆ

X_k+M_k₊₁ !

(4)

:= Xˆk+ ˆβk+1

³

E[ ¯Xk+1| Fk]−β¯k+1Xˆk+Mk+1

´

, (5)

where, βˆk+1=∆ β 0

k+1

(k+1) logβk+1 and β¯k+1

∆

=((k+1) log_ββk+10 −klogβk)

k+1 , respectively. Now observe that (i)

P_k

l=1βˆl → ∞as k → ∞and (ii)

P_∞

l=1βˆl2 < ∞. By letting, β¯0 = ¯β1 = 0.1, it is easy to see that0.1≤β¯k<1,∀k, and thatβ¯k≤β¯k+1,∀k. Further,β¯k→1ask→ ∞.

Define a sequence {t(n)} ⊂ [0,∞) according to t(0) = 0 and t(n) = n−1

X

m=0

ˆ

βm, n ≥ 1. Let

¯

β(t), t ≥0be defined byβ¯(t(k)) = ¯βk+1, k ≥0with linear interpolation (between end points) for t∈[t(k), t(k+ 1)], k≥0. Note from the construction thatβ¯(t)∈[0.1,1]∀t≥0. Consider now the

following ODE associated with (5):

˙ˆ

X(t) =ht( ˆX(t))4= (E[ ¯X]−β¯(t) ˆX(t)). (6)

We shall now apply a key result from [4] (cf. Theorems 2.1-2.2) (alternatively, Theorem 7, Chapter 3.2 of [5]) that is however for the case of a time homogeneous objective functionh( ˆX(t))unlike ours (that depends explicitly ont). As can be seen, the result of [4] carries over easily in our case as well. Note that∀X, Y ∈ R,

|ht(X)−ht(Y)| ≤β¯(t)|X−Y| ≤ |X−Y|.

Thus, ht(X) is Lipschitz continuous inX, uniformly in t. Note also that Mk+1 = ¯Xk+1 − E[ ¯X_k₊₁| F_k], k≥0is a mean-zero process w.r.t.Fkand thatMk2+1, k≥0is bounded above by1.

Further, the step-sizesβˆl,l≥1satisfy (i) and (ii) above. Now, let

ht,r(X)=∆

h_t(rX)

r =

E[ ¯X]

(7)

Thus, ht,∞(X)= lim∆ r→∞ht,r(X) = −β¯(t)X. Sinceβ¯(t) ≥ 0.1 > 0 (uniformly over t), it fol-lows that the origin is the unique globally asymptotically stable equilibrium for the ODEX˙ˆ(t) =

ht,∞( ˆX(t)). Note also that for any0< T <∞,

R_T

0 β(τ)dτ ≤β(T)T <∞.

In the light of the observations made in the previous paragraph, it is easy to verify that the se-quence of results in Chapter 3.2 of [5], specifically given in Lemmas 1-2, Corollary 3, Lemmas 4-6 and also Theorem 7 there continue to hold in our setting. Thus, sup_k k Xˆk k< ∞ almost surely (from Theorem 7, Chapter 3.2 of [5]), thereby ensuring almost sure boundedness of the iterates in (5).

Consider now the following ODE in place of (6):

˙ˆ

X(t) =E[ ¯X]−Xˆ(t), (7)

that hasXˆ∗ = E[ ¯X]as its unique globally asymptotically stable equilibrium. Now rewrite (5) as follows:

ˆ

Xk+1 = ˆXk+ ˆβk+1

³

E[ ¯Xk+1| Fk]−Xˆk+Mk+1+²(k)

´ ,

where²(k) = ˆXk−β¯k+1Xˆkis uniformly bounded by the foregoing and further,²(k)→0ask→ ∞ almost surely (sinceβ¯k+1 → 1ask → ∞). It now follows as a consequence of the third extension

in Chapter 2.2 of [5] thatXˆk → Xˆ∗ almost surely ask → ∞. However, from (3), this implies that the empirical meanµ∗_kobtained from the best arm is s.t.µ∗_k≤E( ¯X)ask→ ∞. This cannot be true

unless equality holds and thereforeφk→φ∗. 2

Each armajinAis sampled once per iterationkand this contributes to the high sampling budget of the algorithm in (2) (as well as the original in [10]). This high sampling budget comes about because all sub-optimal actionsaj are taken at each iteration kbefore a high level of confidence to infer the best actiona∗ is achieved at some iterationk << K. To mitigate this problem, we employ the structure of SAMW in the SOFTMIX algorithm of [7]. In SOFTMIX,k−sample meansµi_k are used at thek−th step of the iteration (by a suitable construction), even though only a ‘winner’ action (call itˆak) is actually taken at each iterationk. Such behaviour is typical of Stochastic Multi-Armed Bandit (SMAB) algorithms where only one arm is pulled in each iterationkof the algorithm.

3. LOG-REGRETSOFTMIX

We wish to obtain the effect of usingk−sample means for each actionjwhile performing updatekin (2). This is achieved in a separate algorithm named SOFTMIX in [7] by considering rewardX_kj = Xk_φˆj

(8)

if armaj is the winner armˆak(i.e. the arm pulled afterφk−1is sampled), and0otherwise. HereXˆk is the reward obtained by pulling the winner armˆak. A change in SOFTMIX algorithm, by giving it an SAMW-like kernel (see (2) above), helps us to obtain logarithmic regret. Thus any sub-optimal actionai ∈A\{a∗}will be taken onlyO(logk)times inktrials of the system. This has been proved as the best possible performance of any algorithm designed for the SMAB problem, with SOFTMIX capable of onlyO(log2k)regret. Also note the remark in [15] that a diminishing step-size algorithm (termed ‘decreasing²−greedy’) like SOFTMIX that achieves logarithmic regret was yet to be found.

We summarize the SOFTMIX algorithm in the following. Define an indicator function I_kj₊₁ which takes value1foraj ifjis selected by sampling iterateφk,Ikj+1 = 0for all other actions in A\{aj_}_{. For each arm}_i_{∈ {}₁_,₂_{, . . . ,}_|_A_|}_{, perform the following updates:}

φi_k₊₁ = (1−γ_k) eηksˆ i k

P_|_A_|

j=1eηkˆs

j k

+ γk

|A|, (8)

ˆ

sj_k₊₁ = ˆsj_k+Xˆk

φj_kI

j

k. (9)

The step-sizesγkandηkare calculated as:γk = 1fork= 1,2

γ_k = min

µ

1,5|A|log (k−1) d2_·₍_k₋₁₎

¶

fork >2,

η_k = _|_A_|1 γk + 1

log



_{1 +}d(

|A|

γk + 1) 2|_γkA|−d2



_.

In the above, the quantitydis a heuristic value input by the programmer and satisfies the condition that0< d <min{∆i, ai 6= (a∗)}. Here,∆i=∆µ∗−µi is the mean loss incurred upon taking action

ai_.

The proposed algorithm SAMWMIX is:

φi_k₊₁ = (1−γ_k) e

Pk s=1ηsXˆis

P_|_A_|

j=1e

P_k s=1ηsXˆsi

+ γk

|A|. (10)

Here,Xˆ_ki is obtained using the reward-modification scheme of SOFTMIX, i.e.Xˆ_ki =∆ Xk_φˆi kI

i k. The stepsizesγkandηkare computed as follows:

γ_k = min

µ

1, 5|A| d2_·_k

¶

(9)

The stepsizeηkis the same as in SOFTMIX. Note the absence oflog(k−1)from the numerator in the expression forγk above. Also observe the term _|γk_A_| in (10) which is essentially an ‘explore’ component that was missing in the original SAMW algorithm in (2).

The proposed algorithm (10) performs a gradual weighting of samples{Xˆ_si}k_s₌₁ with all scale-factors{ηs}k_s₌₁ whereas SOFTMIX performs a multiplication of the entire sum

P_k

s=1Xˆsi withηk. Due to the similarities, a comparison of the time-varying algorithm EXP3 in [14] (based on the orig-inal EXP3 in [3]) with the proposed SAMWMIX is also in order:

• The high-probability result in [14] bounds the regret as O(√k) inkplays of the bandit.

• The companion algorithm EXP3ELM (read as ’EXP3 with Action Elimination’) in [14] does

achieve a logarithmic regret bound. However, in all the typical experiments no action elimina-tion takes place and the algorithm is treated to be on par with EXP3 itself.

A brief comparison with the stochastic policy -based method in [1], which uses Thomson Sam-pling and is the first to show logarithmic expected regret, is also due. Usingdfrom (11) above, for the generalN-armed bandit problem, the expected regret in [1] is proportional to N_d4 whereas ours is proportional to_dN2 (see (15) below).

We now give a proof of logarithmic regret in SAMWMIX, and obtain an upper bound ofO(1_k)on

E{φi

k} for any suboptimal armai. This implies logarithmic regret, since

P_k

p=1∆pi = O(logk). Define the σ−algebra F_ki as σ( ˆX_pi,1 ≤ p ≤ k), k ≥ 1. In the following theorems, we let

φi

k=P{Iki+1 = 1|Fk}.

Theorem 2 — The probability of selecting actionai ∈A\{a∗}at stepkof (10) is s.t. E(φi_k) =

O¡1

k

¢

.

PROOF: The early part of our treatment is similar to [7, Proof of Theorem 3.1, eq. (10)]:

φi_k ≤ (1−γk) exp ( k−1

X

p=1

ηs( ˆXpi −Xˆp∗)) +

γk

|A|. (12)

Now consider Z_pi= ˆ∆X_pi −Xˆ_p∗ + ∆i and note that E{Zpi|Fp−1} = 0. Also, due to the form of

ˆ

Xi p =

ˆ

Xp φi

pI i

p we have for cp = |_γpA| + 1, the inequality Zpi ≤ cp. Using the same form of Xˆpi, we obtain that E{(Z_pi)2|Fp−1} ≤ 2|_γpA|−∆i2 ≤ σ2p=∆2|γpA|−d2. Since cp > 0, for the function

(10)

Thus, we also have for eachp:

E{eηpZip|F

p−1} ≤ E{1 +ηpZpi + (Zpi)

2

φ_cp(η_p)|F_p₋₁}

≤ 1 +σ_p2φ_cp(η_p)≤eσ2pφcp(ηp).

Thus, we have that:

E(φi_k) ≤ γk

|A|+ (1−γk)exp  ₋ k−1 X p=1 ¡

ηpd−ξcp(ηp)σp2

¢ 

_. ₍₁₃₎

Consider Kp =∆ d and σp2

∆

= 2|_γpA| −d2_{, making each term in the sum of the RHS above as}

−K_pη_p +ξ_cp(η_p)σ2

p. Now replace, as per the definition in [7, Fact 5.1],ξcp(ηp) = e

cpηp₋₁₋_cpηp c2

p . From [7, Proof of Theorem 3.1], applying log (1 +x) ≥ ₂₊2x_x toηp (which can be represented as

1

cplog (1 + Kpcp

σ2

p )), we get thatηp ≥

2Kp

2σ2

p+cpKp. Consequently, ξcp(ηp) ≤

ecpηp₋₁ c2

p −

2Kp cp(2σ2

p+cpKp) whereecpηp₋_{1 =} Kpcp

σ2

p due to the representation ofηpas

1

cplog (1 + Kpcp

σ2

p ). Subtraction among the two fractions above yieldsξcp(ηp)≤ K

2 p σ2

p(2σp2+cpKp).

For each term in the sum on the RHS of (13), we have:−Kpηp+ξcp(ηp)σ2p ≤

−K2 p

2σ2

p+cpKp. Rewrite (13) asE(φi_k)≤exp

³

−Pk_p−1₌₁ Kp2

2σ2 p+cpKp

´

+_|γk_A_|.

After substituting forcp,Kp,σp2and choosing an upper bound of5|A|, withp≥P for a finiteP, for the resulting denominator4|A|+d|A|+γp(d−2d2), we have:

K_p2

2σ2

p+cpKp

≥ d2γp

5|A|, and, (14)

E(φi_k) ≤ c·exp(−log (k−1)) + 5|A|

d2_k. (15)

Constantcrepresents regret accumulated for indicesp < P and step-sizeγpis inferred using (11)

above. Thus,E(φi_k) =O(_k1). 2

This result gives expected value of regret for any iterationk (and not only ask → ∞), and is achieved without use of any concentration inequalities.

Modify the stepsizeηk = _ck1 log (1 +dcp_σ2

p)such thatck = 2|A|/γk+ 1andσ

2

k = ck−1. Log-regret holds as the inequalitiesZ_ki ≤ ck andE{(Z_ki)2|Fk−1} ≤ σ_k2 in the proof above continue to

(11)

Theorem 3 — Fork ≤ P, using ηk = _ck1 log (1 +dcp_σ2

p) with ck =

2|A|

γk + 1, σk2 = 2|γkA| and

γk= min(1,5|_d2A_k|), the quantityφi_kforai ∈A\{a∗}in (10) is such that

γk

|A| ≤φ

i

k ≤(1−γk)·_k1 +_|γ_Ak_|,

with probability1−k(αk_P)for anαP s.t.0< αP <1

PROOF : The LHS of the inequality is easy to observe: it is the exploration probability for action ai. Observe from [7, eq. (9)] that φi_k ≤ (1−γk) exp (

P_k₋₁

s=1ηs( ˆXsi−Xˆs∗)) + |γkA|, where

we define Zˆi

s= ˆ∆Xsi −Xˆs∗. Using the Markov inequality, we have P{exp(

P_k₋₁

s=1ηsZˆsi) > 1k} ≤

kE{exp(Pk_s₌₁−1ηsZˆsi)}. Consider another random variableZki−1 = Πks=1−1exp(ηsZˆsi) and note that

Z_ki₋₁= ˆZ_ki₋₁·Z_ki₋₂. Also,E(Z_ki₋₁)=∆E(Z_ki₋₁|Fk−2)whereFk−1isσ( ˆXpi,1≤p≤k−2),k≥3.

From the proof in Theorem 2, we have thatE{Zˆ_ki₋₁|Fk−2}=−∆i,E{( ˆZ_ki₋₁)2|Fk−2} ≤ _γk2|A₋|₁ and

ˆ

Z_ki₋₁ ≤ _γk|A|

−1, hence

1

ck−1 ˆ

Z_ki₋₁ ≤ 1₂. Further, sincelog(1 + dck−1 σ2

k−1 ) = log(1 + (1 + γk−1

2|A|)d) <1for

a smalld, we have that|ηk−1Zˆ_ki₋₁| ≤1. Use the inequalityea≤1 +a+a2for|a| ≤1to obtain:

E(Z_ki) ≤ E(1 +ηk−1Zˆki−1+ηk2−1( ˆZki−1) 2

)·Z_ki₋₁

E(Z_ki) ≤ (1−ηk−1∆i+η2k−1

2|A| γk−1)·Z

i

k−1 (16)

In the above,ηk−1(∆i −ηk−1_γk2|A₋₁|) ≥ ηk−1(d− √_d/ξkd −1+1

) ≥ α¯P, whereξk−1 = σ

2 k−1 ck−1 and ¯

α_P >0. To see this, notice thatηk−1_γk2|A₋|₁ =ξk−1log(1 + _ξkd₋₁)whereξk−1 <1, and then employ

the inequalitylog(1 +a) ≤ √a

1+a fora > 0. Thus, with αP = 1−α¯P, the above (16) becomes:

E(Zi

k)≤αP ·Zki−1. Thus the adverse probablityP{exp(

P_k₋₁

s=1ηsZˆsi)> 1k}is less thankαkP. 2

4. BLINDSAMWMIX

In SAMWMIX, an intelligent guess ofdwas still needed as input to the stepsizeγk. Let us suppose we have a simple technique to avoid guessingdand hence we chooseγk= 5|A|log_k k. Then, using (14), for allk > e

1

d2_{, we would have that} d 2_log_k

k ≥ 1k. However, this would result in log–squared regret due to a trailing term 5|A|_klogk corresponding to5|_d2A_k|in (15). We propose an alternative wheredis not needed as input, and nearly log–regret is achieved. We retain the update step (10), definecp= 1 +|_γpA| andσ_p2= 2|_γpA|. Withα∈(0,0.5), we use modified step-sizesγpandηp:γp =min

³

1,5|A| ·log2α_p p ´

,

ηp = _cp1 log

µ

1 + 1 logα pcp

σ2 p

¶

(12)

Theorem 4 — Using stepsizesγp andηp, the probability of selecting an actionai ∈ A\{a∗}at stepkof (10) is such thatE(φi_k) =O

³

log2α_k k

´

.

PROOF: We define a diminishing stepsizeKp=∆_log1α_p. Thus we note, as regards the term within the summation in RHS of (13) above, that forp≥P:

−ηpd+ξcp(ηp)σ2p−ξcp(ηp)d2 ≤ −ηpKp+ξcp(ηp)σp2,

where we use the fact thatξcp(ηp)≥0andP =ed

−_α1

. As in the proof of Theorem 2, we can bound the quantity in RHS by −K

2 p

2σ2

p+cpKp and further by

γp·log−2αp

5|A| . Now substitute forγpto obtain 1p as the upper-bound for this term. However, the ‘explore’ term_|γk_A_| in (10) results inE(φ_ki) =O(log2α_k k)as

in the statement of this theorem. 2

Thus the regret accumulated isO(log(k)1+2α), which can be made arbitrarily close to log-regret by choosing a smallα. However, the smaller anαis, the larger will be the iterate indexP(the bounds derived would hold only forp > P).

5. UCB1-LIKESOFTMIX

The log-regret algorithm UCB1 in [2] has been applied in the past to finite–horizon MDPs (cf. [9]), in order to reduce the number of times simulated transitions are made to arrive at an optimal policy. We now propose a way to use the UCB1 of [2] within the SOFTMIX algorithm, thereby achieving log-regret as also avoiding a search for a maximum among|A|values needed in each iterationkof UCB1. Finding a maximum requires|A|comparisons, although there exist optimizations. Define the

ci

k−sample mean for armi asµik = c1i k

P_k

p=1XˆpIpi, and a term similar to the UCB1 ‘confidence’ termbi_k=∆q2 log_ci k

k , wherec i k =

P_k

p=1Ipi is the number of times armihas been played. The method UCB1adopts is to find the maximum among{µi_k+b_ki},∀i∈A, at each stepkand to play this action asaˆk+1.

For the proposed algorithm UCB1MIX, we will use an ‘explore’ component similar to (10). Also note that the result obtained is a bound on the instantaneous regret - unlike UCB1which bounds total regret. Consider step-sizes:

δk = 2 log1+2αk, bik=

r

1.5 log1+α_k ci

k , Sk= log

−α_k_,

T_k= 1

k2, ηk = _Tk1 log(1 +SkTk), γk= (1+β) log β_k

(13)

Using the condition0< α < β <0.5, our algorithm is:

φi_k₊₁ := (1−γ_k) eηkδk(µ i k+bik)

P_|_A_|

j=1eηkδk(µ

j k+b

j k)

+ γk

|A|. (17)

Theorem 5 — Assuming thatµi_k ≥δ >0, the probability of selecting an actionai ∈A\{a∗}at

stepkof (17) is s.t. E(φi_k) =O

³

logβ_k k

´

.

PROOF : DefineZ_ki as(µi_k+bi_k)−(µ_k∗ +b∗_k)and note thatηk ≈ Sk due to SkTk → 0. Note thatE(ci_k) ≥ log1+βkfor allai ∈ A, due to the _|γk_A_| ‘explore’ term in (17). Consequently,Z_ki ≤

µi

k+bik ≤1oncecik≥ 1.5 log 1+α_k

(1−µi k)

2 . If we assume thatµi_k ≥δ >0,∀k, i, thenZ_ki ≤µi_k+bi_k ≤1is achieved within finiteksinceβ > α.

Further, conditioning onZ_ki >0,

eηkδkZki ≤ e2 log1+αk _since η_k≈S_k.

P(Z_ki ≥0) ≤ e−3 log1+αk for k > 1.5 log

1+α_s

(∆i₎2 .

In the above, we have drawn from the analysis preceding [2, (6)]. Hence, after a finitek,

E(Z_ki|Z_ki >0)·P(Z_ki >0)≤e−log1+αk≤ 1 k.

We condition next on Z_ki ≤ 0. For each actionai, there exists a d˜i > 0 s.t. d˜i∆i ≤ 1 and ( ˜di _{+ 1)∆}

j ≥ 1. Now, note that E(Z_ki) ≤ −di for some di > 0 onceci_k ≥ 1.5 log 1+α_k

(1−d˜i_∆i₎2 . Thus,

eηkδkZi

k ≤Πbδkc r=1eηk·Z

i

k _{and therefore, using [7, Fact 5.1],}eηkZik ≤1 +η_kZi

k+ξTk(ηk)·(Zki)

2

.

Applying expectations, we have thatE(eηkZik)≤1−η_kdi+ξ_Tk(η_k)_sinceE((Zi k)

2_|F

k−1)≤1

fork > k0. Thus,E(eηkZ

i

k)≤e(−ηkSk+ξTk(ηk)σk2)_{and hence,}E(eηkZik) ≤_exp( −S 2 k

2+SkTk). Therefore,

E(eηkδk·Zi

k)≤_exp(−bδkcS 2 k

2+SkTk). The conditionE((Zki)2|Fk−1)≤1is achieved for suitablek > k0, i.e., k >max

µ

1.5 log1+α_k

(1−µ∗

k)

2 ,1.5 log 1+α_k

(1−µi k)

2

¶

. We have used again, the previous assumption thatµi_k > δ > 0,

∀i, k. Now, the fact thatSkTk →0and−δkSk2 =−2 logkmean thatE(Zki|Zki ≤0)·P(Zki ≤0)≤

1

k. However, the ‘explore’ term in (17), |γkA| =

(1+β) logβ_k

|A|·k , results in the statement of the theorem. 2

(14)

5.1 Alternative UCB1-like scheme

An alternative algorithm to obtain logarithmic regret in the manner of UCB1but without the explicit maximization is given below. The advantage of this algorithm is the use of a single, simple stepsize

γkas also achievement of exact log-regret (although there is a scale-factor exponential in the number

of arms|A|).

Considerγk= 1_k and confidence termsbik=

q

logk ci

k to define this update step:

φi_k₊₁ := (1−γ_k) eµ i k+bik

P_|_A_|

j=1eµ

j k+b

j k

+ γk

|A|. (18)

Theorem 6 — Using algorithm (18), probabilityφi

kof playing an actionai forai ∈A\{a∗}is

O(1

k).

PROOF : Define difference terms Z_ki as used above: Z_ki = µ_ki +bi_k −µ∗_k −b∗_k. Also note thatE(φi

k+1)≤(1−k1)E(eZ i k) + 1

k·|A|. Now condition onZk ≤0to obtain a bound onφik+1when ai _∈_A_\{_a∗_}_{. Using [2, (7)-(9), Theorem 1] for}_k_≥ 8 logk

∆i (where, as before,∆i =µ∗−µi), we have thatP(µi_k+bi_k≥µ∗_k+b∗_k)≤ 1_k. To obtain a bound on the random variable exp¡µi_k+bi_k−µ∗_k−b∗_k¢, note thatµi_k−µ∗_k ≤1and thatE(ci_k) ≥ log_|_A_|k due to the ‘explore’ term, i.e. φi_p ≥ _p_·|1_A_| forp∈ Z+.

Also, for the same reason,E(

q

logk ci

k )≤

p

|A|. Thus, we have thateµik+bki−µ∗k−b∗k ≤e1+

√

|A|_{. Hence,}

E(φi_k₊₁)≤(1−1_k)(e1+ √

|A|_·1

k) +k·|1A|and therefore,E(φik+1)≤ e |A|

k . This proves the statement.2

This confirms(|A| −1)e|A|logkas the upper–bound for total expected regret till stepk.

6. NUMERICALRESULTS

We performed numerical experiments on all proposed algorithms using the computational software package SciLab. We assumed all rewardsXˆk ∈ (0,1)and that the rewards were being drawn uni-formly from intervals (Ai, Bi) s.t. 0 < Ai < Bi < 1. We used the rule that |A| = 5, and

|µi −µj| < 0.3, ∀i, j ∈ {1, . . . ,|A|}, thus placing all means µi close apart and inducing a high

(15)

require thatβk = 1 +_log1_k. We plot the average value ofφ∗_T, note in Figure 1 that using diminishing stepsizeβkresults in upto15%higherφ∗T.

To maintain well-posedness of the SMAB algorithms, we used the condition|µi−µ∗| > 0.1, ∀i∈ {1, . . . ,|A|}. For each SMAB algorithm, we considered1000cases each of5−armed bandits

with Ai, Bi chosen randomly and with proximity conditions on µi as given above. In each case, we calculated the number of pulls of the best arm and the number of pulls for the second-best arm within a total of2000pulls. For better averaging effects in the SAMWMIX and blind-SAMWMIX

algorithms, we applied the update rule φi_k₊₁:= (1−γk) e

Pk p=1ηpXip

φip

P_|A|

j=1e

Pk p=1ηp

X_pj φjp

+ γk

|A|. This does not

change the analyses presented in Theorems2and3above.

We used an optimization module available in SciLab to compute the intial indexk0for SOFTMIX as well as blind-SAMWMIX. We observed better stability of the algorithm whenα = 0.5for blind-SAMWMIX. Also,β = 0.2andα= 0.1were used for UCB1MIX. We averaged the number of pulls of the best-arm as well as the second-best arm over the1000cases. Note the superior performance of SAMWMIX (Figure 2) and, even more so, of blind-SAMWMIX (Figure 3). Notice that only the first UCB1-like algorithm, UCB1MIX, performs better than SAMWMIX (Figure 4), yet slightly worse

than blind-SAMWMIX.

0.2 0.3 0.4 0.5 0.6 0.7 0.8 0.9

0 50 100 150 200 250 300 350 400 450 500

weight assigned to best action

Iteration No. D-SAMW

SAMW

As regards computational complexity, we ran an optimized version of UCB1MIX where the sum

µi

k+bik, as in (17) above, is not re-calculated at everyk, ratherµiki+b

i

kiis used, where indexkiis the

(16)

the fraction e

η_kiδ_kiµi ki+biki

P_|A|

j=1e

η_kjδ_kjµj_kj+bj_kj is easier as only one of the|A|possible numerators has changed from iteration k−1. There are onlyO(log|A|)comparisons due to the binary search employed in generating an actionaˆkfrom iterateφk−1. In contrast, the UCB1algorithm requires, in each iteration

k,O(|A|)comparisons to determine the maximumµi_k−₁+bi_k−₁term. For|A|>50, the total number of arithmetic/relational operations in UCB1overtook UCB1MIX. We also compared the number of ‘bad runs’ for UCB1vis-a-vis UCB1MIX, analogous to the comparison in Figure 1. A bad run was

declared if c∗₅₀₀₀ < 5000_|A| or if c∗₅₀₀₀ < 1.1 ·

³ c2₅₀₀₀,∗

´

where c2₅₀₀₀,∗ stands for the number of pulls of the second best arm in a total of 5000pulls. In UCB1, 154experiments out of a total of 1000 (around15%) were bad runs, whereas only1out of1000experiments for UCB1MIX displayed this characteristic. However, whenever a good run is observed, UCB1was much better in regret terms: in

a good run (resp. bad run) it showed an average of4641(resp. 2) pulls of the best arm compared to just254(resp.200) in UCB1MIX.

7. CONCLUSIONS ANDFUTUREDIRECTIONS

In this paper we have proposed a horizon-independent version of Simulated Annealing with Multi-plicative Weights (SAMW) by modifying the learning rate. We have also modified the existing SOFT-MIX algorithm, a stochastic policy -based stochastic multi-armed bandit (SMAB) algorithm, to obtain the lowest possible logarithmic expected regret in the proposed SAMWMIX (the original SOFTMIX is log-squared regret). An inconvenience with both SOFTMIX and SAMWMIX was the need to spec-ify an input parameterd, which we eliminated with Blind-SAMWMIX - although it obtains slightly worse than logarithmic regret as a result. Finally, we proposed UCB1MIX, a stochastic policy -based algorithm adapting the existing UCB1to a Boltzmann exploration scheme like SAMWMIX. We have

(

)

(17)

also given a description of the numerical experiments with each algorithm, comparing them with pre-decessor algorithms such as SAMW, SOFTMIX and UCB1. As part of future work, an algorithm that uses a tighter version of the inequality in (12) above is under development. Also, the SAMWMIX ker-nel appears to be of use for ‘Contextual Bandits’ (cf. [6, Chapter 4]) - a category of bandit problems different from SMABs - and an algorithm for the same is also under development.

REFERENCES

1. S. Agrawal and N. Goyal, Analysis of Thompson sampling for the multi-armed bandit problem, in: Proc. Intl. Conf. on Learning Theory (COLT), (2012).

2. P. Auer, N. Cesa-Bianchi and P. Fischer, Finite-time analysis of the multiarmed bandit problem, Machine Learning, 47 (2002a), 235-256.

3. P. Auer, N. Cesa-Bianchi, Y. Freund and R. E. Schapire, The non-stochastic multiarmed bandit problem, SIAM Journal of Computing, 32 (2002b), 48-77.

4. V. Borkar and S. Meyn, The ODE method for convergence of stochastic approximation and reinforce-ment learning, SIAM Journal on Control and Optimization, 38 (2000), 447-469.

5. V. S. Borkar, Stochastic approximation: a dynamical systems viewpoint, Cambridge University Press and Hindustan Book Agency (Jointly Published) (2008).

6. S. Bubeck and N. Cesa-Bianchi, Regret analysis of stochastic and non-stochastic multi-armed bandit problems, Foundations and Trends in Machine Learning, 5 (2012), 1-122.

7. N. Cesa-Bianchi and P. Fischer, Finite-time regret bounds for the multi-armed bandit problem, in: Proc. 15th International Conf. on Machine Learning (ICML) (1998).

8. D. Chakrabarti, R. Kumar, F. Radlinski and E. Upfal, Mortal multi-armed bandits, in: Proc. 25th Inter-national Conference on Machine Learning (ICML) (2008).

9. H. S. Chang, M. Fu, J. Hu and S. I. Marcus, An adaptive sampling algorithm for solving Markov decision processes, Operations Research, 53 (2005), 126-139.

10. H. S. Chang, M. C. Fu and S. I. Marcus, An asymptotically efficient algorithm for finite horizon stochas-tic dynamic programming problems, IEEE Transactions on Automastochas-tic Control, 52 (2007), 89-94. 11. V. Farias and R. Madan, The irrevocable multiarmed bandit problem, Operations Research, 59 (2011),

383-399.

(18)

13. A. Gavirier and O. Cappe, The KL-UCB algorithm for bounded stochastic bandits and beyond, in: Proc. Intl. Conf. on Learning Theory (COLT) (2011).

14. Y. Seldin, C. Szepesvari, P. Auer and Y. Abbasi-Yadkori, Evaluation and analysis of the performance of the EXP3 algorithm in stochastic environments, JMLR Workshop and Conference Proceedings, 24 (2012), 103-116.