Making a decision in multi-armed bandit problems involves considering the outcome of all future possible decisions which can quickly become unmanageable. An alternative is to use stochastic dynamic programming (Bellman, 1954) approaches which break the problem down into smaller, more manageable problems. Using dynamic programming, the optimal
CHAPTER 3. SEQUENTIAL DECISION PROBLEMS 42
solution to the multi-armed bandit problem reduces to an index problem. An index is a numerical value independently assigned to each of the arms and the policy is chosen by selecting the arm with the highest index. Gittins and Jones (1979) proved that the optimal policy for the infinite horizon multi-armed bandit problem, with discounted rewards where the rewards are conditionally independent given the action, is an index policy. They termed the index the dynamic allocation index but it is now commonly referred to as the Gittins index. Easier to follow proofs have since been given to show that Gittins index is optimal for multi-armed bandit problems (Tsitsiklis, 1994). One way the Gittins index can be derived is by considering a smaller independent problems for each arm where the arm is pulled and it must be decided adaptively when to stop pulling that arm. Upon stopping a fixed retirement cost is given. The Gittins index is the arm with the largest fixed retirement cost required to stop pulling the arm now (Frazier, 2011). The solution is reduced from solving a k-armed bandit problem to solving the Gittins indices for k one-armed bandits. For many problems these optimal policies are hard to compute, and for finite horizon problems Gittins indices are no longer optimal; the proof is fragile to changes in the assumptions. As a result, research has turned to heuristic methods which are not guaranteed to select the optimal policy at each time step to solve multi-armed bandit problems.
An alternative way to measure how well a decision policy performs is to minimise the cumulative regret (Robbins, 1952). Ifµ∗ is the maximum expected reward over theK arms then instead of maximising the cumulative reward, the multi-armed bandit problem can be thought of as minimising the cumulative regret:
Rn(θ) = n
X
t=1
(µ∗−µat). (3.2.1)
A basic objective given by Robbins (1952) is thatRn(θ)/n→0 as n→ ∞. Policies which
achieve this objective are said to be consistent (Lai and Robbins, 1985). Stronger conditions consider the worst case of regret. Lai and Robbins (1985) show that regret must grow at least logarithmically with time and provide a lower bound on the number of suboptimal draws in a good strategy for a single armed bandit. These are extended to multi-armed bandits
CHAPTER 3. SEQUENTIAL DECISION PROBLEMS 43
by Burnetas and Katehakis (1996). As well as showing that policies asymptotically reach this lower bound, the current standard in evaluating policies is to give an upper bound of the regret in finite time (Auer et al., 2002). The aim is to show that the regret is of
O(log(n)) but large constant multipliers mean this is quite a loose condition. Approaches to finding policies include upper confidence bound methods, interval estimation and Thompson sampling. In the following sections, different approaches to heuristic policies are discussed. These policies are heuristic as they do not guarantee selecting the optimal policy at each time step, unlike the Gittins index. However, they can be applied to a far wider range of problems, for which no optimal policy exists, and have been shown in many cases to reach the lower bound on the number of sub optimal draws. There are also many more general heuristic methods which can be applied to a broader class of optimal learning algorithms (Vermorel and Mohri, 2005; Luce, 2005).
3.2.1 Semi-uniform Methods
Multi-armed bandit problems are an exploration-exploitation problem. Methods which dis- tinguish between exploration steps and exploitation steps are called semi-uniform methods. The simplest and most widely used semi-uniform method is the-greedy policy (Watkins, 1989). The-greedy policy randomly chooses an arm with probabilityand otherwise makes a greedy choice, selecting the arm with the highest expected reward:
at=
maxj=1,...,Kµj,t with probability 1−
random choice with probability
. (3.2.2)
The choice of∈(0,1) is left to the user; a smaller value ofresults in less exploration. The
-greedy policy is a sub-optimal policy because the constant factor prevents the strategy from getting arbitrarily close to the optimal action (Vermorel and Mohri, 2005). Variants of the-greedy policy have been developed, such as the-decreasing strategy which consists of using a decreasing getting arbitrarily close to the optimal strategy asymptotically. Auer et al. (2002) shows that, under certain constraints on how decreases, this achieves an
CHAPTER 3. SEQUENTIAL DECISION PROBLEMS 44
O(log(T)) regret.
Semi-uniform strategies are simple heuristic methods. Methods which assign a probability to selecting each possible arm may perform better. These methods give arms with higher expected rewards a higher probability of being chosen but incorporate exploration by still giving small probabilities to those arms with lower expected reward. One such method is the Softmax algorithm (Vermorel and Mohri, 2005) which assigns a Boltzmann distribution to the probability of selecting each arm:
eµk,tτ
PK
i=1eµi,tτ
(3.2.3)
where τ is the user defined tuning parameter where a larger value of τ results in more exploration.
3.2.2 Optimistic Methods
A different approach to incorporating exploration is used in optimistic methods which take an optimistic guess of the mean reward in the face of uncertainty. The reward with the highest optimistic estimate is then chosen. The methods prefer actions which have a high amount of uncertainty. For two arms with similar expected rewards, the arm which has a higher amount of uncertainty about that expectation will have a larger optimistic reward. If the optimistic view of the expected reward is wrong then after choosing that action a few times, the value will quickly decrease and it will no longer be chosen. These methods are intuitive as they are an approximation to the type of behaviour on Gittins indices which can be seen as the mean reward plus some type of uncertainty bonus.
One optimistic approach is called Interval Estimation (Kaelbling, 1993). Associated with each arm is a 100(1−α)% upper bound on the mean reward, whereαis a parameter in the range (0,1). The value ofαis left to the user; a smaller value ofαleads to more exploration. At each round, the action of highest upper bound is chosen. For example, if a reward on arm a at time t is normally distributed with posterior mean, µa,t and posterior standard
CHAPTER 3. SEQUENTIAL DECISION PROBLEMS 45
deviationσa,t. Interval estimation chooses the policy:
at= argmaxj=1,...,Kµi,t+zασi,t (3.2.4)
wherezα is the is the 1−α quantile of the standard normal distribution.
Alternatively, Upper Confidence Bound (UCB) methods have become popular optimistic heuristic methods for solving the multi-armed bandit problem. They are based on an upper confidence bound to the mean reward on each arm and are designed to meet the regret optimality criteria. Many of the methods do not make any assumptions on the probabilistic model and rely only on the sample mean of rewards on the arms. For each arm, an index is calculated based on an upper bound on the expected reward, which is a function of the number of times the arm has been pulled up to the current time. Many different UCB policies have been suggested and designed for specific problems (Auer et al., 2002; Garivier and Capp´e, 2011). A general Bayes-UCB policy has been proposed by Kaufmann et al. (2012a) which is inspired by a Bayesian perspective to the standard (frequentist) UCB policies. The Bayes-UCB policy chooses the arm which has the highest value of some user- chosen quantile,αt, of the posterior distribution of the rewards. More specifically, denote the
quantile function associated to the distributionν asQ(λ, ν), such thatP(X≤Q(λ, ν)) = 1 and the posterior distribution for armaat time tasνj,t. Bayes-UCB chooses the policy:
at= argmaxj=1,...,KQ(1−αt, νj,t) (3.2.5)
In many cases, the method can reach the lower bound on regret discussed in Section 3.2.
3.2.3 Randomised Probability Matching
A naturally Bayesian approach to defining a policy is to use randomised probability match- ing. The algorithm works by sampling from the posterior distribution on each arms and choosing the arm with the highest value. A sample, ˜θa, is drawn from the posterior distri-
CHAPTER 3. SEQUENTIAL DECISION PROBLEMS 46
butionνa,t(θa| ·) on each arm and the arm selected is:
at= argmaxj=1,...,Kθ˜a. (3.2.6)
Thompson sampling (Thompson, 1933) was the first example, and was proposed as a method for solving stochastic bandits with Bernoulli rewards. Recently, the method has been shown to be asymptotically optimal for cumulative regret minimisation (Kaufmann et al., 2012b). Although proposed for Bernoulli bandits, the method can easily be extended to general stochastic bandit problems (Agrawal and Goyal, 2012). An optimistic version of this has been developed which instead takes the maximum of the mean posterior reward and the sample (May et al., 2012):
at= argmaxj=1,...,Kmax(˜θa, µa,t). (3.2.7)