Regret-regression and Multi-armed Bandit Problem
6.9 Linear Programming
The previous section explained how it is difficult to calculate Gittins index, even in the simple case of a multi-armed bandit problem. This section introduces an easy alternative approach to calculate Gittins index. The linear programming approach for the multi-armed bandit problem can be found in Chen and Katehakis (1986). They considered a finite state bandit process and were able to demonstrate the Gittins index for state M can be obtained by solving a linear program. The problem to be considered involves T variables M1, M2, · · · , MT, linear functions
U (M, R) = maxX
j≥0 T
X
t=1
RjI(Mj(t)),
and T constraint equations
T
X
t=1
I(Mj(t)) = K, for j = 1, 2, · · · , K
where I(Mj(t)) is an indicator and Rj is a reward of Mj(t). The following conventions will be observed throughout the remainder of this section:
(i) The constraint equations have at least one non-negative real solution and are such that they are linearly independent.
(ii) The objective function can not be expressed as a linear combination of the con-straint function.
6.9.1 Linear programming with a finite horizon
Suppose {V∗, Wi∗, i ∈ Ω} are the optimal solution to linear program, minU U = V +X
i∈Ω
Wi subject to
V + Wi ≥ R(i) + λX
j
P (i, j)Wj i, j ∈ Ω − {m}
V ≥ R(m) + λX
j
P (m, j)Wj j ∈ Ω − {m}
Wi ≥ 0, V ∈ R, i ∈ Ω,
where m is a specific state, i for other states and the sign (− {m}) denotes except m.
Then it follows that G(m) = V∗ and Wm∗ = 0. Therefore, if the above linear program can be solved to obtain G(m), it is possible to construct an efficient procedure to calculate the Gittins indexes G(m) for m = 1, 2, · · · , Ω. However, Kallenberg (1986) observed that the number of pivot steps (the pivot step is to choose an element corresponding the location of the largest rewards and the smallest costs) will be highly dependent upon the chosen permutation of the states in Ω. When Ω = 2, then to calculate G(m1) and G(m2) only two constraints in the linear program have to replaced. Also in the objective function we will have only two variables, then the optimal solution can be obtained by a simple graphical solution. If the problem contains more than two variables, then we need to use other solution methods such as the simplex method (explained later). In our previous example, we found the Gittins index for arm 1, state 1 is 6 and 8 for arm 2, state 1. To calculate the Gittins index for arm 1, state 2 the linear program formula will be as
minU U = V + W1
subject to
V + W1 ≥ R(1) + λP (1, 1)W1
V ≥ R(2) + λP (2, 1)W1 W1 ≥ 0,
then using the specific values in the transition matrix and the rewards vector, these two constraints will be V + 0.8W1 ≥ 6 and V − 0.3W1 ≥ 4. Then the possible graphical solution points are ({W, V } = {(7.5, 0), (1.82, 4.54)}). Hence V∗ = 4.54. Similarly, the Gittins index for arm 2, state 2 can be found by solving the following linear program formula
minU U = V + W1
subject to
V + 0.6 ≥ 8 and V − 0.5 ≥ 3 W1 ≥ 0,
The possible solution points are ({W, V } = {(13.33, 0), (4.54, 5.27)}). Thus V∗ = 5.27. For state 2 in arm 1. These are the same results as seen earlier.
Figure 6.3: The Gittins index by LP
6.9.2 Results for K = 5
We will use the same details in the last example in Section 3, but assuming K = 5.
Regret
State Mj j = 1 j = 2 j = 3 j = 4 j = 5 1 0.309 0.309 0.305 0.781 2 2 0.309 0.308 0.312 0.393 3 3 0.855 0.856 0.851 1.562 4 4 0.855 0.855 0.859 0.396 1
Table 6.4: Regrets for two-arm bandit problem when K = 5.
Dynamic regret-regression policies for the initial states (1,2,3,4) are (1, 1, 1, 1) at all time points from j = 1 until j = K − 2. Then it uses (1, 0, 1, 1) for j = K − 1 and (1, 0, 1, 0) for j = K. Choosing the wrong action will cost us the regrets, given in Table 6.4.
Then the optimal expected rewards, using simulations from a dataset size 1000 are shown in Table 6.5.
Initial State M1 1 2 3 4
Rewards 6 or 8 6 or 3 4 or 8 4 or 3 Optimal expected rewards 30.59 26.07 29.27 24.75
Overall mean 27.67
Table 6.5: Regret-regression optimal mean expected rewards for two-arm bandit problem when K = 5.
Now we will compare regret-regression results with other methods such as a random policy (playing each arm with probability 0.5), play-the-winner (playing an arm which gives a maximum reward) and Gittins index policies. The last two methods have static policies,
(1, 0, 1, 0) and (1, 0, 1, 1), respectively, for all time points. For reference we give the mean reward under four decision regimes:
Policy Expected Reward
Random, p = 0.5 25.2
Play-the-winner 26.0
Gittins Index (or linear programming) 27.1
Optimal Dynamic 27.67
Table 6.6: Mean rewards under different strategies when K = 5.
6.9.3 Linear programming policy with an infinite horizon
In the previous section we showed how Gittins Index (linear programming) can be calcu-lated using a finite horizon with K time points. In this section we will compare results of linear programming with regret-regression policies, when there are an infinite number of decision time points. First we describe how to find these policies, then we illustrate the comparison by a numerical example.
Suppose a bandit has four states and two arms. Arm one has two possible states, denoted 1 and 2. Arm 2 also has two possible states, denoted 3 and 4. Initially one of states is in arm 1 and the other in arm 2. Arms may change state only after completing service, according to Markovian transition probabilities. In arm 1, State 1 may thus either remain in the same state, with probability P (1, 1), or transfer to state 2, with probability P (1, 2) = 1 − P (1, 1), and when arm 1 in state 2, state 1 may be entered with probability P (2, 1), or re-enter state 2, with probability P (2, 2) = 1 − P (2, 1). In arm 2, State 3 may thus either remain in the same state, with probability P (3, 3), or transfer to state 4, with probability P (3, 4) = 1 − P (3, 3), and when arm 2 in state 4, state 3 may be entered with probability
P (4, 3), or re-enter state 4, with probability P (4, 4) = 1 − P (4, 3). Each time a state completes its service, a reward Rk for k =1, 2, 3 and 4 is earned, discounted in time by a discount factor 0 < λ < 1. The objective is to find a policy d ∈ D, that maximise the total expected discount reward earned over an infinite horizon. By defining Mj(t) as indicator variable, that takes value 1 if state M (t) in service at time j and 0 otherwise, then we can write the stochastic optimization problem of interest as follows,
V = max
Now, returning to the linear programming method. We define,
1- m1, m2, m3, m4 are the number of service completions of states 1, 2 in arm 1 and 3, 4 in arm 2 respectively. E. g., mi =PI[Mj(i)] is the total number of service completions of state i.
2- R1, R2, R3, R4 are the rewards of states 1, 2, 3 and 4 respectively.
3- The transition matrices will be,
P1 =