Adding a Baseline - Policy Gradient Methods: Variance Reduction and Stochastic Convergence

In Chapters 3 and 4 we saw that using a baseline may reduce the variance of our gradient estimates. Recall from Equation (2.13) that we have

∇βη(θ) =E[L(X, U;θ)Jβ(W;θ)], (8.5)

where the random variable (X, U, W) has probability mass functionPr(X = i, U = u, W = j) = πiµu(i;θ)pij(u). The idea behind using a baseline was seen in

Note 2.4, which showed that we may shift L(X, U;θ)Jβ(W;θ) in Equation (8.5) by

L(X, U;θ)B(X;θ), for any almost surely finiteB(θ) : _{S →} R, without changing the

variance of any∇βη estimates that we calculate, however, it is unlikely we will have

enough information to be able to do this. In this chapter we consider using a (stochastic) sequence of functions (Bt :S →R)∞0 with Bt used as a baseline at time t, and

consider ways of selecting such a sequence, with the aim of obtaining a better baseline over time.

In the COLMDP algorithm, rather than taking steps inL(Xt, Ut;θt)Jβ(Xt+1;θ)like

quantities, we look at the reward multiplied by a discounted sum ofL(Xt, Ut;θt)into

the past. For some baselineB :_{S →}Rwe might consider the baseline to act on reward

terms inJβ(Xt+1;θ)as follows: lim T→∞ T X s=t+1 βs_t−t−1r(Xs)−B(Xt) = lim T→∞ T X s=t+1 β_ts−t−1r(Xs) + T X s=t+1 βs_t−t−1(βtB(Xs)−B(Xs−1)) +βtTB(Xt+T) ! = ∞ X s=t+1 β_ts−t−1(r(Xs) +βtB(Xs)−B(Xs−1)),

providedB is almost surely finite. This would suggest replacing the updateθt+1 = θt+γtr(Xt)ztin the COLMDP algorithm with the updateθt+1 =θt+γtr˜tzt,where

rt def= r(Xt) +βtBt(Xt)−Bt(Xt−1).

In this section we will consider a number of baseline sequences, which we will form from a sequence of parameterised functions (Bt($) :S →R)∞0 , and sequence

of random variables ($t)∞0 . At each time t we form a baseline Bt(x) = Bt(x;$t)

and update the random variable$t+1 = B($t, . . .), whereB is the baseline update.

Algorithm 8.3 shows the COLMDP(baseline) algorithm, a variant of the COLMDP algorithm that uses the sequence of baselines generated in this way.

The first choice of baseline we consider stems from the idea of using the expected discounted value function as a baseline. It is formed from the parameterized sequence

(Bt(˜η))∞0 with Bt(i; ˜η) = ˜ η 1−βt , _∀i_{∈ S}, (8.6)

along with the update

ηt+1 = ˜ηt+λγt(r(Xt)−η˜t), (8.7)

whereλis an arbitrary constant. This choice of baseline gives us theθupdate rule θt+1 =θt+γt(r(Xt)−η˜t)zt.

With this baseline and baseline update the COLMDP(baseline) algorithm has the same shift in the reward, and approximation to the average reward, as the algorithm of Mar- bach (1998), Marbach and Tsitsiklis (2003). Of course, in the COLMDP(baseline) algo-

Algorithm 8.3COLMDP(baseline)

given:

- µMDPD = (_S,_U,P, r,M_p₎satisfying assumption 4;

- initial parameter valueθ0∈RK;

- distribution over starting statesρ0∈ PS;

- a sequence of positive step sizes(γt)∞₀ , and a sequence(βt)∞₀ in[0,1) satisfy-

ing

a. P∞

t=0γt =∞,

b. γt,(1−βt),andγt(1−βt)−4are non-increasing,

c. there exists0< p <1such thatP∞

t=0γt(1−βt)p <∞,

d. there exists0< q <1such thatP∞

t=0γ 1+q

t (1−βt)−8 <∞,and

e. there is a constantc9such that(1−βt)−(1−βt+1)≤c9(1−βt)5(1−βt+1);

- sequence of functions(Bt(x;$))∞0 , initial baseline parameter $0, and base-

line update B_{($, . . .)}, generating the stochastic sequence_(B_t _:_{S →}_R₎∞ 0 by Bt(x) =Bt(x;$t). z0 = 0. GenerateX0 according toρ0. GenerateU0according toµ(X0;θ0). fort= 0,1,2, . . .do θt+1 =θt+γt˜rtzt. $t+1=B($t, . . .). zt+1 =βtzt+L(Xt, Ut;θt). GenerateXt+1 according toPXt(Ut). GenerateUt+1according toµ(Xt+1;θt+1). end for

rithm we have that the value ofβchanges at each iteration (also thatztis not refreshed

at occurrences of a special statei∗), allowing us to give the result of Theorem 8.4 below,

a result that is stronger than that of Marbach (1998).

Theorem 8.4. For the sequence(θt)∞0 generated by the COLMDP(baseline) algorithm with baseline (8.6) and baseline update (8.7):

a. η(θt)converges to a finite value withlimt→∞∇η(θt) = 0;

b. every limit point ofθtis a stationary point ofη;

c. η˜tconverges to a finite value withlimt→∞(˜ηt−η(θt)) = 0.

The proof of Theorem 8.4 can be found in Section 8.6, along with the rest of the proofs for this section. Theorem 8.4a and Theorem 8.4b give the same convergence result as Theorem 8.3. Additionally, from Theorem 8.4c, we find that the error of our average reward approximationη˜tapproaches zero.

The analysis in Chapter 4 suggests choices of baseline. The baselines suggested can not be calculated without access to the dynamics of the Markov decision pro- cess (such as the set of transition matricesP), however, Section 4.3 suggests a gradient

method for updating the baseline. It shows how the gradient (with respect to the baseline parameters) of the variance of an estimate of∇βηcan itself be estimated from a

sequence(Xt, Ut)∞0 generated by theµMDP. We consider a variant of the correspond-

ing update, where we look to take steps in the direction of the gradient of the variance but attenuated by(1₋βt). This gives the following method for updating the baseline:

$t+1 = $t+λγtr˜tz˜t (8.8)

zt+1 = βtz˜t+kL(Xt, Ut;θt)k2(1−βt)∇($)Bt(Xt;$t),

whereλis again some positive constant, and the gradient ofBt is with respect to the

baseline parameters. We will consider this update for two particular choices of the sequence of functions(Bt)∞0 .

The second choice of baseline, and baseline update, we consider is when the update of Equation (8.8) is applied to the baseline

Bt(i;b) =

b 1₋βt

, ∀i∈ S. (8.9)

The update is then

bt+1 = bt+λγt(r(Xt)−bt)˜zt (8.10)

zt+1 = βtz˜t+kL(Xt, Ut;θt)k2.

Similar to before, the resultantθupdate rule is then θt+1=θt+γt(r(Xt)−bt)zt.

As before, we find that COLMDP(baseline) with baseline (8.9) and baseline update (8.10) converges. We have the following result.

Theorem 8.5. For the sequence(θt)∞0 generated by the COLMDP(baseline) algorithm with baseline (8.9) and baseline update (8.10),η(θt)converges to a finite value withlimt→∞∇η(θt) =

0,with probability one. Furthermore, every limit point ofθt is a stationary point ofη.

Lastly, the third choice of baseline, and baseline update, we consider is when the sequence of baseline functions (Bt)∞0 used in the update of Equation (8.8) remains

fixed, that is,

Bt(i;$) = ˜B(i;$), ∀t (8.11)

for some suitable choice of function_B˜_{. The update then becomes} $t+1 = $t+λγt r(Xt) +βtB(X˜ t;$t)−B(X˜ t−1;$t) ˜ zt (8.12) ˜ zt+1 = βtz˜t+ (1−βt)kL(Xt, Ut;θt)k2∇($)B(X˜ t;$t).

Provided that_B˜ _{is smooth in the parameter}_$_{we can again show convergence.} Theorem 8.6. Let _B˜ _: _{S ×}_Rm _→ _R _{be such that: for all} _i _{∈ S} _{the function} _B(i;˜ _·_{) :}

Rm → Ris differentiable; and the Euclidean norm of ∇($)_B(i;˜ _$) _{is bounded uniformly} over all (i, $) ∈ S × Rm. For the sequence (θt)∞0 generated by the COLMDP(baseline) algorithm with baseline (8.11) and baseline update (8.12), η(θt) converges to a finite value

with limt→∞∇η(θt) = 0, with probability one. Furthermore, every limit point of θt is a

stationary point ofη(θ).

In document Policy Gradient Methods: Variance Reduction and Stochastic Convergence (Page 165-169)