Minimizing Sample Error - Algorithms for Learning Value Functions

5.3 Algorithms for Learning Value Functions

5.3.2 Minimizing Sample Error

A more restrictive approach is to minimize the error seen at each sample,

R=E(L(X, U)Gβ(W;$))2,

where the random variable (X, U, W) has probability mass functionPr(X = i, U = u, W = j) = πiµu(i)pij(u). This approach directly drivesV towardsJβ and as such

does not aim for a reduction in estimation variance. It produces an algorithm that is very similar to TD(1) (Sutton 1988), but has the benefit that the relative magnitude of the gradient with respect to the policy parameters is taken into account. In this way, more attention is devoted to accuracy in regions of the state space where the gradient is relatively large.

For the parameterized class of value functions,{V($) :S →R|$ ∈Rm},we can

determine the gradient of this quantity: ∇$1 2R = ∇$ 1 2E(L(X, U)Gβ(W)) 2 = −E h L(X, U) (∇$V(W;$))00(L(X, U)Gβ(W;$)) i = ₋E h (L(X, U))2_∇$V(W;$)Gβ(W;$) i .

If the value function satisfies Assumption 3, that is, if the value function is bounded and has a bounded first derivative, the gradient ∇$R/2 may be estimated from a

sequence generated by aµMDP; we can use the estimate ∆RS = 1 S S X s=1 (r(Xs) +βV(Xs+1;$)−V(Xs;$))zs,

wherez0= 0, andzs+1=βzs+ (L(Xs, Us))2∇$V(Xs+1;$).The ergodicity and trun-

cation argument, showing that∆RS almost surely converges to∇$R/2, is the same

Experiments

This chapter describes some simulation experiments performed whilst investigating the selection of baselines and value functions.

Section 6.1 describes an experiment on a simple three state MDP where many quantities, such as the optimum baseline, can be calculated. This allows a direct com- parison of the performance when using the optimal baselines discussed in Chapter 4 with the performance when using the expected value function as a baseline, and with the performance of the GPOMDP algorithm. The algorithms were altered to use a pre- calculated value functionJβ rather than an estimateJt. This allows us to more clearly

see the benefit of learning a value function that aims to reduce estimate variance in addition to estimate bias.

Section 6.2 describes an experiment using the same simple three state MDP as Section 6.1. This experiment shows the performance of the estimate when using a baseline and the estimate when using a value function when the baseline, and value function, are learnt whilst calculating the respective gradient estimates.

Section 6.3 describes an experiment on a larger, target tracking problem. The performance of gradient estimates when using a number of different baselines are compared at various stages of learning the target tracking problem.

6.1 Three State MDP

This section describes an experiment comparing choices of gradient estimates for a simple three stateµMDP. We use the three stateµMDP as described in Baxter et al.

(2001). There are three statesS =_{1,2,3_}, two actions_U =_{a1, a2}, and four param-

etersθ_∈R4. The transition matrices are

P(a1) =   0.0 0.8 0.2 0.8 0.0 0.2 0.0 0.8 0.2   P(a2) =   0.0 0.2 0.8 0.2 0.0 0.8 0.0 0.2 0.8  ,

and the reward function is

r(i) =

1 ifi= 3, 0 otherwise.

The policy is constructed as follows: there are two functionsφ1, φ2 :S →Rwith φ1(1) = 12₁₈, φ1(2) = ₁₈6, φ1(3) = ₁₈5, φ2(1) = ₁₈6, φ2(2) = 12₁₈, φ2(3) = ₁₈5,

and functionss1, s2 :S ×Rn→Rdefined by

s1(i;θ) def= θ1φ1(i) +θ2φ2(i), s2(i;θ) def= θ3φ1(i) +θ4φ2(i);

the policy is then given by

µa1(i;θ) =

es1(i;θ)

es1(i;θ)₊_es2(i;θ), µa2(i;θ) = 1−µa1(i;θ) =

es2(i;θ) es1(i;θ)+es2(i;θ).

The experiment looked at gradient estimates for the policy at the parameter setting

θ= (1,1,−1,−1)0.

In the experiment the gradient∇η was compared to the gradient estimates pro-

duced with a variety of schemes: - the GPOMDP algorithm;

- the estimate∆(+0)_T (b)with a constant baseline set toEJβ(X), whereX is a ran-

dom state variable distributed according toπ, the stationary distribution of the

Markov chain formed by the sequence(Xt)∞0 generated by theµMDP;

- the estimate∆(+0)_T (b) with the optimum constant baseline, described in Theo-

rem 4.4;

- the estimate∆(+0)_T (bY)with the optimum baseline function, described in Theo-

rem 4.1; and - the estimate∆V

T with a value function that was trained using Algorithm 5.1 with

the free parameterλset to0.001.

The value function had a distinct parameter for each state, all initially set to zero. Because of theµMDP’s simplicity, a number of quantities can be computed explic-

itly, including the true gradient∇η, the discounted value functionJβ, the expectation

of the discounted value function EJβ(X), the optimal baseline b∗Y, and the optimal

constant baselineb∗. All algorithms (estimates) used in the experiments were altered

such thatJβ estimates (that is, the estimates formed from a discounted sum of future

rewards) were replaced by the precomputed discounted value function; such a change having no effect on the estimate∆V_T. The data was produced using500independent

runs, and the approximation parameterβwas set to a value of0.95.

Figures 6.1 and 6.2 plot the mean and standard deviation (respectively) of the relative norm difference of the gradient estimate from∇η, as a function of the number

of time steps. The relative norm difference of a gradient estimate ∆ from the true

gradient∇ηis given by

k∆_{− ∇}η_k

k∇η_k .

From Figures 6.1 and 6.2 we see that the variants of the GPOMDP algorithm give significant variance reductions over the GPOMDP algorithm. We also see that the optimum baseline gives better performance than the use of the expectation of the discounted value function as a baseline. For thisµMDP, the performance difference

between the optimum baseline and the optimum constant baseline is small; the optimum baseline of this system,b∗

Y = (6.35254,6.35254,6.26938)0, is close to a constant

function. The optimum constant baseline isb∗ _{= 6.33837}_.

The asymptotic error of the gradient estimate∆V_T is non-zero, as Figure 6.1 shows,

which is due to the value of λbeing fixed when training the value function. How-

ever, the expected error of the estimate∆V

T remains smaller than that of the estimate

given by the GPOMDP algorithm for all but very large values ofT, and the standard

deviation is always smaller.

In document Policy Gradient Methods: Variance Reduction and Stochastic Convergence (Page 111-115)