5.3 Algorithms for Learning Value Functions
5.3.2 Minimizing Sample Error
A more restrictive approach is to minimize the error seen at each sample,
R=E(L(X, U)Gβ(W;$))2,
where the random variable (X, U, W) has probability mass functionPr(X = i, U = u, W = j) = πiµu(i)pij(u). This approach directly drivesV towardsJβ and as such
does not aim for a reduction in estimation variance. It produces an algorithm that is very similar to TD(1) (Sutton 1988), but has the benefit that the relative magnitude of the gradient with respect to the policy parameters is taken into account. In this way, more attention is devoted to accuracy in regions of the state space where the gradient is relatively large.
For the parameterized class of value functions,{V($) :S →R|$ ∈Rm},we can
determine the gradient of this quantity: ∇$1 2R = ∇$ 1 2E(L(X, U)Gβ(W)) 2 = −E h L(X, U) (∇$V(W;$))00(L(X, U)Gβ(W;$)) i = −E h (L(X, U))2∇$V(W;$)Gβ(W;$) i .
If the value function satisfies Assumption 3, that is, if the value function is bounded and has a bounded first derivative, the gradient ∇$R/2 may be estimated from a
sequence generated by aµMDP; we can use the estimate ∆RS = 1 S S X s=1 (r(Xs) +βV(Xs+1;$)−V(Xs;$))zs,
wherez0= 0, andzs+1=βzs+ (L(Xs, Us))2∇$V(Xs+1;$).The ergodicity and trun-
cation argument, showing that∆RS almost surely converges to∇$R/2, is the same
Experiments
This chapter describes some simulation experiments performed whilst investigating the selection of baselines and value functions.
Section 6.1 describes an experiment on a simple three state MDP where many quantities, such as the optimum baseline, can be calculated. This allows a direct com- parison of the performance when using the optimal baselines discussed in Chapter 4 with the performance when using the expected value function as a baseline, and with the performance of the GPOMDP algorithm. The algorithms were altered to use a pre- calculated value functionJβ rather than an estimateJt. This allows us to more clearly
see the benefit of learning a value function that aims to reduce estimate variance in addition to estimate bias.
Section 6.2 describes an experiment using the same simple three state MDP as Section 6.1. This experiment shows the performance of the estimate when using a baseline and the estimate when using a value function when the baseline, and value function, are learnt whilst calculating the respective gradient estimates.
Section 6.3 describes an experiment on a larger, target tracking problem. The per- formance of gradient estimates when using a number of different baselines are com- pared at various stages of learning the target tracking problem.
6.1 Three State MDP
This section describes an experiment comparing choices of gradient estimates for a simple three stateµMDP. We use the three stateµMDP as described in Baxter et al.
(2001). There are three statesS ={1,2,3}, two actionsU ={a1, a2}, and four param-
etersθ∈R4. The transition matrices are
P(a1) = 0.0 0.8 0.2 0.8 0.0 0.2 0.0 0.8 0.2 P(a2) = 0.0 0.2 0.8 0.2 0.0 0.8 0.0 0.2 0.8 ,
and the reward function is
r(i) =
1 ifi= 3, 0 otherwise.
The policy is constructed as follows: there are two functionsφ1, φ2 :S →Rwith φ1(1) = 1218, φ1(2) = 186, φ1(3) = 185, φ2(1) = 186, φ2(2) = 1218, φ2(3) = 185,
and functionss1, s2 :S ×Rn→Rdefined by
s1(i;θ) def= θ1φ1(i) +θ2φ2(i), s2(i;θ) def= θ3φ1(i) +θ4φ2(i);
the policy is then given by
µa1(i;θ) =
es1(i;θ)
es1(i;θ)+es2(i;θ), µa2(i;θ) = 1−µa1(i;θ) =
es2(i;θ) es1(i;θ)+es2(i;θ).
The experiment looked at gradient estimates for the policy at the parameter setting
θ= (1,1,−1,−1)0.
In the experiment the gradient∇η was compared to the gradient estimates pro-
duced with a variety of schemes: - the GPOMDP algorithm;
- the estimate∆(+0)T (b)with a constant baseline set toEJβ(X), whereX is a ran-
dom state variable distributed according toπ, the stationary distribution of the
Markov chain formed by the sequence(Xt)∞0 generated by theµMDP;
- the estimate∆(+0)T (b) with the optimum constant baseline, described in Theo-
rem 4.4;
- the estimate∆(+0)T (bY)with the optimum baseline function, described in Theo-
rem 4.1; and - the estimate∆V
T with a value function that was trained using Algorithm 5.1 with
the free parameterλset to0.001.
The value function had a distinct parameter for each state, all initially set to zero. Because of theµMDP’s simplicity, a number of quantities can be computed explic-
itly, including the true gradient∇η, the discounted value functionJβ, the expectation
of the discounted value function EJβ(X), the optimal baseline b∗Y, and the optimal
constant baselineb∗. All algorithms (estimates) used in the experiments were altered
such thatJβ estimates (that is, the estimates formed from a discounted sum of future
rewards) were replaced by the precomputed discounted value function; such a change having no effect on the estimate∆VT. The data was produced using500independent
runs, and the approximation parameterβwas set to a value of0.95.
Figures 6.1 and 6.2 plot the mean and standard deviation (respectively) of the rel- ative norm difference of the gradient estimate from∇η, as a function of the number
of time steps. The relative norm difference of a gradient estimate ∆ from the true
gradient∇ηis given by
k∆− ∇ηk
k∇ηk .
From Figures 6.1 and 6.2 we see that the variants of the GPOMDP algorithm give significant variance reductions over the GPOMDP algorithm. We also see that the optimum baseline gives better performance than the use of the expectation of the discounted value function as a baseline. For thisµMDP, the performance difference
between the optimum baseline and the optimum constant baseline is small; the opti- mum baseline of this system,b∗
Y = (6.35254,6.35254,6.26938)0, is close to a constant
function. The optimum constant baseline isb∗ = 6.33837.
The asymptotic error of the gradient estimate∆VT is non-zero, as Figure 6.1 shows,
which is due to the value of λbeing fixed when training the value function. How-
ever, the expected error of the estimate∆V
T remains smaller than that of the estimate
given by the GPOMDP algorithm for all but very large values ofT, and the standard
deviation is always smaller.