An Alternative Exp-GPOMDP Algorithm - The Exp-GPOMDP Algorithm

6.2 The Exp-GPOMDP Algorithm

6.2.2 An Alternative Exp-GPOMDP Algorithm

There may be alternative ways to Rao-Blackwellise Exp-GPOMDP that result in new algorithms. For example, an algorithm that tracks the I-state belief, but instead of generating actions based on the expectation over all I-states (6.3), we could sample an I-state gt+1 from αt+1(·|φ,y¯t), and then sample the action distribution from

µ(ut|θ, gt+1, yt) only. Using such a scheme, the execution of a single step at time t

might be:

1. update the I-state belief using (6.2);

2. sample gt+1 from αt+1(·|φ,y);¯ 3. sample ut fromµ(·|θ, gt+1, yt);

4. compute the gradient contributions for the eligibility trace.

Suppose at timetwe sample I-stategt+1 fromαt+1, then the algorithm would compute

the expectation over all I-state trajectories that lead to I-state gt+1, instead of over all I-state trajectories. Such an algorithm would be useful due to its lower per-step complexity thanExp-GPOMDP, while hopefully still having lower variance thanIState- GPOMDP. Because the alternative algorithm makes partial use ofαt, we plotted it half

way up the I-state axis of Figure 3.2. The derivation and convergence properties of such algorithms need to be investigated.

6.2.3 Related Work

Using a recursively updated forward probability variable αto compute an expectation over all possible paths through a state lattice is an important aspect of hidden Markov Model training [Rabiner, 1989]. By viewing state transitions as being driven by the observations, and viewing actions as symbols generated by the HMM, Exp-GPOMDP

becomes a method for training Input/Output HMMs without using the backward probability component of HMM training. An important difference compared to HMM training is that Exp-GPOMDP does not seek to maximise the conditional likelihood of any particular sequence.

Exp-GPOMDPis similar to the finite-horizon algorithm presented by Shelton [2001b], which is a gradient-ascent HMM algorithm where emissions are actions. In that paper the HMM backward probability is used as well as the forward probability α. In the

4_{This assumption is probably automatically satisfied under the existing assumption that the process} {gt, it}is ergodic (see Assumption 1, Section 3.2), but this is yet to be proven.

infinite-horizon setting there is no natural point to begin the backward probability calculation so our algorithm uses only α.

6.3 Summary

Key Points

I IOHMM-GPOMDP uses IOHMMs to predict rewards, attempting to reveal the hidden state relevant to predicting rewards. Memory-lessIState-GPOMDP

then learns a policy based on the IOHMM-state belief and the current observations.

II IOHMM-GPOMDPis not guaranteed to converge to a local maximum of the long-term reward η, and the IOHMM may not reveal enough hidden state to allow the best policy to be learnt.

III The I-state model is given by the known function ω(h|φ, g, y), thus we can compute the expectation over I-state trajectories.

IV Exp-GPOMDPdoes this, reducing the variance of the estimate.

V Exp-GPOMDP complexity scales quadratically with the number of I-states, preventing large numbers of I-states being used.

Future Work

Further analysis is required to determine POMDPs for which theIOHMM-GPOMDP

method will converge and work better thanExp-GPOMDP. The further work discussion for IState-GPOMDP also applies to Exp-GPOMDP, that is, many variance reduction methods from the literature can be applied. The alternative Exp-GPOMDP algorithm needs to be implemented and tested on the scenarios described in Chapter 8.

We compute ¯µ(·|φ, θ,y¯t) by taking the expectation ofµ(u|θ, h, y) over all I-states for

each action. We could implement ¯µ(·|θ, φ,y¯t) directly, using function approximation.

For example, we could build a neural network implementing ¯µ(·|φ, θ,y¯t), propagating

gradients back to the α inputs to compute derivatives with respect to φ. This could result in improved policies because the action choice can take into account features such as the relative probability of each I-state. We apply a more direct implementation of ¯µ(u|φ, θ,y¯t) for the speech recognition experiments of Section 10.3.

Finally, we have claimed without proof that Exp-GPOMDP has lower variance thanIState-GPOMDP. Casella and Robert [1996] proves that Rao-Blackwellisation applied to Monte-Carlo methods has a variance reducing effect. We hope to use similar

proof methods to provide theoretical guarantees about the degree of variance reduction achieved by using Exp-GPOMDP instead of IState-GPOMDP. Chapter 8 provides empirical evidence for the variance reduction.

Small FSC Gradients

If an algorithm is going to fail, it should have the decency to quit soon.

—Gene Golub

Previous advocates of direct search in the space of FSCs [Meuleau et al., 1999a,b, Peshkin et al., 1999, Lanzi, 2000] report success, but only on POMDPs with a few tens- of-states that only need a few bits of memory to solve. The Load/Unload problem (see Figure 2.2(a)) is a common example. Meuleau et al. [1999a] and Lanzi [2000] comment briefly on the difficulties experienced with larger POMDPs. In this chapter we analyse one reason for these difficulties in the policy-gradient setting. Our analysis suggests a trick that allows us to scale FSCs to more interesting POMDPs, as demonstrated in Chapter 8.

7.1 Zero Gradient Regions of FSCs

In our early experiments we observed that policy-gradient FSC methods initialised with small random parameter values failed to learn to use the I-states for non-trivial scenarios. This was because the gradient of the average rewardη, with respect to the I-state transition parameters φ, was too small. Increasing maxl|φl| (the range of the

random number generator) helps somewhat, but increasing this value too much results in the agent starting near a local maximum because the soft-max function is saturated. The fundamental cause of this problem comes about because, with small random parameters, the I-state transition probabilities are close to uniform. This means the I-states are essentially undifferentiated. If, in addition to the I-state transitions being close to uniform, the action probabilities are similar for each I-stateg, then varying the trajectory through I-states will not substantially affect the reward. The net result is that the gradient of the average reward with respect to the I-state transition parameters will be close to 0. Hence, policies whose starting distributions are close to uniform will be close to a point of 0 gradient with respect to the internal-state parameters, and will tend to exhibit poor behaviour in gradient-based optimisation. The following theorem formalises this argument.

Theorem 6. If we choose θ andφ such that ω(h|φ, g, y) =ω(h|φ, g0_{, y)} _{∀h, g, g}0_{, y} _and

µ(u|θ, h, y) =µ(u|θ, h0, y) ∀u, h, h0, y then ∇φ_η _{= [0].}

This theorem is proved in Appendix A.3. Even if the conditions of the theorem are met when we begin training, they may be violated by updates to the parameters θ during training, allowing a useful finite state controller to be learnt. Appendix A.3 also analyses this situation, establishing an additional condition for perpetual failure to learn a finite state controller, despite changes to θ during learning. The additional condition for lookup tables isω(h|φ, g, y) = 1/|G| ∀g, h, y, which is satisfied by a table initialised to a constant value, such as 0.

The same problem has been observed in the setting of learning value functions when the policy can set memory bits [Lanzi, 2000]. Multiple trajectories through the memory states have the same reward, which Lanzi callsaliasing on the payoffs. A solution was hinted at by Meuleau et al. [1999a] where it was noted that finite state controllers were difficult to learn using Monte-Carlo approaches unless strong structural constraints were imposed. Imposing structural constraints without using domain knowledge is the essence of our proposed solution to small FSC gradients.

Recurrent neural networks (RNNs) provide an alternative memory mechanism that is described in Section 2.6.2.3. Like FSC agents, the RNN can be decomposed into a component that makes internal-state transitions and a component that chooses actions. However, the internal state is a vector of real numbers rather than an element from a finite set. This introduces its own problems, such as internal-state gradients that vanish or explode [Hochreiter and Schmidhuber, 1997], resulting in implementations that can handle no more I-states than FSC agents [Lin and Mitchell, 1992].1

One interesting exception to the poor performance of FSC methods is Glickman and Sycara [2001], where an Elman network — an RNN where all outputs are fed back to the hidden layer — parameterises an agent for the New York Driving scenario. This POMDP has over 21,000 states and requires multiple bits of memory. Surprisingly, Elman networks trained using an evolutionary algorithm outperformed theUTREE algorithm (see Section 2.6.2.2). The best performance was achieved using hidden units that emitted 1 or 0 stochastically. Output distributions for each hidden unit are generated by a sigmoid function of the weighted sum of inputs for each hidden unit. Because the hidden units output 1 or 0, the set of possible feedback vectors into the Elman network is finite and the transition from one feedback vector to another is stochastic. For this reason we can consider Glickman’s stochastic Elman networks to be an FSC algorithm. Performance dropped when non-stochastic RNN hidden units were used.

1_{Lin and Mitchell [1992] does apply RNN methods to the reasonably hard Pole Balancing scenario,}

but it is not clear that memory is necessary to solve this task. Lin’s results also show that using finite histories worked better than RNNs for this scenario.

These results show that FSC methods can perform better than other RNNs and finite history methods, and that FSCs can scale to interesting problems. The results also raise the following question: why is policy evolution immune to the undifferentiated I-state problem seen in policy gradient and value function approaches? One explanation is that evolutionary methods do not get stuck in local maxima. Evolutionary systems keep generating random agents until one works.

Normally, gradient methods rely on converging to a local maximum from some set of initial parameters that makes as few assumptions as possible about the scenario. In the case of policy-gradient algorithms this is the set of parameters that generates uniformω(·|φ, g, y) andµ(·|θ, g, y) distributions. However, Theorem 6 tells us that this initialisation will result in∇φ_η _{= 0. We tried to apply the trick used in neural networks}

for similar situations: initialising the weights with small random values. However, to obtain a reasonable gradient we had to start with large weights that begin to saturate the soft-max function. Consequently, the system again starts near a poor maximum.

Thus, we are caught between two competing sets of maxima: (1) FSC transition gradients are small for small weights, (2) FSC transition gradients are small for large weights. Empirically we found the best middle point was to initialise weights randomly between [−0.5,0.5], however convergence was unreliable. For example, the trivial Load/Unload scenario would fail to converge around 50% of the time. The next section describes a simple trick that allows us to obtain reliable convergence on the Load/Unload scenario and allows us to scale FSC methods to more complex scenarios.

In document Policy-Gradient Algorithms for Partially Observable Markov Decision Processes (Page 105-111)