Summary - Policy-Gradient Algorithms for Partially Observable Markov Decision Processes

Table 2.1 summarises the model-based algorithms described in this chapter. Table 2.2 correspondingly summarises the model-free algorithms. Particular attention is paid

to comparing the how the ω(h|φ, g, y) and µ(u|θ, h, y) processes are parameterised for each algorithm.

Key Points

I POMDP methods can be split along several axes: • model-based learning or model-free learning; • exact or approximate agents;

• policies inferred from values or learnt directly; • policies computed analytically or via simulation.

II Exact methods are PSPACE-hard so we need tractable approximate methods.

III POMDP agents need memory to act optimally.

IV Policy-gradient methods do not waste effort learning and representing values that are discarded after the policy is formed.

V Policy-gradient methods guarantee convergence to at least a local maximum, but they may take longer to do so than value methods due to high variance in the gradient estimates.

Some fields of POMDP research have been left out of this discussion because they are not pertinent to this thesis. They include:

• Frameworks that sit between MDPs and POMDPs [Pendrith and McGarity, 1998, Zubeck and Dietterich, 2000];

• Hierarchical POMDPs [Dietterich, 2000, Theocharous et al., 2000, Hernandez- Gardio and Mahadevan, 2001, Thi´ebaux et al., 2002];

• Multiple time scale and continuous time POMDPs [Precup, 2000, Ghavamzadeh and Mahadevan, 2001].

The remainder of this thesis is only concerned with policy-gradient methods where memory is provided by FSCs or HMMs.

Table 2.1: A summary of the model-based algorithms described in this chapter. The ω

column containsdif the internal state update is deterministic andsif the update is stochastic. Similarly, theµcolumn indicates if the choice of action is deterministic of stochastic. Uppercase D indicates that the function is fixed instead of learnt. The last column describes how φ

parameterisesω(h|φ, g, y) andθparameterisesµ(u|θ, h, y).

Method ω µ φ parameters

θ parameters

MDP [§2.4] e.g.

Value Iteration d d

Fully observable, therefore no memory required

θstores long-term value of each state Exact [§2.5.1] e.g.

Incremental Pruning D d

I-state isbt,φrepresented by q(j|i, u) andν(y|i)

Piecewise linear hyperplanes or policy graph

Heuristic [§2.5.2.1] D d As for exact

Approximation to piecewise linear hyperplanes

Grid [§2.5.2.2] D d As for exact

µ(u|θ, h, y) interpolates grid point values stored byθ

Factored belief

[§2.5.3.1] d d

φencodesBayesian network or algebraic decision tree

Could be any of the previous parameterisations Factored value

[§2.5.3.2] d d

Any of previous, including exact and factored Linear combinations of basis functions or ADDs

Planning [§2.5.4] d D Any belief state tracking method

µ(u|b, y) searches space of future trajectories

RTDP-Bel[§2.5.5] d d Any method to track sampled beliefsbt

θstores value of all visited belief states SPOVA&Linear-Q

[§2.5.5] d d

Any method to track sampled beliefsbt

Smooth approx. of exact ¯Jβ learnt by gradient ascent

Particle Filters

[§2.5.6] d d

φrepresents trackedbt as nparticles

KNN to infer value forbt represented by particles

Policy Iteration

[§2.5.7] d d

φis a policy graph (PG) converted to ¯Jβ for learning steps θmaps policy graph nodes to actions, learnt during DP Depth first PG

search [§2.5.7] d d

φchosen by search of a tree of constrained PGs

θmaps PG nodes to actions Gradient ascent of

PGs [§2.5.7] s s

φis PG transition probs. learnt by gradient ascent

θstochastically maps PG nodes to actions Approx. Gradient

PG [§2.5.7] s s

PG transition probabilities controlled by ANN ANN maps PG nodes to action probabilities Actor-Critic &

VAPS[§2.5.8] D s

Usually no internal state

Table 2.2: A summary of the model-free algorithms described in this chapter. Each column has the same meaning as Table 2.1. The tables are not a complete summary of all POMDP algorithms.

Method ω µ φ parameters

θ parameters

JSJQ-learning

[§2.6.1] D s

Assume yt =it, so no internal state

θstores long-term values of each observation HMM methods

[§2.6.2.1] d s

φis HMM transition probabilities

θis action probs. or value of HMM belief states

Window-Q[§2.6.2.2] D d φdeterministically records last nobservations ¯y

θis ANN weights mapping ¯y to values

UTREE[§2.6.2.2] D d φdeterministically records last nobservations ¯y

θrepresents tree; follow ¯y branch to getut

RNNs [§2.6.2.3] d d RNN mapsyt & RNN state output to new state output RNN mapsyt & RNN state output to actions or values

Evolutionary

[§2.6.2.4] s D

φis RNN trained using EAs & stochastic sigmoid outputs

θweights sigmoid outputs to select actions

FSCs [§2.6.2.5] s s φis prob. of I-state transitiong→h

θis prob. ofut given I-stateh

Williams’

REINFORCE[§2.6.3] s s

I-states may be changed by memory setting actions Grad ascent ofθthat mapsyt→ut

GPOMDP[§2.6.3] s s Learningφis the subject of Chapter 5

Stochastic Gradient Ascent of

FSCs

He had bought a large map representing the sea, Without the least vestige of land:

And the crew were much pleased when they found it to be A map they could all understand.

—Charles Lutwidge Dodgson

Our aim is to maximiseη(φ, θ, i, g), the long-term average reward (2.1), by adjusting the parameters of the agent in the direction of the gradient ∇η(φ, θ, i, g). Before Chapters 4–6 describe several algorithms for doing this, we use this chapter to state the key quantities and assumptions we rely on to ensure the existence of ∇η(φ, θ, i, g). Firstly, we show how to construct a single Markov chain from the world-state, the I-state in the form of a finite state controller (FSC), and the policy. Then we show how to construct the functions ω(h|φ, g, y) and µ(u|θ, h, y) to represent an FSC and policy such that the necessary assumptions are satisfied. This is achieved using the soft-max function that generates distributions from the real valued output of a function approximator. The soft-max function is used for all the experiments documented in this thesis. We also briefly describe our conjugate gradient ascent procedure with details deferred to Appendix B.

3.1 The Global-State Markov Chain

Recall from Section 2.1 that the transition probabilities governing the world-state Markov process are described by q(j|i, u). Similarly, the transition probabilities between I-states are described by the FSC transition probabilities ω(h|φ, g, y). From Meuleau et al. [1999a, Thm.1], the evolution of global states (world-state and I-state pairs (i, g)) is also Markov, with an|S||G|×|S||G|transition probability matrixP(φ, θ). The entry in row (i, g) and column (j, h) is given by

p(j, h|φ, θ, i, g) = X

y∈Y X

u∈U

This equation computes the expectation over all observations and actions of global state transition (i, g) →(j, h). The model of world given by ν(y|i) and q(j|i, u) must be known beforeP(φ, θ) can be computed explicitly.

A step in computing ∇η(φ, θ, i, g) needs the gradient of the global-state transition matrix ∇P = [∇φ_P,_∇θ_P_] _(3.2) = ∂P ∂φ1 , . . . , ∂P ∂φnφ , ∂P ∂θ1 , . . . , ∂P ∂θnθ . (3.3)

The partial derivative of the matrixP with respect to parameterφlwherel∈ {1, . . . , nφ}

is the element-wise derivative

∂P ∂φl =     ∂p(1,1|φ,θ,1,1) ∂φl · · · ∂p(|S|,|G||φ,θ,1,1) ∂φl .. . . .. ... ∂p(1,1|φ,θ,|S|,|G|) ∂φl · · · ∂p(|S|,|G||φ,θ,|S|,|G|) ∂φl    .

Each element of each _∂φ∂P

l is given by ∂p(j, h|φ, θ, i, g) ∂φl =X y,u ν(y|i)∂ω(h|φ, g, y) ∂φl

µ(u|θ, h, y)q(j|i, u). (3.4)

The corresponding entries for ∂P

∂θc, wherec∈ {1, . . . , nθ}, are ∂p(j, h|φ, θ, i, g) ∂θc =X y,u ν(y|i)ω(h|φ, g, y)∂µ(u|θ, h, y) ∂θc q(j|i, u). (3.5)

In document Policy-Gradient Algorithms for Partially Observable Markov Decision Processes (Page 58-64)