2.2 Solving POMDPs
2.2.2 Point-based methods
A natural idea for approximating the optimal value function is to determine it only over a finite subset BRĂ B instead of the complete belief space. The value iteration is only performed on the subset BR instead of the complete belief space. A method for generalising the value function at belief states in BzBR may be applied, based e.g. on interpolation. Such methods are collectively termed point-based POMDP solution methods. In the following, we review some of the point-based POMDP solvers proposed in the literature. A recent, thorough survey on the topic is provided by Shani et al. (2013). Some point-based approaches apply a discretisation of the belief space. For instance, Lovejoy (1991b) generated a fixed grid over B via triangulation and approximated the value function on the grid. A variable resolution grid that is denser in some parts of B was considered by Zhou and Hansen (2001). If an α-vector representation is applied, Equation (2.15) provides a recipe for generalising to beliefs not in BR. Grid-based approaches that rely purely on interpolation-extrapolation rules for generalisation and do not make use
of the α-vector representation may be applied even to problems where the expected reward is non-linear in B.
In approaches that do apply the α-vector representation of the value function, exactly one α-vector is maximal at a given belief state. Thus, the value func- tion is represented by at most |BR| vectors, reducing computational demands. The value function is exact for b P BR, and by (2.15), an approximation of the value function for belief states in BzBR may be found.
Several point-based methods are based on the idea of setting BR equal to the set of reachable belief states in the POMDP. Initial information about the state in a POMDP is summarised by an initial belief state b0. Given
b0, reachable belief states are those that can be reached by executing an
admissible policy. b0 a0 b1 a0 b5 b6 a1 b7 b8 b2 a1 b3 b4 ρ(b0, a0) ρ(b0, a1) p(z0| b0, a0) p(z1| b0, a0) p(z0| b0, a1) p(z1| b0, a1) ρ(b1, a0) ρ(b1, a1) p(z0| b1, a0) p(z1| b1, a0) p(z0| b1, a1) p(z1| b1, a1)
Figure 2.1: A partial belief tree. Belief states are depicted as triangular nodes,
and actions are depicted by circular nodes. Each edge is labelled, from a belief node to an action by the expected reward of executing the action in the belief state, and from actions to belief nodes by the probability of reaching each possible successor belief state.
The concept is further illustrated by considering a tree graph representation of reachable belief states. Consider a POMDP with action space A “ ta0, a1u
and observation space Z “ tz0, z1u. An example of a partial tree graph
of reachable belief states over 2 decisions for such a POMDP is shown in Figure 2.1. Belief states in the tree are depicted by triangular nodes. The initial belief state is b0, shown at the root of the tree. A belief node has child
nodes labelled by actions. An edge from a belief node to an action node is labelled with the expected reward of the action in the parent belief state, for example ρpb0, a0q at the upper left hand side of the figure. The child
nodes of action nodes are again belief nodes. There are |Z| out-edges from an action node, one for each possible observation. Each edge from an action node to a belief node is labelled with the prior probability of perceiving that observation, for instance ppz0 | b0, a0q between nodes labelled a0 and b1. A
pair of edges from a belief node via an action node to a child belief node determines a belief state. For example, the leftmost belief node in the third
layer of the tree corresponds to action a0 at belief b0 and observation z0.
Hence, b1 “ τ pb0, a0, z0q as defined in (2.4). Based on the tree view presented,
we conclude that if the agent executes d actions, there are p|A||Z|qd possible reachable belief states.
Beyond the third layer of the tree, only the children of b1 have been expanded
and drawn in the tree. In a fully expanded belief tree, the belief states corresponding to pd ` 1q topmost layers of belief nodes in the tree form the set of belief states that can be reached by any combination of up to
d decisions and observations starting from the initial belief state b0. For
example for d “ 1 this set in the case of Figure 2.1 is tb0, b1, b2, b3, b4u.
As it is rarely feasible to consider all of the reachable belief states, alternative ways to select a suitable subset of them as BR have been proposed instead.
Pineau et al. (2006) suggest interleaving point-based value iteration steps with inclusion of more belief states into BR. Their point-based value iteration (PBVI) algorithm starts with an initial set BR“ tb0u of beliefs, performs a
finite number of value iteration steps, then inserts new belief states into BR and repeats the procedure. They define a density for the belief set BR as
∆BR “ max
b1PB bPBmin
R
||b ´ b1||1. (2.17)
This density determines how well BR covers B. When selecting which belief point τ pb, a, z1
q to insert to BR between value iteration stages, Pineau et al. take into account 1) how likely it is to reach the belief state τ pb, a, z1q from
b P BR, i.e. ppz1 | b, aq, 2) to minimise Equation (2.17), how far the belief state τ pb, a, z1
q is from other beliefs already in BR, and 3) what is the current approximate value at τ pb, a, z1q.
A related method based on randomising the value iteration stages was suggested by Spaan and Vlassis (2005). The set BRis generated by sampling random trajectories of reachable beliefs. The algorithm is given an initial value function V1 in terms of α-vectors. At iteration k, instead of computing
the value iteration step for each belief state in BR, the algorithm randomly chooses a belief state b P BR and implements the value iteration step for that belief. The related α-vector is added to the value function estimate
Vk`1. The new α-vector may also improve the value function also at belief points other than b. Every b P BR is checked for whether their value was improved, obtaining a set ˜B “ tb P BR | Vk`1pbq ă Vkpbqu Ă BR of beliefs whose value has not yet been improved. A new belief is chosen randomly from ˜B, the value iteration step is computed for it, the α-vector is inserted
to Vk`1, and the beliefs are checked again. This process is repeated until ˜
B “ H, and the value of every belief state has been improved.
The algorithm presented above, called Perseus, features an asynchronous value iteration stage since the beliefs in BR are processed in a random order, and not necessarily an equal number of times. Regrettably, as a result the concept of a fixed planning horizon is obfuscated: performing the Perseus value iteration stage k times will not consider policies k steps into the future, but less by some amount (Spaan and Vlassis, 2005). The algorithm however eventually converges to the optimal value function Vπ˚
, and is thus suited for discounted infinite horizon problems.
Further examples of point-based methods include SARSOP (Kurniawati et al., 2008), which attempts to further reduce the computation of the value function to the set of belief states reachable under an optimal policy instead of an arbitrary policy, or a random policy as in Perseus. Heuristic search value iteration (HSVI) (Smith and Simmons, 2004, 2005) follows the simple rule of updating the successors of a belief state, i.e. the children of any belief state in the tree of Figure 2.1 before the belief itself, accelerating the convergence of value iteration. Additionally, HSVI applies a heuristic function to select the most relevant belief points to be included into BR. The heuristic is computed by maintaining lower and upper bound for the value function, and new beliefs to include in BR are selected based on maximising the distance between the bounds. In other words, belief points are included in areas of B where the uncertainty about the value function is the greatest.
Araya-López et al. (2010) suggested a variant of POMDP, called ρPOMDP, with a reward function that is convex in the belief state. The convex reward function may be approximated by a PWLC function, and regular α-vector updates or point-based value iteration steps may be applied to obtain a bounded-error approximation of the value function. Ji et al. (2007) derived a policy iteration algorithm replacing Hansen’s policy improvement step by point based value iteration.