2.2 Solving POMDPs
2.2.5 Other approaches
The approaches to finding exactly or approximately optimal policies for POMDPs presented in the previous subsections by no means give an ex- haustive treatment on the topic. For instance, Monahan (1982); Lovejoy (1991a) survey exact solution methods. More comprehensive surveys of approximate value function (Hauskrecht, 2000), point-based (Shani et al., 2013) and online methods (Ross et al., 2008) are also avaiable. We next present a brief overview of POMDP solution methods that do not fit under the topics covered in the preceding subsections.
2.2.5.1 Special cases of POMDPs
Other special cases of POMDPs can be distinguished e.g. by the properties of the state transition function.
In his thesis, (Littman, 1996, Sect. 6.3.2) studied the subclass of deterministic POMDPs. A deterministic POMDP is characterised by the fact that both state transitions and observations are deterministic, as seen in the following definition.
Definition 2.8 (Deterministic POMDP). A POMDP xT , S, Ab, Z, T, O, Ry
is deterministic if for all s, s1
P S, a P A, z P Z Tps1, a, sq “ # 1 if s1 “ f ps, aq 0 otherwise , (2.24)
where f : S ˆ A Ñ S is a deterministic state transition function, and
Opz1, s1, aq “ #
1 if z1
“ hps1, aq
0 otherwise , (2.25)
where h : S ˆ A Ñ Z is a deterministic observation function.
Littman determined that as there is a single successor state to any action, and each state-action pair emits a single observation, the reachable belief states in a finite horizon problem can be enumerated. Hence, a conversion to a finite MDP over the set of reachable belief states exists, and e.g. value or policy
iteration may be applied to solve the deterministic POMDP. Littman showed that the finite horizon deterministic POMDP problem is NP-complete. Bonet (2009) notes a relation between policies and so-called AND/OR graphs, allowing for potentially more effective solutions in practice than via the MDP mapping suggested by Littman.
Besse and Chaib-draa (2009) considered quasi-deterministic POMDPs, where state transitions are deterministic, but with the constraint that for any state and action the probability of perceiving one of the observations is lower bounded by at least one half. They show that these quasi-deterministic problems are easier than general problems by bounding the length of the history sequence needed to identify almost surely the underlying state. Warnquist et al. (2013) studied the case where some actions have determin- istic effects. The deterministic actions are abstracted into macro actions, which improves practical convergence speed of a solver.
In some cases, POMDP problems have mixed observability. This means that some components of the state space in a factored representation may be fully observable. Such cases are particularly encountered e.g. in robotics domains, for which specialised POMDP solvers exist (Ong et al., 2010) that leverage the lower-dimensional representation of the belief space together with a point-based solver to compute approximate solutions. Computational savings are achieved as it suffices to maintain a collection of sets of low- dimensional α-vectors, one for each member variable of the fully observable part of the state space, to represent the value function.
2.2.5.2 Submodularity
A submodular function (Fujishige, 2005) is a set function which satisfies a “diminishing returns” property.
Definition 2.9 (Submodular function). Let E denote a nonempty finite set,
and 2E the power set of E. A function f : 2E
Ñ R is submodular if for any
two subsets X, Y Ă E such that X Ď Y and any x P EzY
f pX Y txuq ´ f pXq ě f pY Y txuq ´ f pY q. (2.26) Submodularity indicates that adding an element x to a smaller set X is increases the value of f more than adding the same element to the larger set
Y . Additionally, a submodular function is monotone if f pX Y txuq ě f pXq.
Krause et al. (2008) considered monotone submodular objective functions in a sensor placement problem. The problem was one of optimising prediction quality in a Gaussian process by selecting sampling locations. Concretely, a finite number k of sensors was to be placed at a set of locations X of interest. The sensor locations were chosen from the set E Ą X of all possible locations, and the objective was to maximise a submodular function quantifying the informativeness of sensor locations and prediction quality over the whole of
E. According to Nemhauser et al. (1978), in this case a greedy algorithm
that sequentially adds elements to X such that they maximise the expected immediate improvement of the objective function finds a solution that is at most a factor p1 ´ 1{ expp1qq lower than the optimal solution.
Krause et al. (2008) considered a stationary case where the process was fixed and merely a set of sampling locations were selected to maximise a performance measure. As such, the approach is not directly suited to a case where contingencies arise: for example, as a result of deploying a sensor, feedback is received indicating whether the deployment succeeded or not and further actions are taken contingent on this information. POMDPs are examples of contingent planning as well; a different course of action may be taken at a subsequent decision epoch conditional on the observations perceived.
Golovin and Krause (2011) extend the idea to contingent planning by introducing the concept of adaptive submodularity. The extension introduces a realisation function φ : E Ñ O mapping possible sensor locations to a state O. The problems considered proceed sequentially: a location e P E is selected, its state φpeq is (perfectly) observed, the next location is selected, its state is observed, and so on. The state may e.g. correspond to the event of a failure or success in sensor deployment. A partial realisation ψ is a function from a subset of E to their states. Thus, ψ can be a description of locations selected thus far and whether the deployment in each location failed or not. The policies in Golovin and Krause’s approach are mappings from a set of partial realisations to E; specifying which location to select next given a particular history of deployments and their successes or failures. As such, the partial realisations are equivalent to belief states of a POMDP. Adaptive submodular monotone objective functions are considered, defining submodularity now via the expected marginal improvement of selecting an item given the current partial realisation (belief state). A constant factor approximation to the optimal solution is shown to be found by a greedy policy.
Golovin and Krause’s decision-making model is equivalent to the determinis- tic POMDP (Definition 2.8). If k locations are to be selected, the problem is finite horizon with k decisions. Let c “ tceuePE indicate for each location
e P E whether it has been selected (ce “ 1) or not (ce “ 0). Furthermore, let d “ tdeuePE where de P O denote the (a priori unknown) state of location
e P E. The state is thus described by a pair s “ pc, dq P S. The set
of applicable actions is recoverable from the belief state since ce are fully observable; the applicable actions Ab are simply equal to the set of hitherto unselected locations. The observation space is Z “ O. The conditions of Definition 2.8 are now fulfilled by making the following two definitions. Define f ps, aq “ pc1, dq, where c1
a“ 1 and c1i “ ci for all i ‰ a and d remains unchanged. Finally, define hps1, aq “ d
a.
The key insight to seeing the equivalence is to note that in the sensor selection problem the fact whether a sensor deployment at a particular location will succeed or fail remains fixed; it is only our information regarding the relative probability of these two events that may vary as a consequence of deploying sensors at other locations. Even the information regarding failure probabilities remains fixed if the failures are independent as suggested by Golovin and Krause. As the event of failure or success is perfectly observed, the conditions for a deterministic observation model are fulfilled. Both Krause et al. (2008) and Golovin and Krause (2011) considered sub- modular reward functions that are non-linear in the belief state. The
conditions under which these reward functions are (adaptive) submodular and monotone are somewhat restrictive; for instance a general stochastic state transition model or a non-perfect observation model cannot be applied. As the problems considered fit into the general framework of POMDPs, they nevertheless expand the class of POMDPs that can be efficiently approx- imated. In recent work by Satsangi et al. (2015), submodularity of value functions is shown for reward function equal to the negative belief entropy, while extending the aforementioned results to the full POMDP setting that allows non-deterministic state transitions.