• No results found

2.2 Solving POMDPs

2.2.6 General POMDPs

A feature common to most of the methods reviewed above was that the state, action and observation spaces were assumed to be finite. As continuous spaces are realistic models for real-world problems, such POMDP representations have also been studied. We already briefly covered some algorithms able to deal with continuous spaces, for instance the SMC-OLFC algorithm presented in Subsection 2.2.4.2. This subsection gives a more detailed overview of the work on general POMDPs.

Arguably the most famous special case of a POMDP is the linear-quadratic- Gaussian (LQG) control problem. In the LQG case, the state space is a

n-dimensional real space Rn, while the action space is m-dimensional, Rm. The state transition model and observation model are linear with additive Gaussian noise, and the reward (cost) function is a quadratic function of the state and action. If the initial belief state is Gaussian over Rn, all subsequent beliefs are Gaussian as well. In this case, it is well known the optimal control policy may solved in closed form, see e.g. Åström (1970); Athans (1971). In LQG control, the principle of separation of estimation and control holds. An optimal state estimator, in this case a Kalman filter, can be designed separately for the system. The optimal control is a deterministic function of the state estimate from the optimal estimator.

Meier et al. (1967) studied the case when one additionally has a choice between a finite number of measurement channels in a LQG control problem, i.e. a sensor management problem is considered alongside the control problem. They showed that the separation principle allows one to solve separately the control problem and the problem of sequentially selecting the most useful measurement channels. The optimal policy for operating the sensing subsystem was found independent of the actual measurement data obtained: such a result does not hold for non-linear cases.

The optimal control in the LQG case may be stated as a function of a parametric representation of the belief state. A similar idea has been applied to POMDPs where the state transition and observation models are non- linear or where the noise is non-Gaussian. In general, the belief states can be arbitrary PDFs over the state space. Some approximate methods constrain belief states to some parametric family of PDFs represented by a finite-dimensional vector of sufficient statistics. For instance, Brooks et al. (2006) apply a Gaussian parametrisation of the belief state. They then apply value iteration over the finite-dimensional parameter space. A weakness of the approach is that a single-Gaussian representation is not sufficient

to represent multi-modal beliefs. An extension projecting belief states to a family of parametrised PDFs instead of just Gaussians was introduced by Zhou et al. (2010). Particle filtering was applied to track belief states in the projected parameter space. Platt et al. (2010) apply LQG control in the belief space while assuming maximum likelihood observations to simplify the belief state dynamics.

Porta et al. (2006) generalise the notion of α-vectors to continuous state spaces via α-functions. They show that for expected rewards linear in the belief state the value function is convex, and for discrete action and observa- tion sets it is PWLC. They tackle the problem of belief state representation by using Gaussian mixture models (GMMs) to represent both the beliefs and the state transition and observation models. They apply point-based value iteration over this parametric representation of beliefs to solve the POMDP, representing the value function as the supremum of a set of α-functions. However, the number of components in the GMMs representing posterior belief states and value functions increases exponentially, and the number of components is reduced by approximating the GMMs with a mixture with fewer components.

An alternative way to handle continuous spaces is via discretisation. The discretised problem can then be handled by any solver suitable for a finite POMDP. However, as noted e.g. by Brooks et al. (2006), finite POMDP methods have problems when discretisations are fine, as the dimensionality of the belief space increases with the number of states leading to rapidly increasing computational demands.

Sampling-based methods have also been widely applied for general POMDPs. Thrun (2000) represents a belief state over a continuous state space by a parti- cle approximation, and applies a function approximation based on k-nearest- neighbours to evaluate the value function. A policy search method was applied by Martinez-Cantin et al. (2009) in a continuous-state, continuous- action POMDP. Monte Carlo sampling was applied to evaluate the quality of policies, iteratively improving the policies through Bayesian optimisation. Likewise, the PEGASUS solver (Ng and Jordan, 2000) applied sampled simulation trajectories to evaluate policies, and can handle continuous action spaces. Kantas et al. (2009) present a SMC method combined with the receding horizon control principle that is able to handle continuous state, action and observation spaces.

Hoey and Poupart (2005) studied continuous observation spaces, and noted that from the decision maker’s point of view only observations that ultimately lead to a different action must be distinguished. Regions of the observation space leading to the same optimal action are aggregated into a single meta- observation, ultimately leading to a finite representation of the observation space.

Some online planning methods presented in Subsection 2.2.4 can be applied to POMDPs with a continuous state space and finite action and observation space. Specifically, the online methods that build a search tree over the reachable belief states (see Figure 2.1) require finite action and observation spaces for the tree to have a finite branching factor. If beliefs over the continuous state space can be propagated via the belief update equation (Equation (2.4)), no additional difficulties arise from the uncountable state

space. Furthermore, sampling-based approaches remain applicable in such cases.