3.4 Deterministic robot motion and mixed observability
3.4.3 Path planning
Consider a robot traversing an undirected graph G “ pX , Eq. The robot is navigating towards a goal vertex g P X . A random environmental variable
Yx is related to each vertex x P X in the graph.
The state space in the path planning problem is the cross product of possible robot locations X and possible environment variable values Y “ Ś
xPX Yx. States are pairs s “ px, yq of robot location x P X and environment variables
y P Y. As robot locations are fully observable, the belief states are pairs b “ px, ppyqq consisting of the robot location and a PDF ppyq P PpYq over
environment variables.
At belief state b “ px, ppyqq, the robot chooses a movement action ap P N pxq from the set N pxq of vertices neighbouring x. Let x1
P N pxq denote the successor vertex after the robot executes the movement. In addition, the robot chooses a sensing action am P N px1q to target a vertex neighbouring the successor. Let Ab ” Ax denote the set of allowed actions at belief state
b “ px, ppyqq, defined as the set of all possible combinations of movement
and sensing actions when the robot is at x P X . Let A “ Ť xPX
Ax. The actions in the problem are ordered pairs a “ pap, amq P A of movement and sensing actions. A special case is the goal vertex g, where the action space is a singleton Ag “ tg, gu to indicate termination of the task.
The state transition model T “ Tx ˆ Ty is factorised into a robot state transition model Tx and the environmental variable transition model Ty. Robot motion is deterministic, with Txdefined as in (3.2). The environmental variables evolve independently from the robot’s actions, i.e. the robot cannot affect the environment state. The state transition model Ty is defined by an appropriate DBN model.
After executing an action, the robot obtains a noisy observation z1 P Z about
the environmental variable in the targeted node, according to a stochastic observation model O.
The reward function is state and action dependent and reflects the cost of robot movement and sensing activities. We assume that all sensing actions have the same cost, but note that varying costs are easily incorporated by adding a term dependent on the sensing action to the reward function. When the robot has reached the goal vertex g the reward is equal to zero to indicate termination of the task, i.e. for all states s “ px, yq where x “ g, we have Rps, aq “ 0 @a P Ag. There may be a positive reward for entering the goal vertex for the first time. In other states, the reward is strictly negative and dependent on the value of the environmental variable yx1 in
the successor vertex x1 after the movement action.
In the path planning problem considered here, the state, action, and obser- vation spaces S, A and Z, respectively, are finite.
Given the undirected graph G “ pX , Eq and a goal vertex g P X , the path planning problem is stated as follows.
Problem 3.4 (Path planning). The path planning problem is a POMDP xT , X ˆ Y , tAbu, Z, Txˆ Ty, O, Ry, where T is the set of decision epochs,
and X ˆ Y is the finite state space. A state is an ordered pair px, yq P X ˆ Y. The part x P X is fully observable and y P Y is partially observable. Thus belief states are ordered pairs of the form b “ px, ppyqq P B, where ppyq is a PDF over Y. At belief state b “ px, ppyqq the applicable actions are
Ab “ Ax Ă X ˆ X , corresponding to possible combinations of movement to
x1 P N pxq and sensing x2 P N px1q. In the goal vertex g P X , A
g “ tg, gu. The
state transition model is factorised T “ Txˆ Ty with Tx : X ˆ A ˆ X Ñ r0, 1s
deterministic as in (3.2) and Ty : Y ˆ A ˆ Y Ñ r0, 1s is independent of
a P A, i.e. Typy1, i, yq “ Typy1, j, yq @i, j P A. The observation set is Z, and
the observation model is O : Z ˆ X ˆ Y ˆ A Ñ r0, 1s. The real-valued reward function is R : S ˆ A Ñ R, such that for s “ px, yq where x “ g, Rps, aq “ 0 @a P A.
The optimal solution of the problem specifies a policy for how to traverse the graph and concurrently operate the sensing subsystem in order to best assist in reaching the goal. Knowledge regarding the goal vertex g is incorporated into the action space and reward function and does not need to be explicitly stated in the problem definition.
The factorised state transition model leads to factorised belief update function
τ : B ˆ A ˆ Z Ñ B as well. Given a belief state b “ px, ppyqq, an action a
and the resulting observation z1, the updated belief is τ pb, a, z1q “ px1, ppy1 |
b, z1, aqq, where x1 is determined through Eq. (3.2), and ppy1 | b, z1, aq is
obtained by Bayesian filtering from ppyq applying Ty and O. The Bayesian filtering operation is as in Equations (2.4)-(2.6), replacing s,s1 by y,y1, bpsq by
ppyq and T by Ty along with the appropriate changes in function arguments. The path planning problem as stated here is closely related to the Canadian traveller’s problem (CTP) (Papadimitriou and Yannakakis, 1991). In the CTP, the cost of traversing an edge is revealed only when a vertex incident to the edge is visited. The cost assumes either a finite or infinite value, denoting traversable and non-traversable edges, respectively. The prior probability of each edge being traversable is known, and the objective is to minimise the expected total cost of traversal.
The remote sensing variant where the agent may sense the cost of any edge regardless whether it is adjacent to the current vertex or not was studied by Bnaya et al. (2009). Psaraftis and Tsitsiklis (1993) considered the case where traversal costs are described by a known function dependent on an environmental variable. The environmental variables are fully observable and independent and evolve according to finite state Markov chains. Problem 3.4 combines characteristics of these variants to yield a case where edge costs can be observed by a noisy sensor and depend on dynamic, partially observable environment variables.
We remark that the formulation of path planning presented here is specific to the sensor management context. An arguably more typical definition of a path planning problem in mobile robotics considers finding paths with
metric map representations satisfying kinematic and dynamic constraints of robot motion (LaValle, 2006).