• No results found

Partially Observable Markov Decision Process

4.2 Background

4.2.2 Partially Observable Markov Decision Process

In MDPs the state is always known, but in our problem the person’s position is not always known, which is part of the state; to handle this, aPartially Observable Markov Decision Process(POMDP) [Thrun et al.,2005,White,1991] can be used. The partially observable model POMDP has two main differences with the fully observable model MDP: first, the current state is not known, but abelief(probability map) represents the probability of being in each state; second, the belief is updated through observations.

A POMDP is a tuple:

hS, A, O, T, Z, R, b0i (4.5)

that contains, like an MDP (discussed before in subsection 4.2.1) a set of states (S), actions (A), rewards (R), and a state transition functionT, which defines the proba-bility of going to s0 from s with action a: T (s, a, s0) = P (s0|s, a). Instead of knowing the current state, observations (O) are used in POMDPs, and an observation prob-ability function Z, which defines the probability of observing o from new state s0: Z(o, s0, a) = P (o|s0, a). Also an initial belief b0 has to be given since the state is not known initially.

The belief (B) is the probability of being in each possible state; Figure 4.1 shows thebeliefin two situations: the person is hidden (top) and visible (bottom). To update the beliefthe observation o and action a are used:

boa(s0) = Z(o, s0, a)Ps∈ST (s, a, s0)b(s)

Ω(o|b, a) (4.6)

where boa(s0) is the probability of being in state s0 after having done observation o and action a; b is the previous belief. A normalisation is done by dividing by Ω(o|b, a), the probability of observation based on the belief and action:

Ω(o|b, a) = X

The initial belief b0, has to be given in advance; for example, a uniformly distributed probability over all the locations where the person might be hidden (such as shown in Figure 4.1(b)). Thereafter, the belief is updated using the observation and the probability functions.

4.2 Background

(a) (b)

(c) (d)

Figure 4.1: The simulation of two sequential positions is shown. Left the maps are shown with the blue circle being the robot (R), and the red the person (P). Black squares are obstacles, and dark grey squares indicate locations which are not visible to the robot. The right images show a distribution of the belief (without noise), where red indicates a high, white a low and light blue zero probability.

The reward is given for a state and actionR(s, a), and is used to calculate the best action to do in each (belief) state using the Value Function per action and belief:

Q(b, a) = ρ(b, a) + γX

o∈O

Ω(o|b, a)V (boa) (4.8)

where ρ(b, a) = Ps∈S[b(s)R(s, a)] is the reward for the belief b ∈ B and action a; γ is the discount factor; and boa is the next belief state, defined in Eq. 4.6. Now, we can define the Value Function (likeEq. 4.2):

V (b) = maxa∈AQ(b, a) (4.9)

The policy π, the best action to take in each state, is:

π(b) = arg max

a∈A

Q(b, a) (4.10)

4.2 Background

4.2.2.1 Policies

The Value Function for a POMDP can be defined as a linear convex function [Pineau et al.,2003], which is represented by a list of α-vectors: Vn= {α0, α1, · · · , αn}, where anα-vectoris an |S|-dimensional hyper-plane containing the values for a certain action a.

Figure 4.2: The value function of the POMDP is represented by a list of α-vectors, in this example, there are three: V = {α0, α1, α2}. Horizontally, the belief b(s0) is represented below, and b(s1) up. The vertical axis indicates the expected reward. The red dashed line represents the maximum value for each belief.

Figure 4.2 shows an example of a value function V with three α-vectors. In this example there are only two states: S = {s0, s1}; their belief is shown horizontally:

b(s0) below, and b(s1) up; note that b(s1) = 1.0 − b(s0), since there are only two states.

Vertically, the expected reward is shown, and each of theα-vectorsrepresents an action:

A = {a0, a1, a2}. The dashed red line indicates the best value for each belief, which would result in the following policy:

π =

a0, if 0 ≤ b(s0) < p1 a2, if p1 ≤ b(s0) < p2 a1, if p2 ≤ b(s0) ≤ 1.0

(4.11)

4.2 Background

Another way of representing apolicyis with a tree;Figure 4.3 shows a partial tree of depth one (a complete policy tree would be much larger). The root of the policy tree represents the current situation, which is a state for, the MDP, and a belief for thePOMDP. The trees inFigure 4.3show that the POMDP’s policy tree is wider than the MDP’s tree, because it also contains the observations.

Figure 4.3: The policy tree of an MDP and a POMDP model; note that the figures only show a depth of one, and that in (a) the MDP, the state is known, whereas for (b) the POMDP, the beliefscontains a probability of each state.

Executing the policy is done by finding the maximum approximated expected re-ward:

V (b) = max

α∈Γ(α · b) (4.12)

where Γis the list of α-vectors and b is the beliefpoint.

4.2.2.2 Policy Calculation

Finding POMDP policies has the problem of being complex—intractable (PSPACE-hard) [Papadimitriou and Tsisiklis,1987] to find the exact policy—and furthermore, it is known to suffer from the curse of dimensionality and the curse of history [Pineau et al.,2003,Silver and Veness,2010], because of the infinite continuousbelief space.

Exact Value Iteration The calculation of the Value Function V is also noted as V = HV0 [Pineau et al.,2003], where V0 is the previous version of theValue Function,

4.2 Background

and H is the backupoperator. Algorithm 4.3shows the Exact Value Iteration (see for example [Pineau et al.,2003]) for which the Value Function (Eq. 4.8) is calculated in several steps for all the actions and observations. Firstly, in line 1 the direct reward is calculated for all actions a, and it is stored in the set Γa,∗; secondly, in line 2 the discounted future reward is calculated for each action a and observation o, and all existing α-vectors in V0. Thirdly, the values are summed based on the actions and observations (line 3); finally, all sets of α-vectorsare joined inline 4.

It can be seen, that the set Γ grows exponential if no pruning is done, the new set V has, at worst case, |A||V0||O| α-vectors; and the time complexity is |S|2|A||V0||O|

[Pineau et al.,2003].

Algorithm 4.3 The steps of the Exact Value Iteration for a POMDP [Pineau et al., 2003].

1: Γa,∗← αa,∗(s) = R(s, a)

2: Γa,o← αa,o(s) = γPs0∈S[T (s, a, s0)Z(o, s0, a)α0i(s0)], ∀α0i∈ V0

3: Γa= Γa,∗Lo∈OΓa,o

4: V =Sa∈AΓa

Curse of Dimensionality: The policy of a POMDP does not define exactly which action to do in which state, however, which action to do in a certain belief state. Since thebelief is probabilistic, this space is infinite with |S| − 1 dimensions, S being the set of states. For each added state, a new dimension is added to thebelief, this is called the Curse of Dimensionality [Pineau et al.,2003]: it scales exponentially with the number of states.

Curse of History: When trying to find an optimal policy, learning is started with an initial belief, then all of the action-observation combinations have to be traced, which grows exponentially with the planning horizon (search depth), see Figure 4.3. This growth affects the POMDP value iteration far more than the Curse of Dimensionality [Pineau et al.,2003].

4.2 Background

Algorithm 4.4 The backupfunction of the PBVIalgorithm.

1: Γa,∗← αa,∗(s) = R(s, a)

2: Γa,o← αa,o(s) = γPs0∈S[T (s, a, s0)Z(o, s0, a)α0i(s0)], ∀α0i∈ V0

3: Γab = Γa,∗+Po∈Oarg maxα∈Γa,o(α˙b)

4: V = arg maxΓa

b,∀a∈Aab˙b), ∀b ∈ B 4.2.2.3 POMDP Solvers

Since finding the exact policy is hard to calculate, approximation methods are used.

They sample the belief spacein a smart way, to find apolicy[Kurniawati et al.,2008, Pineau et al.,2003].

PBVI: Instead of exploring the whole belief space, it is more practical to only ex-plore representative points of the whole space, this is done by the Point-Based Value Iteration (PBVI) solver [Pineau et al.,2003,Spaan and Vlassis,2004], which explores only reachable belief points. Figure 4.3shows that new belief points are reachable for the POMDP by doing an action and an observation, which both are limited sets.

PBVI extends the belief space B = {b0, b1, .., bn}, by searching reachable belief points that improve the coverage of the total belief. For each of those belief points, an α-vector is calculated, which has the same length as the number of states, and represents the expected reward of each state. The policyis stored as the collection of α-vectors, and for eachα-vectorthe best action is stored.

Algorithm 4.4 shows the PBVI backup function. PBVI expands the set of belief points by greedily expanding the set that improves the worst-case density as fast as possible. It is an anytime algorithm, i.e. it can return apolicyeven though it has not yet converged. Here the final solution only contains |B| points, and has time complexity

|S||A||V0||O||B| [Pineau et al.,2003].

SARSOP: An improvement of PBVI is theanytime algorithm Successive Approxi-mations of the Reachable Space under Optimal Policies (SARSOP) [Kurniawati et al., 2008], which limits the searched belief space to only optimally reachable belief points.

These are the points that are reached by doing an optimal sequences of actions, starting from the initial belief b0. The algorithm keeps track of a lower and upper bound, the

4.2 Background

Figure 4.4: The figure shows the dependencies of the states and observations for the MDP,POMDP, andMOMDP models.

first is represented by the set ofα-vectors Γ, and the second by a sawtooth approxima-tion [Kurniawati et al.,2008].