• No results found

Partially observable Markov decision processes

1.6 Outline of the thesis

2.1.2 Partially observable Markov decision processes

We now consider the case where the state of the system S is not directly observable by the agent. After a state transition the agent now perceives an observation zt`1 P Z instead of learning the value of the state st`1P S. We assume that the observations are conditionally independent given the current state and previous action, i.e. Zt`1 „ ppzt`1 | st`1, atq. This probabilistic observation model is assumed independent of the decision epoch. The observation model is represented as a PDF O : Z ˆ S ˆ A Ñ R`, such that

Opz1, s1, aq is the value of the PDF for observation z1 when the system is in state s1 after action a was executed at the previous decision epoch. A valid

observation model must satisfy ş

ZOpz

1, s1, aqdz1 “ 1 for all s1 P S and a P A2.

As the state is not directly observed, the agent’s knowledge about the state at the start of the process is modelled by a PDF pps0q P P pSq. If no prior

information exists, pps0q is defined to be an uniform distribution over S.

Knowledge that the initial state is s0 can be modelled by setting pps0q to a

degenerate distribution at s0.

At decision epoch t, the prior pps0q and past actions and observations

contain all information that the agent has available about the current and past states of the system. Let h0 “ pps0q P H0, and for t ě 1, let

2

ht “ ppps0q, a0, z1, a1, z2, . . . , at´1, ztq P Ht denote the history at decision epoch t. We have H0 “ P pSq, and for any t ě 1 the recursive relationships

Ht“ Ht´1ˆ A ˆ Z and ht “ pht´1, at´1, ztq.

State estimation is a procedure by which a history ht is mapped into a PDF over the state. In a general decision process, one may be interested in the PDF over all past states of the system, as they can all affect future states and rewards. In a partially observable Markov decision process (POMDP), the Markov property implies that a PDF over the current system state summarizes all relevant knowledge. This conditional PDF ppst| htq is called the belief state. We adopt the notation btpstq “ ppst | htq3, and denote PpSq “ B, and call this set the belief space.

State estimation is carried out recursively as the process progresses. For brevity of notation, we refer in the following to st, at, bt, zt`1, and bt`1 as

s, a, b, z1, and b1, respectively. Suppose we are given b, and the agent then

executes an action a P A, and perceives z1 P Z. The posterior belief state

b1

” pps1 | b, z1, aq is given by the belief update equation τ : B ˆ A ˆ Z Ñ B, defined via the Bayes’ rule as

b1 “ τ pb, a, z1q “ Opz 1, s1, aqpps1 | b, aq ppz1 | b, aq , (2.4) where pps1

| b, aq is the predictive PDF of the state at the next decision epoch given the current belief state b and action a. This PDF is obtained from the Chapman-Kolmogorov equation (Brzeźniak and Zastawniak, 1999) as pps1 | b, aq “ ż S Tps1, a, sqbpsqds. (2.5)

For finite S, the integration is replaced by summation. The term ppz1

| b, aq in (2.4) is a normalisation term equal to the prior probability of observing

z1, obtained by

ppz1

| b, aq “ ż

S

Opz1, s1, aqpps1 | b, aqds1, (2.6)

replacing integration by summation for finite S.

The steps presented above outline a recursive procedure by which the agent’s belief state may be tracked over histories of past actions and observations. A belief state bt is a sufficient statistic for the history ht. The belief state and history may be used interchangeably as representations of the agent’s knowledge.

With the addition of observations, the MDP is a partially observable MDP, or a POMDP. As the state is not observable, the sets of allowed actions must instead depend on the history, or equivalently the belief state instead of the state. Furthermore, with the introduction of belief states we can allow reward functions that are also dependent on the belief state as opposed to the true underlying state of the system. With these modifications, the following definition of a POMDP is given.

3For t “ 0, b

Definition 2.5 (Partially observable Markov decision process (POMDP)).

A partially observable Markov decision process (POMDP) is a tuple xT , S,

tAbu, Z, T, O, Ry, where T is the set of decision epochs, S is the state space, Ab is the set of actions allowed in belief state b P B such that A “

Ť bPB

Ab, Z

is the observation space, T : S ˆ A ˆ S Ñ R` is the state transition model,

O : Z ˆ S ˆ A Ñ R` is the observation model, and R : B ˆ S ˆ A Ñ R is a

real-valued reward function.

Since the belief states of the POMDP are fully observed by the agent, a POMDP is equivalent to a MDP over belief states.

Lemma 2.6 (Belief MDP). A POMDP xT , S, tAbu, Z, T, O, Ry is

equivalent to a MDP xT , B, tAbu, Tb, ρy, known as the belief MDP, where Tb : B ˆ A ˆ B Ñ R` is a state transition model for belief states defined

Tbpb1, a, bq “ #

ppz1

| b, aq, for b1 “ τ pb, a, z1q

0, otherwise (2.7)

and ρ : B ˆ A Ñ R is the reward function defined as the expectation ρpb, aq “ Es„brRpb, s, aqs.

The state space in the belief MDP is the belief space B of the POMDP. Recall that B “ PpSq. If the state space is finite, e.g. |S| “ n P N, the belief space is the pn ´ 1q-dimensional unit simplex B “

" v P Rn | n ř i“1 vi “ 1, vi ě 0 * Ă Rn, which contains all PDFs over S (Lovejoy, 1991a). In case the state space is uncountable, e.g. an interval of R, the belief space is instead a function space. Detailed discussion and proofs for Lemma 2.6 may be found e.g. in Bertsekas (1995, Ch. 5) for the case of finite S, and in Bertsekas and Shreve (1996, Ch. 10) for the case of general S.