• No results found

I. THE DECOUPLING PRINCIPLE

2. GENERAL BACKGROUND

In this chapter, we define the general background regarding the stochastic optimal control of a single-agent system. We will only consider the problem with imperfect state information, which is more general than the problem with perfect state infor-mation. In the next chapter, we define the specific problems that are tackled in this research.

2.1 Single-Agent Model

Probability space (notation): Let {Ω,F , P } be a probability space with the ran-dom variables on some measurable space (X, B), where X is generally a Euclidean space with dimension of nx or a smooth manifold in this space, and B is the corre-sponding σ-algebra of Borel sets.

Notations: Let x ∈ X ⊂ Rnx, u ∈ U ⊂ Rnu, and z ∈ Z ⊂ Rnz denote the state, control and observation vectors, respectively, and f : X × U × R → X, and h : X × R → Z denote the process and measurement model, respectively.

Discrete-time system equations: We consider the general discrete-time system equations:

xt+1 = f (xt, ut, ωt), (2.1a)

zt= h(xt, νt), (2.1b)

where the nx- and nz-dimensional random sequences {ωt, t ≥ 0} and {νt, t ≥ 0} are mutually independent zero-mean i.i.d. (independent, identically distributed), and x0 ∼ p0(·).

Data history: Let us define the data history of observations and actions for 1 ≤

t ≤ K as Dt := {z0:t, u0:t−1}, where u0:t−1and z0:tdenote the actions and observations from beginning to time step t. Note there is no observation at time 0, and z0 is only defined artificially to model the initial distribution. This will be useful later in the definition of the control policy.

The conditional distribution: The conditional distribution of θt := xt|Dt, 1 ≤ t ≤ K, denoted by pt, is the conditional distribution of the original system. It is a sufficient statistic for the estimation and control of the original system. The evolution of pt is based on the Bayesian update equation, which can be summarized as a function τt: R × I×U×Z → I [2, 12, 15], where pt+1= τt(pt, ut, zt+1), p0 is given, and I denotes the space of conditional distributions. Also we define θ0 := x0. We will denote pt(xt= x, Dt= D) by pt(x, D) throughout the text.

Next, we revisit some of the concepts related to the conditional distribution and derive τt.

2.1.1 Features of the Conditional Distribution

Sufficient statistic: A statistic is a function of the observations z0:t. A statistic g(z0:t) is said to be “sufficient” for the parameter set Θ if the conditional density of z0:t given g(z0:t), does not depend on θ. That is, p(z0:t|g(z0:t, θ)) does not depend on θ. It is proved in [2] that g(z0:t) is a sufficient statistic for Θ if and only if there are functions q1, q2 such that:

p(z0:t|θ) = q1(g(z0:t), θ)q2(z0:t), θ ∈ Θ

That is, if p(z0:t|θ) depends on θ only through g(z0:t).

Conditional distribution as a sufficient statistic: In a system where the state is only partially observed, the controller needs to keep track of its knowledge about the current state of the system given the data history. The conditional distribution of

the state given the data history is a sufficient statistic for the given history. That means that it contains all the necessary information for decision making at time t.

Here, g(z0:t) = pXt|Z0:t;U0:t−1(x|z0:t; u0:t−1; p0). Therefore,

pZ0:t|U0:t−1Xt(z0:t|u0:t−1, x) = pXt|Z0:t,U0:t−1(x|z0:t, u0:t−1)pZ0:t|U0:t−1(z0:t|u0:t−1) pXt|U0:t−1(x|u0:t−1)

= pXt|Z0:t,U0:t−1(x|z0:t, u0:t−1)pZ0:t|U0:t−1(z0:t|u0:t−1) pXt|U0:t−1(x|u0:t−1)

= q1(pXt|Z0:t,U0:t−1(x|z0:t, u0:t−1), x)q2(z0:t),

where

q1(pXt|Z0:t,U0:t−1(x|z0:t, u0:t−1), x) = pXt|Z0:t,U0:t−1(x|z0:t, u0:t−1)/pXt|U0:t−1(x|u0:t−1),

and q2(z0:t) = pZ0:t|U0:t−1(z0:t|u0:t−1). Therefore, the conditional distribution over the augmented state is indeed a sufficient statistic for the parameter.

Transition function: Tt : X × U × X → R is the transition function describing the probability of transitioning from state x0 to state x after taking action u at time step t, where Tt(x, u, x0) := pXt+1|Ut,Xt(x|u, x0). Note that this function, which is an equivalent representation of the process model xt+1 = f (xt, ut, ωt), describes the uncertainty in the effect of the action or process uncertainty.

Likelihood function: Ωt: Z×X → R is the likelihood function describing the prob-ability of observing z at state x at time step t, where Ωt(z, x) := pZt|Xt(z|x). Simi-larly, this function, which equivalently describes the observation model zt = h(xt, νt), is needed to describe the uncertainty in perception or measurement uncertainty.

Bayesian update: Since the system is only partially observable, there is a need for the estimation module to update the conditional distribution after taking an action and perceiving an observation. The well-known Bayesian update equation [12, 2, 14]

gives us the general mechanism to update the conditional distribution over the state after taking an action and perceiving an observation:

pt+1(x, D) = ηΩt+1(z, x)

Z

x0∈XTt(x, u, x0)pt(x0, D)dx0, (2.2) where η is a normalizing constant. This equation is summarized as pt+1= τt(pt, ut, zt+1).

Information state: Υtis an information state for the stochastic system (3.1) if it is both a function of Dt, and Υt+1can be determined from Υt, zt+1and ut [2]. We show that the conditional distribution over the state is a information state. Moreover, it is a sufficient statistic for the stochastic control problem.

Conditional distribution is an information state: We now derive the Bayesian recursion formula for the conditional distribution:

pXt|Z0:t,U0:t−1(x|z0:t, u0:t−1) = pZt|Xt(zt|x)pXt|Z0:t−1,U0:t−1(x|z0:t−1, u0:t−1) pZ0:t,U0:t−1(z0:t, u0:t−1) .

We have:

pXt|Z0:t−1,U0:t−1(x|z0:t−1, u0:t−1)

=

Z

x0∈XpXt|Xt−1,Z0:t−1,U0:t−1(x|x0, z0:t−1, u0:t−1)pXt−1|Z0:t−1,U0:t−1(x0|z0:t−1, u0:t−1)dx0

=

Z

x0∈X

pXt|Ut−1,Xt−1(x|ut−1, x0)pXt−1|Z0:t−1,U0:t−2(x0|z0:t−1, u0:t−2)dx0

=

Z

x0∈XTt−1(x, u, x0)pt−1(x0, z0:t−1, u0:t−2, p0)dx0 := Ψt(pXt−1|Z0:t−1,U0:t−2(·|z0:t−1, u0:t−2), ut−1)(x)

= Ψt(pt−1(·, z0:t−1, u0:t−2, p0), ut−1)(x), (2.3)

where pt−1(x0, z0:t−1, u0:t−2, p0) = pXt−1|Z0:t−1,U0:t−2(x0|z0:t−1, u0:t−2). Moreover,

pXt|Z0:t,U0:t−1(x|z0:t, u0:t−1) = pZt|Xt(zt|x)pXt|Z0:t−1,U0:t−1(x|z0:t−1, u0:t−1) pZ0:t,U0:t−1(z0:t, u0:t−1)

= pZt|Xt(zt|x)pXt|Z0:t−1,U0:t−1(x|z0:t−1, u0:t−1)

R

x∈XpZt|Xt(zt|x)pXt|Z0:t−1,U0:t−1(x|z0:t−1, u0:t−1)dx

= Ωt(z, x)pXt|Z0:t−1,U0:t−1(x|z0:t−1, u0:t−1)

R

x∈Xt(z, x)pXt|Z0:t−1,U0:t−1(x|z0:t−1, u0:t−1)dx

:= Φt(pXt|Z0:t−1,U0:t−1(·|z0:t−1, u0:t−1), zt)(x). (2.4)

Hence, we have:

pXt|Z0:t,U0:t−1(x|z0:t, u0:t−1) = Φtt(pXt−1|Z0:t−1,U0:t−2(·|z0:t−1, u0:t−2), ut−1), zt] := τt(pXt−1|Z0:t−1,U0:t−2(·|z0:t−1, u0:t−2), ut−1, zt), (2.5)

which is the same formula obtained in (2.2). Therefore, we can compute the condi-tional distribution at time t through the condicondi-tional distribution at time t − 1, using zt and ut−1. This also proves that conditional distribution over state is an infor-mation state. Note that in order to solve the above recursion, we need the initial condition:

pX0|Z0,U−1(·|z0, u−1) := pX0(·). (2.6)

2.2 Elements of the Stochastic Control Problem

Incremental cost function: Assuming that the time horizon is finite, K < ∞, ct(xt, u) : X × U → R denotes the one-step or immediate cost incurred by executing action u at state xt. Moreover, cK(xK) denotes the terminal cost.

Policy function: The feedback policy (planner or the feedback control law), is a

sequence of functions π = {π0, π1, · · · } where πt : Zt+1 → U specifies the action given the output (i.e., the observations). In a problem with perfect state measure-ments, the output of the system is a direct function of the state and therefore, the policy is state-dependent. Thus, ut = πt(z0:t), where π = {π0, · · · , πt} is a policy denoted by a finite sequence (since K < ∞). A policy is feasible if ut= πt(z0:t) ∈ U.

We denote the space of feasible policies by Π.

Cost associated with the policy: Let π ∈ Π, and {xπt}, {uπt} and {zπt} be the random processes associated with (and dependent on) that policy. We can define the cost function Jπ : XK+1× UK → R associated with π as:

Jπ :=

K−1

X

t=0

ct(xπt, uπt) + cK(xπK).

For notational simplicity, we denote the cost associated with the policy π byK−1P

t=0

cπt(xt, ut)+

cπK(xK). A proper choice of this cost function is an important aspect of the overall modeling of the problem.

Cost-to-go function: Due to the randomness of the processes {xπt} and {uπt}, Jπ is a random variable. Therefore, we define the cost-to-go as the expected cost E[Jπ] which is deterministic, with the expectation taken over all randomness. This expectation can be written as:

E[Jπ(x0:K, u0:K−1)] = E[

K−1

X

t=0

cπt(xt, ut) + cπK(xK)]

= E[

K−1

X

t=0

E[cπt(xt, ut)|Dt] + E[cπK(xK)|DK]]

= E[

K−1

X

t=0

Z

X

[cπt(xt, ut)pt(xt|z0:t, u0,t−1)dxt] +

Z

X

[cπK(xK)pK(xK|z0:K, u0,K−1)dxK]]

=: E[

K−1

X

t=0

cπ,pt (pt, ut) + cπ,pK (pK)]

=: E[Jπ0(p0:K, u0:K−1)]

where cπ,pt , cπ,pK , and J0 are defined using the above equations with respect to the conditional distribution, and the last expectation is taken over all possible conditional distributions.

Problem ingredients: The stochastic control problem can be represented by an n−tuple: {X, U, Z, p0, Tt, Ωt, ct, K}.

Problem 1 General stochastic control problem The objective in our stochastic