GENERAL BACKGROUND - THE DECOUPLING PRINCIPLE

I. THE DECOUPLING PRINCIPLE

2. GENERAL BACKGROUND

In this chapter, we define the general background regarding the stochastic optimal control of a single-agent system. We will only consider the problem with imperfect state information, which is more general than the problem with perfect state infor-mation. In the next chapter, we define the specific problems that are tackled in this research.

2.1 Single-Agent Model

Probability space (notation): Let {Ω,F , P } be a probability space with the ran-dom variables on some measurable space (X, B), where X is generally a Euclidean space with dimension of n_x or a smooth manifold in this space, and B is the corre-sponding σ-algebra of Borel sets.

Notations: Let x ∈ X ⊂ Rⁿ^x, u ∈ U ⊂ Rⁿ^u, and z ∈ Z ⊂ Rⁿ^z denote the state, control and observation vectors, respectively, and f : X × U × R → X, and h : X × R → Z denote the process and measurement model, respectively.

Discrete-time system equations: We consider the general discrete-time system equations:

xt+1 = f (x_t, u_t, ω_t), (2.1a)

z_t= h(x_t, ν_t), (2.1b)

where the n_x- and n_z-dimensional random sequences {ω_t, t ≥ 0} and {ν_t, t ≥ 0} are mutually independent zero-mean i.i.d. (independent, identically distributed), and x₀ ∼ p₀(·).

Data history: Let us define the data history of observations and actions for 1 ≤

t ≤ K as Dt := {z_0:t, u_0:t−1}, where u_0:t−1and z_0:tdenote the actions and observations from beginning to time step t. Note there is no observation at time 0, and z₀ is only defined artificially to model the initial distribution. This will be useful later in the definition of the control policy.

The conditional distribution: The conditional distribution of θ_t := x_t|Dt, 1 ≤ t ≤ K, denoted by p_t, is the conditional distribution of the original system. It is a sufficient statistic for the estimation and control of the original system. The evolution of p_t is based on the Bayesian update equation, which can be summarized as a function τ_t: R × I×U×Z → I [2, 12, 15], where pt+1= τ_t(p_t, u_t, z_t+1), p₀ is given, and I denotes the space of conditional distributions. Also we define θ0 := x₀. We will denote p_t(x_t= x, Dt= D) by pt(x, D) throughout the text.

Next, we revisit some of the concepts related to the conditional distribution and derive τ_t.

2.1.1 Features of the Conditional Distribution

Sufficient statistic: A statistic is a function of the observations z_0:t. A statistic g(z_0:t) is said to be “sufficient” for the parameter set Θ if the conditional density of z_0:t given g(z_0:t), does not depend on θ. That is, p(z_0:t|g(z_0:t, θ)) does not depend on θ. It is proved in [2] that g(z_0:t) is a sufficient statistic for Θ if and only if there are functions q₁, q₂ such that:

p(z_0:t|θ) = q₁(g(z_0:t), θ)q₂(z_0:t), θ ∈ Θ

That is, if p(z_0:t|θ) depends on θ only through g(z0:t).

Conditional distribution as a sufficient statistic: In a system where the state is only partially observed, the controller needs to keep track of its knowledge about the current state of the system given the data history. The conditional distribution of

the state given the data history is a sufficient statistic for the given history. That means that it contains all the necessary information for decision making at time t.

Here, g(z_0:t) = p_X_t|Z_0:t;U0:t−1(x|z_0:t; u_0:t−1; p₀). Therefore,

= p_X_t_|Z_0:t_,U_0:t−1(x|z_0:t, u_0:t−1)p_Z_0:t_|U_0:t−1(z_0:t|u_0:t−1) p_X_t|U_0:t−1(x|u_0:t−1)

= q₁(p_X_t|Z0:t,U0:t−1(x|z_0:t, u_0:t−1), x)q₂(z_0:t),

where

and q₂(z_0:t) = p_Z_0:t_|U_0:t−1(z_0:t|u_0:t−1). Therefore, the conditional distribution over the augmented state is indeed a sufficient statistic for the parameter.

Transition function: T_t : X × U × X → R is the transition function describing the probability of transitioning from state x⁰ to state x after taking action u at time step t, where T_t(x, u, x⁰) := p_X_t+1|Ut,Xt(x|u, x⁰). Note that this function, which is an equivalent representation of the process model x_t+1 = f (x_t, u_t, ω_t), describes the uncertainty in the effect of the action or process uncertainty.

Likelihood function: Ωt: Z×X → R is the likelihood function describing the prob-ability of observing z at state x at time step t, where Ω_t(z, x) := p_Z_t_|X_t(z|x). Simi-larly, this function, which equivalently describes the observation model z_t = h(x_t, ν_t), is needed to describe the uncertainty in perception or measurement uncertainty.

Bayesian update: Since the system is only partially observable, there is a need for the estimation module to update the conditional distribution after taking an action and perceiving an observation. The well-known Bayesian update equation [12, 2, 14]

gives us the general mechanism to update the conditional distribution over the state after taking an action and perceiving an observation:

p_t+1(x, D) = ηΩt+1(z, x)

x⁰∈XT_t(x, u, x⁰)p_t(x⁰, D)dx⁰, (2.2) where η is a normalizing constant. This equation is summarized as p_t+1= τ_t(p_t, u_t, z_t+1).

Information state: Υ_tis an information state for the stochastic system (3.1) if it is both a function of Dt, and Υ_t+1can be determined from Υ_t, z_t+1and u_t [2]. We show that the conditional distribution over the state is a information state. Moreover, it is a sufficient statistic for the stochastic control problem.

Conditional distribution is an information state: We now derive the Bayesian recursion formula for the conditional distribution:

We have:

p_X_t|Z_0:t−1,U0:t−1(x|z_0:t−1, u_0:t−1)

x⁰∈XpXt|Xt−1,Z0:t−1,U0:t−1(x|x⁰, z0:t−1, u0:t−1)p_X_t−1_|Z_0:t−1_,U_0:t−1(x⁰|z0:t−1, u0:t−1)dx⁰

x⁰∈X

p_X_t_|U_t−1_,X_t−1(x|u_t−1, x⁰)p_X_t−1_|Z_0:t−1_,U_0:t−2(x⁰|z_0:t−1, u_0:t−2)dx⁰

x⁰∈XT_t−1(x, u, x⁰)p_t−1(x⁰, z_0:t−1, u_0:t−2, p₀)dx⁰ := Ψ_t(p_X_t−1|Z0:t−1,U0:t−2(·|z_0:t−1, u_0:t−2), u_t−1)(x)

= Ψ_t(p_t−1(·, z_0:t−1, u_0:t−2, p₀), u_t−1)(x), (2.3)

where p_t−1(x⁰, z_0:t−1, u_0:t−2, p₀) = p_X_t−1_|Z_0:t−1_,U_0:t−2(x⁰|z_0:t−1, u_0:t−2). Moreover,

= p_Z_t_|X_t(z_t|x)p_X_t_|Z_0:t−1_,U_0:t−1(x|z_0:t−1, u_0:t−1)

x∈Xp_Z_t|X_t(z_t|x)p_X_t|Z_0:t−1,U0:t−1(x|z_0:t−1, u_0:t−1)dx

= Ω_t(z, x)p_X_t|Z_0:t−1,U0:t−1(x|z_0:t−1, u_0:t−1)

x∈XΩ_t(z, x)p_X_t|Z_0:t−1,U0:t−1(x|z_0:t−1, u_0:t−1)dx

:= Φ_t(p_X_t|Z0:t−1,U0:t−1(·|z_0:t−1, u_0:t−1), z_t)(x). (2.4)

Hence, we have:

which is the same formula obtained in (2.2). Therefore, we can compute the condi-tional distribution at time t through the condicondi-tional distribution at time t − 1, using zt and u_t−1. This also proves that conditional distribution over state is an infor-mation state. Note that in order to solve the above recursion, we need the initial condition:

p_X₀|Z0,U−1(·|z₀, u−1) := pX0(·). (2.6)

2.2 Elements of the Stochastic Control Problem

Incremental cost function: Assuming that the time horizon is finite, K < ∞, c_t(x_t, u) : X × U → R denotes the one-step or immediate cost incurred by executing action u at state x_t. Moreover, c_K(x_K) denotes the terminal cost.

Policy function: The feedback policy (planner or the feedback control law), is a

sequence of functions π = {π₀, π₁, · · · } where π_t : Z^t+1 → U specifies the action given the output (i.e., the observations). In a problem with perfect state measure-ments, the output of the system is a direct function of the state and therefore, the policy is state-dependent. Thus, u_t = π_t(z_0:t), where π = {π₀, · · · , π_t} is a policy denoted by a finite sequence (since K < ∞). A policy is feasible if u_t= π_t(z_0:t) ∈ U.

We denote the space of feasible policies by Π.

Cost associated with the policy: Let π ∈ Π, and {x^π_t}, {u^π_t} and {z^π_t} be the random processes associated with (and dependent on) that policy. We can define the cost function J_π : X^K+1× U^K → R associated with π as:

J_π :=

K−1

t=0

c_t(x^π_t, u^π_t) + c_K(x^π_K).

For notational simplicity, we denote the cost associated with the policy π by^K−1^P

t=0

c^π_t(x_t, u_t)+

c^π_K(x_K). A proper choice of this cost function is an important aspect of the overall modeling of the problem.

Cost-to-go function: Due to the randomness of the processes {x^π_t} and {u^π_t}, J_π is a random variable. Therefore, we define the cost-to-go as the expected cost E[Jπ] which is deterministic, with the expectation taken over all randomness. This expectation can be written as:

E[J^π(x_0:K, u0:K−1)] = E[

K−1

t=0

c^π_t(x_t, ut) + c^π_K(x_K)]

= E[

K−1

t=0

E[c^πt(x_t, u_t)|D_t] + E[c^πK(x_K)|D_K]]

= E[

K−1

t=0

[c^π_t(x_t, u_t)p_t(x_t|z_0:t, u_0,t−1)dx_t] +

[c^π_K(x_K)p_K(x_K|z_0:K, u_0,K−1)dx_K]]

=: E[

K−1

t=0

c^π,p_t (pt, ut) + c^π,p_K (pK)]

=: E[Jπ⁰(p_0:K, u_0:K−1)]

where c^π,p_t , c^π,p_K , and J⁰ are defined using the above equations with respect to the conditional distribution, and the last expectation is taken over all possible conditional distributions.

Problem ingredients: The stochastic control problem can be represented by an n−tuple: {X, U, Z, p0, T_t, Ω_t, c_t, K}.

Problem 1 General stochastic control problem The objective in our stochastic

In document A Decoupling Principle for Simultaneous Localization and Planning Under Uncertainty in Multi-Agent Dynamic Environments (Page 37-43)