• No results found

Separation Principle and Optimal Centralized Solution

2.2

Separation Principle and Optimal Centralized Solution

When the models are linear Gaussian and the target prior is Gaussian, the optimal target estimator is the Kalman filter (Appendix C.1). The target distribution remains Gaussian over time and its covariance evolves according to the Riccati map and very importantly is independent of the particular realization of the measurement sequence. As a result, the differential entropy (Appendix A) of the Gaussian target distribution conditioned on the Gaussian measurements is proportional to the log-determinant of the target covariance ma- trix. In other words,the cost function in (2.1)is independent of the particular measurement realization and, consequently, open-loop planning does just as well as closed-loop planning. The following theorem formalizes this intuition.

Theorem 2.1. If the distribution of y0 is Gaussian with covariance Σ00, there exists an

open-loop control sequenceσ∈ UT that is optimal in (2.1). Moreover, (2.1) is equivalent to the deterministic optimal control problem:

min σ∈UT J (n) T (σ) := T X t=1 ct(Σt, xt) s.t. xt+1=f(xt, σt), t= 0, . . . , T−1, Σt+1=ρet+1(ρ p t(Σt, xt, σt), xt+1), t= 0, . . . , T−1, (2.3)

where cT(ΣT, xT) := log det(ΣT) and for t= 1, . . . , T −1:

ct(Σt, xt) :=

(

0 for the terminal-stage-only cost in (2.1) log det(Σt) for the additive stage cost in (2.1)

Further, ρpt(Σ, x, u) is the Kalman filter covariance prediction:

ρpt(Σ, x, u) :=At(x, u)ΣATt(x, u) +Wt(x, u) (2.4)

and ρet(Σ, x) is the Kalman filter covariance update:

ρet(Σ, x) := Σ−ΣHtT(x) Ht(x)ΣHtT(x) +Vt(x) −1 Ht(x)Σ = (Σ−1+Mt(x))−1 (2.5) where Mt(x) :=HtT(x)V −1

t (x)Ht(x) is called the sensor matrix.

Notation: In the reminder we suppress the dependence on x and u of At, Wt, Ht, Vt, ρpt(Σ), ρet(Σ), and Mt when it is clear from context in order to simplify the notation. We

also define Rt(Σ) :=I −Kt(Σ)Ht and the Kalman gain Kt(Σ) := ΣHtT(HtΣHtT +Vt)−1.

Further, given a control sequence σ := uτ, . . . , ut−1 ∈ Ut−τ at time τ, the corresponding

One computational bottleneck in problem (2.3) is the large dimension of the state (xt,Σt). The significance of the separation principle1 (Thm. 2.1) is that search space can

be explored forward in time by building a set of reachable states rather than discretizing the high-dimensional space of all possible covariance as required by a closed-loop approach such as backward value iteration (Bertsekas 1995). As shown by Le Ny and Pappas(2009), the optimal (nonmyopic) open-loop control sequence σ∗ in (2.3) can be obtained via for- ward value iteration (FVI, Alg. 1). FVI constructs a search tree, Tt, with nodes at stage

Algorithm 1Forward Value Iteration

1: J0 ←0, S0← {(x0,Σ0, J0)}, St← ∅fort= 1, . . . , T 2: fort= 1 :T do 3: for all(x,Σ, J)∈St−1 do 4: for allu∈ U do 5: xt←f(x, u), Σt←ρet+1(ρ p t(Σ, x, u), xt) 6: Jt←J+ct(Σt, xt) 7: St←St∪ {(xt,Σt, Jt)} 8: returnmin{J|(x,Σ, J)∈ST}

t∈[0, T] corresponding to the reachable states (xt,Σt, Jt). At each node, there are edges,

one for each control in U, leading to nodes (xt+1,Σt+1, Jt+1), obtained from the dynamics

in (2.3). Unfortunately, FVI has exponential complexity,O(|U1×. . .× Un|T), in both the

time horizon T and the number of sensors n, since the number of nodes in Tt equals the number of sensor trajectories of lengtht.

The other extreme is greedy open-loop planning, which discards all but the best (lowest cost) node at level tof the tree Tt. The greedy policyσg is

σgt ∈arg min

u∈U

log det ρet+1(ρpt(Σt, xt, u), f(xt, u))

, t∈[0, T−1] (2.6)

and has linear complexity in the length of the planning horizon, O(|U1×. . .× Un|T), but

no guarantees exist for its performance. Fig. 2.1 shows a graphical comparison between greedy and nonmyopic planning in the context of problem (2.3). A natural question is if a solution that does less work than FVI but has suboptimality guarantees can be found. Our idea is to approximate FVI by discarding some nodes from the search tree. Unlike the greedy approach which discards everything except the currently-best node, we should discard nodes more intelligently in order to retain some performance guarantees. Before we proceed, however, we need to be sure that the greedy policy is not optimal.

Example 2.1. Consider the following single-sensor linear Gaussian active information acqui- sition problem with planning horizonT = 2.

min σ∈{1,2,3}2 h(y2 |z1:2) s.t. xt+1=σt, t= 0,1, yt+1=yt, y0∼ N(0, I2) t= 0,1, zt=H(xt)yt+vt, vt∼ N(0, V(xt)) t= 1,2, 1

Note that Thm. 2.1 guarantees the ability to plan open-loop which in general is stronger than the separation principle. For example, linear quadratic Gaussian regulation satisfies the separation principle but open-loop regulation cannot be achieved. Here, the cost function (2.3) is crucial in guaranteeing independence from the particular measurement realizations.

Figure 2.1: Forward value iteration (FVI) is a nonmyopic open-loop planning approach which con- structs a search tree (right) with branching factor |U | and depth T. It is guaranteed to find the optimal control sequenceσ∗ in (2.3) but its complexity is exponential inT (andn). Greedy open-

loop planning, on the other hand, keeps only the best node per stage (left) and, hence, has linear complexity inT (and exponential in n) but provides no performance guarantees.

Figure 2.2: Example that greedy planning is worse than nonmyopic planning even for static inde- pendent targets and a planning horizon ofT = 2. The control sequence chosen by greedy planning is indicated in red, while the optimal two-step sequences are shown in green.

where the sensor observation model is defined by the sensor matrixM(x) :=HT(x)V−1(x)H(x) as follows: M(1) = 0.45 0 0 0.45 M(2) = 1 0 0 0 M(3) = 0 0 0 1 .

Let Ωt:= Σ−t1 be the target information matrix at timet. Due to the separation principle

(Thm. 2.1), the problem is equivalent to: max

σ∈{1,2,3}2 log det(Ω2)

s.t. Ωt+1 = Ωt+M(σt), t= 0,1.

Fig. 2.2shows the search tree of depth 2 needed to compute the optimal open-loop policy and the control inputs chosen by the greedy policy. We can see that the greedy policy commits to a suboptimal input at the first stage and hence achieves a worse reward than the optimal control sequence. It is noteworthy that greedy planning is not optimal even in such a simple setting with static independent targets. Interestingly, this example also shows that the optimal control policy is not unique.