The Backward Recursion and the Two-Filter Formula

5.1 Models with Finite State Space

5.2.5 The Backward Recursion and the Two-Filter Formula

Notice that up to now, we have not considered the backward functions βk|n

in the case of Gaussian linear state-space models. In particular, and although the details of both approaches differ, the smoothing recursions discussed in Sections 5.2.1 and 5.2.4 are clearly related to the general principle of backward Markovian smoothing discussed in Section 3.3.2 and do not rely on the forward-backward decomposition discussed in Section 3.2.

A first terminological remark is that although major sources on Gaussian linear models never mention the forward-backward decomposition, it is indeed known under the name of two-filter formula (Fraser and Potter, 1969; Kita- gawa, 1996; Kailath et al., 2000, Section 10.4). A problem however is that, as noted in Chapter 3, the backward function βk|n is not directly interpretable

as a probability distribution (recall for instance that the initialization of the backward recursion is βn|n(x) = 1 for all x ∈ X). A first approach consists

in introducing some additional assumptions on the model that ensure that βk|n(x), suitably normalized, can indeed be interpreted as a probability den-

sity function. The backward recursion can then be interpreted as the Kalman prediction algorithm, applied backwards in time, starting from the end of the data record (Kailath et al., 2000, Section 10.4).

A different option, originally due to Mayne (1966) and Fraser and Potter (1969), consists in deriving the backward recursion using a reparameterization of the backward functions βk|n, which is robust to the fact that βk|n(x) may

not be integrable over X. This solution has the advantage of being generic in that it does not require any additional assumptions on the model, other than SkSkt being invertible. The drawback is that we cannot simply invoke a

variant of Algorithm 5.2.3 but need to derive a specific form of the backward recursion using a different parameterization. This implementation of the backward recursion (which could also be used, with some minor modifications, for usual forward prediction) is referred to as the information form of the Kalman filtering and prediction recursions (Anderson and Moore, 1979, Section 6.3; Kailath et al., 2000, Section 9.5.2). In the time series literature, this method is also sometimes used as a tool to compute the smoothed estimates when using so-called diffuse priors (usually for X0), which correspond to the notion

of improper flat distributions to be discussed below.

5.2.5.1 The Information Parameterization

The main ingredient of what follows consists in revisiting the calculation of the posterior distribution of the unobserved component X in the basic Gaussian linear model

Y = BX + V .

Indeed, in order to prove Proposition 5.2.2, we could have followed a very different route: assuming that both ΣV and Cov(Y ) = BtΣXB + ΣV are full

5.2 Gaussian Linear State-Space Models 149

rank matrices, the posterior probability density function of X given Y , which we denote by p(x|y), is known by Bayes’ rule to be proportional to the product of the prior p(x) on X and the conditional probability density function p(y|x) of Y given X, that is,

p(x|y) ∝ exp −1 2(y − Bx) t_Σ−1 V (y − Bx) + (x − µX)tΣX−1(x − µX) , (5.101) where the symbol ∝ indicates proportionality up to a constant that does not depend on the variable x. Note that this normalizing constant could easily be determined in the current case because we know that p(x|y) corresponds to a multivariate Gaussian probability density function. Hence, to fully determine p(x|y), we just need to rewrite (5.101) as a quadratic form in x:

p(x|y) ∝ exp {−1 2x t_(Bt_Σ−1 V B + Σ −1 X )x − x t_(Bt_Σ−1 V y + Σ −1 X µX) − (Bt_Σ−1 V y + Σ −1 X µX)tx , (5.102) that is, p(x|y) ∝ exp −1 2(x − µX|Y) t_Σ−1 X|Y(x − µX|Y)] , (5.103) where µX|Y = ΣX|Y−1 B t_Σ−1 V y + Σ −1 X µX , (5.104) ΣX|Y = BtΣV−1B + Σ −1 X −1 . (5.105)

Note that in going from (5.102) to (5.104), we have used once again the fact that p(x|y) only needs be determined up to a normalization factor, whence terms that do not depend on x can safely be ignored.

As a first consequence, (5.105) and (5.104) are alternate forms of equations (5.17) and (5.16), respectively, which we first met in Proposition 5.2.2. The fact that (5.17) and (5.105) coincide is a well-known result from matrix theory known as the matrix inversion lemma that we could have invoked directly to obtain (5.104) and (5.105) from Proposition 5.2.2. This simple rewrit- ing of the conditional mean and covariance in the Gaussian linear model is however not the only lesson that can be learned from (5.104) and (5.105). In particular, a very natural parameterization of the Gaussian distribution in this context consists in considering the inverse of the covariance matrix Π = Σ−1 and the vector κ = Πµ rather than the covariance Σ and the mean vector µ. Both of these parameterizations are of course fully equivalent when the covariance matrix Σ is invertible. In some contexts, the inverse covariance matrix Π is referred to as the precision matrix, but in the filtering context the

150 5 Applications of Smoothing

use of this parameterization is generally associated with the word information (in reference to the fact that in a Gaussian experiment, the inverse of the covariance matrix is precisely the Fisher information matrix associated with the estimation of the mean). We shall adopt this terminology and refer to the use of κ and Π as parameters of the Gaussian distribution as the information parameterization. Note that because a Gaussian probability density function p(x) with mean µ and covariance Σ may be written

p(x) ∝ exp −1 2x t_Σ−1_{x − 2x}t_Σ−1_µ = exp −1 2trace xx t Σ−1 − 2xtΣ−1µ ,

Π = Σ−1and κ = Πµ also form the natural parameterization of the multivariate normal, considered as a member of the exponential family of distributions (Lehmann and Casella, 1998).

5.2.5.2 The Gaussian Linear Model (Again!)

We summarize our previous findings—Eqs. (5.104) and (5.105)—in the form of the following alternative version of Proposition 5.2.2,

Proposition 5.2.18 (Conditioning in Information Parameterization). Let

Y = BX + V ,

where X and V are two independent Gaussian random vectors such that, in information parameterization, κX = Cov(X)

−1

E(X), ΠX = Cov(X)

−1

, ΠV = Cov(V )−1 and κV = E(V ) = 0, B being a deterministic matrix. Then

κX|Y = κX+ BtΠVY , (5.106)

ΠX|Y = ΠX+ BtΠVB , (5.107)

where κX|Y = Cov(X|Y ) −1

E(X|Y ) and ΠX|Y = Cov(X|Y ) −1

If the matrices ΠX, ΠV, or ΠX|Y are not full rank matrices, (5.106)

and (5.107) can still be interpreted in a consistent way using the concept of improper (flat) distributions.

Equations (5.106) and (5.107) deserve no special comment as they just correspond to a restatement of (5.104) and (5.105), respectively. The last sentence of Proposition 5.2.18 is a new element, however. To understand the point, consider (5.101) again and imagine what would happen if p(x), for instance, was assumed to be constant. Then (5.102) would reduce to

p(x|y) ∝ exp −1 2x t_(Bt_Σ−1 V B)x − x t_(Bt_Σ−1 V y) − (B t_Σ−1 V y) t_x , (5.108)

5.2 Gaussian Linear State-Space Models 151

which corresponds to a perfectly valid Gaussian distribution, when viewed as a function of x, at least when Bt_Σ−1

V B has full rank. The only restriction is that

there is of course no valid probability density function p(x) that is constant on X. This practice is however well established in Bayesian estimation (to be discussed in Chapter 13.1.1) where such a choice of p(x) is referred to as using an improper flat prior. The interpretation of (5.108) is then that under an (improper) flat prior on Y , the posterior mean of X given Y is

BtΣ_V−1B−1

BtΣ_V−1Y , (5.109)

which is easily recognized as the (deterministic) optimally weighted least- squares estimate of x in the linear regression model Y = Bx + V . The important message here is that (5.109) can be obtained direct from (5.106) by assuming that ΠX is the null matrix and κX the null vector. Hence Propo-

sition 5.2.18 also covers the case where X has an improper flat distribution, which is handled simply by setting the precision matrix ΠX and the vector

κX equal to 0. A more complicated situation is illustrated by the following

example.

Example 5.2.19. Assume that the linear model is such that X is bivariate Gaussian and the observation Y is scalar with

B = 1 0

and Cov(V ) = σ2.

Proposition 5.2.18 asserts that the posterior parameters are then given by κX|Y = κX+ σ−2_Y 0 , (5.110) ΠX|Y = ΠX+ σ−2₀ 0 0 . (5.111)

In particular, if the prior on X is improper flat, then (5.110) and (5.111) simply mean that the posterior distribution of the first component of X given Y is Gaussian with mean Y and variance σ2_{, whereas the posterior on the}

second component is also improper flat.

In the above example, what is remarkable is not the result itself, which is obvious, but the fact that it can be obtained by application of a single set of formulas that are valid irrespectively of the fact that some distributions are improper. In more general situations, directions that are in the null space of ΠX|Y form a subspace where the resulting posterior is improper flat, whereas

the posterior distribution of X projected on the image ΠX|Y is a valid Gaus-

sian distribution.

The information parameterization is ambivalent because it can be used both as a Gaussian prior density function as in Proposition 5.2.18 but also as an observed likelihood. There is nothing magic here but simply the observation

152 5 Applications of Smoothing

that as we (i) allow for improper distributions and (ii) omit the normalization factors, Gaussian priors and likelihood are equivalent. The following lemma is a complement to Proposition 5.2.18, which will be needed below.

Lemma 5.2.20. Up to terms that do not depend on x, Z exp −1 2(y − Bx) t_Σ−1_{(y − Bx)} exp −1 2 ytΠy − 2ytκ dy ∝ exp −1 2x t_Bt_{(I + ΠΣ)}−1_{ΠBx − 2x}t_Bt_{(I + ΠΣ)}−1_κ , (5.112)

where I denotes the identity matrix of suitable dimension.

Proof. The left-hand side of (5.112), which we denote by p(x), may be rewrit- ten as p(x) = exp −1 2xB t_Σ−1_Bx × Z exp −1 2y t_{(Π + Σ}−1_{)y − 2y}t_{(κ + Σ}−1_{Bx) dy . (5.113)}

Completing the square, the bracketed term in the integrand of (5.113) may be written

y − (Π + Σ−1₎−1_{(κ + Σ}−1_Bx) t

(Π + Σ−1)

×y − (Π + Σ−1₎−1_{(κ + Σ}−1_Bx)

− (κ + Σ−1Bx)t(Π + Σ−1)−1(κ + Σ−1Bx) . (5.114) The exponent of −1/2 times the first two lines of (5.114) integrates to a constant (or, rather, a number not depending on x), as it is recognized as a Gaussian probability density function. Thus

p(x) ∝ exp −1 2 [−2xtBtΣ−1(Π + Σ−1)−1κ + xtBt Σ−1− Σ−1(Π + Σ−1)−1Σ−1 Bx , (5.115) where terms that do not depend on x have been dropped. Equation (5.112) follows from the equalities Σ−1(Π + Σ−1)−1= (I + ΠΣ)−1 and

Σ−1− Σ−1_{(Π + Σ}−1₎−1_Σ−1

= Σ−1(Π + Σ−1)−1(Π + Σ−1) − Σ−1 = (I + ΠΣ)−1Π . Note that the last identity is the matrix inversion lemma that we already met, as (I + ΠΣ)−1Π = (Π−1+ Σ)−1. Using this last form however is not a good idea in general, however, as it obviously does not apply in cases where Π is

5.2 Gaussian Linear State-Space Models 153

5.2.5.3 The Backward Recursion

The question now is, what is the link between our original problem, which consists in implementing the backward recursion in Gaussian linear state-space models, and the information parameterization discussed in the previous section? The connection is the fact that the backward functions defined by (3.16) do not correspond to probability measures. More precisely, βk|n(Xk) defined

by (3.16) is the conditional density of the “future” observations Yk+1, . . . , Yn

given Xk. For Gaussian linear models, we know from Proposition 5.2.18 that

this density is Gaussian and hence that βk|n(x) has the form of a Gaussian

likelihood,

p(y|x) ∝ exp −1

2(y − M x)

t_Σ−1_{(y − M x)} ,

for some M and Σ given by (5.16) and (5.17). Proceeding as previously, this equation can be put in the same form as (5.108) (replacing B and ΣV by

M and Σ, respectively). Hence, a possible interpretation of βk|n(x) is that

it corresponds to the posterior distribution of Xk given Yk+1, . . . , Yn in the

pseudo-model where Xk is assumed to have an improper flat prior distribution.

According to the previous discussion, βk|n(x) itself may not correspond to

a valid Gaussian distribution unless one can guarantee that Mt_Σ−1_{M is a}

full rank matrix. In particular, recall from Section 3.2.1 that the backward recursion is initialized by setting βn|n(x) = 1, and hence βn|n never is a valid

Gaussian distribution.

The route from now on is clear: in order to implement the backward recursion, one needs to define a set of information parameters corresponding to βk|nand derive (backward) recursions for these parameters based on Propo-

sition 5.2.18. We will denote by κk|n and Πk|n the information parameters

(precision matrix times mean and precision matrix) corresponding to βk|nfor

k = n down to 0 where, by definition, κn|n= 0 and Πn|n= 0. It is important

to keep in mind that κk|n and Πk|n define the backward function βk|n only

up to an unknown constant. The best we can hope to determine is βk|n(x)

R βk|n(x) dx

by computing the Gaussian normalization factor in situations where Πk|n is

a full rank matrix. But this normalization is not more legitimate or practical than other ones, and it is preferable to consider that βk|nwill be determined up

to a constant only. In most situations, this will be a minor concern, as formulas that take into account this possible lack of normalization, such as (3.21), are available.

Proposition 5.2.21 (Backward Information Recursion). Consider the Gaussian linear state-space model (5.11)–(5.12) and assume that SkSkt has

full rank for all k ≥ 0. The information parameters κk|n and Πk|n, which

154 5 Applications of Smoothing

Initialization: Set κn|n= 0 and Πn|n = 0.

Proof. The initialization of Proposition 5.2.21 has already been discussed and we just need to check that (5.116)–(5.119) correspond to an implementation of the general backward recursion (Proposition 3.2.1).

We split this update in two parts and first consider computing ˜

βk+1|n(x) ∝ gk+1(x)βk+1|n(x) (5.120)

from βk+1|n. Equation (5.120) may be interpreted as the posterior distribu-

tion of X in the pseudo-model in which X has a (possibly improper) prior distribution βk+1|n(with information parameters κk+1|n and Πk+1|n) and

Y = Bk+1X + Sk+1V

is observed, where V is independent of X. Equations(5.116)–(5.117) thus correspond to the information parameterization of ˜βk+1|n by application of

Proposition 5.2.18.

From (3.19) we then have βk|n(x) =

Qk(x, dx0) ˜βk+1|n(x0) , (5.121)

where we use the notation Qk rather than Q to emphasize that we are deal-

ing with possibly non-homogeneous models. Given that Qk is a Gaussian

transition density function corresponding to (5.12), (5.121) may be computed explicitly by application of Lemma 5.2.20 which gives (5.118) and (5.119). ut While carrying out the backward recursion according to Proposition 5.2.21, it is also possible to simultaneously compute the marginal smoothing distribution by use of (3.21).

Algorithm 5.2.22 (Forward-Backward Smoothing).

Forward Recursion: Perform Kalman filtering according to Algorithm 5.2.13 and store the values of ˆXk|k and Σk|k.

Backward Recursion: Compute the backward recursion, obtaining for each k the mean and covariance matrix of the smoothed estimate as

ˆ Xk|n= ˆXk|k+ Σk|k I + Πk|nΣk|k −1 (κk|n− Πk|nXˆk|k) , (5.122) Σk|n= Σk|k− Σk|k I + Πk|nΣk|k −1 Πk|nΣk|k. (5.123)

5.2 Gaussian Linear State-Space Models 155

Proof. These two equations can be obtained exactly as in the proof of Lemma 5.2.20, replacing (y − Bx)t_Σ−1_{(y − Bx) by (x − µ)}t_Σ−1_{(x − µ) and}

applying the result with µ = ˆXk|k, Σ = Σk|k, κ = κk|n and Π = Πk|n. If

Πk|nis invertible, (5.122) and (5.123) are easily recognized as the application

of Proposition 5.2.2 with B = I, Cov(V ) = Π_k|n−1, and an equivalent observed

value of Y = Π_k|n−1κk|n. ut

Remark 5.2.23. In the original work by Mayne (1966), the backward information recursion is carried out on the parameters of ˜βk|n, as defined by (5.120),

rather than on βk|n. It is easily checked using (5.116)–(5.119) that, except for

this difference of focus, Proposition 5.2.21 is equivalent to the Mayne (1966) formulas—see also Kailath et al. (2000, Section 10.4) on this point. Of course, in the work of Mayne (1966), ˜βk|n has to be combined with the predictive

distribution φk|k−1 rather than with the filtering distribution φk, as ˜βk|n al-

ready incorporates the knowledge of the observation Yk. Proposition 5.2.21

and Algorithm 5.2.22 are here stated in a form that is compatible with our general definition of the forward-backward decomposition in Section 3.2.

5.2.6 Application to Marginal Filtering and Smoothing in

In document (Ebook) Hidden Markov Models (Theory & Methods) Markov Chains Particle Filter Monte Carlo Hmm - Cappemoulinesryden (Page 164-171)