Filtering and Normalized Recursion

3.2 Forward-Backward

3.2.2 Filtering and Normalized Recursion

The forward and backward quantities αν,k and βk|n, as defined in previous

sections, are unnormalized in the sense that their scales are largely unknown. On the other hand, we know that αν,k(βk|n) is equal to Lν,n, the likelihood of

the observations up to index n under Pν.

The long-term behavior of the likelihood Lν,n, or rather its logarithm, is

a result known as the asymptotic equipartition property, or AEP (Cover and Thomas, 1991) in the information theoretic literature and as the Shannon- McMillan-Breiman theorem in the statistical literature. For HMMs, Proposi- tion 12.3.3 (Chapter 12) shows that under suitable mixing conditions on the underlying unobservable chain {Xk}k≥0, the AEP holds in that n−1log Lν,n

converges Pν-a.s. to a limit as n tends to infinity. The likelihood Lν,n will

thus either grow to infinity or shrink to zero, depending on the sign of the limit, exponentially fast in n. This has the practical implication that in all cases where the recursions of Proposition 3.2.1 are effectively computable (like in the case of finite state space to be discussed in Chapter 5), the dynamics of the numerical values needed to represent αν,k and βk|n is so large that it

62 3 Filtering and Smoothing Recursions

rapidly exceeds the available machine representation possibilities (even with high accuracy floating-point representations). The famous tutorial by Rabiner (1989) coined the term scaling to describe a practical solution to this problem. Interestingly, scaling also partly answers the question of the probabilistic interpretation of the forward and backward quantities.

Scaling as described by Rabiner (1989) amounts to normalizing αν,k and

βk|nby positive real numbers to keep the numeric values needed to represent

αν,kand βk|nwithin reasonable bounds. There are clearly a variety of options

available, especially if one replaces (3.14) by the equivalent auto-normalized form

φν,k|n(f ) = [αν,k(βk|n)]−1

αν,k(f βk|n) , (3.21)

assuming that αν,k(βk|n) is indeed finite and non-zero.

In our view, the most natural scaling scheme (developed below) consists in replacing the measure αν,k and the function βk|n by scaled versions ¯αν,k

and ¯βk|nof these quantities, satisfying both

(i) ¯αν,k(1) = 1, and

(ii) ¯αν,k( ¯βk|n) = 1.

Item (i) implies that the normalized forward measures ¯αν,k are probabil-

ity measures that have a probabilistic interpretation given below. Item (ii) implies that the normalized backward functions are such that φν,k|n(f ) =

R f (x) ¯βk|n(x) ¯αν,k(dx) for all f ∈ Fb(X), without the need for a further

renormalization. We note that this scaling scheme differs slightly from the one described by Rabiner (1989). The reason for this difference, which only affects the scaling of the backward functions, is non-essential and will be discussed in Section 3.4.

To derive the probabilistic interpretation of ¯αν,k, observe that (3.14) and

Proposition 3.2.3, instantiated for the final index k = n, imply that the filtering distribution φν,n at index n (recall that φν,n is used as a simplified

notation for φν,n|n) may be written [αν,n(1)]−1αν,n. This finding is of course

not specific to the choice of the index n as already discussed when proving the second statement of Proposition 3.2.3. Thus, the normalized version ¯αν,k

of the forward measure αν,k coincides with the filtering distribution φν,k in-

troduced in Definition 3.1.3. This observation together with Proposition 3.2.3 implies that there is a unique choice of scaling scheme that satisfies the two requirements of the previous paragraph, as

Z f (x) φν,k|n(dx) = L−1ν,n Z f (x) αν,k(dx)βk|n(x) = Z f (x) L−1_ν,kαν,k(dx) | {z } ¯ αν,k(dx) L−1_ν,nLν,kβk|n(x) | {z } ¯ βk|n(x)

3.2 Forward-Backward 63

must hold for any f ∈ Fb(X). The following definition summarizes these

conclusions, using the notation φν,krather than ¯αν,k, as these two definitions

refer to the same object—the filtering distribution at index k.

Definition 3.2.4 (Normalized Forward-Backward Variables). For k ∈ {0, . . . , n}, the normalized forward measure ¯αν,k coincides with the filtering

distribution φν,k and satisfies

φν,k = [αν,k(1)]−1αν,k= L−1ν,kαν,k.

The normalized backward functions ¯βk|n are defined by

¯ βk|n= αν,k(1) αν,k(βk|n) βk|n= Lν,k Lν,n βk|n.

The above definition would be pointless if computing αν,kand βk|nwas in-

deed necessary to obtain the normalized variables φν,kand ¯βk|n. The following

result shows that this is not the case.

Proposition 3.2.5 (Normalized Forward-Backward Recursions). Forward Filtering Recursion The filtering measures may be obtained, for all f ∈ Fb(X), recursively for k = 1, . . . , n according to

cν,k= Z Z φν,k−1(dx)Q(x, dx0)gk(x0) , φν,k(f ) = c−1ν,k Z f (x) Z φν,k−1(dx)Q(x, dx0)gk(x0) , (3.22)

with initial condition

cν,0= Z g0(x)ν(dx) , φν,0(f ) = c−1ν,0 Z f (x)g0(x) ν(dx) .

Normalized Backward Recursion The normalized backward functions may be obtained, for all x ∈ X, by the recursion

βk|n(x) = c−1ν,k+1

Q(x, dx0)gk+1(x0) ¯βk+1|n(x0) (3.23)

operating on decreasing indices k = n − 1 down to 0; the initial condition is ¯

βn|n(x) = 1.

Once the two recursions above have been carried out, the smoothing distribution at any given index k ∈ {0, . . . , n} is available via

φν,k|n(f ) =

f (x) ¯βk|n(x)φν,k(dx) (3.24)

64 3 Filtering and Smoothing Recursions

Proof. Proceeding by forward induction for φν,k and backward induction for

βk|n, it is easily checked from (3.22) and (3.23) that

φν,k= k Y l=0 cν,l !−1 αν,k and β¯k|n= n Y l=k+1 cν,l !−1 βk|n. (3.25) Because φν,k is normalized, φν,k(1) def = 1 = k Y l=0 cν,l !−1 αν,k(1) .

Proposition 3.2.3 then implies that for any integer k, Lν,k=

l=0

cν,l. (3.26)

In other words, cν,0 = Lν,0 and for subsequent indices k ≥ 1, cν,k =

Lν,k/Lν,k−1. Hence (3.25) coincides with the normalized forward and back-

ward variables as specified by Definition 3.2.4. ut

We now pause to state a series of remarkable consequences of Proposi- tion 3.2.5.

Remark 3.2.6. The forward recursion in (3.22) may also be rewritten to highlight a two-step procedure involving both the predictive and filtering measures. Recall our convention that φν,0|−1 refers to the predictive distribution

of X0when no observation is available and is thus an alias for ν, the distribu-

tion of X0. For k ∈ {0, 1, . . . , n} and f ∈ Fb(X), (3.22) may be decomposed

cν,k= φν,k|k−1(gk) ,

φν,k(f ) = c−1ν,kφν,k|k−1(f gk) ,

φν,k+1|k = φν,kQ . (3.27)

The equivalence of (3.27) with (3.22) is straightforward and is a direct consequence of the remark that φk+1|k = φν,kQ, which follows from Proposi-

tion 3.1.4 in Section 3.1.2. In addition, each of the two steps in (3.27) has a very transparent interpretation.

Predictor to Filter : The first two equations in (3.27) may be summarized as φν,k(f ) ∝

f (x) g(x, Yk) φν,k|k−1(dx) , (3.28)

where the symbol ∝ means “up to a normalization constant” (such that φν,k(1) = 1) and the full notation g(x, Yk) is used in place of gk(x) to

highlight the dependence on the current observation Yk. Equation (3.28)

is recognized as Bayes’ rule applied to a very simple equivalent Bayesian pseudo-model in which

3.2 Forward-Backward 65

• Xk is distributed a priori according to the predictive distribution

φν,k|k−1,

• g is the conditional probability density function of Yk given Xk.

The filter φν,kis then interpreted as the posterior distribution of Xkgiven

Yk in this simple equivalent Bayesian pseudo-model.

Filter to Predictor : The last equation in (3.27) simply means that the updated predicting distribution φν,k+1|kis obtained by applying the transition ker-

nel Q to the current filtering distribution φν,k. We are thus left with the

very basic problem of determining the one-step distribution of a Markov chain given its initial distribution.

Remark 3.2.7. In many situations, using (3.27) to determine φν,k is indeed

the goal rather than simply a first step in computing smoothed distributions. In particular, for sequentially observed data, one may need to take actions based on the observations gathered so far. In such cases, filtering (or predic- tion) is the method of choice for inference about the unobserved states, a topic

that will be developed further in Chapter 7.

Remark 3.2.8. Another remarkable fact about the filtering recursion is that (3.26) together with (3.27) provides a method for evaluating the likelihood Lν,k of the observations up to index k recursively in the index k. In

addition, as cν,k = Lν,k/Lν,k−1 from (3.26), cν,k may be interpreted as the

conditional likelihood of Yk given the previous observations Y0:k−1. However,

as discussed at the beginning of Section 3.2.2, using (3.26) directly is gener- ally impracticable for numerical reasons. In order to avoid numerical under- or overflow, one can equivalently compute the log-likelihood `ν,k. Combin-

ing (3.26) and (3.27) gives the important formula

`ν,k def = log Lν,k= k X l=0 log φν,l|l−1(gl) , (3.29)

where φν,l|l−1 is the one-step predictive distribution computed according

to (3.27) (recalling that by convention, φν,0|−1is used as an alternative nota-

tion for ν).

Remark 3.2.9. The normalized backward function ¯βk|ndoes not have a sim-

ple probabilistic interpretation when isolated from the corresponding filtering measure. However, (3.24) shows that the marginal smoothing distribution, φν,k|n, is dominated by the corresponding filtering distribution φν,k and that

βk|nis by definition the Radon-Nikodym derivative of φν,k|n with respect to

φν,k, ¯ βk|n= dφν,k|n dφν,k As a consequence,

66 3 Filtering and Smoothing Recursions

inf_{M ∈ R : φ}ν,k({ ¯βk|n≥ M }) = 0 ≥ 1

and

sup_{M ∈ R : φ}ν,k({ ¯βk|n≤ M }) = 0 ≤ 1 ,

with the conventions inf ∅ = ∞ and sup ∅ = −∞. As a consequence, all values of ¯βk|n cannot get simultaneously large or close to zero as was the case for

βk|n, although one cannot exclude the possibility that ¯βk|nstill has important

dynamics without some further assumptions on the model. The normalizing factorQn

l=k+1cν,l= Lν,n/Lν,kby which ¯βk|ndiffers from

the corresponding unnormalized backward function βk|n may be interpreted

as the conditional likelihood of the future observations Yk+1:n given the ob-

servations up to index k, Y0:k.

In document (Ebook) Hidden Markov Models (Theory & Methods) Markov Chains Particle Filter Monte Carlo Hmm - Cappemoulinesryden (Page 77-82)