3.2 Forward-Backward
3.2.2 Filtering and Normalized Recursion
The forward and backward quantities αν,k and βk|n, as defined in previous
sections, are unnormalized in the sense that their scales are largely unknown. On the other hand, we know that αν,k(βk|n) is equal to Lν,n, the likelihood of
the observations up to index n under Pν.
The long-term behavior of the likelihood Lν,n, or rather its logarithm, is
a result known as the asymptotic equipartition property, or AEP (Cover and Thomas, 1991) in the information theoretic literature and as the Shannon- McMillan-Breiman theorem in the statistical literature. For HMMs, Proposi- tion 12.3.3 (Chapter 12) shows that under suitable mixing conditions on the underlying unobservable chain {Xk}k≥0, the AEP holds in that n−1log Lν,n
converges Pν-a.s. to a limit as n tends to infinity. The likelihood Lν,n will
thus either grow to infinity or shrink to zero, depending on the sign of the limit, exponentially fast in n. This has the practical implication that in all cases where the recursions of Proposition 3.2.1 are effectively computable (like in the case of finite state space to be discussed in Chapter 5), the dynamics of the numerical values needed to represent αν,k and βk|n is so large that it
62 3 Filtering and Smoothing Recursions
rapidly exceeds the available machine representation possibilities (even with high accuracy floating-point representations). The famous tutorial by Rabiner (1989) coined the term scaling to describe a practical solution to this prob- lem. Interestingly, scaling also partly answers the question of the probabilistic interpretation of the forward and backward quantities.
Scaling as described by Rabiner (1989) amounts to normalizing αν,k and
βk|nby positive real numbers to keep the numeric values needed to represent
αν,kand βk|nwithin reasonable bounds. There are clearly a variety of options
available, especially if one replaces (3.14) by the equivalent auto-normalized form
φν,k|n(f ) = [αν,k(βk|n)]−1
Z
αν,k(f βk|n) , (3.21)
assuming that αν,k(βk|n) is indeed finite and non-zero.
In our view, the most natural scaling scheme (developed below) consists in replacing the measure αν,k and the function βk|n by scaled versions ¯αν,k
and ¯βk|nof these quantities, satisfying both
(i) ¯αν,k(1) = 1, and
(ii) ¯αν,k( ¯βk|n) = 1.
Item (i) implies that the normalized forward measures ¯αν,k are probabil-
ity measures that have a probabilistic interpretation given below. Item (ii) implies that the normalized backward functions are such that φν,k|n(f ) =
R f (x) ¯βk|n(x) ¯αν,k(dx) for all f ∈ Fb(X), without the need for a further
renormalization. We note that this scaling scheme differs slightly from the one described by Rabiner (1989). The reason for this difference, which only affects the scaling of the backward functions, is non-essential and will be dis- cussed in Section 3.4.
To derive the probabilistic interpretation of ¯αν,k, observe that (3.14) and
Proposition 3.2.3, instantiated for the final index k = n, imply that the fil- tering distribution φν,n at index n (recall that φν,n is used as a simplified
notation for φν,n|n) may be written [αν,n(1)]−1αν,n. This finding is of course
not specific to the choice of the index n as already discussed when proving the second statement of Proposition 3.2.3. Thus, the normalized version ¯αν,k
of the forward measure αν,k coincides with the filtering distribution φν,k in-
troduced in Definition 3.1.3. This observation together with Proposition 3.2.3 implies that there is a unique choice of scaling scheme that satisfies the two requirements of the previous paragraph, as
Z f (x) φν,k|n(dx) = L−1ν,n Z f (x) αν,k(dx)βk|n(x) = Z f (x) L−1ν,kαν,k(dx) | {z } ¯ αν,k(dx) L−1ν,nLν,kβk|n(x) | {z } ¯ βk|n(x)
3.2 Forward-Backward 63
must hold for any f ∈ Fb(X). The following definition summarizes these
conclusions, using the notation φν,krather than ¯αν,k, as these two definitions
refer to the same object—the filtering distribution at index k.
Definition 3.2.4 (Normalized Forward-Backward Variables). For k ∈ {0, . . . , n}, the normalized forward measure ¯αν,k coincides with the filtering
distribution φν,k and satisfies
φν,k = [αν,k(1)]−1αν,k= L−1ν,kαν,k.
The normalized backward functions ¯βk|n are defined by
¯ βk|n= αν,k(1) αν,k(βk|n) βk|n= Lν,k Lν,n βk|n.
The above definition would be pointless if computing αν,kand βk|nwas in-
deed necessary to obtain the normalized variables φν,kand ¯βk|n. The following
result shows that this is not the case.
Proposition 3.2.5 (Normalized Forward-Backward Recursions). Forward Filtering Recursion The filtering measures may be obtained, for all f ∈ Fb(X), recursively for k = 1, . . . , n according to
cν,k= Z Z φν,k−1(dx)Q(x, dx0)gk(x0) , φν,k(f ) = c−1ν,k Z f (x) Z φν,k−1(dx)Q(x, dx0)gk(x0) , (3.22)
with initial condition
cν,0= Z g0(x)ν(dx) , φν,0(f ) = c−1ν,0 Z f (x)g0(x) ν(dx) .
Normalized Backward Recursion The normalized backward functions may be obtained, for all x ∈ X, by the recursion
¯
βk|n(x) = c−1ν,k+1
Z
Q(x, dx0)gk+1(x0) ¯βk+1|n(x0) (3.23)
operating on decreasing indices k = n − 1 down to 0; the initial condition is ¯
βn|n(x) = 1.
Once the two recursions above have been carried out, the smoothing distri- bution at any given index k ∈ {0, . . . , n} is available via
φν,k|n(f ) =
Z
f (x) ¯βk|n(x)φν,k(dx) (3.24)
64 3 Filtering and Smoothing Recursions
Proof. Proceeding by forward induction for φν,k and backward induction for
βk|n, it is easily checked from (3.22) and (3.23) that
φν,k= k Y l=0 cν,l !−1 αν,k and β¯k|n= n Y l=k+1 cν,l !−1 βk|n. (3.25) Because φν,k is normalized, φν,k(1) def = 1 = k Y l=0 cν,l !−1 αν,k(1) .
Proposition 3.2.3 then implies that for any integer k, Lν,k=
k
Y
l=0
cν,l. (3.26)
In other words, cν,0 = Lν,0 and for subsequent indices k ≥ 1, cν,k =
Lν,k/Lν,k−1. Hence (3.25) coincides with the normalized forward and back-
ward variables as specified by Definition 3.2.4. ut
We now pause to state a series of remarkable consequences of Proposi- tion 3.2.5.
Remark 3.2.6. The forward recursion in (3.22) may also be rewritten to highlight a two-step procedure involving both the predictive and filtering mea- sures. Recall our convention that φν,0|−1 refers to the predictive distribution
of X0when no observation is available and is thus an alias for ν, the distribu-
tion of X0. For k ∈ {0, 1, . . . , n} and f ∈ Fb(X), (3.22) may be decomposed
as
cν,k= φν,k|k−1(gk) ,
φν,k(f ) = c−1ν,kφν,k|k−1(f gk) ,
φν,k+1|k = φν,kQ . (3.27)
The equivalence of (3.27) with (3.22) is straightforward and is a direct con- sequence of the remark that φk+1|k = φν,kQ, which follows from Proposi-
tion 3.1.4 in Section 3.1.2. In addition, each of the two steps in (3.27) has a very transparent interpretation.
Predictor to Filter : The first two equations in (3.27) may be summarized as φν,k(f ) ∝
Z
f (x) g(x, Yk) φν,k|k−1(dx) , (3.28)
where the symbol ∝ means “up to a normalization constant” (such that φν,k(1) = 1) and the full notation g(x, Yk) is used in place of gk(x) to
highlight the dependence on the current observation Yk. Equation (3.28)
is recognized as Bayes’ rule applied to a very simple equivalent Bayesian pseudo-model in which
3.2 Forward-Backward 65
• Xk is distributed a priori according to the predictive distribution
φν,k|k−1,
• g is the conditional probability density function of Yk given Xk.
The filter φν,kis then interpreted as the posterior distribution of Xkgiven
Yk in this simple equivalent Bayesian pseudo-model.
Filter to Predictor : The last equation in (3.27) simply means that the updated predicting distribution φν,k+1|kis obtained by applying the transition ker-
nel Q to the current filtering distribution φν,k. We are thus left with the
very basic problem of determining the one-step distribution of a Markov chain given its initial distribution.
Remark 3.2.7. In many situations, using (3.27) to determine φν,k is indeed
the goal rather than simply a first step in computing smoothed distributions. In particular, for sequentially observed data, one may need to take actions based on the observations gathered so far. In such cases, filtering (or predic- tion) is the method of choice for inference about the unobserved states, a topic
that will be developed further in Chapter 7.
Remark 3.2.8. Another remarkable fact about the filtering recursion is that (3.26) together with (3.27) provides a method for evaluating the like- lihood Lν,k of the observations up to index k recursively in the index k. In
addition, as cν,k = Lν,k/Lν,k−1 from (3.26), cν,k may be interpreted as the
conditional likelihood of Yk given the previous observations Y0:k−1. However,
as discussed at the beginning of Section 3.2.2, using (3.26) directly is gener- ally impracticable for numerical reasons. In order to avoid numerical under- or overflow, one can equivalently compute the log-likelihood `ν,k. Combin-
ing (3.26) and (3.27) gives the important formula
`ν,k def = log Lν,k= k X l=0 log φν,l|l−1(gl) , (3.29)
where φν,l|l−1 is the one-step predictive distribution computed according
to (3.27) (recalling that by convention, φν,0|−1is used as an alternative nota-
tion for ν).
Remark 3.2.9. The normalized backward function ¯βk|ndoes not have a sim-
ple probabilistic interpretation when isolated from the corresponding filtering measure. However, (3.24) shows that the marginal smoothing distribution, φν,k|n, is dominated by the corresponding filtering distribution φν,k and that
¯
βk|nis by definition the Radon-Nikodym derivative of φν,k|n with respect to
φν,k, ¯ βk|n= dφν,k|n dφν,k As a consequence,
66 3 Filtering and Smoothing Recursions
infM ∈ R : φν,k({ ¯βk|n≥ M }) = 0 ≥ 1
and
supM ∈ R : φν,k({ ¯βk|n≤ M }) = 0 ≤ 1 ,
with the conventions inf ∅ = ∞ and sup ∅ = −∞. As a consequence, all values of ¯βk|n cannot get simultaneously large or close to zero as was the case for
βk|n, although one cannot exclude the possibility that ¯βk|nstill has important
dynamics without some further assumptions on the model. The normalizing factorQn
l=k+1cν,l= Lν,n/Lν,kby which ¯βk|ndiffers from
the corresponding unnormalized backward function βk|n may be interpreted
as the conditional likelihood of the future observations Yk+1:n given the ob-
servations up to index k, Y0:k.