Although we will not go into a complete description of simulation methods in this book, the reader must be aware that recent developments of these methods have offered new opportunities for inference in complex models like hidden Markov models and their generalizations. For a more in-depth covering of these simulation methods and their implications see, for instance, the books by Chen and Shao (2000), Evans and Swartz (2000), Liu (2001), and Robert and Casella (2004).
162 6 Monte Carlo Methods
6.1.1 Monte Carlo Integration
Integration, in general, is most useful for computing probabilities and expec- tations. Of course, when given an expectation to compute, the first thing is to try to compute the integral analytically. When analytic evaluation is im- possible, numerical integration is an option. However, especially when the dimension of the space is large, numerical integration can become numerically involved: the number of function evaluations required to achieve some degree of approximation increases exponentially in the dimension of the problem (this is often called the curse of dimensionality ).
Thus it is useful to consider other methods for evaluating integrals. For- tunately, there are methods that do not suffer so directly from the curse of dimensionality, and Monte Carlo methods belong to this group. In particular, recall that, by the strong law of large numbers, if ξ1, ξ2, . . . is a sequence of i.i.d. X-valued random variables with common probability distribution π, then the estimator ˆ πNMC(f ) = N−1 N X i=1 f (ξi)
converges almost surely to π(f ) for all π-integrable functions f . Obviously this Monte Carlo estimate of the expectation is not exact, but generating a sufficiently large number of random variables can render this approximation error arbitrarily small, in a suitable probabilistic sense. It is even possible to assess the size of this error. If
π(|f |2) = Z
|f (x)|2π(dx) < ∞ ,
the central limit theorem shows that√N ˆπMC
N (f ) − π(f ) has an asymptotic
normal distribution, which can be used to construct asymptotic confidence regions for π(f ). For instance, if f is real-valued, a confidence interval with asymptotic probability of coverage α is given by
h ˆ πNMC(f ) − cαN−1/2σN(π, f ), ˆπNMC(f ) + cαN−1/2σN(π, f ) i , (6.1) where σN2(π, f )def= N−1 N X i=1 f (ξi) − ˆπMC N (f ) 2
and cαis the α/2 quantile of the standard Gaussian distribution. If generating
a sequence of i.i.d. samples from π is practicable, one can make the confidence interval as small as desired by increasing the sample size N . When compared to univariate numerical integration and quasi-Monte Carlo methods (Nieder- reiter, 1992), the convergence rate is not fast. In practical terms, (6.1) implies that an extra digit of accuracy on the approximation requires 100 times as many replications, where the rate 1/√N cannot be improved. On the other
6.1 Basic Monte Carlo Methods 163
hand, it is possible to derive methods to reduce the asymptotic variance of the Monte Carlo estimate by allowing a certain amount of dependence among the random variables ξ1, ξ2, . . . Such methods include antithetic variables, control
variates, stratified sampling, etc. These techniques are not discussed here (see for instance Robert and Casella, 2004, Chapter 4). A remarkable fact however is that the rate of convergence of 1/√N in (6.1) remains the same whatever the dimension of the space X is, which leaves some hope of effectively using the Monte Carlo approach in large-dimensional settings.
6.1.2 Monte Carlo Simulation for HMM State Inference 6.1.2.1 General Markovian Simulation Principle
We now turn to the specific task of simulating the unobserved sequence of states in a hidden Markov model, given some observations. The main re- sult has already been discussed in Section 3.3: given some observations, the unobserved sequence of states constitutes a non-homogeneous Markov chain whose transition kernels may be evaluated, either from the backward func- tions for the forward chain (with indices increasing as usual) or from the forward measures—or equivalently filtering distributions—for the backward chain (with indices in reverse order). Schematically, both available options are rather straightforward to implement.
Backward Recursion/Forward Sampling: First compute (and store) the back- ward functions βk|n by backward recursion, for k = n, n − 1 down to 0
(Proposition 3.2.1). Then, simulate Xk+1given Xk from the forward tran-
sition kernels Fk|nspecified in Definition 3.3.1.
Forward Recursion/Backward Sampling: First compute and store the forward measures αν,kby forward recursion, according to Proposition 3.2.1. As an
alternative, one may evaluate the normalized versions of the forward mea- sures, which coincide with the filtering distributions φν,k, following Propo-
sition 3.2.5. Then Xk is simulated conditionally on Xk+1 (starting from
Xn) according to the backward transition kernel Bν,k defined by (3.38).
Despite its beautiful simplicity, the method above will obviously be of no help in cases where an exact implementation of the forward-backward recursion is not available.
6.1.2.2 Models with Finite State Space
In the case where the state space X is finite, the implementation of the forward- backward recursions is feasible and has been fully described in Section 5.1. The second method described above is a by-product of Algorithm 5.1.3. Algorithm 6.1.1 (Markovian Backward Sampling). Given the stored val- ues of φ0, . . . , φn computed by forward recursion according to Algorithm 5.1.1,
164 6 Monte Carlo Methods
Final State: Simulate Xn from φn.
Backward Simulation: For k = n − 1 down to 0, compute the backward transition kernel according to (5.7) and simulate Xk from Bk(Xk+1, ·).
The numerical complexity of this sampling algorithm is thus equivalent to that of Algorithm 5.1.3, whose computational cost depends most importantly on the cardinal r of X and on the difficulty of evaluating the function g(x, Yk)
for all x ∈ X and k = 0, . . . , n (see Section 5.1). The backward simulation pass in Algorithm 6.1.1 is simpler than its smoothing counterpart in Algo- rithm 5.1.3, as one only needs to evaluate Bk(Xk+1, ·) for the simulated value
of Xk+1rather than Bk(i, j) for all (i, j) ∈ {1, . . . , r}2.
6.1.2.3 Gaussian Linear State-Space Models
As discussed in Section 5.2, Rauch-Tung-Striebel smoothing (Algorithm 5.2.4) is the exact counterpart of Algorithm 5.1.3 in the case of Gaussian linear state- space models. Not surprisingly, to obtain the smoothing means and covariance matrices in Algorithm 5.2.4, we explicitly constructed the backward Gaussian transition density, whose mean and covariance are given by (5.23) and (5.24), respectively. We simply reformulate this observation in the form of an algo- rithm as follows.
Algorithm 6.1.2 (Gaussian Backward Markovian State Sampling). Assume that the filtering moments ˆXk|k and Σk|k have been computed using
Proposition 5.2.3. Then do the following. Final State: Simulate
Xn∼ N( ˆXn|n, Σn|n).
Backward Simulation: For k = n − 1 down to 0, simulate Xk from a Gaussian
distribution with mean and covariance matrix given by (5.23) and (5.24), respectively.
The limitations discussed in the beginning of Section 5.2.2 concerning RTS smoothing (Algorithm 5.2.4) also apply here. In some models, Algorithm 6.1.2 is far from being computationally efficient (Fr¨uhwirth-Schnatter, 1994; Carter and Kohn, 1994). With these limitations in mind, De Jong and Shephard (1995) described a sampling algorithm inspired by disturbance (or Bryson- Frazier) smoothing (Algorithm 5.2.15) rather than by RTS smoothing. The method of De Jong and Shephard (1995) is very close to Algorithm 5.2.15 and proceeds by sampling the disturbance vectors Uk backwards (for k =
n − 1, . . . , 0) and then the initial state X0, from which the complete sequence
X0:nmay be obtained by repeated applications of the dynamic equation (5.11).
Because the sequence of disturbance vectors {Uk}k=n−1,...,0 does not however
have a backward Markovian structure, the method of De Jong and Shephard (1995) is not a simple by-product of disturbance smoothing (as was the case
6.1 Basic Monte Carlo Methods 165
for Algorithms 5.2.4 and 6.1.2). Durbin and Koopman (2002) described an approach that is conceptually simpler and usually about as efficient as the disturbance sampling method of De Jong and Shephard (1995).
The basic remark is that if X and Y are jointly Gaussian variables, the conditional distribution of X given Y is Gaussian with mean vector E [X | Y ] and covariance matrix Cov(X | Y ), where Cov(X | Y ) equals Cov(X−E[X | Y ]) and, in addition, does not depend on Y (Proposition 5.2.2). In particular, if (X∗, Y∗) is another independent pair of Gaussian distributed random vectors with the same (joint) distribution, X − E[X | Y ] and X∗− E[X∗| Y∗] are inde-
pendent and both are N (0, Cov(X | Y )) distributed. In summary, to simulate ξ from the distribution of X given Y , one may
1. Simulate an independent pair of Gaussian variables (X∗, Y∗) with the same distribution as (X, Y ) and compute X∗− E[X∗| Y∗];
2. Given Y , compute E[X | Y ], and set
ξ = E[X | Y ] + X∗− E[X∗| Y∗] .
This simulation approach only requires the ability to compute conditional expectations and to simulate from the prior joint distribution of X and Y . When applied to the particular case of Gaussian linear state-space models, this general principle yields the following algorithm.
Algorithm 6.1.3 (Sampling with Dual Smoothing). Given a Gaussian linear state-space model following (5.11)–(5.12) and observations Y0, . . . , Yn, do
the following.
1. Simulate a fictitious independent sequence {Xk∗, Yk∗}k=0,...,n of both states
and observations using the model equations.
2. Compute { ˆXk|n}k=0,...,n and { ˆXk|n∗ }k=0,...,n using Algorithm 5.2.15 for the
two sequences {Yk}k=0,...,n and {Yk∗}k=0,...,n.
Then { ˆXk|n+ Xk∗− ˆXk|n∗ }k=0,...,n is distributed according to the posterior dis-
tribution of the states given Y0, . . . , Yn.
Durbin and Koopman (2002) list a number of computational simplifica- tions that are needed to make the above algorithm competitive with the distur- bance sampling approach. As already noted in Remark 5.2.16, the backward recursion of Algorithm 5.2.15 may be greatly simplified when only the best linear estimates (and not their covariances) are to be computed. During the forward Kalman prediction recursion, it is also possible to save on computa- tions by noting that all covariance matrices (state prediction error, innovation) will be common for the two sequences {Yk} and {Yk∗}, as these matrices do
not depend on the observations but only on the model. The same remark should be used when the purpose is not only to simulate one sequence but N sequences of states conditional on the same observations, which will be the standard situation in a Monte Carlo approach. Further improvement can be
166 6 Monte Carlo Methods
gained by carrying out simultaneously the simulation and Kalman prediction tasks, as both of them are implemented recursively (Durbin and Koopman, 2002).