1.3 Examples
1.3.4 Conditionally Gaussian Linear State-Space Models
We gradually move toward more complicated models for which the state space X of the hidden chain is no more finite. The previous example is, as we shall see in Chapter 5, a singular case because of the unique properties of the multivariate Gaussian distribution with respect to linear transformations. We now describe a related, although more complicated, situation in which the state Xkis composed of two components Ckand Wkwhere the former is finite-
valued whereas the latter is a continuous, possibly vector-valued, variable. The term “conditionally Gaussian linear state-space models”, or CGLSSMs in short, corresponds to structures by which the model, when conditioned on the finite-valued process {Ck}k≥0, reduces to the form studied in the previous
section.
Conditionally Gaussian linear state-space models belong to a class of mod- els that we will refer to as hierarchical hidden Markov models, whose depen- dence structure is depicted in Figure 1.6. In such models the variable Ck, which
is the highest in the hierarchy, influences both the transition from Wk−1 to
Wk as well as the observation Yk. When {Ck} takes its values in a finite set,
it is also common to refer to such models as jump Markov models, where the jumps correspond to the instants k at which the value of Ckdiffers from that of
18 1 Introduction · · · - - - · · · Ck Ck+1 R R ? ? · · · - - - · · · Wk Wk+1 ? ? Yk Yk+1
Fig. 1.6. Graphical representation of the dependence structure of a hierarchical HMM.
considering the composite state Xk = (Ck, Wk). But for hierarchical HMMs
in general and CGLSSMs in particular, it is often advantageous to consider the intermediate state sequence {Wk}k≥0as a nuisance parameter to focus on
the {Ck} component that stands at the top of the hierarchy in Figure 1.6. To
do so, one needs to integrate out the influence of {Wk}, conditioning on {Ck}
only. This principle can only be made effective in situations where the model belongs to a simple class (such as Gaussian linear state-space models) once conditioned on {Ck}. Below we give several simple examples that illustrate
the potential of this important class of models.
Example 1.3.9 (Rayleigh-fading Channel). We will now follow up on
Example 1.3.1 and again consider a model of interest in digital communication. The point is that for wireless transmissions it is possible, and desirable, to model more explicitly (than in Example 1.3.1) the physical processes that cause errors during transmissions. As in Example 1.3.1, we shall assume that the signal to be transmitted forms an i.i.d. sequence of fair Bernoulli draws. Here the sequence is denoted by {Ck}k≥0 and we assume that it takes its
values in the set {−1, 1} rather than in {0, 1}. This sequence is transmitted through a suitable modulation (Proakis, 1995) that is not of direct interest to us.
At the receiving side, the signal is first demodulated and the simplest model, known as the additive white Gaussian noise (AWGN) channel, postu- lates that the demodulated signal {Yk}k≥0may be written
Yk = hCk+ Vk , (1.15)
where h is a (real) channel gain, also known as a fading coefficient, and {Vk}k≥0is an i.i.d. sequence of Gaussian observation noise with zero mean and
1.3 Examples 19
variance σ2. For reasons that are inessential for the discussion that follows,
the actual model features complex channel gain and noise (Proakis, 1995), a fact that we will ignore in the following.
The AWGN channel model ignores inter-symbol interference in the sense that under (1.15) the observations {Yk} are i.i.d. In many practical situations,
it is necessary to account for channel memory to obtain a reasonable model of the received signal. Another issue is that, in particular in wireless commu- nication, the physical characteristics of the propagation path or channel are continuously changing over time. As a result, the fading coefficient h will typ- ically not stay constant but vary with time. A very simple model consists in assuming that the fading coefficient follows a (complex) autoregressive model of order 1, giving the model
Wk+1= ρWk+ Uk ,
Yk = WkCk+ Vk,
where the time-varying h is denoted by Wk, and {Uk}k≥0 is white Gaussian
noise (an i.i.d. sequence of zero mean Gaussian random variables). With this model, it is easily checked that if we assume that W0 is a Gaussian random
variable independent of both the observation noise {Vk} and the state noise
{Uk}, {Yk} is the observation sequence corresponding to an HMM with hidden
state Xk = (Ck, Wk) (the emitted bit and the fading coefficient). This is a
general state-space HMM, as Wkis a real random variable. In this application,
the aim is to estimate the sequence {Ck} of bits, which is thus a component
of the unobservable state sequence, given the observations {Yk}. The fading
coefficients {Wk} are of no direct interest and constitute nuisance variables.
This model however has a unique feature among general state-space HMMs in that conditionally on the sequence {Ck} of bits, it reduces to a Gaussian
linear state-space model with state variables {Wk}. The only difference to
Section 1.3.3 is that the observation equation becomes non-homogeneous in time,
Yk= Wkck+ Vk ,
where {Ck = ck} is the event on which we are conditioning. As a striking
consequence, we shall see in Chapters 4 and 5 that the distribution of Wkgiven
the observations Y0, Y1, . . . , Yk is a mixture of 2k+1 Gaussian distributions.
Because this is clearly not a tractable form when k is a two-digit number, the challenge consists in finding practical approaches to approximate the exact
distributions.
Conditionally Gaussian models related to the previous example are also commonly used to approximate non-Gaussian state-space models. Imagine that we are interested in the linear model given by Eqs. (1.7)–(1.8) with both noise sequences still being i.i.d. but at least one of them with a non-Gaussian distribution. Assuming a very general form of the noise distribution would directly lead us into the world of (general) continuous state-space HMMs. As
20 1 Introduction
a middle ground, we may however assume that the distribution of the noise is a finite mixture of Gaussian distributions.
Let {Ck}k≥0 denote an i.i.d. sequence of random variables taking values
in a set C, which can be finite or infinite. We refer to these variables as the indicator variables when C is finite and latent variables otherwise. To model non-Gaussian system dynamics we will typically replace the evolution equation (1.7) by
Wk+1= µW(Ck+1) + A(Ck+1)Wk+ R(Ck+1)Uk, Uk∼ N(0, I) ,
where, µW, A and R are respectively vector-valued and matrix-valued func-
tions of suitable dimensions on C. When C = {1, . . . , r} is finite, the distribu- tion of the noise µW(Ck+1) + R(Ck+1)Ukdriving the state equation is a finite
mixture of multivariate Gaussian distributions,
r
X
i=1
miN µW(i), R(i)Rt(i)
with mi= P(C0= i) .
Another option consists in using the same modeling to represent non-Gaussian observation noise by replacing the observation equation (1.8) by
Yk = µY(Ck) + B(Ck)Wk+ S(Ck)Vk, Vk∼ N(0, I) ,
where µY, B and S are respectively vector-valued and matrix-valued func-
tions of suitable dimensions on C. Of course, by doing this the state of the HMM has to be extended to the joint process {Xk}k≥0, where Xk = (Wk, Ck),
taking values in the product set X × C. At first sight, it is not obvious that anything has been gained at all by introducing additional mixture indices with respect to our basic objective, which is to allow for linear state-space models with non-Gaussian noises. We shall see however in Chapter 8 that the availability of computational procedures that evaluate quantities such as E[Wk| Y0, . . . Yk, C0, . . . , Ck] is a distinct advantage of conditionally linear
state-space models over more general (unstructured) continuous state-space HMMs. Conditionally Gaussian linear state-space models (CGLSSM) have found an exceptionally broad range of applications.
Example 1.3.10 (Change Point Detection). A simple yet useful exam- ple of CGLSSMs appears in change point detection problems (Shumway and Stoffer, 1991; Fearnhead, 1998). In a Gaussian linear state-space model, the dynamics of the state depends on the state transition matrix and on the state noise covariance. These quantities may change over time, and if the changes, when they occur, do so unannounced and at unknown time points, then the associated inferential problem is referred to as a change point problem. Var- ious important application areas of statistics involve change detection in a central way (for instance, environmental monitoring, quality assurance, bi- ology). In the simplest change point problem, the state variable is the level
1.3 Examples 21 0 500 1000 1500 2000 2500 3000 3500 4000 0.6 0.7 0.8 0.9 1 1.1 1.2 1.3 1.4 1.5x 10 5 Time Nuclear response 0 500 1000 1500 2000 2500 3000 3500 4000 −6 −5 −4 −3 −2 −1 0 1x 10 4 Time Nuclear response
Fig. 1.7. Left: well-log data waveform with a median smoothing estimate of the state. Right: median smoothing residual.
of a quantity of interest, which is modeled as a step function; the time in- stants at which the step function jumps are the change points. An example of this situation is provided by the well-log data considered in Chapter 5 of the book by ´O Ruanaidh and Fitzgerald (1996) and analyzed, among others, by Fearnhead (1998) and Fearnhead and Clifford (2003).
In this example, the data, which is plotted in Figure 1.7, consists of mea- surements of the nuclear magnetic response of underground rocks that are obtained whilst drilling for oil. The data contains information about the rock structure that is being drilled through. In particular, it contains information about boundaries between rock strata; jumps in the step function relate to the rock strata boundaries. As can be seen from the data, the underlying state is a step function, which is corrupted by a fairly large amount of noise. It is the position of these jumps that one needs to estimate. To model this situation, we put C = {0, 1}, where Ck = 0 means that there is no change point at
time index k, whereas Ck = 1 means that a change point has occurred. The
state-space model is
Wk+1= A(Ck+1)Wk+ R(Ck+1)Uk,
Yk = Wk+ Vk ,
where A(0) = I, R(0) = 0 and A(1) = 0 and R(1) = R. The simplest model consists in taking for {Ck}k≥0an i.i.d. sequence of Bernoulli random variables
with probability of success p. The time between two change points (period of time during which the state variable is constant) is then distributed as a geometric random variable with mean 1/p;
Wk+1=
(
Wk with probability p ,
Uk otherwise .
22 1 Introduction
It is possible to allow a more general form for the prior distribution of the durations of the periods by introducing dependence among the indicator vari- ables.
Note that it is also possible to consider such multiple change point mod- els under the different, although strictly equivalent, perspective of a Bayesian model with an unknown number of parameters. In this alternative represen- tation, the hidden state trajectory is parameterized by the succession of its levels (between two change points), which thus form a variable dimension set of parameters (Green, 1995; Lavielle and Lebarbier, 2001). Bayesian inference about such parameters, equipped with a suitable prior distribution, is then carried out using simulation-based techniques to be discussed further in Chap-
ter 13.
Example 1.3.11 (Linear State-Space Model with Observational Out- liers and Heavy-Tailed Noise). Another interesting application of condi- tional Gaussian linear state-space models pertains to the field of robust statis- tics (Schick and Mitter, 1994). In the course of model building and validation, statisticians are often confronted with the problem of dealing with outliers. Routinely ignoring unusual observations is neither wise nor statistically sound, as such observations may contain valuable information about unmodeled sys- tem characteristics, model degradation and breakdown, measurement errors and so forth.
The well-log data considered in the previous example illustrates this sit- uation. A visual inspection of the nuclear response reveals the presence of outliers, which tend to clump together in bursts (or clusters). This is con- firmed when plotting the quantile-quantile regression plot (see Figure 1.8) of the residuals of the well-log data obtained from a crude moving median estimate of the state variable (the median filter applies a sliding window to a sequence and outputs the median value of all points in the window as a smoothed estimate at the window center). It can be seen that the normal distribution does not fit the measurement noise well in the tails. Following Fearnhead and Clifford (2003), we model the measurement noise as a mixture of two Gaussian distributions. The model can be written
Wk+1= A(Ck+1,1)Wk+ R(Ck+1,1)Uk , Uk∼ N(0, 1) ,
Yk= µ(Ck,2) + B(Ck,2)Wk+ S(Ck,2)Vk , Vk ∼ N(0, 1) ,
where Ck,1∈ {0, 1} and Ck,2∈ {0, 1} are indicators of a change point and of
the presence of an outlier, respectively. As above, the level is assumed to be constant between two change points. Therefore we put A(0) = 1, R(0) = 0, A(1) = 0, and R(1) = σU. When there is no outlier, that is, Ck,2 = 0,
we assume that the level is observed in additive Gaussian noise. Therefore {µ(0), B(0), S(0)} = (0, 1, σV,0). In the presence of an outlier, the measure-
ment does no longer carry information about the current value of the level, that is, B(1) = 0, and the measurement noise is assumed to follow a Gaus- sian distribution with negative mean µ and (large) variance σV,1. Therefore
1.3 Examples 23 −4 −3 −2 −1 0 1 2 3 4 −6 −4 −2 0 2 4 6x 10 4
Standard Normal Quantiles
Quantiles of Input Sample
Fig. 1.8. Quantile-quantile regression of empirical quantiles of the well-log data residuals with respect to quantiles of the standard normal distribution.
{µ(1), B(1), S(1)} = (µ, 0, σV,1). One possible model for {Ck,2} would be a
Bernoulli model in which we could include information about the ratio of outliers/non-outliers in the success probability. However, this does not incor- porate any information about the way samples of outliers cluster together, as samples are assumed independent in such a model. A better model might be a two-state Markov chain in which the state transition probabilities allow a preference for “cohesion” within outlier bursts and non-outlier sections. Sim- ilar models have been used for audio signal restoration, where an outlier is a local degradation of the signal (click, scratch, etc.).
There are of course, in the framework of CGLSSMs, many additional de- grees of freedom. For example, ´O Ruanaidh and Fitzgerald (1996) claimed that the distribution of the measurement noise in the “clean” segments (segments free from outliers) of the nuclear response measurement have tails heavier than those of the Gaussian distribution, and they advocated a Laplacian additive noise model. The use of heavy-tailed distributions to model either the observa- tion noise or the measurement noise, which finds its roots in the field of robust statistics, is very popular and has been worked out in many different fields. One can of course consider to use Laplace, Weibull, or Student t-distributions, depending on the expected “size” of the tails, but if one is willing to exploit the full strength of conditionally Gaussian linear systems, it is wiser to con- sider using Gaussian scale mixtures. A random vector V is a Gaussian scale mixture if it can be expressed as the product of a Gaussian vector W with zero mean and identity covariance matrix and an independent positive scalar random variable√C: V =√CW (Andrews and Mallows, 1974). The variable C is the multiplier or the scale. If C has finite support, then V is a finite mix- ture of Gaussian vectors, whereas if C has a density with respect to Lebesgue measure on R, then V is a continuous mixture of Gaussian vectors. Gaussian scale mixtures are symmetric, zero mean, and have leptokurtic marginal den-
24 1 Introduction