Hidden Markov models - Maximum likelihood parameter estimation in time series models using sequ

We begin with the definition of a HMM. Let {Xn}n≥1 be a homogenous Markov chain

defined on (X , EX). Suppose that this process is observed as another process {Yn}n≥1

defined on (Y, EY) such that the conditional distribution on Yngiven all the other random

variables depends only on Xn. Then the bivariate process {Xn, Yn}n≥1 is called a HMM.

We give below a more formal definition which is taken from Capp´e et al. [2005]; we additionally assume that the HMM is parametrised by a vector valued static parameter. Definition 3.1 (HMM). Let (_{X , E}X) and (Y, EY) be two measurable spaces, dθ > 0, and

Θ is a compact subset of Rdθ_{. For any θ} ∈ Θ, let µ

θ, Fθ, and Gθ denote, respectively, a

probability measure on (X , EX), a Markov transition kernel on (X , EX), and a transitional

kernel from (X , EX) to (Y, EY). Consider the Markov transition kernel Hθ defined on the

product space (X × Y, EX ⊗ EY) such that for all (x, y)∈ X × Y, C ∈ EX ⊗ EY

Hθ[(x, y), C] =

Fθ(x, dx′)Gθ(x′, dy′).

Then, the Markov chain {Xn, Yn}n≥1 with initial distribution µθ ⊗ Gθ, and with transi-

tional kernel Hθ is called a hidden Markov model (HMM) parametrised by θ.

Although this definition concerns the joint process _{Xn, Yn}n≥1, the term ‘hidden’

is justified when only _{Yn}n≥1 is observable. We call {Xn}n≥1 the hidden process and

its states the hidden states, and _{Yn}n≥1 is called the observed process, containing the

observed states. We will deal with real valued vector processes, that is why we always take _{X ∈ R}dx _and _{Y ∈ R}dy_{. Note, also, that it is Definition 3.1 from which it follows}

that {Xn}n≥1 is Markov(µθ, Fθ) and observations {Yn}_n≥1 conditioned upon {Xn}_n≥1

and B _{∈ E}Y we have Pθ(X1 ∈ A) = µθ(A), Pθ(Xn∈ A |X1:n−1= x1:n−1) = Fθ(xn−1, A), (3.1) Pθ Yn∈ B {Xt}_t≥1 ={xt}_t≥1,{Yt}_t6=n={yt}_t6=n = Gθ(xn, B). (3.2)

In the time series literature, the term HMM has been widely associated with the case of _{X being finite [Rabiner, 1989] and those models with continuous X are often referred} to as state-space models. Again, in some works the term ‘state space models’ refers to the case of linear Gaussian systems [Anderson and Moore, 1979]. We emphasise at this point that in this thesis we shall keep the framework as general as possible. We consider the general case of measurable spaces and we avoid making any restrictive assumptions on µθ, Fθ, and Gθ that impose a certain structure on the dynamics of the HMM. Also, we

clarify that in contrast to previous restrictive use of terminology, we will use both terms ‘HMM’ and ‘general state space model’ to describe exactly the same thing as defined by Definition 3.1.

For the rest of the thesis, we will be dealing with fully dominated HMMs, where µθ,

Fθ(x,·) and Gθ(x,·) have densities with respect to some dominating measures. We give

a formal definition of a fully dominated HMM here.

Definition 3.2 (fully dominated HMM). Consider the HMM in Definition 3.1. Sup-

pose that there exists probability measures λ on (X , EX) and ν on (Y, EY) such that (i)

µθ is absolutely continuous with respect to λ (ii) for all x∈ X , Fθ(x,·) is absolutely con-

tinuous with respect to λ with transition density function fθ(·|x) and (iii) for all x ∈ X ,

Gθ(x,·) is absolutely continuous with respect to ν with transition density function gθ(·|x).

Then the HMM is said to be fully dominated and the joint Markov transition kernel Hθ

is dominated by the product measure λ_{⊗ ν and admits the transition density function}

hθ(x′, y′|x, y) = fθ(x′|x)gθ(y′|x′).

Therefore, for a fully dominated HMM as in Definition 3.2, the joint probability density of (X1:n, Y1:n) exists and it is given by

pθ(x1:n, y1:n) = µθ(x1)gθ(y1|x1) n

t=2

fθ(xt|xt−1)gθ(yt|xt) (3.3)

where, with abuse of notation, we have used µ also to denote the density of the probability measure µ. Note, the joint law of all the variables of the HMM up to time n is summarised in (3.3) from which we derive several probability densities of interest. One example is the

likelihood of the observations up to time n which can be derived as

pθ(y1:n) =

pθ(x1:n, y1:n)λ(dx1:n). (3.4)

Maximisation of this quantity with respect to θ is the main interest of this thesis. Another important probability density, which will be pursued in detail, is the density of the posterior distribution of X1:n given Y1:n = y1:n, which is obtained by using the Bayes’

theorem pθ(x1:n|y1:n) = pθ(x1:n, y1:n) pθ(y1:n) (3.5)

3.2.1 Extensions to HMMs

Although HMMs are the most common class of time series models in the literature, there are also many time series models which are not a HMM and are still of great importance. These models differ from HMMs mostly because they do not possess the conditional independency of observations. Here, we give two examples that we will also use in this thesis.

• In the first example of such models, the process {Xn}n≥1 is still a Markov chain;

however the conditional distribution of Yn, given all past variables X1:n and Y1:n−1,

depends not only on the value of Xn but also on the values of past observations

i.e. Y1:n−1. If we denote the probability density of this conditional distribution

gθ,n(yn|xn, y1:n−1), the joint probability density of (X1:n, Y1:n) is

pθ(x1:n, y1:n) = µθ(x1)gθ(y1|x1) n

t=2

fθ(xt|xt−1)gθ,t(yt|xt, y1:t−1).

If Yn given Xn is independent of the past values of the observations prior to time

n_{− k, then we can define a g}θ such that gθ,n(yn|xn, y1:n−1) = gθ(yn|xn, yn−k:n−1) for

all n. One example of such models is a changepoint model e.g. see Fearnhead and Liu [2007]. We will encounter changepoint models in Chapter 4 of this thesis. The terminology regarding the type of models where we have gθ(yn|xn, yn−k:n−1) is

not fully standardised. One term that is used is Markov switching models; Markov

jump systems is also used at least in cases where the hidden state space is finite

[Capp´e et al., 2005]. These models have much in common with basic HMMs in the sense that virtually identical computational tools may be used for both models. In the particular context of SMC, the similarity between these to types of models is more clearly exposed in Del Moral [2004] via the Feynman-Kac representation of SMC methods, where the conditional density of observation at time n is treated generally as a potential function of xn.

• In another type of time series models that are not HMM the latent process {Xn}n≥1

is, again, still a Markov chain; however observation at current time depends on all the past values, i.e. Yn conditional on (X1:n, Y1:n−1) depends on all of these

conditioned random variables. Actually, these models are usually the result of marginalising an extended HMM. Consider the HMM{(Xn, Zn), Yn}n≥1, where the

joint process {Xn, Zn}n≥1 is a Markov chain such that its transitional law admits

the density fθ with respect to the product measure λ1⊗ λ2 which can be factorized

fθ(xn, zn|xn−1, zn−1) = fθ,1(xn|xn−1)fθ,2(zn|xn, zn−1).

and the observation Yn depends only on Xn and Zn given all the past random vari-

ables and admits the probability density gθ(yn|xn, zn). Now, the marginal bivariate

process _{Xn, Yn}n≥1 is not a HMM and we express the joint density of (X1:n, Y1:n)

as pθ(x1:n, y1:n) = µθ(x1)pθ,1(y1|x1) n Y t=2 fθ,1(xt|xt−1)pθ,t(yt|x1:t, y1:t−1)

where the density pθ,n(yn|x1:n, y1:n−1) is given by

pθ,n(yn|x1:n, y1:n−1) =

pθ(z1:n−1|x1:n−1, y1:n−1)fθ,2(zn|xn, zn−1)gθ(yn|xn, zn)λ2(dz1:n).

(3.6) The reason{Xn, Yn}n≥1might be of interest is that the conditional laws of Z1:nmay

be available in close form and exact evaluation of the integral in (3.6) is available. In that case, it can be more effective to perform Monte Carlo approximation for the law of X1:n given observations Y1:n, which leads to the so called Rao-Blackwellised

particle filters in the literature [Doucet et al., 2000a].

The integration is indeed available in close form for some time series models. One example is the linear Gaussian switching state space models [Chen and Liu, 2000; Doucet et al., 2000a; Fearnhead and Clifford, 2003], where Xn takes values on a

finite set whose elements are often called ‘labels’, and conditioned on _{Xn}n≥1,

{Zn, Yn}n≥1 is a linear Gaussian state-space model. A more sophisticated time

series model of the same nature is linear Gaussian multiple target tracking models, which we will investigate in detail in Chapter 5.

Having stated that the interest of this thesis is on more general time series models than HMMs, we note that the computational tools developed for HMMs are generally applicable to a more general class of time series models with some suitable modifications. For this reason we carry on this chapter with review of SMC and parameter estimation methods for HMMs.

In document Maximum likelihood parameter estimation in time series models using sequential Monte Carlo (Page 61-65)