Evaluating hidden Markov model parameters

CHAPTER 2 : THEORY

2.1 Dependent mixture model

2.1.4 Evaluating hidden Markov model parameters

For a complete description of HMM, two specifications, N and U and three parameters, A, E and τ are required. How the parameters of a HMM can be evaluated will be discussed in this section. For convenience, the compact symbol φ will be used to represent the model parameter. That is, φ = {A, E, τ }.

To model a sequence of observations by HMM, there are three fundamental

Figure 2.5. Graphical illustration of sequence generation by HMM

problems that must be solved:

• Problem 1: Given a HMM φ = {A, E, τ }, how to compute the probability of a given observation sequence Y = {y1, y₂, y₃, . . . y_T}.

• Problem 2: Given a HMM φ = {A, E, τ }, how to choose the state sequences {r₁, r₂, r₃, . . . }, that will maximize the likelihood of a given observation sequence Y = {y₁, y₂, y₃, . . . y_T}.

• Problem 3: How to adjust the model parameters φ = {A, E, τ }, that will maximize the conditional probability Pr(Y |φ ).

Problem 1 is known as the Evaluation problem of HMM. It is mainly associated with the computation of the probability of a given sequence according to the given model.

Problem 2, which is known as the Decoding problem of HMM, attempts to reveal the hidden part. It tries to find the state sequence that maximizes the optimality of the

sequence based on some criterion. Problem 3 is the Learning problem of HMM. It adjusts the model parameters φ = {A, E, τ }, so an optimal solution comes out through iteration.

In the following section, the mathematical formulation of and present state of art of solving these three fundamental problems will be discussed.

Solution to Problem 1:The Evaluation problem deals with the computation of the probability of a given observation sequence Y = {y₁, y₂, y₃, . . . y_T} for a given model φ = {A, E, τ }. In mathematical notation, Pr(Y |φ ) is to be evaluated. From the

preceding discussion, it is clear that this conditional probability is an outcome of a joint distribution. First, the probability of the emission; second, the probability of the states. An explicit state sequence R = r₁, r₂, r₃, . . . r_T is considered. Now the probability of the observation sequence for this explicit state sequence can be given by:

Pr(Y |R, φ ) =

∏

t=1

Pr(y_t|r_t, φ ), 1 ≤ t ≤ T (2.8)

Equation (2.8) can be written in terms of emission probabilities as follows:

Pr(Y |R, φ ) = e_r₁_(y₁₎.e_r₂_(y₂₎. . . . e_r_T_(y_T₎ (2.9)

The probability of the explicit state sequence is given by:

Pr(R|φ ) = τr1.a_r₁_r₂.a_r₂_r₃. . . . a_r_T−1_r_T (2.10)

The joint probability of the observation and the explicit state occurring simultaneously is simply the product of the two expressions:

Pr(Y , R|φ ) = Pr(Y |R, φ ) × Pr(R|φ ) (2.11a)

or, Pr(Y , R|φ ) = τr1.e_r₁(y₁).a_r₁_r₂.e_r₂(y₂).a_r₂_r₃.e_r₃(y₃) . . . a_r_T−1_r_T.e_r_T(y_T) (2.11b)

or, Pr(Y , R|φ ) = τr₁.e_r_T(y_T)

T−1

∏

t=1

e_r_t(y_t).a_r_t_r_t+1 (2.11c)

Equation (2.11) gives the probability of the observation sequence for an explicit state sequence. The objective here is to find the probability of the observation sequence, unconditional to a definite state. Taking sum over all the states gives the necessary probability, irrespective of a definite state sequence:

Pr(Y |φ ) =

∑

Although equation (2.12) gives the necessary probability in a straight forward manner, it is computationally exhaustive. The summation has to be performed over all the possible states, which is N^T, which is a very large number for even small values of N and T. The dimensionality problem is handled with forward-backward procedure. First the forward and backward terms α and β are defined, respectively:

α_t(i) = Pr(y₁, y₂, . . . y_t, r_t = S_i|φ ) (2.13)

βt(i) = Pr(y_t+1, y_t+2, . . . y_T, r_t= S_i|φ ) (2.14)

The forward term defines the partial probability of the observations up to time t and the

state S_iat that time. This can be calculated inductively by:

α_t+1( j) = [

N i=1

∑

α_t(i).a_{i j}].e_j(y_t+1), 1 ≤ t ≤ T − 1, 1 ≤ j ≤ N (2.15)

Equation 2.15 is interpreted as follows: At time t + 1 state S_j can be achieved from any of the N possible states. Since the forward term is the joint probability of observing

y₁, y₂, . . . y_t and having state S_iat time t, the product α_t(i)a_{i j} is the joint probability of observing y₁, y₂, . . . y_tand reaching state S_j at time t + 1 via state S_i. The summation of this joint factor over all the states gives the probability of observing y1, y₂, . . . y_t and reaching state S_jat time t + 1. This reserves all the possible state scenarios in the preceding time frames. The forward term is initialized according to the initial state probabilities by α1(i) = λi.e_i(y₁) for all possible values of N. The terminal forward term, α_T(i) is the probability of observing Y with S_ias the final state. Since the states are disjoint, summing all the forward term (over all the states) gives the necessary probability:

Pr(Y |φ ) =

∑

i=1

αT(i) (2.16)

The backward term can similarly be utilized to calculate the observation probability.

Rather than calculating the partial probability by a forward induction, the backward term is used to calculate the partial probability through backward induction. The backward induction is given by:

β_t(i) =

N j=1

∑

a_{i j}.e_j(y_t+1)βt+1( j) (2.17)

The backward term is initialized arbitrarily by defining β_T(i) = 1 for all i. In the induction

procedure, which involves calculating β_t(i) from βt=1(i), all the possible states are considered with associated transition. Either forward term or backward term can be utilized to solve the first fundamental problem. And both of them are used extensively to solve the remaining two.

Solution to Problem 2:The Decoding problem deals with finding the optimal states, which will maximize some score. This is different from Evaluation, where an exact solution can be possible. But in Decoding, multiple solutions are possible based on the criterion of optimality. One choice is to choose the states that will maximize the individual likelihood i.e Pr(r_t|y_t, φ ), of the observations. But this criterion does not establish any bridge between states. So a problem could arise when an individual state comes with a zero or low probability, or makes transition to the next individually most likely state with zero or low probability. Although taking pairs or triplets could be a reasonable solution, the most used technique is to choose the best single path in terms of likelihood. In this case, one path is chosen so that Pr(R|Y , φ ) is maximized. This is known as Viterbi algorithm [32].

The Viterbi algorithm is based on dynamic programming. This uses the fact that If policy A has a best solution s, then another policy B must contain s in the best solution if A is a sub policy of B. A score variable χ is defined to determine the highest probability along a single path for the first t observations ending at state S_i.

χ_t(i) = max

r1,r2,...r_t−1Pr(r₁, r₂, . . . r_t= S_i, y₁, y₂, . . . y_t|φ ) (2.18)

The variable is initialized by:

χ₁(i) = τie_i(y₁), 1 ≤ i ≤ N (2.19)

and calculated inductively by:

χt+1( j) = [max

i χt(i)a_{i j}].e_j(y_t+1), 1 ≤ j ≤ N (2.20)

It should be noted here that the Viterbi algorithm works almost in a similar way like in the forward procedure. In any time segment t, there are N ”best” score variables up to that point. Now for the next time segment, each of the N states is reached via N best paths and the path having the best score is retained. When the procedure terminates when t = T , the best single path is retrieved by backtracking the states.

Solution to Problem 3:The third problem associated with HMM deals with finding the model parameters φ = {A, E, τ} to maximize the likelihood of the observations.

Although there is no analytical way to find the model parameters for maximum likelihood [31], a local maximum can be achieved using Expectation Maximization (EM) algorithm, which is an iterative procedure, classically developed by Baum et al [33].

To begin with, for adjusting the model parameters, a new term ω_t(i, j) is

introduced to define joint the probability of having state S_iat some time and state S_j at the next time, for the given observation sequence and model parameters:

ω_t(i, j) = Pr(r_t= S_i, rt+1= S_j|Y , φ ) (2.21)

Now ω_t(i, j) can be written in terms of forward and backward terms:

ω_t(i, j) = αt(i)a_{i j}e_j(y_t+1)βt+1( j)

P(Y |φ ) (2.22)

The numerator of equation (2.22) gives the probability of seeing S_iat time t and S_j at time t+ 1 for the given observation sequence and model. The denominator is to get the

required probability measure. It can be written as a double summation of the numerator term, over all the states as ∑^N defined as the probability of having the state Siat time t for the given observation sequence and model parameter:

µ_t(i) = Pr(r_t = S_i|Y , φ ) (2.23)

Now the two conditional probability terms, ω and µ can be correlated. The former term evaluates the joint probability of having two states at two consecutive times. Taking the summation of ω_t(i, j) over all the states at time t + 1 gives the probability of having some state at earlier time slot. Mathematically,

Both the probability terms, ω and µ, give us very useful inceptions. Taking summation of µ_t(i) over the entire time interval, that is t = 1 to t = T , gives the expected number that the state S_ivisited. Excluding t from the time interval, the summation can be interpreted as the expected number of transitions from state S_i. Also, taking the sum of

ω_t(i, j) over the time interval t = 1 to t = T − 1, it gives us the expected number of transitions that the a transition from state S_ito S_joccurs. That is,

T−1

∑

t=1

µ_t(i) = expected number of transitions from S_i (2.25)

T−1

∑

t=1

ω_t(i, j) = expected number of transitions from S_ito S_j (2.26)

Using the above formulas and the concept of counting, method can be developed for readjusting the HMM model parameters. There will be a set of formulas for reestimating the model parameters as follows:

τ¯_i= expected number of times state S_ivisited at time (t = 1) = µ₁(i) (2.27a)

a_{i j} = expected number of transitions from state S_ito S_j expected number of transitions from state Si

= expected number of times in state S_j =

Based on the equations (2.27a)-(2.27c), model parameters can be readjusted iteratively. It has been shown by Baum et al. that if a new set of model parameters φ = { ¯¯ A, ¯E, ¯τ } is computed through equations (2.27a)-(2.27c), then it will 1) be equal to the existing model parameters φ = {A, E, τ}, or, 2) would give a better likelihood in

terms of probability [34]. That is, Pr(Y | ¯φ ) greater than Pr(Y |φ ). The model converges better parameters through successive iterations.

In document Data Center Load Forecast Using Dependent Mixture Model (Page 29-38)