• No results found

Evaluating hidden Markov model parameters

CHAPTER 2 : THEORY

2.1 Dependent mixture model

2.1.4 Evaluating hidden Markov model parameters

For a complete description of HMM, two specifications, N and U and three parameters, A, E and τ are required. How the parameters of a HMM can be evaluated will be discussed in this section. For convenience, the compact symbol φ will be used to represent the model parameter. That is, φ = {A, E, τ }.

To model a sequence of observations by HMM, there are three fundamental

Figure 2.5. Graphical illustration of sequence generation by HMM

problems that must be solved:

• Problem 1: Given a HMM φ = {A, E, τ }, how to compute the probability of a given observation sequence Y = {y1, y2, y3, . . . yT}.

• Problem 2: Given a HMM φ = {A, E, τ }, how to choose the state sequences {r1, r2, r3, . . . }, that will maximize the likelihood of a given observation sequence Y = {y1, y2, y3, . . . yT}.

• Problem 3: How to adjust the model parameters φ = {A, E, τ }, that will maximize the conditional probability Pr(Y |φ ).

Problem 1 is known as the Evaluation problem of HMM. It is mainly associated with the computation of the probability of a given sequence according to the given model.

Problem 2, which is known as the Decoding problem of HMM, attempts to reveal the hidden part. It tries to find the state sequence that maximizes the optimality of the

sequence based on some criterion. Problem 3 is the Learning problem of HMM. It adjusts the model parameters φ = {A, E, τ }, so an optimal solution comes out through iteration.

In the following section, the mathematical formulation of and present state of art of solving these three fundamental problems will be discussed.

Solution to Problem 1:The Evaluation problem deals with the computation of the probability of a given observation sequence Y = {y1, y2, y3, . . . yT} for a given model φ = {A, E, τ }. In mathematical notation, Pr(Y |φ ) is to be evaluated. From the

preceding discussion, it is clear that this conditional probability is an outcome of a joint distribution. First, the probability of the emission; second, the probability of the states. An explicit state sequence R = r1, r2, r3, . . . rT is considered. Now the probability of the observation sequence for this explicit state sequence can be given by:

Pr(Y |R, φ ) =

T

t=1

Pr(yt|rt, φ ), 1 ≤ t ≤ T (2.8)

Equation (2.8) can be written in terms of emission probabilities as follows:

Pr(Y |R, φ ) = er1(y1).er2(y2). . . . erT(yT) (2.9)

The probability of the explicit state sequence is given by:

Pr(R|φ ) = τr1.ar1r2.ar2r3. . . . arT−1rT (2.10)

The joint probability of the observation and the explicit state occurring simultaneously is simply the product of the two expressions:

Pr(Y , R|φ ) = Pr(Y |R, φ ) × Pr(R|φ ) (2.11a)

or, Pr(Y , R|φ ) = τr1.er1(y1).ar1r2.er2(y2).ar2r3.er3(y3) . . . arT−1rT.erT(yT) (2.11b)

or, Pr(Y , R|φ ) = τr1.erT(yT)

T−1

t=1

ert(yt).artrt+1 (2.11c)

Equation (2.11) gives the probability of the observation sequence for an explicit state sequence. The objective here is to find the probability of the observation sequence, unconditional to a definite state. Taking sum over all the states gives the necessary probability, irrespective of a definite state sequence:

Pr(Y |φ ) =

Although equation (2.12) gives the necessary probability in a straight forward manner, it is computationally exhaustive. The summation has to be performed over all the possible states, which is NT, which is a very large number for even small values of N and T. The dimensionality problem is handled with forward-backward procedure. First the forward and backward terms α and β are defined, respectively:

αt(i) = Pr(y1, y2, . . . yt, rt = Si|φ ) (2.13)

βt(i) = Pr(yt+1, yt+2, . . . yT, rt= Si|φ ) (2.14)

The forward term defines the partial probability of the observations up to time t and the

state Siat that time. This can be calculated inductively by:

αt+1( j) = [

N i=1

αt(i).ai j].ej(yt+1), 1 ≤ t ≤ T − 1, 1 ≤ j ≤ N (2.15)

Equation 2.15 is interpreted as follows: At time t + 1 state Sj can be achieved from any of the N possible states. Since the forward term is the joint probability of observing

y1, y2, . . . yt and having state Siat time t, the product αt(i)ai j is the joint probability of observing y1, y2, . . . ytand reaching state Sj at time t + 1 via state Si. The summation of this joint factor over all the states gives the probability of observing y1, y2, . . . yt and reaching state Sjat time t + 1. This reserves all the possible state scenarios in the preceding time frames. The forward term is initialized according to the initial state probabilities by α1(i) = λi.ei(y1) for all possible values of N. The terminal forward term, αT(i) is the probability of observing Y with Sias the final state. Since the states are disjoint, summing all the forward term (over all the states) gives the necessary probability:

Pr(Y |φ ) =

N

i=1

αT(i) (2.16)

The backward term can similarly be utilized to calculate the observation probability.

Rather than calculating the partial probability by a forward induction, the backward term is used to calculate the partial probability through backward induction. The backward induction is given by:

βt(i) =

N j=1

ai j.ej(yt+1t+1( j) (2.17)

The backward term is initialized arbitrarily by defining βT(i) = 1 for all i. In the induction

procedure, which involves calculating βt(i) from βt=1(i), all the possible states are considered with associated transition. Either forward term or backward term can be utilized to solve the first fundamental problem. And both of them are used extensively to solve the remaining two.

Solution to Problem 2:The Decoding problem deals with finding the optimal states, which will maximize some score. This is different from Evaluation, where an exact solution can be possible. But in Decoding, multiple solutions are possible based on the criterion of optimality. One choice is to choose the states that will maximize the individual likelihood i.e Pr(rt|yt, φ ), of the observations. But this criterion does not establish any bridge between states. So a problem could arise when an individual state comes with a zero or low probability, or makes transition to the next individually most likely state with zero or low probability. Although taking pairs or triplets could be a reasonable solution, the most used technique is to choose the best single path in terms of likelihood. In this case, one path is chosen so that Pr(R|Y , φ ) is maximized. This is known as Viterbi algorithm [32].

The Viterbi algorithm is based on dynamic programming. This uses the fact that If policy A has a best solution s, then another policy B must contain s in the best solution if A is a sub policy of B. A score variable χ is defined to determine the highest probability along a single path for the first t observations ending at state Si.

χt(i) = max

r1,r2,...rt−1Pr(r1, r2, . . . rt= Si, y1, y2, . . . yt|φ ) (2.18)

The variable is initialized by:

χ1(i) = τiei(y1), 1 ≤ i ≤ N (2.19)

and calculated inductively by:

χt+1( j) = [max

i χt(i)ai j].ej(yt+1), 1 ≤ j ≤ N (2.20)

It should be noted here that the Viterbi algorithm works almost in a similar way like in the forward procedure. In any time segment t, there are N ”best” score variables up to that point. Now for the next time segment, each of the N states is reached via N best paths and the path having the best score is retained. When the procedure terminates when t = T , the best single path is retrieved by backtracking the states.

Solution to Problem 3:The third problem associated with HMM deals with finding the model parameters φ = {A, E, τ} to maximize the likelihood of the observations.

Although there is no analytical way to find the model parameters for maximum likelihood [31], a local maximum can be achieved using Expectation Maximization (EM) algorithm, which is an iterative procedure, classically developed by Baum et al [33].

To begin with, for adjusting the model parameters, a new term ωt(i, j) is

introduced to define joint the probability of having state Siat some time and state Sj at the next time, for the given observation sequence and model parameters:

ωt(i, j) = Pr(rt= Si, rt+1= Sj|Y , φ ) (2.21)

Now ωt(i, j) can be written in terms of forward and backward terms:

ωt(i, j) = αt(i)ai jej(yt+1t+1( j)

P(Y |φ ) (2.22)

The numerator of equation (2.22) gives the probability of seeing Siat time t and Sj at time t+ 1 for the given observation sequence and model. The denominator is to get the

required probability measure. It can be written as a double summation of the numerator term, over all the states as ∑N defined as the probability of having the state Siat time t for the given observation sequence and model parameter:

µt(i) = Pr(rt = Si|Y , φ ) (2.23)

Now the two conditional probability terms, ω and µ can be correlated. The former term evaluates the joint probability of having two states at two consecutive times. Taking the summation of ωt(i, j) over all the states at time t + 1 gives the probability of having some state at earlier time slot. Mathematically,

Both the probability terms, ω and µ, give us very useful inceptions. Taking summation of µt(i) over the entire time interval, that is t = 1 to t = T , gives the expected number that the state Sivisited. Excluding t from the time interval, the summation can be interpreted as the expected number of transitions from state Si. Also, taking the sum of

ωt(i, j) over the time interval t = 1 to t = T − 1, it gives us the expected number of transitions that the a transition from state Sito Sjoccurs. That is,

T−1

t=1

µt(i) = expected number of transitions from Si (2.25)

T−1

t=1

ωt(i, j) = expected number of transitions from Sito Sj (2.26)

Using the above formulas and the concept of counting, method can be developed for readjusting the HMM model parameters. There will be a set of formulas for reestimating the model parameters as follows:

τ¯i= expected number of times state Sivisited at time (t = 1) = µ1(i) (2.27a)

¯

ai j = expected number of transitions from state Sito Sj expected number of transitions from state Si

= expected number of times in state Sj =

T

Based on the equations (2.27a)-(2.27c), model parameters can be readjusted iteratively. It has been shown by Baum et al. that if a new set of model parameters φ = { ¯¯ A, ¯E, ¯τ } is computed through equations (2.27a)-(2.27c), then it will 1) be equal to the existing model parameters φ = {A, E, τ}, or, 2) would give a better likelihood in

terms of probability [34]. That is, Pr(Y | ¯φ ) greater than Pr(Y |φ ). The model converges better parameters through successive iterations.

Related documents