Lecture
Lecture Notes:
Notes: Introduction
Introduction to
to Hidden
Hidden
Markov Models
Markov Models
Introduction
Introduction
A Hidden Markov Model (HMM), as the name suggests, is a Markov model in which the A Hidden Markov Model (HMM), as the name suggests, is a Markov model in which the states cannot be observed but symbols that are consumed or produced by transition are states cannot be observed but symbols that are consumed or produced by transition are observable.
observable. A speech generation system mA speech generation system might, for example, be implemented ight, for example, be implemented as a as a HMMHMM and speak
and speak a word a word as it as it transitions from one transitions from one state to state to another. another. Similarly a Similarly a speechspeech understanding system might
understanding system might “recognize” a word “recognize” a word on a on a transition. transition. In this In this sense HMM’ssense HMM’s can be thought of as
can be thought of as generative or as interpretative. generative or as interpretative. The HMM is the The HMM is the same but in same but in oneone case the transitions emit symbols (such as words) and in the other case consumes case the transitions emit symbols (such as words) and in the other case consumes symbols.
symbols. We We will therefore will therefore treat observations treat observations and and actions interchangeably in actions interchangeably in thethe foregoing.
foregoing.
Hidden Markov Models and simple extensions of them are very popular in a variety of Hidden Markov Models and simple extensions of them are very popular in a variety of fields including computer vision, natural language understanding, and speech recognition fields including computer vision, natural language understanding, and speech recognition and synthesis (to name a
and synthesis (to name a few). few). Often HMM’s are a natural way Often HMM’s are a natural way of modeling a of modeling a systemsystem and in other cases they
and in other cases they are “force-fit” to a problem to are “force-fit” to a problem to which they are not which they are not quite ideal. quite ideal. TheThe immense popularity of HMM’s is that very fast, linear time, algorithms exist for some of immense popularity of HMM’s is that very fast, linear time, algorithms exist for some of the
the most most important HMM important HMM problems. problems. This This allows, for allows, for example, speech example, speech recognitionrecognition systems to operate in real-time.
systems to operate in real-time.
A HMM is defined as the four tuple <s
A HMM is defined as the four tuple <s11,S,W,E> where s,S,W,E> where s11 is the start state, S is the set of is the start state, S is the set of
states, W is the set of
states, W is the set of observation symbols, observation symbols, and E is and E is the set of transitions. A transition isthe set of transitions. A transition is also a four tuple such as <s
also a four tuple such as <s22,”had”, s,”had”, s33, 0.3>. , 0.3>. This example This example described a described a transition fromtransition from
state s
state s22 to sto s33 in which the word “had” is either emitted or consumed and the probability of in which the word “had” is either emitted or consumed and the probability of
taking the trans
taking the transition is 0.3. ition is 0.3. We will usually write a tranWe will usually write a transition as T(ssition as T(s22,”had”, s,”had”, s33, 0.3) or as:, 0.3) or as:
3 3 .. 0 0 )) (( 33 "" "" 2 2
→
→
ss==
ss P P had hadSometimes we do not know the
Sometimes we do not know the starting state but we have a pdf for the starting state but we have a pdf for the starting state. starting state. WeWe can therefore, with improved generality replace s
can therefore, with improved generality replace s11 with a with a starting starting state state pdf. pdf. We We willwill
continue to assume that we know the starting state for the remainder of this discussion in continue to assume that we know the starting state for the remainder of this discussion in order to simplify examples but generalizing the starting condition to a pdf adds no order to simplify examples but generalizing the starting condition to a pdf adds no additional complexity to the algorithms presented.
aa
b
b
“1” 0.48
“1” 0.48
“0” 0.48
“0” 0.48
“0” 0.04
“0” 0.04
“1” 1.0
“1” 1.0
aa
b
b
“1” 0.48
“1” 0.48
“0” 0.48
“0” 0.48
“0” 0.04
“0” 0.04
“1” 1.0
“1” 1.0
Consider the, very simple, example below: Consider the, very simple, example below:
The transitions are depicted as arcs that indicate the symbol that is consumed or emitted, The transitions are depicted as arcs that indicate the symbol that is consumed or emitted, in quotes, as the transition is taken and a number that indicates the probability that a in quotes, as the transition is taken and a number that indicates the probability that a transition is taken.
transition is taken. A couple of
A couple of points are worth noting points are worth noting at this point. at this point. First, the probabilities of all transitionsFirst, the probabilities of all transitions from a state must sum to 1.0 and second, multiple transitions out of a state can occur with from a state must sum to 1.0 and second, multiple transitions out of a state can occur with the same symbol.
the same symbol. Looking at state “a” in the Looking at state “a” in the figure above shows that the symbol “0” figure above shows that the symbol “0” cancan be consumed by
be consumed by two different transitionstwo different transitions. . One of them changes One of them changes the state to “b” while thethe state to “b” while the other leaves the system
other leaves the system in state “a”. in state “a”. It is because It is because of this that of this that the states cannot be the states cannot be knownknown deterministical
deterministically. ly. If If transitions could transitions could be be uniquely identified uniquely identified by by the the symbol symbol theythey produce/consume we would track, with certainty, the current state of the system.
produce/consume we would track, with certainty, the current state of the system. The above example can be described as
The above example can be described as <s<s11, S, W, E>, S, W, E>
Where: Where: S S11= a= a S = { a, b } S = { a, b } W = { 0, 1 } W = { 0, 1 } E =
A sentence parsing example
A sentence parsing example
Let’s look at a more interesting example. Let’s look at a more interesting example.
0.5 0.5
ss11 ““MMaarryy”” ss22 ““HHaadd”” ss33 “A”“A” ss44 ““LLiittttllee”” ss55 ““LLaammbb”” ss66
“A” “A” ss88 “.” “.” ss77 ““CCuurrrryy”” ““..”” “And” “And” ““BBiigg”” ““DDoogg”” “And” “And” “Hot” “Hot” 0 0..44 00..44 “Roger” “Roger” 0.3 0.3 “Ordered” “Ordered” 0.3 0.3 0.5 0.5 0.5 0.5 0.4 0.4 0.1 0.1 0.5 0.5 0.5 0.5 0.5 0.5 0.3 0.3 0.4 0.4 0.3 0.3 0.5 0.5 “Cooked” “Cooked” 0.3 0.3 “John” “John” 0.3 0.3 0.5 0.5
ss11 ““MMaarryy”” ss22 ““HHaadd”” ss33 “A”“A” ss44 ““LLiittttllee”” ss55 ““LLaammbb”” ss66
“A” “A” ss88 “.” “.” ss77 ““CCuurrrryy”” ““..”” “And” “And” ““BBiigg”” ““DDoogg”” “And” “And” “Hot” “Hot” 0 0..44 00..44 “Roger” “Roger” 0.3 0.3 “Ordered” “Ordered” 0.3 0.3 0.5 0.5 0.5 0.5 0.4 0.4 0.1 0.1 0.5 0.5 0.5 0.5 0.5 0.5 0.3 0.3 0.4 0.4 0.3 0.3 0.5 0.5 “Cooked” “Cooked” 0.3 0.3 “John” “John” 0.3 0.3 The HMM
The HMM depicted above defines depicted above defines a very a very simple sentence generator. simple sentence generator. Notice that fromNotice that from state s
state s33 there are two transitions that emit the there are two transitions that emit the word “A”. word “A”. Imagine that, starting from stateImagine that, starting from state
ss11 the words “Roger Ordered A” the words “Roger Ordered A” had been emitted. had been emitted. We know that the We know that the first state was sfirst state was s11,,
the second state was s
the second state was s22, and the third state was s, and the third state was s33 because given the emitted words there isbecause given the emitted words there is
no ambiguity.
no ambiguity. After the word “A” After the word “A” has been emhas been emitted however we do itted however we do not know whethernot know whether the system is in state s
the system is in state s44 or or ss55 and, given the transition probabilities, would rationallyand, given the transition probabilities, would rationally
assign equal likelihood to each of
assign equal likelihood to each of those states. those states. As it happens the As it happens the ambiguity is resolved asambiguity is resolved as soon as the next word is
soon as the next word is emitted. emitted. If the next word If the next word is “Hot” we know that the is “Hot” we know that the current statecurrent state is S
is S55 and the preceding state that had formerly been ambiguous must have been Sand the preceding state that had formerly been ambiguous must have been S 44..
Examples of some of the sentences that can be produced by this model are: Examples of some of the sentences that can be produced by this model are: S
S11: Mary had a little Lamb and a big dog.: Mary had a little Lamb and a big dog.
S
S22: Roger ordered a lamb curry and a hot dog.: Roger ordered a lamb curry and a hot dog.
S
S33: John cooked a hot dog curry.: John cooked a hot dog curry.
We can calculate the probability that a word sequence, such as S
We can calculate the probability that a word sequence, such as S 33, be emitted by this, be emitted by this
HMM by multiplying the probabilities of the transitions taken in order to produce the HMM by multiplying the probabilities of the transitions taken in order to produce the sentence.
sentence. P(S
P(S33)=0.3*0.3*0.5*0.5*0.3*0.5=0.003375)=0.3*0.3*0.5*0.5*0.3*0.5=0.003375
We can
We can do this because we do this because we know that the transitions are know that the transitions are conditionally independentconditionally independent. . ThisThis is
is the the Markovian assumption. Markovian assumption. SS33 is a six word sequence and we will often refer to anis a six word sequence and we will often refer to an nn
symbol sequence as w symbol sequence as w1,n1,n..11
1
Finding out the probability of a sequence of transitions is sometimes a useful thing to do, Finding out the probability of a sequence of transitions is sometimes a useful thing to do, as we will soon see, but it is by no m
as we will soon see, but it is by no means the only thing that we wish to do. eans the only thing that we wish to do. Finding theFinding the probability of a sequence of observations or actions is referred to as
probability of a sequence of observations or actions is referred to as EvaluationEvaluation and canand can be computed in both the forward and the reverse direction (i.e. working forward from the be computed in both the forward and the reverse direction (i.e. working forward from the first observation, or working back
first observation, or working back from the from the last observation). last observation). We will We will shortly introduceshortly introduce algorithms for doing this but first let consider some other things that we wish to compute. algorithms for doing this but first let consider some other things that we wish to compute. We have already alluded to the problem of inferring the state sequence from the We have already alluded to the problem of inferring the state sequence from the observation.
observation. In general we In general we cannot know for cannot know for sure what sequence sure what sequence of states will of states will be takenbe taken for a given sequence of observations although sometimes, such as in the above example, for a given sequence of observations although sometimes, such as in the above example, we can.
we can. In general the best that we In general the best that we can do is to can do is to estimate the most likely state trajectoryestimate the most likely state trajectory for a given
for a given sequence of observations. sequence of observations. This problem is often This problem is often referred to asreferred to as DecodingDecoding.. Another common requirement is to learn the probabilities associated with transitions in Another common requirement is to learn the probabilities associated with transitions in the system by being given a representative training sequence rather than being given the the system by being given a representative training sequence rather than being given the transition probabilities directly. The idea is to find the set of transition probabilities that transition probabilities directly. The idea is to find the set of transition probabilities that maximizes the likelihood of the training sequence. Not surprisingly this is the
maximizes the likelihood of the training sequence. Not surprisingly this is the LearningLearning
problem. problem.
HMM Decoding -
HMM Decoding -
Finding the most likely path
Finding the most likely path
We beginWe begin with the decoding problem by introducing the Viterbi algorithm for finding the most with the decoding problem by introducing the Viterbi algorithm for finding the most likely sequence of states given a set of observations.likely sequence of states given a set of observations.
))
||
((
max
max
arg
arg
))
((
11,, 11,, 11 ,, 1 1 −−==
t t t t ssw
w
ss
P
P
t
t
t tσ
σ
A sequence ofA sequence of t-1t-1 observations will result in a sequence of observations will result in a sequence of tt states. states. In thIn the fore foregoing egoing wewe represent the most likely sequence of
represent the most likely sequence of tt states given a sequence of observations wstates given a sequence of observations w1,t-11,t-1 asas
σ
σ
(t). (t). We We will will use use the the operatoroperator oo to concatenate new states onto an existing stateto concatenate new states onto an existing state sequence.sequence. So So for for example, example, if if
σ
σ
(3) is the state sequence s(3) is the state sequence s11 ss22 ss33,,σ
σ
(3) o s(3) o s44 is the sequenceis the sequencess11 ss22 ss33 ss44..
To illustrate the Viterbi algorithm we will use the HMM depicted in the figure below. To illustrate the Viterbi algorithm we will use the HMM depicted in the figure below.
aa
b
b
“0” 0.3 “0” 0.3 “1” 0.1 “1” 0.1 “0” 0.2 “0” 0.2 “1” 0.1 “1” 0.1 “0” 0.2 “0” 0.2 “1” 0.5 “1” 0.5 “0” 0.4 “0” 0.4 “1” 0.2 “1” 0.2 aab
b
“0” 0.3 “0” 0.3 “1” 0.1 “1” 0.1 “0” 0.2 “0” 0.2 “1” 0.1 “1” 0.1 “0” 0.2 “0” 0.2 “1” 0.5 “1” 0.5 “0” 0.4 “0” 0.4 “1” 0.2 “1” 0.2Given the known starting state
Given the known starting state aa we know that before any observations are made thatwe know that before any observations are made that P(
either
either aaaa oror abab. . At each state a At each state a state might transition to one of state might transition to one of many different states onemany different states one might therefore naively assume that the algorithm would be exponential because there is might therefore naively assume that the algorithm would be exponential because there is a branching factor
a branching factor involved at each involved at each observation. observation. Surprisingly and delightfully this is Surprisingly and delightfully this is notnot the case this is not the case and the algorithm is, as we advertised earlier, linear! The the case this is not the case and the algorithm is, as we advertised earlier, linear! The reasoning is as follows:
reasoning is as follows:
We are not interested in finding all possible state sequences—just the most likely state We are not interested in finding all possible state sequences—just the most likely state sequence.
sequence. At an arbitrary point before all At an arbitrary point before all of the observations have of the observations have been seen we been seen we cannotcannot know what is the
know what is the most likely sequence. most likely sequence. What is currently the mWhat is currently the most likely sequence mightost likely sequence might be eliminated completely when the next observations comes along if there is no transition be eliminated completely when the next observations comes along if there is no transition that extends the
that extends the previously most likely previously most likely sequence with the sequence with the observation. observation. Consequently theConsequently the currently most promising sequence might seem very unlikely if the latest transition has a currently most promising sequence might seem very unlikely if the latest transition has a very low probability or
very low probability or impossible if there is impossible if there is no such no such transition. transition. Imaging the case Imaging the case wherewhere the only transition that could possibly account for the latest observation is from one state. the only transition that could possibly account for the latest observation is from one state. We will need to have maintained, in our algorithm, a sequence that ends in that state so We will need to have maintained, in our algorithm, a sequence that ends in that state so that we can
that we can extend it with the latest observation. extend it with the latest observation. We must We must therefore maintain for everytherefore maintain for every state the most likely sequence that ends
state the most likely sequence that ends in that state. in that state. There may of There may of course be many course be many otherother ways of ending in that state but since we are looking for the most likely sequence there is ways of ending in that state but since we are looking for the most likely sequence there is nothing to be gained by maintaining anything but the one that has the highest probability. nothing to be gained by maintaining anything but the one that has the highest probability. The Viterbi algorithm therefore works by maintaining for each state:
The Viterbi algorithm therefore works by maintaining for each state: 1.
1. The most The most likely sequence of likely sequence of states that ends states that ends in that in that state, andstate, and 2.
2. The probability The probability of that of that sequence. Sometimes sequence. Sometimes the probability the probability will be will be zero.zero.
The table below shows the steps of the Viterbi algorithm as it processes the observations The table below shows the steps of the Viterbi algorithm as it processes the observations (starting from
(starting from
εε
the empty sequence of the empty sequence of observations). observations). At each At each step, for step, for each state, weeach state, we calculate the probability of extending the previous state sequences so as to end in the calculate the probability of extending the previous state sequences so as to end in the state in question.state in question. So we can get from So we can get from state b to state b state b to state b with an observation of “1” usingwith an observation of “1” using the transition T(b,”1”,b,0.5) but since the probability of the state sequence
the transition T(b,”1”,b,0.5) but since the probability of the state sequence bb was 0.0 thewas 0.0 the probability of the sequence
probability of the sequence bbbb is also 0.0 whereas the probability of the sequenceis also 0.0 whereas the probability of the sequence abab isis 1.0*0.1=0.
1.0*0.1=0.1. 1. As you As you can see can see the table grows the table grows linearly with the linearly with the number of number of observationsobservations and only the previous time step needs to be stored between iterations so the space utilized and only the previous time step needs to be stored between iterations so the space utilized
is constant and the time grows linearly with observations. is constant and the time grows linearly with observations.
0.005
0.005
0.025
0.025
0.05
0.05
0.1
0.1
0.0
0.0
Probability
Probability
abbbb
abbbb
abbb
abbb
abb
abb
ab
ab
b
b
Sequence
Sequence
b
b
0.005
0.005
0.008
0.008
0.04
0.04
0.2
0.2
1.0
1.0
Probability
Probability
abbba
abbba
aaaa
aaaa
aaa
aaa
aa
aa
aa
Sequence
Sequence
aa
1110
1110
111
111
11
11
1
1
εε
States
States
0.005
0.005
0.025
0.025
0.05
0.05
0.1
0.1
0.0
0.0
Probability
Probability
abbbb
abbbb
abbb
abbb
abb
abb
ab
ab
b
b
Sequence
Sequence
b
b
0.005
0.005
0.008
0.008
0.04
0.04
0.2
0.2
1.0
1.0
Probability
Probability
abbba
abbba
aaaa
aaaa
aaa
aaa
aa
aa
aa
Sequence
Sequence
aa
1110
1110
111
111
11
11
1
1
εε
States
States
Formally, we can describe the Viterbi algorithm as follows: Formally, we can describe the Viterbi algorithm as follows: For a sequence of T visible actions W
For a sequence of T visible actions WTT and a HMM with c hidden states.and a HMM with c hidden states.
Let
Let
σ
σ
ιι(t) be the maximum likelihood path that accounts for the observation/action(t) be the maximum likelihood path that accounts for the observation/action sequence Wsequence WTT and which ends in state sand which ends in state sii.Below is the pseudo code for the Viterbi.Below is the pseudo code for the Viterbi
algorithm. algorithm.
))
((
))
))
((
((
max
max
arg
arg
,,
))
((
))
1
1
((
))
1
1
((
1 1 ii w w k k k k cc k k ii j j ii ii iiss
ss
P
P
t
t
P
P
j
j
ss
t
t
t
t
ss
t t→
→
==
==
++
==
==σ
σ
σ
σ
σ
σ
σ
σ
o oHMM Evaluation -
HMM Evaluation -
HMM forward probabilities
HMM forward probabilities
Using the same HMM as before, repeated below for convenience, consider the Using the same HMM as before, repeated below for convenience, consider the observation sequence “1110”.
observation sequence “1110”. We wish We wish to calculate the to calculate the probability of that probability of that sequencesequence being generated by the HMM.
being generated by the HMM.
We begin by considering the observations in the order that they are observed—the We begin by considering the observations in the order that they are observed—the so-called forward probabilities.
called forward probabilities. In this algorithm we In this algorithm we are not concerned are not concerned with a sequence with a sequence of of states.
states. Instead, we want to know Instead, we want to know the sum of the sum of the probabilities of all paths that the probabilities of all paths that account foraccount for the sequence of
the sequence of observations. observations. In principle the In principle the final state can final state can be any be any of the of the states. states. LetLet
α
α
ιι (t)(t) be the probability P(wbe the probability P(w1,t-11,t-1,s,stt=s=sii. . Clearly Clearly if if we we knowknow
α
α
ιι(t) for all states(t) for all states ii we canwe cancalculate the probabilities of
calculate the probabilities of
α
α
ιι (t+1) for all states(t+1) for all states ii by simply extending computing andby simply extending computing and summing allsumming all extensions of the extensions of the previous states. previous states. This is This is precisely what we precisely what we do for do for thethe forward-probabi
forward-probabilities lities algorithm. algorithm. To To get get P(wP(w1,t1,t) therefore we simply sum the probabilities) therefore we simply sum the probabilities of all sequences ending in each of the states.
of all sequences ending in each of the states.
The table below shows the steps of the forward-probabilities algorithm for the sequence The table below shows the steps of the forward-probabilities algorithm for the sequence of observations “1110”.
of observations “1110”. Consider the entry for
Consider the entry for
α
α
ii (3)=0.07. (3)=0.07. This is calculated by This is calculated by extending the path extending the path that ends that ends inin statepath that ends in state
path that ends in state bb with the with the transition transition “1” with “1” with probability 0.5 probability 0.5 hence 0.1*0.5=0.05.hence 0.1*0.5=0.05. Summing these two routes yields 0.02+0.05=0.07.
Summing these two routes yields 0.02+0.05=0.07.
More formally we can describe this algorithm as follows: More formally we can describe this algorithm as follows: Let
Let
α
α
ii(t)(t) be the probability P(wbe the probability P(w1,t-11,t-1,s,stt=s=sii).).∑
∑
====
==
++ cc ii iiss
S
S
w
w
P
P
w
w
P
P
T T T T T T 1 1))
,,
((
))
((
11,, 11,, 11The base case for our recursive definition is the starting state before any observations are The base case for our recursive definition is the starting state before any observations are
seen.
seen. We must We must start in the start in the start state so:start state so:
∑
∑
==++
==
cc ii ii T TT
T
w
w
P
P
1 1 ,, 1 1))
1
1
((
))
((
α
α
⎩⎩
⎨⎨
⎧⎧
→
→
→
→
==
==
==
==
0 0 0 0 .. 1 1 1 1 )) ,, (( )) 1 1 (( 11,,00 11 otherwise otherwise ii ss ss w w P P ii ii α αand the recursive step extends it and the recursive step extends it
∑
∑
== −− ++==
==
==
→
→
==
++
cc ii j j w w ii ii t t t t j j t t t t j j ss ss P P ss ss w w P P ss ss w w P P t t t t 1 1 1 1 ,, 1 1 1 1 ,, 1 1 )) (( )) ,, (( )) ,, (( )) 1 1 (( α αThe pseudo code for the forward probabilities is very given below:For a sequence of T The pseudo code for the forward probabilities is very given below:For a sequence of T visible actions V
visible actions VTT and a HMM with c hidden states where tprob(a,b,c) is the transitionand a HMM with c hidden states where tprob(a,b,c) is the transition
probability for transitioning from state a to state b with observation/action c probability for transitioning from state a to state b with observation/action c
Backward Probabilities
Backward Probabilities
Backward probabilities work in exactly the same way as forward probabilities but we Backward probabilities work in exactly the same way as forward probabilities but we start from
would ever want to do such a
would ever want to do such a thing. thing. It turns out that it is It turns out that it is useful for solving the traininguseful for solving the training problem, which we will describe next.
problem, which we will describe next. We can define
We can define
ββ
ii(t)(t) just as we did just as we did forforα
α
ii(t) but starting from the end.(t) but starting from the end.)) || (( )) (( ,, ii t t T T t t ii ss ss w w P P t t
≡≡
==
β βNotice that it is the starting state that is constrained to be s
Notice that it is the starting state that is constrained to be sii and not, as it was forand not, as it was for
α
α
ii(t), the(t), theending state. ending state.
It is interesting to note that It is interesting to note that
)) (( )) || (( )) 1 1 (( ,, 1 1 1 1 1 1 ,, 1 1 1 1 T T T T w w P P ss ss w w P P
==
==
==
β βwhich follows because a sequence of one state must be the start state. which follows because a sequence of one state must be the start state.
The base case of our recursive definition is similar to before but starts, as we would The base case of our recursive definition is similar to before but starts, as we would expect, from the end.
expect, from the end.
1 1 )) || (( )) 1 1 ((
++
==
T T ++11==
ii==
ii ss ss P P T T ε ε β βThe starting state of the
The starting state of the entire sequence is, by definition, the start state. entire sequence is, by definition, the start state. W can W can work back work back with our recursive definition just as we did
with our recursive definition just as we did before. before. Notice that we are Notice that we are using state susing state s j j, the, the
target state of the transitions, rather than s
target state of the transitions, rather than sii in this definition.in this definition.
)) || (( )) (( )) (( )) (( )) 1 1 (( ,, 1 1 1 1 1 1 1 1 j j t t T T t t cc j j j j w w ii j j cc j j j j w w ii j j ss ss w w P P ss ss P P t t ss ss P P t t t t t t
==
→
→
==
→
→
==
−−
∑
∑
∑
∑
== == −− −− β β β βThe pseudo code for backward-probabilities is very similar to that of The pseudo code for backward-probabilities is very similar to that of forward-probabilities so we will leave it as an exercise for the reader.
HMM Training -
HMM Training -
(Baum-Welch Algorithm)
(Baum-Welch Algorithm)
We now
We now have enough apparatus have enough apparatus to attack the to attack the learning problem. learning problem. In the In the learning problemlearning problem we start out with the HMM definition withou
we start out with the HMM definition without the transition probabt the transition probabilities. ilities. Our goal is toOur goal is to find the set of transition probabilities that maximize the likelihood of the training find the set of transition probabilities that maximize the likelihood of the training sequence provided.
sequence provided. It should be It should be noted from noted from the outset that the outset that the algorithm described, thethe algorithm described, the Baum-Welch algorithm also known as the forward-backward algorithm does not Baum-Welch algorithm also known as the forward-backward algorithm does not guarantee to find the
guarantee to find the global maximum. global maximum. It finds the It finds the local maximum local maximum and as and as such itssuch its usefulness depends upon the HMM being trained.
usefulness depends upon the HMM being trained. Consider a very simple
Consider a very simple case of an case of an HMM with a single state and HMM with a single state and three transitions. three transitions. Such anSuch an HMM is
HMM is depicted below left without transition probabilities.depicted below left without transition probabilities.
aa
“0”
“0”
“1”
“1”
“2”
“2”
Training Sequence: 01010210
Training Sequence: 01010210
a 8
a 8
“0” 4
“0” 4
“1” 3
“1” 3
“2” 1
“2” 1
aa
“0” 0.5
“0” 0.5 “1” 0.375
“1” 0.375
“2” 0.125
“2” 0.125
aa
“0”
“0”
“1”
“1”
“2”
“2”
aa
“0”
“0”
“1”
“1”
“2”
“2”
Training Sequence: 01010210
Training Sequence: 01010210
a 8
a 8
“0” 4
“0” 4
“1” 3
“1” 3
“2” 1
“2” 1
a 8
a 8
“0” 4
“0” 4
“1” 3
“1” 3
“2” 1
“2” 1
aa
“0” 0.5
“0” 0.5 “1” 0.375
“1” 0.375
“2” 0.125
“2” 0.125
aa
“0” 0.5
“0” 0.5 “1” 0.375
“1” 0.375
“2” 0.125
“2” 0.125
Given a training sequence such as “01010210” we can play the observations through the Given a training sequence such as “01010210” we can play the observations through the model counting the number of times each transition is taken and the number of times model counting the number of times each transition is taken and the number of times each state takes a
each state takes a transition. transition. The result of this The result of this that with our trivial example is that with our trivial example is that state athat state a takes 8 transitions, the transition “0” is taken four times, the transition “1” is taken three takes 8 transitions, the transition “0” is taken four times, the transition “1” is taken three times, and the transition “2” is
times, and the transition “2” is taken just once (left). taken just once (left). We can We can obtain from these counts obtain from these counts aa consistent set of transition probabilities by dividing the number of times a transition is consistent set of transition probabilities by dividing the number of times a transition is taken by the number of transitions taken by the state from which the transition transitions taken by the number of transitions taken by the state from which the transition transitions (right).
(right).
More formally, If C is a function that counts the number of times that a transition is taken More formally, If C is a function that counts the number of times that a transition is taken when the training sequence is run through the model, we can estimate the transition when the training sequence is run through the model, we can estimate the transition probabiliti
probabilities as es as follows:follows:
∑
∑∑
∑
= = ==→
→
→
→
==
→
→
cc l l mm ll w w ii j j w w ii j j w w ii ee ss ss C C ss ss C C ss ss P P m m k k k k 1 1 11))
((
))
((
))
((
ω ωIt is fun to show prove that the transition probabilities resulting from the above algorithm It is fun to show prove that the transition probabilities resulting from the above algorithm do in fact
do in fact maximize the likelihood of maximize the likelihood of the training sequence. the training sequence. That proof is That proof is left as anleft as an exercise for the reader.
If we could deterministically follow transitions in this way we would be done but we If we could deterministically follow transitions in this way we would be done but we cannot because there may be, as we discussed earlier, multiple transitions out of a state cannot because there may be, as we discussed earlier, multiple transitions out of a state that have the same observation/action.
that have the same observation/action.
aa
“0”
“0”
“0”
“0”
cc
b
b
0.7
0.7
0.3
0.3
aa
“0”
“0”
“0”
“0”
cc
b
b
0.7
0.7
0.3
0.3
Consider the exampleConsider the example above. above. There are There are two transitions out of two transitions out of statestate aa for the observationfor the observation “0”.
“0”. One One has has probability 0.7 probability 0.7 and and the the other other 0.3. 0.3. This This suggests suggests a a solution. solution. Instead Instead of of counting the number of times a transition is taken count instead the prorated amount of counting the number of times a transition is taken count instead the prorated amount of transitions taken. So in this case we would count 0.7 for the transition to state b and 0.3 transitions taken. So in this case we would count 0.7 for the transition to state b and 0.3 for the transition to
for the transition to state c. state c. We would We would count 0.7+0.3=1.0 for the count 0.7+0.3=1.0 for the number of number of transitionstransitions taken out of
taken out of aa (instead of 2).(instead of 2).
This solves our problem of ambiguous transitions but one small problem remains: This solves our problem of ambiguous transitions but one small problem remains: WeWe don’t know the transition probabilities because they are what we are trying to learn!
don’t know the transition probabilities because they are what we are trying to learn!
It turns out that if we guess a set of transition values and then count the transitions using It turns out that if we guess a set of transition values and then count the transitions using the prorated scheme described above the resulting transition probabilities will have the prorated scheme described above the resulting transition probabilities will have improved somewhat towards a local maximum and so by iterating as long as an improved somewhat towards a local maximum and so by iterating as long as an improvement occurs we can find the local maximum.
improvement occurs we can find the local maximum. 1.Guess a set
1.Guess a set of transition probabilities.of transition probabilities. 2.while (improving) { 2.while (improving) { 3. propagate-training-sequences 3. propagate-training-sequences 4.} 4.}
There are two
There are two obvious ways of calculating “improving” in obvious ways of calculating “improving” in the pseudo code the pseudo code above. above. TheThe conceptually easiest way is to scan through the old and new transition probabilities to see conceptually easiest way is to scan through the old and new transition probabilities to see which has changed the most.
which has changed the most. When the maximum When the maximum change of any transition drops below achange of any transition drops below a predefined accuracy
predefined accuracy
θθ
the training loop can be the training loop can be exited. exited. The problem with The problem with this approach isthis approach is that it involves scanning through all transition probabilities—of which there may be a that it involves scanning through all transition probabilities—of which there may be a very large number. An alternative, and computationally less expensive approach, is to very large number. An alternative, and computationally less expensive approach, is to calculate the cross-entropy after each iteration. When the cross-entropy decreases by less calculate the cross-entropy after each iteration. When the cross-entropy decreases by less thanthan
θθ
we are done.we are done. Cross entropy is: Cross entropy is:∑
∑
−−−−
T T w w T T M M T T M M w w P P w w P P n n 11,, )) (( log log )) (( 1 1 11,, 2 2 ,, 1 1 1 1Where P
Where PM-1M-1is the probability as estimated by the previous iteration’s model and Pis the probability as estimated by the previous iteration’s model and PMMis theis the probability as estimated by the current iteration’s model. Cross-entropy is a common probability as estimated by the current iteration’s model. Cross-entropy is a common measure used to compare models and avoids the cost of scanning through all of the measure used to compare models and avoids the cost of scanning through all of the transitions.
transitions.
All that remains then is to formalize the calculation of the transition counts for the All that remains then is to formalize the calculation of the transition counts for the training sequence.
training sequence. Consider a
Consider a training sequence consisting of training sequence consisting of T observations. T observations. There will be There will be T+1 statesT+1 states chained together by T
chained together by T transitions. transitions. The prorated count The prorated count of a of a transition is the number transition is the number of of times that the
times that the transition was taken during transition was taken during the training sequence. the training sequence. We can We can calculate thecalculate the probability of the entire sequence P(w
probability of the entire sequence P(w1,T1,T) as the backward-probability) as the backward-probability
ββ
ii(1). (1). In In order order toto consider the transitions imagine that we cut the chain of transitions at some point t at consider the transitions imagine that we cut the chain of transitions at some point t at which a transitions takes the system from state swhich a transitions takes the system from state sii to to ss j j. . The The probability of probability of the the sequencesequence
can now be written as the product of the probability of the part preceding the transition can now be written as the product of the probability of the part preceding the transition
α
α
ii(t), the probability of the transition itself, and the probability of the rest of the sequence (t), the probability of the transition itself, and the probability of the rest of the sequence
ββ
j j(t+1).
(t+1). This probability is This probability is not the not the same assame as
ββ
ii(1) because it picks a particular transition at(1) because it picks a particular transition at time t.time t. If we summed If we summed over all possible transitions at time t we over all possible transitions at time t we would get P(wwould get P(w1,T1,T)=)=
ββ
ii(1).(1). We can divide by P(wWe can divide by P(w1,T1,T) to get the probability of that particular transition happening at) to get the probability of that particular transition happening at time step t and we can get the prorated transition count for that transition by summing time step t and we can get the prorated transition count for that transition by summing over all possible places in the sequence where that transition had an opportunity to over all possible places in the sequence where that transition had an opportunity to happen. happen.
∑
∑
==++
→
→
==
→
→
T T t t j j j j w w ii ii ii j j w w iiss
t
t
P
P
ss
ss
t
t
ss
C
C
k k k k 1 1))
1
1
((
))
((
))
((
))
1
1
((
1
1
))
((
α
α
β
β
β
β
We can use this simple equation to calculate the counts used in the equation We can use this simple equation to calculate the counts used in the equation
∑
∑∑
∑
= = ==→
→
→
→
==
→
→
cc l l mm ll w w ii j j w w ii j j w w ii ee ss ss C C ss ss C C ss ss P P m m k k k k 1 1 11))
((
))
((
))
((
ω ω (repeated from(repeated from above above for for convenience). convenience). Notice that Notice that the the 1/ 1/
ββ
ii(1) occurs in both the(1) occurs in both the numerator and the denominator and can thus be dropped from the count calculation.Baum-Welch Pseudo Code
Baum-Welch Pseudo Code
The pseudo code for this algorithm is shown below followed by a hand worked example. The pseudo code for this algorithm is shown below followed by a hand worked example.
Baum-Welch example
Baum-Welch example
Consider the following, very simple, example consisting of two states and four transitions Consider the following, very simple, example consisting of two states and four transitions with the
with the training sequence “01011”. training sequence “01011”. The probabilities on The probabilities on the transitions are the transitions are our “startingour “starting guess”. guess”.
aa
b
b
“1” 0.48
“1” 0.48
“0” 0.48
“0” 0.48
“0” 0.04
“0” 0.04
“1” 1.0
“1” 1.0
aa
b
b
“1” 0.48
“1” 0.48
“0” 0.48
“0” 0.48
“0” 0.04
“0” 0.04
“1” 1.0
“1” 1.0
First we calculateFirst we calculate
α
α
for all of our states (for all of our states (aa andand bb) for all of the steps of our training) for all of the steps of our training sequence as follows (all hand calculated numbers have been rounded for readability): sequence as follows (all hand calculated numbers have been rounded for readability):0
0
0
0
0.01
0.01
0
0
0.04
0.04
0
0
α
α
bb(t)
(t)
0.035
0.035
0.072
0.072
0.13
0.13
0.27
0.27
0.48
0.48
1
1
α
α
aa(t)
(t)
1
1
1
1
0
0
1
1
0
0
εε
6
6
5
5
4
4
3
3
2
2
1
1
0
0
0
0
0.01
0.01
0
0
0.04
0.04
0
0
α
α
bb(t)
(t)
0.035
0.035
0.072
0.072
0.13
0.13
0.27
0.27
0.48
0.48
1
1
α
α
aa(t)
(t)
1
1
1
1
0
0
1
1
0
0
εε
6
6
5
5
4
4
3
3
2
2
1
1
Next we calculateNext we calculate
ββ
for all of our states (for all of our states (aa andand bb) for all of the steps of our training) for all of the steps of our training sequence as follows:sequence as follows:
ow, for all of our (four) transitions, we estimate the transition probabilities at each time ow, for all of our (four) transitions, we estimate the transition probabilities at each time
he numerator is the count for the specific transition whose probability is being estimated he numerator is the count for the specific transition whose probability is being estimated
1
1
1
1
0.28
0.28
0
0
0.13
0.13
0
0
ββ
bb(t)
(t)
1
1
0.48
0.48
0.23
0.23
0.13
0.13
0.062
0.062
0.035
0.035
ββ
aa(t)
(t)
1
1
1
1
0
0
1
1
0
0
εε
6
6
5
5
4
4
3
3
2
2
1
1
1
1
1
1
0.28
0.28
0
0
0.13
0.13
0
0
ββ
bb(t)
(t)
1
1
0.48
0.48
0.23
0.23
0.13
0.13
0.062
0.062
0.035
0.035
ββ
aa(t)
(t)
1
1
1
1
0
0
1
1
0
0
εε
6
6
5
5
4
4
3
3
2
2
1
1
N Nstep and then sum them
step and then sum them to produce the total (see the to produce the total (see the total column in the table below). total column in the table below). WeWe can avoid the step of normalizing the transition counts (total column) because the can avoid the step of normalizing the transition counts (total column) because the normalization factor cancels out.
normalization factor cancels out. The new The new probability P(T(a,0,b)probability P(T(a,0,b)) is ) is calculates as:calculates as:
T T
and the denominator is the sum of all transitions out of the starting state for that and the denominator is the sum of all transitions out of the starting state for that transition. transition. 0.58 0.58 0.095 0.095 0.035 0.035 0.03 0.03 0 0 0.03 0.03 0 0 T(a,1,a) T(a,1,a) 0.36 0.36 0.06 0.06 0 0 0 0 0.03 0.03 0 0 0.030 0.030 T(a,0,a) T(a,0,a) 1 1 0.01 0.01 0 0 0.0048 0.0048 0 0 0.0052 0.0052 0 0 T(b,1,a) T(b,1,a) 0.06 0.06 0.01 0.01 0 0 0 0 0.0052 0.0052 0 0 0.0052 0.0052 T(a,0,b) T(a,0,b) New P New P Total Total 1 1 1 1 0 0 1 1 0 0 0.58 0.58 0.095 0.095 0.035 0.035 0.03 0.03 0 0 0.03 0.03 0 0 T(a,1,a) T(a,1,a) 0.36 0.36 0.06 0.06 0 0 0 0 0.03 0.03 0 0 0.030 0.030 T(a,0,a) T(a,0,a) 1 1 0.01 0.01 0 0 0.0048 0.0048 0 0 0.0052 0.0052 0 0 T(b,1,a) T(b,1,a) 0.06 0.06 0.01 0.01 0 0 0 0 0.0052 0.0052 0 0 0.0052 0.0052 T(a,0,b) T(a,0,b) New P New P Total Total 1 1 1 1 0 0 1 1 0 0 06 06 .. 0 0 165 165 .. 0 0 01 01 .. 0 0 095 095 .. 0 0 06 06 .. 0 0 01 01 .. 0 0 01 01 .. 0 0 )) ,, 1 1 ,, (( )) ,, 0 0 ,, (( )) ,, 0 0 ,, (( )) ,, 0 0 ,, ((