HHMMs for Automatic speech recognition (ASR)

2.3 Representing HMMs and their variants as DBNs

2.3.10 HHMMs for Automatic speech recognition (ASR)

a n a

7 8 9

1 2 3

b a n 4 5 6

ah n er

Figure 2.16: HMM state transition diagram for a model encoding two possible pronunciations of “banana”, the British version (bottom fork: middle “a” is long, and final “a” has “r” appended), and the American version (top fork: all “a”s are equal). Dotted lines from numbers to letters represent deterministic emissions. The states may have self-loops (not shown) to model duration.

Consider modelling the pronunciation of a single word. In the simplest case, the can be described as a sequence of phones, e.g., “dog” is “d - o - g”. This can be represented as a (left-to-right) finite state automaton. But consider a word like “yamaha”: the first “a” is followed by “m”, but the second “a” is followed by “h”. Hence some automaton states need to share the same phone labels. (Put another way, given a phone label, we cannot always uniquely identify our position within the automaton/word.) This suggests that we use an HMM to model word pronunciation. Such a model can also cope with multiple pronunciations, e.g., the British or American pronunciation of “banana” (see Figure 2.16).8

LetQh

t be the hidden automaton state at timet, andQtbe the observed phone. For the yamaha model,

we haveQh

t ∈ {1, . . . ,9}andQt∈ {/y/, /aa/, /m/, /a/, /h/}. Note thatP(Qt=k|Qht =q) =δ(B(q), k)is a

delta function (e.g., state 2 only emits “aa”). Also, the transition matrix,P(Qh

t =q0|Qht−1=q) =A(q, q0), can always be made upper diagonal by renumbering states, and will typically be very sparse.

Now suppose that we do not observe the phones directly, but that each phone can generate a sequence of acoustic vectors; we will model the acoustic “appearance” and duration of each phone with a phone HMM. The standard approach is to embed the appropriate phone HMMs into each state of the word HMM. However, we then need to tie the transition and observation parameters between different states.

Qh 1 Qh2 Qh3 Q1 Q2 Q3 Fs 1 F2s F3s S1 S2 S3 Y1 Y2 Y3

Figure 2.17: A DBN for modelling the pronunciation of a single word.Qh_{is the state (position) in the word}

HMM;Qis the phone;Sis the state (position) within the phone HMM, i.e., the subphone;Y is the acoustic vector.Fs_{is a binary indicator variable that turns on when the phone HMM has finished.}

Qh 1 Qh2 Qh3 Fs 1 F2s F3s Q1 Q2 Q3 Y1 Y2 Y3

Figure 2.18: A DBN for modelling the pronunciation of a single word, where each phone is modelled by a single state. This is the simplification of Figure 2.17 ifShas only one possible value. This corresponds to the model used in [Zwe98].

An alternative is to make the parameter tying graphically explicit by using HHMMs; this allows the tying pattern to be learned. Specifically, let each state of the word model,Qh

t, emitQtwhich then “calls” the phone

HMM. The state of the phone HMM at timetisSt; this is called the subphone. Since we don’t know how long each phone lasts, we introduceFS _{which only turns on, allowing}_Qh

t to make a transition, when the

phone HMM has finished. This can be accomplished as shown in Figure 2.17.FS_{is conditioned on}_Qt_{, not} Qh

t, which represents the fact that the duration depends on the phone, not the automaton state (i.e., duration

information is tied). Similarly,Ytis conditioned onQtbut notQh t.

Normally the phone HMMs are 3-state left-to-right HMMs. However, if we use a single state HMM, we can simplify the model. In particular, ifStcan have only one possible value, it can be removed from the graph; the duration of the phone will be determined by the self-loop probability on the corresponding hidden state in the word HMM. The resulting DBN is shown in Figure 2.18, and corresponds to the model used in [Zwe98]. If the word HMM has a strict left-to-right topology,Qh

t deterministically increases by 1 iff FS

t = 1; henceP(FtS = 1|Qt = k) = 1−A(i, i), whereiis any state that (deterministically) emits the

phonek. If the word HMM is not left-to-right, thenFS _{can specify which out-arc to take from state}_Qh t−1, soP(Qh

t|Qht−1, Fts−1)becomes deterministic [ZR97].

hierarchy, and condition the variables inside the word HMM (i.e.,Qh_and_Q_{) on the (hidden) identity of the}

word: see Figure 2.19. The fact that the duration and appearance are not conditioned on the word represents the fact that the phones are shared across words: see Figure 2.20. Sharing the phone models dramatically reduces the size of the state space. Nevertheless, with, say, 6000 words, 60 phones, and 3 subphones, there are still about 1 million unique states, so a lot of training data is required to learn such models. (In reality, things are even worse because of the need to use triphone contexts, although these are clustered, to avoid having603_{combinations.)}

Note that the number of legal values forQh

t can vary depending on the value ofWt, since each word has

a different-sized HMM pronunciation model. Also, note that silence (a gap between words) is different from silent (non-emitting) states: silence has a detectable acoustic signature, and can be treated as just another word (whose duration is of course unknown), whereas silent states do not generate any observations.

We now explicitly define all the CPDs fort >1in the model shown in Figure 2.19.

P(Wt=w0_|_Wt −1=w, FtW−1=f) = δ(w, w0₎ _if_f _{= 0} A(w, w0₎ _if_f _{= 1} P(FtW =f|Qth=q, Wt=w, FtS =b) =    δ(f,0) ifb= 0

1₋Aw(q,end) ifb= 1andf = 0 Aw(q,end) ifb= 1andf = 1 P(Qht =q0|Qht−1=q, Wt=w, FtW−1=f, FtS−1=b) =    δ(q, q0₎ _if_b_{= 0} Aw(q, q0₎ _if_b_{= 1}_and_f _{= 0} πw(q0₎ _if_b_{= 1}_and_f _{= 1} P(Qt=k|Qht =q, Wt=w) = δ(Bw(q), k) P(Fts= 1|St=j, Qt=k) = Ak(j,end) P(St=j_|St₋1=i, Qt=k, Ft−1=f) = Ak(i, j) iff = 0 πk(j) iff = 1

whereA(w, w0₎_{is the word bigram probability;}_{Aw(q, q}0₎_{is the transition probability between states in word}

modelw,πw(q)is the initial state distribution, andBw(q)is the phone deterministically emitted by stateq

in the HMM for wordw, and similarly,Ak(i, j)πk(j)andBk(j, yt)are the parameters for thek’th phone HMM. We could modelP(Yt_|St=j, Qt=k)using a mixture of Gaussians, for example.

Training HHMMs for ASR

When training an HMM from a known word sequence, we can use the model in Figure 2.21. We specify the sequence of words,w~, but not when they occur. IfWh

t =k, it means we should use thek’th word at time t; henceP(Wt =w|Wh

t =k, ~w) =δ(w, wk)is a delta function, c.f., the relationship betweenQh andQ.

Also,Wh_{increases by 1 iff}_FW _{= 1}_{, to indicate the next word in the sequence.}

In practice, we can “compile out” thew~ node, by changing the CPDs for theW nodes for each training sequence. If the mapping from words to phones is fixed, we can also compile outWh_,_W_{, and}_FW_{, resulting}

in the model shown in Figure 2.17 (or Figure 2.18 if we don’t use subphones). NowQh_{acts as an index}

into the known (sub)phone sequence corresponding tow~, so its CPD becomes deterministic (increase Qh

by 1 iffFS

t−1 = 1, which occurs iff the phone HMM has finished), and the CPD for the phone becomes

P(Qt=k_|Qh

t =i) =Bw(q, k), ificorresponds to stateqin word HMMw. This is equivalent to combining

all the word HMMs together, and giving all their states unique numbers, which is the standard way of training HMMs from a known phonetic sequence.

Note that the number of possible values forWh_{in Figure 2.21 is now}_Tin_{, the number of words in the}

input training sequence. One concern is that this might make inference intractable. The simplest approach to inference is to combine all the interface variables (see Section 3.4) into one “mega” variable, and then use a (modified) forwards-backwards procedure, which has complexityO(ToutN2₎_{, where}_N _{is the number of} states of the mega variable, andToutis the length (number of frames) of the output (observed) sequence. The interface variables (the ones with outgoing temporal arcs) in Figure 2.21 are as follows:Wh

t ∈ {1, . . . , Tin}, Qh

t ∈ {1, . . . , P},St ∈ {1, . . . , K},FtW ∈ {0,1},FtS ∈ {0,1}, whereP is the max number of states in

W1 W2 W3 Fw 1 F2w F3w Qh 1 Qh2 Qh3 Q1 Q2 Q3 Fs 1 F2s F3s S1 S2 S3 Y1 Y2 Y3

Figure 2.19: A DBN for continuous speech recognition. Note that the observations,Yt, the transition probabilities,St, and the termination probabilities,FS

t , are conditioned on the phoneQt, not the position within

the phone HMMQh

t, and not on the word,Wt.

need

on

the

words

phones

sub-

phones

aa

n

end

n

iy

d

dh

n

ax

iy

end

Figure 2.20: An example HHMM for an ASR system which can recognize 3 words (adapted from [JM00]). The phone models (bottom level) are shared (tied) amongst different words; only some of them are shown.

~ w Wh 1 W2h W3h W1 W2 W3 Fw 1 F2w F3w Qh 1 Qh2 Qh3 Q1 Q2 Q3 Fs 1 F2s F3s S1 S2 S3 Y1 Y2 Y3

Figure 2.21: A DBN for training a continuous speech recognition system from a known word sequencew~. In practice, thew~ node can be absorbed into the CPDs forW. Also, if the mapping from words to phones is fixed, we can also compile outWh_,_W_{, and}_FW_{, resulting in the much simpler model shown in Figure 2.17}

appear that exact smoothing requiresO(ToutT2

inP2K2)time. However, because of the left-to-right nature of

the transition matrices forWh_,_Qh_and_S_{, this can be reduced to}_{O(ToutTinP K}₎_{time. This is the same as}

for a standard HMM (up to constant factors), since there areN =TinP K states created by concatenating the word HMMs for the whole sequence.

When training, we know the length of the sequence, so we only want to consider segmentations that “use up” all the phones. We can do this by introducing an extra “end of sequence” (eos) variable, that is clamped to some observed value, say 1, and whose parents areQh

T andFTs; we define its CPD to be P(eos= 1_|Qh

T =i, FTs =f) = 1iffiis the last phone in the training sequence, andf = 1. (The reason we

requiref = 1is that this represents the event that we are just about to transition to a (silent) accepting state.) This is the analog of only examining paths that reach the “top right” part of the HMM trellis, where the top row represents the final accepting state, and the right column represents the final time slice. This trick was first suggested in [Zwe98]. A simpler alternative is simply to setQh

T =iandFTs = 1as evidence; this has

the advantage of not needing any extra variables, although it is only possible if there is only one state (namely

i) immediately preceeding the final state (which will be true for a counting automaton of the kind considered here).

General DBNs for ASR

So far, we have focussed on HHMMs for ASR. However, it is easy to create more flexible models. For example, the observation nodes can be represented in factored form, instead of as a homogeneous vector- valued node. (Note that this only affects the computation of the conditional likelihood terms, Bt(i, i) = P(yt_|Qt = i), so the standard forwards-backwards algorithm can be used for inference.) More general hidden nodes can also be used, representing the state of the articulators, e.g., position of the tongue, size of mouth opening, etc [Zwe98]. If the training set just consists of standard speech corpora, there is nothing to force the hidden variables to have this “meaningful” intrepretation. [Row99, RBD00] trained on data where they were able to observe the articulators directly; the resulting models are much easier to intrepret. See [Bil01] for a more general review of how graphical models can be used for ASR.

In document Dynamic Bayesian Networks Representation, Inference And Learning Kevin Patrick Murphy pdf (Page 38-43)