• No results found

10.3 Phoneme level LVCSR

10.3.1 Training Procedure

10.3.1.2 Learning Transition Probabilities

We consider HMM-states to be equivalent to I-states except that HMM-state transition probabilities are not driven by observations. Thus, we use the notation ω(h|φ, g) to refer to the probability of HMM-state transitiong→h.

Previously studied options for learning the state transition probabilities ω(h|φ, g) include the popular Baum-Welch Algorithm, using an ANN to model both transitions and emissions (as in Appendix E.5.1.1), or estimating phoneme transitions directly from the data.3 We introduce an alternative approach, using a variant of Exp-GPOMDPto estimate the phoneme transition probabilities.

The Viterbi Algorithm

At this juncture it is worth reviewing the Viterbi algorithm which is more com- pletely described in Section D.1.2. The aim of Viterbi decoding is to find the maximum likelihoodsequenceof I-states. The algorithm can be visualised as finding the best path through an I-state lattice, depicted for 3 I-states in Figure 10.8.

The Viterbi algorithm is useful because it identifies the most likely I-state for each step given the I-state trajectory up to the current time. This equates to outputting a stream of phonemes. By comparison, the full forward probability update tells us how likely the states are assuming all trajectories through the I-state lattice are possible. We use the Viterbi algorithm because a speaker describes only a single trajectory through the I-state lattice. Put another way, the Viterbi algorithm specifies an I-state for each step in time.

One step of the Viterbi algorithm iterates over all I-states. For each I-state h the algorithm chooses the most likely I-state g that leads to I-state h. This is performed by evaluating an approximate form of the forward probability update (D.1)

ˆ

αt+1(h|φ,y¯t) = max

g [ˆαt(g|φ,y¯t−1)ω(h|φ, g)]ξ(yt|h). (10.4)

The approximation arises because of the use of the max function. We take the max- imum because we want to identify the single most likely predecessor I-state. The algorithm stores the predecessorg of every I-statehfor the lastl time steps, giving us the ability to track the most likely sequence back from any I-state hforl steps.

To produce the optimal I-state sequence the lattice should extend back to the beginning of the sequence, however, due to memory and time restrictions, it is typical to assume that limiting the decoding to l steps introduces negligible errors [Proakis,

3This option is only possible when the data is labelled with all transitions. Most speech systems

use HMMs with multiple states per phoneme and data that is only labelled at the phoneme level (at best), so they cannot use this simple option.

Pr=0.1 Pr=0.2 Pr=0.7 Pr=0.6 Pr=0.3 Pr=0.1 t−3 t−2 t−1 t−5 t−4

Reward issued at timet t

g h

Figure 10.8: For each I-state h, select the most likely predecessor I-state g based on the probability of being in I-state g and the transition probability ω(h|φ, g). At time t−1 we see that the most likely last l I-states were {ht−4 = 1, ht−3 = 2, ht−2 = 3, ht−1 = 3} with normalised likelihood 0.6, but at time tthe Viterbi procedure makes a correction that makes the last 4 most likely I-states{ht−3= 3, ht−2= 2, ht−1= 1, ht= 2}with normalised likelihood

0.7.

1995, §8.2.2]. Thus, after l steps, the most likely I-state at the end of the lattice is locked in as the I-state for time step t−l. For example, in Figure 10.8, l= 3. As we update the lattice to incorporate the observation for stept, I-state 3 is locked in as the correct I-state for t−3 because it is at the end of the most likely sequence of I-states: the sequence{ht−3= 3, ht−2 = 2, ht−1 = 1, ht = 2} with a normalised likelihood of 0.7.

The following sections describe howExp-GPOMDPis used to train the speech recog- nition agent. We start by describing the job of the agent and what its parameters are. Then we introduce the key modification toExp-GPOMDPthat allows it to fit naturally into the context of the Viterbi algorithm for speech processing. Subsequent sections delve into the details of estimating ∇η using the modified Exp-GPOMDP algorithm and our chosen parameterisation. After a detailed discussion of the reward structure a dot-point summary of the training process is given.

Exp-GPOMDP using Viterbi Probabilities

For the purposes of the speech recognition agent, each I-state in the Viterbi lattice is defined to represent a single phoneme. Our speech agent is responsible for running the Viterbi decoder that emits a phoneme classification for each frame of speech. If the Viterbi lattice is length l then the agent emits the classification for framet−l at timet. As illustrated in Figure 10.8, this means the reward the agent receives at time t is a result of the classification decision for frame t−l. The implications of this for temporal credit assignment are discussed in Section 10.3.1.3. The agent parameters φ represent the phoneme transition probabilities ω(h|φ, g). The agent is illustrated by Figure 10.9.

?

{1,3,3,3} ω(h|φ,y¯t) t t−1 t−2 t−3 ht ut ¯ µ(·|φ,y¯t) ˆ α(1) = 0.1 ˆ α(2) = 0.7 ˆ α(3) = 0.2 h= 1 h= 2 h= 3 ξ(yt|h)

Figure 10.9: The LVCSR agent, showing the Viterbi lattice and the ¯µ(·|φ,y¯t) distribution

from which an action is chosen. In this instance the action consists of the sequence of labels leading back through the Viterbi lattice fromht = 3, that is,ut={ht−3= 1, ht−2= 3, ht−1= 3, ht= 3}.

with respect to the φ parameters. In other words, Exp-GPOMDP adjusts the Viterbi lattice transition probabilities to maximise the long-term average reward. The key modification to Exp-GPOMDP is a natural consequence of using a Viterbi decoder to classify speech frames: Exp-GPOMDP uses the Viterbi approximation (10.4) of the I- state belief instead of computing the full forward belief (given by Equation (6.2)). Just like the forward belief, the Viterbi approximation ˆαt depends on φ, and on all past

acoustic feature vectors yt, so we write the Viterbi I-state belief for I-state (phoneme)

h as ˆαt(h|φ,y¯t−1).

Recall from Section 6.2 thatExp-GPOMDPdoes not sample I-state transitions, but it does sample actions. However, the Viterbi algorithm normally operates completely deterministically, choosing the most likely current I-state

ht = arg max

h αˆt(h|φ,y¯t−1),

and tracking backlsteps through the Viterbi lattice to find the best phoneme to emit as a classification action. To make our speech processing agent fit into theExp-GPOMDP

framework we must allow it to emit stochastic classification actions for frame t. This allows the agent to occasionally try an apparently sub-optimal classification that, if it results in a positive reward, will result in parameter adjustments that make the classification more likely.

Viterbi phoneme probability distribution ˆαt+1(·|φ,y¯t). The action conceptually consists

of the entirelstep sequence back through the Viterbi lattice to the optimum phoneme for frame t−l given the stochastic choice of ht. The reason for emitting the entire

sequence instead of just the phoneme for frame t−l is to ensure that all possible choices of ht emit a unique action. The stochastic choice of ht is forgotten during the

next step and has no impact on the Viterbi probability.

Now we describe more precisely how this scheme fits into the Exp-GPOMDP algo- rithm. Recall from Equation (6.3) thatExp-GPOMDPevaluates the action distribution µ(u|θ, h, y) for each I-state h, and then samples action ut from the weighted sum of

these distributions with probability ¯µ(ut|φ, θ,y¯t). For our speech agent the action dis-

tribution for each I-state h is

µ(u|h) :=χu(h),

which is a deterministic mapping from I-states to actions. This definition reflects the operation of choosing an I-stateh, and from this choice, deterministically tracking back through the Viterbi lattice to find the best sequence of phonemes to emit asu.

Our choice of deterministic µ(·|h) may seem strange after just stating that actions have to be chosen stochastically, however recall that Exp-GPOMDP samples the sum of weighted distributions ¯µ(·|φ, θ,y¯t), whichis stochastic, even though µ(·|h) is not

¯ µ(u|φ,y¯t) = X h∈G ˆ αt+1(h|φ,y¯t)µ(u|h) =X h∈G ˆ αt+1(h|φ,y¯t)χu(h).

Thus, the probability of action u, consisting of a classification for the stochastically chosen current frame ht plus the last l frames, is equal to the probability of frame t

being classified as ht, and is given by ˆαt+1(ht|φ,y¯t). Thus, sampling an action from

¯

µ(·|φ, θ,y¯t) is equivalent to sampling an I-state ht from the same distribution.

There are|G|l total possible actions that the system can emit, but at any one time step at most |G| actions have non-zero probabilities because the choice of the I-state ht ∈ G completely determines the rest of the action.

After an action has been emitted the gradient of the action is computed and used to update theExp-GPOMDPeligibility tracez. A reward is received from the environment and it is used along with zto update the gradient estimate.

Computing the Gradient

To this point we have given a general overview of how theExp-GPOMDPalgorithm can be applied to speech. What has been left unspecified is exactly how ω(h|φ, g) is

parameterised and how to compute the gradient accordingly. There are several sensible ways of doing this. The next paragraphs outline the method used in our experiments.

We have already described the deterministic µ(u|h) function, which simplifies the problem of choosing an actionut from ¯µ(·|φ,y¯t) to one of choosing an I-state, from the

distribution ˆαt+1(·|φ,y¯t).

What remains to describe is how to parameterise the update of ˆα. In Section 6.2 we did this using Equation (6.2), which for each I-state g evaluates the probability of making a transition from I-state g to I-state h using ω(h|φ, g), then multiplies by the current forward probabilityαt(g|φ,y¯t−1). In this section we take a different approach by including the current forward probabilities in the evaluation of the soft-max function. Assuming that the steptphoneme is labelledg, and the stept+ 1 phoneme is labelled h, the likelihood of phoneme h using the Viterbi criterion is

ˆ αt+1(h|φ,y¯t) :=ξ(yt|h) maxgexp(φgh)ˆα(g|φ,y¯t−1) P g0∈Gexp(φg0h)ˆα(g0|φ,y¯t1) . (10.5)

For every transition from phoneme g to phoneme h there is one parameter φgh ∈ R.

The scaled likelihoods ξ(yt|h) are provided by the ANN, trained during Phase 1. In

previous experiments we used the soft-max function to evaluate the probability of making a transition from a given I-state g to some I-state h, thus we normalised the distribution by summing over potential next I-states h0 (see Section 3.4.1). When

implementing the Viterbi algorithm we are given the next I-state h, so we normalise over potential previous I-states g0. The Viterbi algorithm dictates that the previous

I-state we choose, denoted ˜g, must be the one that maximises ˆαt+1(h|φ,y¯t).

Before deriving the gradient we introduce an approximation that greatly reduces the amount of computation needed per step of the algorithm. We make the assumption that

ˆ

αt(g) does not depend on φ. This reduces the complexity of the gradient calculation

from being square in |G| to linear in |G|. The computation of ˆαt+1(h|φ,y¯t) for all h

is still square in |G|. The approximation also allows normalisation of ˆαt(g), which

would otherwise vanish over time, without needing to further complicate the gradient computation. For comparison, we implemented the exact calculation of ∇µ¯µ¯ and found that with 48 phonemes/I-states, each step was about 100 times slower than using the approximation described below, with no strong evidence that it produced better results.

As discussed earlier in this section, the choice of ut from ¯µ(·|φ,y¯t), is equivalent the

choice of a phoneme ht from ˆαt+1(·|φ,y¯t). Thus ∇µ¯µ¯, required byExp-GPOMDP, is the

same as ∇αˆt+1

ˆ

αt+1 and follows the form of the soft-max derivative given by Equation (3.9).

of ˆα(g|φ,y) (which is assumed to equal ˆ¯ α(g))

Pr(g|φ,y¯t, h) =

exp(φgh)ˆα(g)

P

g0∈Gexp(φg0h)ˆα(g0),

so that ˜g= arg maxgPr(g|φ,y¯t, h). Then ∂µ¯(ut=h|φ,y¯t) ∂φgh ¯ µ(ut =h|φ,y¯t) = ∂αˆ(h|φ,y¯t) ∂φgh ˆ α(h|φ,y¯t) =χ˜g(g)−Pr(g|φ,y¯t, h). (10.6)

There is no dependency on the emission probabilities ξ(yt|h) in Equation (10.6)

because it vanishes when we compute the ratio ∇µ¯µ¯. This would not be the case if we implemented global training of the emission probabilities and transition probabilities. Thenξ(yt|h) would depend explicitly on a subset of the parametersφrepresenting the

weights of the ANN that estimates ξ(yt|h). We would need to compute ∇φξ(yt|φ, h)

by back propagating the gradient through the layers of the ANN. This idea is similar to the global ANN/HMM training scheme implemented by Bengio et al. [1992].

Reward Schedule

At each time step an action ut is chosen from ¯µ(·|φ,y¯t), which according to our

definition of ¯µ is equivalent to choosing a phoneme ht from ˆα(·|φ,y¯t). Actions are

generated by following the path back through the Viterbi lattice defined by the choice ofht. If the Viterbi lattice is length l, then the path back gives us the previous lmost

likely phonemes given our choice of current phoneme ht. The final decision for the

classification of speech frame t−l is the phoneme at the end of the path traced back fromht. Because the Viterbi procedure corrects errors made at timet, we expect that

there is a high probability that the correct classification is made for frame t−l, even when Exp-GPOMDPchooses a low probability phoneme ht.

A reward of 1 is received for correctly identifying a change in phoneme at time t. A reward of -1 is received for incorrectly identifying a change in phoneme. Because the Viterbi filter delays the final classification of frames untilExp-GPOMDPhas processed another l frames, the reward is delayed byl steps.

If the phoneme classification for frame tis the same as the previous phoneme clas- sification, then no reward is issued. We only reward changes in phoneme classification because the phoneme sequence is what we really care about. Rewarding the accuracy of every frame would cause the system to try and accurately match the length of each phoneme as well as the sequence of phonemes, hence wasting training effort. For ex- ample, suppose the true frame-by-frame labels are {b, b, eh, eh, eh, d, d} and ourExp-GPOMDP trained classifier emits {b, eh, eh, eh, d, d, d}; then both se-

quences are interpreted as the sequence of phonemesb-eh-d, and our classifier should be given the maximum reward. However, if we were to base rewards on the frame-by- frame labels then Exp-GPOMDP would receive penalties for the apparent mistakes at the 2nd and 5th frames. The focusing of training effort to maximise an arbitrary high- level criteria is one of the key benefits of the POMDP model. HMM training would penalise the 2nd and 5th frames by interpreting the emission of those phonemes as zero probability events.4 Off-by-one frame errors such as we have described are fairly common because the hand labelling of the data set is only accurate to a few ms.

Furthermore, penalties are received only for errors not corrected by Viterbi decod- ing. The system is free to artificially raise or lower transition probabilities from their maximum likelihood values, provided doing so decreases errors after Viterbi decoding. Training automatically adjusts for the ability of the Viterbi decoder to correct errors. We do not rely on Viterbi decoding as a post-training method for reducing errors.

The final operation of the system is the same as an ANN/HMM hybrid. The training procedure is not the same because we avoid using the maximum likelihood criterion.

Summary of Transition Probability Training

Combining all the elements of this section we end up with the following training procedure for each frame of input speech:

1. A frame of the speech waveform is pre-processed to form a vector of features such as power and mel-cepstral values. These features form the observation vectoryt

for thet’th frame of speech.

2. We feed yt forward through the ANN, trained during Phase 1, which outputs an

estimate of Ξ(m|yt) for each phonemem.

3. Ξ(m|yt) is converted into a scaled likelihood ξ(yt|m) by dividing the posterior

probabilities by the phoneme probabilities, which are estimated from the training set.

4. For each phoneme, represented by an I-stateh in the Viterbi lattice, we evaluate ˆ

αt+1(h|φ,y¯t) using Equation (10.5).

5. A path back through the Viterbi lattice is chosen by sampling a phonemeht from

ˆ

αt+1(·|φ,y¯t).

6. We compute ∇µφ¯µ¯ for the chosen path by evaluating Equation (10.6). This log gradient vector is added to the Exp-GPOMDPtrace zt.

7. Follow the path back through the Viterbi lattice from phoneme ht to construct

the actionut. Emit the phoneme at the end of the Viterbi lattice as the correct

phoneme for frame t−l.

8. Issue rt+1 = 1 if the phoneme fort−l correctly identifies a change in phoneme, rt+1 = 0 for no change, and rt+1 = −1 for an incorrect (or missing) change in phoneme. Incrementt, goto step 1.