3.1. Introduction
From the perspective of segmental modelling for automatic speech recognition, the term “segment” will be taken as referring to any sequence of frames representing some linguistically meaningful speech unit. Typically, these units correspond to phones or to sub-components of phones, but other units such as diphones have been used (e.g. Ghitza and Sondhi, 1993). The choice of unit does not affect the probabilistic formalism, but is important when considering the most appropriate way of describing the acoustic feature dynamics. In addition, longer units have advantages in providing greater constraints on the pattern of the features over time, but have computational disadvantages due to the greater variability in length that must be accommodated and the larger number o f units needed to describe a language.
3.2. General segment-modelling framework
The task of recognising a word sequence with any statistical model involves fmding the sequence of labels that is most likely given a sequence of acoustic feature vectors y f . This task can be formulated as follows :
=argmax p { u ^ \ y ' () = argmax p { u ^ ) . p { y j| w ^ ) .
N ,u \ N ,u\
Any one label w, can correspond to a word or to any other unit (such as a phone or sub component of a phone) that can be mapped to a word sequence. The above equation uses Bayes’ theorem to express the probability of a label sequence given the acoustic observations in terms of a language model probability /?(w^^)and acoustic model probability p { y j |w i^).
H M M s provide one w ay o f calculating acou stic m odel probabilities ( o f observations given m odels), and the focu s here is on segm ent-based approaches to com puting these p rob ab ilities’,
H M M probability calcu lations are performed on a fram e by fram e b asis, and an H M M state is thus associated with a sin gle acou stic feature vector j v In contrast, a segm ent m odel describes a variable-length segm ent y = { > v , . . . , y ^ } ( 5 > r > r , y ^ e m u lti-d im e n sio n a l sp ace F ) . This difference betw een the tw o m odels is illustrated in Figure 3.1 from the perspective o f a generative m odel o f speech. In H M M s, the on ly sequence constraints are provided by the p ossib le state sequences in the underlying M arkov chain. On the other hand, segm ental m odels provide a fram ework for exp licitly incorporating w ithin-segm ent dependencies and segm ent duration constraints. It is usual for p ossib le sequences o f segm ents to be controlled w ith a M arkov chain, but for segm ents to be otherw ise treated as independent.
time t time t+1 time t (d-3) # time t+3 (d-2) • • tim e f+ 2 L A L A time t+ 5 (d = 5 j
(a) Three fr a m e -b a s e d sta te s em ittin g three (b) Three se g m e n t-b a se d sta te s em ittin g 10
obsen m tion s. observation s.
F igure 3.1: G en eration o f seq u en ces o f o b se rv a tio n s b y H M M s ta te s a s s o c ia te d with (a) in d ivid u a l fr a m e s a n d (b) v a ria b le-d u ra tio n seq u en ces (" se g m e n ts”) o ffr a m e s , using the sam e m ode! topology’ with tra n sitio n s on to the n ext sta te a n d ba ck to the sam e state. In both cases, the illu s tr a te d p r o c e s s sta rts in sta te I, m a k es a tran isition to sta te 2 a n d then m a k es a self-lo o p tran sition b a c k to s ta te 2.
’ Most segm ent m odels use this concept o f computing the probabilities o f observations given labels {P(y\u))-
One exception is the use o f a posteriori distribution m odelling (see Section 4.2), whereby P { u y ) is decomposed differently into a segmentation probability and a probability o f the labels given the segmentation.
Modelling speech segments 39
A general segment model specifies the probability for a sequence of / observations y \ = y y I being generated by a model unit i, according to the density
P { y \ , . . . , y i , l \ 0 = P { y \ , . . . , y i I / , 0 • ^ ( / 10 = b i i { y [ ) . P{ l \ t ) .
The model for segment i thus has two components:
• A duration distribution P{1 |/) which specifies the likelihood for a segment duration of / frames, where / e j , the set of all possible durations.
• A family of output densities !^ ij{ y \)',l which represents likelihoods of all possible observation sequences, where these sequences are of variable duration. The model can optionally include an explicit representation of the effect of variation in segment length / on the trajectory realisation.
3.2.1 Segment-model notation and probability calculations
A segment model can be expressed as an extended, more general form of HMM, whereby a model unit M is represented by a sequence of N segments corresponding to states a , (/ = 1,...,A) in a Markov chain. The generative process can be visualised as follows: at some time ti, the process enters state o y and randomly selects a duration I according to the duration distribution P{1 |/). It then produces a sequence of / acoustic vectors according to the pdf bi i . Then at time it randomly moves from state o , to state a y according to a state transition probability matrix A.
Consider an observation sequence y = y \,...,y T iy t and a particular state sequence x = x i ,. .. ,x s (x_5 6 {a 1,...,o ^ }) with S distinct state occupancies. Each state occupancy Xs
is associated with an entry time =1; 1 < < T for 5 = 2,...,5') and the sum of the durations o f the individual state occupancies must be equal to T. Assuming that there is only one possible starting state X], the joint probability o f y and x given model M is:
S
P ( y , x \ M ) = b ^ ( y^ , . . . , )■ P { { h - 0 1 ) I I (>' (, , • •
^(('«+1
- ) I 5=23.2.2 Recognition and training with segment modeis
Although the details depend on the characteristics of individual segment models, the principles of training and recognition extend quite straightforwardly from HMMs to segment-based models (Ostendorf, Digalakis and Kimball, 1996). As already explained, the recognition task for a statistical model involves finding the most likely sequence of labels given a sequence of acoustic feature vectors, the probability o f which can be separated out into a probability for the language model and a probability for the acoustic model. The usual approach to computing the acoustic-model probability for a given model (P(y\M)) is to use Viterbi decoding to compute the joint probability P(y,x\M) for the most likely state sequence, thus:
P{y\M ) = m sx P {y,x\M) . X
The above probability can be computed efficiently by dynamic programming, thus: first define (()f(/) to be the probability of the first t observations y\,...,yt for the most likely state sequence (with associated segmentation) ending with the f " state at time t such that a transition occurs between times t and t+1. The quantity ())((/) can be calculated recursively as follows,
4 > f ( ; ) = m a x I m a x {(|)6 ( / ) . a ^ .6y ) . f ( ( r - Ô ) | y ) } l (3.1)
where D represents the maximum segment duration (which may be state-dependent). The value of P(y[iW) is then given by the highest value of (j)?- ( j), considered for all possible ending states in the model M. The dynamic programming operation can also be applied across model units (with the inclusion of language-model probabilities), so enabling the most likely unit sequence to be identified.
The Viterbi recognition algorithm is more computationally intensive for segment models than for HMMs, as can be seen by comparing the above expression (3.1) for (j), {j) with the corresponding HMM-based expression, which is as follows: