Segment-level mixture of trajectories - Models based on multiple trajectories

Approaches to modelling feature dynamics

4.8. Models based on multiple trajectories

4.8.1 Segment-level mixture of trajectories

Gaussian mixture distributions have been shown to be highly successful for HMMs (e.g. Young, Odell and Woodland, 1994), and the analogous segmental model is a Gaussian mixture at the level of the parameters specifying the trajectory, with each component representing one trajectory. The components are combined with mixture weights that correspond to the

probability of observing a particular trajectory in a segment. The probability of an observed segment y given duration / and segment model / with K mixture components can thus be expressed as

bi,iiy\) = Yvick\i).viy\\ckJ^)

k = \

= î , v(ck\ 0-bi j {yi \ ck) . ^=1

For each component Ck, the probability of the complete segment is given by b-^jiyilck) and the mixture weight is expressed as the probability p{ck\i)- Assuming that the benefits of incorporating Gaussian mixture distributions at the frame level result from systematic variation in speech, there could be greater advantage from incorporating a mixture at the segment level as the mixture component is constrained to be constant across a segment. In contrast, the frame-level model allows the mixture component to change randomly at each time frame.

Mixtures of parametric trajectories

Both Gish and Ng (1996) and Fukada et al. (1997) have demonstrated advantages to including multiple-mixture components in parametric constrained mean trajectory segment models, based on TIMIT vowel classification tasks. Gish and Ng (1993) have also applied a mixture model to a secondary processing task, for rescoring the output of an HMM word spotter to improve detection performance. Each keyword was represented by two Gaussian-mixture segment models, with one model representing segments from true keywords and the other for segments from false alarms. In this application, the mixture components were thus used to represent completely different phonetic events within words, rather than to represent different realisations o f some single phonetic event. Liu, Wang, Soong and Huang (1995) and Liu and Wang (1996) applied a similar approach to speaker verification, whereby each speaker was represented by a model describing a segment of speech with a mixture of polynomial trajectories (without explicit association with phonetic identity). On a text-independent speaker verification task using isolated digit data, the segment-based approach was shown to outperform a conventional frame-based probabilistic model. A third-order polynomial was found to give the best performance.

Approaches to modelling feature dynamics 67

Mixtures of non-parametric trajectories

Models using a mixture of non-parametric trajectories have been proposed by Gong and Haton (1994), and as an extension to the independent-frame SSM (Kimball and Ostendorf, 1993; Kimball, 1994). These two proposals are discussed below.

The approach suggested by Gong and Haton (1994) is based on the concept of using mixtures of trajectories to accommodate variation due to phonetic context effects within context- independent phoneme-sized models. As with the original version of Boston University’s stochastic segment model, Gong and Haton computed probabilities in terms o f a fixed-length observation sequence, with a time-warping transformation applied first to transform a variable- length observed segment into the fixed-length sequence. The original work used a linear sampling technique (Gong and Haton, 1994), and the incorporation of a nonlinear warping function (Afify, Gong and Haton, 1994) resulted in only minor improvements in performance. The modelling framework differs from most segment-based approaches in the nature of the dynamic-programming search algorithm used in connected-speech recognition. For each frame, a measure of phoneme plausibility (using a log probability measure) was computed for the fixed-length observation sequence, maximised over all possible actual segment durations centred on the current frame. Sentence recognition was then performed using dynamic programming, with the plausibility of a particular phoneme occurring between two specified frames computed based on the sum of the individual frame-based plausibilities and a duration probability measure. Because the recognition stage was based on a measure o f plausibility for each frame, this approach does not require a heuristic weighting to compensate for the modelling o f fixed-length observation sequences. However, the separate computation of a segment-based probability measure for every frame involves modelling every frame as occurring in the central position within a phoneme (and generally in other positions as well), which seems inappropriate from a speech-modelling viewpoint.

Gong and Haton (1994) performed speaker-dependent recognition experiments on an alphabet recognition task, modelling each phoneme with a five-state model using up to four trajectory components to describe mel cepstrum features. Recognition performance of the context- independent stochastic trajectory model was better than that of a discrete whole-word HMM, and comparable with that o f a continuous-density whole-word HMM. Afify, Gong and Haton (1996) extended the model to explicitly represent time correlation with a dynamical system model. This extended model is similar to the dynamical system approach of Digalakis (1992),

with the extension to a mixture of underlying trajectories. However, Afify et al. retained the search strategy and fixed-length sequence modelling approach used by Gong and Haton (1994). In a speaker-dependent continuous speech recognition task, representing each phone by a Gaussian mixture of up to eight five-state trajectories, overall recognition performance was improved by incorporating the model of correlation in the stochastic trajectory model. However, neither these nor the earlier experiments by Gong and Haton include comparisons with context-independent phoneme-based Gaussian-mixture HMMs, or with using time derivatives in the HMMs.

A mixture version o f the SSM has been shown to outperform both single constrained mean and frame-level mixture models for context-independent phone modelling (Kimball and Ostendorf, 1993; Kimball, 1994). For context-dependent modelling however, a frame-level mixture model gave the best performance, presumably because there were problems with training the trajectory mixture models. One way of reducing the number of free parameters in segmental mixture models is to represent the mixtures at the sub-segment level, which allows for sharing of common sub-components within different trajectories for a phone or even across different phones. The use of sub-segment trajectory mixtures in the independent-frame SSM was studied by Digalakis (1992), and further explored by Kannan and Ostendorf (1993). The segment probability calculation involved finding the most-likely sub-segment sequence for a given phone, so in effect only the dominant mixture components contributed for any one acoustic match. When tested on context-independent phone modelling with cepstrum features, using multiple mixture components at the sub-segment level was shown to improve recognition performance, and to provide better performance than frame-level mixtures. However, Digalakis also demonstrated that his dynamical system model performed even better, for comparable numbers of parameters. This finding suggests that, at least for a limited number of available parameters, a model of short-term dependencies is more beneficial than a general representation o f longer-term dependencies. On the other hand, when derivative features were included in the feature set, the sub-segment mixture model performed better than the dynamical system model. This finding suggests that it is useful to model long-term dependencies in addition to the short-term correlations. However, attempts at incorporating sub-segment trajectory mixtures in a context-dependent version of the SSM were not successful (Kannan and Ostendorf, 1993), vdiich could be further evidence of difficulties in training these models.

Approaches to modelling feature dynamics 69

4.8.2 Alternative state sequences for representing different trajectories

In document Modelling Segmental Variability for Automatic Speech Recognition (Page 66-70)