Models using conditional observation probabilities

Approaches to modelling feature dynamics

4.5. Models using conditional observation probabilities

The continuous dynamic nature of speech is such that the highest correlations exist between observations which are close together in time. It is thus reasonable to restrict the representation of correlation to a specific time-span, which overcomes the problems of training and duration variability which are associated with fiill-correlation models.

4.5.1 HMMs incorporating dependence on previous observations

A number of authors have described extensions to HMMs for incorporating direct (linear) dependencies on one or more previous observations. Paliwal (1993) augmented discrete HMMs with state-conditioned transition probabilities between successive vector-quantized observations. For continuous HMMs using Gaussian distributions, Ostendorf et al. (1996) suggested the term “conditionally Gaussian HMMs”. Early work on these linear predictive models has been described by Brown (1987) and by Wellekens (1987), who both presented models in which the emission probability takes into account the correlation with the observation vector at the previous frame time. Brown tested this model in recognition experiments on the E-set, but with disappointing results. Kenny, Lennig and Mermelstein (1990) studied a linear-predictive model which incorporated correlations with previous observations at a number of different time lags. When using only static features in the front- end analysis, they found advantages from incorporating multiple time lags o f up to 80 ms. However, the best linear predictive model did not perform as well as a conventional HMM which incorporated differential features in the front-end representation, with these features calculated by computing the difference in the relevant feature value over an interval o f 40 ms. Woodland (1992) reports experiments using a linear predictive model which was similar to the one adopted by Kenny et al. Using both static and dynamic features, recognition performance was improved by incorporating a predictor with an offset of 30 ms. This improvement was presumably due to the incorporation of longer time dependencies than those represented in the difference features, in addition to the modelling of correlations between the difference features. A somewhat different approach was suggested by Takahashi, Matsuoka, Minami and Shikano (1993), by modifying HMM output probabilities according to separately-trained bigrams.

Approaches to modelling feature dynamics 57

These bigrams were conditional probability distributions representing phoneme-dependent correlations between adjacent frames. Incorporating correlation constraints improved the performance of a speaker-independent recognition system based on cepstrum features and their time derivatives.

The models described above are all based on the concept of predicting a current frame from one or more previous frames covering a fixed time interval, and are therefore unable to accommodate variations in speaking rate. Ming and Smith (1996) described a model which overcomes this problem, by using a weighted mixture distribution to represent dependence on multiple previous frames, with the chosen time lags for any one observation sequence being optimised as part of both training and recognition. In a speaker-independent recognition task for an E-set vocabulary, this new model was shown to outperform both a linear-predictive HMM and a standard independent-frame HMM.

An approach which is related to predictive HMMs is theory developed by Saerens (1993a, 1993b, 1995) to extend HMMs to represent a continuous dynamic process within a state. Motivated by the inherently dynamic and continuous nature of speech production, he suggested a model which was formulated in the continuous-time domain (Saerens, 1993a), rather than on the basis of sampled time as in the majority of models. In further developments (Saerens, 1993b, 1995), he extended the model to represent the process of generating acoustic vectors by a first-order linear stochastic differential equation. The approach involves assigning a probability density to a continuous path of the acoustic vector, where the conditional probability at any particular point in time is determined entirely by the most recent observation. It can thus be regarded as a continuous-time version of linear-predictive (or autoregressive) modelling. The modelling of speech in a continuous-time domain is an interesting and well-motivated idea, but it is very difficult to make judgements about whether this model is beneficial in practice as no experimental work appears to have been reported.

Overall, results with HMMs incorporating dependence on previous observations suggest there are some advantages in modelling correlations, provided that the correlation is measured over an appropriate time interval to give a reliable indication of local dynamics. However, there are disadvantages associated with the increased number of free parameters in the models. From the viewpoint of developing an appropriate model for speech, these extended HMMs are not powerful enough, as the approach retains the basic HMM framework and only considers

previous observations without considering previous state occupancies (and duration of the occupancies). The model thus cannot fully represent the changing pattern over several frames which could reasonably correspond to a single state occupancy.

4.5.2 Segment models involving dependence on previous observations

Within a segment modelling framework, the simplest model of correlation is to make the probability of each observation depend on the identity of the immediately-preceding observation, as follows:

\lP(yt

IJt-i

t=\

By conditioning the observation probabilities on the observation at the previous frame time, this model thus imposes a Markov restriction on the temporal covariance matrix, and has been referred to as a Gauss-Markov segment model (e.g. Ostendorf et al., 1996). Digalakis, Ostendorf and Rohlicek (1989) investigated a Gauss-Markov version of the SSM. This model was regarded as a compromise between the full-covariance and the independent-frame SSM, by incorporating some time correlation but with a manageable number of parameters. Digalakis et al. compared the performance of different versions of the SSM on a TIMIT classification task using varying numbers o f features obtained by linear discriminant analysis. For very small numbers of features, the full covariance model provided the best performance. For intermediate numbers of features, performance was best for the Gauss-Markov assumption. However, the best overall performance was obtained by using a large number of features and the independent-frame model. These results suggest that the extra features provided useful additional information for discrimination, but that it was difficult to train robust models for even quite limited correlation assumptions when the number of parameters became large.

In experiments comparing classification rates as a function of number of cepstral coefficients, Digalakis (1992) reported that the Gauss-Markov model outperformed the independent-frame model when only static features were used. However, when derivatives (computed by linear regression over 5 frames) of the cepstra were included in the feature set, the independent-frame model gave better performance than the Gauss-Markov model either with or without derivatives, even for small numbers of features. As pointed out by Digalakis, these results

Approaches to modelling feature dynamics 59

suggest that there are problems with the Gauss-Markov segment model which are not associated with training set size. There are likely to be problems with accommodating non- linearities in the correlations near the segment boundaries, particularly for derivative features which will be computed using features from neighbouring phones. There may also be lack of robustness in the model of dependencies, so that it does not generalise very well to any mismatches between the training and test data, which may cause the exact nature of the correlations to be different.

An extension to a parametric constrained mean trajectory model to incorporate dependence on previous observations has been reported by Deng and his colleagues (Deng, 1993; Deng and Rathinavelu, 1995). Deng and Rathinavelu include the results o f experiments on a speaker- dependent isolated CVC word small-vocabulary recognition task using whole-word models. The experiments used mel-cepstrum coefficients and their time differences and provide the first positive results for a Gauss-Markov assumption with delta features. The combination of a linear trajectory mean with a Gauss-Markov assumption gave better results than using either assumption alone, which in turn performed better than a conventional HMM with the same number o f states. Results were not however reported for more than four states per word, and it seems likely that the conventional HMM in particular would continue to show improvements as the number of states is increased. It should also be noted that this was a highly simplified experiment, and the findings may not extrapolate to typical sub-word modelling tasks where a variety o f different contexts necessarily have to be represented within a single model.

4.5.3 Discussion

Overall, it evidently is possible to obtain recognition performance advantages by modelling direct dependence of a current observation on one or more specific previous observations. However, the varying levels o f success which have been achieved with this class of model indicate that there are problems with robustness. For any one model unit, the underlying assumption is that a single function can describe the relationship between frames in a sequence. However, although this may be a reasonable assumption for underlying trends in a sequence, specific observed sequences will in general show considerable variation from the trend due to approximations in the modelling assumptions, differences between speakers, recording conditions and so on, and other random variations.

In document Modelling Segmental Variability for Automatic Speech Recognition (Page 57-61)