Interpolation-based models - Models based on multiple trajectories

Approaches to modelling feature dynamics

4.8. Models based on multiple trajectories

4.8.4 Interpolation-based models

Another way of representing a number of trajectories within a single model is to describe feature dynamics by some form of interpolation between the means o f successive HMM states. A simple form of state interpolation was used by Deng, Kenny, Lennig and Mermelstein (1992), who modelled the transition region between a vowel and an adjacent consonant by a sequence of two states with means obtained by interpolation between the mean vector for the vowel steady-state and a vector representing the “locus” (target spectrum) of the consonant. Linear interpolation was used, with the consonant locus assigned weights of 1/3 and 2/3 for the first and second state respectively. The locus vectors were separate from the consonant and vowel models used for the recognition, but were automatically trained by an extension to the Baum-Welch re-estimation algorithm. This approach provides a method for using additional states to construct a rather crude representation of a mean trajectory, which can be applied to model any consonant-vowel or vowel-consonant transition. Experiments on very large vocabulary isolated-word recognition demonstrated some small recognition improvements over conventional context-independent HMMs. When compared with context-dependent HMMs, the relative performance depended on the size of the training set: the state-interpolation HMMs

Approaches to modelling feature dynamics 71

were better for small amounts of training data but, as training set size increased, the context- dependent models gave considerably better performance.

Wiewiorka and Brookes (1996) have proposed a model which makes greater use of interpolation for trajectory modelling by applying it to describe observation sequences within a state occupancy as an alternative to simply introducing additional states. The mean vector within a state was modelled as following a negative exponential function defined for any one frame time as being a weighted combination of the mean at the previous frame time and the “target” mean associated with the current state. The negative exponential modelling was optionally also applied to the inverse covariance matrix. The advantage of this approach is that trajectories are defined across as well as within segment boundaries. However, a complication is that the trajectory corresponding to any one state occupancy depends on the sequence of previous states, and thus the optimal path cannot in general be identified until the end of the utterance. Various approximate solutions to this problem were tried, all involving making a decision on state exit after looking ahead for a fixed number of frames and only considering the most likely future paths. A different model was also tried, which avoided making approximations by using fixed state-dependent exponentials defined only in terms of the previous and current state means, rather than calculated means for any particular sequence of state occupancies. This assumption simplifies the modelling but introduces some discontinuities at state boundaries.

In experiments on recognition of an E-set vocabulary with five-state models, Wiewiorka and Brookes demonstrated that all the exponential models could outperform a conventional HMM when using cepstrum features. The fixed exponential model did not perform as well as the models using occupancy-dependent trajectories. However, all the models using only static features gave worse performance than the conventional HMM incorporating delta-cepstrum features. This finding is disappointing, especially for a model which includes a representation of dynamics across state boundaries. When delta features were used with the exponential HMMs, the performance of these models improved, but only the model with fixed exponentials gave some improvement over the conventional HMMs with delta features. Wiewiorka and Brookes commented that, as the delta-cepstrum features could show quite different dynamic behaviour from that of the cepstrum features themselves, the exponential HMMs may have performed better if a different exponential coefficient had been used for these features. In fact.

the use of a single exponential coefficient per state is probably a rather simplified description of segment dynamics, as it assumes that all features have the same dynamic characteristics.

Bakis (1991) described a model in which a phoneme was associated with a “target” in some vector space related to articulation, with a phonetic model to determine the smooth trajectory path for any one sequence of phonemes. A separate acoustic model was used to represent the relationship between the trajectories and observed acoustic feature vectors. This approach is thus similar to the dynamical system model, with a somewhat different specification of the underlying trajectory, which has the advantage of being context-sensitive but is dependent on an accurate representation of coarticulation within the phonetic model component. The model is particularly interesting as it is a much closer representation of the speech production process than most other models, but it appears that no experimental results have been published.

p(ytls,,yn)=p(y,|st)

“trajectory” does not vary with time

“trajectory” does not vary with time standard HMM geometnc d uration model hidden semi-Markov model conditionally Gaussian HMM constrained mean trajectory segment model (single trajectory per model,

independent observations) p(yjst.yt-i)=p(ytlst) Gauss-Markov segment model unobserved state Xj=0 _{1 mixture} component Dynamical system segment model Discrete mixture of segment-level trajectories observation noise V(=0

F ig u re 4 .1 : F a m ily tr e e r e la tin g so m e d iffe re n t (G a u ssia n ) s to c h a s tic m o d e ls f o r a v a r ia b le -le n g th f r a m e - b a s e d o b s e r v a tio n se q u e n c e ( a d a p te d fr o m O s te n d o r f e t al. (1996)). A r r o w s in d ic a te m o d e l sim p lific a tio n s, a n d th e s y m b o l St is u s e d to r e p r e s e n t a s ta te (o r “r e g i o n ”) in b o th H M M s a n d s e g m e n t m o d els.

Approaches to modelling feature dynamics 73

4.8.5 Discussion

Alternative trajectories for a single model unit is a useful way o f reducing the variability that needs to be modelled for any one trajectory. However, a discrete mixture is unlikely to provide a sufficient range of different trajectories, and it is also unable to model associations between the mixture components and systematic influences of factors such as phonetic context. Modelling trajectory variation is a potentially more powerful approach, which includes the possibility of representing dependencies across segments by taking account of previous context in the trajectory specification.

In document Modelling Segmental Variability for Automatic Speech Recognition (Page 71-74)