• No results found

Modelling temporal correlations

Approaches to modelling feature dynamics

4.4. Modelling temporal correlations

4.4.1 Representing full correlation structure

The continuous nature o f the articulation process causes successive acoustic feature vectors to be highly correlated, especially for frames within a segment. Given that a segment is represented by a mean trajectory, it is natural and in principle also straightforward (especially

for non-parametric trajectory models) to represent temporal correlations within the associated covariance matrix. However, there do not appear to be any successful attempts at incorporating a complete representation of temporal correlation over the duration of a segment. Using the SSM, Roucos et al. (1988) experimented with modelling correlations within a feature over time, but this did not improve performance over that obtained when assuming complete independence. Part of the reason for the poor performance may be due to problems associated with representing correlations within a fixed-length sequence. This model is only likely to accurately represent correlations between observations in variable-length sequences if there is accurate mapping between the two and the fixed length sequence is at least as long as most observed sequences. In addition, a major problem in modelling temporal covariance is the difficulty of robustly estimating the greatly increased number of parameters, especially if covariance between features is to also be included.

4.4.2 Describing general correlations

Goldenthal (1994) adopted a rather different approach, in order to capture the general nature of correlations over time with a manageable number of covariance parameters. A feature vector trajectory was represented non-parametrically by a template, which could be matched against variable-length sequences of observations by first applying a duration-dependent generation function to produce a “synthetic segment” of the same length as the observations. The error between the observed and synthetic segments was computed on a per-ffame basis, and a measure of the error probability was used to provide a likelihood score for each phonetic model. The statistical element of the model was thus in terms of the expected error associated with modelling multiple examples by a single track as represented by the template. In order to represent covariance robustly, the error sequence was divided into a small (fixed) number of equal-duration sub-segments, and the error was averaged for each sub-segment. Goldenthal points out that the averaging into a fixed number of sub-segments implicitly assumes that the signal is piece-wise constant over each sub-segment, but that this assumption is not unreasonable if it is applied to the error signal (after taking dynamics into account separately by means o f the track). The stochastic element of the segment model is thus of fixed length, and so the model requires the use of a weighting factor to achieve appropriate segmentation. In context-independent classification experiments on TIMIT using mel-cepstrum features (Goldenthal and Glass, 1993; Goldenthal, 1994), the optimum number of sub-segments was

Approaches to modelling feature dynamics 55

found to be three or four and using the model of temporal correlation was shown to improve performance.

Goldenthal investigated the effect of modelling triphone context within the tracks representing phone templates. In order to make best use of limited context coverage in the training data, triphone tracks were created by merging trained biphone tracks with appropriate weighting. The error modelling was considered to be independent of the templates and the same error model was therefore used for all contexts, so greatly reducing any possible difficulties in estimating the statistical parameters with context-dependent modelling. Using these context- dependent templates improved phone accuracy on the TIMIT core test set (Goldenthal, 1994; Goldenthal and Glass, 1994). Further experiments investigated the importance of the phonetic transitions themselves, as these are the most dynamic regions o f speech. Using bottom-up clustering techniques, a small set of transition models was created to represent the major types of transitions which can occur. Each transition model was created using a fixed number of frames either side of a phonetic boundary, with the data pooled from all examples in the relevant cluster. These transition models were used to reduce the search space by identifying likely segment boundaries, and were also incorporated into the scoring framework to help determine phonetic identity. The result was increased context-dependent phone accuracy on TIMIT, with the best result being very similar to state-of-the-art HMM phonetic recognition performance reported by Lamel and Gauvain (1993). When including the transition models in the recognition scoring, in effect some parts o f the segments are contributing twice. It would however be more elegant, from the points o f view both of modelling and of computation, for the importance of the transitions to be represented within a single model.

4.4.3 Discussion

Modelling the full covariance structure of the observation sequence representing a segment is not a practical option, due to problems associated with duration variability and difficulties in robustly estimating the large number of parameters. The approach adopted by Goldenthal has been quite successful at overcoming this problem by representing correlations across regions of an error sequence. However, correlations can only be described in a very general way and are not a very strong representation o f the continuously-evolving dynamic nature o f the speech signal itself. In addition, this model suffers from the same difficulties as approaches using

segmental features, in adopting a fixed-length representation of a segment and in applying the statistical modelling to features (matching errors in this case) which are extracted separately.