Relation to previous work - Using Auxiliary Sources of Knowledge for Automatic Speech Recogniti

The auxiliary feature can be continuous valued or discrete valued. In the past much focus has been towards auxiliary features that do not take many values. Work such as, (Koniget al., 1991) uses an ANN to determine the gender of the speaker. The input to the ANN are cepstral features and the output are the posterior probabilities of the gender given the data. The output of the ANN are treated in two different ways:

• They are used as additional features.

• They are are used to select the gender-dependent acoustic model or to weight the output of the two acoustic models (hiding the gender information) for decoding.

The later approach yielded significant improvement in the performance of ASR.

In (Verginet al., 1996), the location of the first two formant frequencies was used for gender classification. Using this gender classifier, the training data was split into male and female part and, separate acoustic models were trained. During recognition, the gender classifier was used to select the acoustic model for decoding. This lead to a relative improvement of 14% over a single acoustic model system.

In (Siegler, 1995), the measure of speaking rate was used to adapt the acoustic models, HMM state transition probabilities, language model weight, the dictionary and phoneme set to compen-

sate for the effect of fast speech. Adapting the language weight, HMM state transition probabilities, and dictionary led to an improvement in the performance of ASR.

In (Martinezet al., 1998), two speaking rate dependent HMMs were trained, one corresponding to slow speech and the other corresponding to fast speech. These two models, along with a speaking rate classifier, yielded a 32% reduction of average error rate. While (Martinez et al., 1998) used a discrete valued estimate of speaking rate, (Morgan et al., 1997) used a continuous estimate of speaking rate to divide the test set into four bins corresponding to different speaking rates and, optimized the exit state transition probabilities for each of the bins. This lead to a 14% reduction in word error rate.

In all the above described works, the auxiliary feature was in some way discretized. (Singer and Sagayama, 1992) studied the correlation between a continuous valued auxiliary feature (pitch frequency) and cepstral features. To exploit the correlation between pitch frequency and spectral parameters, they proposed a pitch-spectrum normalization approach where the cepstral vector was normalized by a linear regression over the auxiliary feature. This approach yielded models with lower variance and, improved the separability between the phoneme classes in the acoustic space, as a result of which improved phoneme recognition performance and ASR performance were achieved. More recently, this approach has been extended to condition the acoustic model (as in (4.4)) instead of as a feature level normalization (Fujinagaet al., 2001). Here, the auxiliary feature was used in a regression to better model the Gaussian distribution of the regular features i.e, conditioning the standard features by shifting their first order moment. In their study, they showed the advantage of using auxiliary features as conditioning variable as opposed to the conventional approach where, the auxiliary feature is appended to the standard feature (as in (4.3)). If the auxiliary features are assumed to be correlated with standard features then conditioning the standard features on the auxiliary feature results in Gaussians with reduced variance during training. The auxiliary features investigated were pitch frequency and energy. This approach to condition the emission distribution upon auxiliary feature led to a improvement in phoneme recognition and isolated word recognition tasks. In addition to this, they observed that the approach of appending the auxiliary feature to the standard acoustic feature leads to degradation in the ASR performance..

In (Zweig, 1998), the notion of the auxiliary variable was introduced within the framework of DBNs, where it was referred to as “context” variable. The idea behind using an auxiliary variable was to model contextual information i.e., features in relation to features at the previous time frame and also to model the correlation between the features at the present time frame. The auxiliary variable was a latent variable i.e., hidden (expect in certain experiments where it was initialized to reflect voicing) during both training and recognition. (Zweig, 1998) gave theoretical justifications (but no experiments) behind using auxiliary variable with real data.

In (Stephenson, 2003), this notion of auxiliary variable was furthered using real data within the framework of DBNs. The different auxiliary features studied were articulatory features, pitch frequency, short-term energy and rate-of-speech. Stephenson investigated different ways to introduce auxiliary features in state-of-the-art ASR systems, such as by appending them or by using them to condition the standard features. His work revealed the need for a time-dependent auxiliary feature that conditions the standard features i.e., the auxiliary feature shifts the Gaussians that model the standard features in order to estimate better acoustic models that are robust to noisy conditions. In (Fujinaga et al., 2001) and previous related work described earlier in this section, the auxiliary feature was observed during both training and recognition. Stephenson investigated in detail the idea of observing the auxiliary feature during training and hiding it (i.e. integrating over all possible values) during recognition. It was found that hiding the auxiliary features during recognition sometimes make the acoustic models more robust, especially in noisy conditions.

The initial part of the studies reported in this chapter were carried out with Todd Stephenson. While (Stephenson, 2003), focusses more on the use of DBNs for modelling auxiliary feature, the present work focusses on modelling auxiliary feature in the framework of the hybrid HMM/ANN systems and extending them to state-of-the-art TANDEM systems.

In document Using Auxiliary Sources of Knowledge for Automatic Speech Recognition (Page 56-58)