HMM/GMM-based ASR systems and hybrid HMM/ANN-based ASR systems have been widely stud- ied (Rabiner and Juang, 1993; Bourlard and Morgan, 1994). HMM/GMM models are trained to maximize the likelihood of the data X, where as, an HMM/ANN model is trained to discriminate between the states so as to yield the posterior probability of state qn.
A TANDEM system combines the discriminative feature of an ANN with Gaussian mixture modelling by using the processed posterior probabilities obtained from the output of ANN (referred to as tandem features) as the input feature for the HMM/GMM based systems. Figure 3.1 illus- trates the TANDEM system. This approach has been shown to yield significant improvement over conventional HMM/GMM ASR system using cepstral features in both clean and noisy conditions (Hermanskyet al., 2000).
The TANDEM system in spirit is similar to an approach proposed earlier in (Bengioet al., 1992) for speech recognition where, the outputs of ANN was used as observations for HMM/GMM sys- tem. This system had three levels, (a) the first level consisted of ANNs trained to recognize broad phonetic classes, (b) the second level consisted of an ANN integrating the outputs of the ANNs of the first level, this ANN was trained to principal components of lower levels, (c) at the third level, the output of the second level ANN was modelled by HMM/GMM system. The Gaussians of GMMs had diagonal covariance matrix. This system yielded better phoneme recognition performance than standard HMM/GMM system and hybrid HMM/ANN system. Furthermore, the phoneme recogni- tion performance improved when the parameters at all the levels were jointly optimized. As we will see later in this section, in TANDEM system the parameters of the ANN and HMM/GMM system are optimized separately and, the ANN output is decorrelated in a different way before being fed into HMM/GMM system. o o o
X
MLP
Log
KL
Transform
prior to final nonlinearity MLP outputsGMM
W
HMM
Tandem FeaturesP(q
n= i|x
n)
Figure 3.1. Block diagram of TANDEM system.
The TANDEM system is trained in the following manner (Hermanskyet al., 2000).
1. An ANN is trained to discriminate between a set of class labels, such as, phonemes. The ANN can be trained with the training data of the intended ASR task (task-dependent training data) or training data of any other ASR task (task-independent training data) (Hermanskyet al., 2000). In our studies, the ANN is always trained with task-dependent data.
2. After training the ANN, the task-dependent training data is passed through the ANN to esti- mate the phoneme posterior probabilities.
3. Since the posterior probabilities obtained from the output of the ANN are skewed, their logs are taken. An alternative is to take the output of the ANN prior to the output layer nonlinear- ity.
4. Principal component analysis (PCA) is performed on the features obtained in the previous step. The features are then decorrelated by projecting them along the eigenvectors. We refer to the resulting features as tandem-features.
5. HMM/GMM ASR system with diagonal covariance matrices for the Gaussians is then trained with the tandem-features.
During recognition, as for training, the test data is passed through the ANN. The log posterior probabilities obtained are decorrelated by Karhunen-Loeve-transform (KLT) using the PCA statis- tics collected during training to obtain the tandem-features. The tandem-features are then fed to the trained HMM/GMM ASR system and decoding is performed.
TANDEM systems have several advantages, such as:
• Better use of the different probabilistic basis of the two systems and approaches developed for them.
• It provides a framework where data from different databases could be used together. For in- stance, if there is not sufficient task-dependent training data to train ANN then a well trained ANN on a different database can be used for tandem-feature extraction (Hermanskyet al., 2000; Sivadas and Hermansky, 2004).
• TANDEM systems can be used to combine different features or streams of information effi- ciently system (Hermansky and Sharma, 1998; Zhuet al., 2004; Ikbal et al., 2004b) similar to hybrid HMM/ANN system. For instance, in (Ikbalet al., 2004b) two ANNs corresponding to features MFCCs and PAC-MFCCs were trained to classify phonemes. The phoneme posterior estimates from the two ANNs were combined through entropy combination approach (Misra et al., 2003) yielding a new estimate of phoneme posterior probabilities. The new estimate of phoneme posterior probabilities were used to extract tandem-features. The resulting tandem- features were used as the input feature to HMM/GMM system. This approach led to improve- ment in the performance of the ASR system mainly in the noisy conditions.
• The tandem-features exhibit less speaker variability (Zhuet al., 2004). This is due to the abil- ity of the ANN to project the standard acoustic feature on dimensions carrying more speech information. For example, due to speaker variability the standard acoustic feature vectors cor- responding to the same phoneme class may be located at different points in the feature space, but may have similar phoneme class probabilities (output of ANN). Thus, we can expect that the acoustic feature vectors of the same phoneme from different speakers to be mapped to same point in the trained ANN’s output space.