Robustness of Features - Feature Representation Learning in Deep Neural Networks

Feature Representation Learning in Deep Neural Networks

9.4 Robustness of Features

A key property of a good feature is its robustness to the variations. There are two main types of variations in speech signals: speaker variation and environment variation. In the conventional GMM-HMM systems, both types of variations need to be handled explicitly.

9.4.1 Robust to Speaker Variations

To deal with speaker variability, vocal tract length normalization (VTLN) [1] and feature-space maximum likelihood linear regression (fMLLR) [5] are critical in the GMM-HMM systems.

Table 9.2 Comparison of feature-transform-based speaker-adaptation techniques for GMM-HMMs, a shallow, and a deep NN

Adaptation technique CD-GMM-HMM (40-mixture)

CD-MLP-HMM

(1× 2,048) CD-DNN-HMM

(7× 2,048)

Speaker independent 23.6 % 24.2 % 17.1 %

+ VTLN 21.5 % (−9%) 22.5 % (−7%) 16.8 % (−2%)

+ fMLLR/fDLR×4 20.4 % (−5%) 21.5 % (−4%) 16.4 % (−2%)

Word-error rates (WER) for Hub5’00-SWB (relative change in parentheses). (Summarized from Seide et al. [27])

VTLN warps the frequency axis of the filter-bank analysis to account for the fact that the locations of vocal-tract resonances vary roughly monotonically with the vocal tract length of the speaker. This is done in both training and testing with 20 quantized warping factors from 0.8 to 1.18. During the training, the optimal warping factor can be found using the expectation–maximization (EM) algorithm by repeatedly selecting the best factor given the current model and then updating the model using the selected factor. During the testing, the system can pick the best factor by running recognition for all factors and using the highest cumulative log probability.

On the other hand, fMLLR applies an affine transform to the feature vector so that the transformed feature better matches the model. It is typically applied to the testing utterance by first generating recognition results using the raw feature and then re-recognizing the speech with the transformed feature. This process can be iterated for several times. For GMM-HMMs, fMLLR transforms are estimated to maximize the likelihood of the adaptation data given the model. For DNNs, they are optimized to maximize cross entropy (with backpropagation), which is a discrimina-tive criterion. This procedure is thus referred as feature-space discriminadiscrimina-tive linear regression (fDLR) [27]. The transformation may be applied to each input vector (which is typically a concatenation of multiple frames of features) in the DNN or applied to individual frames, prior to concatenation.

Table9.2, extracted from [27], compares the effectiveness of VTLN and fMLLR/fDLR on GMMs, shallow multilayer perceptrons (MLPs), and DNNs. It can be observed that both VTLN and fMLLR are important for GMMs to reduce speaker variability. In fact, they provide 9 and 5 % relative error rate reduction, respectively. These techniques are also important for shallow MLPs with 7 and 4 % relative WER reduction. However, these techniques are less important on the DNN systems and provide only 2 % relative error reduction over the speaker-independent baseline DNN system. This observation indicates that DNNs are more robust to the speaker variations than GMMs and shallow MLPs.

9.4 Robustness of Features 165

9.4.2 Robust to Environment Variations

Similarly, GMM-based acoustic models are highly sensitive to environmental mismatch. To deal with the issue several techniques, such as vector Taylor series (VTS) [12, 14, 15, 19] adaption and maximum likelihood linear regression (MLLR) [4], that normalize the input features or adapt the model parameters have been developed. In contrast, the analysis in the previous sections suggests that DNNs have the ability to generate internal representations that are robust to environmental variability seen in the training data.

In methods such as VTS adaptation, an estimated noise model is used to adapt the Gaussian parameters of the recognizer based on a physical model that defines how noise corrupts clean speech. The relationship between the clean speech x, corrupted (or noisy) speech y, and noise n in the log spectral domain can be approximated as

y= x + log(1 + exp(n − x)). (9.4)

In GMMs, this nonlinear relationship is often approximated with the first-order VTS. DNNs, however, with many layers of nonlinear transformation, can directly model arbitrary nonlinear relationships, including that described by Eq.9.4. Since we are interested in the nonlinear mapping from the noisy speech y, and noise n to the clean speech x, we may augment each observation input (noisy speech) to the network with an estimate of the noise ˆnt present in the signal, i.e.,

v_t⁰= [yt−τ, . . . , yt−1, yt, yt+1, . . . , yt+τ, ˆnt], (9.5) where a window of 2τ + 1 frames of noisy speech and a frame of noise estimation is used as the input to the network. This is done in both training and decoding and thus is analogous to noise adaptive training (NAT) [11] without an explicit mismatch function. Since the DNN is being given noise estimation in order to automatically learn the mapping from the noisy speech and noise to the senone labels, implicitly through a clean speech estimation, this technique is referred as noise-aware training (NaT) [28,33].

The robustness of the DNNs on environment distortions can be clearly observed in the experiments conducted on the Aurora 4 corpus [20], a 5,000-word vocabulary task based on the Wall Street Journal (WSJ0) corpus. The models were trained with the 16 kHz multi-condition training set consisting of 7,137 utterances from 83 speakers.

One half of the utterances was recorded by a high-quality close-talking microphone and the other half was recorded using one of 18 different secondary microphones.

Both halves include a combination of clean speech and speech corrupted by one of six different types of noise (street traffic, train station, car, babble, restaurant, airport) at a range of signal-to-noise ratios (SNR) between 10–20 dB.

The evaluation was conducted on the test set consisting of 330 utterances from 8 speakers. This test set was recorded by the primary microphone and a number of secondary microphones. These two sets were then each corrupted by the same six

Table 9.3 A comparison of several GMM systems in the literature to a DNN system on the Aurora 4 task

Systems Distortion AVG (%)

None Noise (%) Channel (%) Noise+

(clean) (%) channel (%)

GMM baseline 14.3 17.9 20.2 31.3 23.6

MPE+ NAT + VTS 7.2 12.8 11.5 19.7 15.3

NAT+ Derivative kernels 7.4 12.6 10.7 19.0 14.8

NAT+ Joint MLLR/VTS 5.6 11.0 8.8 17.8 13.4

DNN (7× 2,048) 5.6 8.8 8.9 20.0 13.4

DNN+ NaT + dropout 5.4 8.3 7.6 18.5 12.4

Summarized from [28,33]

noises used in the training set at SNRs between 5 and 15 dB, creating a total of 14 test sets. These 14 test sets can then be grouped into 4 subsets, based on the type of distortion: none (clean speech), additive noise only, channel distortion only, and noise+ channel. Notice that the types of noise are common across training and test sets but the SNRs of the data are not.

The DNN was trained using 24-dimensional log mel-filter-bank features with utterance-level mean normalization. The first- and second-order derivative features were appended to the static feature vectors. The input layer was formed from a context window of 11 frames creating an input layer of 792 input neurons. The DNN had 7 hidden layers each with 2,048 neurons and the softmax output layer had 3,206 neurons, corresponding to the senones of the baseline HMM system. The network was initialized using layer-by-layer generative pretraining and then discriminatively trained using backpropagation. To reduce the overfitting, dropout [7] discussed in Sect.4.3.4was used in one of the DNN setups.

In Table9.3, summarized from [28,33], the performance obtained by the DNN acoustic model is compared to that obtained by several GMM systems. The first system is a baseline GMM-HMM system, while the remaining systems are repre-sentative of the state-of-the-art GMM systems in acoustic modeling and noise and speaker adaptation. All used the same training set.

The “MPE+NAT+VTS” system combines minimum phone error (MPE) discriminative training [23] and noise adaptive training (NAT) using VTS adaptation to compensate for noise and channel mismatch [3]. The “NAT+Derivative Kernels”

system uses a multi-pass hybrid generative/discriminative classifier [24]. It first uses an adaptively trained HMM with VTS adaptation to generate features based on state likelihoods and their derivatives. These features are then input to a discriminative log-linear model to obtain the final hypothesis. The “NAT+Joint MLLR/VTS” sys-tem uses an HMM trained with NAT and combines VTS adaptation for environment compensation and MLLR for speaker adaptation [30] . The last two rows of the table show the performance of the two DNN-HMM systems. The “DNN (7× 2 K)” sys-tem is simply a direct application of the CD-DNN-HMM with 7 hidden layers each

9.4 Robustness of Features 167 with 2 K neurons. Nevertheless, it outperforms all but the “NAT+Joint MLLR/VTS”

system. Finally, the “DNN+NaT+dropout” system that uses the noise-aware train-ing and dropout has the best performance. In addition, all the DNN-HMM results were obtained in the first pass, while the other three systems required two or more recognition passes for noise, channel, or speaker adaptation. These results clearly demonstrate the inherent robustness of the DNN to unwanted variability from noise and channel mismatch.

In document Automatic Speech Recognition (Page 177-181)