Digit string recognition on aurora-2 - Exploiting properties of the human auditory system and c

2.3 Experiments

2.3.2 Digit string recognition on aurora-2

We first applied the MPPCA imputation method to the aurora-2 task with the default GMM-HMM recognition system described in Hirsch and Pearce (2000).

The data set consists of connected digit utterances with lengths of one to seven

1The amount of training data in aurora-2 is too small for training effective DNNs.

Chapter 2. Missing Feature Imputation Using Manifold-based Compressive Sensing 26

Figure 2.3: Diagram of the complete ASR system. Missing data imputation in combination with a recognition back-end. The GMM-HMM back-end is used in both experiments. The DNN-HMM back-end is only applied in the large

vocabulary task.

digits per utterance. There are clean and multi-condition training sets available.

The multi-condition training set contains the clean training data, plus the same signals noisified by four different noise types at SNR levels of 20, 15, 10, 5 dB. The clean training set consisting of 8, 440 utterances was used for training the MPPCA models and the GMMs and transition probabilities in the recognizer back-end.

The multi-condition set was used to determine the block size (K), the number of MPPCA sub-models (M), and the dimension of the manifold representations (q). We also used the multi-condition train set to tune the mask thresholds. We report results for test sets A and B before and after the application of MPPCA-based imputation. Both test sets contain 4004 utterances, in the clean version and noisified by four different noise type at SNR levels −5, 0, 5, 10, 15, 20 dB. The noise types in set A are the same as in the multi-condition training data; the noise types in set B are not represented in the training corpus.

For training and testing the GMM-HMM the 23-band log-Mel-spectra were con-verted to Mel-frequency cepstral coefficients (MFCC). MPPCA-based imputation was used to reconstruct log-Mel-spectra that contained missing features before the conversion to MFCCs. Figure 2.4 contains an example of the impact of MPPCA-based imputation on the log-Mel-spectrogram of a noisified utterance ‘eight’. The SNR is 10 dB; the noise is suburban train noise. Panel (a) shows the spectrogram of the clean speech; panel (b) contains the spectrogram of the noisified signal. Panel (c) only contains the reliable spectro-temporal features, and panel (d) shows the MPPCA-based reconstruction. The application-dependent parameters of the MP-PCA model used for the imputation had the values {M = 30, K = 5, q = 20}. While the reconstructed spectrogram is clearly different from the clean spectrogram, it can be seen that the (visually) most salient features have been recovered. As in

Chapter 2. Missing Feature Imputation Using Manifold-based Compressive Sensing 27

(a) (b)

Figure 2.4: An example of a MPPCA imputed short noisy utterance. (a):

the original clean spectrogram, (b): the spectrogram of the noisy utterance at SNR=10dB, (c): the reliable components, (d): the MPPCA imputed spectro-gram. Imputation was performed with a MPPCA model with {K = 5, M =

30, q = 20}

all lossy coding, successful imputation yields spectrograms that tend towards the prototypical spectrogram of the digit words.

First, we determined a good value for the dimension of the PPCA sub-models.

For that purpose we trained MPPCA models with q= d, M = 11 (motivated by the fact that there are eleven digit words), and K = 5, 10, · · · , 30. We observed a sharp decrease of the values of 1/α around q = 20. Subsequent recognition experiments confirmed that q = 20 suffices to represent the relevant variation in the acoustic space. With q = 20 we then performed a grid search over MPPCA models with {K, M} pairs, with K = 5, 10, · · · , 30 and M = 1, 15, · · · , 40 on the multi-condition training data. The maximum block size K = 30 was imposed by the duration of the shortest utterance. The block shift was always fixed at 3 frames.

Unsurprisingly, there was no {M, K} pair that yielded maximum recognition

Chapter 2. Missing Feature Imputation Using Manifold-based Compressive Sensing 28 accuracy on all 5 (SNR levels) × 4 (noise types) in the multi-condition training data. It appeared that accuracy reached a shallow plateau for values of M around 30, for oracle and estimated masks. Therefore, we only report results for M = 30. With the oracle mask the recognition accuracy decreased monotonically with increasing block size K. However, the reverse was true for the estimated mask:

recognition accuracy kept increasing with increasing block size.

Table 2.1 summarizes the recognition results on test set A and test set B using a MPPCA model with q = 20 and M = 30, with block sizes K = 5 and K = 30.

Intermediate values of K did not yield additional insights. The column labeled

‘Baseline’ contains the recognition results without any form of imputation or fea-ture enhancement. The results are shown for both the oracle and the estimated masks.

Test set A

SNR Baseline 5 Frames 30 Frames

Oracle Estimated Oracle Estimated

20 94.09 (±0.41) 97.78 (±0.25) 90.24(±0.51) 97.16 (±0.28) 95.06(±0.37)

15 85.37 (±0.61) 97.30 (±0.28) 86.66(±0.59) 95.85 (±0.34) 91.76(±0.48)

10 65.35 (±0.83) 95.66 (±0.35) 78.20(±0.72) 91.46 (±0.48) 81.40(±0.67)

5 36.96 (±0.84) 91.46 (±0.48) 60.03(±0.85) 80.01 (±0.69) 58.14(±0.86)

0 14.39 (±0.61) 84.51 (±0.63) 34.96(±0.83) 56.27 (±0.86) 29.38(±0.79)

-5 7.52(±0.46) 76.14 (±0.74) 15.97(±0.63) 33.91 (±0.82) 13.08(±0.58)

Average 50.61 (±0.35) 90.48 (±0.20) 61.01(±0.34) 75.78 (±0.30) 61.47(±0.34)

Test set B

20 94.25 (±0.40) 98.65 (±0.20) 91.96(±0.47) 98.19 (±0.23) 96.64(±0.31)

15 84.32 (±0.63) 98.33 (±0.22) 88.91(±0.54) 97.32 (±0.28) 93.98(±0.41)

10 61.22 (±0.85) 97.88 (±0.25) 80.65(±0.68) 94.69 (±0.39) 85.08(±0.62)

5 32.45 (±0.81) 95.87 (±0.34) 63.01(±0.84) 85.74 (±0.61) 63.37(±0.84)

0 13.02 (±0.58) 91.80 (±0.47) 36.22(±0.83) 64.02 (±0.83) 35.16(±0.83)

-5 7.27(±0.45) 85.72 (±0.61) 16.23(±0.64) 39.14 (±0.85) 15.45(±0.63)

Average 48.75 (±0.35) 94.71 (±0.16) 62.83(±0.34) 79.85 (±0.28) 64.94(±0.34)

Table 2.1: Average ASR accuracies for MPPCA imputation on aurora-2.

95% confidence intervals in parentheses.

2.3.2.1 Interpretation

Thanks to the fact that the MPPCA method is noise ignorant, the recognition accuracy for test sets A and B are similar. In a previous application of CS to noise-robust ASR (Gemmeke et al., 2011b), the performance for test set B was

Chapter 2. Missing Feature Imputation Using Manifold-based Compressive Sensing 29 clearly inferior. If anything, in our data the accuracy in test set B is higher than in test set A. The only parameter that could be (and indeed is) sensitive to the differences between the noise types is the threshold used in determining whether a spectro-temporal measurement is reliable.

From Table 2.1 it is clear that MPPCA-based imputation improves the recog-nition accuracy in all SNR conditions. Also, imputation with the oracle mask al-ways outperforms the estimated mask. The relative loss with the estimated mask increases sharply with decreasing SNR level. The results with the oracle mask are always better for 5-frame blocks than for 30-frame blocks. The accuracies of 76.14% in test set A and 85.72% in test set B at SN R = −5 dB with five-frame blocks suggest that the MPPCA imputation yields excellent reconstructions, even if the number of reliable features is very small. The oracle mask guarantees that there are no false positives. The performance with the oracle mask with the 30-frame blocks suggests that there is indeed an effect of context smearing. That effect is already visible at SN R= 20 dB. The effect becomes worse with decreasing SNR level. At the lowest SNR levels the number of reliable features is small. If the reliable features happen to be at a large distance from the frame under analysis, the average of the 30 estimates is determined biased towards the spectra of the remote frames. Even in the absence of false accepts it is difficult to reconstruct spectral vectors on the basis of reliable features scattered randomly over 30 by 23 pixel patches if these patches are sparsely filled.

For the estimated mask the impact of block size is more complex. Although the mask threshold was set to a level that minimizes false accepts –at the cost of a substantial proportion of false rejects, which are less harmful– it is not possible to completely prevent false accepts with the estimated mask. The relative proportion of false accepts increases with decreasing SNR. With the 5-frame blocks the effect of a decreasing number of reliable features, in combination with an increasing proportion of false accept as SNR decreases, is evident from the highest SNR levels. For SN R > 5 dB 30-frame blocks outperform the 5-frame blocks. So much so that, averaged over the six SNR levels, the 30-frame blocks come out as the winner. Most probably, the availability of a larger number of correctly identified reliable features, along with a relatively small number of false accepts, overcomes the detrimental effect of context smearing. In the conditions with SN R ≤ 5 dB the smearing effect becomes dominant.

Chapter 2. Missing Feature Imputation Using Manifold-based Compressive Sensing 30 In summary, the experiment on aurora-2 confirms the efficacy of MPPCA-based imputation. However, in conditions with SN R ≤ 5 dB the effect of a growing proportion of false accepts in combination with a decreasing number of features that are considered as reliable is increasingly more difficult to overcome.

In document Exploiting properties of the human auditory system and compressive sensing methods to increase noise robustness in ASR (Page 38-43)