6.4 Quantile Equalization: Alternative Setups
6.4.6 Utterance Wise, Two Pass, and Online Processing
The characteristics of the data that shall be recognized determine whether an utter-ance wise implementation of joint quantile equalization and mean normalization performs better than a moving window online implementation. If the SNR is constant over the utterance, taking into account more data for the estimation of the transformation param-eters is likely to yield better results because the estimates will be more reliable. As soon as the SNR changes within an utterance the situation is different.
As already pointed out in the database description the Car Navigation database was collected in real driving conditions. The objective was to record realistic data without explicitly waiting for stationary conditions, so many recordings were made during acceler-ation, deceleracceler-ation, gear shifts and changes of the road surface. Even though the isolated word utterances themselves are short there is an obvious change of the background noise level in many of them (figure 6.10). Under these circumstances the online implementation of mean normalization with 500ms delay and a short 1s window performs significantly bet-ter than the utbet-terance wise version (10th FMN in table 6.42). Quantile equalization can reduce the difference between the utterance wise and the moving window implementa-tions, but the online implementation still always yields better results (10th QEF2 FMN) on the noisy test data sets.
1 1.1 1.2 1.3 1.4 1.5 1.6
0 0.5 1 1.5 2 2.5
Y4
time [s]
Figure 6.10: Output of the 4th Mel scaled filter after 10th root for an utterance from the Car Navigation test set. The level of the background noise changes during the recording.
If online processing is not required, a two pass approach can be used. The percentage of silence is determined in a first recognition pass, then the appropriate target quantiles can be calculated by combining the training quantiles estimated on the speech and silence portions of the signal respectively. In the case of full histogram normalization this
ap-6.4. QUANTILE EQUALIZATION: ALTERNATIVE SETUPS 113
proach significantly improves the recognition performance (table 6.19 on page 89). When using quantile equalization there is no consistent improvement (10th QE(F) FMN sil in table 6.42). The approximation of the cumulative distributions with four quantiles is rough. When determining the transformation function only the quantiles 1 to 3 are taken into account (table 6.41). They do not change in an extent that significantly influences the transformation function and consistently reduces the resulting error rates.
Table 6.41: Target quantiles for different amounts of silence (Car Navigation database) target quantiles 100 0.98 1.04 1.07 1.13 1.32
The Aurora 4 database does not consist of recordings in realistic background noise conditions, it was created by adding noise to existing clean studio recordings. Some of the noises that were added are non–stationary, but the SNR remains constant over the utterances which leads to a different tendency in the results: The lowest error rates are obtained with utterance wise estimation of the transformation functions, online processing leads to an increase of the word error rates (Table 6.43). For quantile equalization with filter combination in the 1s delay 5s window setup the error rate rises from 25.9% to 27.3%
for clean training and from 17.1% to 17.8% for multicondition training, but this is still better than the utterance wise baseline.
The 5s window means that for many utterances in the test set the processing is in-cremental, the end of the sentence is reached before the first frame is dropped at the end of the window. When using a 2 second window instead, the system should be able to react faster in changing noise conditions, but since the average SNRs are constant over the utterances in the Aurora 4 database the 2 second window does not perform better than the 5 second window.
Conclusions:
• Quantile equalization can be implemented using moving windows with a short delay, if the application requires real–time online processing.
• In real world conditions with changing SNR a moving window implementation that can adapt to these changes is recommendable, even if real–time response is not needed.
• A two pass approach that considers the amount of silence is not able improve quantile equalization.
• If the SNR of the utterances that are to be recognized can be expected to be constant and real–time possessing is not required, utterance wise processing yields the best results.
Table 6.42: Car Navigation database: utterance wise (UTTERANCE) quantile equaliza-tion compared to an online implementaequaliza-tion (delay: window length). 10th: root instead of logarithm, FMN: filter mean normalization, QE: quantile equalization, QEF(2): quantile equalization with filter combination (2 neighbors), QE sil: target quantiles dependent on the amount of silence.
Word Error Rates [%]
test set SNR [dB] office 21 city 9 highway 6
UTTERANCE 10th FMN 2.9 29.8 60.1
10th QE FMN 3.0 12.0 19.4
10th QEF FMN 3.7 11.1 17.5
10th QEF2 FMN 3.4 11.3 18.2
UTTERANCE 10th QE sil FMN 3.0 11.8 19.7
2 PASS 10th QEF sil FMN 3.4 10.5 18.1
0.5s : 1s 10th FMN 2.8 19.9 40.1
10th QE FMN 3.2 11.7 20.1
10th QEF FMN 3.6 10.3 17.1
10th QEF2 FMN 3.6 9.6 17.1
Table 6.43: Comparison of utterance wise (UTTERANCE) and online implementations (delay: window length) of quantile equalization. Average recognition results on the Aurora 4 noisy WSJ 5k database. 10th: root instead of logarithm, FMN: filter mean normaliza-tion, QE: quantile equalizanormaliza-tion, QEF: quantile equalization with filter combination.
Word Error Rates [%]
clean training multi. training
UTTERANCE 10th FMN 29.7 17.8
10th QE FMN 25.9 17.1
10th QEF FMN 25.9 17.1
1s : 5s 10th FMN 31.0 18.3
10th QE FMN 27.5 17.9
10th QEF FMN 27.2 17.8
1s : 2s 10th FMN 31.1 18.4
10th QE FMN 28.0 18.2
10th QEF FMN 27.8 18.2
6.4. QUANTILE EQUALIZATION: ALTERNATIVE SETUPS 115