Online Implementation - Quantile based histogram equalization for noise robust speech recogniti

5.4 Online Implementation

In the previous section the assumption was that the quantiles are determined on an entire utterance and the transformation parameters are calculated once to remain constant for that utterance. This restriction can be dropped, but the real online application with a moving window requires some more considerations [Hilger et al. 2002] that will be dis-cussed in the following.

For online applications it is standard to implement mean and variance normalization using a moving window. If the delay and window length are chosen appropriately the recognition performance will not suffer significantly.

Quantile equalization can also be implemented using a window instead of the whole utterance, but when when simply applying the two techniques successively their individual delays add up as shown in Figure 5.13. An initial delay has to elapse before the quantile equalization passes the first vector to the mean normalization, then the second delay of the mean normalization has to go by before the first vector is actually put out and the feature extraction can continue with the calculation of the cepstrum coefficients.

Figure 5.14 illustrates an alternative that combines the two steps without adding the delays [Hilger et al. 2002]. Assuming that quantile equalization and mean normalization have the same delay, the resulting delay is halved with this procedure, at the cost of a growing the computational complexity.

For each time frame t:

1. Calculate the signal’s quantiles Qki for each filter channel in a window around the current time frame. The window length twin should be some seconds. It does not have to be symmetrical. The delay t_del can be chosen to be short (some time frames) if the application only allows short delays or longer (seconds) if the recognition performance is more important.

2. If Qki < Q^train_i then Qki = Q^train_i

3. Determine the optimal transformation parameters αk and γk and apply the trans-formation to all vectors in the window.

4. Calculate the mean values of the resulting vectors within the window.

5. Subtract the mean to get the final vector of filter bank coefficients.

After that step the feature extraction can be continued as usual with the calculation of the cepstral coefficients.

In the online implementation the expression in step 2. does not only make sure that a noise level with lower amplitude than in training is not scaled up, it also provides an important initialization of the quantiles at the beginning of the utterance: if the moving window only contains non–speech frames at the beginning of the utterance even the high quantiles will be determined by these silence frames. A transformation that is simply be based on this estimate would then transform the background noise level to the speech

calculate quantiles original vec.

apply transformation

calculate mean transformed vec.

subtract mean

final vec.

t 1.

time for a time frame t

delay

Figure 5.13: Application of quantile equalization and mean normalization using two suc-cessive moving windows, both delays add up.

calculate quantiles original vec.

calculate mean transformed vec.

subtract mean

final vec.

t 1.

time for each time frame t

delay apply transformation

tdel twin

Figure 5.14: Combined normalizing scheme with shorter delay.

5.4. ONLINE IMPLEMENTATION 51

level observed in training. By initializing the quantiles according to expression 2. this can be prevented. As long as the SNR is not too low the background noise level will not exceed the amplitude of the speech peaks, so the higher training quantiles can take the role of the speech estimate if that is not available yet.

The update of the parameters αk and γk in step 3. requires some modifications to make it practically applicable in an online system. When using a full grid search as described in Section 5.3 in every time frame, a lot of computation is required and, what is more important, the transformation parameters can change significantly within a few time frames. Especially at the beginning of the utterance when the first speech frames come in after the initial silence the quantiles can change suddenly. If the update of the transformation parameters is not restricted this will cause distortions, because the transformation will then change the signal faster than the actual signal itself changes.

Then usually many insertion errors occur and error rates are higher than baseline.

The temporal change of the transformation parameters has to be slow, compared to the temporal behavior of the signal. This can be achieved by searching the updated parameter values within a small range around the previous ones αk[t − 1] ± δ and γk[t − 1] ± δ, with δ in the order of 0.005 . . . 0.01 The changes induced by transformation the signal will then be slower than the signal’s changes yielding better recognition results. As positive side effect the computational load is reduced significantly. If the step size for the grid search is also set to δ only 9 combinations of αk and γk have to be evaluated, instead of the 20000 a full grid search of α_k ∈ [0, 1] and γ ∈ [1, 3] would require.

If no prior information about the sentence to come is available the initial values in the first time frame should be unbiased i.e. αk = 0 and γk = 1 which corresponds to no transformation. While the sentence carrys on the transformation will adapt to the current noise conditions like the example in the figures 5.15 and 5.16 shows. There a delay of 1 second and a total window length of 5 seconds was used.

0.8 1 1.2 1.4 1.6 1.8

0 1 2 3 4 5

time [s]

clean noisy noisy QE

Figure 5.15: Example: output of the 6th Mel scaled filter over time for a sentence from the Aurora 4 test set before and after applying online quantile equalization with 1s delay and 5s window length.

0 10 20 30 40 50 60 70 80 90 100

0.6 0.8 1 1.2 1.4 1.6 1.8

Y6 clean

noisy noisy QE

Figure 5.16: Cumulative distributions of the signals shown in Figure 5.11.

In document Quantile based histogram equalization for noise robust speech recognition (Page 67-71)