• No results found

3.4 Discussion

3.4.1 The features

In designing the modulation spectrum analysis system, a number of decisions had to be made about implementation details. Although we are confident that all our decisions were reasonable (and supported by data from the literature), we cannot claim that they were optimal. Most data in the literature on modulation spectra are based on perception experiments with human subjects, but more often than not these experiments use auditory stimuli that are very different from speech. While the results of those experiments surely provide guidance for ASR, it may well be that the automatic processing aimed at extracting the discriminative information is so different from what humans do that some of our decisions are sub-optimal. Our gammatone filterbank contains 15 one-third octave filters, which have a higher resolution in the frequencies < 500 Hz than the Mel filterbank that is used in

Chapter 3. Sparse Coding of the Modulation Spectrum for Noise-Robust Automatic

Speech Recognition 69

most ASR systems. However, initial experiments in which we compared our one-third octave filterbank with a filterbank consisting of 23 Mel-spaced gammatone filters, spanning the frequency range of 64 to 3,340 Hz did not show a significant advantage of the latter over the former. From the speech technology’s point of view, this may seem surprising because the narrow-band filters of the one-third octave filterbank in the low frequencies may cause interactions with fundamental frequency, while the relatively broad filters in the higher frequencies cannot resolve formants. But from an auditory system’s point of view, there is no such surprise, since one-third octave filters are compatible with most, if not all, outcomes of psycho-acoustic experiments. This is also true for experiments that focused on speech intelligibility (Drullman et al., 1994).

For the modulation filterbank, it also holds that the design is partly based on the results of perception experiments (Jørgensen and Dau, 2011). Our modulation frequency analyser contained filters with centre frequencies ranging from 0 to 16 Hz. From Kanedera et al. (1999) it appears that the modulation frequency range of interest for ASR is limited to the 2- to 16-Hz region. Therefore, here too we must ask whether our design is optimal for ASR. It might be that the spacing of the modulation filters in the frequency band that is most important for human speech intelligibility is not optimal for automatic processing. However, as with the gammatone filters, it is not evident why a different spacing should be preferred. It might be necessary to treat modulation frequencies ≤1 Hz, which are more likely to correspond to the characteristics of the transmission channel, different than modulation frequencies that might be related to articulation. One might think that the very low modulation frequencies would best be discarded completely in the aurora-2 task, where transmission channel characteristics do not play a role.

However, experiments in which we did just that yielded substantially worse results.

Arguably, the lowest modulation frequencies help in distinguishing time intervals that contain speech from time intervals that contain only silence or background noise. We decided to not include modulation filters with centre frequencies>16 Hz.

This implies that we ignore almost all information related to the periodicity that characterizes many speech sounds. However, it is well known that the presence of periodicity is a powerful indicator of the presence of speech in noisy signals and also, in case the background noise consists of speech from one or more interfering speakers, a powerful means to separate the target speech from the background

Chapter 3. Sparse Coding of the Modulation Spectrum for Noise-Robust Automatic

Speech Recognition 70

speech. In future experiments we will investigate the possibility of adding explicit information about the harmonicity of the signals to the feature set.

The experiments with the MLP classifiers for obtaining state posterior proba-bilities from the EMS features confirm that the EMS features capture most of the information that is relevant for speech decoding. Still, the WERs obtained with the MLPs were always inferior to the results obtained with stacks of nine con-ventional PLP features that include ∆ and ∆∆ features, especially in the cleanest SNR conditions. Although the EMS features are performing quite well in noisy conditions, in cleaner conditions their performance is worse than the classical PLP features. Adding ∆s and ∆∆s, computed as linear regressions over 90 ms windows, to the EMS features does not improve performance nearly as much as adding speed and acceleration to MFCC or PLP features. This suggests that our EMS features are suboptimal with respect to describing the medium-term dynamics of the speech signal. The time windows associated with the modulation frequency filters with the lowest centre frequencies is larger than 500 ms. As a consequence, time derivatives computed over a window of 90 ms for these slowly varying fil-ter outputs is not likely to carry much additional information. We suspect that the features in the lowest modulation bands play too heavy a role. If we want to optimally exploit the redundancy in the different modulation frequency chan-nels when part of them gets obscured by noise, information about relevant speech events (such as word or syllable onsets and offsets) should ideally be represented equally well by their temporal dynamics in all channels.

Perhaps the most striking difference between the auditory model used in this study and the model proposed in Dau et al. (1996) is the absence of the adap-tation/compression network between the gammatone filters and the modulation frequency filters. Preliminary experiments in which we applied tenth root com-pression to the output of the modulation filters (rather than the gammatone filters) already showed a substantial beneficial effect. The additional high-pass filtering that is performed in the compression/adaptation network (which should only be applied to the output of the gammatone filters) is expected to have a further ben-eficial effect in that it implements the medium-term dynamics that we seem to be missing at the moment. Including the adaptation stage is also expected to enhance the different dynamic characteristics of speech and many noise types in the modulation frequency bands. If this expectation holds, the absence of a proper adaptation network might explain the failure of the nine band fusion system.

Chapter 3. Sparse Coding of the Modulation Spectrum for Noise-Robust Automatic

Speech Recognition 71