In a common NMF framework, speech enhancement is essentially a special case of generic separation. A speech spectrogram estimate Ψsis computed from atoms
As and activations Xs belonging to the target speaker using equations like (2.2)
and (4.2). It is compared to the overall estimate Ψ, producing the bin-wise spec- trogram filter weight matrix as Ψs/Ψ, which is then used to filter the original
spectrogram.
A schematic diagram of the process with a sample utterance is show in Fig- ure 5.1. Speech and noise spectrogram estimates are computed with NMF. They form the previously described filter matrix, which is finally applied to the orig- inal noisy spectrogram to obtain the enhanced speech spectrogram. Optionally, post-processing operations such as spectro-temporal smoothing may be applied for reducing artifacts from the plain filter, thus improving perceived quality or signal behaviour in further processing. Post-processing has been observed to im- prove output quality measured by computational metrics in speech-music separa- tion [54, 56]. However, none was used in the work included in this thesis.
In a consistently designed spectrogram enhancement and ASR system, there is no need to go all the way back into signal level, because typical back-end features can be derived from the DFT spectrum or even compressed filter bank coefficients used in factorisation. Nevertheless, signal level synthesis was used in the included publications due to employment of several back-end feature extractors not directly compatible with the internal NMF feature space, and to produce wave files for computational and subjective quality evaluations.
In robust ASR, the benefits of properly functioning speech enhancement are obvious. Any noise features in the input will make it less speech-like, thus a worse match to back-end models representing speech. Not even multi-condition training can truly compensate the mismatch, especially if the noise is non-stationary and unpredictable in its behaviour. Whenever the gains from noise removal outweigh
Figure 5.1: Steps taken by a factorisation and enhancement system to compute spectrogram estimates, a time-varying filter, enhanced speech features, and sparse classification output. time (s) m e l b a n d 1 2 3 4 5 6 7 8
noisy speech spectrogram
speech and noise bases spectrogram factorisation
speech estimate noise estimate
speech and noise activations spectro-temporal filter enhanced speech original noisy spectrogram back-end recognition SC filtering
the possible loss of actual speech features, improvements in back-end recognition accuracy can be expected. The proposed NMF algorithms have yielded uniform increments in recognition rates even using semi-supervised enhancement without a trained noise model [P4, P6]. As usual, re-training the back-end with similarly enhanced speech will reduce the mismatch further [P8]. Apart from ASR, speech enhancement has applications in recording, transmission and storage for better intelligibility, more efficient compression via reduction of noise-like information, and in any speech processing that benefits from a cleaner input signal.
5.3
Sparse Classification
Sparse classification (SC), briefly introduced in Section 2.4.3, is an alternative approach to ASR exploiting the factorisation output without converting it back into a spectral or waveform domain. The nomenclature arises from sparse repre-
sentations, where enforcing sparsity on the model has been found beneficial for
discovery of key features from noisy or mixed data [26]. Although sparsity con- straints regularly appear in separation and enhancement tasks, in SC they can be considered almost essential [48] unlike in separation, where the bias introduced by sparsity objectives may be even detrimental for quality metrics [189].
As seen in the diagram of Figure 5.1, sparse classification output is derived from activation weights of factorisation with no need to construct the spectrogram estimates. In simple tasks with few discrete output classes such as keyword spot- ting or speaker recognition it suffices to observe, which atoms were activated in factorisation. Assuming that activations can be represented as a fixed-length vec- tor, common classification algorithms are applicable for training and evaluating the class borders [143]. For longer observations, where the temporal structure of words or other events is important, it becomes necessary to model the temporal dimension as well.
Early sparse classifiers used histogram modelling for multi-digit recognition tasks, assigning state likelihoods uniformly over the duration of atoms and win- dows [42, 178]. Notable temporal blurring was consequently present in the like- lihood estimates over utterances. It was soon found out that assigning state tran- scription on a frame level to multi-frame atoms improved the decoding accuracy significantly [190]. For more general speech recognition tasks with a large vocab- ulary, explicit state modelling over time becomes effectively essential.
In this thesis’ work, sparse classification is based on assigning label matrices to speech atoms. Let us assume that the language model employed in recognition contains Q states, which may denote e.g. phonetic or sub-word models. Each speech atom, whose spectral feature content is a B × T matrix, is also given a
be binary, i.e. only one state entry per frame is active at weight 1, or fuzzy so that several state candidates may be active at variable weights. After determining the speech activations, the same reconstruction formulae that are used for spectrogram estimation in (2.2) and (4.2) for producing a B × Tutt spectral feature estimate,
are applied to label matrices yielding a Q × Tutt matrix of state weights over
observation frames. The matrix thus conveys similar information as likelihood estimates from conventional evaluation of GMMs for the frames of an utterance.
Although the distribution and magnitude scaling of SC state weight estimates will be different from the output of GMM evaluation, they can be decoded us- ing HMMs trained with a conventional back-end as long as the correct path is found by the Viterbi algorithm. For AURORA-2 data, 96–97% digit recognition accuracy was achieved for clean speech already with early versions of the SC framework [P1, 44]. Keyword accuracy for 1st CHiME clean development data was approximately 93% [P3]. In both cases the error rate is slightly higher than for baseline GMM evaluation. A likely reason is that the presented SC systems operate on plain mel spectra, which are not as accurate as mel-cepstral features for classifying phonetically close keywords like ‘five’/‘nine’ or ‘b’/‘v’.
In noisy conditions, the SC approach has repeatedly surpassed conventional GMM recognisers due to its superior robustness via explicit noise modelling [P6, 44]. However, compared to a GMM back-end with NMF feature enhancement and model re-training, there have been results favouring either FE or SC, depend- ing on the NMF model and back-end parameters [P6, 44]. Another significant factor is the method of assigning the label matrices. The first SC systems used bi- nary matrices acquired simply by assigning the single state determined by forced alignment as the only active entry of atom-frames. Thereafter more advanced al- gorithms such as ordinary and partial least squares regression (OLS, PLS) and NMD learning have been proposed with solid improvements on recognition accu- racy just by better translation of activations into state estimates [70, 106].
Although the direct SC approach described here is unlikely to provide a com- plete replacement for GMM evaluation, obviously the information available in speech activation weights is meaningful, hence it should not be ignored in decod- ing. Similarly to other exemplar and template systems [17, 21, 145, 146, 164], the information should be exploited even more efficiently with integration to other recognition methods. For example, FE and SC streams have been found comple- mentary in multi-stream recognition [204]. Other combinations of methods, both previously proposed and emerging, are discussed in the next section.