1.3 State of the art of speech enhancement algorithms
1.3.2 Sound source separation
1.3.2.1 Single-channel source separation
CASA methods are psychoacoustically motivated techniques based on auditory perception that try to separate sound sources in the same way that the human auditory system does. CASA basis have been already described, and its application to single-channelSSSconsists in identifying and grouping those time-frequency regions belonging to each source to generate a time-frequency mask for each original speech source. Two types of grouping can be distinguished: simultaneous grouping that aims to group sounds that overlap in time (i.e. frequency components), and sequential grouping that aims to put together successive speech sections from the same speaker that are separated in time. CASAsingle-channel techniques can be roughly divided into feature- based and model-based.
Feature-based methods make use of intrinsic sound properties such as proximity in frequency and time, periodicity (harmonicity), amplitude modulation (AM) and frequency modulation (FM), temporal continuity and onset/offset events. Many relatively simple algorithms have been proposed for the extraction of these features from a single speech signal. Unfortunately, the extraction of these features from a mixture of speech signals is more complex, and new advanced algorithms for feature extraction have been proposed. The most relevant approaches for feature extraction and speech separation algorithms that involves one of several of these features are described below.
Pitch estimation: A big amount of algorithms that estimate the pitch of isolated speech signals can be found in the literature. However, the estimation of multiple pitches in speech mixtures is a harder task, mainly due to the mutual overlap between voices that weakens the pitch cues. Algorithms that use pitch estimations to separate speech sources need to estimate multiple pitches, but algorithms that estimate multiple pitches need to separate the speech sources first. Single pitch estimation methods exist in both the time domain and the spectral domain. There are many variants of both approaches, but most rely on the same ideas. Most of spectral techniques are based on pattern matching, finding periodicity in the sequence of peaks in the power spectrum, hence they are highly dependent on frequency resolution. This idea was first applied in [Schroeder, 1968], and many variants have been proposed, for instance, [Duifhuis et al., 1982]. Time domain methods try to find periodic patterns in the time signal normally using the autocorrelation function. This idea was introduced in [Rabiner, 1977] and variations are found, for instance, in [Klapuri, 2005].
In the case of multiple pitch estimation, there are two strategies. The first alternative is to estimate a single pitch from the mixture, removing the speech source corresponding to that pitch, and estimating again the pitch from the remaining mixture, repeating the procedure until all sources are extracted. The previous estimations can be refined in further iterations. The second and more elegant alternative is to jointly estimate all the pitches at the same time. Again, methods exist in the spectral and temporal domain, but the former are more common for multiple estimations. Spectral approaches are based on the work in [Parsons, 1976] that seeks the harmonic series that best match the spectrum. Another work based on the previous one is [Vincent et al., 2010]. Some works in the time domain are [De Cheveign´e, 1993; De Cheveign´e and Kawahara, 1999]. The main limitation of spectral domain methods is frequency resolution, which is limited by the size of the analysis window, and it affects directly the accuracy of the pitch estimation. On the other hand, time domain methods are limited by the sampling resolution and their computational efficiency is lower than spectral techniques.
Multiple pitch estimations are useful to perform simultaneous grouping of voiced speech segments, grouping harmonics that are scattered in the frequency spectrum. In [Parsons, 1976], voiced segments are separated using a comb filter with large responses at the funda- mental frequency and its harmonics. This is a common solution adopted in many following works, for instance, in the model proposed in [Brown and Cooke, 1994] which combines pitch, FM and onset/offset detection. Early CASA models perform relatively well in low frequencies where the harmonics are resolved, but their performance is reduced in high frequencies where the harmonics are unresolved. The work in [Hu and Wang, 2004] groups resolved and unresolved harmonics differently. Resolved harmonics are grouped according their periodicity, and unresolved harmonics are grouped according toAM rates.
Onset and Offset detection: The identification of sudden intensity changes (increase or decrease) is easily achieved by finding the peaks and valleys of the first-order derivative of the intensity function with respect to time. However, many peaks and valleys can be orig- inated by background noise, hence the intensity function should be previously smoothed applying a low-pass filter. An example of speech segmentation based on onset and offset analysis is [Hu and Wang, 2007]. Onset/offset detection is usually combined with pitch estimation to separate voiced segments, but they are very useful to segregate unvoiced speech that lacks periodicity. A significative example is [Hu and Wang, 2008].
Amplitude modulation extraction: AM detection is a common problem in signal processing which corresponds with the extraction of the envelope of the signal, which variations are assumed to be much slower than the carrier frequency (i.e. the fundamental frequency in the case of speech). Common methods are the Hilbert transform method and the half-wave rectification following to low-pass filtering. The amplitude modulation spectrum (AMS), which is the spectral representation of the signal envelope, is a very useful feature for speech separation. Some examples of algorithms that use AMS for separation are [Kim and Loizou, 2010; Kim et al., 2009].
Frequency modulation extraction: FM corresponds with variations of the carrier frequency, which occurs at rates much slower than the carrier frequency itself. The FM feature used in CASA refers to a change in frequency of a sound component, and it can be detected either from a two-dimensional cochleagram [Brown and Cooke, 1994] or from the response of a band-pass filter [Kumaresan and Rao, 1999].
Model-based methods understand the problem of source separation as inference, where some constraints should be included to be able to recover an approximation of the original signals. The CASA approach uses a parametric model of the sources which parameters are estimates from the mixture. The most common approach is the use of hidden Markov models (HMM) for the sources. The constraints included in the model represent the prior knowledge about the expected sources, and they can be either explicit or implicit. Explicit signal models typically use a dictionary containing the possible signals, for instance, in [Roweis, 2001], or they consider that the signals are contained in a subspace [Jang and Lee, 2002]. Feature-based models based on periodicity can be considered implicit signal models.
Unsupervised learning algorithms have been also applied to single-channel source separation. These methods usually apply a simple non-parametric model and use less prior information of the sources, learning the information directly from the data. One of the most popular approaches is based on non-negative matrix factorization (NMF). The work in [Virtanen, 2007] combines NMF with a sparseness constraint for single-channel SSS based on minimizing a cost function which is a weighted sum of three terms: a reconstruction error term, a temporal continuity term, and a sparseness term.