1.3 State of the art of speech enhancement algorithms
1.3.2 Sound source separation
1.3.2.2 Multichannel source separation
MultichannelSSS can be divided into two main approaches: one basically inspired in the inde- pendent component analysis (ICA), and the other relies on sparse representations of speech in which only a small number of the source components differs significantly from zero.
The first BSStechniques applied to SSS were based onICA [Comon, 1994]. TheICA main assumption is that the sources are statistically independent and non-Gaussian, and the separa- tion problem is formulated as a mixing matrix estimation problem. Further assumptions about the number of microphones and the mixing process are required. ICA tries to find the indepen- dent components of the mixture by maximizing the statistical independence of the estimated components either minimizing the mutual information or maximizing the non-Gaussianity. The main limitations ofICAare: the original formulation is not valid for underdetermined mixtures, the mixing matrix needs to be stationary during a period of time (i.e. the sources can not move), the sources should come from different spatial directions, and the number of sources must be known in advance. The algorithms based on ICA work very well when the signals are mixed instantaneously, but they do not perform so well in a reverberant environment. Many efforts have been carried out to adapt the original ICA to undetermined and reverberant mixtures.
The work in [Hyv¨arinen and Oja, 1997] describes a fast implementation of the ICA algorithm, which is denominated FastICA. The algorithm finds, one at a time, all non-Gaussian indepen- dent components, regardless of their probability distributions. The convergence of the algorithm is guaranteed, and the algorithm is 10 to 100 times faster than gradient-based ICAalgorithms. The work in [Parra and Spence, 2000] exploits the non-stationarity of speech to estimate the multiple channels of echoic speech mixtures. The multi-path channels are identified using a LS optimization to estimate a forward model. An efficient FastICA (EFICA) algorithm is described in [Koldovsky et al., 2006], where the accuracy given by the residual error variance attains the Cramer-Rao lower bound. The algorithm assumes that thePDFs of the independent signals are generalized Gaussian distributions. The computation time is only three times higher than the standard FastICA.
A more recent approach for SSS is based on the assumption that the sources are sparse and the data do not overlap in the time-frequency domain. The sparsity based approach solves the underdetermined separation problem and the algorithms can be further divided into two categories: the first type of algorithms are based on MAP estimation of the sources, usually performed by l1-norm minimization, after estimating the mixing matrix either by clustering or
by using theML criterion; the second type of algorithms are based on extracting the signals by means of time-frequency masking, which can be calculated using different criteria. A relevant example of the first type of algorithms is the one described in [Bofill and Zibulevsky, 2001]. The algorithm exploits the sparsity of speech and music signals when they are represented in the STFT domain. The authors propose the use of a clustering algorithm to estimate the mixing matrix from only two sensors, and a shortest path separation procedure based on the l1-norm
to recover the most sparse original signals from the mixtures. The algorithm also identifies the number of sources in the mixture. Tests with speech and music mixtures show good separation performance even in the case of separating 6 sources from only 2 mixtures. Another interesting algorithm is the line orientation separation technique (LOST) algorithm described in [O’Grady and Pearlmutter, 2008]. The algorithm considers that the problem of audio source separation is equivalent to the separation of linear subspaces in a mixture of oriented lines and separates any number of sources from any number of instantaneous mixtures by identifying lines in a scatter plot. The orientation of each line is estimated using anEMprocedure. The demixing procedure in case of undetermined mixtures is performed using l1-norm minimization.
The best-known algorithm for SSS based on sparsity and time-frequency masking is the degenerate unmixing estimation technique (DUET) [Rickard and Yilmaz, 2002; Yilmaz and Rickard, 2004]. In these works, the authors introduce the concept of approximate W-disjoint orthogonality (WDO) to measure the orthogonality of speech signals in the STFT domain. The experiments carried out demonstrate that there exists a time-frequency binary mask that allows separating each speech source from the mixtures, similar to the one inspired in CASA, but the problem still remains in the estimation of the IBM from the observations. Unlike traditionalCASA approaches that use a single mixture, theDUETalgorithm uses two mixtures to estimate theIBM. The algorithm proposes to construct a weighted two-dimensional histogram from estimations of the delay and level differences between the two microphones. The weighted histogram shows peaks corresponding to each source. Unsupervised clustering is applied to identify these peaks from the smoothed histogram, and these peaks are used to estimate the mixing parameters of each source. The demixing procedure is performed via time-frequency masking, generating binary masks based on a proximity criteria. Listening experiments show that the WDOmeasure is fairly correlated with subjective separation performance. Hence, the WDO measure is proposed as a good indicator of the separation performance for this type of SSS methods. There are three main limitations of theDUETalgorithm: the number of sources
must be known in advance, its performance is reduced in echoic mixtures, and the use of time- frequency binary masks introduces residual musical noise. Many methods based onDUEThave been proposed in the last decade, and some of the most relevant are listed below.
A multichannel DUET algorithm is described in [Melia and Rickard, 2006], combining the sparse assumption with the estimation of signal parameters via rotational invariance technique (ESPRIT). The method, denominated DESPRIT, is limited to linear arrays. A new algorithm for SSS, denominated time-frequency ratio of mixtures (TIFROM), is
presented in [Abrard and Deville, 2005]. Using two microphones, it allows separating speech sources from instantaneous linear mixtures even if the original signals almost fully overlap in the time-frequency domain. The only condition required is the existence of slight differences in the time-frequency distributions of the original signals, i.e. each source only needs to occur alone in a small time-frequency area. The algorithm calculates time- frequency ratios of the mixed signals to identify those small time-frequency areas and estimates the mixing matrix. This approach is much less restrictive thanICAand sparsity- based approaches.
Music signals do not meet so well the WDO assumption as pure speech signals do, due to their harmonic structure. The DUET algorithm is combined with CASA techniques to perform stereo music source separation in [Woodruff and Pardo, 2006]. The algorithm has three steps: a cross-channel histogram is performed using spatial cues (i.e. similar to theDUET algorithm), the pitches of the original signals are estimated from the previous histogram to generate harmonic masks, and the harmonic amplitude envelopes are obtained from the pitch estimations.
The method in [Araki et al., 2007] proposes a generalized multichannel DUETalgorithm that is valid for any number of sensors and geometry. The method performs k-means clustering using normalized amplitude and time differences between sensors. The phase differences are weighted to obtain a variance comparable to the one of the level differences. The two main approaches,ICAand sparseness, have been also combined in order to overcome their individual limitations. In [Araki et al., 2004] the authors combine both approaches with the aim of reducing the distortions associated to time-frequency binary masking. The algorithm estimates the time-frequency points where only one source is active, removes that source from the observations and appliesICAto the remaining mixtures. The time-frequency source estimation is inspired in DUET. Furthermore, the authors propose to reduce the distortions associated to binary masking using a directivity pattern based continuous mask instead. The mask is generated with a null beamformer. The use of a soft mask reduces distortions even in reverberant rooms. On the other hand, ICA and time-frequency masking can be also combined the other way around: a time-frequency mask can be applied to the ICA outputs, as a post-processing technique. For instance, in [Kolossa and Orglmeister, 2004] the time-frequency masking is applied to the output of two frequency domain ICA methods. The time-frequency masks are determined from the ratio of the demixed signal energies. The approach notably increases the output SNR.