• No results found

Comparison with NMF-based SCBSS methods

CHAPTER 3 SINGLE CHANNEL BLIND SOURCE SEPARATION

4.3 Results and Analysis

4.3.3 Comparison with other SCSS methods

4.3.3.4 Comparison with NMF-based SCBSS methods

In this evaluation, the following NMF-based SCBSS methods are used for comparison:

 NMF with Temporal Continuity and Sparseness Criteria [37] (NMF-TCS) based

SCBSS method as described in Chapter 3.

 Automatic Relevance Determination NMF (NMF-ARD) [97] based SCBSS method as

described in Chapter 3.

Currently, there are no reliable NMF methods for automatic estimation of the number of components (e.g. the basis vectors in D) and normally, this has to be set manually. As discussed in Section 4.2, each IMF is separated into a number of components that corresponds exactly to the number of sources. However, in this implementation, more

components than the number of sources are used for evaluating the efficiency of the proposed method. In order to obtain the baseline comparison of each method, all NMF algorithms are tested by factorizing the mixture signal into Is 2,4,,10 components. In

the case of NMF-ARD, the threshold has been modified such that it accepts all the initialized components. Since more than two components are used and the tested methods are blind, there is no information to tell which component belongs to which source. Thus, the clustering method proposed in [57] is utilized where the original sources are used as reference to create component clusters for each source. However, a large number of components i.e. Is 10 may not necessarily produce better results since more sub-sources

need to be classified. If the recovered sub-sources are incorrectly clustered, then these sub-sources will become interference to the supposedly correct estimated source. We have carried out additional analysis to compare the KLd-based k-means clustering method [57] with the supervised clustering method in [37]. The finding shows that if the sub-sources are too sparse, both methods will introduce errors during the clustering process. For example, beyond the 7th stage decomposition by the EMD, the TF sub-sources are too sparse to assign them to the correct sources. If wrongly clustered, this particular sub-source will become interference to the intended source. To mitigate this situation, a power threshold is set as described in Section 4.2 to judge whether the IMF is of acceptable quality. The findings have shown that the results based on KLd k-means clustering method are identical to the supervised clustering method in [37] except in special circumstances where the sub-sources are overly too sparse in the TF domain. Figure 4.15 shows the ISNR performance between the proposed method and the NMF-TCS, NMF-ARD methods under

different mixture types, and the increasing number of components fromIs 2,4,6,8,10.

Figure 4.15: Average ISNR using different number of components.

In Figure 4.15, the ISNR improvement of the proposed method compared with NMF-TCS and NMF-ARD can be summarised as follows: (i) for mixture of music signals, the average improvement is 4.3dB per source, (ii) for mixture of speech and music signal, the average improvement is 3.1dB per source, and (iii) for mixture of speech signals, the average improvement is 3.3dB per source. Analysing the separation results, NMF-ARD performs with poorer results whereas the separation performance by NMF-TCS is slightly better than NMF-ARD. The common feature among these two methods is that they do not incorporate the preprocessing step that benefits the nonnegative matrix factorization. This renders the performance less efficient especially in terms of separating mixture that contains speech sources. The result indicates that without the EMD preprocessing, it

becomes difficult to obtain the unique spectral basis D especially when the spectral overlapping between the sources in TF domain is large since each column in D may contain the combination spectral information of both sources. In this case, by directly using NMF methods, the separation of sources is no longer efficient.

4.4 Summary

This chapter has presented a novel framework of amalgamating EMD with v-SNMF2D for single channel source separation. In this chapter, it is shown that the IMFs have several desirable properties unique to single channel source separation problem: (i) the degree of mixing in each IMF is less ambiguous than the mixed signal, (ii) the IMFs has simpler and sparser spectral and temporal patterns which allows the proposed v-SNMF2D algorithm to efficiently track them, and (iii) the IMFs serve as the orthogonal temporal bases for signal separation; hence errors resulted from any IMF will be averaged over all the IMFs leading to smaller errors at the signal reconstruction stage. In the proposed v-SNMF2D algorithm, the sparsity parameters are individually optimized and adaptively tuned using the variational Bayesian approach to yield the optimal sparse codes. The proposed framework enjoys at least two significant advantages: Firstly, it avoids the strong constraints of separating blind source among all types of audio mixture without training knowledge. Secondly, the v-SNMF2D algorithm gives a robust sparse decomposition and under non-negativity condition, the decomposition is unique making it unnecessary to impose constraints in the form of statistical independence of the sources.

CHAPTER 5

SINGLE CHANNEL BLIND SOURCE SEPARATION USING

GAMMATONE FILTERBANK AND ITAKURA-SAITO

MATRIX FACTORIZATION

In this chapter, a novel framework to solving SCBSS based on the cochleagram TF representation and a family of IS divergence based novel two-dimensional nonnegative matrix factorization algorithms are proposed. The proposed solution separates audio sources from a single channel without relying on training information about the original sources. The uniqueness of the proposed work can be summarised as follows:

(i) Using the gammatone filterbank to construct audio signal TF representation. It produces a non-uniform TF domain termed as the cochleagram whereby each TF unit has different resolution unlike the classic spectrogram which deals only with uniform resolution.

(ii) The separability theory has been derived in the TF domain and a quantitative performance measure has been developed to evaluate how separable the sources in the monaural mixed signal. In particular, the ideal condition has been identified when the sources are perfectly separable. We also proposed a separation framework using the gammatone filterbank. The latter produces a non-uniform TF domain termed as the cochleagram whereby each TF unit has different resolution unlike the classical spectrogram which deals only with uniform resolution. Towards this end, it is shown

that the mixed signal is significantly more separable in the cochleagram than the classic spectrogram and the log-frequency spectrogram (constant-Q transform).

(iii) A family of IS divergence based novel two-dimensional nonnegative matrix factorization algorithms has been developed to extract the spectral and temporal features of the sources. The proposed factorizations are scale invariant whereby the lower energy components in the cochleagram can be treated with equal importance as the higher energy components. Within the context of SCBSS, this property is highly desirable as it enables the spectral-temporal features of the sources that are usually characterized by large dynamic range of energy to be estimated with significantly higher accuracy. This is to be contrasted with the matrix factorization based on LS distance and KL divergence where both methods favor the high-energy components but neglect the low-energy components.

This chapter is organized as follows: Section 5.1 introduces the different TF matrix representations and the separability theory is developed. In Section 5.2, the family of IS divergence based NMF2D and regularised NMF2D algorithms are derived. The proposed source separation framework is fully developed. Experimental results and a series of performance comparison with other matrix factorization methods are presented in Section 5.3. Finally, Section 5.4 concludes the work of this chapter.