Effects on separation based on different TF representation

CHAPTER 3 SINGLE CHANNEL BLIND SOURCE SEPARATION

5.3 Experimental Results and Analysis

5.3.1 Effects on separation based on different TF representation

In this section, the performance of our proposed Quasi-EM IS-NMF2D algorithm is evaluated by using three types of TF representation: (i) spectrogram (STFT with a 1024-point Hamming windowed FFT and 50% overlap), (ii) log-frequency spectrogram (as described in section 5.1 with 1024-point Hamming window). and (iii) cochleagram based on Gammatone filterbank of 128 channels, filter order of 4 (i.e.h4in Eqn.(5.2)), and the output is divided into 20-ms time frame with 50% overlap between consecutive frames. Speech signals and music are used to generate the monoaural mixture. In the following, we show that the separation results based on the cochleagram is significantly more effective than other TF domain. Table 5.7 shows the comparison of the proposed algorithm (quasi-EM IS-NMF2D) based on the spectrogram, log-frequency spectrogram and cochleagram under various audio mixtures.

Table 5.7: Separation results based on different TF representation

Mixtures TF methods SDR SAR SIR

spectrogram 3.47 6.57 4.86 log-frequency spectrogram 6.54 9.51 10.53 jazz music and male speech

cochleagram 8.87 10.31 12.62

spectrogram -1.41 5.87 0.14 log-frequency spectrogram 3.97 9.48 6.17 jazz music and female speech

cochleagram 9.34 9.77 14.37

spectrogram 2.10 4.34 6.23 log-frequency spectrogram 2.31 5.42 6.64 piano music and male speech

cochleagram 7.16 8.56 12.08

spectrogram -1.01 5.15 1.13 log-frequency spectrogram 0.27 8.01 2.25 piano music and female speech

cochleagram 7.44 9.18 11.38

spectrogram -0.59 6.34 0.97 log-frequency spectrogram 1.21 6.42 4.89 jazz music and piano music

cochleagram 7.21 13.07 8.68

The separation results for all mixture types based on the spectrogram gives an average SDR of 0.51dB while the log-frequency spectrogram gives an average SDR of 2.8dB. However, a significantly higher performance is attained by the cochleagram with an average SDR of 8dB which leads to a substantial gain improvement of 7.5dB and 5.2dB, respectively. The major reason for the large discrepancy between them is in the mixing ambiguity between X1.2 and

2 . 2

X in the TF domain. The larger the mixing ambiguity between X₁.2 and X₂ .2 , the more numerous TF units will be ambiguous which subsequently decreases the possibility of correct assignment of each unit to the sources. This inadvertently results in poorer performance of source separation. Figure 5.4 shows the spectrogram of the original sources, the mixed signal, and the estimated sources using the proposed Quasi-EM IS-NMF2D algorithm. The spectral overlapping between the two sources has resulted in mixing ambiguity in the time-domain as highlighted with red box

marked area in the last two panels of Figure 5.5.

Figure 5.4: Separation results in spectrogram. Top panel: Spectrogram of the original sources. Middle panel: Spectrogram of the mixture. Bottom panel: Spectrogram of the estimated sources.

Figure 5.5: Time-domain separated results based on spectrogram.

information about a particular TF unit and therefore, the resulting spectrogram fails to infer the dominating source. This leads to high degree of ambiguity in TF domain and causes lack of uniqueness in extracting the spectral-temporal features of the sources. Figures 5.6 and 5.7 show the separation results based on log-frequency spectrogram. Comparing with spectrogram, the separation performance is better since log-frequency spectrogram has the prosperity of non-uniform time frequency resolution. However, according to the analysis of separability in Section 5.1.4, the transform used by the log-frequency spectrogram is still not be an optimal option for audio source separation. The spectral overlapping based on log-frequency spectrogram between the two sources has resulted in mixing ambiguity in the time-domain as highlighted with red box marked area in the last two panels of Figure 5.8.

Figure 5.6: Separation results in log- frequency spectrogram. Top panel: Log-frequency spectrogram of the original sources. Middle panel: Log-frequency spectrogram of the mixture. Bottom panel: Log-frequency spectrogram of the estimated sources.

Figure 5.7: Time-domain separated results based on log-frequency spectrogram.

On the other hand, the results of separation in the cochleagram have led to significant SDR improvement. The cochleagram enables the mixed signal to be more separable and thereby reduces the mixing ambiguity between X1.2 and

2 . 2

X . This explains the

average performance of separating mixture jazz music and female utterance is highest among all the mixtures because both sources have very distinguishable TF patterns in the cochleagram. Figure 5.8 further shows the separation results in the cochleagram. The plot clearly shows the spectral energy of the two audio sources is clustered at different frequencies in the cochleagram due to their different fundamental frequencies. These prominent features have been separated using the proposed Quasi-EM IS-NMF2D algorithm. Figure 5.9 shows the final recovered time-domain sources.

Figure 5.8: Separation results in cochleagram. Top panel: Cochleagram of the original sources. Middle panel: Cochleagram of the mixture. Bottom panel: Cochleagram of the estimated sources.

Figure 5.9: Time-domain separated results using the proposed algorithm.

In the proposed method, the performance of source separation depends to an extent on how distinguishable the two spectral bases D₁ and D₂ are from each other. When D₁

and D₂ are distinguishable from each other and since

 

i _i





H are sparse, it follows that the mixing ambiguity between X1.2 and

2 . 2

X which constitutes the magnitude of

interference in the TF domain will be small. Thus, by exploiting the sparse property of

 

2 1

i i





H , it is possible to determine X1.2 and 2 . 2

X from Y.2 provided that D₁ and



D are sufficiently distinguishable. Figure 5.10 shows the results of D_i and H_i for the above mixture (mixing between female utterance and jazz music) when the factorization is obtained in the cochleagram. In Figure 5.10, panels (A)-(B) refer to D1 and



D which

are the estimated spectral bases of jazz music and female utterance, respectively. Panels (C)-(D) refer to H1 and



H which correspond to the estimated temporal code (i.e. time pitch signature) of jazz music and female utterance, respectively. In comparison, the results of D_i and H_i have also been included when factorizing the same mixture in the spectrogram and log-frequency spectrogram. These are shown in Figure 5.11 and 5.12, respectively.

Figure 5.10: Estimated D_i and H_i using the proposed algorithm based on cochleagram.

 



Figure 5.11: Estimated D_i and H_i using the proposed algorithm based on spectrogram.

Figure 5.12: Estimated D_i and H_i using the proposed algorithm based on log-frequency spectrogram.

In sharp contrast with Figure 5.10, Figures 5.11 amd 5.12 show overlap in the spectral bases between D₁ and D₂. The cochleagram based spectral bases estimation shows less overlap among the all. Hence, the recovered sources are much better as noted by the very high values of SDR in Table 5.7.

       

In document Single channel blind source separation (Page 138-146)