Experimental work - The weighted-Gaussian kernel mean shift (WG-MS) algorithm for SSS

3.3 The weighted-Gaussian kernel mean shift (WG-MS) algorithm for SSS

3.3.5 Experimental work

Different tests have been carried out for the assessment of the proposed WG-MS algorithm. Since the algorithm performs time-frequency masking, the quality of the separated sources is measured by means of the SIR defined in expression (2.15) and the WDO factor defined in

Figure 3.4: Probability density function estimated from an anechoic mixture of three speech sources by the weighted-Gaussian kernel estimator.

expression (2.17). The performance achieved by the proposed algorithm (labeled as WG-MS) is compared with the one obtained by an implementation of the originalDUETalgorithm described in [Rickard, 2007] (labeled as DUET) and a modification thereof replacing the clustering step by the so-called k-means technique [MacQueen, 1967] (labeled as DUET-KM). The time-frequency decomposition is performed by aSTFT with a 256-DFTHamming window and 50% of overlap. All sound signals are sampled at 16 kHz and normalized and mixed with the same power.

In order to generalize the results, the algorithm is evaluated with several types of mixtures: Linear anechoic mixtures of 2, 3 and 4 speech sources. The time and level differences

introduced in the mixtures for the three cases are summarized in table3.1.

Binaural anechoic mixtures of 2, 3 and 4 speech sources. The binaural signals are generated in the time domain, filtering the original signals with theHRTF’s from the CIPIC database. The DOA of each source, which is described by the azimuth and elevation angles, is randomly selected among the available directions in the database for each mixture. Linear mixtures of 2 sources mixing speech with noise and speech with music. The position

of the sources is the one shown in table 3.1for N = 2.

Echoic mixtures of 2, 3 and 4 speech sources, generated with the RIRG, varying the reflection coefficient from 0 to 0.5. The microphones are placed in the center of the room and the sources are randomly located around the microphones.

All the speech signals have been randomly selected from the TIMIT database. The noise and music signals have been randomly selected from a database that contains a wide variety of different types of noise and both vocal and instrumental music signals.

Table3.2contains a comparison in terms ofSIRand table3.3in terms ofWDObetween the three separation methods for linear and binaural mixtures of 2, 3 and 4 speech sources. TheSIR

Table 3.1: Level differences (LD) and time differences (TD) between microphones introduced in linear mixtures of 2, 3 and 4 sources.

Sources LD TD

N = 2 [1.1, 0.9] [-2, 2]

N = 3 [1.1, 1, 0.9] [-2, 0, 2] N = 4 [1.1, 0.9, 1, 1.1] [-2, -1, 0, 2]

and WDO values have been averaged over 400 mixtures in the case of 2 sources, 200 mixtures in the case of 3 sources, and 100 mixtures in the case of 4 sources. Additionally, both tables show the average values of the meanSIRandWDOfor all sources. Considering linear mixtures, the proposed WG-MS method obtains slightly lower SIR than DUET but slightly higher than DUET-KM, on average. Nevertheless, the WDO value is increased by 1.91% on average when compared to theDUETalgorithm, and by 3.78% on average when compared to the DUET-KM algorithm. Results shown are quite similar between the DUET and the WG-MS methods in the case of 2 sources, where theWDO values are very high, which means that both algorithms separate 2 sources very well. In the case of 3 and 4 sources, the WG-MS algorithm obtains better results than the other two, getting an average increase of theWDO of 3.63% and 1.98% respectively, when compared toDUET, and an average increase of theWDOof 2.1% and 6.66% respectively, when compared to DUET-KM.

Concerning binaural mixtures, theSIRand WDOvalues obtained are notably lower than in the linear case, in general, due to the higher complexity entailed by the use of binaural mixtures. The proposedWG-MSalgorithm increases theSIRby 20.9% on average compared to theDUET algorithm, and by 7.29% on average when compared to the DUET-KM algorithm. In addition, theWG-MSobtains an average increase of theWDO of 6.16%, 12.74% and 25.43%, for the 2, 3 and 4 sources cases respectively, when compared to DUET algorithm, and an average increase of the WDO of 3.25%, 10.95% and 16.93% respectively when compared to the DUET-KM algorithm.

Moreover, table 3.4 and table3.5 contain the SIR and WDO values, respectively, obtained by the three methods in the case of mixtures of speech and noise and of speech and music. In both cases the SIR and WDO values have been averaged over 500 mixtures. Furthermore, the tables show the average values of the two sources. In the case of mixing speech with noise, the proposed WG-MS method improves the SIR by 11.16% on average when compared to DUET, and by 0.37% when compared to DUET-KM, as well as improves theWDOby 7.81% on average average when compared to DUET, and by 7.17% when compared to DUET-KM. In the case of speech and music mixtures, the WG-MS method improves the SIR by 14.21% and 5.45%, on average, when compared to DUET and DUET-KM respectively, and the WDO by 7.43% and 2.97%, on average, respectively.

Finally, table 3.6 shows the SIR and the WDO values obtained in the separation of echoic mixtures of 2, 3 and 4 speech sources with the DUET, DUET-KM and WG-MS algorithms, for reflection coefficient values of 0, 0.1, 0.3 and 0.5. The values shown are the average of the N sources of the mixture. The WG-MS algorithm increases the SIR obtained by the DUET algorithm in 40%, 24.7%, and 329% on average, for the 2, 3 and 4 sources cases respectively. Comparing theSIRobtained by theWG-MSand DUET-KM algorithms, the former increases in 39%, 22.7% and 93% on average, for 2, 3 and 4 sources respectively, the values of the DUET-KM. The SIR increments are so large in this case due to the extremely low SIR values obtained by theDUET and the DUET-KM algorithms in the 4 sources case. Concerning the WDO values, theWG-MS algorithm obtains an average increase of 11.7%, 10.2% and 19.7% for the 2, 3 and

Table 3.2: Averaged SIR (dB) values obtained in the separation of linear and binaural mixtures of 2, 3 and 4 speech sources with the DUET, DUET-KM and WG-MS algorithms.

Sources _{DUET DUET-KM WG-MS DUET DUET-KM WG-MS}Linear mixtures Binaural mixtures N = 2 S1 14.77 12.49 14.12 8.87 9.03 10.32 S2 12.59 12.18 12.68 9.10 9.19 9.61 Average 13.68 12.34 13.40 8.98 9.11 9.97 N = 3 S1 10.17 8.20 9.15 5.26 5.39 5.04 S2 10.03 9.08 10.36 4.49 4.64 4.93 S3 8.52 8.47 8.70 5.79 5.23 6.04 Average 9.57 8.58 9.40 5.18 5.09 5.34 N = 4 S1 7.50 5.78 6.43 3.18 6.03 6.21 S2 7.05 6.96 7.32 4.15 6.02 5.94 S3 6.53 7.69 6.62 3.37 4.23 4.95 S4 6.15 6.50 5.06 3.94 3.97 4.65 Average 6.81 6.73 6.36 3.66 5.06 5.44

Table 3.3: Averaged WDO values obtained in the separation of linear and binaural mixtures of 2, 3 and 4 speech sources with the DUET, DUET-KM and WG-MS algorithms.

Sources _{DUET DUET-KM WG-MS DUET DUET-KM WG-MS}Linear mixtures Binaural mixtures N = 2 S1 0.939 0.896 0.933 0.776 0.803 0.824 S2 0.895 0.894 0.903 0.782 0.799 0.830 Average 0.917 0.895 0.918 0.779 0.801 0.827 N = 3 S1 0.837 0.797 0.840 0.612 0.609 0.649 S2 0.820 0.833 0.845 0.611 0.626 0.697 S3 0.737 0.801 0.795 0.636 0.656 0.751 Average 0.798 0.810 0.827 0.620 0.630 0.699 N = 4 S1 0.740 0.671 0.746 0.509 0.583 0.672 S2 0.724 0.671 0.738 0.536 0.583 0.679 S3 0.689 0.702 0.705 0.535 0.529 0.633 S4 0.675 0.658 0.693 0.514 0.550 0.640 Average 0.707 0.676 0.721 0.523 0.561 0.656

4 sources cases respectively, in comparison to the DUET algorithm, and an average increase of 2.3%, 6.7% and 14.8% for the 2, 3 and 4 sources cases respectively, in comparison to the DUET-KM algorithm.

In document Speech enhancement algorithms for audiological applications (Page 90-93)