Reducing Binary Masking Artefacts in Blind Audio Source Separation

(1)

Convention Paper

Presented at the 134th Convention

2013 May 4–7 Rome, Italy

This Convention paper was selected based on a submitted abstract and 750-word precis that have been peer reviewed by at least two qualified anonymous reviewers. The complete manuscript was not peer reviewed. This convention paper has been reproduced from the author’s advance manuscript without editing, corrections, or consideration by the Review Board. The AES takes no responsibility for the contents. Additional papers may be obtained by sending request and remittance to Audio Engineering Society, 60 East 42nd Street, New York, New York 10165-2520, USA; also see www.aes.org. All rights reserved. Reproduction of this paper, or any portion thereof, is not permitted without direct permission from the Journal of the Audio Engineering Society.

Reducing Binary Masking Artefacts

in Blind Audio Source Separation

Toby Stokes1_{, Chris Hummersone}1_{, and Tim Brookes}1

1

Institute of Sound Recording, The University of Surrey, Guildford, Surrey, GU2 7XH, UK

Correspondence should be addressed to Toby Stokes ([email protected]) ABSTRACT

Binary masking is a common technique for separating target audio from an interferer. Its use is often justified by the high signal-to-noise ratio achieved. The mask can introduce musical noise artefacts, limiting its perceptual performance and that of techniques that use it. Three mask-processing techniques, involving adding noise or cepstral smoothing, are tested and the processed masks are compared to the ideal binary mask using the perceptual evaluation for audio source separation (PEASS) toolkit. Each processing technique’s parameters are optimised before the comparison is made. Each technique is found to improve the overall perceptual score of the separation. Results show a trade-off between interferer suppression and artefact reduction.

1. INTRODUCTION

Separation of a mixture of audio sources is a continuing goal in audio research. Potential applications of this work include: cleaning up speech recorded in noisy conditions; removing audio material, for which rights can not be obtained, from a programme; and the rebalancing of a mixture. This paper looks specifically at Time-Frequency (TF) masking and aims to find ways of improving the quality of audio separated using this process. Specifically, the aim is to determine whether

post-processing can improve the overall perceptual score of the ideal binary mask (IBM). A secondary aim is to quantify the improvement in terms of artefacts and any loss of interferer suppression.

Work is detailed comparing three experimental masks with the widely used IBM. The experiment is conducted using synthetic mixtures and the known targets and interferers are used in the calculation of the TF masks. Each technique is optimised before the comparison is made. Results are collated using the PEASS toolkit.

(2)

The remainder of this paper is structured as follows: Section 2 gives an overview of TF masking; Section 3 describes the experimental masks which will be used in this paper; Section 4 details the experimental method; Section 5 lists results; Section 6 discusses the work conducted and its implications; and, Section 7 provides a summary and concludes the paper.

2. TIME-FREQUENCY MASKING

In the TF domain, a mixture,Z, containing a target, X, and an interferer,Y, can be expressed as

Zij =Xij+Yij (1)

where i and j are the time and frequency indices. The TF transform can be obtained in a number of ways, most commonly either the gammatone filterbank or the short-time Fourier transform [1]. The separation problem requires recovering X, withoutY, from knowledge of onlyZ. TF masking seeks to achieve this by apportioning the energy in each cell inZ betweenX andY [2].

For a TF representation, Z, the goal is to calculate a mask, MTF_{, where each element is a weighting} according to the proportion of the corresponding unit in Z that should be retained. The TF representation of the target signal, X, can then be recovered using

Xij=M_ijTFZij (2)

2.1. The Ideal Binary Mask

The IBM was proposed as the computational goal of computation auditory scene analysis (CASA) in [3]. The binary mask is also used with independent component analysis [4] and non-negative matrix factorisation [5] approaches to audio separation. The IBM of a speech signal in noise is shown in Figure 1. The binary mask weights each TF cell as either one or zero, depending on whether it is primarily target signal energy or primarily interferer signal energy. The IBM is defined as

MijIBM= 1 ifXij > Yij 0 otherwise (3) Time (s) Centre Frequency (Hz) 0 1 2 12000 4488 1590 473 50

Fig. 1: The IBM of a speech signal in noise. White cells are set to one and black cells to zero.

In many cases the binary mask provides optimal signal-to-noise ratio (SNR) [1]. This is often cited as a reason to support its use for audio source separation. The SNR does not provide a complete picture of the quality of a separation [6] and hence will not be used for the experiment detailed in this work.

Binary masking has been noted to introduce artefacts into separated audio [7]. As a result of these artefacts, the perceptual quality of audio separated by binary masks is seen to be lower than that of audio separated by other methods [8].

3. EXPERIMENTAL MASKS

A series of experimental masks have been created which aim to improve on the artefact performance of the IBM. These are: the dithered binary mask, the noisy binary mask and the cepstrally-smoothed binary mask. This section will describe each mask in turn detailing their calculation, their relationship to the IBM and how they may improve artefact performance.

3.1. The Dithered Binary Mask

The dithered binary mask (DBM) is calculated by adding noise to the target signal and then comparing

(3)

the noisy target signal with the interferer, as with the IBM. This gives the DBM as

MijDBM=

1 ifXij+ ∆> Yij

0 otherwise (4)

where ∆ is the noise term. The noise used in this experiment was triangularly distributed. Applying this process will result in a binary mask but some of the elements will be inverted due to the addition of noise. This will primarily affect cells with a smaller ratio of target to interferer. This may reduce artefacts as simultaneous switchings, as would be expected at onsets, will be reduced.

3.2. The Noisy Binary Mask

The noisy binary mask (NBM) is similar to the DBM in that it involves adding noise. The difference lies in the noise being added after the calculation of the IBM has been made. The NBM is formulated as

M_ijNBM=

1 + ∆ ifXij> Yij

0 + ∆ otherwise (5)

with ∆ again representing triangularly distributed noise. Changing the point the noise is added means the NBM can take a continuous range of values, whereas the DBM is still a binary mask. The continuous mask will reduce the severity of the switching at each transition. This might be expected to lessen the audibility of the artefacts.

3.3. The Cepstrally-Smoothed Binary Mask Transforming the mask for smoothing in the cepstral domain is an idea originating from Madhu et al. [9]. The binary mask is transformed into the cepstral domain and then smoothed in three regions. The quefrency bins are indexed by l up to a maximum of L. Each quefrency is assigned a smoothing parameter γl, which takes its value according to γl=    γenv ifl∈ {0, . . . , lenv}, γpitch ifl=lpitch

γpeak ifl∈ {(lenv+ 1), . . . , L}\lpitch (6) This has the effect of allowing the cepstrum to be smoothed in three separate regions: envelope, pitch and peak. The \ is used to omit lpitch from the final grouping. The bins with the lowest l values represent the spectral envelope of the signal. To

avoid distortion of the spectral envelope little or no smoothing is applied to these bins. Bins above the spectral envelope are referred to as peak values and as the most likely to contain artefacts these are smoothed the most. Quefrency bins containing pitch information are given their own smoothing value. This is generally higher than the envelope smoothing but less than the peak information.

The smoothed mask is calculated using

M_ijs =γlM_i,ls₋₁+ (1−γl)M_ilc (7) where Ms is the smoothed mask and Mc is the cepstral mask defined as

Mc= DFT−1{ln(MIBM)} (8) DFT−1{·} represents the inverse discrete Fourier transform. In order to allow this calculation to compute without taking the natural logarithm of zero, the zeros in the binary mask must be replaced with near zero values. This experiment uses 0.1; the value also used in [9]. Once this substitution has been made the transform can be performed andMs calculated as in (7).

To recover the processed TF mask, MCBM_{, the} cepstral transform in (8) is reversed

MCBM= exp(DFT{Ms_}₎ ₍₉₎

4. METHOD

To compare the experimental masks detailed in the previous section, thirty-six synthetic mixtures were created for separation by the experimental masks and rated using the PEASS toolkit [6]. Multiple variants of each experimental mask were used in order to find their optimum parameters. After the optimum overall perceptual score for a given mask had been found, the masks were compared with each other.

4.1. The Audio Mixtures

To create the mixtures, six target speech signals were selected from a radio broadcast. Three of the segments were spoken by males and three by females. The six interferer signals used were two speech and two music signals from SQAM [10], and two noise signals from the CHiME database [11].

(4)

0 0.5 1 1.5 2 10 20 30 40 50 60 70 80 Noise Range Perceptual Score APS IPS OPS

Fig. 2:The mean optimisation results for the DBM.

Each signal was decimated to a sampling rate of 24 kHz and edited to 240,000 samples (10 s) in length. This allowed speech signals long enough to contain a whole phrase while also maintaining a workable computational load.

4.2. The Separation

The separation was performed in four stages: firstly, 128-channel cochleagrams were taken of the target, interferer and mixture [2]. This decomposition uses fourth-order gammatone filters, spaced on the equivalent rectangular bandwidth (ERB) scale between 50 Hz and 12 kHz, and a window size of 320 samples. Secondly, the experimental TF masks were calculated from the target and interferer cochleagrams. Thirdly, the mask was applied to the mixture as in (2). Finally, separated audio was re-synthesised from the masked TF representation. 4.3. The Metrics

The PEASS toolbox was chosen for the evaluation of the separated audio. This is because it provides perceptually-relevant objective metrics which discriminate between different sources of error in a separation. The metrics of interest in this experiment are: the artefact perceptual score (APS), the interferer perceptual score (IPS) and the overall perceptual score (OPS). The APS will

0 0.5 1 1.5 2 10 20 30 40 50 60 70 80 Noise Range Perceptual Score APS IPS OPS

Fig. 3:The mean optimisation results for the NBM.

indicate whether the experimental masks have been successful in reducing artefacts, the IPS will quantify any cost of this on interferer suppression and the OPS will provide a summary metric. The fourth metric, the target perceptual score, will not be reported as it does not directly relate to the aims of this work. These calculations are made by comparison of the pre-mixture target and interferer with the separated audio.

5. RESULTS

This section will present results in two stages: firstly, the optimisations of each experimental mask are presented; and secondly, a comparison between experimental masks, in their optimised forms, is given.

5.1. Optimisation

Each of the experimental masks was optimised in terms of at least one parameter to ensure best performance was used in the comparison stage. The OPS was chosen as the metric to be optimised as this reflects changes in both APS and IPS.

5.1.1. The Dithered Binary Mask

The DBM was optimised in terms of the range of the noise being added to the target signal. The width of the triangular distribution was varied and the

(5)

change in the target metrics observed. The optimum OPS value was found when the noise range was equal to 0.6. This value equates to 0.8 standard deviations of the TF target signals. Increasing the noise was also observed to reduce the IPS and increase the APS. Full results are shown in Figure 2.

5.1.2. The Noisy Binary Mask

The NBM was optimised in terms of the range of the noise being added to the mask. The width of the triangular distribution was varied and the change in the target metrics observed. The optimum OPS value was found when the noise range was equal to 0.5. As with the DBM, increasing the noise was also observed to reduce the IPS and increase the APS. Full results are shown in Figure 3.

5.1.3. The Cepstrally-Smoothed Binary Mask The CBM was optimised for the smoothing parameters detailed in (6). This found the optimum γ for each smoothing region. To reduce the computational load, the constraint γenv ≤γpitch ≤ γpeak was applied. The results of the optimisations for the APS, IPS and OPS are shown in figures 4 to 6 respectively.

For each region it is found that increasing the smoothing parameter has a deleterious effect on the OPS while the optimum lies at zero smoothing. While the APS is shown to be improved by smoothing, the effect on the IPS is so severe that the OPS is reduced by any amount of smoothing. The magnitude of this effect varied with the amount of the cepstrum which was represented by each region; the smoothing of the pitch caused least variation and the peak region caused the most.

5.2. Comparison

The techniques can now be compared with each other and the IBM using the optimal parameters discovered in Section 5.1. Figure 7 shows this comparison for all four techniques and each of the three metrics.

6. DISCUSSION

The three experimental masks have been compared to the IBM and each provides an improved OPS. The greatest improvement was demonstrated by both the NBM and the CBM, which improved on the IBM’s mean OPS of 18 to a mean of 49. The DBM provided a lesser improvement of 18. This section discusses

Artefacts Interference Overall 0 10 20 30 40 50 60 70 80 12 29 48 53 76 62 66 61 18 36 49 49 Perceptual Score IBM DBM NBM CBM

Fig. 7: Mean results for each experimental mask.

the results and how further improvements may be sought.

The NBM and CBM, which both provided the greatest improvement observed in this experiment, returned masks with weaker switching than the binary mask. This suggests further investigation of the perceptual quality of non-binary masks is required. The DBM showed some improvement in APS, possibly due to the randomising of switchings leading to fewer channels switching simultaneously in the mask. The masks which provided greatest improvement appear to have softened these switchings giving a smoother onset. Improvement of the APS has not been demonstrated without a reduction in IPS. However, the improvement in APS does not appear to be directly related to the reduction in IPS. Clearly, an ideal solution would provide an improved APS with an IPS that is still comparable to that of the IBM. This is unlikely to be achieved by solely adjusting the TF mask.

The optimisation of the CBM found an unexpected result; the optimum smoothing parameters were zero. This leaves the processing as merely taking and reversing a cepstral transform; the only change

(6)

0 0.5 1 53 γ_peak = 0 γpitch

γ_peak = 0.1 γ_peak = 0.2 γ_peak = 0.3

0 0.5 1 γ_peak = 0.4 γpitch γ_peak = 0.5 γ_peak = 0.6 0 0.5 1 γ_peak = 0.7 γ env 0 0.5 1 0 0.5 1 γ_peak = 0.8 γ_env γpitch 0 0.5 1 γ_peak = 0.9 γ_env 0 0.5 1 γ_peak = 1 γ_env APS 55 60 65 70 75

Fig. 4: Results of the optimisation of the CBM on the APS. Each set of axes shows a slice through the γpeakvariable and shows the mean APS at each value ofγpitch andγenv tested at that value.

to the mask is that cells which were 0 are now 0.1. This reduction in the severity of the switching of the mask has reduced artefacts. To verify this a TF mask formulated as

MijTF =

1 ifXij > Yij

0.1 Otherwise (10)

has been used on the experimental audio and returned the same results.

7. SUMMARY AND CONCLUSIONS

TF masking is a method of separating sound from a mixture. Binary masking is a widely used variant but introduces artefacts to the separated audio, which reduces its perceptual quality. The work in this paper aimed to determine whether post-processing could improve the OPS of audio separated by the IBM and, if so, to quantify the

improvement in terms of APS and any loss in IPS. Three mask post-processors were created. The NBM and DBM add noise to the mask creation process. The CBM takes a cepstral transform of the mask and smoothes each quefrency channel. Optimally-processed masks were used to separate 36 mixtures and PEASS results were taken for each separation. It was found that post-processing the IBM can lead to a mean OPS improvement of up to 31 points. APS can be improved by up to 41 points but the cost is a small IPS reduction of up to 15 points. Masks with non-binary switching can perform better than binary masks. Indeed, a simple replacement of all zeros in the IBM with near-zeros can offer significant performance improvements (the largest observed improvements in the study reported in this paper).

(7)

0 0.5 1 61 γ_peak = 0 γpitch

0 0.5 1 γ_peak = 0.4 γpitch γ_peak = 0.5 γ_peak = 0.6 0 0.5 1 γ_peak = 0.7 γ env 0 0.5 1 0 0.5 1 γ_peak = 0.8 γ_env γpitch 0 0.5 1 γ_peak = 0.9 γ_env 0 0.5 1 γ_peak = 1 γ_env IPS 20 30 40 50 60

Fig. 5: Results of the optimisation of the CBM on the IPS. Each set of axes shows a slice through theγpeak variable and shows the mean IPS at each value ofγpitch andγenv tested at that value.

ACKNOWLEDGMENTS

This research is funded by the EPSRC and the BBC. Thank you to BBC R&D for help with acquisition and processing of the target signals.

8. REFERENCES

[1] Y. Li and D. Wang, “On the optimality of ideal binary time-frequency masks,” Speech Commun., vol. 51, pp. 230–239, Mar. 2009. [2] D. Wang and G. Brown, Computational

auditory scene analysis: principles, algorithms and applications. Wiley interscience, 2006. [3] D. Wang, “On Ideal Binary Mask as the

computational goal of auditory scene analysis,” inSpeech Separation by Humans and Machines, pp. 181–197, Kluwer, 2005.

[4] M. S. Pedersen, D. Wang, J. Larsen, and U. Kjems, “Overcomplete blind source separation by combining ICA and binary time-frequency masking,” in IEEE International workshop on Machine Learning for Signal Processing (V. Calhoun, T. Adali, J. Larsen, D. Miller, and S. Douglas, eds.), pp. 15–20, sep 2005.

[5] E. Grais and H. Erdogan, “Single channel speech music separation using nonnegative matrix factorization and spectral masks,” in Digital Signal Processing (DSP), 2011 17th International Conference on, pp. 1–6, July 2011.

[6] V. Emiya, E. Vincent, N. Harlander, and V. Hohmann, “Subjective and objective quality assessment of audio source separation,” IEEE

(8)

0 0.5 1 49 γ_peak = 0 γpitch

0 0.5 1 γ_peak = 0.4 γpitch γ_peak = 0.5 γ_peak = 0.6 0 0.5 1 γ_peak = 0.7 γ env 0 0.5 1 0 0.5 1 γ_peak = 0.8 γ_env γpitch 0 0.5 1 γ_peak = 0.9 γ_env 0 0.5 1 γ_peak = 1 γ_env OPS 10 20 30 40

Fig. 6: Results of the optimisation of the CBM on the OPS. Each set of axes shows a slice through the γpeakvariable and shows the mean OPS at each value of γpitch andγenv tested at that value.

transactions on audio, speech and language processing, vol. 19, pp. 2046–2057, September 2011.

[7] S. Araki, S. Makino, H. Sawada, and R. Mukai, “Reducing musical noise by a fine-shift overlap-add method applied to source separation using a time-frequency mask,” in Acoustics, Speech, and Signal Processing, 2005. Proceedings. (ICASSP ’05). IEEE International Conference on, vol. 3, pp. iii/81 – iii/84, march 2005. [8] S. Araki, F. Nesta, E. Vincent, Z. Koldovsky,

G. Nolte, A. Ziehe, and A. Benichoux, “The 2011 Signal Separation Evaluation Campaign (SiSEC2011): - Audio source separation -,” in 10th Int. Conf. on Latent Variable Analysis and Signal Separation (LVA/ICA), (Tel Aviv, Israel), pp. 414–422, Mar. 2012.

[9] N. Madhu, C. Breithaupt, and R. Martin, “Temporal smoothing of spectral masks in the cepstral domain for speech separation,” in ICASSP, pp. 45–48, 2008.

[10] European Broadcasting Union, “Tech 3253-E Sound Quality Assessment Material.”

[11] H. Christensen, J. Barker, N. Ma, and

P. D. Green, “The CHiME corpus: a

resource and a challenge for computational hearing in multisource environments,” in INTERSPEECH, pp. 1918–1921, 2010.