Under-determined reverberant audio source separation using Bayesian Non-negative Matrix Factorization

(1)

SpeechCommunication81(2016)129–137

www.elsevier.com/locate/specom

Under-determined

reverberant

audio

source

separation

using

Bayesian

Non-negative

Matrix

Factorization

Sayeh

Mirzaei

a,∗

,

Hugo

Van

Hamme

a

,

Yaser

Norouzi

b

a_{Department of Electrical Engineering-ESAT, KULeuven, Kasteelpark Arenberg 10, Bus 2441, B-3001 Leuven, Belgium} b_{Department of Electrical Engineering, Amirkabir University of Technology, 424 Hafez Ave, Tehran, Iran}

Received28May2015;receivedinrevisedform11January2016;accepted13January2016 Availableonline3February2016

Abstract

Inthispaper,weaddressthetaskofaudiosourceseparationforastereoreverberantmixtureofaudiosignals.Weuseafull-rankmodelfor thespatialcovariancematrix.BayesianNon-negativeMatrixFactorization(NMF)frameworksareintroducedforfactorizingthetime-frequency variance matrix of each source into basis componentsand timeactivations. We alsopropose to incorporate the temporaldependencies in the Bayesianmodelthrough(1)recursivelyupdatingtheprior hyperparametersor(2)applyingapriorwithMarkovchainstructuretofavor the smoothnessof the solution andwe comparethe performanceof thesetwo schemes.TheEM algorithm isappliedto derivethe update relationsof theunknownparameters.Theseparationperformanceimprovementoverthe non-BayesianstandardNMFmethodaswell asthe conventionalfull-rankunconstrained methodare investigatedbycalculating objectiveseparation evaluationmetrics.

Keywords: BlindSourceSeparation(BSS);BayesianNon-negativeMatrixFactorization(BNMF);Spatialcovariancemodel;Temporaldependencies; Reverberantmixture.

1. Introduction

Weoftendeal withamixture of sounds comingfrom

dif-ferentacousticsources.Separationof theseaudiosignalsand

extracting the individual source signals is required in many

applications including speaker diarization, meeting

transcrip-tion systems, hearing aids, polyphonic music transcription,

etc.

Whennopriorinformationofthe sourcesorchannel

mix-ing systemis available,thetask iscalledBlindSource

Sepa-ration(BSS).Themultichannelmixturesignalx(t )∈RM _can

be expressedas x(t )= N n=1 yn(t ) (1)

where yn(t ), n=1...N isthenthsourcespatialimage

vec-torover M channels.The mixing processconsists of alinear

∗_{Corresponding}_author._Tel.:_{+989126850714.} E-mail address: [email protected](S.Mirzaei).

time-invariantfiltering of the source signalsas:

yn(t )= L−1

l=0

hn(l)sn(t− l) (2)

wheresn(t)isthe nthsourcesignal,hn(l)∈RM isthemixing

filter vector which denotes the acoustic path from source n

tothe M microphones andL is the filter length. Mostof the

proposedBSSmethods are basedon the assumptionthat the

mixing process at each frequency bin can be approximated

bycomplex-valued multiplication:

Yn( f,t)≈ Hn( f)Sn( f,t) (3)

where Yn(f, t) is the spatial image of source n in the Short

Time Fourier Transform Domain(STFT) domain, sn(f, t)

de-notesthesource STFTandHn(f)specifies the Fourier

trans-formof the mixing filter hn(t).

IfSn(f,t)isazero-meanvariablewithvariancevn(f,t),the

covarianceof Yn(f, t) canbe written as:

RYn( f,t)=vn( f,t)Rn( f) (4)

http://dx.doi.org/10.1016/j.specom.2016.01.003

(2)

The assumption in (3) implies that the spatial covariance

matrixof eachsource,Rn(f),hasrank1.Assumingthe

rank-1 model, BSS can be achieved using time-frequency (TF)

maskingtechniques(Yilmaz andRickard, 2004)or MAP

es-timation assuming sparse prior distributions (Winter et al.,

2007), or modeling the source variances with Non-negative

MatrixFactorization(NMF)(Févotteetal.,2009;Ozerovand

Févotte,2010).Therank1 assumptionisonlyvalidwhenthe

filter length L is sufficiently small withrespect tothe STFT

window length. This is violated in most realistic scenarios

wherereverberationexists.Afull-rankspatialcovariance

ma-trix model is proposed in Duong et al. (2009) to provide

better approximation in reverberant environments. The

Max-imum Likelihood (ML) solution is then found in an oracle

contextwhere both the spatialcovariance matrix, Rn(f),and

the scalar variance of the sources, vn(f, t), are known and

alsoinasemi-blindcontextwhere thespatialcovariance

ma-trix is estimated from single-source training data. In Duong

et al. (2010a), the EM algorithm was used for blindly

esti-matingbothof theaboveparameters.Thesourcepermutation

problem whicharises when the unknown parameters are

in-dependently estimated at each frequency bin, has also been

solvedinDuong etal.(2010a).

In Arberet et al. (2010), the source variances vn(f, t) are

modeled by NMF andthe EM algorithm is used for blindly

estimating the parameters similar to what is done in Duong

et al. (2010a). In Duong et al. (2010c), the use of a

non-uniform TF representation onthe auditory-motivated

equiva-lent rectangular bandwidth(ERB) scale is investigated.It has

been shown that this representation is beneficial for

multi-channel convolutive source separation provided that the

full-rankcovariancemodelisused.Thishasalsobeeninvestigated

inBurred and Sikora (2006) for instantaneous mixtures and (Vincent,2006) for convolutive mixtures.

In Duong et al. (2010b), four specific covariance models

including the rank-1 anechoic model, the rank-1 convolutive

model,thefull-rankdirect+diffusemodelandthefull-rank

un-constrained model are considered. A hierarchical

clustering-basedmethodisusedtoinitializetheparameters. Also,a

Di-rectionof Arrival(DoA) basedapproachisproposedtoalign

theorder of the estimated sourcesacross all frequency bins.

In Duongetal.(2013)somespatiallocationprior

distribu-tions consistent with the theory of statistical room acoustics

are proposed for application to the spatial covariance

matri-ces and EM algorithms are derived for Maximum a

Poste-riori (MAP) estimation. In Nikunen and Virtanen (2014), a

spatial covariance matrix model is proposed which consists

of a weighted sum of Direction of Arrival kernels. This

co-variancemodeliscombinedwiththeComplexNMF(CNMF)

framework proposedin Sawada et al. (2013) andthe update

relationsforfindingtheunknownparametersaresubsequently

derived.

In Arberet et al. (2010), the nth source variance matrix

Vn(F × T) consisting of the above variance elements, vn(f,

t),isapproximatedas aproductof twonon-negativematrices

Wn(F × K)and Hn(K × T)which specifythe basis

compo-nentsandtimeactivation matricesrespectively. It isassumed

that the numberof the components K required for modeling

each source is known in advance. However this may not be

a suitable presumption when the goal is to blindly separate

the individual source signals and there is no prior

informa-tion about the source types. Here, we propose a Bayesian

NMF framework to automatically infer the number of basis

vectors for each source. In our first approach, we develop a

Bayesian framework assuming that the time activation

ma-trix elements Hn are random variables with a Gamma prior

distribution. An EM algorithm is developed for deriving the

updateequations.The updaterelationsgiveninArberetetal.

(2010) are replacedwith the newly derived relations for the

factors of the source variance matrices which are obtained

throughMAPestimation.Wehavealsomodeledthetemporal

dependenciesthrough imposingconstraintstotheprior

distri-butionofthetemporal activations.Aprocedureinspiredfrom

Mohammadiha et al. (2012) is used for updating the scale

parametersof the prior distributions of the timeactivations.

In the second approach, we favor the smoothness of the

results through applying an inverse-Gamma chain prior

dis-tribution inspired fromFévotteet al.(2009).

In Smaragdis et al.(2014), a comprehensive study of the

NMF methods which model the temporal statistics is done.

One flexible approach for considering the actual temporal

dependencies is to impose constraints on the model

activa-tions (Essid andFévotte, 2013; Févotte,2011; Févotteet al.,

2009; Mohammadiha et al., 2013; 2012; Virtanen, 2007;

Wilson et al., 2008). These approaches are called dynamic

or smooth NMF. They differ by the used penalty term in

non-probabilistic settings or by the choice of the

observa-tion model and prior structure in the Bayesian frameworks.

In Virtanen (2007), temporal continuity and sparseness

con-straints are applied to the activation coefficients. Temporal

continuity is favored by using a cost term which is the sum

of squared differences between the activations in adjacent

frames, andsparseness is favored by penalizing nonzero

ac-tivations. ANon-negative DynamicalSystem (NDS)is

intro-ducedinFévotteetal.(2013)for modelingspeechspectra.It

can be regarded as an extension of NMF tosupport

Marko-viandynamics.Non-negativitypreservingGammaor

inverse-GammaMarkovchainpriorsareconsideredinFévotte(2011);

Févotteet al.(2009); Mohammadiha et al.(2013, 2012)and

Markov random fields in Kim and Smaragdis (2013). In

Nakano et al. (2010), the spectrogram of music signals is

modeled as the combinationof Markov-chained spectral

pat-terns.

The approaches proposed in this paper can be regarded

as Bayesian extensions of the method proposed in Arberet

et al. (2010) accentuating the smoothness of the estimates.

The Gamma prior model has been chosen for its

effec-tiveness inmodeling sparse parameters. Meanwhile, Gamma

and inverse-Gamma prior distributions are preferred

be-cause we are going to model non-negative elements of

the activation matrix, thus other sparse prior distributions

such as Laplace cannot be useful here. The novel

as-pects of our proposed approaches can be summarized as

(3)

•Bayesian NMF frameworks are proposed to factorize the

source variance matrixin the full-rank model for the

pur-poseofprovidingamorepowerfulmodelthroughapplying

suitable prior structures andavoidingover-fitted or

under-fitted models.

•Temporal dependencies are taken into account via

1)Up-dating the scale hyperparameters of the time activation

Gammapriordistributions;2)Applyinganinverse-Gamma

Markov chain priordistribution.

Thestructure oftherestof thepaperisasfollows:We

in-troduce ourfirst proposedBayesian approachwhichimposes

a temporal continuity constraint through updating the scale

hyperparameters of the Gamma prior in Section 2. Wethen

explainthesecond Bayesianapproachwhichusesan

inverse-Gamma chain prior structure in Section 3. Various

experi-mental settings and the performance results are presented in

Section 4. Finally,we conclude inSection 5.

2. BayesianNMFwithGammapriormodelconfiguration

Weadmitthefollowinggenerativemodelforthenthsource

spatial imageyn(f, t)(Arberet etal., 2010):

Yn( f,t)∼ Nc

0,RYn( f,t)

(5)

where Nc(μ,) isapropercomplexGaussiandistribution:

Nc(y; μ,)= π−1exp

−(y− μ)H−1_(y_{− μ)} ₍₆₎

Thecovariance matrixRYn( f,t)isgivenby(4)where the

spatial covariancematrixRn(f)is assumedas afull-rank

un-constrainedcovariance model (Duongetal.,2010b)andvn(f,

t) is thetime-varying sourcevariance whichwe approximate

withthe following factorizedform:

vn(t,f)= K

k=1

w(n)f_,kh(n)k_,t (7)

where w(n)_f_,k,h_k(n)_,t ∈R+. Therefore, we can expressthe matrix

Vn with entries [Vn]f,t =vn( f,t) as a product of two

non-negativematricesWnandHnwithentries[Wn]f,k =w(n)f_,k and

[Hn]k_,t =hk(n),t:

Vn=WnHn (8)

Consequently,thesourcespatialimageYn(f,t)canbewritten

as the combination of Kcomponents:

Yn( f,t)= K

k₌₁

Yn,k( f,t) (9)

withthe covariance matrix givenby:

RY_n,k( f,t)=w(n)f_,kh(n)k_,tRn( f) (10)

Here, we assume that the elements of the Hn matrix

rep-resenting the time activations of the basis components in

Wn, are random variables with Gamma prior distribution,

Gamma(h(n)_k_,t|a_k(n)_,t,b(n)_k_,t): Gamma(h|a,b)= h (a−1) ba(a)exp _−h b (11)

The elementsof Wn are assumed deterministic.

According to the above generative model and assuming

that the source signals are independent, the mixture signal

STFT,X(f,t)wouldbeazero-meancomplexGaussianvector

withthe followingcovariance matrix:

RX( f,t)=

N

n₌₁

RYn( f,t) (12)

The aim is to estimate the spatial covariance matrices

Rn(f)and the source variances vn(f,t) under the above

gen-erativemodel bymaximizing thelikelihood function.Weopt

forthe EMalgorithm similartoArberetetal.(2010);Duong

etal.(2010b) with thedifference that we maximize the

pos-terior probabilitywhileupdating the Hn elements.Therefore,

theE-stepoftheEMalgorithmremainsunchangedw.r.twhat

isgiveninArberetetal.(2010).ThefollowingM-stepupdate

relations are obtained:

w(n)_f_,k= 1 T T t=1 ˆ vn,k( f,t) h_k(n)_,t (13) hk(n)_,t = a_k(n)_,t − (F +1)+ a(n)_k_,t − (F+1) 2 +4 F f=1ˆ v_n,k( f ,t ) w(n)_f,k b(n)_k,t 2 b(n)_k,t (14) withvˆn,k( f,t)= _M1tr(Rn−1( f)Rˆyn,k( f,t)).Fdenotes thetotal

numberoffrequency binsintheTFrepresentation.Rˆy_n,k( f,t)

isupdatedaccordingtoEq.(11)oftheE-stepgiveninArberet

et al. (2010) and is restated in (A.2). Rn(f) is updated as

(Arberetetal., 2010): Rn( f)= 1 T T t₌₁ 1 vn( f,t) ˆ RYn( f,t) (15)

More explanation on the derivation of (14) can be found

inAppendix A.1.

Afterthe convergence of the EM algorithm, the STFT of

the source spatial image is estimated using the Wiener

esti-mator: ˆ

Yn( f,t)=RYn( f,t)(RX( f,t))

−1_X_{( f}_,_t₎ ₍₁₆₎

The time-domain signals are simply derived through

in-verse STFToperation.

The proposed Bayesian framework has the advantage of

avoidingoverfittedorunderfittedmodels.Meanwhile,inspired

from Mohammadiha et al.(2012), we recursively updatethe

prior hyperparameters over subsequent time frames as

de-scribedinSection2.1toimposethetemporalcontinuity

con-straint.Thiscontinuitynormallyexistsformanyaudiosignals

andfor speechin particular.

2.1. Imposingthe temporalcontinuity constraint

Tomake themodel fit bettertothedata,we takethe

(4)

the prior are recursively updated at each time frame based

on the following smoothing relation (Mohammadiha et al.,

2012): b(n)_k_,t =λh (n) k,t−1 a(n)_k_,t +(1− λ)b (n) k,t−1 (17)

where the λ parameter controls the smoothing level. The

shape hyperparameter a(n)_k_,t is taken fixed and time-invariant

inourmodel.

2.2.Parameter initialization

The EM algorithm is very sensitive to initialization, thus

here, similar to Arberet et al. (2010), we use the perturbed

oracle initialization where the parameters Rn(f) and vn(f, t)

are estimated form the original source spatial images as in

Duong et al. (2009) and then perturbed with a high level

additivenoise (SNR of 5 dB). The parameters w(n)_f_,k,h_k(n)_,t are

theninitializedaccordingtotheIS-NMFBayesiangenerative

model with Gamma prior. The update relations are derived

in Appendix B. We have chosen IS divergence because it

has been shown to perform more effectively for factorizing

the power spectrogram (Févotte et al., 2009). Furthermore,

IS divergence is scale-invariant, which means that the same

relativeweightisgiventosmallandlargevaluesofthesource

variance matrix coefficients Vn(F × T) in the cost function.

Thispropertycanberegardedas abenefitsinceitisrelevant

todecomposition of audio spectra in the sense that a badfit

of the factorization for a low-power coefficient will cost as

muchas a badfit for ahigher-power coefficient.

3. BayesianNMFwithinverse-Gammachainpriormodel

configuration

InthesecondBayesianframework,weassumethe

inverse-gammaprior distributionfor h(n)k_,t parametersas follows:

p h_k(n)_,t|h(n)_k_,t−1 =IG h(n)_k_,t|α,(α +1)h(n)_k_,t−1 IG(u|α,β)= β α (α)u−(α+1)exp _−β u , u≥ 0 (18)

The prior is constructed so that its mode is obtained for

h(n)_k_,t =h(n)_(k,t−1).α isashapeparameterthatcontrolsthe

sharp-ness of the prior around its mode. A high value of α will

increasethesharpnessandthusaccentuate thesmoothnessof

h(n)_k , while alow value of α will render the prior more

dif-fuse andthus less constraining (Févotte et al., 2009). In the

following, h_k(n)_,₁ is assigned the scale-invariant Jeffreys

nonin-formative prior p(h(n)_k_,₁)∝ _h1(n) k,1

. In this case, the update

rela-tions are the same as in the previous section except for the

h(n)_k_,t parameters which are obtained through maximizing the

posteriordistribution. The update relationis obtained as:

h(n)_k_,t = − p2 1− 4p2p0− p1 2p2 (19) Table1

Coefficients ofthe order2 polynomial tosolve in orderto update h (n)_k_,t in BayesianIS-NMFwithaninverse-Gammachainprior.

p 2 p 1 p 0 h (n)_k,₁ −(α+1) h(n)_k_,₂ −(F− α +1) f ˆ v_n,k( f ,1) w(n)_f_,k h (n)_k,t, 1< t < T −(α+1) h_k(n)_,t+1 −(1+F ) f ˆ v_n,k( f ,t ) w(n)_f_,k +(α + 1) h (n) k,t−1 h (n)_k,T 0 −(α +1+F ) f ˆ v_n,k( f ,T) w(n)_f_,k +(α + 1) h (n) k,T−1 Table2

Coefficientsoftheorder2polynomialtosolveinordertoinitialize h _k(n)_,t in BayesianIS-NMFwithaninverse-Gammachainprior.

p 2 p 1 p 0 h (n)_k_,₁ (α+1) h(n)_k,₂ F − α +1 −Fh ˆ (n) k,1 h (n)_k_,t, 1<t <T (α+1) h(n)_k,t+1 1+F −Fh ˆ (n) k,t − (α +1)h (n)k,t−1 h (n)_k_,T 0 (α + 1+F ) −Fh ˆ(n)_k_,T− (α +1) h (n)_k_,T₋₁

where the values of p0, p1 and p2 are given in Table 1.

The derivation of the above expression can be found in

Appendix B.

3.1. Parameter initialization

Again, we use the perturbed oracle to initialize the

pa-rameters Rn(f)andvn(f,t). Theparametersw(n)_f_,k andh(n)_k_,t are

theninitializedaccordingtotheIS-NMFwithinverse-Gamma

chainpriorgenerativemodel.Theupdateequationscanagain

beobtainedthroughfinding theMAPestimatesof the

param-eters. The wf,k update relation is the same as (A.8) and the

hk,t coefficient is updated according to the Equation 5.10 in

Févotteet al.(2009): h(n)_k_,t = p2 1− 4p2p0− p1 2p2 (20) where vk, f ,t,n= w(n)_f_,kh_k(n)_,t l w(n)f_,lh(n)l_,t 2 v(n)_f_,t + w (n) f,kh (n) k,t l w(n)f_,lhl(n)_,t l =k w(n)_f_,lh_l(n)_,t ˆ h(n)_k_,t = 1 F F f₌₁ vk, f ,t,n w(n)_f_,k (21)

The valuesof p0,p1 and p2 are giveninTable 2.

4. Experiments

Here,weconsiderthestereocase(M =2).Constantvalues

are chosen for the parameters ak(n)_,t =2 of the Gamma prior

andα =5 oftheinverse-Gammadistributions.Inanon-blind

settingwheresometrainingdataisavailable,theseparameters

canbelearnedinsteadoftakingfixedvalues.Theinitialvalue

for b(n)_k_,t ischosen equal tothe mean value of the v(n)_f_,t matrix

(5)

Fig.1. Averageblindsourceseparationperformanceoverstereomixturesofthreesourcesasafunctionofthereverberationtime,measuredintermsof(a) SDR,(b)SIR,(c)ISR,and(d)SAR.

Themale andfemalespeechsignalsandmusicsignalsare

taken from the dev2 dataset of the SiSEC’08

“underdeter-minedspeechandmusicmixtures” task(Vincentetal.,2009).

Each sourcesignal hasa durationof 10s.The sampling rate

isequalto16kHz.TheSTFTframesizeis1024withaframe

shiftof512 samples.Asinewindowfunctionisusedateach

frame to obtain the STFT coefficients. The number of

com-ponents is setto30 for NMF-based algorithms.20 iterations

of the EM algorithm is executed.

The separation quality is measured by calculating the

BSS evaluation metrics; We use the signal-to-distortion

ra-tio(SDR),signal-to-interferenceratio(SIR), signal-to-artifact

ratio (SAR)andsourceimage-to-spatialdistortion ratio(ISR)

criteriaexpressedindecibels(dB),asdefinedinVincentetal.

(2007).

Wehave performed the experiments onboth synthetically

mixed and real-world data as explained in Sections 4.1 and

4.2 respectively.

4.1. Experimentson syntheticmixtures

The mixture signal is synthetically generated using the

Roomsim Toolbox (Campbell et al., 2005) for a

rectangu-lar room of dimensions 6.25 × 3.75 × 2.5m and

omni-directional microphones. Three sources are chosen for each

mixture. Thereare 4 different mixture types: (1) Three male

speech sources; (2) Three female speech sources; (3) Three

nodrumsmusicsourcesand(4)Threewdrumsmusicsources.

The nodrums data consists of three non-percussive sources

while the wdrums includes the Drums musical instrument.

The source directions w.r.t the microphones axis are taken

as 30° , 70° and 120°. The source and microphone height

are set to 1.1m. The position of the microphone centers

in (x,y) coordinates lies at (1.56m, 1.87m). The

micro-phone spacingissetto d =5 cm.Weevaluate ourproposed

methods under different reverberant conditions

(6)

Table3

Averageperformancemetricsobtainedoverspeechmixtures.

RT60 Algorithm SDR SIR ISR SAR

(dB) (dB) (dB) (dB) 130ms GammaBayesianmodel 9.4 14 16.8 12.1 Inverse-GammaBayesianmodel 7.8 12.8 16.1 11.8 250ms GammaBayesianmodel 7.3 12 13.6 10.6 Inverse-GammaBayesianmodel 7.1 11.6 13.1 9.5

Table4

Averageperformancemetricsobtainedovermusicmixtures.

RT60 Algorithm SDR SIR ISR SAR

(dB) (dB) (dB) (dB) 130ms GammaBayesianmodel 8.8 14.4 13.8 9.7 Inverse-GammaBayesianmodel 10.2 15.4 14.1 13.2 250ms GammaBayesianmodel 6.1 8.8 11.2 9.2 Inverse-GammaBayesianmodel 8.1 14 14.1 11.5

averagedoverallsourcesandallmixturesaredepictedagainst

RT60 values in Fig. 1. As comparison materials, the results

ofthe standardNMF (Arberetet al.,2010) andfull-rank

un-constrained model without NMF (Duong et al., 2010b), are

also evaluated as indicated inthe figure. It can be observed

thatourproposedBayesianapproachesoutperformboth

state-of-the-art methods in all evaluation metrics. In the full-rank

model of Duong et al. (2010b), since the model parameters

areindependentlyestimatedatdifferentfrequencies,the

well-known source permutation problem must be addressed. The

DOA-basedpermutation alignmentscheme was used for this

purpose. However in the case of NMF and Bayesian NMF

decomposition,the permutation problemcan be avoided due

tothe joint estimation of the parameters.

The value chosen for the smoothing parameter(λ =0.5),

led to optimum performance in average for all signal types.

Forprovidingaquantitativeanalysis,weobtainedthevariance

oftheSDRmetricevaluatedfor60λ valuesuniformlyspaced

between0.3and0.9inthecasewhere the reverberationtime

is set to 250ms. We observed that the standard deviation is

equal to .13dB and the best obtained metric corresponds to

thecase where λ =0.5.

To measurewherethesuperiorityof eachoftwo Bayesian

frameworks stems from in modeling two audio mixture

cat-egories, music and speech, we calculated the average

per-formance of the separation for speech and music mixtures

separately as listed in Tables 3 and 4 respectively for the

corresponding RT60 values equal to 130ms and 250ms. We

observethattheGamma Bayesianframeworkperformsbetter

for speech mixtures and the inverse-Gamma chain Bayesian

model is better matched to the music signals. These results

can be assigned to the more strict temporal continuity

con-straint imposed within applying the Gamma framework

re-lated to the inverse-Gamma model. Hence we may interpret

that the strict temporal dependency constraint is better

fit-ted to the speech signals in general. We havealso observed

that inserting the temporaldependencies into the model may

not improve the separation quality for the percussive music

sources because of their discontinuous nature. To illustrate

Table5

SDRmetric(dB)forwdrumssourcesusingtheGammaBayesianmodel.

Smoothingparameter Drums Hi-hat Bass

λ = 0 4.5 5.6 10.4

λ = 0. 5 3.4 4.8 11.5

Table6

SDRmetric(dB)forwdrumssourcesusingInverse-GammaBayesianmodel.

Smoothingparameter Drums Hi-hat Bass

α =0.1 5.1 7.7 11.0

α = 5 4.4 7.2 12.2

this, we have analyzed the effect of reducing the temporal

smoothness constraintin bothBayesian models andreported

the obtained SDR for the wdrums mixture which contains

percussive sources in Tables 5 and 6. RT60 is set to 250ms.

Reducingor eliminatingthe temporalcontinuityconstraintin

both models, can even lead to better performance for

per-cussive sources “Drums” and “Hi-hat”. However it can be

observedthat applying thetemporal continuityconstrainthas

improvedthe SDR corresponding tothe “Bass” source.

TheEM convergence isdemonstrated inFig.2aandb for

the two Bayesian frameworks representing the MAP criteria

(Q1MAPandQ3MAPintheAppendicesA.1andBrespectively)

ateachiteration.TheMAP criterionisaveragedoverall

mix-tures.

4.2. Experimentson real-world mixtures

We perform this final experiment to compare the

pro-posed Bayesian algorithms with state-of-the-art BSS

algo-rithms submitted for evaluation to SiSEC 2008 over

real-worldmixturesof threeor fourspeechsources.Twomixtures

wererecordedfor eachgivennumberof sources,usingeither

male or femalespeech signals. The roomreverberation time

was either 130 or 250ms and the microphone spacing 5cm

Vincent et al. (2009). The average SDR achieved by each

algorithm is listed in Table 7 for comparison. The SDR

re-sults of all algorithms were taken from Table III in Duong

et al. (2010b). The Bayesian NMF-based methods

outper-form the other state-of-the-art methods. The better

perfor-manceachieved using theproposed Bayesianapproachescan

be attributed to the choice of the prior structure as well as

modelingthetemporalsmoothness. Wehaveanalyzedthisby

diminishing the smoothness constraint from both Bayesian

models and compared the obtained average SDR values

for real-world mixtures with the case where smoothness is

considered.

For the Gamma model, we set λ to 0 to eliminate the

smoothness. The obtained results are reported in Table 8. It

can be implied that the temporal continuity constraint

ap-plied through the smoothing parameter λ, leadsto improved

performance interms of average SDR metric.

For the second Inverse-Gamma model, we set the α

pa-rameter to 0.1 to reduce the sharpness of the prior

(7)

Fig.2.EMconvergencegraphsfor(a)GammaBayesianframework,(b)Inverse-GammachainBayesianframework.

Table7

Average SDR(dB)overthereal-worldtestdataofSiSEC2008with5-cm microphonespacing.

RT60 Algorithm 3sources 4sources

130ms GammaBayesianmodel 4.4 3.9 Inverse-GammaBayesianmodel 4.6 3.4 full-rankunconstrainedmodel(Duong

etal.,2010b)

3.3 2.8

M.Cobes(CobosandLópez,2009) 2.3 2.1 M.Mandel(MandelandEllis,2007) .1 −3.7 R.Weiss(WeissandEllis,2010) 2.9 2.3 S.Araki(Arakietal.,2009) 2.9 – Z.ElChami(ElChamietal.,2008) 2.3 2.1 250ms GammaBayesianmodel 5.0 3.1 Inverse-GammaBayesianmodel 4.9 3.5 full-rankunconstrainedmodel(Duong

etal.,2010b)

3.8 2.0

M.Cobes(CobosandLópez,2009) 2.2 1.0 M.Mandel(MandelandEllis,2007) 0.8 1.0 R.Weiss(WeissandEllis,2010) 2.3 1.5 S.Araki(Arakietal.,2009) 3.7 – Z.ElChami(ElChamietal.,2008) 3.1 1.4

Table8

Average SDR (dB)over thereal-worldtestdata fortheGammaBayesian model.

130ms GammaBayesianmodelwith λ = 0 3.8 3.0 GammaBayesianmodelwith λ = 0. 5 4.4 3.9 250ms GammaBayesianmodelwith λ = 0 4.1 2.4 GammaBayesianmodelwith λ = 0. 5 5.0 3.1

are reported in Table 9. Again, it can be observed that the

smoothingconstraintincorporated intothe modelthrough the

inverse-Gamma priorhyperparameter α, leadstobetter

aver-age separation performance interms of SDR.

Table9

Average SDR (dB) over the real-world test data for the Inverse-Gamma Bayesianmodel.

130ms Inverse-GammaBayesianmodelwith

α =0.1

3.6 2.9

Inverse-GammaBayesianmodelwith

α =5

4.6 3.4

250ms GammaBayesianmodelwithα =0.1 4 3.0 GammaBayesianmodelwithα =5 4.9 3.5

5. Conclusion

In this work, we proposed and developed two Bayesian

NMF frameworks for separating stereo audio mixtures in a

reverberant environment using a factorization of the source

variance matricesin a full-rank model. The temporal

depen-dencies of the audio signal are taken into account through

incorporating two different prior distribution structures

as-sumed for the activation coefficients:(1) a gammaprior

dis-tribution whose scale parameters are updated at each time

frame and (2) an inverse-Gamma Markov chain prior

distri-bution.The performance of the developed Bayesianmethods

are compared witheach other as well as the standard NMF

and the standard full-rank unconstrained modeling schemes

bycalculatingtheBSSevaluationmetrics.It hasbeenshown

that the proposed Bayesian approaches outperform the

pre-vious state-of-the-art methods. It can also be concluded that

the inverse-Gamma chain prior structure performs better for

the music source separation and the Gamma prior structure

withrecursiveupdates isbetterfitted tothespeech mixtures.

Bayesian models offer both a strong theoretical framework

andthe possibilitytomanageconstraintsthrough modelsand

(8)

Acknowledgment

ThisresearchwasfundedbytheKULeuvenresearchgrant

GOA/14/005 (CAMETRON).

AppendixA

A.1.Derivationof the M-step updaterelationsunder BayesianNMFframework withGamma prior

MAP estimationof thetemporalweights h_k(n)_,t isequivalent

tomaximizing the costfunction below:

Q1MAP =− f,t,k,n DKL ˆ Ry_n,k( f,t)|Ry_n,k( f,t) +a(n)_k_,t − 1 logh(n)_k_,t −h (n) k,t b(n)_k_,t f =1:F, t =1:T, k=1:K, n=1:N (A.1) where ˆ Ryn,k( f,t)=Yˆn,k( f,t)YˆnH_,k( f,t)+(I− Gn,k( f,t))Ryn,k( f,t) ˆ Yn_,k( f,t)=Gn_,k( f,t)X( f,t) Gn,k( f,t)=Ryn,k( f,t)(RX( f,t))−1 Ryn,k( f,t)=w(n)f,kh(n)k,tRn( f) (A.2)

Thefirsttermin(A.1)isequivalenttotheMaximum

Like-lihood(ML) criterion as stated in Arberet et al. (2010). So

Q1MAP canbe written as:

Q1MAP = f_,t,k,n −1 2trace ˆ RYn( f,t) R−1n ( f) w(n)_f_,kh(n)_k_,t +1 2logdet ˆ RYn( f,t) R−1n ( f) w(n)_f_,kh(n)_k_,t +1 +a(n)k_,t − 1 logh(n)k_,t − h(n)_k_,t b(n)_k_,t − a (n) k_,t log b(n)k_,t (A.3)

Then, thederivative of Q1MAP w.r.thk(n),t isobtained as:

dQ1MAP dh_k(n)_,t = f ˆ v_n,k_{( f ,t )} w(n)_f,k − F− ak(n)_,t +1 hk(n)_,t − h_k,t(n)2 b(n)_k,t h_k(n)_,t 2 (A.4)

Setting (A.4) tozero, will give us the update relation

ex-pressedin(14).

A.2.Initializingthe parameters underBayesianNMF frameworkwith Gamma prior

Weassume IS-NMFwithGammaprior modelfor the

ini-tialization step; Thus we should obtain the MAP estimation

byminimizing Q2MAP function:

Q2MAP= f,t,k,n DIS vk,f,t,n|w(n)f,khk(n),t − logP h(n)_k_,_t|a(n)_k_,_tb(n)_k_,_t (A.5) where vk, f ,t,n=| w(n)_f_,kh(n)_k,t lw(n)f,lh(n)l,t | 2_v(n) f_,t + w(n)_f,kh_k,t(n) lw(n)f,lh(n)l,t l =kw(n)f_,lhl(n)_,t.

Therefore,the partial derivatives of (A.5) areobtained as:

dQ2MAP dw(n)_f_,k = T w(n)_f_,k − 1 w(n)_f_,k 2 t vk, f ,t,n h_k(n)_,t (A.6) dQ2MAP dh(n)_k_,t = F h_k(n)_,t − 1 h_k(n)_,t 2 f vk, f ,t,n w(n)_f_,k − a(n)_k_,t − 1 h(n)_k_,t + 1 b(n)_k_,t (A.7)

where the firsttwo terms onthe righthand sideof the

equa-tionsareequaltotheMLcriterion(QML)gradients.Therefore,

the updaterelations are obtained as below:

w(n)_f_,k = 1 T t vk, f ,t h(n)_k_,t (A.8) h(n)_k_,t = p2 1− 4p2p0− p1 2p2 p2 = 1 b(n)_k_,t p1=F− a (n) k,t +1 p0=− f vk, f ,t w(n)_f_,k (A.9)

Thescaleparametersb(n)_k_,t oftheprior arerecursivelyupdated

ateach time frameusing (17).

AppendixB. DerivationoftheM-stepupdaterelations

underBayesianNMFframeworkwithinverse-gamma

Markovchainprior

In the case of the Bayesian framework with

Inverse-Gamma chain prior, the MAP criteria whichshouldbe

max-imizedcan bewritten as:

Q3MAP= f_,t,k,n −1 2trace ˆ RYn(f,t) R_n−1(f) w(n)_f_,_kh(n)_k_,_t +1 2log det ˆ RYn(f,t) R_n−1(f) w(n)_f_,_kh(n)_k_,_t +1 +log p h(n)_k_,t|h_k(n)_,t−1 +log p h_k(n)_,t+1|h_k(n)_,t (B.1)

Sothe gradient of (B.1) isobtained as follows:

dQ3MAP dh_k,t(n) = ⎧ ⎪ ⎪ ⎪ ⎪ ⎪ ⎪ ⎪ ⎪ ⎪ ⎪ ⎪ ⎪ ⎪ ⎪ ⎪ ⎨ ⎪ ⎪ ⎪ ⎪ ⎪ ⎪ ⎪ ⎪ ⎪ ⎪ ⎪ ⎪ ⎪ ⎪ ⎪ ⎩ f ˆ vn,k( f ,t ) w(n)_f,k +(α − 1− F)h (n) k,t − (α+1)(h(n) k,t) 2 h(n)_k,t+1 (h(n)k,t) 2 , t=1 f ˆ vn,k( f ,t ) w(n)_f,k +(α +1)h (n) k,t−1− (F+1)h(n)k,t − (α+1)(h(n) k,t) 2 h(n)_k,t+1 (h(n) k,t) 2 , 1<t<T f ˆ vn,k( f ,t ) w(n)_f,k +(α +1)h (n) k,t−1− (α +1+F)hk,t(n) (h(n) k,t) 2 , t=T (B.2)

(9)

Zeroing the above gradient function, lead to the update

equation stated in(19).

References

Araki, S.,Nakatani, T.,Sawada, H., Makino,S., 2009.Stereosource sep-aration and source counting with MAPestimation with Dirichlet prior consideringspatialaliasingproblem.In:IndependentComponent Analy-sisandSignalSeparation.Springer,pp.742–750.

Arberet,S.,Ozerov,A.,Duong,N.Q.,Vincent,E.,Gribonval,R.,Bimbot,F., Vandergheynst,P.,2010.Nonnegativematrixfactorizationandspatial co-variance model for under-determined reverberant audio source separa-tion.In:Proceedingsofthe10thInternationalConferenceonInformation SciencesSignalProcessingandtheirApplications(ISSPA),2010.IEEE, pp.1–4.

Burred, J.J., Sikora, T., 2006. Comparison of frequency-warped represen-tationsfor sourceseparation ofstereomixtures.In: AudioEngineering SocietyConvention121.AudioEngineeringSociety.

Campbell, D.,Palomaki, K.,Brown, G.,2005. AMATLABsimulation of “shoebox” roomacousticsforuseinresearchandteaching.Comput.Inf. SystICASSP2010,9–12.

Cobos, M., López, J., 2009. Blind separation of underdetermined speech mixturesbasedonDOAsegmentation.IEEETrans.Audio,Speech,Lang. Process.

Duong,N.Q.,Vincent,E.,Gribonval,R.,2009.Spatialcovariancemodelsfor under-determinedreverberantaudiosourceseparation.In:Applicationsof SignalProcessingtoAudioandAcoustics.IEEE,pp.129–132. Duong,N.Q.,Vincent, E., Gribonval,R., 2010a. Under-determined

convo-lutiveblindsource separationusingspatial covariancemodels.In: Pro-ceedingsoftheIEEEInternationalConferenceonAcousticsSpeechand SignalProcessing(ICASSP),2010.IEEE,pp.9–12.

Duong,N.Q.,Vincent,E.,Gribonval,R.,2010b.Under-determined reverber-antaudiosource separationusing a full-rankspatial covariancemodel. IEEETrans.Audio,Speech,Lang.Process.,18(7),1830–1840. Duong,N.Q.,Vincent,E.,Gribonval,R.,2010c.Under-determined

reverber-antaudiosourceseparationusinglocalobservedcovarianceand auditory-motivatedtime-frequencyrepresentation.In:LatentVariableAnalysisand SignalSeparation.Springer,pp.73–80.

Duong,N.Q.,Vincent, E.,Gribonval, R.,2013. Spatial locationpriorsfor gaussianmodelbasedreverberantaudiosourceseparation.EURASIPJ. Adv.SignalProcess.2013,1–11.

ElChami,Z.,Pham,D.,Serviere,C.,Guerin,A.,2008.Anewmodelbased underdeterminedsource separation.In: Proceedings oftheInternational onWorkshoponAcoustic EchoandNoiseControl(IWAENC),p.147. Essid, S., Févotte, C., 2013. Smooth nonnegativematrix factorization for

unsupervisedaudiovisualdocumentstructuring.IEEETrans.Multimedia 415–425.

Févotte,C.,2011.Majorization-minimization algorithmforsmooth Itakura-Saito nonnegative matrix factorization. In: Proceedings of the IEEE International Conference on Acoustics, Speech and Signal Processing (ICASSP),2011.IEEE,pp.1980–1983.

Févotte,C.,Bertin,N.,Durrieu,J.-L.,2009.Nonnegativematrixfactorization with theItakura-Saito divergence: With application to music analysis. Neuralcomputation,793–830.

Févotte,C.,LeRoux,J.,Hershey,J.R.,2013.Non-negativedynamicalsystem withapplicationtospeechandaudio.In:ProceedingsoftheIEEE Interna-tionalConferenceonAcoustics,SpeechandSignalProcessing(ICASSP), 2013.IEEE,pp.3158–3162.

Kim,M.,Smaragdis,P.,2013.Singlechannelsourceseparationusingsmooth nonnegativematrixfactorizationwithMarkovrandomfields.In: Proceed-ingsoftheIEEEInternationalWorkshoponMachineLearningforSignal Processing(MLSP),2013.IEEE,pp.1–6.

Mandel,M.I.,Ellis,D.P.,2007.EMlocalizationandseparationusing inter-auralleveland phase cues. In:Proceedings of theIEEEWorkshop on ApplicationsofSignalProcessingtoAudioandAcoustics,2007.IEEE, pp.275–278.

Mohammadiha,N.,Smaragdis,P.,Leijon,A.,2013.Supervisedand unsuper-visedspeechenhancementusingnonnegativematrix factorization.IEEE Trans.Audio,Speech,Lang.Process. 2140–2151.

Mohammadiha, N., Taghia, J., Leijon, A., 2012. Single channel speech enhancement usingBayesian NMF with recursive temporalupdates of prior distributions. In: Proceedings of the IEEE International Confer-enceonAcoustics,SpeechandSignalProcessing(ICASSP),2012.IEEE, pp.4561–4564.

Nakano,M.,LeRoux,J.,Kameoka,H.,Kitano,Y.,Ono,N.,Sagayama,S., 2010.Nonnegative matrix factorization with Markov-chained bases for modelingtime-varyingpatternsinmusicspectrograms.In:LatentVariable AnalysisandSignalSeparation.Springer,pp.149–156.

Nikunen, J., Virtanen, T., 2014. Direction of arrival based spatial covari-ancemodelforblindsoundsourceseparation.IEEE/ACMTrans.Audio, SpeechLang.Process.22(3),727–739.

Ozerov,A.,Févotte,C.,2010.Multichannelnonnegativematrixfactorization inconvolutivemixturesforaudiosourceseparation.IEEETrans.Audio, Speech,Lang.Process.18(3),550–563.

Sawada, H., Kameoka, H., Araki, S., Ueda, N., 2013. Multichannel ex-tensionsofnon-negativematrixfactorization withcomplex-valueddata. IEEETrans.Audio,SpeechLang.Process.21(5),971–982.

Smaragdis, P., Fevotte, C., Mysore,G., Mohammadiha, N.,Hoffman, M., 2014.Staticanddynamicsourceseparationusingnonnegative factoriza-tions:aunifiedview.SignalProcess.Mag.,IEEE,66–75.

Vincent, E., 2006. Musicalsource separation using time-frequencysource priors.Audio,Speech,Lang.Process.,IEEETrans.91–98.

Vincent, E., Araki, S., Bofill, P., 2009. The2008signal separation evalu-ationcampaign:a community-basedapproachtolarge-scaleevaluation. In: Independent Component Analysis and Signal Separation. Springer, pp.734–741.

Vincent,E.,Sawada,H.,Bofill,P.,Makino,S.,Rosca,J.P.,2007.Firststereo audiosourceseparationevaluationcampaign:data,algorithmsandresults. In: Independent Component Analysis and Signal Separation. Springer, pp.552–559.

Virtanen,T.,2007.Monauralsoundsourceseparationbynonnegativematrix factorizationwithtemporalcontinuityandsparsenesscriteria.IEEETrans. Audio,Speech,Lang.Process.15(3),1066–1074.

Weiss,R.J.,Ellis,D.P.,2010.Speechseparationusingspeaker-adapted eigen-voicespeechmodels.Comput.SpeechLang.16–29.

Wilson,K.W.,Raj, B.,Smaragdis, P., 2008.Regularizednon-negative ma-trixfactorization with temporaldependencies forspeechdenoising. In: Interspeech,pp.411–414.

Winter,S.,Kellermann,W.,Sawada,H.,Makino,S.,2007.Map-based un-derdeterminedblindsource separationof convolutive mixturesby hier-archicalclusteringandl1-normminimization.EURASIPJ.Appl.Signal Process2007(1),81–92.

Yilmaz,O.,Rickard,S.,2004.Blindseparationofspeechmixturesvia time-frequencymasking.IEEETrans.SignalProcess.52(7),1830–1847.