SpeechCommunication81(2016)129–137
www.elsevier.com/locate/specom
Under-determined
reverberant
audio
source
separation
using
Bayesian
Non-negative
Matrix
Factorization
Sayeh
Mirzaei
a,∗,
Hugo
Van
Hamme
a,
Yaser
Norouzi
baDepartment of Electrical Engineering-ESAT, KULeuven, Kasteelpark Arenberg 10, Bus 2441, B-3001 Leuven, Belgium bDepartment of Electrical Engineering, Amirkabir University of Technology, 424 Hafez Ave, Tehran, Iran
Received28May2015;receivedinrevisedform11January2016;accepted13January2016 Availableonline3February2016
Abstract
Inthispaper,weaddressthetaskofaudiosourceseparationforastereoreverberantmixtureofaudiosignals.Weuseafull-rankmodelfor thespatialcovariancematrix.BayesianNon-negativeMatrixFactorization(NMF)frameworksareintroducedforfactorizingthetime-frequency variance matrix of each source into basis componentsand timeactivations. We alsopropose to incorporate the temporaldependencies in the Bayesianmodelthrough(1)recursivelyupdatingtheprior hyperparametersor(2)applyingapriorwithMarkovchainstructuretofavor the smoothnessof the solution andwe comparethe performanceof thesetwo schemes.TheEM algorithm isappliedto derivethe update relationsof theunknownparameters.Theseparationperformanceimprovementoverthe non-BayesianstandardNMFmethodaswell asthe conventionalfull-rankunconstrained methodare investigatedbycalculating objectiveseparation evaluationmetrics.
© 2016 Elsevier B.V. All rights reserved.
Keywords: BlindSourceSeparation(BSS);BayesianNon-negativeMatrixFactorization(BNMF);Spatialcovariancemodel;Temporaldependencies; Reverberantmixture.
1. Introduction
Weoftendeal withamixture of sounds comingfrom
dif-ferentacousticsources.Separationof theseaudiosignalsand
extracting the individual source signals is required in many
applications including speaker diarization, meeting
transcrip-tion systems, hearing aids, polyphonic music transcription,
etc.
Whennopriorinformationofthe sourcesorchannel
mix-ing systemis available,thetask iscalledBlindSource
Sepa-ration(BSS).Themultichannelmixturesignalx(t )∈RM can
be expressedas x(t )= N n=1 yn(t ) (1)
where yn(t ), n=1...N isthenthsourcespatialimage
vec-torover M channels.The mixing processconsists of alinear
∗Correspondingauthor.Tel.:+989126850714. E-mail address: [email protected](S.Mirzaei).
time-invariantfiltering of the source signalsas:
yn(t )= L−1
l=0
hn(l)sn(t− l) (2)
wheresn(t)isthe nthsourcesignal,hn(l)∈RM isthemixing
filter vector which denotes the acoustic path from source n
tothe M microphones andL is the filter length. Mostof the
proposedBSSmethods are basedon the assumptionthat the
mixing process at each frequency bin can be approximated
bycomplex-valued multiplication:
Yn( f,t)≈ Hn( f)Sn( f,t) (3)
where Yn(f, t) is the spatial image of source n in the Short
Time Fourier Transform Domain(STFT) domain, sn(f, t)
de-notesthesource STFTandHn(f)specifies the Fourier
trans-formof the mixing filter hn(t).
IfSn(f,t)isazero-meanvariablewithvariancevn(f,t),the
covarianceof Yn(f, t) canbe written as:
RYn( f,t)=vn( f,t)Rn( f) (4)
http://dx.doi.org/10.1016/j.specom.2016.01.003
The assumption in (3) implies that the spatial covariance
matrixof eachsource,Rn(f),hasrank1.Assumingthe
rank-1 model, BSS can be achieved using time-frequency (TF)
maskingtechniques(Yilmaz andRickard, 2004)or MAP
es-timation assuming sparse prior distributions (Winter et al.,
2007), or modeling the source variances with Non-negative
MatrixFactorization(NMF)(Févotteetal.,2009;Ozerovand
Févotte,2010).Therank1 assumptionisonlyvalidwhenthe
filter length L is sufficiently small withrespect tothe STFT
window length. This is violated in most realistic scenarios
wherereverberationexists.Afull-rankspatialcovariance
ma-trix model is proposed in Duong et al. (2009) to provide
better approximation in reverberant environments. The
Max-imum Likelihood (ML) solution is then found in an oracle
contextwhere both the spatialcovariance matrix, Rn(f),and
the scalar variance of the sources, vn(f, t), are known and
alsoinasemi-blindcontextwhere thespatialcovariance
ma-trix is estimated from single-source training data. In Duong
et al. (2010a), the EM algorithm was used for blindly
esti-matingbothof theaboveparameters.Thesourcepermutation
problem whicharises when the unknown parameters are
in-dependently estimated at each frequency bin, has also been
solvedinDuong etal.(2010a).
In Arberet et al. (2010), the source variances vn(f, t) are
modeled by NMF andthe EM algorithm is used for blindly
estimating the parameters similar to what is done in Duong
et al. (2010a). In Duong et al. (2010c), the use of a
non-uniform TF representation onthe auditory-motivated
equiva-lent rectangular bandwidth(ERB) scale is investigated.It has
been shown that this representation is beneficial for
multi-channel convolutive source separation provided that the
full-rankcovariancemodelisused.Thishasalsobeeninvestigated
inBurred and Sikora (2006) for instantaneous mixtures and (Vincent,2006) for convolutive mixtures.
In Duong et al. (2010b), four specific covariance models
including the rank-1 anechoic model, the rank-1 convolutive
model,thefull-rankdirect+diffusemodelandthefull-rank
un-constrained model are considered. A hierarchical
clustering-basedmethodisusedtoinitializetheparameters. Also,a
Di-rectionof Arrival(DoA) basedapproachisproposedtoalign
theorder of the estimated sourcesacross all frequency bins.
In Duongetal.(2013)somespatiallocationprior
distribu-tions consistent with the theory of statistical room acoustics
are proposed for application to the spatial covariance
matri-ces and EM algorithms are derived for Maximum a
Poste-riori (MAP) estimation. In Nikunen and Virtanen (2014), a
spatial covariance matrix model is proposed which consists
of a weighted sum of Direction of Arrival kernels. This
co-variancemodeliscombinedwiththeComplexNMF(CNMF)
framework proposedin Sawada et al. (2013) andthe update
relationsforfindingtheunknownparametersaresubsequently
derived.
In Arberet et al. (2010), the nth source variance matrix
Vn(F × T) consisting of the above variance elements, vn(f,
t),isapproximatedas aproductof twonon-negativematrices
Wn(F × K)and Hn(K × T)which specifythe basis
compo-nentsandtimeactivation matricesrespectively. It isassumed
that the numberof the components K required for modeling
each source is known in advance. However this may not be
a suitable presumption when the goal is to blindly separate
the individual source signals and there is no prior
informa-tion about the source types. Here, we propose a Bayesian
NMF framework to automatically infer the number of basis
vectors for each source. In our first approach, we develop a
Bayesian framework assuming that the time activation
ma-trix elements Hn are random variables with a Gamma prior
distribution. An EM algorithm is developed for deriving the
updateequations.The updaterelationsgiveninArberetetal.
(2010) are replacedwith the newly derived relations for the
factors of the source variance matrices which are obtained
throughMAPestimation.Wehavealsomodeledthetemporal
dependenciesthrough imposingconstraintstotheprior
distri-butionofthetemporal activations.Aprocedureinspiredfrom
Mohammadiha et al. (2012) is used for updating the scale
parametersof the prior distributions of the timeactivations.
In the second approach, we favor the smoothness of the
results through applying an inverse-Gamma chain prior
dis-tribution inspired fromFévotteet al.(2009).
In Smaragdis et al.(2014), a comprehensive study of the
NMF methods which model the temporal statistics is done.
One flexible approach for considering the actual temporal
dependencies is to impose constraints on the model
activa-tions (Essid andFévotte, 2013; Févotte,2011; Févotteet al.,
2009; Mohammadiha et al., 2013; 2012; Virtanen, 2007;
Wilson et al., 2008). These approaches are called dynamic
or smooth NMF. They differ by the used penalty term in
non-probabilistic settings or by the choice of the
observa-tion model and prior structure in the Bayesian frameworks.
In Virtanen (2007), temporal continuity and sparseness
con-straints are applied to the activation coefficients. Temporal
continuity is favored by using a cost term which is the sum
of squared differences between the activations in adjacent
frames, andsparseness is favored by penalizing nonzero
ac-tivations. ANon-negative DynamicalSystem (NDS)is
intro-ducedinFévotteetal.(2013)for modelingspeechspectra.It
can be regarded as an extension of NMF tosupport
Marko-viandynamics.Non-negativitypreservingGammaor
inverse-GammaMarkovchainpriorsareconsideredinFévotte(2011);
Févotteet al.(2009); Mohammadiha et al.(2013, 2012)and
Markov random fields in Kim and Smaragdis (2013). In
Nakano et al. (2010), the spectrogram of music signals is
modeled as the combinationof Markov-chained spectral
pat-terns.
The approaches proposed in this paper can be regarded
as Bayesian extensions of the method proposed in Arberet
et al. (2010) accentuating the smoothness of the estimates.
The Gamma prior model has been chosen for its
effec-tiveness inmodeling sparse parameters. Meanwhile, Gamma
and inverse-Gamma prior distributions are preferred
be-cause we are going to model non-negative elements of
the activation matrix, thus other sparse prior distributions
such as Laplace cannot be useful here. The novel
as-pects of our proposed approaches can be summarized as
•Bayesian NMF frameworks are proposed to factorize the
source variance matrixin the full-rank model for the
pur-poseofprovidingamorepowerfulmodelthroughapplying
suitable prior structures andavoidingover-fitted or
under-fitted models.
•Temporal dependencies are taken into account via
1)Up-dating the scale hyperparameters of the time activation
Gammapriordistributions;2)Applyinganinverse-Gamma
Markov chain priordistribution.
Thestructure oftherestof thepaperisasfollows:We
in-troduce ourfirst proposedBayesian approachwhichimposes
a temporal continuity constraint through updating the scale
hyperparameters of the Gamma prior in Section 2. Wethen
explainthesecond Bayesianapproachwhichusesan
inverse-Gamma chain prior structure in Section 3. Various
experi-mental settings and the performance results are presented in
Section 4. Finally,we conclude inSection 5.
2. BayesianNMFwithGammapriormodelconfiguration
Weadmitthefollowinggenerativemodelforthenthsource
spatial imageyn(f, t)(Arberet etal., 2010):
Yn( f,t)∼ Nc
0,RYn( f,t)
(5)
where Nc(μ,) isapropercomplexGaussiandistribution:
Nc(y; μ,)= π−1exp
−(y− μ)H−1(y− μ) (6)
Thecovariance matrixRYn( f,t)isgivenby(4)where the
spatial covariancematrixRn(f)is assumedas afull-rank
un-constrainedcovariance model (Duongetal.,2010b)andvn(f,
t) is thetime-varying sourcevariance whichwe approximate
withthe following factorizedform:
vn(t,f)= K
k=1
w(n)f,kh(n)k,t (7)
where w(n)f,k,hk(n),t ∈R+. Therefore, we can expressthe matrix
Vn with entries [Vn]f,t =vn( f,t) as a product of two
non-negativematricesWnandHnwithentries[Wn]f,k =w(n)f,k and
[Hn]k,t =hk(n),t:
Vn=WnHn (8)
Consequently,thesourcespatialimageYn(f,t)canbewritten
as the combination of Kcomponents:
Yn( f,t)= K
k=1
Yn,k( f,t) (9)
withthe covariance matrix givenby:
RYn,k( f,t)=w(n)f,kh(n)k,tRn( f) (10)
Here, we assume that the elements of the Hn matrix
rep-resenting the time activations of the basis components in
Wn, are random variables with Gamma prior distribution,
Gamma(h(n)k,t|ak(n),t,b(n)k,t): Gamma(h|a,b)= h (a−1) ba(a)exp −h b (11)
The elementsof Wn are assumed deterministic.
According to the above generative model and assuming
that the source signals are independent, the mixture signal
STFT,X(f,t)wouldbeazero-meancomplexGaussianvector
withthe followingcovariance matrix:
RX( f,t)=
N
n=1
RYn( f,t) (12)
The aim is to estimate the spatial covariance matrices
Rn(f)and the source variances vn(f,t) under the above
gen-erativemodel bymaximizing thelikelihood function.Weopt
forthe EMalgorithm similartoArberetetal.(2010);Duong
etal.(2010b) with thedifference that we maximize the
pos-terior probabilitywhileupdating the Hn elements.Therefore,
theE-stepoftheEMalgorithmremainsunchangedw.r.twhat
isgiveninArberetetal.(2010).ThefollowingM-stepupdate
relations are obtained:
w(n)f,k= 1 T T t=1 ˆ vn,k( f,t) hk(n),t (13) hk(n),t = ak(n),t − (F +1)+ a(n)k,t − (F+1) 2 +4 F f=1ˆ vn,k( f ,t ) w(n)f,k b(n)k,t 2 b(n)k,t (14) withvˆn,k( f,t)= M1tr(Rn−1( f)Rˆyn,k( f,t)).Fdenotes thetotal
numberoffrequency binsintheTFrepresentation.Rˆyn,k( f,t)
isupdatedaccordingtoEq.(11)oftheE-stepgiveninArberet
et al. (2010) and is restated in (A.2). Rn(f) is updated as
(Arberetetal., 2010): Rn( f)= 1 T T t=1 1 vn( f,t) ˆ RYn( f,t) (15)
More explanation on the derivation of (14) can be found
inAppendix A.1.
Afterthe convergence of the EM algorithm, the STFT of
the source spatial image is estimated using the Wiener
esti-mator: ˆ
Yn( f,t)=RYn( f,t)(RX( f,t))
−1X( f,t) (16)
The time-domain signals are simply derived through
in-verse STFToperation.
The proposed Bayesian framework has the advantage of
avoidingoverfittedorunderfittedmodels.Meanwhile,inspired
from Mohammadiha et al.(2012), we recursively updatethe
prior hyperparameters over subsequent time frames as
de-scribedinSection2.1toimposethetemporalcontinuity
con-straint.Thiscontinuitynormallyexistsformanyaudiosignals
andfor speechin particular.
2.1. Imposingthe temporalcontinuity constraint
Tomake themodel fit bettertothedata,we takethe
the prior are recursively updated at each time frame based
on the following smoothing relation (Mohammadiha et al.,
2012): b(n)k,t =λh (n) k,t−1 a(n)k,t +(1− λ)b (n) k,t−1 (17)
where the λ parameter controls the smoothing level. The
shape hyperparameter a(n)k,t is taken fixed and time-invariant
inourmodel.
2.2.Parameter initialization
The EM algorithm is very sensitive to initialization, thus
here, similar to Arberet et al. (2010), we use the perturbed
oracle initialization where the parameters Rn(f) and vn(f, t)
are estimated form the original source spatial images as in
Duong et al. (2009) and then perturbed with a high level
additivenoise (SNR of 5 dB). The parameters w(n)f,k,hk(n),t are
theninitializedaccordingtotheIS-NMFBayesiangenerative
model with Gamma prior. The update relations are derived
in Appendix B. We have chosen IS divergence because it
has been shown to perform more effectively for factorizing
the power spectrogram (Févotte et al., 2009). Furthermore,
IS divergence is scale-invariant, which means that the same
relativeweightisgiventosmallandlargevaluesofthesource
variance matrix coefficients Vn(F × T) in the cost function.
Thispropertycanberegardedas abenefitsinceitisrelevant
todecomposition of audio spectra in the sense that a badfit
of the factorization for a low-power coefficient will cost as
muchas a badfit for ahigher-power coefficient.
3. BayesianNMFwithinverse-Gammachainpriormodel
configuration
InthesecondBayesianframework,weassumethe
inverse-gammaprior distributionfor h(n)k,t parametersas follows:
p hk(n),t|h(n)k,t−1 =IG h(n)k,t|α,(α +1)h(n)k,t−1 IG(u|α,β)= β α (α)u−(α+1)exp −β u , u≥ 0 (18)
The prior is constructed so that its mode is obtained for
h(n)k,t =h(n)(k,t−1).α isashapeparameterthatcontrolsthe
sharp-ness of the prior around its mode. A high value of α will
increasethesharpnessandthusaccentuate thesmoothnessof
h(n)k , while alow value of α will render the prior more
dif-fuse andthus less constraining (Févotte et al., 2009). In the
following, hk(n),1 is assigned the scale-invariant Jeffreys
nonin-formative prior p(h(n)k,1)∝ h1(n) k,1
. In this case, the update
rela-tions are the same as in the previous section except for the
h(n)k,t parameters which are obtained through maximizing the
posteriordistribution. The update relationis obtained as:
h(n)k,t = − p2 1− 4p2p0− p1 2p2 (19) Table1
Coefficients ofthe order2 polynomial tosolve in orderto update h (n)k,t in BayesianIS-NMFwithaninverse-Gammachainprior.
p 2 p 1 p 0 h (n)k,1 −(α+1) h(n)k,2 −(F− α +1) f ˆ vn,k( f ,1) w(n)f,k h (n)k,t, 1< t < T −(α+1) hk(n),t+1 −(1+F ) f ˆ vn,k( f ,t ) w(n)f,k +(α + 1) h (n) k,t−1 h (n)k,T 0 −(α +1+F ) f ˆ vn,k( f ,T) w(n)f,k +(α + 1) h (n) k,T−1 Table2
Coefficientsoftheorder2polynomialtosolveinordertoinitialize h k(n),t in BayesianIS-NMFwithaninverse-Gammachainprior.
p 2 p 1 p 0 h (n)k,1 (α+1) h(n)k,2 F − α +1 −Fh ˆ (n) k,1 h (n)k,t, 1<t <T (α+1) h(n)k,t+1 1+F −Fh ˆ (n) k,t − (α +1)h (n)k,t−1 h (n)k,T 0 (α + 1+F ) −Fh ˆ(n)k,T− (α +1) h (n)k,T−1
where the values of p0, p1 and p2 are given in Table 1.
The derivation of the above expression can be found in
Appendix B.
3.1. Parameter initialization
Again, we use the perturbed oracle to initialize the
pa-rameters Rn(f)andvn(f,t). Theparametersw(n)f,k andh(n)k,t are
theninitializedaccordingtotheIS-NMFwithinverse-Gamma
chainpriorgenerativemodel.Theupdateequationscanagain
beobtainedthroughfinding theMAPestimatesof the
param-eters. The wf,k update relation is the same as (A.8) and the
hk,t coefficient is updated according to the Equation 5.10 in
Févotteet al.(2009): h(n)k,t = p2 1− 4p2p0− p1 2p2 (20) where vk, f ,t,n= w(n)f,khk(n),t l w(n)f,lh(n)l,t 2 v(n)f,t + w (n) f,kh (n) k,t l w(n)f,lhl(n),t l =k w(n)f,lhl(n),t ˆ h(n)k,t = 1 F F f=1 vk, f ,t,n w(n)f,k (21)
The valuesof p0,p1 and p2 are giveninTable 2.
4. Experiments
Here,weconsiderthestereocase(M =2).Constantvalues
are chosen for the parameters ak(n),t =2 of the Gamma prior
andα =5 oftheinverse-Gammadistributions.Inanon-blind
settingwheresometrainingdataisavailable,theseparameters
canbelearnedinsteadoftakingfixedvalues.Theinitialvalue
for b(n)k,t ischosen equal tothe mean value of the v(n)f,t matrix
Fig.1. Averageblindsourceseparationperformanceoverstereomixturesofthreesourcesasafunctionofthereverberationtime,measuredintermsof(a) SDR,(b)SIR,(c)ISR,and(d)SAR.
Themale andfemalespeechsignalsandmusicsignalsare
taken from the dev2 dataset of the SiSEC’08
“underdeter-minedspeechandmusicmixtures” task(Vincentetal.,2009).
Each sourcesignal hasa durationof 10s.The sampling rate
isequalto16kHz.TheSTFTframesizeis1024withaframe
shiftof512 samples.Asinewindowfunctionisusedateach
frame to obtain the STFT coefficients. The number of
com-ponents is setto30 for NMF-based algorithms.20 iterations
of the EM algorithm is executed.
The separation quality is measured by calculating the
BSS evaluation metrics; We use the signal-to-distortion
ra-tio(SDR),signal-to-interferenceratio(SIR), signal-to-artifact
ratio (SAR)andsourceimage-to-spatialdistortion ratio(ISR)
criteriaexpressedindecibels(dB),asdefinedinVincentetal.
(2007).
Wehave performed the experiments onboth synthetically
mixed and real-world data as explained in Sections 4.1 and
4.2 respectively.
4.1. Experimentson syntheticmixtures
The mixture signal is synthetically generated using the
Roomsim Toolbox (Campbell et al., 2005) for a
rectangu-lar room of dimensions 6.25 × 3.75 × 2.5m and
omni-directional microphones. Three sources are chosen for each
mixture. Thereare 4 different mixture types: (1) Three male
speech sources; (2) Three female speech sources; (3) Three
nodrumsmusicsourcesand(4)Threewdrumsmusicsources.
The nodrums data consists of three non-percussive sources
while the wdrums includes the Drums musical instrument.
The source directions w.r.t the microphones axis are taken
as 30° , 70° and 120°. The source and microphone height
are set to 1.1m. The position of the microphone centers
in (x,y) coordinates lies at (1.56m, 1.87m). The
micro-phone spacingissetto d =5 cm.Weevaluate ourproposed
methods under different reverberant conditions
Table3
Averageperformancemetricsobtainedoverspeechmixtures.
RT60 Algorithm SDR SIR ISR SAR
(dB) (dB) (dB) (dB) 130ms GammaBayesianmodel 9.4 14 16.8 12.1 Inverse-GammaBayesianmodel 7.8 12.8 16.1 11.8 250ms GammaBayesianmodel 7.3 12 13.6 10.6 Inverse-GammaBayesianmodel 7.1 11.6 13.1 9.5
Table4
Averageperformancemetricsobtainedovermusicmixtures.
RT60 Algorithm SDR SIR ISR SAR
(dB) (dB) (dB) (dB) 130ms GammaBayesianmodel 8.8 14.4 13.8 9.7 Inverse-GammaBayesianmodel 10.2 15.4 14.1 13.2 250ms GammaBayesianmodel 6.1 8.8 11.2 9.2 Inverse-GammaBayesianmodel 8.1 14 14.1 11.5
averagedoverallsourcesandallmixturesaredepictedagainst
RT60 values in Fig. 1. As comparison materials, the results
ofthe standardNMF (Arberetet al.,2010) andfull-rank
un-constrained model without NMF (Duong et al., 2010b), are
also evaluated as indicated inthe figure. It can be observed
thatourproposedBayesianapproachesoutperformboth
state-of-the-art methods in all evaluation metrics. In the full-rank
model of Duong et al. (2010b), since the model parameters
areindependentlyestimatedatdifferentfrequencies,the
well-known source permutation problem must be addressed. The
DOA-basedpermutation alignmentscheme was used for this
purpose. However in the case of NMF and Bayesian NMF
decomposition,the permutation problemcan be avoided due
tothe joint estimation of the parameters.
The value chosen for the smoothing parameter(λ =0.5),
led to optimum performance in average for all signal types.
Forprovidingaquantitativeanalysis,weobtainedthevariance
oftheSDRmetricevaluatedfor60λ valuesuniformlyspaced
between0.3and0.9inthecasewhere the reverberationtime
is set to 250ms. We observed that the standard deviation is
equal to .13dB and the best obtained metric corresponds to
thecase where λ =0.5.
To measurewherethesuperiorityof eachoftwo Bayesian
frameworks stems from in modeling two audio mixture
cat-egories, music and speech, we calculated the average
per-formance of the separation for speech and music mixtures
separately as listed in Tables 3 and 4 respectively for the
corresponding RT60 values equal to 130ms and 250ms. We
observethattheGamma Bayesianframeworkperformsbetter
for speech mixtures and the inverse-Gamma chain Bayesian
model is better matched to the music signals. These results
can be assigned to the more strict temporal continuity
con-straint imposed within applying the Gamma framework
re-lated to the inverse-Gamma model. Hence we may interpret
that the strict temporal dependency constraint is better
fit-ted to the speech signals in general. We havealso observed
that inserting the temporaldependencies into the model may
not improve the separation quality for the percussive music
sources because of their discontinuous nature. To illustrate
Table5
SDRmetric(dB)forwdrumssourcesusingtheGammaBayesianmodel.
Smoothingparameter Drums Hi-hat Bass
λ = 0 4.5 5.6 10.4
λ = 0. 5 3.4 4.8 11.5
Table6
SDRmetric(dB)forwdrumssourcesusingInverse-GammaBayesianmodel.
Smoothingparameter Drums Hi-hat Bass
α =0.1 5.1 7.7 11.0
α = 5 4.4 7.2 12.2
this, we have analyzed the effect of reducing the temporal
smoothness constraintin bothBayesian models andreported
the obtained SDR for the wdrums mixture which contains
percussive sources in Tables 5 and 6. RT60 is set to 250ms.
Reducingor eliminatingthe temporalcontinuityconstraintin
both models, can even lead to better performance for
per-cussive sources “Drums” and “Hi-hat”. However it can be
observedthat applying thetemporal continuityconstrainthas
improvedthe SDR corresponding tothe “Bass” source.
TheEM convergence isdemonstrated inFig.2aandb for
the two Bayesian frameworks representing the MAP criteria
(Q1MAPandQ3MAPintheAppendicesA.1andBrespectively)
ateachiteration.TheMAP criterionisaveragedoverall
mix-tures.
4.2. Experimentson real-world mixtures
We perform this final experiment to compare the
pro-posed Bayesian algorithms with state-of-the-art BSS
algo-rithms submitted for evaluation to SiSEC 2008 over
real-worldmixturesof threeor fourspeechsources.Twomixtures
wererecordedfor eachgivennumberof sources,usingeither
male or femalespeech signals. The roomreverberation time
was either 130 or 250ms and the microphone spacing 5cm
Vincent et al. (2009). The average SDR achieved by each
algorithm is listed in Table 7 for comparison. The SDR
re-sults of all algorithms were taken from Table III in Duong
et al. (2010b). The Bayesian NMF-based methods
outper-form the other state-of-the-art methods. The better
perfor-manceachieved using theproposed Bayesianapproachescan
be attributed to the choice of the prior structure as well as
modelingthetemporalsmoothness. Wehaveanalyzedthisby
diminishing the smoothness constraint from both Bayesian
models and compared the obtained average SDR values
for real-world mixtures with the case where smoothness is
considered.
For the Gamma model, we set λ to 0 to eliminate the
smoothness. The obtained results are reported in Table 8. It
can be implied that the temporal continuity constraint
ap-plied through the smoothing parameter λ, leadsto improved
performance interms of average SDR metric.
For the second Inverse-Gamma model, we set the α
pa-rameter to 0.1 to reduce the sharpness of the prior
Fig.2.EMconvergencegraphsfor(a)GammaBayesianframework,(b)Inverse-GammachainBayesianframework.
Table7
Average SDR(dB)overthereal-worldtestdataofSiSEC2008with5-cm microphonespacing.
RT60 Algorithm 3sources 4sources
130ms GammaBayesianmodel 4.4 3.9 Inverse-GammaBayesianmodel 4.6 3.4 full-rankunconstrainedmodel(Duong
etal.,2010b)
3.3 2.8
M.Cobes(CobosandLópez,2009) 2.3 2.1 M.Mandel(MandelandEllis,2007) .1 −3.7 R.Weiss(WeissandEllis,2010) 2.9 2.3 S.Araki(Arakietal.,2009) 2.9 – Z.ElChami(ElChamietal.,2008) 2.3 2.1 250ms GammaBayesianmodel 5.0 3.1 Inverse-GammaBayesianmodel 4.9 3.5 full-rankunconstrainedmodel(Duong
etal.,2010b)
3.8 2.0
M.Cobes(CobosandLópez,2009) 2.2 1.0 M.Mandel(MandelandEllis,2007) 0.8 1.0 R.Weiss(WeissandEllis,2010) 2.3 1.5 S.Araki(Arakietal.,2009) 3.7 – Z.ElChami(ElChamietal.,2008) 3.1 1.4
Table8
Average SDR (dB)over thereal-worldtestdata fortheGammaBayesian model.
RT60 Algorithm 3sources 4sources
130ms GammaBayesianmodelwith λ = 0 3.8 3.0 GammaBayesianmodelwith λ = 0. 5 4.4 3.9 250ms GammaBayesianmodelwith λ = 0 4.1 2.4 GammaBayesianmodelwith λ = 0. 5 5.0 3.1
are reported in Table 9. Again, it can be observed that the
smoothingconstraintincorporated intothe modelthrough the
inverse-Gamma priorhyperparameter α, leadstobetter
aver-age separation performance interms of SDR.
Table9
Average SDR (dB) over the real-world test data for the Inverse-Gamma Bayesianmodel.
RT60 Algorithm 3sources 4sources
130ms Inverse-GammaBayesianmodelwith
α =0.1
3.6 2.9
Inverse-GammaBayesianmodelwith
α =5
4.6 3.4
250ms GammaBayesianmodelwithα =0.1 4 3.0 GammaBayesianmodelwithα =5 4.9 3.5
5. Conclusion
In this work, we proposed and developed two Bayesian
NMF frameworks for separating stereo audio mixtures in a
reverberant environment using a factorization of the source
variance matricesin a full-rank model. The temporal
depen-dencies of the audio signal are taken into account through
incorporating two different prior distribution structures
as-sumed for the activation coefficients:(1) a gammaprior
dis-tribution whose scale parameters are updated at each time
frame and (2) an inverse-Gamma Markov chain prior
distri-bution.The performance of the developed Bayesianmethods
are compared witheach other as well as the standard NMF
and the standard full-rank unconstrained modeling schemes
bycalculatingtheBSSevaluationmetrics.It hasbeenshown
that the proposed Bayesian approaches outperform the
pre-vious state-of-the-art methods. It can also be concluded that
the inverse-Gamma chain prior structure performs better for
the music source separation and the Gamma prior structure
withrecursiveupdates isbetterfitted tothespeech mixtures.
Bayesian models offer both a strong theoretical framework
andthe possibilitytomanageconstraintsthrough modelsand
Acknowledgment
ThisresearchwasfundedbytheKULeuvenresearchgrant
GOA/14/005 (CAMETRON).
AppendixA
A.1.Derivationof the M-step updaterelationsunder BayesianNMFframework withGamma prior
MAP estimationof thetemporalweights hk(n),t isequivalent
tomaximizing the costfunction below:
Q1MAP =− f,t,k,n DKL ˆ Ryn,k( f,t)|Ryn,k( f,t) +a(n)k,t − 1 logh(n)k,t −h (n) k,t b(n)k,t f =1:F, t =1:T, k=1:K, n=1:N (A.1) where ˆ Ryn,k( f,t)=Yˆn,k( f,t)YˆnH,k( f,t)+(I− Gn,k( f,t))Ryn,k( f,t) ˆ Yn,k( f,t)=Gn,k( f,t)X( f,t) Gn,k( f,t)=Ryn,k( f,t)(RX( f,t))−1 Ryn,k( f,t)=w(n)f,kh(n)k,tRn( f) (A.2)
Thefirsttermin(A.1)isequivalenttotheMaximum
Like-lihood(ML) criterion as stated in Arberet et al. (2010). So
Q1MAP canbe written as:
Q1MAP = f,t,k,n −1 2trace ˆ RYn( f,t) R−1n ( f) w(n)f,kh(n)k,t +1 2logdet ˆ RYn( f,t) R−1n ( f) w(n)f,kh(n)k,t +1 +a(n)k,t − 1 logh(n)k,t − h(n)k,t b(n)k,t − a (n) k,t log b(n)k,t (A.3)
Then, thederivative of Q1MAP w.r.thk(n),t isobtained as:
dQ1MAP dhk(n),t = f ˆ vn,k( f ,t ) w(n)f,k − F− ak(n),t +1 hk(n),t − hk,t(n)2 b(n)k,t hk(n),t 2 (A.4)
Setting (A.4) tozero, will give us the update relation
ex-pressedin(14).
A.2.Initializingthe parameters underBayesianNMF frameworkwith Gamma prior
Weassume IS-NMFwithGammaprior modelfor the
ini-tialization step; Thus we should obtain the MAP estimation
byminimizing Q2MAP function:
Q2MAP= f,t,k,n DIS vk,f,t,n|w(n)f,khk(n),t − logP h(n)k,t|a(n)k,tb(n)k,t (A.5) where vk, f ,t,n=| w(n)f,kh(n)k,t lw(n)f,lh(n)l,t | 2v(n) f,t + w(n)f,khk,t(n) lw(n)f,lh(n)l,t l =kw(n)f,lhl(n),t.
Therefore,the partial derivatives of (A.5) areobtained as:
dQ2MAP dw(n)f,k = T w(n)f,k − 1 w(n)f,k 2 t vk, f ,t,n hk(n),t (A.6) dQ2MAP dh(n)k,t = F hk(n),t − 1 hk(n),t 2 f vk, f ,t,n w(n)f,k − a(n)k,t − 1 h(n)k,t + 1 b(n)k,t (A.7)
where the firsttwo terms onthe righthand sideof the
equa-tionsareequaltotheMLcriterion(QML)gradients.Therefore,
the updaterelations are obtained as below:
w(n)f,k = 1 T t vk, f ,t h(n)k,t (A.8) h(n)k,t = p2 1− 4p2p0− p1 2p2 p2 = 1 b(n)k,t p1=F− a (n) k,t +1 p0=− f vk, f ,t w(n)f,k (A.9)
Thescaleparametersb(n)k,t oftheprior arerecursivelyupdated
ateach time frameusing (17).
AppendixB. DerivationoftheM-stepupdaterelations
underBayesianNMFframeworkwithinverse-gamma
Markovchainprior
In the case of the Bayesian framework with
Inverse-Gamma chain prior, the MAP criteria whichshouldbe
max-imizedcan bewritten as:
Q3MAP= f,t,k,n −1 2trace ˆ RYn(f,t) Rn−1(f) w(n)f,kh(n)k,t +1 2log det ˆ RYn(f,t) Rn−1(f) w(n)f,kh(n)k,t +1 +log p h(n)k,t|hk(n),t−1 +log p hk(n),t+1|hk(n),t (B.1)
Sothe gradient of (B.1) isobtained as follows:
dQ3MAP dhk,t(n) = ⎧ ⎪ ⎪ ⎪ ⎪ ⎪ ⎪ ⎪ ⎪ ⎪ ⎪ ⎪ ⎪ ⎪ ⎪ ⎪ ⎨ ⎪ ⎪ ⎪ ⎪ ⎪ ⎪ ⎪ ⎪ ⎪ ⎪ ⎪ ⎪ ⎪ ⎪ ⎪ ⎩ f ˆ vn,k( f ,t ) w(n)f,k +(α − 1− F)h (n) k,t − (α+1)(h(n) k,t) 2 h(n)k,t+1 (h(n)k,t) 2 , t=1 f ˆ vn,k( f ,t ) w(n)f,k +(α +1)h (n) k,t−1− (F+1)h(n)k,t − (α+1)(h(n) k,t) 2 h(n)k,t+1 (h(n) k,t) 2 , 1<t<T f ˆ vn,k( f ,t ) w(n)f,k +(α +1)h (n) k,t−1− (α +1+F)hk,t(n) (h(n) k,t) 2 , t=T (B.2)
Zeroing the above gradient function, lead to the update
equation stated in(19).
References
Araki, S.,Nakatani, T.,Sawada, H., Makino,S., 2009.Stereosource sep-aration and source counting with MAPestimation with Dirichlet prior consideringspatialaliasingproblem.In:IndependentComponent Analy-sisandSignalSeparation.Springer,pp.742–750.
Arberet,S.,Ozerov,A.,Duong,N.Q.,Vincent,E.,Gribonval,R.,Bimbot,F., Vandergheynst,P.,2010.Nonnegativematrixfactorizationandspatial co-variance model for under-determined reverberant audio source separa-tion.In:Proceedingsofthe10thInternationalConferenceonInformation SciencesSignalProcessingandtheirApplications(ISSPA),2010.IEEE, pp.1–4.
Burred, J.J., Sikora, T., 2006. Comparison of frequency-warped represen-tationsfor sourceseparation ofstereomixtures.In: AudioEngineering SocietyConvention121.AudioEngineeringSociety.
Campbell, D.,Palomaki, K.,Brown, G.,2005. AMATLABsimulation of “shoebox” roomacousticsforuseinresearchandteaching.Comput.Inf. SystICASSP2010,9–12.
Cobos, M., López, J., 2009. Blind separation of underdetermined speech mixturesbasedonDOAsegmentation.IEEETrans.Audio,Speech,Lang. Process.
Duong,N.Q.,Vincent,E.,Gribonval,R.,2009.Spatialcovariancemodelsfor under-determinedreverberantaudiosourceseparation.In:Applicationsof SignalProcessingtoAudioandAcoustics.IEEE,pp.129–132. Duong,N.Q.,Vincent, E., Gribonval,R., 2010a. Under-determined
convo-lutiveblindsource separationusingspatial covariancemodels.In: Pro-ceedingsoftheIEEEInternationalConferenceonAcousticsSpeechand SignalProcessing(ICASSP),2010.IEEE,pp.9–12.
Duong,N.Q.,Vincent,E.,Gribonval,R.,2010b.Under-determined reverber-antaudiosource separationusing a full-rankspatial covariancemodel. IEEETrans.Audio,Speech,Lang.Process.,18(7),1830–1840. Duong,N.Q.,Vincent,E.,Gribonval,R.,2010c.Under-determined
reverber-antaudiosourceseparationusinglocalobservedcovarianceand auditory-motivatedtime-frequencyrepresentation.In:LatentVariableAnalysisand SignalSeparation.Springer,pp.73–80.
Duong,N.Q.,Vincent, E.,Gribonval, R.,2013. Spatial locationpriorsfor gaussianmodelbasedreverberantaudiosourceseparation.EURASIPJ. Adv.SignalProcess.2013,1–11.
ElChami,Z.,Pham,D.,Serviere,C.,Guerin,A.,2008.Anewmodelbased underdeterminedsource separation.In: Proceedings oftheInternational onWorkshoponAcoustic EchoandNoiseControl(IWAENC),p.147. Essid, S., Févotte, C., 2013. Smooth nonnegativematrix factorization for
unsupervisedaudiovisualdocumentstructuring.IEEETrans.Multimedia 415–425.
Févotte,C.,2011.Majorization-minimization algorithmforsmooth Itakura-Saito nonnegative matrix factorization. In: Proceedings of the IEEE International Conference on Acoustics, Speech and Signal Processing (ICASSP),2011.IEEE,pp.1980–1983.
Févotte,C.,Bertin,N.,Durrieu,J.-L.,2009.Nonnegativematrixfactorization with theItakura-Saito divergence: With application to music analysis. Neuralcomputation,793–830.
Févotte,C.,LeRoux,J.,Hershey,J.R.,2013.Non-negativedynamicalsystem withapplicationtospeechandaudio.In:ProceedingsoftheIEEE Interna-tionalConferenceonAcoustics,SpeechandSignalProcessing(ICASSP), 2013.IEEE,pp.3158–3162.
Kim,M.,Smaragdis,P.,2013.Singlechannelsourceseparationusingsmooth nonnegativematrixfactorizationwithMarkovrandomfields.In: Proceed-ingsoftheIEEEInternationalWorkshoponMachineLearningforSignal Processing(MLSP),2013.IEEE,pp.1–6.
Mandel,M.I.,Ellis,D.P.,2007.EMlocalizationandseparationusing inter-auralleveland phase cues. In:Proceedings of theIEEEWorkshop on ApplicationsofSignalProcessingtoAudioandAcoustics,2007.IEEE, pp.275–278.
Mohammadiha,N.,Smaragdis,P.,Leijon,A.,2013.Supervisedand unsuper-visedspeechenhancementusingnonnegativematrix factorization.IEEE Trans.Audio,Speech,Lang.Process. 2140–2151.
Mohammadiha, N., Taghia, J., Leijon, A., 2012. Single channel speech enhancement usingBayesian NMF with recursive temporalupdates of prior distributions. In: Proceedings of the IEEE International Confer-enceonAcoustics,SpeechandSignalProcessing(ICASSP),2012.IEEE, pp.4561–4564.
Nakano,M.,LeRoux,J.,Kameoka,H.,Kitano,Y.,Ono,N.,Sagayama,S., 2010.Nonnegative matrix factorization with Markov-chained bases for modelingtime-varyingpatternsinmusicspectrograms.In:LatentVariable AnalysisandSignalSeparation.Springer,pp.149–156.
Nikunen, J., Virtanen, T., 2014. Direction of arrival based spatial covari-ancemodelforblindsoundsourceseparation.IEEE/ACMTrans.Audio, SpeechLang.Process.22(3),727–739.
Ozerov,A.,Févotte,C.,2010.Multichannelnonnegativematrixfactorization inconvolutivemixturesforaudiosourceseparation.IEEETrans.Audio, Speech,Lang.Process.18(3),550–563.
Sawada, H., Kameoka, H., Araki, S., Ueda, N., 2013. Multichannel ex-tensionsofnon-negativematrixfactorization withcomplex-valueddata. IEEETrans.Audio,SpeechLang.Process.21(5),971–982.
Smaragdis, P., Fevotte, C., Mysore,G., Mohammadiha, N.,Hoffman, M., 2014.Staticanddynamicsourceseparationusingnonnegative factoriza-tions:aunifiedview.SignalProcess.Mag.,IEEE,66–75.
Vincent, E., 2006. Musicalsource separation using time-frequencysource priors.Audio,Speech,Lang.Process.,IEEETrans.91–98.
Vincent, E., Araki, S., Bofill, P., 2009. The2008signal separation evalu-ationcampaign:a community-basedapproachtolarge-scaleevaluation. In: Independent Component Analysis and Signal Separation. Springer, pp.734–741.
Vincent,E.,Sawada,H.,Bofill,P.,Makino,S.,Rosca,J.P.,2007.Firststereo audiosourceseparationevaluationcampaign:data,algorithmsandresults. In: Independent Component Analysis and Signal Separation. Springer, pp.552–559.
Virtanen,T.,2007.Monauralsoundsourceseparationbynonnegativematrix factorizationwithtemporalcontinuityandsparsenesscriteria.IEEETrans. Audio,Speech,Lang.Process.15(3),1066–1074.
Weiss,R.J.,Ellis,D.P.,2010.Speechseparationusingspeaker-adapted eigen-voicespeechmodels.Comput.SpeechLang.16–29.
Wilson,K.W.,Raj, B.,Smaragdis, P., 2008.Regularizednon-negative ma-trixfactorization with temporaldependencies forspeechdenoising. In: Interspeech,pp.411–414.
Winter,S.,Kellermann,W.,Sawada,H.,Makino,S.,2007.Map-based un-derdeterminedblindsource separationof convolutive mixturesby hier-archicalclusteringandl1-normminimization.EURASIPJ.Appl.Signal Process2007(1),81–92.
Yilmaz,O.,Rickard,S.,2004.Blindseparationofspeechmixturesvia time-frequencymasking.IEEETrans.SignalProcess.52(7),1830–1847.