1.5 DSP methods for Source Separation
1.5.6 Shifted Non-negative Matrix Factorisation
Shifted Non negative Matrix factorisation (SNMF) was proposed as a means of avoiding the problem of clustering provided that a log frequency resolution is used for the frequency basis functions. The SNMF algorithm [44] assumes that the timbre of a note does not change for all the pitches produced by an instrument. The basic principle used in the SNMF algorithm is well motivated by the fact that, in western music the fundamental frequencies of each half tone are geometrically spaced by a factor of 12√
2. Therefore, a translated version D of a frequency basis function of a particular instrument can then be used to approximately cover the entire range of melodies played by the instrument in consideration. Also, if the frequency bins are a semitone apart, a shift up
approximate the frequency basis function of another note higher or lower by half a note respectively. However, a log-frequency resolution of the frequency basis functions is required to exploit this shift invariant property. A constant Q transform can be used to obtain the log-frequency resolution.
Notations
We now define the parameters and notations used in the SNMF model. The notations for tensor parameters used to define the SNMF model [44] is as per the conventions described in [90]. Calligraphic upper-case letters (R) are used to denote tensors of any given dimension. A contracted tensor product of two tensors of finite dimension is defined as follows. Let a tensor R be of dimension I1×· · ·×IS×L1×· · ·×LP and tensor D be of dimension I1×· · ·×IS×J1×· · ·×JN
then equation 1.74 denotes the contracted tensor multiplication of R and D along the first p modes. Indexing of tensor elements is done using lower case letters, such as j and is denoted by R(i, j).
hRDi{l1,...,lp,j1,...,jp} = lp X l=l1 · · · jp X j=j1 Rl× Dj = Z (1.74)
The dimensions along which the tensors R and D are to be multiplied is specified in curly brackets. The resultant tensor Z will be of dimension l1 ×
SNMF Algorithm (SNMFcqt)
As noted previously, a log-frequency resolution of the frequency basis function is required for the Shifted NMF. Here, the CQT is used to obtain the log-frequency resolution. A CQT spectrogram can be obtained by multiplying the transform matrix Y ( see equation 1.14) with X, where X is the linear domain magnitude spectrogram.
C = YX (1.75)
Having obtained a Constant Q spectrogram C of size n × m, where m is the number of time frames along the n frequency bins, SNMF can be used to separate the instrument basis functions. In practice, for a given number of p sources the spectrogram C can be decomposed using the SNMF model into tensors as shown in equation:
C ≈ hhRDi{3,1}Hi{2:3,1:2} (1.76)
where, R is a translation tensor of dimension n×k×n for k possible translations. R translates the instrument basis functions in D up or down to approximate various notes played by an instrument in question. Tensor D is of size n × p contains a frequency or instrument basis function for each source. H is a tensor of size k × p × m such that H(i , s , :) represents the time envelope for the ith
translation of the sth source, which informs when a given note is played by a
For a given s number of sources, SNMF will decompose the constant Q spectrogram C into instrument basis functions and sets of associated time activations that can be used to approximately represent C. The cost function used to approximate tensors D and H is the same as used for NMF. To approximately cover all the notes played by the instrument, the number of translation k is chosen empirically. The translated (frequency-shifted) version of an instrument basis function approximately captures all the notes played by a given instrument considered in a mixture. Thus, the need of clustering NMF basis functions is avoided, as each instrument is now represented by a single instrument basis function. The SNMF algorithm requires the use of a log-frequency spectrogram for segregating the frequency basis functions. In music processing, a CQT is typically used to achieve log-frequency resolution.
The SNMF algorithm has two notable drawbacks. Firstly, the spectral envelope of notes played by an instrument changes with the pitch, therefore, the assumption that the timbre of any note played by an instrument remains unchanged, regardless of pitch, is not true in general. However, this approximation holds reasonably well over a limited pitch range.
Secondly, the lack of an inverse CQT results in a deterioration of the separation quality of the reconstructed signal. However, the shift-invariant property of the instrument basis function can be exploited to capture all the notes played by pitched instruments in the audio mixture. We will attempt to address these limitations to develop improved SNMF algorithms for monaural sound source separation in chapters 2 and 3.