1.4 Time Frequency Representations (TFR)
1.4.2 The Constant Q Spectrogram
As discussed in section 1.3.4, sounds are comprised of harmonic frequency components. The positions of these frequency components in the spectral domain play an important role in analysis of a given piece of music. Consider the following harmonics kFo, i.e 1Fo, 2Fo, 3Fo, .... for a fundamental frequency
Fo. The absolute positions of the harmonics are dependent on the position of
the fundamental frequency, Fo. However, the relative position of the harmonics
are independent of the fundamental frequency if plotted against a logarithmic scale. This can be summarised by the following equation.
Dnm = log(nFo) − log(mFo) = log nFo mFo = logn m = constant (1.9)
where, Fo denotes a fundamental frequency and Dnm gives the logarithmic
distance between nth and mth harmonics. nF
o and mFo represents nth and
mth harmonics of the fundamental frequency, F
o, respectively. It can be seen
from equation 1.9 that the logarithmic difference between the corresponding harmonics is independent of the fundamental frequency. Thus, these harmonics in sound or specifically in music contain a pattern that can be investigated using frequency analysis.
However, the conventional linear and uniform frequency separation in the DFT does not show clearly the shift-invariant property of harmonics. This can be explained as follows. Let a constant frequency resolution of 21.5 Hz i.e. sampling frequency 44.1 kHz and window size of 2048 samples is used to calculate the DFT. In the calculation of frequency component with a frequency spacing of 21.5 Hz, we will lose many notes belonging to the lower frequencies i.e. in the range of 150Hz. On the other hand, if we consider the notes containing frequencies in the range of 3kHz, we are evaluating far more frequency components to represent notes than desired. Thus, for musical analysis, a time-frequency representation using DFT or STFT is not always a suitable representation. Therefore, we need a TFR, where the resolution of the frequency bins should be geometrically related to the frequency. Also, with respect to notes the TFR should give a constant pattern of the frequency components (harmonics) for analysis and musical signal processing. This can be achieved by maintaining a constant ratio (Q) of the fundamental frequency to the frequency resolution.
f
δf = Q (1.10)
where, δf denotes the frequency resolution or the bandwidth of the frequency bin and f represents the corresponding fundamental frequency.
To obtain this logarithmic resolution in TFR, a Constant Q transform (CQT) is typically used. The constant Q transform of a discrete-time signal x[n] can be calculated by using the following equation:
Xcq[k] = N[k]−1
X
n=0
W [n, k]x[n]e−jωkn (1.11)
where Xcq[k] is the kth component of the Constant Q transform of the input
signal x[n]. W [n, k] is a window function of length N[k] for each value of k and k varies from 1, 2, . . . K which indexes the frequency bins in the Constant Q domain. The CQT was first proposed by JC Brown [45] inspired by many earlier works including [10, 11, 12].
Figure 1.4: Constant Q Spectrogram of an audio mixture signal
Figure 1.4 shows the constant Q magnitude spectrogram of a test signal containing music signals of two pitched instruments.
according to even tempered chromatic scale [52], the fundamental frequencies of the adjacent notes are geometrically spaced by a factor of 12√
2. Thus, a frequency spacing of 12√
2f would cover all the notes for musical analysis. Therefore, the frequency of kth spectral component can be calculated using
fk = (12
√ 2)kf
min (1.12)
where fmin is the lowest frequency chosen manually. For our research, we have
chosen fmin to be 55Hz. The Q factor of a filter is calculated by using equation
1.10. For semi-tone spacing the Q factor can be evaluated to 17 as done in [84]. The direct evaluation of equation 1.11 is computationally inefficient as detailed in [84]. Here, we will make use of Parseval’s equation to calculate the CQT coefficients.
Let x[n] and w[n] are discrete time function and X(f ) and W (f ) represents DFT of the discrete signals x[n] and w[n] respectively. Then according to Parseval’s theorem, N−1 X 0 x[n]w∗[n] = 1 N N−1 X 0 X(f )W∗(f ) (1.13)
where, W∗(f ) denotes the complex conjugate of W (f ). Thus, the CQT can be
efficiently calculated in the Fourier domain by using Parseval’s equation and using the DFT coefficients in X(f ) and the spectral kernels (as denoted in [84]) in Y (f ). Here, Y (f ) contains the coefficients of the DFT of the time domain complex exponentials y[n] corresponding to the fundamental frequencies of the
notes (geometrically spaced) present in music. These complex exponentials are used to modulate the time domain signal to obtain the logarithmically scaled frequency basis functions. The CQT can then be obtained by using the following equation: Xcq[k] = N X 0 X(f )Y∗(f ) (1.14)
where, Y∗(f ) is the complex conjugate of Y (f ). For simplicity, we will denote
the spectral kernels in Y (f ) as transform matrix Y and the linear spectral coefficients in X(f ) as X, then the constant Q transform can be formulated as
Xcq[k] = Y∗X (1.15)
where, Y∗ is the complex conjugate of Y. However, a drawback of using the
CQT is that no true inverse of the CQT is possible. Therefore, it is typically impossible to get a perfect reconstruction of the original signal. Another drawback of using the Constant Q transform is that it is computationally more intensive and complex than the simple DFT or the STFT. Despite these limitations, the time-frequency representations using CQT give a far better understanding of the musical signals and can be potentially used for the musical signal processing.
An approximate inverse transform was proposed by Fitzgerald [88] with the assumption that the music signals can be sparsely represented in the linear frequency domain. However, the assumption does not hold good for all audio signals and the algorithm was extremely slow in calculating the inverse CQT
transform. Recently, Sch¨orkhuber and Klapuri [85] has proposed an extension to the method discussed in [45, 84] to calculate the CQT in a manner which allows a high quality inverse CQT to be calculated. The algorithm processes each octave in the signal one by one starting from highest to lowest to calculate the CQT coefficients of a given spectrogram. In [85], the algorithm basically tries to improve the computational efficiency by addressing two problems. Firstly, when a wide range of frequencies is considered, the DFT blocks are very wide in length, hence the transform matrix is no longer very sparse i.e. for frequency range of 60Hz to 16kHz. Secondly, when calculating the CQT coefficients of the highest frequency bins, the width between the frequency bins should be atleast
N
2, where N the window length of highest CQT bin. These two problems were
addressed to reduce computational efficiency.
The computational efficiency improvement is obtained as follows. Firstly, the transform matrix matrix Y, which contains the CQT coefficients for the highest octave remains same for all the octaves. Then, the entire length of audio input signal is passed through a lowpass filter and downsampled by factor two. Thereafter, the CQT coeffients are calculated using the same transform matrix. The process is repeated until the desired lowest octave is processed. Since, the transform matrix Y represents the frequency bins that are separated by a maximum of one octave, the matrix Y remains sparse for highest frequency bins.
Secondly, many of the translated versions of y[n] within the transform matrix Y are shifted temporally to different positions. This reduces the number of
DFTs calculations for x[n] in equation 1.14. The use of this algorithm and its effect on the separation of sound sources is detailed in chapter 3. In the following section, we give a brief overview of previous techniques used for the separation of the sound sources from a given mixture.