The Constant Q Spectrogram - Time Frequency Representations (TFR)

1.4 Time Frequency Representations (TFR)

1.4.2 The Constant Q Spectrogram

As discussed in section 1.3.4, sounds are comprised of harmonic frequency components. The positions of these frequency components in the spectral domain play an important role in analysis of a given piece of music. Consider the following harmonics kFo, i.e 1Fo, 2Fo, 3Fo, .... for a fundamental frequency

Fo. The absolute positions of the harmonics are dependent on the position of

the fundamental frequency, Fo. However, the relative position of the harmonics

are independent of the fundamental frequency if plotted against a logarithmic scale. This can be summarised by the following equation.

Dnm = log(nFo) − log(mFo) = log nFo mFo = logn m = constant (1.9)

where, Fo denotes a fundamental frequency and Dnm gives the logarithmic

distance between nth _{and m}th _{harmonics. nF}

o and mFo represents nth and

mth _{harmonics of the fundamental frequency, F}

o, respectively. It can be seen

from equation 1.9 that the logarithmic difference between the corresponding harmonics is independent of the fundamental frequency. Thus, these harmonics in sound or specifically in music contain a pattern that can be investigated using frequency analysis.

However, the conventional linear and uniform frequency separation in the DFT does not show clearly the shift-invariant property of harmonics. This can be explained as follows. Let a constant frequency resolution of 21.5 Hz i.e. sampling frequency 44.1 kHz and window size of 2048 samples is used to calculate the DFT. In the calculation of frequency component with a frequency spacing of 21.5 Hz, we will lose many notes belonging to the lower frequencies i.e. in the range of 150Hz. On the other hand, if we consider the notes containing frequencies in the range of 3kHz, we are evaluating far more frequency components to represent notes than desired. Thus, for musical analysis, a time-frequency representation using DFT or STFT is not always a suitable representation. Therefore, we need a TFR, where the resolution of the frequency bins should be geometrically related to the frequency. Also, with respect to notes the TFR should give a constant pattern of the frequency components (harmonics) for analysis and musical signal processing. This can be achieved by maintaining a constant ratio (Q) of the fundamental frequency to the frequency resolution.

δf = Q (1.10)

where, δf denotes the frequency resolution or the bandwidth of the frequency bin and f represents the corresponding fundamental frequency.

To obtain this logarithmic resolution in TFR, a Constant Q transform (CQT) is typically used. The constant Q transform of a discrete-time signal x[n] can be calculated by using the following equation:

Xcq[k] = N[k]−1

n=0

W [n, k]x[n]e−jωkn (1.11)

where Xcq[k] is the kth component of the Constant Q transform of the input

signal x[n]. W [n, k] is a window function of length N[k] for each value of k and k varies from 1, 2, . . . K which indexes the frequency bins in the Constant Q domain. The CQT was first proposed by JC Brown [45] inspired by many earlier works including [10, 11, 12].

Figure 1.4: Constant Q Spectrogram of an audio mixture signal

Figure 1.4 shows the constant Q magnitude spectrogram of a test signal containing music signals of two pitched instruments.

according to even tempered chromatic scale [52], the fundamental frequencies of the adjacent notes are geometrically spaced by a factor of 12√

2. Thus, a frequency spacing of 12√

2f would cover all the notes for musical analysis. Therefore, the frequency of kth _{spectral component can be calculated using}

fk = (12

√ 2)k_f

min (1.12)

where fmin is the lowest frequency chosen manually. For our research, we have

chosen fmin to be 55Hz. The Q factor of a filter is calculated by using equation

1.10. For semi-tone spacing the Q factor can be evaluated to 17 as done in [84]. The direct evaluation of equation 1.11 is computationally inefficient as detailed in [84]. Here, we will make use of Parseval’s equation to calculate the CQT coefficients.

Let x[n] and w[n] are discrete time function and X(f ) and W (f ) represents DFT of the discrete signals x[n] and w[n] respectively. Then according to Parseval’s theorem, N−1 X 0 x[n]w∗[n] = 1 N N−1 X 0 X(f )W∗(f ) (1.13)

where, W∗_{(f ) denotes the complex conjugate of W (f ). Thus, the CQT can be}

efficiently calculated in the Fourier domain by using Parseval’s equation and using the DFT coefficients in X(f ) and the spectral kernels (as denoted in [84]) in Y (f ). Here, Y (f ) contains the coefficients of the DFT of the time domain complex exponentials y[n] corresponding to the fundamental frequencies of the

notes (geometrically spaced) present in music. These complex exponentials are used to modulate the time domain signal to obtain the logarithmically scaled frequency basis functions. The CQT can then be obtained by using the following equation: Xcq[k] = N X 0 X(f )Y∗(f ) (1.14)

where, Y∗_{(f ) is the complex conjugate of Y (f ). For simplicity, we will denote}

the spectral kernels in Y (f ) as transform matrix Y and the linear spectral coefficients in X(f ) as X, then the constant Q transform can be formulated as

Xcq[k] = Y∗X (1.15)

where, Y∗ _{is the complex conjugate of Y. However, a drawback of using the}

CQT is that no true inverse of the CQT is possible. Therefore, it is typically impossible to get a perfect reconstruction of the original signal. Another drawback of using the Constant Q transform is that it is computationally more intensive and complex than the simple DFT or the STFT. Despite these limitations, the time-frequency representations using CQT give a far better understanding of the musical signals and can be potentially used for the musical signal processing.

An approximate inverse transform was proposed by Fitzgerald [88] with the assumption that the music signals can be sparsely represented in the linear frequency domain. However, the assumption does not hold good for all audio signals and the algorithm was extremely slow in calculating the inverse CQT

transform. Recently, Sch¨orkhuber and Klapuri [85] has proposed an extension to the method discussed in [45, 84] to calculate the CQT in a manner which allows a high quality inverse CQT to be calculated. The algorithm processes each octave in the signal one by one starting from highest to lowest to calculate the CQT coefficients of a given spectrogram. In [85], the algorithm basically tries to improve the computational efficiency by addressing two problems. Firstly, when a wide range of frequencies is considered, the DFT blocks are very wide in length, hence the transform matrix is no longer very sparse i.e. for frequency range of 60Hz to 16kHz. Secondly, when calculating the CQT coefficients of the highest frequency bins, the width between the frequency bins should be atleast

2, where N the window length of highest CQT bin. These two problems were

addressed to reduce computational efficiency.

The computational efficiency improvement is obtained as follows. Firstly, the transform matrix matrix Y, which contains the CQT coefficients for the highest octave remains same for all the octaves. Then, the entire length of audio input signal is passed through a lowpass filter and downsampled by factor two. Thereafter, the CQT coeffients are calculated using the same transform matrix. The process is repeated until the desired lowest octave is processed. Since, the transform matrix Y represents the frequency bins that are separated by a maximum of one octave, the matrix Y remains sparse for highest frequency bins.

Secondly, many of the translated versions of y[n] within the transform matrix Y are shifted temporally to different positions. This reduces the number of

DFTs calculations for x[n] in equation 1.14. The use of this algorithm and its effect on the separation of sound sources is detailed in chapter 3. In the following section, we give a brief overview of previous techniques used for the separation of the sound sources from a given mixture.

In document Non-Negative Matrix Factorization Based Algorithms to Cluster Frequency Basis Functions for Monaural Sound Source Separation. (Page 43-49)