2.4 The MFCC Feature
2.4.1 Extracting MFCCs
MFCCs are extracted via a series of transformations and distortions which will now be described within the perceptual framework from the previous sections. An overview of the
MFCC feature extraction process is provided in figure 2.2 and a comparison of the Mel
frequency scale to the equal temperament scale (as represented by midi note numbers) is
provided in figure 2.3.
The analog signal is sampled into the digital domain such that the subsequent steps can be ‘mechanised’. The complex spectrum is taken, after the ‘bank-of-filters’ speech
recognition system front-end from [99, p. 81-84], wherein it is stated that ‘probably the
most important parametric representation of speech is the short time spectral envelope.’. Following this, the logarithm of the magnitude spectrum is taken, motivated by the widely accepted fact that the perception of loudness is logarithmic, e.g. the dB scale. This is another impact of the human perceptual apparatus on the design of the feature vector. Next, the frequency scale of the magnitude spectrum is warped using the Mel frequency
0 500 1000 1500 2000 2500 3000 3500 0 2000 4000 6000 8000 10000 0 10 20 30 40 50 60 70 80 90 100 110 120 mels midi note Hz mels midi note
Figure 2.3: Comparison of the mel frequency scale approximated using equation2.1to midi notes, based on equal temperament tuning.
frequency in Mels increases according to the perceived increase in pitch. It is roughly linear below 1000Hz and logarithmic above 1000Hz. The conversion from Hz to Mels is
typically approximated using equations 2.1 [38] or 2.2[38], where m is the value in Mels
and h is the value in Hz. These two equations will be revisited later in the comparison of open source MFCC extractors.
m = 1127.0(log(1.0 + h
700 )) (2.1)
m = 2595(log10(1.0 + h
700 )) (2.2)
The warping is carried out in combination with a data reduction step, where the spec-
trum is passed through a filter bank to reduce the number of coefficients from windowsize2
to no.f ilters. The filter bank is made from band pass filters with triangular frequency responses centred at equidistant points along the Mel scale. The filters overlap such that the start of filter n + 1’s triangular response is the centre of filter n’s response. The factors controlling the placement of the filters are the sampling rate of the original signal, the window length for the Fourier transform and the number of filters that are required. The
lowest frequency, fmin is defined in terms of the sampling rate SR and the window length
N as fmin = N1SR, in Hz. The highest frequency fmax is the Nyquist frequency SR2 . If
we convert fmax and fmin to Mels mmin and mmax with equation 2.1, we can define the
centre frequencies of the filters in Mels using equation2.3where there are C filters, noting
that the triangular responses for the filters start at mmin and end at mmax to perfectly fit
the filters into the total available bandwidth (mmax− mmin).
C
X
n=1
(mmax− mmin− mmaxC−mmin)
To summarise at this point, the output of the Mel filter bank can be described as a low dimensional, perceptually weighted log magnitude spectrum, containing typically up to 42 coefficients. The final stage of the MFCC extraction is to perform a further transform on the outputs of the filter bank. The results of this are cepstral coefficients, so called because they are the transform of a transform; ‘spec’ becomes ‘ceps’. Different types of transforms could be used to create cepstral coefficients from the output of the filter bank but the favoured transform is the Discrete Cosine Transform (DCT). The DCT is commonly used in digital signal processing as a stage in encoding signals for efficient storage or transmission, e.g. the JPEG image compression format. A tutorial can be
found in [65]. The role of the DCT in the MFCC extraction is to enable source and filter
separation as well as further compression of the representation of the signal. It has certain interesting properties which make it appropriate for these tasks: it decorrelates the signal and it redistributes the information towards the lower coefficients.
Decorrelating the signal in this case means it reduces the linkage or covariance between separate values in the original signal. From the speech synthesis perspective, this is said to separate the source and channel, i.e. the larynx and the ‘band pass filters’ of the mouth. This leads to a speaker independent characteristic, where the focus is on the positions of the band pass filters, not the signal flowing through them. In terms of musical instruments, this makes it useful for recognising the distinctive timbre of a note, regardless of what was used to make it as the variations caused by pitch should be removed. It is well established that the MFCC is useful for representing the salient aspects of the spectrum for speech
recognition but in [66] the researchers report that the MFCC is ineffective for speaker
recognition. In terms of musical instruments, this could be interpreted as being unable to guess the source (violin or cello string) but being able to tell the difference between the filter (the body of the instrument).
A further aspect of the decorrelation of the signal is that it makes it less meaningful to compare different coefficients to each other, e.g. coefficient 1 from sound 1 to coefficient 2 from sound 2, which makes it possible to use a simple distance metric.
Redistributing the information towards the lower coefficients means the higher coef- ficients can be discarded and a close representation of the signal is still retained. This makes the representation compact. This makes it similar to the more complex transform used in Principle Component Analysis, the Karhunen-Loeve transform, which is perhaps
less popular due to its complexity, despite superior performance [112, p. 496]. In other
For a more mathematical description of the MFCC, the reader is referred to Rabiner
and Juang [99, p. 163-190].