Extracting MFCCs - The MFCC Feature - Automatic sound synthesizer programming: techniques and

2.4 The MFCC Feature

2.4.1 Extracting MFCCs

MFCCs are extracted via a series of transformations and distortions which will now be described within the perceptual framework from the previous sections. An overview of the

MFCC feature extraction process is provided in figure 2.2 and a comparison of the Mel

frequency scale to the equal temperament scale (as represented by midi note numbers) is

provided in figure 2.3.

The analog signal is sampled into the digital domain such that the subsequent steps can be ‘mechanised’. The complex spectrum is taken, after the ‘bank-of-filters’ speech

recognition system front-end from [99, p. 81-84], wherein it is stated that ‘probably the

most important parametric representation of speech is the short time spectral envelope.’. Following this, the logarithm of the magnitude spectrum is taken, motivated by the widely accepted fact that the perception of loudness is logarithmic, e.g. the dB scale. This is another impact of the human perceptual apparatus on the design of the feature vector. Next, the frequency scale of the magnitude spectrum is warped using the Mel frequency

0 500 1000 1500 2000 2500 3000 3500 0 2000 4000 6000 8000 10000 0 10 20 30 40 50 60 70 80 90 100 110 120 mels midi note Hz mels midi note

Figure 2.3: Comparison of the mel frequency scale approximated using equation2.1to midi notes, based on equal temperament tuning.

frequency in Mels increases according to the perceived increase in pitch. It is roughly linear below 1000Hz and logarithmic above 1000Hz. The conversion from Hz to Mels is

typically approximated using equations 2.1 [38] or 2.2[38], where m is the value in Mels

and h is the value in Hz. These two equations will be revisited later in the comparison of open source MFCC extractors.

m = 1127.0(log(1.0 + h

700 )) (2.1)

m = 2595(log10(1.0 + h

700 )) (2.2)

The warping is carried out in combination with a data reduction step, where the spec-

trum is passed through a filter bank to reduce the number of coefficients from windowsize₂

to no.f ilters. The filter bank is made from band pass filters with triangular frequency responses centred at equidistant points along the Mel scale. The filters overlap such that the start of filter n + 1’s triangular response is the centre of filter n’s response. The factors controlling the placement of the filters are the sampling rate of the original signal, the window length for the Fourier transform and the number of filters that are required. The

lowest frequency, fmin is defined in terms of the sampling rate SR and the window length

N as fmin = _N1SR, in Hz. The highest frequency fmax is the Nyquist frequency SR₂ . If

we convert fmax and fmin to Mels mmin and mmax with equation 2.1, we can define the

centre frequencies of the filters in Mels using equation2.3where there are C filters, noting

that the triangular responses for the filters start at mmin and end at mmax to perfectly fit

the filters into the total available bandwidth (mmax− mmin).

n=1

(mmax− mmin− mmax_C−mmin)

To summarise at this point, the output of the Mel filter bank can be described as a low dimensional, perceptually weighted log magnitude spectrum, containing typically up to 42 coefficients. The final stage of the MFCC extraction is to perform a further transform on the outputs of the filter bank. The results of this are cepstral coefficients, so called because they are the transform of a transform; ‘spec’ becomes ‘ceps’. Different types of transforms could be used to create cepstral coefficients from the output of the filter bank but the favoured transform is the Discrete Cosine Transform (DCT). The DCT is commonly used in digital signal processing as a stage in encoding signals for efficient storage or transmission, e.g. the JPEG image compression format. A tutorial can be

found in [65]. The role of the DCT in the MFCC extraction is to enable source and filter

separation as well as further compression of the representation of the signal. It has certain interesting properties which make it appropriate for these tasks: it decorrelates the signal and it redistributes the information towards the lower coefficients.

Decorrelating the signal in this case means it reduces the linkage or covariance between separate values in the original signal. From the speech synthesis perspective, this is said to separate the source and channel, i.e. the larynx and the ‘band pass filters’ of the mouth. This leads to a speaker independent characteristic, where the focus is on the positions of the band pass filters, not the signal flowing through them. In terms of musical instruments, this makes it useful for recognising the distinctive timbre of a note, regardless of what was used to make it as the variations caused by pitch should be removed. It is well established that the MFCC is useful for representing the salient aspects of the spectrum for speech

recognition but in [66] the researchers report that the MFCC is ineffective for speaker

recognition. In terms of musical instruments, this could be interpreted as being unable to guess the source (violin or cello string) but being able to tell the difference between the filter (the body of the instrument).

A further aspect of the decorrelation of the signal is that it makes it less meaningful to compare different coefficients to each other, e.g. coefficient 1 from sound 1 to coefficient 2 from sound 2, which makes it possible to use a simple distance metric.

Redistributing the information towards the lower coefficients means the higher coefficients can be discarded and a close representation of the signal is still retained. This makes the representation compact. This makes it similar to the more complex transform used in Principle Component Analysis, the Karhunen-Loeve transform, which is perhaps

less popular due to its complexity, despite superior performance [112, p. 496]. In other

For a more mathematical description of the MFCC, the reader is referred to Rabiner

and Juang [99, p. 163-190].

In document Automatic sound synthesizer programming: techniques and applications (Page 46-49)