2.3 Pitch estimation
2.3.1 Single pitch estimation
Single pitch estimation algorithms are based on the assumption that the signal only presents an active harmonic source at a time. De Cheveigné (2006) presents a re- view on related methods, dividing them according to the domain in which they work: temporal, spectral, and spectrotemporal.
rate (how often does the signal cross the zero level). The main problem of this ap- proach is the high sensitivity to noise. Other early approaches attempt to model the auditory system to compute the perceived pitch (Terhardt,1979;Terhardt et al.,1982).
Gold & Rabiner (1969) propose the combination of several periodicity estimates,
coming from impulse trains derived from the signal. Most commonly, time domain methods are based on the idea that harmonic sounds are periodic, and use the autocor- relation function (ACF) to find the period of the signal. Periodic signals have the first major peak of this function at the fundamental period. SomeACF-based approaches have focused on the time domain (Medan et al.,1991;Talkin, 1995) and others on the frequency domain (Klapuri,2000). Ross et al.(1974) also useACF but use the average magnitude difference function, based on the city-block distance between two chunks of the signal.De Cheveigné(1998) proposes the squared-difference function, replacing the city-block distance with Euclidean distance instead. De Cheveigné &
Kawahara(2002) proposes a normalized form of the squared-difference function for
the YIN pitch estimation algorithm. This avoids spurious peaks near zero lag, avoid- ing harmonic errors. Mauch & Dixon(2014) proposes pYin, a modification of YIN in which they replace the threshold parameter by a parameter distribution, obtaining several f0candidates per frame. In a second stage, they use aHMMwhich is decoded using the Viterbi algorithm (Forney,1973). Results show improvements in recall and precision over YIN on a database of over 30 hours of synthesised singing.
Some spectral methods also useACF, exploiting the fact that the partials of a har- monic sound occur at integer multiples of the fundamental frequency of that sound. The maximum of theACF for a harmonic spectrum corresponds to the f0. Lahat
et al.(1987) propose a method based on flattening the spectrum, and estimating the f0
fromACF. A previous approach byNoll(1967) uses cepstrum analysis. The (power) cepstrum is defined as the inverse Fourier transform of the logarithm of the power spectrum of the signal. Periodic signals present a strong peak at the location which corresponds to the inverse of the f0. Other approaches are based on harmonic match- ing, in which the peaks of the magnitude spectrum are matched against the expec- ted locations of the harmonics of a candidate (Piszczalski & Galler, 1979; Maher
& Beauchamp, 1994). Doval & Rodet(1993) propose a maximum likelihood (ML)
approach, based on the representation of an input spectrum as a set of sinusoidal par- tials, and usedHMMs for tracking. Klapuri(2000) proposes a bandwise processing algorithm, in which the final estimation is computed by combining the pitch likeli- hoods from different bands. This provides robustness against inharmonicity, since in narrow bands the frequency distance between partials can be considered constant. The most commonly found error is that the estimated pitch is an octave above or below the real pitch (octave errors). The main motivation of spectro-temporal f0estimation methods is the fact that spectral methods generally have a tendency to exhibit errors in integer multiples of the f0(harmonic errors), while methods on the temporal domain typically exhibit errors at submultiples of the f0(subharmonic errors) (Klapuri,2003). The approach byMeddis & O’Mard(1997) splits the input signal using a filterbank,
Pitch Salience Estimation Melody tracking Time Pitch Voicing Detection
Figure 2.1: Schema of salience-based melody extraction methods
giving emphasis to a different range of frequencies in each channel. Then theACF
is computed for each channel in the time domain, and results are added to create a global autocorrelation function.
Monophonic pitch estimation is commonly considered to be a solved task, even though the results vary depending on the source and the input parameters. However, the es- timation of the pitch corresponding to the melody, or the estimation of all pitches present in a polyphonic mixture are still very challenging tasks.
2.3.2 Melody extraction
As previously introduced, in melody extraction, we deal with the estimation of the melody pitch in polyphonic audio signals. Thus, the signal does not contain a single harmonic source, which would make the application of a monophonic pitch tracking algorithm fail in most cases.
Composers and performers use several cues to make melodies perceptually salient, including loudness, timbre, frequency variation or note onset rate. Melody extraction methods commonly use cues such as pitch continuity and pitch salience, and some of them group pitches into higher level objects (such as tones or contours), using principles from Auditory Scene Analysis (Goto,2004;Paiva et al., 2006;Dressler,
2012b;Salamon & Gómez,2012;Marolt,2004b). Some approaches have also con-
sidered timbre, either within a source separation framework (Durrieu et al., 2010;
Ozerov et al.,2007), with a machine learning approach (Ellis & Poliner,2006), or in
a salience based approach (Marolt,2005;Hsu & Jang,2010).
There are different strategies for melody extraction. Salamon et al. (2014) divide them in salience-based and separation-based. The former start by computing a pitch salience function and then perform tracking and voicing detection (see Figure2.1), and the latter involve the separation of the melody from the accompaniment, which is more or less explicit depending on the approach. Salience-based methods are detailed in Section2.4.
A clear example of a separation-based approach isTachibana et al.(2010). In this ap- proach, they propose the use of Harmonic-Percussive Sound Separation (HPSS), ori- ginally conceived to separate sources which are smooth in time (harmonic), and those smooth in frequency (percussive). The approach is based on modifying the window length to first separate chords (more sustained) from other content which is more vari- able (melody and percussion). Then, aHPSSis run again, but this time using the ori- ginal window size, to separate melodic and percussive elements. The melody is then
estimated from the spectrogram of the separated (or at least, enhanced) melody signal using dynamic programming, to find the path which maximises the maximum posteri- ori probability of the frequency sequence. The probability of a frequency given the spectrum is proportional to the weighted sum of the energy at its harmonic multiples. Transition probabilities depend on the distance between two consecutive frequency values. Fan et al. (2016) propose another separation-based method but only for vo- cals. The algorithm performs first singing voice separation based on Deep Neural Networks (DNN) in a supervised setting, and then pitch tracking based on dynamic programming, considering both periodicity and smoothness.Hsu et al.(2011) also en- hance the melody signal, in an approach which aims at vocal melody estimation with a trend estimation algorithm which dynamically adapts the pitch range for melody estimation according to the signal.
Some approaches combine concepts from salience-based methods and source separ- ation methods.Durrieu et al.(2010) use an Expectation Maximisation (EM) method to compute a salience function by modelling the melody signal with a source-filter (S/F) model, and then perform an explicit separation of the lead for voicing estima- tion. Other approaches such asHsu & Jang(2010) andYeh et al.(2012) useHPSS
to attenuate the accompaniment signal, and then compute a salience function from which they extract the melody.
Finally, there is a recent tendency to perform data-driven melody extraction, motiv- ated by the advances in machine learning (especially deep learning) and the increas- ing dataset availability (see section2.7.5).Poliner & Ellis(2005) propose a machine learning based approach, based on training anSVMclassifier to estimate the melody note from a feature vector derived from the power spectrum. Voicing detection is performed with a threshold on the magnitude squared energy found between 200 and 1800 Hz. Several algorithms using neural networks have been recently proposed but they have only been tested on vocal data. Kum et al.(2016) propose a multicolumn deep neural network to predict the pitch label of singing voice, using spectrograms as input. The output of several Recurrent Neural Networks (RNN) with different pitch resolutions are combined. Voicing is handled separately, with a simple energy- based voice detector.Rigaud & Radenen(2016) propose an approach based on Deep Neural Networks, which estimates pitch activations (pitch salience) and singing voice detection separately. Verma & Schafer (2016) propose a neural network approach for estimating the f0 of a periodic signal directly from the time domain waveform, without pre or post-processing. They quantize the frequency range into 24 states per octave and an additional state for silence or unvoiced speech. Even though the system is trained on speech (TIMIT database), it obtains similarRPAas the method proposed
bySalamon & Gómez(2012) on MIR1k dataset4. However, this method is not tested
on more complex and realistic data, such as MedleyDB (Bittner et al.,2014). Finally,
Bittner et al.(2015) propose a data-driven method for melody pitch tracking, based
on Pitch Contour Classification (PCC), further described in Section2.5.4. Some data- driven approaches make use of data augmentation (e.g. using time stretching, random frequency filtering), which is reported to be useful for both pitch (Kum et al.,2016) and voicing estimation (Schlüter & Grill,2015).
A summary of relevant melody extraction approaches for this dissertation is presen- ted in Table 2.1, where they are classified as salience-based, separation-based, or data-driven approaches. As previously introduced, some data-driven approaches also exploit a pitch salience function as intermediate representation, but they use super- vised methods in some part of the algorithms. This is a list of publicly available methods (‡ source code or † binary form) for melody extraction and pitch tracking, as well as intermediate representations (programming language in square brackets):
1. LabROSAmelodyextract2005 ‡5 [Matlab, java]: melody extraction algorithm submitted to MIREX 2006 byPoliner & Ellis(2005) .
2. FChT ‡6 [Matlab/C++]: Fan Chirp Transform (FChT) and salience function proposed byCancela et al.(2010).
3. separateLeadStereo ‡7[python]: melody extraction method and lead instru- ment/accompaniment source separation methods submitted to MIREX 2010 by
Durrieu et al.(2010).
4. IMMF0salience ‡8[vamp plugin]: pitch salience function based on a source- filter model fromDurrieu et al.(2010,2011).
5. MELODIA †9[vamp plugin]: melody extraction method bySalamon & Gómez
(2012) based on pitch contour selection. An open source implementation of this method is available in Essentia10(Bogdanov et al.,2013).
6. Fuentes2012_ICASSP ‡11[Matlab]: melody extraction method byFuentes et al.
(2012) based on PLCA.
7. contour_classification ‡12[python]: pitch tracking method byBittner et al.(2015), based on supervised pitch contour classification (and pitch contour creation from MELODIA).
8. MelodyExtraction_MCDNN ‡13[python]: melody extraction method evaluated in MIREX 2016, based on deep neural networks byKum et al.(2016).
5http://labrosa.ee.columbia.edu/projects/melody/ 6http://iie.fing.edu.uy/investigacion/grupos/gpa/fcht.html 7https://github.com/wslihgt/separateLeadStereo 8https://github.com/wslihgt/IMMF0salience 9http://mtg.upf.edu/technologies/melodia 10http://essentia.upf.edu 11http://www.benoit-fuentes.fr/articles/Fuentes2012_ICASSP/index.html 12https://github.com/rabitt/contour_classification 13https://github.com/keums/MelodyExtraction_MCDNN
The melody extraction methods proposed in this thesis (Bosch & Gómez,2016;Bosch
et al.,2016a) are also publicly available14.