Voicing and polyphony estimation - From heuristics-based to data-driven audio melody extraction

Melody extraction algorithms have to classify frames as containing a melody pitch or not. Due to historical reasons, and the fact that most research has been conducted on vocal melodies, most melody extraction literature denotes as voiced those frames that contain a melody pitch, regardless of the instrument producing it. Unvoiced frames are therefore those which do not contain a melody pitch. In this thesis we follow the same naming convention.

The simultaneous estimation of pitch and voicing in music signals is a complex task. In the case of speech signals mixed with background noise, it is easier to discriminate a frame of speech and one containing noise, possibly due to the presence of a pitched structure. Therefore, there have been approaches which jointly estimate pitch and the presence of human voice, using for instance a singleDNN(Lee & Ellis,2012;Han & Wang,2014). In the case of music signals, both melody and accompaniment contain harmonic pitched structures, and it is thus necessary to use some other information to

distinguish between them. Due to this complexity, voicing detection is is commonly performed on a separate step.

Most melody extraction approaches use static or dynamic thresholds on e.g. energy or salience (Fuentes et al.,2012;Durrieu et al.,2010;Dressler,2012b;Arora & Behera,

2013).Salamon & Gómez(2012) follow a different strategy, and exploit pitch contour

salience distributions. Bittner et al.(2015) determine voicing by setting a threshold on the contour probabilities produced by the discriminative model.

Separation-based approaches perform voicing detection in different ways. For instance,Durrieu et al.(2010) estimate the energy of the melody signal frame by frame. Frames whose energy falls below the threshold are set as unvoiced and vice versa. The threshold is empirically chosen, such that voiced frames represent more than 99.95% of the leading instrument energy. Fuentes et al.(2012) also use an energy threshold (of -12dB) on a low-pass filtered separated melody signal. In the case ofTachibana

et al.(2010) voicing detection is also performed with a threshold, but in this case it is

applied on the (Mahalanobis) distance between the estimated melody signal and the percussive signal.

Singing voice detection (SVD) is a very similar task, which aims at identifying the regions in a music recording where at least one person sings, for which timbral and temporal characteristics are commonly exploited. In comparison with melody extraction, this task is restricted to the vocals, and pitch does not need to be identified.

Mauch et al.(2011) propose the use of standardMFCCs, and three features based on

the extracted melody line: pitch fluctuation,MFCCs of the re-synthesized predominant voice, and the relative harmonic amplitudes of the predominant voice. A different approach is taken byRao et al.(2013), who use the differences in singing style and instrumentation across genres to adapt acoustic features for this task. Lehner et al.

(2014) propose the Vocal Variance (VV) feature, which computes the variance of the first fiveMFCCs (Davis & Mermelstein,1980;Logan et al.,2000), calculated over a window of around 800ms around the current frame. Lehner et al.(2015) also propose a real-time approach using LSTM neural networks forSVD.Rigaud & Radenen

(2016) propose a neural network approach forSVDbased on a similar approach by

Leglaive et al.(2015), using Bidirectional Long Short-Term Memory (BLSTM). Mel-

frequency spectrograms are computed from pre-decomposed signals using Harmon- ic/Percussive separation, and given as input to 3 BLSTM layers of 50 units each, and a final feed-forward logistic output layer with one unit. The binary decision of voice detection is taken with a threshold (0.5) on theDNNoutput. They combined thisSVD

approach with the previously mentionedDNNfor pitch salience estimation, achieving state-of-the-art overall accuracy on vocal data. A different approach is the algorithm

byRyynänen & Klapuri (2008), which incorporates a silence model into theHMM

tracking part of the algorithm. Hsu & Jang(2010) also propose the use of timbre to classify frames as containing human voice or not.

to estimate the amount of different concurrent sounds. This is a complex task even for musicians, who e.g. commonly underestimate the number of voices, when listen- ing to four voice polyphonies employing homogeneous timbre (Huron,1989). There are different approaches to this problem. Many methods apply a threshold commonly based on pitch salience / likelihood, either fixed (Benetos & Dixon, 2011), or dynamic (Dressler,2012a).Klapuri(2003) proposes a statistical-experimental approach to control the stopping of the iterative f0 estimation and sound separation process.

Yeh et al.(2010) follow an approach based on STFT, with an adaptive noise level

estimation method. Then, given a set of pitch candidates, the overlapping partials are detected and smoothed according to the spectral smoothness principle. Polyphony estimation is based on the increase of a score function using harmonicity, mean band- width, spectral centroid, and synchronicity features. Duan et al.(2010) also perform polyphony estimation, in order to control the number of iterations of the method using a threshold-based method on the likelihood function. The likelihood function is com- posed of the peak region likelihood (probability that a peak is detected in the spectrum given a pitch) and the non-peak region likelihood.

In document From heuristics-based to data-driven audio melody extraction (Page 63-65)