Pitch salience function evaluation - Evaluation strategies

2.7 Evaluation strategies

2.7.1 Pitch salience function evaluation

Salience functions are commonly evaluated from two different perspectives: pitch and salience estimation accuracy.Salamon et al.(2011) propose four different metrics using the ground truth melody pitch. First, salience function peaks are computed, and then the peak closest to the ground truth is selected, and considered as the melody salience peak. The first metric is the frequency error of the salience function ∆ fm, computed as the difference (in cents) between the frequency of the melody salience peak and the ground truth f0. The following metrics deal with salience estimation. The first metric (RRm) is the reciprocal rank score of the melody salience peak amongst the rest of salience peaks (the closer to one the better). The second (S1) is the relative salience of the melody peak in comparison to the highest salience peak in that frame. Last metric (S3) computes the salience of the melody peak, divided by the mean salience of the 3 highest peaks (the higher the better). We consider the latter as the single most important salience-related measure, since it quantifies the ability of a method to make the melody pitch more salient than the rest of the peaks, which is a key property of a salience function.

2.7.2 Melody extraction

Melody extraction algorithms are commonly evaluated by comparing their output against a ground truth, corresponding to the sequence of pitches that the main instrument plays. Such pitch sequence is usually created by employing a monophonic pitch estimator on the solo recording of the instrument playing the melody (Bittner

et al.,2014). Pitch estimation errors are then usually corrected by the annotators.

The evaluation in MIREX20 focuses on both voicing detection and pitch estimation itself. An algorithm may report an estimated melody pitch even for a frame which is considered unvoiced. This allows the evaluation of voicing and pitch estimation separately. Voicing detection is evaluated using metrics from detection theory, such as voicing recall (VR) and voicing false alarm (VFA) rates. We define a voicing indicator vector v, whose τthelement (υτ) has a value of 1 when the frame contains a melody pitch (voiced), and 0 when it does not (unvoiced). We define the ground truth of such vector as v∗. We also define ¯υτ= 1 − υτ as an unvoicing indicator.

Voicing Recall rate is the proportion of frames labelled as melody frames in the ground truth that are estimated as melody frames by the algorithm.

VR=∑τυτυ

∗ τ

∑τυ_τ∗ (2.7)

Voicing False Alarm rate is the proportion of frames labelled as non-melody in the ground truth that are mistakenly estimated as melody frames by the algorithm. VFA=∑τυτυ¯ ∗ τ ∑τυ¯τ∗ (2.8)

Pitch estimation is evaluated by comparing the estimated and the ground truth pitch vectors, whose τthelements are fτ and f_τ∗respectively. Most commonly used accuracy metrics are raw pitch (RPA) and raw chroma accuracy (RCA). Another metric used in the literature is the concordance measure, or weighted raw pitch (WRPA) which linearly weights the score of a correctly detected pitch by its distance in cents to the ground truth pitch. Finally, the overall accuracy (OA) is used as a single measure to measure the performance of the whole system:

Raw Pitch Accuracy (RPA) is the proportion of melody frames in the ground truth for which the estimation is considered correct (within half a semitone of the ground truth).

RPA=∑τυ ∗ τT [M( fτ) − M( f ∗ τ)] ∑τυτ∗ (2.9) 20_{http://www.music-ir.org/mirex/wiki/2014:Audio_Melody_Extraction}

T and M are defined as: T [a] = ( 1, if |a| < 0.5 0, else (2.10) M( f ) = 12 log₂( f ) (2.11)

where f is a frequency value in Hertz.

Raw Chroma Accuracy (RCA) is a measure of pitch accuracy, in which both estimated and ground truth pitches are mapped into one octave, thus ignoring the commonly found octave errors.

RCA=∑τυ ∗ τT [k M( fτ) − M( fτ∗) k12] ∑τυτ∗ = Nch ∑τυτ∗ (2.12)

where k a k12= a − 12b₁₂a + 0.5c, and Nch represents the number of chroma matches.

Overall Accuracy (OA) measures the proportion of frames that were correctly labelled in terms of both pitch and voicing

OA= 1 Nf r

∑

_τ υ ∗ τT [M( fτ) − M( f ∗ τ)] + υτ ∗ υτ (2.13)

where Nf ris the total number of frames.

2.7.3 Multiple pitch estimation

The evaluation of multiple pitch algorithms is performed at three different levels, depending on the task.

Multipitch estimation: the task is to collectively estimate pitch values of all concur- rent sources at each individual time frame, without determining their sources. In MIREX (Bay et al., 2009), systems should report the number of active pitches every 10ms. Two commonly used metrics are Precision (the portion of correctly retrieved pitches in all pitches retrieved for each frame) and Recall (the ratio of correct pitches to all ground truth pitches for each frame).

Prec= ∑ T t=1T P(t) ∑tT=1T P(t) + ∑ T t=1FP(t) (2.14) Rec= ∑ T t=1T P(t) ∑tT=1T P(t) + ∑ T t=1FN(t) (2.15)

where T P correspond to True Positives, FP correspond to False Positives and FN correspond to False Negatives. An estimated pitch is evaluated as correct if it is within a half semitone of a ground-truth pitch for that frame. Note that only one ground-truth pitch can be associated with each returned pitch. Accuracy (Acc) is a measure of overall performance, bounded between 0 and 1 where 1 corresponds to perfect transcription.

Acc= T P

T P+ FP + FN (2.16)

In order to have more information about the kind of errors, other metrics have been proposed, such as the total error score (Etot), which is computed as the sum of frame level errors, normalised by the total number of f0values in the ground truth. If we define Nre f as the number of non-zero elements in the ground truth data, Nsysas the number of active elements returned by the system and Ncorras the number of correctly identified elements:

Etot =

∑Tt=1max(Nre f(t), Nsys(t)) − Ncorr(t) ∑Tt=1Nre f(t)

(2.17)

The total error score can be divided into three different kind of errors: substitution errors, missed errors and false alarms. Substitution errors count the number of ground-truth f0 values for each frame that were not estimated, but other incorrect f0values were returned instead.

Esubs=∑ T

t=1min(Nre f(t), Nsys(t)) − Ncorr(t) ∑Tt=1Nre f(t)

(2.18)

Missed errors Emiss counts the number of ground-truth f0 values that were missed by the algorithm, but no other f0estimates were returned.

Emiss= ∑ T

t=1max(0, Nre f(t)) − Nsys(t) ∑Tt=1Nre f(t)

(2.19)

The false alarms Ef acounts the number of extra f0 estimates that are not sub- stitutes.

Ef a=

∑Tt=1max(0, Nsys(t)) − Nre f(t) ∑Tt=1Nre f(t)

(2.20)

Note tracking: the task is to estimate continuous pitch segments, which would typ- ically correspond to individual notes. In this case, the measures used in MIREX are also Precision (ratio of correctly transcribed ground truth notes to the number of ground truth notes) and Recall (ratio of correctly transcribed ground truth

notes to the number of transcribed notes). A ground truth note is evaluated as correct if the system returns a note that is within a half semitone of that note and the returned note’s onset is within a 100ms range (±50 ms) of the onset of the ground truth note, and its offset is within 20% range of the ground truth note’s offset.

Timbre tracking: the task is to estimate pitches and stream them into a single pitch trajectory over the musical excerpt, for each of the sources. This task has not been commonly evaluated in MIREX, due to very low participation.Duan et al.

(2014) performed an evaluation, considering that a pitch is estimated as correct when it is within a half semitone of a ground-truth pitch for that frame, and it is assigned to the right stream.

In document From heuristics-based to data-driven audio melody extraction (Page 65-69)