Application to melody classification - A progressive alignment approach to template extraction

4.4 A progressive alignment approach to template extraction

4.4.3 Application to melody classification

The concept of supervised melody classification has been introduced in Chapter 2. In current state of the art systems, an unknown melody is classified according to its similarity to a large number of annotated melodies. Here, we explore an alternative strategy based on the proposed method. The key idea is to compare the unlabelled melody to a single sequence per class, which is representative of all annotated melodies belonging to the respective category.

To this end, we explore both approaches described in the previous section to estimate the similarity between a given melody and a set of performance transcriptions and compare to several baseline scenarios.

Experimental setup

In order to investigate the suitability of the proposed framework for melody classification tasks, we compare five different strategies to assign one of c candidate labels (in our case c = 4) to an unlabelled melody based on its similarity to annotated instances, as shown in Figures 4.21 and 4.22: The three setups in Figure 4.21 are considered baseline methods, whereas the two scenarios shown in Figure 4.22 rely on the framework presented in this study. The first baseline method described in Figure 4.21 (a), subsequently denoted as k-NN, corresponds to the commonly used method for melody classification, which has shown to give reliable results in the context of flamenco sub-style classification [43, 135] as well as tune family recognition [211, 210]. The unlabelled melody xq is classified in ak-nearest-neighbour [155] scheme based on pair-wise similarities with all labelled instances contained in the database. While different similarity functions have been introduced in literature, see for example [215] and [216] (note that most of these methods are not applicable to automatic transcriptions without time quantisation), we use here the score of theNW alignment in order to provide a direct comparison to the other strategies. The major drawback of this method, and a key motivation for the proposed framework, is the high amount of computationally expensive alignments at run-time. The second baseline method displayed in Figure 4.21 (b) circumvents this issue by aligning xq to a single, randomly selected labelled sequence of each class. Since randomly selected instances may represent either "typical" performances or outliers, we report average classification accuracies for 100 repetitions of the experiment. This setup is subsequently referred to as random. Finally, we compare to the exhaustive case, where a label is assigned based on the average alignment score with all instances of each class (Figure 4.21 (c)). This method is referred to asaverage. It is worth mentioning, that while the k-NN and average approaches are computationally expensive, the similarity calculations could potentially be parallelised.

The fourth and fifth setup, depicted in Figure 4.22, are based on the proposed template extraction framework: In (a) (denoted as model), xq is classified based on the alignment scores γ(xGi_{, x}q_{), 1 < i < c with the models G}

i extracted for the different classes. While the

model extraction requires exhaustive pair-wise comparison, this procedure can be computed offline. At run-time, only c alignments are required. Similarly, in the procedure shown in Figure 4.22 (b), the class is assigned based on the alignment scores γ(xTi_{, x}q_{), 1 < i < c of} the unlabelled instance with the templates xTi _{extracted for the different classes.}

For each scenario, unlabelled melodies are first transformed to the time normalised representation and shifted to the key of the reference sequence (see Section 4.4.2). Setups

Fig. 4.21 Experimental setup of the baseline methods: (a) k-NN classification based on pair-wise alignment with all labelled instances (k-NN ); (b) classification based on a single alignment score of a randomly selected instance of each class (random); (c) classification based on the average alignment scores with all instances of each class (average).

(c) and (d) are evaluated in a leave-one-out validation, meaning that the melody to be classified did not participate in the construction of the model. Furthermore, all experiments are conducted for both, AT and MT. For each setup, we report the percentage of correctly classified instances CCI. For the random setup, we furthermore provide the variance scores among repetitions. Given that the 40 instances are equally distributed among the 4 groups, the naive baseline accuracy for this scenario results to 25%.

Results

The results of the melody classification experiments are shown in Table 4.3. For the case of MT, it can be seen that both, the model-based as well as the template-based classification, yield the same classification accuracy (95%) as the computationally more expensivek-NN and average methods. When comparing to a single randomly selected instance, the accuracy drops to 87.8% on average. For the case of AT, we observe a similar behaviour. Both,template and

Fig. 4.22 Experimental setup of the proposed method: (a) classification based on alignment scores with class models (model); (b) classification based on alignment with extracted class templates (template).

accuracy CCI [%]

setup MT AT

k-NN 95.0 90.0

random 87.8 (var = 0.20) 76.8 (var = 0.33)

average 95.0 90.0

model 95.0 90.0

template 95.0 90.0

Table 4.3 Classification accuracy for the five different experimental setups

model obtain the same accuracy of 90% as k-NN and average. The random selection results in a lower accuracy of only 76.8%. In general, these results indicate, that both, the template- as well as the model-based approach, appear to yield a competitive performance compared to the computationally more expensivek-NN and average setups.

Error analysis

After having assessed the numerical results, we now proceed with a manual inspection of the incorrectly classified items. A first important observation is the fact that the instances misclassified by the k-NN, average, model and template setups are identical. For MT, two performances belonging to the fandangos valientes de Huelva style were mistakenly classified as fandangos valientes de Alosno. Analysing these two examples revealed that they are indeed rather atypical performances which deviate significantly from the underlying template. Figure 4.23 shows the estimated template for thefandangos valientes de Huelva style together with one example of a correctly classified instance and one example which was misclassified. It can be seen, that the misclassified example differs strongly from the template as well as

0 0.1 0.2 0.3 0.4 0.5 0.6 0.7 0.8 0.9 1 time rel. 62 64 66 68 70 72 MIDI pitch 0 0.1 0.2 0.3 0.4 0.5 0.6 0.7 0.8 0.9 1 time rel. 60 65 70 MIDI pitch

Fandango Valiente de Huelva - "El Cabrero" (correctly classified)

0 0.1 0.2 0.3 0.4 0.5 0.6 0.7 0.8 0.9 1 time rel. 60 65 70 MIDI pitch

Fandango Valiente de Huelva - J. M. Lepe (incorrectly classified)

Fig. 4.23 Fandangos Valientes de Huelva: Template and examples of a correctly and an incorrectly classified melody.

from the correctly classified instance. For AT, two additional instances were misclassified in all four setups. For one of them, the manual and automatic transcription is shown in Figure 4.24. It can be seen that the overall contour of the AT is distorted by octave errors, in particular around t = 0.35, t = 0.65 and t = 0.95. It is very likely that these transcription errors caused a lower alignment score with the corresponding template, which resulted in misclassification of this melody.

Melody classification in a noisy corpus

In a last experiment, we explore the potential of the extracted templates in a melody retrieval task. In particular, we aim to identify further instances of the four melodies in a noisy corpus by computing alignment scores to the templates. To this end, we gathered five additional examples of each of the four fandango styles, which were not used in any of the prior experiments. These recordings are taken from video sharing platforms as well as private collections and the quality ranges from amateur videos of informal flamenco gatherings to professional studio productions. Recordings may contain various exhibitions of the melody as well as additional melodies belonging to other styles. To this set, we add 1169 recordings from the corpusCOFLA [102], a research corpus containing commercial flamenco recordings. These recordings belong to other style families and are not related to the four fandango

0 0.1 0.2 0.3 0.4 0.5 0.6 0.7 0.8 0.9 1 time rel. 60 65 70 MIDI pitch

Fandango Valiente de Alosno - Paco Toronjo - MT

0 0.1 0.2 0.3 0.4 0.5 0.6 0.7 0.8 0.9 1 time rel. 60 65 70 MIDI pitch

Fandango Valiente de Alosna - Paco Toronjo - AT Fandango Valiente de Alosno - Paco Toronj

Fig. 4.24 Fandangos Valientes de Alosno: MT and AT of one of the incorrectly classified melodies.

melodies under study. For all recordings, we extract automatic transcriptions of the singing voice melody using theCANTE [104] algorithm.

Assuming that various melody exhibitions will alternate with guitar interludes, we split each recording at non-vocal sections lasting at least 3 seconds. Then, we align the resulting segments to the four templates extracted from the database used in the previous experiment. For each template, we store the highest obtained alignment score among the segments. Then, for each of the four cases, we analyse how many of highest ranked recordings are relevant to the respective template melody, meaning, they contain at least one exhibition of the template melody. To this end, we compute the retrieval recall rec_R within the R ranked recordings.

rec_R= # relevant items within the top R ranks

# relevant items in the search space (4.14)

Figure 4.25 shows the retrieval recall for the four melodies when considering the 5 (rec₅), 10 (rec₁₀) and 20 (rec₂₀) highest ranked recordings. We furthermore distinguish whether MT or AT were used to estimate the templates. It can be seen, that for Fandangos de Valverde and Fandangos Valientes de Huelva, the five highest ranked recordings correspond to the five recordings containing the relevant melodies. For the worst case, theFandangos de Calaña, three relevant results are located within the first five ranks and the remaining two relevant melodies are not ranked among the top 20 results. For theFandangos Valientes de Alosno, three relevant results are located within the first five ranks, and the remaining two are located between ranks 5 and 10. These results are promising, given that the search space

Fig. 4.25 Retrieval recall for manual and automatic transcriptions.

encompasses a total of 1189 recordings. We manually inspected the two examples which were not located within the top 20 results and found that severe vocal detection errors are most likely the cause for the low ranking. For example, in one recording, various seconds of the relevant singing voice melody are missing. In the other recording, guitar segments between the melody exhibitions were mistakenly transcribed as singing voice. In this case, the melody exhibitions are not segmented correctly and the alignment with the template yields a low score. We do not observe any differences between using the templates extracted from AT and MT.

In document Flamenco music information retrieval. (Page 105-111)