4. Evaluation Methods
4.1.2. Further metrics
In [217], we discussed five categories of metrics, which can be estimated for the evaluation of music classification: common quality-based, resource, model complexity, user interaction, and specific metrics. The measures from the first group are listed in Section 4.1.1, and most of them were used in our studies. The metrics from the last four groups are not so commonly used in music classification, but are, in our opinion, very promising for the multi-objective evaluation of music classification in future.
4.1. Evaluation metrics 77
Figure 4.1.: All solutions found during the 10 statistical repetitions of FS optimisation by SMS-EMOA and classification by RF, category Classic. Left subfigure: partition-level optimisation. Right subfigure: song-level optimisation.
Resource metrics estimate algorithm runtime and storage demands. One of the few
works which provide a general categorisation of these metric group is [210]. They can be calculated for each stage of the music classification chain discussed in Section 2.1.3, for example:
• The CT runtime is relevant, if new music categories are frequently created.
• The C runtime becomes crucial, if the same categorisation models are applied for different music collections, for example, if a music online shop applies automatic classification of new songs each day.
• The same holds for the F E runtime: although feature extraction is usually done only once for each new music track, it can be very costly. For example, it was observed in [18] that the estimation of autocorrelation, fundamental frequency, and power spectrum required more than 65% of the overall extraction time for the set of 25 common audio features. Therefore, too long extraction times may lead to problems for often updated music collections as well as for devices with limited resources. • The F P reduction rate measures the number of entries in the processed feature ma-
trix X0 divided by the number of all feature dimension values before any processing:
mF P RR=
F · T0 PF∗
i=1(T∗∗(i) · F∗∗(i))
. (4.22)
For each feature i, the number of extracted values is equal to the product of the number of dimensions F∗∗(i) and the number of extraction windows T∗∗(i), see also note8in Section2.1.3. mF P RR provides a rough estimation for the storage demands
which are required to index the music files.
• A modified version of mF P RR was used in [222] for the comparison of different time dimension processing methods (see the discussion of Fig.2.10in Section2.3.3). The time windows reduction rate corresponds to a relative number of the selected time windows, compared to the number of the smallest extraction frames, which are
required for the harmonisation of the feature matrix XH (see Section2.3):
mT W RR=
T0
TH, where (4.23)
TH is the time dimensionality of the harmonised feature matrix XH before further processing.
Model complexity metricsestimate the complexities of the classification models. The
more complex models often have a higher tendency to be overfitted towards certain data sets, in particular, the training data, so that the classification performance on other data sets is deteriorated. This metric group is sometimes closely related to the resource metrics: a more complex model is often built from a larger amount of features and has higher storage demands.
• A crude measure for model complexity is the selected feature rate:
mSF R=
F
F∗. (4.24)
A larger number of input variables often leads to more complex models, and the dan- ger increases that some noisy features are coincidently recognised as relevant. This especially holds, if the number of features is larger than the number of classification instances.
• The generalisation performance of classification models can also be evaluated ac- cording to stability criteria, such as the deviation of the classification performance on different validation sets. An example for such a measure is proposed in [113]. • The classifier-specific model complexity metrics compare models, which are created
by the same classifier, but with different parameters. An SVM-specific complexity measure is discussed in [145]. For C4.5, the number of tree nodes measures the tree complexity.
A group ofuser related metricsmakes sense for any classification scenario, where the
users are either involved in ground truth labelling, or the categorisation itself aims at user satisfaction. Examples for these metrics are:
• Listener satisfaction with the music classification results. • Feedback efforts, if the user plays a role in active learning [86].
• Efforts to create the ground truth are usually in conflict with the classification per- formance: the smallest number of misclassifications can be achieved, when a large number of the labelled songs from different genres exists. However, high manual efforts for labelling are necessary for that case.
• High interpretability of the classification models and the involved features helps to understand the category properties, for example, if a decision tree model is built with high-level features. Each step of the algorithm chain (Fig. 2.3), which aims at the increase of the classification performance, may on the other side reduce the inter- pretability: e.g., if the F E outputs a large number of complex and less comprehensi- ble audio signal characteristics, F P applies statistical feature dimension processing,