Moods - Application of Feature Selection - Improving supervised music classification by means o

5. Application of Feature Selection

5.1.2. Moods

The parameters of the mood study are listed in Table5.2.

In the following, we describe only the parameters of the mood recognition study, which differ from the setup of the instrument recognition study already discussed in Section

5.1.1.

• _{Classification tasks}: For our music database of 120 albums, listed in Appendix

B, TableB.1, we labelled the songs with the corresponding AMG mood categories4. These moods are defined by music experts and can be treated as personal prefer- ences. Because some categories had very small numbers of the labelled songs in our collection, we selected the following eight moods after the preliminary analysis: Ag- gressive, Confident, Earnest, Energetic, PartyCelebratory, Reflective, Sentimental, and Stylish.

A problematic issue of this ground truth is that only the positive labels are available from the AMG web site. If an album is not labelled with a certain mood, it could mean that it is either a negative example or that it has not been analysed by the experts. On the other side, subjective descriptors such as moods almost always cannot guarantee a precise ground truth.

The training sets were generated as follows: we selected all available albums with a certain mood tag and drew randomly one song per album. Then, the same number of songs was drawn randomly from the remaining albums, which were not labelled with this mood. The classification models were trained on these balanced sets. The solutions generated by SMS-EMOA were evaluated on the song set OS120, and validated independently on the song set TS120 (both sets are listed in Appendix

B, Table B.3), as also done in previous studies [15, 217, 218]. During the random generation of the training sets, the songs from OS120 and TS120 were excluded, so that the number of the shared songs for all sets (training, optimisation and holdout) was equal to zero.

• Features and processing: For 439 low-level and high-level descriptors, which

have been originally extracted from short frames with length < 4 s, mean and stan- dard deviation were estimated for classification frames with Wc= 4 s and Sc = 2 s.

Table 5.2.: Parameters of the mood recognition study.

Parameter name Values No.

Classification tasks

Classification tasks Aggressive, Confident, Earnest, Energetic,

PartyCelebratory, Reflective, Sentimental, Stylish

Training sets 30–52 songs 1

Optimisation set 120 songs 1

Holdout set 120 songs 1

Features and processing

Initial features 1,318 audio features 1

Feature processing NaN elimination, normalisation, interonset frame selection

Classification frames Wc= 4 and Sc= 2 1

Feature aggregation Mean and std. deviation for low-level features and high-level features with extraction windows < 4 s

Classification methods

Algorithms C4.5, RF, NB, SVM with a linear kernel 4

Optimisation parameters

Optimisation metrics mBRE and mSF R 1

Optimisation algorithm (50+1) SMS-EMOA 1

Mutation Asymmetric bit flip with p01= 0.01 and γ = 32 1

Initial feature rate ifr ∈ {0.5; 0.2; 0.05} 3

Number of evaluations 2,000 1

Evaluation method Optimisation on the optimisation set; independent validation on the holdout set

Statistical repetitions - 10

Number of experiments 960

Number of model train. 1,968,000

Number of model eval. 3,936,000

Therefore, the number of feature dimensions was increased by the factor 2 leading to 878 features.

Another set of 70 low-level and high-level features with extraction frames larger than 4 s was processed directly without aggregation.

A next group of features was integrated according to the concept of sliding feature selection, as introduced in Section 3.3. Different instrument categorisation models described in Section 5.1.1were applied on extraction windows around the onset events. Then, the relative share of the positive outcomes (an instrument was de- tected) was calculated for larger high-level feature extraction frames of 10 s. For example, a piano share of 0.8 in a frame means that 80% of the binary classification models identified a piano around the onsets in the analysed 10 s frame. Because all non-dominated instrument models for different classifiers were taken into account (see Fig. 5.1), the number of these instrument-related features was equal to 237.

5.1. Recognition of high-level features 93

Finally, the 133 structural complexity high-level characteristics listed in Table A.7

were also integrated into the complete feature set.

All features were normalised and the missing values were replaced by the medians. For features with short extraction frames, only the interonset frames were taken into account.

• Classification methods: For the mood recognition study and the following ex-

periments described in Sections 5.1.3 to 5.2, we used the same 4 classifiers as for instrument recognition. The only change was the increased number of trees for the RF classifier – we replaced the default value of 10 trees by 100 according to the observations from [218].

• _{Optimisation parameters}: Because the optimisation (and holdout) sets were not balanced, we used the balanced relative error mBRE (Equ. 4.12) and mSF R

as optimisation criteria. The parameters of SMS-EMOA were the same as for the instrument recognition study, except for the population size (it was increased to 50) and crossover: because this operator did not provide any significant improvement for all instrument recognition tasks, we removed it from the algorithm.

For mood recognition, we did not use a 10-fold CV process for model training and validation, because of two reasons. First, 10-fold CV requires approximately 10 times longer runtimes than the single validation, and we had to select a compromise between the number of classification tasks and other parameter settings. This load could be in principle partly reduced by using, e.g., only a 3-fold CV procedure, as it was done for the GFKL2011 set recognition (Section 5.1.3). However, the second restriction was the limited number of positive mood albums (between 15 and 26), so that the balanced training sets consisted of 30–52 songs. Using only 2/3 of these sets for the model training would further decrease the number of positive songs used for the model creation.

Figure5.2illustrates the ND fronts of the final solutions after the experiments. It can be clearly observed that mood recognition tasks are more complex than instrument identifi- cation in polyphonic mixtures. It depends on the ground truth, which is not so precise, as for instrument recognition task: as discussed above, it cannot be guaranteed that negative examples are always really negative.

The share of each classification method in the overall-classifier ND front varies from category to category. NB provides the smallest mBRE for the categories Aggressive and

Sentimental, RF for the five other categories, and SVM only for Reflective. C4.5 con- tributes only seldom to the overall ND front.

It is important to explain why we did not use any established song database for the better comparison with other studies. Unfortunately, at least at the time point, when we started our studies, these databases had (and still have) several limitations:

• Several databases do not contain complete songs. For examples, GTZAN5 _consists

of 30 s song excerpts, and the Music Audio Benchmark Data Set [85] of 10 s excerpts. However, a part of our studies was to examine different processing methods, starting with features from complete songs (see [221]). Another motivation for the F E from complete songs is that it is not straightforward to decide, which part of a song should 5

Figure 5.2.: The best ND fronts after all mood recognition experiments. Circles: C4.5, squares: RF, diamonds: NB, triangles: SVM. The ND fronts for each classifier are indicated with thin lines. The ND fronts across all classifiers are indicated with thick lines, and the markers of the corresponding models are enlarged.

5.1. Recognition of high-level features 95

be ‘representative’. For example, in a study of Chinese pop songs, the representative segments marked by the listeners which understood Chinese were different from the representative sections marked by the remaining listeners [28]. Some genres, like progressive rock, contain parts with varying properties (orchestra, longer segments with distorted guitars without any vocals, vocal segments, etc.), and it would not be advantageous to restrict the feature extraction interval to, e.g., 30 seconds from the song middle.

• Free music data sets, such as RWC Magnatune6_{, are often biased toward several}

genres and do not represent the popular commercial music well.

• Databases with large lists of commercial pop songs, such as USPOP7 _{or SLAC}

dataset8_{, contain only a limited number of features. It is not possible to extract self-}

implemented characteristics, and it is expensive to buy a large collection of songs.

In document Improving supervised music classification by means of multi-objective evolutionary feature selection (Page 95-99)