• No results found

5. Application of Feature Selection

5.1.1. Instruments

The instrument recognition study was originally conducted for the work reported in [219], where the instruments were identified from intervals (2 tones played at the same time) and chords (3 and 4 tones played simultaneously). For the genre and style recognition task in Section5.2, we selected only the chord-based models. The study overview is summarised in Table5.1.

In the following, we explain the relevant study details:

• Classification tasks: Binary recognition of four different instrument groups (gui-

tar, piano, wind, and strings) from polyphonic mixtures of 3 and 4 instrument sam- ples from McGill University MUMS collection [48], the RWC database [72], and the University of Iowa instrument samples1. First, 3,000 chords were randomly gener-

ated. 2,000 chords were reserved for the experiment set and 1,000 for the holdout 1

Table 5.1.: Parameters of the instrument recognition study.

Parameter name Values No.

Classification tasks

Classification tasks Guitar, Piano, Wind, Strings 4

Experiment set 2,000 chords 1

Holdout set 1,000 chords 1

Features and processing

Initial features 1,148 audio features 1

Feature processing NaN elimination, frame selection from attack and release interval middles and onset events for 795 features; larger building blocks for 353 features from [51]

1

Classification frames Wc= −1 and Sc= −1 1

Feature aggregation - 1

Classification methods

Algorithms C4.5, RF, NB, SVM with a linear kernel 4

Optimisation parameters

Optimisation metrics mRE and mSF R 1

Optimisation algorithm (30+1) SMS-EMOA 1

Mutation Asymmetric bit flip with p01= 0.01 and γ = 32 1

Crossover No crossover, uniform or commonality-based 3

Initial feature rate ifr ∈ {0.5; 0.2; 0.05} 3

Number of evaluations 2,000 1

Evaluation method Optimisation by 10-fold CV on the experiment set; independent validation on the holdout set

1

Statistical repetitions - 10

Number of experiments 1,440

Number of model train. 29,232,000

Number of model eval. 32,155,200

set, for the independent validation of the models after FS (see Section 4.2 for the data set descriptions).

• Features and processing: 265 mostly low-level audio descriptors were estimated

from the middle frames of the attack intervals, the middle frames of the release intervals, and the interonset frames, resulting in a 795-dimensional feature vector. Another 353 characteristics were extracted from larger blocks, as described in [51]. The onset events were identified by the MIR Toolbox. If none or more than one onset was estimated, we selected the frame with the highest RMS energy as an onset frame. Therefore, each chord provided exactly one onset event. The classification window size Wc and the step size Sc were set to -1 (this setting corresponds to the

building of the classification frames from complete audio recordings). Because only one feature value could be extracted from each chord through the aforementioned procedure, no further feature aggregation was required.

5.1. Recognition of high-level features 89

tree C4.5, random forest, naive Bayes and support vector machines with a linear kernel (see Section 2.4 for the description of algorithms). The default parameters of these methods were used as provided in RapidMiner [147]. For the explanation, why we selected these algorithms, see the last paragraphs of Section 2.4.

• Optimisation parameters: The relative error mRE (Equ. 4.10)2 from the 10-

fold cross-validation on the experiment set3 and the selected features rate mSF R

(Equ. 4.24) were optimised.

SMS-EMOA with asymmetric bit flip mutation (p01= 0.01 and γ = 32), as described

in Section 3.2.4, was used as optimisation method. These parameters were selected after a preliminary study and experiments from [217]. Three different crossover possibilities (no crossover, uniform, and commonality-based) and three different val- ues for the initial feature rate ifr ∈ {0.5; 0.2; 0.05} were tested. The number of

evaluations was set to 2,000 after the preliminary experiments.

The overall number of the statistical repetitions was set to 10. This limitation was necessary because of the long experiment runtimes between several hours and several days. The last lines in Table 5.1 illustrate the high computational load. The number of model trainings was equal to: 10 · 30 (training of models based on 10-fold CV on the experiment set for the initial population of 30 individuals) + 10·2, 000 (the same procedure for each offspring solution during 2,000 EA loop steps) = 20,300 for each experiment. Since 10 statistical repetitions for 144 combinations of parameter settings were run, the overall number of model trainings is 20, 300 · 10 · 144 = 29,232,000. Similarly, the number of model validations for each experiment was equal to: 10 · 30 + 30 (independent validation of the initial solutions on the holdout set) + 10 · 2, 000 + 2, 000 (validation of the offspring solutions on the holdout set) = 22,330.

It should be mentioned that the number of model trainings and evaluations provides a very rough effort measurement for many reasons: NB experiments were more than 10 times faster as SVM, and training and classification with fewer features for ifr= 0.05 was significantly faster at the beginning of the corresponding experiments

than for ifr = 0.5. Also, the number of classification instances in the data sets is

different for this study and the studies described in Sections 5.1.2to5.2, so that it is not possible to exactly compare the computing efforts across all studies.

Figure5.1shows the best non-dominated fronts of the final solutions after the experiments. The holdout mRE is plotted on the horizontal axis and mSF R on the vertical axis. C4.5

solutions are marked with blue circles, RF with red squares, NB with green diamonds, and SVM with yellow triangles.

The trade-off solution fronts can be observed: the smallest mRE is achieved by the models

with the largest numbers of features, and on the other side there exist models, which are built by a very limited numbers of features. These boundary solutions may be sometimes less promising for a decision maker. However, several models closer to the upper left side 2In [219], we refer to the mean squared error - the implemented metric was indeed MSE, but for this

study setup it is equal to mRE.

3For simplicity reasons, we denote the metrics which were estimated as an average from the n CV folds by

the same symbols as the metrics estimated by a single model building and evaluation (both approaches are illustrated in Fig.4.2, Section4.2).

Figure 5.1.: The best ND fronts after all instrument recognition experiments. Circles: C4.5, squares: RF, diamonds: NB, triangles: SVM. The ND fronts for each classifier are indicated with thin lines. The ND fronts across all classifiers are indicated with thick lines, and the markers of the corresponding models are enlarged.

of the ND front can be indeed relevant. Consider, for example, the upper right subfigure (Strings category), the two SVM solutions in the upper left corner of the subplot (marked with triangles): it is possible to reduce the amount of features from 44.34% to 23.26%, keeping almost the same mRE, which increases from 0.139 to 0.14. This situation illustrates

very well that the number of features can be strongly decreased, where the classification quality remains almost the same. Models with less features and a slightly diminished classification performance can be even more preferable: we already discussed in Section

3.1that the models built from smaller feature sets allow faster training and classification, save storage demands, and have a reduced tendency to be overfitted.

The classification models built by RF and SVM have the largest share of the overall ND solutions across all classifiers (this front is marked with the thick line). But there also exist some NB and C4.5 solutions, which belong to the best ND front, and are not dominated by any RF or SVM solution. This means that it is reasonable to combine different classification methods. This observation was measured in terms of hypervolume contributions of different classification methods in [215].

5.1. Recognition of high-level features 91

It is also important to mention that the solutions, which are plotted in Fig. 5.1, may not correspond to the very best trade-off solutions, because the validation was done on the independent holdout set. However, we measured the generalisation performance of the models in [219], and it was very similar for the optimisation and the holdout sets. Another derived suggestion is that some solutions from earlier generations provide better classification results on the holdout set than the final solutions after 2,000 evaluations. This effect may be reduced by early stopping, as discussed in [124] and mentioned in Section3.4.1. However, it is then required to provide an additional outer validation loop, and to further increase the already very high number of model training and validation steps (see above the explaination of the last three lines from Table5.1). These experiments were not possible within the scope of this thesis, but are reasonable in future.