Low-level feature selection - Recognition of genres and styles

5. Application of Feature Selection

5.2. Recognition of genres and styles

5.2.1. Low-level feature selection

Figure 5.8 plots the non-dominated fronts of the final solutions after 3,000 SMS-EMOA generations. The identification of the classical music pieces is the simplest categorisation task: the lowest ms_BRE is 0.0113 (classified with RF), and all solutions of the overall ND front, except for one C4.5 model, have ms_BRE < 0.05. The most challenging categories are ClubDance (the smallest ms_BRE = 0.1442) and Pop (the smallest ms_BRE = 0.1236). However, in our opinion, these results are also promising for these hard to classify styles. As it was stated in the other studies, the all-classifier ND front contains solutions of several classification methods, and for all categories at least three of four different classifiers con- tribute to this front. Also, non-dominated solutions with lowest ms_BRE values are created by different classifiers across the tested categories. This strengthens the suggestion that it is reasonable to include several classification algorithms into genre and style classification. For a general evaluation of EMO-FS, it is necessary to measure the increase of the multi-objective performancebetween the first and last generations of SMS-EMOA.

This can be done by the estimation of the mean hypervolume S on the holdout set across 10 statistical repetitions before and after optimisation. The increase of hypervolume on the holdout set means that the models built with the optimised feature subsets are better generalisable and also perform well on data which have been neither involved in model training nor their validation during the optimisation process.

As plotted in Fig.5.9, it can be clearly observed that the dominated hypervolume increases. Here, its progress is measured in per cent, related to the initial dominated hypervolume on the holdout set. We denote the mean initial dominated hypervolume on the holdout set bySH_init , and the mean final dominated hypervolume on the holdout set by SH_{f in}. The larger markers correspond to the experiments with ifr = 0.5, and the smaller markers to

the runs with ifr= 0.2. C4.5 is marked with blue circles, RF with red squares, NB with

green diamonds, and SVM with yellow triangles. The categorisation tasks are separated by thick vertical lines. Because the experiments with ifr = 0.2 already start with smaller

feature sets than the experiments with ifr = 0.5, the increase of hypervolume is not so

high. The increase of the mean dominated holdout hypervolume during the optimisation is approximately the same for all categories in spite of their different complexity.

The increase of the hypervolume on the holdout set after the optimisation is confirmed as being significant in all cases by the Wilcoxon signed rank test for the following test setup:

5.2. Recognition of genres and styles 105

Figure 5.8.: The best ND fronts after genre and style recognition with the LL set. Circles: C4.5, squares: RF, diamonds: NB, triangles: SVM. The ND fronts for each classifier are indicated with thin lines. The ND fronts across all classifiers are indicated with thick lines, and the markers of the corresponding models are enlarged.

Figure 5.9.: Increase of the relative mean holdout dominated hypervolume after the optimisation. Circles: C4.5, squares: RF, diamonds: NB, triangles: SVM. Large markers: ifr= 0.5, small markers: ifr= 0.2.

• For a fixed classifier and if_r setting, denoted by the index i ∈ {1, ..., 8}, and a fixed classification task, denoted by its index j ∈ {1, ..., 6}, let u(i, j, LL) be the vector of the initial dominated hypervolumes estimated on the holdout set for the experiments with the LL feature set, so that uk(i, j, LL) = SinitH (i, j, k, LL) corresponds to the

hypervolume value from the k-th statistical repetition, k ∈ {1, ..., 10}. Similarly, let v(i, j, LL) be the vector of the final dominated hypervolumes estimated on the holdout set, so that vk(i, j, LL) = Sf inH (i, j, k, LL).

• H0: u and v belong to the same probability distribution. • H1: The distributions are not equal.

The p-value of the tests applied for each combination of a classification method and a categorisation problem is equal to 0.002, and H0 is always rejected.

It also makes sense to evaluate theincrease of the single-objective performance

w.r.t. ms_BRE, because the classification quality is usually more relevant than the number of features. For this goal, we estimated ms_BRE using complete feature sets for each combination of a classification task and a classification method as a baseline method without FS. Then, the boundary solutions with the smallest ms_BRE after the optimisation were saved for comparison. Figure5.10shows the mean ms_BRE decrease over 10 statistical repetitions for the ND solution with the smallest ms_BRE (and the largest mSF R), denoted by ms_BRE,

related to ms_BRE produced by the complete feature set, denoted by ms_BRE(Φall).

For C4.5, the ms_BRE decrease is between 22.66% and 51.94%. For RF, it is between 20.95% and 54.28%, for NB between 21.38% and 77.39%, and for SVM between 10.08% and 47.95%. This means that the optimised models are not only better with respect to the dominated hypervolume, but they achieve smaller error rates. In general, it cannot be expected that the full feature sets always perform worse with regard to a quality performance measure. But it is indeed often the case, because too many irrelevant features overwhelm classification methods, as discussed in Section3.1. The benefit varies, depend- ing on the classifier and the task: for example, all error decrease rates are below 40% for the ClubDance category, and above 40% for Classic. NB and RF profit stronger for Classic, Rap, HeavyMetal, and ProgRock, however achieve only smaller improvements for Pop and ClubDance.

5.2. Recognition of genres and styles 107

Figure 5.10.: Decrease of ms_BRE for the best-ms_BRE solution after the optimisation, compared to the error using the complete feature set. Circles: C4.5, squares: RF, diamonds: NB, triangles: SVM. Large markers: ifr = 0.5, small markers:

ifr = 0.2.

Wilcoxon signed rank test. For each combination of a classifier and a categorisation task, H0 is rejected, using the following test setup:

• For a fixed classifier and ifr setting, denoted by the index i ∈ {1, ..., 8}, and a fixed

classification task, denoted by its index j ∈ {1, ..., 6}, let u(i, j, LL, Φbest) be the

vector of the smallest ms_BRE estimated on the holdout set for the experiments with the LL feature set, so that uk(i, j, LL, Φbest) = msBRE(i, j, k, LL, Φbest) corresponds

to the ms_BRE-best value from the k-th statistical repetition, k ∈ {1, ..., 10}. Similarly, let v(i, j, LL, Φall) be the vector of msBRE estimated on the holdout set, if all features

are switched on, so that vk(i, j, LL, Φall) = msBRE(i, j, k, LL, Φall) (in this case, v1 =

v2 = ... = vk).

• H0: u and v belong to the same probability distribution. • H1: The distributions are not equal.

The p-value of the tests is in all cases 0.002, except for SVM with ifr = 0.2 for the Rap

category, where p = 0.049 is only slightly below the 0.05 boundary.

Figures5.9and 5.10can be compared. Starting with a smaller number of features (ifr =

0.2) obviously leads to a lower relative increase of hypervolume, but we cannot observe any significant impact of the choice of ifr on the solutions with the smallest msBRE. A

similar tendency was also observed in [223]: the initial population of feature sets with larger errors did not lead to a significantly different performance than an initialisation with feature sets, which produced smaller errors. Because the classification categories are very different, and the feature selection problem is also very complex, starting with ‘better’ feature subsets may lead to two very different outcomes: the probability may increase to get stuck in the local minima, or it could be indeed possible to benefit from the initial advantage of smaller feature sets and overcome the local optima, if the mutation strength is high enough.

In document Improving supervised music classification by means of multi-objective evolutionary feature selection (Page 108-112)