5. Application of Feature Selection
5.2. Recognition of genres and styles
5.2.1. Low-level feature selection
Figure 5.8 plots the non-dominated fronts of the final solutions after 3,000 SMS-EMOA generations. The identification of the classical music pieces is the simplest categorisation task: the lowest msBRE is 0.0113 (classified with RF), and all solutions of the overall ND front, except for one C4.5 model, have msBRE < 0.05. The most challenging categories are ClubDance (the smallest msBRE = 0.1442) and Pop (the smallest msBRE = 0.1236). However, in our opinion, these results are also promising for these hard to classify styles. As it was stated in the other studies, the all-classifier ND front contains solutions of several classification methods, and for all categories at least three of four different classifiers con- tribute to this front. Also, non-dominated solutions with lowest msBRE values are created by different classifiers across the tested categories. This strengthens the suggestion that it is reasonable to include several classification algorithms into genre and style classification. For a general evaluation of EMO-FS, it is necessary to measure the increase of the multi-objective performancebetween the first and last generations of SMS-EMOA.
This can be done by the estimation of the mean hypervolume S on the holdout set across 10 statistical repetitions before and after optimisation. The increase of hypervolume on the holdout set means that the models built with the optimised feature subsets are better generalisable and also perform well on data which have been neither involved in model training nor their validation during the optimisation process.
As plotted in Fig.5.9, it can be clearly observed that the dominated hypervolume increases. Here, its progress is measured in per cent, related to the initial dominated hypervolume on the holdout set. We denote the mean initial dominated hypervolume on the holdout set bySHinit , and the mean final dominated hypervolume on the holdout set by SHf in. The larger markers correspond to the experiments with ifr = 0.5, and the smaller markers to
the runs with ifr= 0.2. C4.5 is marked with blue circles, RF with red squares, NB with
green diamonds, and SVM with yellow triangles. The categorisation tasks are separated by thick vertical lines. Because the experiments with ifr = 0.2 already start with smaller
feature sets than the experiments with ifr = 0.5, the increase of hypervolume is not so
high. The increase of the mean dominated holdout hypervolume during the optimisation is approximately the same for all categories in spite of their different complexity.
The increase of the hypervolume on the holdout set after the optimisation is confirmed as being significant in all cases by the Wilcoxon signed rank test for the following test setup:
5.2. Recognition of genres and styles 105
Figure 5.8.: The best ND fronts after genre and style recognition with the LL set. Circles: C4.5, squares: RF, diamonds: NB, triangles: SVM. The ND fronts for each classifier are indicated with thin lines. The ND fronts across all classifiers are indicated with thick lines, and the markers of the corresponding models are enlarged.
Figure 5.9.: Increase of the relative mean holdout dominated hypervolume after the opti- misation. Circles: C4.5, squares: RF, diamonds: NB, triangles: SVM. Large markers: ifr= 0.5, small markers: ifr= 0.2.
• For a fixed classifier and ifr setting, denoted by the index i ∈ {1, ..., 8}, and a fixed classification task, denoted by its index j ∈ {1, ..., 6}, let u(i, j, LL) be the vector of the initial dominated hypervolumes estimated on the holdout set for the experiments with the LL feature set, so that uk(i, j, LL) = SinitH (i, j, k, LL) corresponds to the
hypervolume value from the k-th statistical repetition, k ∈ {1, ..., 10}. Similarly, let v(i, j, LL) be the vector of the final dominated hypervolumes estimated on the holdout set, so that vk(i, j, LL) = Sf inH (i, j, k, LL).
• H0: u and v belong to the same probability distribution. • H1: The distributions are not equal.
The p-value of the tests applied for each combination of a classification method and a categorisation problem is equal to 0.002, and H0 is always rejected.
It also makes sense to evaluate theincrease of the single-objective performance
w.r.t. msBRE, because the classification quality is usually more relevant than the number of features. For this goal, we estimated msBRE using complete feature sets for each combina- tion of a classification task and a classification method as a baseline method without FS. Then, the boundary solutions with the smallest msBRE after the optimisation were saved for comparison. Figure5.10shows the mean msBRE decrease over 10 statistical repetitions for the ND solution with the smallest msBRE (and the largest mSF R), denoted by msBRE,
related to msBRE produced by the complete feature set, denoted by msBRE(Φall).
For C4.5, the msBRE decrease is between 22.66% and 51.94%. For RF, it is between 20.95% and 54.28%, for NB between 21.38% and 77.39%, and for SVM between 10.08% and 47.95%. This means that the optimised models are not only better with respect to the dominated hypervolume, but they achieve smaller error rates. In general, it cannot be expected that the full feature sets always perform worse with regard to a quality per- formance measure. But it is indeed often the case, because too many irrelevant features overwhelm classification methods, as discussed in Section3.1. The benefit varies, depend- ing on the classifier and the task: for example, all error decrease rates are below 40% for the ClubDance category, and above 40% for Classic. NB and RF profit stronger for Classic, Rap, HeavyMetal, and ProgRock, however achieve only smaller improvements for Pop and ClubDance.
5.2. Recognition of genres and styles 107
Figure 5.10.: Decrease of msBRE for the best-msBRE solution after the optimisation, com- pared to the error using the complete feature set. Circles: C4.5, squares: RF, diamonds: NB, triangles: SVM. Large markers: ifr = 0.5, small markers:
ifr = 0.2.
Wilcoxon signed rank test. For each combination of a classifier and a categorisation task, H0 is rejected, using the following test setup:
• For a fixed classifier and ifr setting, denoted by the index i ∈ {1, ..., 8}, and a fixed
classification task, denoted by its index j ∈ {1, ..., 6}, let u(i, j, LL, Φbest) be the
vector of the smallest msBRE estimated on the holdout set for the experiments with the LL feature set, so that uk(i, j, LL, Φbest) = msBRE(i, j, k, LL, Φbest) corresponds
to the msBRE-best value from the k-th statistical repetition, k ∈ {1, ..., 10}. Similarly, let v(i, j, LL, Φall) be the vector of msBRE estimated on the holdout set, if all features
are switched on, so that vk(i, j, LL, Φall) = msBRE(i, j, k, LL, Φall) (in this case, v1 =
v2 = ... = vk).
• H0: u and v belong to the same probability distribution. • H1: The distributions are not equal.
The p-value of the tests is in all cases 0.002, except for SVM with ifr = 0.2 for the Rap
category, where p = 0.049 is only slightly below the 0.05 boundary.
Figures5.9and 5.10can be compared. Starting with a smaller number of features (ifr =
0.2) obviously leads to a lower relative increase of hypervolume, but we cannot observe any significant impact of the choice of ifr on the solutions with the smallest msBRE. A
similar tendency was also observed in [223]: the initial population of feature sets with larger errors did not lead to a significantly different performance than an initialisation with feature sets, which produced smaller errors. Because the classification categories are very different, and the feature selection problem is also very complex, starting with ‘better’ feature subsets may lead to two very different outcomes: the probability may increase to get stuck in the local minima, or it could be indeed possible to benefit from the initial advantage of smaller feature sets and overcome the local optima, if the mutation strength is high enough.