Open-source code - Concluding remarks - Discussion and conclusions

Discussion and conclusions

6.1 Concluding remarks

6.1.4 Open-source code

In order to largely increase our contribution to the scientific community, we have released the code of every algorithm and tool developed in the course of this Ph.D. thesis. These codes are all released in GitHub repositories under GPLv3 license.

The aim of publicly providing the code of the developed algorithms is two-fold: I) facilitating the reproducibility of results, i.e., anyone interested in reproducing the experimental results of our research papers can do it easily, as well as use it in new datasets for their research; and II) any researcher interested in modifying or adapting any aspect of the algorithms can modify it for their future research. We have not only made available the code of EME and EAGLET, but also of two other

preliminary evolutionary approaches: CCEA [C13] and G3P-kEMLC [C14]. These

We have also made publicly available the code of two different tools. First, a library to execute the Mulan algorithms from command-line interface, which has been used to carry out the different experiments of the thesis. In this way, having this code, any researcher who aims to execute these methods can just download the library and use it, or maybe extend it with new features, such as including other algorithms and evaluation measures. Then, the code of MLDA[J2], a tool for analyzing and preprocessing multi-label datasets has been released too. This tool is described in more detail inSection 7.1.

Following, the list of repositories derived from this Ph.D. thesis is provided.

• ExecuteMulan. https://github.com/kdis-lab/ExecuteMulan • EME.https://github.com/kdis-lab/EME • EAGLET.https://github.com/kdis-lab/EAGLET • MLDA.https://github.com/i02momuj/MLDA • CCEA.https://github.com/kdis-lab/CCEA_EMLCs • G3P-kEMLC.https://github.com/kdis-lab/G3P-kEMLC

6.2 Future work

Although two EAs to build EMLCs have been developed within the scope of this Ph.D. thesis, we consider that there is still work to do to improve both the predictive performance and the efficiency of these models, as well as to find new structures for the EMLCs or approaches to build them. Following, we highlight some lines of future work.

Using variablekvalues. In both EME and EAGLET, as well as RAkEL does, each

member of the ensemble is focused on predicting a subset ofk labels. In all

of them,kis a parameter of the algorithm, and fixed for all multi-label classifiers. Usually,k= 3is used, as proposed in [53]; however, it would depend on each problem if the labels are better modeled considering a lower or higher

number of labels along with it. Therefore, we propose to study the effect of

considering a variable value ofkin each member of the ensemble.

A first approximation has been made in[C14], which is then presented inSec- tion 7.3. In this method, different values ofkare used in each base classifier, being randomly chosen at the beginning between a range given as parameter. However, these values are fixed at the beginning of the evolution for each base classifier and not modified then. Although it is able to select the most suitable members for the ensemble, it is not able to modify the number of labels considered in each of them.

We consider that the fact of using a variable value of k and automatically

adapting it through the evolution would lead to a better predictive performance, since each classifier would select the most appropriate number of labels. Nevertheless, it is not straightforward to apply to previously proposed algorithms, since many critical parts of the EAs must be modified, such as the mutation and crossover operators, as well as some ensemble-specific param- eters, as the number of classifiers in the ensemble.

New approaches to select the ensemble members. Since the construction of the

EMLCs is one of the key points to obtain a high-performing method, other ways to generate the ensemble should be investigated. In EAGLET, an iterative greedy algorithm to build the ensemble is proposed, which selects in each iteration the classifier that maximizes a combination of accuracy and diversity.

The aim would be to have an EA which is in any way able to both evolve sep- arate members, but also to evolve the structure of the ensemble (instead of using this iterative process to build it). One idea could be to generate an EA in two steps: the first step focused on obtaining a pool of good candidate individuals for the ensemble, and the second step focused on combining these individuals into an ensemble.

Using different populations. So far, we have used EAs involving one population

of dealing with different subpopulations at the same time. These subpopulations could be each of them focused on a different part of the problem (maybe considering different subsets of instances or labels), each of them optimizing the classifiers according to its own criteria. Then, to generate the ensemble, individuals from different subpopulations might be used, so the diversity of the EMLC would improve.

A first approach based on a CCEA has been already proposed [C13], and it

is presented inSection 7.2. In this method, several subpopulations are used, each of them focused on a different subset of the data. These subpopulations not only evolve separately, but also exchange useful information between them each some generations. However, we think that there still work to do in this way.

Different structures for the EMLC. The EMLCs proposed in this dissertation have

both of them the same final structure: the predictions ofnmulti-label classi-

fiers, each of them considering k labels are combined, where all members

have the same weight in the final prediction.

We consider that other types of structures for the EMLC should be investi-

gated. For example, in [C14] we have proposed a preliminary version of a

G3P method to build EMLCs, where the EMLC has a tree structure (see Sec-

tion 7.3). In this way, at each node of the tree, the predictions of children nodes are gathered and combined, while the final ensemble prediction is the one at the root of the tree. Thus, each base classifier of the ensemble does not have the same weight in the final prediction, but it depends on its depth and the number of children of each combination node.

However, in this first approach the k-labelsets for the base multi-label clas- sifiers are randomly generated at the beginning of the evolution and they do not change, but they are combined in an optimal tree-shaped structure. Therefore, we think that further research in this field would be interesting. Finally, although up to date we have based all our approaches in the use of k-labelsets for each ensemble member in the same way as RAkEL, other struc-

tures as the use of CCs members should be investigated (using all or just part of the labels). The benefit of using CC-based EMLCs is that ECC is usually one of the best performing EMLCs; however, its optimization would be more dif- ficult since it is not only selecting the labels at each member but also their chaining order.

Other publications associated

In document Multi-label classification models for heterogeneous data: an ensemble-based approach. (Page 135-140)