Ensemble learning - Multi-label classification models for heterogeneous data: an ensemble-based

Introduction

1.1 Ensemble learning

When making crucial decisions, the tendency of humans is to gather information and opinions from different sources, then combining them into a final decision which is supposed to be better and more consistent than considering just one opin-

ion [1]. Based on this reasoning, ensemble learning is a machine learning tech-

nique which combines predictions of individual learners from heterogeneous or homogeneous modeling to obtain a combined learner that improves the overall generalization ability and reduces the overfitting of each [2,3].

Nowadays, ensemble methods are considered as the state-of-the-art to solve a wide range of machine learning problems, such as classification [4], regression [5], and clustering [6] problems. They have been successfully applied in fields such as finance [7], bioinformatics [8], medicine [9], and image retrieval [10].

There are three main reasons why an ensemble learner use to perform better than single learners [11]. First, when picking only one single learner we run the risk of selecting one out of several learners that get same performance on training data, but which may perform different over unseen data; therefore, selecting some of these learners and combining their outputs, the optimal one, i.e., this which per-

forms better on unseen data, would be easily reached. InFigure 1.1a, the search

spaceHof models is indicated with the outer line, while the inner line denotes the set of learners that get the same performance in training, andh∗ denotes the optimal learner. Second, many learning algorithms use local search to find a model and they may not find the optimal learner, so running several times the learning algorithm and combining the obtained models may result in a better approxima- tion to the optimal learner than any single one (Figure 1.1b). Third, since in most machine learning problems the optimal function might not be found, the optimal learner may be approximated by combining several feasible learners (Figure 1.1c).

(a) (b) (c)

Figure 1.1: Reasons why ensemble methods outperform single learners. (a) Com- bining several learners that perform similar on training data, the optimal one would be better reached. (b) Combining several learners obtained by local search could better reach the optimal learner. (c) Combining several feasible learners, the non-reachable optimal one could be better approximated.

1.1.1 Building phase

Although ensemble models tend to improve the generalization ability of base learners, those learners to be combined into an ensemble should be carefully chosen. Using accurate learners seems to be an obvious approach; however, an ensemble including learners that are very similar to each other could not perform as well

as expected, and it may perform even worse than individual ones. Therefore, diverse learners are usually preferred to be combined, although formal proof of this dependency does not exist [12,13].

The main approaches that have been proposed to build ensembles of diverse learners could be categorized in the following groups [1]:

Input manipulation Each base model is built over a slightly different subset of

the original training dataset (sampled with or without replacement). Thus, learners are diverse among them since each is focused on different input data.

Manipulated learning algorithm The way in which the learning algorithm per-

forms is modified. Usually, different hyperparameters are used for each ensemble member, such as using neural networks with different learning rates or number of layers, or support vector machines with different regularization parameters or kernels.

Partitioning The original dataset is divided in several mutually exclusive subsets, and each of them is used to train a different base learner. The original data can be split following two different approaches: a) horizontal partitioning, where each subset is composed of different instances (being non-overlapping subsets, unlike ininput manipulation); and b) vertical partitioning, where each member is built over all instances but each of them considering a different subset of the input attributes.

Output manipulation The output attribute (or attributes) is manipulated at each

of the members of the ensemble. Depending on the problem, the output attributes would be categorical classes, real-valued targets, etc.

Hybridization Usually, instead of just using one approach, two or even more ap-

proaches are hybridized in order to obtain far more diverse learners, aiming to improve the overall predictive performance of the ensemble learner.

1.1.2 Prediction phase

Given an ensemble modelecomposed bynbase learners, the final decision is made by combining individual predictions from each memberej, being_j_∈_[1_{, n}_]. Several approaches have been proposed in order to combine or integrate these predictions into the final one.

Weighting methods are the most usual techniques to combine predictions. These methods give a weight to each individual prediction, and then they are combined in any way. In particular, majority voting is the simplest and widely used weighting method, where the final prediction is the output with more votes among the base learners, or the average of predictions in real-valued outputs [5]. However, more complex approaches exist, such as giving a weight to each base learner based on their individual performance [14]; thus, the final prediction is biased by more accurate base learners while still considering all predictions.

On the other hand, instead of directly combining the predictions of individual learners, meta-learning methods build an ensemble in two-stages. In the first stage, several base learners are built and their predictions are gathered. Then, in a second phase, predictions of individual learners are used to build a meta-model, which may either extend the original input space with the predictions of previous stage, or just use these predictions as input attributes for the meta-model. Then, the output of the meta-model is used as the final ensemble prediction [15]. While weighting methods are better applicable in cases where the performance of base learners is similar, meta-learning methods are able to detect if certain base methods perform poorly in some subspaces.

In document Multi-label classification models for heterogeneous data: an ensemble-based approach. (Page 34-37)