Generating Diverse Models - Multi-criteria optimisation for complex learning prediction systems

There are a number of approaches that can be used to introduce diversity into ensemble base predictors. The following points summarise the main approaches used for this purpose:

1. Varying the initial condition: start each predictor with different randomly generated position in the search space. Though this method is widely used in the literature, it is seen as the least significant method for generating diverse predictors (Brown et al. (2005)). It shows no or only slight improvement in the generalisation error when applied.

2. Varying model architecture or model type: in this approach compatible learners are chosen to be combined in the ensemble. Examples of this approach are presented in (Islam et al. (2003) and Opitz and Shavlik (1996b)). In the case of incompatible learners (hybrid ensembles) where the ensemble consist of more than one type of predictive models, it is often the case that a single best model is chosen to provide prediction for each new instance. Examples of this approach are found in (Wang et al. (2000) and Langdon et al. (2002)).

3. Varying the training data: in this approach each predictor is trained on a subset of the training data or/and a subset of the features. This approach is more likely to generate diverse models than the previous two approaches (Xue et al. (2006)). The sets generated from this approach are either intersected sets (with overlapping instances) or disjoint sets (with non-overlapping instances). Learners trained on disjoint sets are more diverse; however, generating disjoint sets are often impractical in real world problems due to limited data.

Once the models have been created, a fusion method is used to combine them into a single ensemble. There are many fusion strategies such as: Majority vote, Borda count, threshold vote, heuristic decision rule, weighted average, fuzzy integral and fuzzy mode among others (Xue et al. (2006)).

predictive models are trained on subsets of the data and/or subsets of the features. The following section discusses feature extraction and feature selection methods used to generate a representative subset of the data’s features.

4.4.1 Feature extraction and feature selection

One way of generating diverse ensembles is to train the base predictors on subsets of the data or subsets of the features. In the case of training using subset of the data, the predictors do not see all of the available data which might be impractical in real case scenarios where limited data is available. On the other hand, in the second case each predictor is trained using a subset of the features. Reducing the number of features can be particularly helpful when the task at hand has a large number of features, many of which might be irrelevant to the task or redundant with respect to the other features (Brown et al. (2012)). In such cases using all of the available features to train a model can result in overfitting and can have a high computational cost. Reducing the number of features can be achieved by using feature extraction or feature selection methods (Xue et al. (2006)). Using feature extraction methods, such as Principle Component Analysis (PCA) or Inde- pendent Component Analysis (ICA), the dimensionality of the data is reduced by creating new features that represent the projections of multiple existing features. PCA mainly aim to find the features that contribute most to the variance (energy), and does not optimise for the class separability (Guo and Nixon (2009) and Bishop (1995)). Furthermore, as the information contained in the original feature set are projected into fewer principle components, there is a high risk of training the base predictors on the same or similar set of principle components, which eventually reduces the diversity among the predictors (Tumer and Oza (2003)).

On the other hand, using feature selection methods a different subset of features is chosen for the training of each base predictor. Features subset selection can be achieved through many approaches. Some popular choices of feature selection in classification problems are correlation (Tumer and Ghosh (1996)) and Mutual Information (MI) (Cover and Thomas (2012)). In (Tumer and Oza (2003)), an example that uses correlation based feature selection in classification problems is presented, this work focuses on the correlation between feature subset and the output classes (or a particular class) and aims to choose the set of features with the highest correlation. On the other hand, MI can be used to evaluate the dependencies between two features with respect to a certain class. This

approach has been applied in many pattern recognition problems, examples are the use of MI in selecting features for gait recognition problem (Guo and Nixon (2009)) as well as for medical signal selection (Deriche and Al-Ani (2001)).

The work presented in this chapter aims to divide the search space of the prediction problem (by selecting subsets of the features) into Local Regions (LRs) and train a set of local expert models on each LR. In order to generate the LRs two approaches are consid- ered, the pairwise squared correlation and the conditional mutual information. In the first approach, similar features are grouped into one region, such that the predictive models trained on the resultant subsets specialize in a particular aspect of the prediction problem. The chosen features are the ones with the highest correlation (but they are not identical). The high correlation between the features is viewed as an indication for their similarity in defining a certain region of the search space. On the other hand, weakly correlated or independent features are assigned to different regions.

In the first approach a variation of Pearsons product-moment coefficient is used. Pear- son’s correlation method can only show linear-dependencies between the features. In this study a measure for higher order dependencies between the features is used, this measure is the correlation between the energy responses of the features. This measure was intro- duced in (Coates and Ng (2011)) and proved its efficiency in deep learning algorithms. Meanwhile, in the second approach a number of LRs seeds are chosen using the conditional MI criterion. Then a modified version of this criterion is used to measure the similarity among the features and the LRs seeds, based on this criterion a subset of the features are assigned to each LR. The proposed criterion encourages the inner correlation between the features and the LRs seeds. The following sections consider and compare the methods used to generate local models. In addition, they provide a detailed descrip- tion of the squared correlation approach and the conditional MI approach, how they are used to build MCMLPS and the results obtained when they are applied to a number of supervised classification problems.

In document Multi-criteria optimisation for complex learning prediction systems. (Page 95-97)