• No results found

• Proposal for a new framework for cost-based feature selection. In this manner, the scope of feature selection is broaden by taking into consideration not only the relevance of the features but also their associated costs. The proposed framework consists of adding a new term to the evaluation function of a filter method so that the cost is taken into account.

• Distributed and parallel feature selection. There are two common types of data distribution: (a) horizontal distribution wherein data are distributed in subsets of instances; and (b) vertical distribution wherein data are distributed in subsets of attributes. Both approaches are tested, employing for this sake filter and wrapper methods. Since in some cases the partitioning of the datasets can introduce some redundancy among features, new partitioning schemes are being investigated, for example by dividing the features according to some goodness measure.

1.3

Overview of this thesis

This chapter has introduced the main topics to be presented in this work. Figure 1.1 depicts the organization of the thesis. Part I is covered by chapters 2 - 6. Chapter 2 presents the foundations of feature selection, as well as a description of the feature selection methods which will be employed in this thesis. Then, Chapter 3 reviews the most popular methods in the literature and checks their performance in an artificial controlled scenario, proposing some guidelines about their appropriateness in different domains. Chapter 4 analyzes the up-to-date contributions of feature selection research applied to the field of DNA microarray classification, whereas Chapter 5 is devoted to proving the benefits of feature selection in other real applications such as classification of the tear film lipid layer and K-complex classification. Chapter 6 closes Part I by studying the scalability of existing feature selection methods.

Part II is covered by chapters 7 - 10. Chapter 7 presents a method which consists of a combination of discretizers, filters and classifiers. The proposed method is applied over an intrusion detection benchmark dataset, as well as other challenging scenarios such as DNA microarray data. Chapter 8 introduces an ensemble of filters to be applied to different scenarios. The idea builds on the assumption that an ensemble of filters is better than a single method, since it is possible to take advantage of their individual strengths and overcome their weak points at the same time. Chapter 9 proposes a new framework for cost-based feature selection. The objective is to solve

Chapter 1. Introduction

Figure 1.1: Organization of the thesis

problems in which it is interesting not only to minimize the classification error, but also to reduce costs that may be associated to input features. Chapter 10 presents some approaches for distributed and parallel feature selection, splitting the data both vertically and horizontally. Finally, Chapter 11 summarizes the main conclusions and contributions of this thesis. Notice that Appendix I presents the materials and methods used throughout this thesis and Appendix II reports the author’s key publications and mentions.

PART

I

CHAPTER

2

Foundations of feature selection

In the last years, several datasets with high dimensionality have become publicly avail- able on the Internet. This fact has brought an interesting challenge to the research community, since for the machine learning methods it is difficult to deal with a high number of input features. To confront the problem of the high number of input fea- tures, dimensionality reduction techniques can be applied to reduce the dimensionality of the original data and improve learning performance. These dimensionality reduction techniques usually come in two flavors: feature selection and feature extraction.

Feature selection and feature extraction each have their own merits (Z. A. Zhao & Liu, 2011). On the one hand, feature extraction techniques achieve dimensionality reduction by combining the original features. In this manner, they are able to generate a set of new features, which is usually more compact and of stronger discriminating power. It is preferable in applications such as image analysis, signal processing, and information retrieval, where model accuracy is more important than model interpretability. On the other hand, feature selection achieves dimensionality reduction by removing the irrelevant and redundant features. It is widely used in data mining applications, such as text mining, genetics analysis, and sensor data processing. Due to the fact that feature selection maintains the original features, it is especially useful for applications where the original features are important for model interpreting and knowledge extraction.

This chapter will present the foundations of feature selection, as well as a description of the feature selection methods which will be employed in this thesis.

2.1

Feature selection

Feature selection can be defined as the process of detecting the relevant features and discarding the irrelevant and redundant ones with the goal of obtaining a subset of

Chapter 2. Foundations of feature selection

features that describes properly the given problem with a minimum degradation of performance. It has several advantages (Guyon, 2006), such as:

• Improving the performance of the machine learning algorithms.

• Data understanding, gaining knowledge about the process and perhaps helping to visualize it.

• General data reduction, limiting storage requirements and perhaps helping in reducing costs.

• Feature set reduction, saving resources in the next round of data collection or during utilization.

• Simplicity, possibility of using simpler models and gaining speed.

2.1.1 Feature relevance

Intuitively, it can be determined that a feature is relevant if it contains some information about the target. More formally, Kohavi & John classified features into three disjoint categories, namely, strongly relevant, weakly relevant, and irrelevant features (Kohavi & John, 1997). In their approach, the relevance of a feature X is defined in terms of an ideal Bayes classifier. A feature X is considered to be strongly relevant when the removal of X results in a deterioration of the prediction accuracy of the ideal Bayes classifier. A feature X is said to be weakly relevant if it is not strongly relevant and there exists a subset of features S, such that the performance of the ideal Bayes classifier on S is worse than the performance on S ∪ {X}. A feature is defined as irrelevant if it is neither strongly nor weakly relevant.

2.1.2 Feature redundancy

A feature is usually considered as redundant in terms of feature correlation (Yu & Liu, 2004a). It is widely accepted that two features are redundant to each other if their values are completely correlated, but it might not be so easy to determine feature redundancy when a feature is correlated with a set of features. According to Yu and Liu (2004a), a feature is redundant and hence should be removed if it is weakly relevant and