Final Remarks - Development of unsupervised feature selection methods for high dimensional biom

In this section, the strengths and weaknesses of unsupervised feature selection methods, which are exploited in this study, are presented.

It is observed that SPFS and LapFS usually produce similar results although SPFS generally achieves better results than LapFS. This might be because they both attempt to preserve the data similarity of the original features, however,

LapFS cannot handle feature redundancy. It is also observed that MCFS method performed well when the number of features is small (RV144 Vaccine data set); however, its performance declined as the dimensionality of data increases (peptide binding affinity data sets). In addition, MCFS is inefficient for application to very high dimensional data, such as the GSE40279 because this method em- ploys the computation of a normalised Laplacian matrix,l1 norm regularisation,

and eigenvalue decomposition. LapFS computes a Laplacian matrix, and eigen value decomposition; however, it is still able to perform feature selection on the GSE40279 data set.

Another interesting point is that even though EUFS is an embedded method, and thus is computationally more expensive than filter methods, it is able to perform feature selection on ultra high dimensional GSE40279 data set.

Due to the extremely high run time and memory consumption, InFS, MCFS, SPFS could not be applied to the GSE40279 data set; instead, TV is used. Even though TV is a very simple unsupervised feature selection method, it produces good results on GSE40279 data set. Furthermore, because of its simplicity, it is the most computationally effective method compared to EUFS, LapFS, KBFS, and DKBFS.

The proposed KBFS method is a simple K-means based unsupervised method; however, it produces the second best results on ultra high dimensional GSE44763 and GSE40209 data sets. This might be due to the fact that unlike existing K- means based feature selection methods, which are capable of performing univari- ate feature selection, KBFS performs multivariate feature selection by exploiting feature-feature dissimilarity measure. It is observed that KBFS should be used to select features from very high dimensional data.

The proposed DFSFR method achieves the best results for the RV144 Vaccine, peptide binding affinity, the GSE44763 (for the prediction of BMI) data sets and it yields the second best results for the GSE44763 (for the prediction of chronological age), and the GSE40279 data sets. Therefore, DFSFR method can be utilised for low dimensional, high dimensional, very high dimensional and ultra high dimensional data.

The proposed DKBFS method produces the best results for the GSE44763 (prediction of chronological age), and GSE40279 data sets. Therefore, it is concluded that DKBFS method is useful when it is applied to extremely high dimensional

data. In summary, it is beneficial to exploit MCFS, LapFS and SPEC methods for low dimensional data sets. KBFS and DKBFS can be used for very high dimensional data sets. DFSFR method can be exploited for both low dimensional, and high dimensional data sets. The results of EUFS and InFS over different data sets are generally not consistent; thereby, the performances of these are highly dependent on data set.

Conclusions and Future Works

This chapter concludes the research, and presents possible future works.

7.1 Conclusions

In line with the technological developments, there is almost no limit to collect data of high dimension in bioinformatics. These high dimensional data sets usually contain many redundant or noisy features which need to be filtered out to find a small but biologically meaningful set of attributes. Feature selection aims at identifying a subset of original features by eliminating redundant and noisy ones and this is an effective dimensionality reduction method that is widely used in machine learning and data mining. In fact, feature selection enables regressors to achieve better performance in terms of regression. There are mainly two different types of feature selection methods: unsupervised and supervised. Supervised feature selection methods can identify relevant features as well as noisy ones; however, unsupervised methods do not tend to identify features that can act as noise.

After conducting an intensive literature review, it is observed that selection of features from very high dimensional data sets in regression domain seems to have been understudied. The reason for this might be due to the fact that regression problems are more difficult than classification tasks [185].

In this study, a taxonomy of feature selection methods for regression problems is provided. To the best of our knowledge this is the first study that provides a

feature selection review as well as a taxonomy of feature selection methods for particularly regression tasks.

Two novel unsupervised feature selection frameworks are provided in this study, namely, KBFS and DFSFR. KBFS is a simple K-means based feature selection framework where features are selected according to a feature-feature dissimilarity measure. In K-means, one centroid point for each cluster is used, however, in KBFS, three centroids are exploited to determine weights of features. Indeed, the centroids of K-means are even not a feature. DFSFR is a deep learning based feature selection framework that selects features at the input level of DBN which is, to the best of our knowledge, the first deep learning based feature selection method in regression domain. This framework is capable of handling both multi-input single-output and multi-input multi-output regression tasks. A hybrid method, which combines DFSFR and KBFS, is also proposed and named as DKBFS. In DKBFS, KBFS is exploited as a pre-filtering method for DFSFR framework. Therefore, KBFS prioritises features according to their importance and identifies relevant features. Previously identified relevant features are then evaluated by DFSFR that attempts to decide an optimal feature subset. KBFS and DKBFS are proposed to deal with extremely high dimensional data.

To show the effectiveness of the proposed frameworks, experiments are conducted on different high dimensional biomedical data sets. Four different case studies are considered. In the first case study, the proposed methods are used to reveal the associations between antibody feature and their functional activities (ADCC, ADCP, NK Cell Cytokine Release) from the RV144 Vaccine data set. The purpose of this case study is to identify the most discriminative antibody features that fight against HIV.

In the second case study, proposed methods are applied to high dimensional peptide binding affinity data sets. Three different peptide binding affinity data sets are used. Each amino acid in the peptide sequences is then described by 643 physico-chemical descriptors. Tasks 1 and 3 contain nona-peptides that have a total of 5787 descriptors (=643x9) whereas Task 2 consists of octa-peptides that were characterised using a total of 5144 descriptors (=643x8). The goal of this study is to predict binding affinity values for peptides using amino acid descriptors. The purpose of this study is to predict affinity values of peptide binding since affinity refers the strength of binding.

In the third case study, very high dimensional GSE44763 data set, which consists of 27842 Cytosine-phosphate-Guanine (CpG) dinucleotides from peripheral blood of 46 adult female individuals, is exploited. There is a total of 46 subjects where the subjects are obese and 22 of them are lean. The aim of this study is to reveal age and obesity related CpG biomarkers from the given data.

In the fourth case study, ultra high dimensional GSE40279 data set which con- tains 473034 CpG biomarkers (features) from whole blood of 656 donors (sam- ples) aged 19 to 101, is used. The goal of this study is to disclose the associations among CpG dinucleotides and aging from the given data.

The proposed methods obtain better or at least comparable results compared to other the state-of-the-art feature selection methods in the literature and it is shown that the proposed methods are robust and effective in identifying discriminative features from biomedical data.

In this thesis, in addition to providing novel feature selection frameworks, a comprehensive overview of feature selection methods for regression problems is also provided where feature selection methods are shown along with their types, references, sources, and code repositories. Finally, a taxonomy of feature selection methods for regression problems is proposed to assist researchers to select appropriate feature selection method for their research.

In document Development of unsupervised feature selection methods for high dimensional biomedical data in regression domain (Page 148-153)