Survey on Gene Selection Using Meta Heuristic Algorithms for Classifying Cancer Disease

(1)

Survey on Gene Selection Using Meta Heuristic Algorithms for Classifying Cancer Disease

I

Mohammed W. Al Rawi,

^II

Hazem M. EL-Bakry,

^III

Mohammed Loey

IDept. of Computer Science, Faculty of Computer & Information Sciences, Benha University, Benha, Egypt

IIDept. of Information Systems, Faculty of Computer & Information Sciences, Mansoura University, Mansoura, Egypt

IIIDept. of Computer Science, Faculty of Computer & Information Sciences, Benha University, Benha, Egypt

I. Introduction

Medical diagnosis is defined as the process of distinguishing or recognizing which disease or cause explains a patient’s symptoms or complaint. It may be subjective for several reasons. First, it depends on the physician’s examination. Secondly, and most essentially, the amount of information that needs to be analyzed to produce a good diagnosis is usually huge and sometimes uncontrollable. The conventional diagnostic procedure for most of the existing diseases depends on human skills to recognize the occurrence of the convincing pattern. This old diagnosis method is likely subject to human mistake, imprecise diagnosis, as well as being time-consuming and labor intensive, and causes an unnecessary burden to radiologists.

Moreover, by the time of getting the diagnosis, the patient may already be at a critical stage of his disease.

Recently, Computer Aided Diagnosis (CAD) and machine learning systems have been developed. They are employed to support specialists in the diagnosis decision process. Medical artificial intelligence is primarily concerned with the appearance of Artificial Intelligent (AI) programs that perform diagnosis and make therapy recommendations. Unlike medical applications built on other programming approaches, such as morally statistical and probabilistic approaches, medical AI programs are based on representative models of disease units and their relationship to patient factors [1]. Many studies focused on implementing an Intelligent Decision Support System (IDSS) and machine learning in medical systems [2, 3, 4]. An IDSS is defined as interactive computer system utilized to help decision-making. It uses data sets and models to solve problems and take decisions. In the medical industry, the use of IDSS to support decision making is indispensable as they enable doctors and nurses to rapidly collect information and analyze it in numerous ways to support decisions making of diagnosis and treatment. Machine learning can be also utilized to automatically conclude diagnostic rules from descriptions of past, successfully treated patients, and help experts and specialists make the diagnostic process more objective and more reliable [5].

Cancer is a genetic disease caused by certain changes that take place in genes which control how the cell functions. The growth

behavior of cells is changed to be uncontrolled that causes a lump (tumor) to form rogue immune cells that invade and spreads through the blood and lymph systems to other parts of the body.

The spectrum of cancer types exceeds 100 different tumors, named by the location where cancer first developed or by the type of tissue cell in which they start (histological type). Cancer is a complex and a diverse disease; however, a set of characteristics are shared among almost all malignancies [6].

II. Background Information and Problem Statement The medical diagnosis of most current CAD systems depends on different types of information, such as medical laboratory tests (such as blood tests, magnetic resonance imaging (MRA)), medical indicators (finger tremors, lung signs), or symptoms) And various types of digital images (such as x-ray images and ultrasound images).

Development of Cancer

Size of cancer

Cell Cancer cell at molecular and cell level

Self Immune exclusion Normal Genetic

disorder Proliferation stars

Premalignant lesion Malignant

Cancer Death

metastases

Cancer Cell of 1kg

Genetic diagnosis level

Evaluation of the cancer risk at the premalignant level and detection of the minute cancer cell that is unable to discover by tests such as imaging tests Early diagnosis, prevention control, prevention control for recurrent

Image test level

Figure 2 The Diagram is based on The Hypothesis of The Multi-Staged Development of Cancer [10].

However, the risk of physical medical examination is the transmission of infection by tools and others such as skin scratching pain to take a blood sample. X-rays are harmful because of exposure to body cells. Ultrasound is based on accuracy and image integrity. Other criteria determine the accuracy of the image, such as the air between the surface of the skin and the image blur. The system that depend on by gene expression profiles of DNA microarray datasets well solved these problems. And the Abstract

In Detecting the type of tumor is the key step to diagnose cancer. However, traditional methods have shown several weaknesses.

Thus, it becomes a central issue to find new techniques that can be used effectively in discrimination of the cancer subtypes. Over the last 20 years, DNA microarray technology for treating diseases has been applied by analyzing the expression levels of thousands of genes at the same time. However, microarray experiment involves analyzing the gene expression levels of thousands of genes through conducting two tests on sample cells. But, this requires more money and more time. So, there is a bad need to find out new techniques for selecting the correct features that can give the best result with high classification accuracy and low costs as much as possible. Therefore, this survey aims to explore recent researches, showing their latest findings, the methodologies they have utilized and the data sets that have been employed in the diagnosis of cancer diseases using gene expression profiles.

Keywords

Cancer classification; Gene expression data; Feature Selection

(2)

challenges in microarray classification are dimensionality and classification accuracy.

Cell cleNu us

Cell Nucleus where Chromosomes "live"

Cell Chr

omos omes

Chromosomes Contain All genetic material

DNA The material from which chromosomes

are constructed

Gene A segment of a chromosome (made up of DNA)

A A C GG

g g a a k k k k j k

TA AT

C

CA ATG

T CGT

A A

GC

G g g a a k k k k

TA AT C

A GC

AT T

CG T

A A

CG G g g a a k k

AT TA

C

A CG TA

T

GC T

figure 1 relationship among the cell, the nucleus, a chromosome and a gene [8].

III. Introduction of Microarray Technology

All mammals include humans its bodies consist of billions of cells. Inside each cell a nucleus, inside the nucleus several sep- arated long segments that are chromosomes which organized by deoxyribonucleic acid (DNA). (DNA) consist of basic units called nucleotides which is a set of sugar phosphate backbone and one of the four bases adenine (A), cytosine (C), guanine (G), and thymine (T) . A Pairs with T, while C pairs with G [7]. The hereditary information coded by the DNA through the order of these base pairs on a double-stranded helix for making future organisms, figure 1 shows the relationship among the cell, the nucleus, a chromosome and a gene [8]. DNA consist of two sets that is coding genes and non-coding segments. To convert genes into proteins there are two stages First, DNA is transcribed into messenger ribonucleic acid (mRNA), by mRNA or Ribonucleic Acid (RNA) for shorter. Second, the mRNA is translated into proteins. In practice, each living organism has the same genes, but these genes can be expressed differently in different conditions and at times[9]. Cancers are a genetic disease that is caused by changes in genes that control how the cell functions, especially cell growth and dividing. These rec- ognizable changes include mutations in the DNA which make up genes.

DNA

A A C GTA

AT C

CA AGT T

CGT transcriptio n

RNA

translation

Amino Acid chain

folding protein

Figure 4: Gene Expression from Gene to Protein [9].

Generally, cancer cells have significantly more genetic changes than normal cells. But each person’s cancer tumor has a dis- tinctive combination of genetic alterations. A few of these rec- ognizable changes may be the result of cancer, then the cause rather. As cancer grows, additional changes will occur [10].

IV. Gene Expression Profiling in Cancers

Thousands of messenger RNAs transcripts are simultaneously measured by microarray technology. Where the translation of mRNA uses to produce all proteins in the cells, and a good approximation of the abundance of proteins provided by mRNA expression levels, see figure 3, and 4.

V. Gene Selection Techniques

Cancer microarray data usually contains a few hundred of samples with thousands or tens of thousands of genes that represent features. Data classification in a high dimensional space is lead to overfitting, in addition to the ultimate increase in processing time and energy consumed. Overfitting refers to that the decision function or a model may correctly classify the training data but behave very poorly on testing data. Another importance of gene selection techniques is that the biologists are focused on accurate predictive tools and identify biomarkers of diseases, i.e. a small set of relevant genes that leads to the correct discrimination between different biological states [11]. The analysis of gene expression data obtained in microarray experiments has been of great interest in the research areas of pattern recognition, machine learning, and statistics. All kinds of data mining, machine learning and Intelligent Decision Support System (IDSS) methods have been applied and extended for analyzing microarray data. Gene selection techniques can divide into four types based on “how the selection process is combined with the classification process”. These three types are a filter, wrapper, embedded and hybrid approaches. The following are the most prominent types of Selection techniques and some research that have been used each type separately.

And examples of nonparametric tests are correlation coefficient, Euclidean distance, and cosine coefficient. Disadvantages of this type are that filter techniques dealing with redundancy between the selected genes. Moreover, they do not consider complemen- tary genes that have a poor predictive performance but jointly enable a good classification. This is in addition to being univariate techniques. Some of these problems can be solved by combining several univariate techniques form one multiple scoring techniques. To achieve this goal several filter techniques (from both parametric and non-parametric tests) can be implemented to rank genes of each to design multiple scoring systems [12, 13].

A. Filter Approach

A A

CG TA AT ACGAT C

T CG

T

A A CG

TA ATC ATCGAT

CG T

Transcription (DNA ˰> RNA)

Revers Transcription

Rev. Transcription RNA Polymerase

(+) sense

RNA (-) sense RNA

RNA replication (RNA ˰> RNA)

RNA Dependent RNA Ribosomes

Translation (RNA ˰>Protein) Ribosomes

DNA

DNA replication

(DNA ˰> DNA) DNA Polymerase

protein

Figure 5 The Transfer of Genetic Information from DNA, Through mRNA to Proteins

Filter techniques are just ranking techniques which asses the relevance of a gene by considering only the intrinsic properties of the data. This is done by assigning a score to each gene based on a specific criterion and independently of the classification model.

Designers often select the highly informative genes and ignore

(3)

all the rest. This chosen subset of genes is the input to the classifier. Filter techniques could be parametric or non-parametric tests. Parametric tests measure a specific property of the gene while non-parametric tests measure a degree of relation between each gene and class. Examples univariate filter techniques are information gain, gain ratio, releif-f.

Set of all features

Selecting the best subset

Learning

Algorithm Performance

Figure 6 Filter feature selection model

Thanh Nguyenad and et al [14], provide Hidden Markov models for cancer classification using gene expression profiles by designing supervised learning hidden Markov models (HMMs).

They introduce a novel gene selection method based on a modi- fication of the analytic hierarchy process (AHP). The idea be- hind this approach is to incorporate prominent discriminant genes from different gene selection ranking methods through a systematic hierarchy. Hidden Markov Models (HMMs) are used for the classification process. Four benchmark datasets used for experiments include diffuse large B-cell lymphomas (DLBCL), leukemia cancer, colon and prostate. To compare with the HMM classification approach, a range of prevalent classifiers such as k-nearest neighbors (kNN) , probabilistic neural network (PNN), support vector machine (SVM), mul- tilayer perceptron (MLP), fuzzy ARTMAP (FARTMAP) , and ensemble learning AdaBoost are implemented. Amongst seven investigated classifiers, HMM proves to be the most robust method. It yields greater accuracy and AUC and also relatively stable results compared to other methods.

Kaberi Das, J. et al [15]. proposed research under the title Gene Selection Using Information Theory and Statistical Approach.

Initially the data set has been pre-processed and then gene selection is applied in two different approaches. One is the information-based approach and the other is statisti cal approach. The statistical approach Distance includes Euclidian (ED) and Pear- son Co-efficient (PR) and information theory-based approach includes Dynamic Relevance (DR) and Information Gain (IG).

The gener ated subsets of interdependent genes by using ED and PR are then given as input to the learning machines where a sub-set is divided into training set and testing data-sets. First the train set is fed to the classifier followed by the test set where the classifier gives a class level to the test set which is compared with the original class level to calculate the accuracy. Finally, a comparison is made between the outputs of the two classifiers. If required missing value imputation is done by K-Nearest neighbor method and Min–Max normalization is used to nor- malize the data-sets. Feature reduction is done through Princi- pal Component Analysis (PCA). Two classifiers are used here, Naïve Bayes and Support Vector Machine (SVM). Naïve Bayes is a probabilistic classifier which is based on the Bayes’ theo- rem. It is based on independent assumptions, that presence or absence of one gene does not affect the classification of another.

SVM is a super vised classifier that tries to define boundaries between two or more classes by constructing a hyper plane or a set of hyper planes. SVM is meant for two class problems.

So, when there are three or more classes, it will work by considering two classes to be one and the third to be another class.

This process goes on iteratively for all the classes.

Bouazza, Sara Haddou, et al [16] presents “Gene-expression-

based cancer classification through feature selection with KNN and SVM classifiers” . That is a study of feature selection methods effect, using a filter approach, on the accuracy and error of supervised classification of cancer. A comparative evaluation between different selection methods: Fisher, T-Statistics, SNR and ReliefF, is carried out, using the dataset of different cancers; leukemia cancer, prostate cancer and colon cancer. The classification results using k nearest neighbors (KNN) and support vector machine (SVM) classifiers show that the combination between SNR’s method and the SVM classifier can present the highest accuracy. In this paper, presented the effect of feature selection methods on different cancers datasets. These datasets are used as a test set for comparing proposed methods and classifiers (leukemia, prostate, colon). It is due to the abnor- mal growth of cells that have the ability to invade or spread to other parts of the body.

Ahsen, Mehmet Eren, et al [17]. provide research under Title Sparse feature selection for classification and prediction of metastasis in endometrial cancer for proposing a new algorithm for sparse feature selection in binary classification problems. Its application to predict the risk of metastasis in endometrial cancer. In order to detect candidate quantitative microRNA feature sets within the primary tumors that may discriminate between node positive and node negative disease, as well as a numerical procedure for combining the measured values of the features, they turned to machine learning protocols (Support Vector Ma- chines). To detect discriminatory features that may predict met- astatic disease, 213 miRNA expression features measured in 86 samples (43 lymph node-positive and 43 lymph node-negative) were used as the training data after normalization. The application of the lone star algorithm in the training data with 80 random cross validations at each iteration resulted in a set of 18 features. Afterwards, to compute a unique classifier, a single iteration of lone star is run with these 18 features. Fifty stage I and 50 stage IIIC frozen endometrial cancer samples were obtained from the Gynecologic Oncology Group tumor bank.

The samples were collected from patients enrolled in GOG tissue acquisition protocol 210 which established a repository of clinical specimens with detailed clinical and epidemiologic data from patients with surgically staged endometrial carcinoma.

A weighted classifier, using 18 micro-RNAs, achieved 100%

accuracy on the training cohort. When applied to the testing cohort, the classifier correctly predicted 90% of node-positive cases, and 80% of node-negative cases (FDR = 6.25%).

Tirumala, Sreenivas S [18]. provide research under Title Attri- bute Selection and Classification of Prostate Cancer Gene Ex- pression Data Using Artificial Neural Networks for proposing an extended ANN based approach for classification and prediction of prostate cancer using gene expression data. Firstly, they used four attribute selection approaches, namely Sequen- tial Floating Forward Selection (SFFS), RELIEFF, Sequential Backward Feature Section (SFBS) and Significant Attribute Evaluation (SAE) to identify the most influential attributes among 12600. Then, they used ANNs and Naive Bayes for classification with complete sets of attributes as well as various sets obtained from attribute selection methods. Experimental results show that ANN outperformed Naive Bayes by achieving a classification accuracy of 98.2% compared to 62.74% with the full set of attributes. Further, with selected attributes obtained with SFFS, ANNs achieved better accuracy (100 %) for classification compared to Naive Bayes. For prediction using

(4)

ANNs, SFFS was able achieve best results with 92.31% of accuracy by correctly predicting 24 out of 26 samples provided for independent sample testing.

Penalized logistic regression with the adaptive LASSO for gene selection in high-dimensional cancer classification [19]. for The main aim of this study is to show the effectiveness of the proposed weight for the gene selection in high-dimensional cancer classification. Penalized logistic regression using the least abso- lute shrinkage and selection operator (LASSO) is one of the key steps in high-dimensional cancer classification, as gene coefficient estimation and gene selection simultaneously. However, the LASSO has been criticized for being biased in gene selection. The adaptive LASSO (APLR) was originally proposed to overcome the selection bias by assigning a consistent weight to each gene. To prove the effectiveness of the proposed initial weight, three DNA microarray datasets, with different sample sizes and number of genes, were used as in tabe First, the colon dataset . Second, the prostate dataset; and, third, the diffuse large B-cell lymphoma. The method has achieved I maximum AUC of 0.966, 0.971, and 0.953 for colon, prostate, and DL- BCL datasets, respectively.

Table 1: Details of the used datasets Dataset

type No. Of

samples No. Of

Genes Classes

Colon 62 2000 Tumor/Normal

Prostate 102 5966 Tumor/Non-tumor

DLBCL 77 7129 DLBCL/FL

B. Wrapper Approach

The gene selection techniques originated from the thought of wrapping approach around a classifier which is recognized as a black box. In wrapper techniques, search algorithm uses to perform the selection step to find different genes subsets from the feature space. Then classification model is used to evaluate each subset of genes that achieve the best classification accuracy. That is carried out only using the training data. These algorithms can be split into two search classes: randomized and deterministic search algorithms. Simulated annealing, genetic algorithms, estimation of distribution algorithms, and randomized hill climb- ing are examples of randomized algorithms. Sequential forward selection and backward elimination are examples of deterministic algorithms. While the bad edge is the high computation added to the technique [20]. To better understand the wrapper approach the genetic algorithm is chosen as another gene selection technique. And its results are compared to the new multiple scoring filter technique proposed when both applied to the same classifiers. Examples of wrapper methods are various optimization algorithms such as Sequential search, Estimation of distribution algorithms [21], Genetic Algorithm (GA) , etc [22].

Set of all

features Performan

ce

Figure 7 Wrapper feature selection model

In [23] present study used hybrid particle swarm optimization and genetic algorithms for gene selection and a fuzzy support vector machine (SVM) as the classifier. Fuzzy logic is used to infer the importance of each sample in the training phase and decrease the outlier sensitivity of the system to increase the ability to generalize the classifier. A decision-tree algorithm was applied to the most frequent genes to develop a set of rules for each type of cancer. This improved the abilities of the algorithm by finding the best parameters for the classifier during the training phase without the need for trial-and-error by the user. The proposed approach was tested on four benchmark gene expression profiles. And their results have been demonstrated for the proposed algorithm. The classification accuracy for leukemia data is 100%, for colon cancer is 96.67% and for breast cancer is 98%.

The results show that the best kernel used in training the SVM classifier is the radial basis function.

Elyasigomari, Vahid, et al [24] provide Development of a two- stage gene selection method that incorporates a novel hybrid approach using the cuckoo optimization algorithm and harmony search for cancer classification. The data were discretized into nine states. After this pre-processing stage, the top 100 genes were selected using MRMR. The selected genes were fed into a wrapper setup consisting of the COA-HS algorithm and the SVM classifier to choose the minimum number of genes that provides 100% accuracy. Finally, the classification performance of the selected genes was measured in terms of accuracy via the LOOCV method. To validate the performance of the COA-HS, the results were compared to those established from other evolu- tionary algorithms, such as the genetic algorithm (GA), the particle swarm optimization (PSO) algorithm, the harmony search (HS) algorithm, and the cuckoo optimization algorithm (COA).

The codes required in this study were written using Matlab 2014a. Microarray data for four cancer types (leukemia, prostate, lymphoma, and colon) were used in this study.

Al-Rajab, et al [25], provide Examining applying high performance genetic data feature selection and classification algorithms for colon cancer diagnosis. They applied more than one method on colon data set. And they found that the PSO as a selection algorithm outperforms the others. And, IG is efficient in ranking the genes and hence in selecting the best-ranked ones thereafter.

From the experiments as they presented in thr research the PSO/

SVM method demonstrated the highest average accuracy (87%) in terms of classifying colon cancer datasets compared with the other algorithms presented earlier. IG/SVM (86%) and IG/DT (86%) demonstrate very good classification accuracy. PSO/DT has less classification accuracy (71%).

C. Embedded Approach

Set of all features

Figure 8 Embedded feature selection model The classification model is embedded into both embedded techniques and wrapper techniques. But the classifier search for an optimal subset of features that built into the classifier construc- tion. Then the classifier role in embedded techniques is to select

(5)

the subset of informative genes and evaluate it. While in wrapper techniques the classifier role is only to evaluate the chosen subsets by the search algorithms. The most common embedded technique is recursive Feature Elimination using SVM (SVM- RFE). The advantages of this approach are the interaction with the classification model and the less computation needed than wrapper approach. Its main disadvantage is that it is restricted to the classifier used [26].

D. Hybrid Methods

Hybrid methods derive from a sequential approach where in fact the first step is usually predicated on filter approaches to reduce the range of features found in the second stage. After a wrapper method is utilized to select the required amount of features by using this reduced set [27].

J. Bennet, C. Ganaprakasam, and N. Kumar [28], proposed a Hybrid Gene Selection Technique. That is an ensemble feature selection technique which is a combination of Recursive Fea- ture Elimination (RFE) and Based Bayes error Filter (BBF) for gene selection and Support Vector Machine (SVM) algorithm for classification. (SVM) algorithm for classification. This new ensemble approach is the combination of SVM-RFE and BBF.

SVM-RFE yields good performance on classification but lacks on poor seperability in redundant class labels. BBF avoids

redundant class labels in selection. Both combined achieves comparable performance. The feature selection from dataset is performed with SVM-RFE and BBF. After selection it is un- dergone with classifier for training. Finally, evaluation carried with testing data.

Motieghader, Habib, et al [29], provide A hybrid gene selection algorithm for microarray cancer classification using genetic algorithm and learning automata. A hybrid meta-heuristic algorithm, which is an integration of Genetic Algorithm and Learn- ing Automata (GALA), is proposed for this purpose. It has acceptable accuracy and performance on six different cancer datasets including Colon, ALL_AML, SRBCT, MLL, Tumors_9 and Tumors_11 were selected. Mean classification accuracies of GALA on colon dataset with 8 genes was found to be 99.46%

and for ALL_AML, SRBCT and MLL datasets with 2, 4 and 3 genes were 100%, 97.35% and 93.96%, respectively and mean classification accuracies on Tumors_9 and Tumors_11 datasets with 10 genes were 86.52%, and 84.38%, respectively.

Salem, Hanaa, et al [30]. provide research under Title Classi- fication of human cancer diseases by gene expression profiles.

for classify human cancer diseases. Methodology of Feature Selection in This study the employs information gain (IG) to select the significant features (feature selection (FS)) from the input patterns of the gene microarray dataset. Feature Reduc- tion technique employs genetic algorithm (GA) to reduce the



     







        





     

      

       









       

       

     

      

      

      

      

      



       

        



       





  



















 



 





 





 



 





 





























 





















 





 



 





















 

  















 































(6)

features determined by the IG. Cancer type classification is done by means of genetic programming (GP). The framework is verified by considering 7 Cancer Gene Expression Datasets (leukemia, prostate, lung cancer- Ontario, lung Cancer-Mich- igan, DLBCL, central nervous system and colon). And rech classification accuracy (97.06% for leukemia, 85.48% colon tumer, 86.67% central nervous system, 74.4% lung cancer- On- tario, 100% lung Cancer-Michigan, 94.8% DLBCL and100%

for prostate).

Salem, Hanaa, et al [31]. provide research under title Early diagnosis of breast cancer by gene expression profiles for Presents an intelligent decision support system (IDSS) for breast cancer diagnosis by using gene expression profiles. The system first extracts significant features from the input patterns by using information gain. Employs deep genetic algorithm for feature reduction. IG threshold value 0.7 is the optimal value for the breast cancer dataset. At this value, features are reduced from 24,481 attributes to 45 attributes in IG and reduced auxiliary to 22 features by applying GA with 100 population size and 20 evaluation progress. In addition, the accuracy of classification is 100 %.

VI. Common Cancer Classification Techniques

Various classification algorithms were applied in cancer classification field in order to predict and diagnose cancer diseases. As mentioned before, classification algorithms can be broadly categorized into two main categories: supervised learning algorithms and unsupervised learning algorithms. Some of the most commonly used supervised learning algorithms are summarized below. They are classified amongst the top ten influential classification algorithms in the research community identified by the Institute of Electrical and Electronics Engineers (IEEE) [32].

A. Naïve Bayes Classifier

Naïve Bayes classifier is a simple, yet highly scalable probabilistic algorithm that relies on Bayes theory. It assumes independencies between features. That is, if two events are independent then P(A|B) = P(A).

Assuming the independency among features reduces the dimensional multivariate to a univariate problem, so instead of calculating 𝑝 (𝐴1, 𝐴2,…, 𝐴𝑑 /𝐵𝑗), a simpler computation will be calculated 𝑝 (𝐴1 /𝐵𝑗), …, (𝐴𝑑 /𝐵𝑗). However, in any biological task, such as gene expressions based cancer classification, it is intuitive that there are dependencies between genes. Genes are likely to have meaning in a group rather than individually.

Practically, even when feature dependencies were known to ex- ist, Naïve Bayes classifier performed remarkably well. It outperformed more sophisticated classifiers [33]. In fact, the as- sumption of feature independencies regardless of its validity protected this classifier against over-fitting problem.

B. K-Nearest Neighbors

Nearest neighbor is a non-parametric similarity-based classifier. The idea of K-Nearest Neighbor relies on classifying unseen sample to the most common class amongst its K neighbors. In other words, assume a feature vector 𝑋 that needs to be classified; the label assigned for feature vector of gene expression values 𝑋 will be determined according to its most similar “K”

training samples. Similarity can be measured by distance func-

tions (e.g. Hamming, Euclidian, Mahalanobis Distance) or correlation coefficient metrics (e.g. Pearson correlation function [34], Spearman). Classification is made by measuring the dis- tances from the test instance to all training instances. Finally, the class that takes higher votes among the k nearest instances is assigned to the test instance.

C. Decision Trees: CART

Classification And Regression Tree (CART) is a binary recursive-partitioning classification algorithm. CART classifiers are considered unstable, high variant classifiers. That is, any small change in inputs directly leads to changing the output decision due to its hieratical structure that propagates the error all across the decision tree leading to classifier instability. This is considered a drawback when dealing with stand-alone classifiers.

However, it is an advantage when dealing with ensemble classifiers as will be discussed later in ensemble systems section. On the other hand, decision trees may easily embrace over-fitting, especially with small size problems. Many researches implemented this technique in tumor classification [35].

D. Support Vector Machines SVMs

Support Vector Machine (SVM) is one of the popular techniques for machine learning. It is based on similarity, just like K-NN.

Expect it does not have to calculate the distance between new unseen point and all the data points available; rather, only the vectors that influence the decision that matter. SVM bases its idea on maximizing the margins between different classes. The more the certainty of a classifier, the more margins it provides [36].

E. Artificial Neural Networks (ANNs)

Artificial Intelligence (AI) approaches for medical diagnosis and prediction of cancer are important and ever growing areas of research. Artificial Neural Networks (ANNs) is one such approach that have been successfully applied in these areas. Vari- ous types of clinical datasets have been used in intelligent decision making systems for medical diagnosis, especially cancer for over three decades. However, gene expression datasets are complex with large numbers of attributes which make it more difficult for AI approaches to classification and prediction. The importance of using AI techniques in bio-informatics has been known for some time. Artificial Neural Networks (ANNs) have been implemented for classification and prediction of various types of cancers like colorectal, Lung, colon etc. Most of these implementations are performed using gene expression datasets.

Classification of complex non-linear data like gene expression data is a challenging task due to its high dimensionality especially when the number of samples are far less number of attributes in the dataset. With such datasets, analysing the importance of attributes and identifying the significance of an attribute become more difficult. Earlier research on using ANNs for gene expression analysis was limited to mainly single-lay- ered ANNs due to no availability of powerful hardware. One such early work emphasizes the importance of identifying a set of significant genes among the thousands of genes in the dataset [37].

VII. Discussion

When reviewing previous studies as shown in table 3, we found that the most efficient method when integrating the filter with

(7)

the wrapper approach for gene selection such as in [28] where use SVM-RFE embedded approach then BBF as filter approach reach to classification accuracy 97.2% using SSVM as a Classi- fier. While in [16] used SNR filter gene selection approaches and they are reach maximum result 100% of classification accuracy for Leukemia cancer datase with KNN Classifier. But in another research on the same dataset using CFS filter and J48 classifier the classification accuracy was 91.17% using the same filter and on the same dataset, but with another classifier, (NB) has reached the accuracy of the classification to 100% [30]. This means that all gene expression datasets are different from each other in terms of dealing with different selection and classification algorithms. On the other hand, most of the researches for

classification of breast cancer because of a relatively large number compared to other prevalence of cancers compared to other cancers. Noting that all cancers affecting organs of the body fatal and a little research has reached the results of the classification accuracy more than 90% of other cancers such as cancer of the central nervous system as in [38]. Table 4 reviews some methods of selecting and classifying gene expression datasets and presents comparison of classification accuracy before and after improvement. Most methods depend on the improvement in gene selection techniques.

Table 3 comparison between gene selection and classification methods

    

    

        

        

        







  

  

   

   

   

   

   

   

   





  

  

   

   

   

   

   

   

   





  

  

   

   

   

   

   

   

   

       



         

         

         

         

         

      

     





        

    

    

    

      



       

 

   

   

 

   

   

       



        

        

(8)

Table 4 : Compeer between classification of cancer and selection methods before improv and after improv.

Ref Classifier Selector Acc

before improv

Acc after improv [41] Decision Tree

(DT) t-statistics (TA) 85.00 98.33 [42]

Support Vector Machine (SVM)

Genetic Algorithm

(GA) 85.4839 98.3871

[43] Linear Kernel SVM RBF Kernel SVM

Information Gain (IG) Mutual Information (MI)

71.0 93.9

[44] ---

Binary Particle Swarm Optimization (BPSO)

86.94 94.19

[38] Decision Tree (DT)

Genetic Algorithm

(GA) 88..8 89.24

VIII. Conclusion

Systematic and unbiased approach to cancer classification is of great importance to cancer treatment and drug discovery. Pre- vious cancer classification methods are all clinical based and were limited in their diagnostic ability. It has been known that gene expressions contain the keys to the fundamental problems of cancer diagnosis, cancer treatment and drug discovery. The recent advent of microarray technology has made the produc- tion of large amount of gene expression data possible. This has motivated the researchers in proposing different cancer classification algorithms using the gene expression data. In this paper, we provided a comprehensive survey on the existing cancer classification methods and evaluated their performance in tow aspects: classification accuracy and biological relevance.

Gene selection as an important preprocessing step was also presented in detail and evaluated for their relevance in cancer classification. Related issues such as statistical significance vs.

biological significance, asymmetric classification errors, gene contamination and marker genes were also introduced. Through this survey, we conclude that cancer classification using gene expression data has a promising future in providing a more systematical and unbiased approach in differentiating different tumor types. However, there is still a great amount of work that needs to be done in order to achieve the goal of cancer classification.

References

[1] A. B. Al-Badareen, M. H. Selamat, M. Samat, Y. Nazira and O. Akkanat, “A Review on Clinical Decision Support Systems in Healthcare,” Journal of Convergence Information Technology, vol. 9, p. 125, 2014.

[2] K. R. Foster, R. Koprowski and J. D. Skufca, “Machine learning, medical diagnosis, and biomedical engineering research-commentary,” Biomedical engineering online, vol.

13, p. 94, 2014.

[3] D. Raval, D. Bhatt, M. K. Kumhar, V. Parikh and D. Vyas,

“Medical Diagnosis System Using Machine Learning,”

2015.

[4] S. Godara and R. Singh, “Evaluation of predictive machine learning techniques as expert systems in medical diagnosis,”

Indian Journal of Science and Technology, vol. 9, 2016.

[5] D. S. Kumar, G. Sathyadevi and S. Sivanesh, “Decision support system for medical diagnosis using data mining,”

International Journal of Computer Science Issues, vol. 8, pp. 147-153, 2011.

[6] D. Hanahan and R. A. Weinberg, “Hallmarks of cancer: the next generation,” cell, vol. 144, pp. 646-674, 2011.

[7] N. Br{\”a}ndle, H. Bischof and H. Lapp, “Robust DNA microarray image analysis,” Machine vision and applications, vol. 15, pp. 11-28, 2003.

[8] B. Teacher, “Leaving Certificate Biology,” 22 11 2017.

[Online]. Available: http://leavingbio.net/cell-division/.

[Accessed 3 2018].

[9] “ The New Genetics,” 5 2010. [Online]. Available: https://

publications.nigms.nih.gov/thenewgenetics/chapter1.html.

[Accessed 2018].

[10] “What Is Cancer?,” National Cancer Institute, 9 February 2015. [Online]. Available: https://www.cancer.gov/about- cancer/understanding/what-is-cancer. [Accessed 3 2018].

[11] R. Clarke, H. W. Ressom, A. Wang, J. Xuan, M. C. Liu, E.

A. Gehan and Y. Wang, “The properties of high-dimensional data spaces: implications for exploring gene and protein expression data,” Nature Reviews Cancer, vol. 8, p. 37, 2008.

[12] B. Duval and J.-K. Hao, “Advances in metaheuristics for gene selection and classification of microarray data,”

Briefings in Bioinformatics, vol. 11, pp. 127-141, 2010.

[13] Z. M. Hira and D. F. Gillies, “A review of feature selection and feature extraction methods applied on microarray data,”

Advances in bioinformatics, vol. 2015, 2015.

[14] T. Nguyen, A. Khosravi, D. Creighton and S. Nahavandi,

“Hidden Markov models for cancer classification using gene expression profiles,” Information Sciences, vol. 316, pp.

293-307, 2015.

[15] K. a. R. J. a. M. D. Das, “Gene selection using information theory and statistical approach,” Indian Journal of Science and Technology, vol. 8, p. 695, 2015.

[16] S. H. Bouazza, N. Hamdi, A. Zeroual and K. Auhmani,

“Gene-expression-based cancer classification through feature selection with KNN and SVM classifiers,” in Intelligent Systems and Computer Vision (ISCV), 2015, 2015.

[17] M. E. Ahsen, T. P. Boren, N. K. Singh, B. Misganaw, D.

G. Mutch, K. N. Moore, F. J. Backes, C. K. McCourt, J. S.

Lea, D. S. Miller and others, “Sparse feature selection for classification and prediction of metastasis in endometrial cancer,” BMC genomics, vol. 18, p. 233, 2017.

[18] S. S. Tirumala and A. Narayanan, “Attribute selection and classification of prostate cancer gene expression data using artificial neural networks,” in Pacific-Asia Conference on Knowledge Discovery and Data Mining, 2016.

[19] Z. Y. Algamal and M. H. Lee, “Penalized logistic regression with the adaptive LASSO for gene selection in high- dimensional cancer classification,” Expert Systems with Applications, vol. 42, pp. 9326-9332, 2015.

[20] V. Bol{\’o}n-Canedo, N. S{\’a}nchez-Maro{\~n}o and A.

Alonso-Betanzos, “Feature selection for high-dimensional data,” Progress in Artificial Intelligence, vol. 5, pp. 65-75, 2016.