Abstract - Cancer classification is one of the major applications of microarray technology. When standard machine learning algorithms are applied for cancer classification they face problem in the gene expression data. The problem is high dimensional dataset. Dimension reduction is a critical issue in the analysis of microarray data, because the high dimensionality of gene expression microarray data set hurts generalization performance of classifiers. Classification analysis of microarray gene expression data has been performed extensively to find out the biological features and to differentiate intimately related cell types that usually appear in the diagnosis of cancer. Many algorithms and techniques have been developed for the microarray gene classification process and dimensionality reduction of dataset. These developed techniques accomplish microarray gene classification process with the aid of three basic phases namely, dimensionality reduction, feature selection and gene classification. In our previous work, microarray gene classification by statistical analysis approach with Fuzzy Inference System (FIS) was proposed for precise classification of genes to their corresponding gene types. Among various dimensionality reduction techniques, the paper proposed prescribed statistical procedures to efficiently perform the classification process. To further substantiate and to analyze the performance, we conduct a comparative study in this work. The comparative study with the popular dimensionality reduction technique called Principle Component Analysis (PCA The dimensionality reduction techniques replace the proposed statistical approach and perform microarray gene expression data classification. Based on the obtained results, we conduct the performance study over the combination of statistical approach with FIS, and PCA with FIS. The study results that the statistical approach with FIS outperforms the classification performance when compared to the other methods.
Index Terms – Dimensionality Reduction, Microarray Gene, Feature Selection, Gene Classification, Principle Component Analysis, FIS.
Manuscript received 12/Nov/2012. Manuscript selected 05/Dec/2012. Tamilselvi Madeswaran, Research Scholar,
Anna University of Technology, Coimbatore, Tamilnadu, India E-mail: [email protected]
Dr.G.M.Kadhar Nawaz, Director & Professor, Department of Computer Applications,
Sona College of Technology, Salem, Tamilnadu, India E-mail: [email protected]
I. INTRODUCTION
Huge amounts of datasets with different sizes are naturally distributed over the network. This explosive growth of data and database has generated an urgent need of for the new techniques and tools that can intelligently and automatically turn the processed data into useful information and knowledge [9]. Data mining has emerged as a successful solution for the identification of information concealed in databases. Data Mining has been [25] conventionally defined as “the non-trivial extraction of implicit, formerly unknown and practically beneficial information from data in databases” [1] [2]. A particular enumeration of patterns (or models) over the data are produced under tolerable computational efficiency restrictions by the process of employment of computational techniques known as data mining which is the fundamental step of Knowledge Discovery in Databases (KDD) [3][27]. Discovery of formerly unknown, valid patterns and relationships in huge data sets by data mining [7-9] involves the utilization of advanced data analysis tools.
Some examples for data mining techniques are Clustering, Association Rule Mining and Classification. Among these, a decisive role is played by classification in the field of micro array technology. Nowadays, concurrent measurement of the expression levels of thousands of genes, probably the entire set of genes in an organism, is practicable in a single experiment by means of micro arrays [14][6].
Micro array technology has emerged as an imperative tool in the tracking of genome-wide expression levels of gene [15]. Separate genes, gene ensembles, and the metabolic ways fundamental to the structurally practicable organization of an organ and its physiological function are revealed by the analysis of the gene expression profiles in various organs using micro array technologies [16]. The application of micro array technology can automate the diagnostic task and improve the accuracy of conventional diagnostic methods [17][18]. Several techniques have been proposed earlier to decrease the dimensionality of gene expression data [12]. Numerous machine learning methods utilizing micro array data have been effectively employed to cancer classification [13] [11] [19]. But, due to the high dimensionality and insignificant sample size of the gene expression data, classification in micro array technology is considered to be extremely difficult. Lots of researches have been performed for the successful classification of gene expression data. A few recent works available in the literature are reviewed in the following section.
A Comparative Analysis of Classification
of Micro Array Gene Expression Data
Using PCA
II. RELATED WORK
Li-Yeh Chuang et al. [20] have discussed that the learning method called support vector machine (SVM) produces equivalent or enhanced results than the neural networks on certain applications. They have employed SVM to take advantage of certain strategies of the SVM technique, such as fuzzy logic and statistical theories and group multiple cancer types by gene expression profiles. FSVM (fuzzy support vector machine) using the proposed strategies and outlier detection methods, has been able to achieve an equivalent or superior performance than other methods, and more adaptable architecture in distinguishing SRBCT and non-SRBCT samples.
Edmundo Bonilla Huerta et al. [21] have proposed a Genetic Algorithm (GA) approach integrated with Support Vector Machines (SVM) for the categorization of high dimensional Micro array data. A pre-filtering technique based on fuzzy logic has been associated with that approach. The gene subset whose fitness is computed by a SVM classifier has been evolved using the GA. The most informative genes have been identified by a frequency based technique using archive records of “good” gene subsets. Their approach has obtained competitive results with six existing methods when evaluated on two well-known cancer datasets.
HieuTrung Huynh et al. [22] have discussed that DNA micro array used in molecular biology and biomedicine has been a multiplex technology. Computational methods are used to analyze the results of an arrayed sequence of thousands of microscopic spots of DNA oligonucleotides known as features contained in it. In recent times, the use of intelligent computing methods for the analysis of the micro array data has attracted the attention of numerous researchers. A significant role is played by many of the proposed machine learning based approaches such as gene expression interpretation, classification and prediction for cancer diagnosis in biomedical research. They have presented an application of the feed forward neural network (SLFN) for DNA micro array classification that employs singular value decomposition (SVD) approach for training. The activation function of the hidden units has been ‘tansig’ for the classifier of the single hidden-layer feed forward neural network (SLFN). Experimental results have revealed that training procedure as well as network structure of the SVD trained feed forward neural network has been simple with minimal computational intricacy and could yield superior results with compact network architecture.
PradiptaMajiet al. [23] has discussed that the use of many information measures like entropy, mutual information, and f-information has been proved to be successful for choosing a set of relevant and non redundant genes from a high-dimensional micro array data set. But determining the true density functions and carrying out the integrations necessary to calculate diverse information measures is extremely difficult for continuous gene expression values consequently, the true marginal and joint distributions of continuous gene expression values have been approximated by introducing the concept of the fuzzy equivalence partition matrix. The theory of fuzzy–rough sets has been the basis of fuzzy equivalence partition matrix in which each row of the matrix characterizes a fuzzy equivalence partition that could be automatically extracted from the specified
expression values. The class separability index and the predictive accuracy of the support vector machine of the proposed approach have been compared with that of existing approaches for assessing its performance. The effectiveness of the proposed method in identifying relevant and non superfluous continuous-valued genes from micro array data has been proved.
Venkateshet al. [24] have discussed that the exhaustive study of genes and their functions has been termed as genomics. Techniques to evaluate thousands of genes in a single sample have been made possible by micro array analysis or gene expression profiling. Micro array analysis has been useful in diverse fields for obtaining beneficial information by processing huge quantity of data. Gene samples acquired from biopsy samples gathered from colon cancer patients have been presented. Artifacted states and separate malignant genes have been distinguished from normal genes by an introduced learning vector quantization method
In the previous work [28], we extracted feature and reduced the dimension of the microarray gene expression based on statistical approach and classification was done using a personalized fuzzy inference system (FIS). This work intends to extend the work by performing a comparative analysis between the conventional and popular dimensionality reduction technique such as principal component analysis (PCA). The comparative analysis is made in two aspects, one in terms of classification performance. The rest of the paper is organized as follows. Section 3 gives a brief introduction about the PCA –based dimensionality reduction. Section 4 details the experimental setup, Section 5 discusses about the performance and, respectively and Section 6 concludes the paper.
III. DIMENSIONALITY REDUCTION METHOD FOR MICRO ARRAY GENE EXPRESSION DATA
SET
A. PCA – Based Dimensionality Reduction
Principle components analysis (PCA) has been widely documented as an effective means for analyzing high dimensional data [5].The basic idea of PCA is to reduce the dimensionality of dataset. This achieved by transforming the
p original variables X = [x1, x2, …, xp] to a new set of K
predictor variables, T = [t1, t2, …, tK], which are linear combinations of the original variables. In mathematical terms, PCA sequentially maximizes the variance of a linear combination of the original predictor variables,
uk=arg max Var(Xu) (1) u'u=1
subject to the constraint ui' Sx uj =0 , for all 1 ≤ i < j. The orthogonal constraint ensures that the linear combinations are uncorrelated, i.e. Cov( Xui, Xuj) = 0, i ≠ j. These linear combinations
ti = Xui (2)
are known as the principal components. Computation of the principal components reduces to the solution of an eigenvalue-eigenvector problem. The projection vectors (or
called the weighting vectors) u can be obtained by eigenvalue decomposition on the covariance matrix X S ,
Sx ui = λ u I (3) Where λi is the i-th eigenvalue in the descending order for
i=1,…,K, and i u is the corresponding eigenvector. The
eigenvalue λi measures the variance of the i-th PC and the
eigenvector ui provides the weights (loadings) for the linear transformation (projection).
The pseudo code for the dimensionality reduction using PCA is given as follows
Input:
A set of gene samples
}
{
X
m∈
R
I1xI2...xIN,m=1,...,M .Output: Low-dimensional representations
}
{
P1xP2x...xPN,m 1,...,Mm
R
Y
∈
=of the input gene samples with maximum variation captured. Algorithm:
1.
DetermineX
, mean ofX
2.
Compute covariance matrix and its eigenvectors3.
Sort eigenvectors in descending order of eigenvalues4.
Project eigenvectors in order5.
DetermineA
=
X
−
X
6.
Compute covariance matrixC
7.
If mean is centered,C
=
A
T
A
8.
Find eigenvectors and corresponding eigenvalues)
,
(
V
E
ofC
9.
Sort eigenvalues such thate
1≥
e
2≥
≥
e
n10.
Project step-by-step into the principal components,
,
,
2 1
v
v
etc.IV. EXPERIMENTAL SETUP
The proposed statistical approach based microarray gene expression classification is implemented along with PCA-based dimensionality reduction .The following are the machine configuration, which is exploited in our comparative study.
Processor: Core i5 RAM size: 4GB Clock speed: 3.20 GHz
Operating System: Windows 7.12 Implementation Tool: MATLAB
The work exploits the microarray gene datasets of Acute Lymphoblastic Leukemia (ALL) and Acute Myeloid Leukemia (AML) [29] in which there are 38 samples of ALL and 12 samples of AML patients. Each patient is subjected to test to obtain the gene samples of 7129 genes. The
performance of the methods is studied by conducting four types of experiments, in which the dataset is restructured in various ways. The details of restructured dataset, which is utilized in the four experiments, are tabulated in Table I.
Table I: Dataset description Experiment No. No. of ALL patients No. of AML patients 1 8 4 2 8 4 3 10 4 4 26 12
In every experiment, the three methods are subjected to N-fold cross-validation and hence the statistical measures are obtained. For example, in the first experiment, all the 8 ALL patients and 3 AML patients are subjected to training and the dropped patient data is used for testing. The same process is conducted in round fashion and hence the following statistical measures are determined.
True Positive (TP): AML patients are identified correctly True Negative (TN): ALL patients are identified correctly False Positive (FP): ALL patients are identified as AML patients
False Negative (FN): AML patients are identified as ALL patients
V. PERFORMANCE ANALYSIS
The Section firstly presents the confusion matrices of two methods that are obtained for four experiments. Secondly, the statistical measures such as accuracy, sensitivity, specificity, false positive rate (FPR), positive predictive value (PPV), negative predictive value (NPV), false discovery rate (FDR) and Mathew correlation coefficient (MCC) are illustrated based on the confusion matrices and the reliability of the performance is discussed after determining the average and performance deviation.
Table II: Confusion Matrices of Proposed Classification and PCA-based Classification Dataset Proposed Classification PCA-based Classification 1 2 2 0 8 2 2 3 5 2 1 3 0 8 2 2 5 3 3 1 3 1 9 2 2 7 3 4 4 8 1 25 6 6 15 11
Table III: Averaged Performance Measures of Proposed and PCA – Based Methods (Represented with Mean and the Corresponding
Standard Deviation)
Sl. Performance Proposed PCA–based
2 Sensitivity (in 33.33
±
11.79 37.50±
14.43 3 Specificity (in 96.54±
4.72 43.08±
13.90 4 FPR (in %) 3.46±
4.72 56.92±
13.90 5 PPV (in %) 82.50±
23.63 24.43±
12.41 6 NPV (in %) 75.87±
3.04 59.03±
10.79 7 FDR (in %) 17.50±
23.63 75.57±
12.41 8 MCC 0.35 0.26 -0.32±
0.11Table II tabulates the confusion matrices of four datasets on the proposed and PCA – based methods. Table III illustrates a comparative chart between the three methods in terms of various statistical measures such as accuracy, sensitivity, specificity, FPR, NPV, PPV, FDR and MCC. The Proposed method achieves necessary higher value in all the datasets on maximization parameters such as accuracy, sensitivity, specificity, PPV, NPV and MCC when compared to the other methods, mean time the proposed relatively very lesser than the other methods in terms of minimization function such as FPR and FDR. The performance parameters of all the four datasets are averaged and the standard deviation is determined and tabulated in Table III.
VI. CONCLUSION
This paper studied the performance of the proposed statistical approach based dimensionality reduction in microarray gene classification method over the popular dimensionality reduction and feature extraction methods such as PCA – based dimensionality reduction. The study was carried out on ALL – AML cancer datasets and the experimentation is segregated into four ways. The study concentrated on classification performance measures such as accuracy, sensitivity, specificity, FPR, PPV, NPV, FDR and MCC. PCA – based method showed poor performance rather than the proposed method. Hence, it can be asserted that the proposed dimensionality reduction method is suitable for microarray gene classification as it extracts relevant and less volume of information from the raw expression.
REFERENCES
[1] Osmar, “Introduction to Data Mining”, In: Principles of Knowledge Discovery in Databases, CMPUT690, University of Alberta, Canada, 1999
[2] Kantardzic and Mehmed, “Data Mining: Concepts, Models, Methods, and Algorithms”, John Wiley & Sons, 2003
[3] Umarani and Punithavalli, "A Study on Effective Mining of Association Rules from Huge Databases", International Journal of Computer Science and Research, Vol. 1, No. 1, pp. 30-34, 2010 [4] Haiping Lu and Anastasios N.Venetsanopoulos, " MPCA: Multilinear
Principal Component analysis of Tensor Objects", In IEEE Tansactions on Neural Networks, Vol.19,No.1,January 2008 [5] Jian J. Dai, Linh Lieu and David Rocke, “Dimension Reduction for
Classification with Gene Expression Microarray Data", Statistical Applications in Genetics and Molecular Biology Vol. 5 [2006], No. 1, Article 6
[6] J. Han and M. Kamber, “Data Mining: Concepts and Techniques. Morgan Kaufman, San Francisco, 2000
[7] Bigus, “Data Mining with Neural Networks”, McGraw-Hill, 1996 [8] Klaus Julisch,” Data Mining for Intrusion Detection -A Critical
Review”, In Proceedings of the IBM Research on application of Data Mining in Computer security, Chapter 1 , 2002
[9] Hewen Tang, Wei Fang and Yongsheng Cao, “A simple method of classification with VCL components”, In Proceedings of the 21st international CODATA Conference, 2008
[10] Umarani and Punithavalli, "A Study on Effective Mining of Association Rules From Huge Databases", International Journal of Computer Science and Research, Vol. 1, No. 1, pp. 30-34, 2010 [11] Yendrapalli, Basnet, Mukkamala and Sung, “Gene Selection for
Tumor Classification Using Microarray Gene Expression Data", In Proceedings of the World Congress on Engineering, London, U.K., Vol. 1, 2007
[12] Sandrine Dudoit, Jane Fridlyand and Terence P. Speed, “Comparison of Discrimination Methods for the Classification of Tumors Using Gene Expression Data”, Journal of the American Statistical Association, Vol. 97, pp. 77-87, 2002
[13] Peterson and Ringner, “Analyzing Tumor Gene Expression Profiles”, Artificial Intelligence in Medicine, Vol. 28, No. 1, pp. 59-74, 2003 [14] AnandhavalliGauthaman, "Analysis of DNA Microarray Data using
Association Rules: A Selective Study", World Academy of Science, Engineering and Technology, Vol.42, pp.12-16, 2008
[15] Chintanu K. Sarmah, SandhyaSamarasinghe, Don Kulasiri and Daniel Catchpoole, "A Simple Affymetrix Ratio-transformation Method Yields Comparable Expression Level Quantifications with cDNA Data", World Academy of Science, Engineering and Technology, Vol. 61, pp.78-83, 2010
[16] Khlopova, Glazko and Glazko, “Differentiation of Gene Expression Profiles Data for Liver and Kidney of Pigs”, World Academy of Science, Engineering and Technology, Vol. 55, pp. 267-270, 2009 [17] Ahmad m. Sarhan, "Cancer classification based on microarraygene
expression data using DCT and ANN", Journal of Theoretical and Applied Information Technology, Vol. 6, No. 2, pp. 207-216, 2009 [18] Ying Xu, Victor Olman and Dong Xu, "Minimum Spanning Trees
for Gene Expression Data Clustering", Genome Informatics, Vol. 12, pp. 24–33, 2001
[19] LucilaOhno-Machado, StaalVinterbo and Griffin Weber, "Classification of Gene Expression Data Using Fuzzy Logic", Journal of Intelligent & Fuzzy Systems, Vol. 12, No. 1, pp. 19-24, January 2002
[20] Li-Yeh Chuang, Cheng-Hong Yang and Li-Cheng Jin, "Classification Of Multiple Cancer Types Using Fuzzy Support Vector Machines And Outlier Detection Methods", Biomedical Engineering applications, Basis and Communications, Vol. 17, No. 6, pp. 300-308, December 2005
[21] Edmundo Bonilla Huerta, Beatrice Duval and Jin-Kao Hao, "A hybrid GA/SVM approach for gene selection and classification of micro array data", In Lecture Notes in Computer Science, pp. 34-44, Springer, 2006
[22] HieuTrung Huynh, Jung-JaKimandYonggwan Won, "Classification Study on DNA Micro array with Feed forward Neural Network Trained by Singular Value Decomposition", International Journal of Bio- Science and Bio- Technology Vol. 1, No. 1, pp. 17-24, December, 2009
[23] PradiptaMaji and Sankar K. Pal, "Fuzzy–Rough Sets for Information Measures and Selection of Relevant Genes from Micro array Data", IEEE Transactions on Systems, Man, and Cybernetics, Vol. 40, No. 3, pp. 741-752, June 2010
[24] Venkatesh and Thangaraj, "Investigation of Micro Array Gene Expression Using Linear Vector Quantization for Cancer", International Journal on Computer Science and Engineering, Vol. 02, No. 06, pp. 2114-2116, 2010
[25] Brahmadesam Krishna and BaskaranKaliaperumal, "Efficient Genetic-Wrapper Algorithm Based Data Mining for Feature Subset Selection in a Power Quality Pattern Recognition Application", The International Arab Journal of Information Technology, Vol. 8, No. 4, pp. 397-405, October 2011
[26] KamelFaraoun, AouedBoukelif, "Genetic Programming Approach for Multi Category Pattern Classification Applied to Network Intrusions Detection", The International Arab Journal of Information Technology, Vol. 4, No. 3, pp. 237-246, July 2007
[27] Mohammed El-Telbany, Mahmoud Warda and Mahmoud El-Borahy, Mining the Classification Rules For Egyptian Rice Diseases, The International Arab Journal of Information Technology, Vol. 3, No. 4, pp. 303-307, October 2006
[28] Tamilselvi, M.; G.M.Kadhar Nawaz, “Classification of Micro Array Gene Expression Data using Statistical Analysis Approach with Personalized Fuzzy Inference System”, International Journal of Computer Applications Volume 31– No.1, p.p. 5-12, October 2011 [29] ALL/AML datasets from http:/ /www.broadinstitute.org /cancer/
BIOGRAPHY
Tamilselvi Madeswaran received the B.Sc. and MCA. Degrees from the Department of Computer Science affiliated to Madras Univeristy in 1995 and 1998, respectively. She has done her Master of Philosophy degree in the department of Computer Science from Allagappa University in the year 2006. She is currently pursuing Ph.D. degree, working closely with Prof. G.M.Kadhar Nawaz. She worked at Lecturer and Senior Lecturer in the Sona College Of Technology from 2000 to 2008. She served in the TULEC Computer Education Center as Programmer and acted as Guest Lecturer in the Government College, Salem, India. She is also worked as Academic councellor and University coordinator in the Indira Gandhi Open University and TamilNadu Open University. She is also a member in Computer society of India. She works in the field of Fuzzy Logic, Semantic Web and Networks. She is author of more than 9 publications in national and international conferences. She Received Appreciation award for the p r o d u c i n g Best Results in the University Exam. She has guided 50MCA Projects.
Dr.G.M.Kadhar Nawaz received the B.Sc., MCA, and Ph.D. degrees from the Department of Computer Science, He was worked as Assistant Professor in the KSR College of Technology in the year (1997-2005) and worked as Assistant Professor in the Sona College of Technology in the year (2005 to 2007). Since January 2008, he has been with the Department of Computer Application, Sona College of Technology, as Director & Professor. He works in the field of Image Processing, Secure Communication and Network Security. Dr.G.M.Kadhar Nawaz received SCT Best department award under his supervision. He is also member in the different Professional Activities such as ISTE and FUWAI and he is also member in the broad of studies in the different university like Periyar University, Indira Gandhi Open University, New Delhi, Bharathiar University, Coimbatore, Allagappa University, Karaukudi, Anna University, Chennai. He is author/co-author of more than 12 publications in national and international conferences and journals. Presently he is guiding fifteen PhD scholars.