Feature Selection for Classification in Medical Data Mining

(1)

Volume 2, Issue 2 March – April 2013 Page 492 Abstract: Feature selection is an important step in

classification and also for dimensionality reduction. As medical information is with multiple attributes, medical data mining differs from other one. Diagnosis of most of the diseases is expensive as many tests are required to predict the disease. By using data mining techniques we can reduce the cost of diagnosis by avoiding many tests by selection of those attributes which are really important for prediction of disease. Dimensionality reduction plays an important role in the field of medicine as it contains multiple attributes. In this paper we have analyzed the approach of feature selection for classification and also presented a novel approach for the feature selection by using association and correlation mechanism. The aim of our paper is to select the correlated features or attributes of medical dataset so that patient need not to go for many tests and in future it is used for preparing the clinical decision support system which is helpful for decision making of disease prediction in a cheaper way. Other approach is mentioned in this paper is after removal of some attributes accuracy of classifier is also improved which support our statement of disease prediction in cheaper way by avoiding all unwanted tests for disease prediction. By using association rules and correlation attributes features can be selected. As medical field contains large number of attributes and information so dimensionality reduction is must now. The accuracy of classifiers after removal of attributes is discussed in this paper.

Keywords: Feature selection, Dimensionality reduction, Association rule, Correlation, Medical data mining, classifiers.

1. I

NTRODUCTION

In this paper we have considered three datasets of different diseases such as Heart, Breast cancer and Diabetes disease. All these datasets are downloaded from standard benchmark UCI repository. Now there are high probability of heart attacks, diabetes because of hyper tension and today’s lifestyle. And many tests are needed for diagnosis of that disease.

Table 1: Cleveland Heart Disease Dataset Sr.

No.

Attribute Description 1 Age Values 1_30,30_44, >45 2 Sex Value 1 :Male , 0: Female 3

Cp

1: typical type 1 angina, 2: typical type angina, 3: non angina pain, 4 : asymptotic 4 Trestbps 0_120,121_150,150_max mm hg on admission to the hospital 5 Chol 0_200,200_239,240_max 6 Fbs 1: >120 mg/dl ; 2: <120 mg/dl 7 Restecg 0 : normal, 1:having St-T

abnormality; 2: showing probable left ventricular hypertrophy

8 Thalach 0_120,121_150,150_max 9 Exang Value 1: yes ; 0:no

10 Oldpeak 0_1.5,1.5_4.5,4.5_max 11

Slope Value 1: unsloping ; 2: flat; 3: downslopping

12

Ca Number of major vessels colored by floursopy( value 0-3)

13

Thal 3: normal ; 6: fixed defect; 7: reversible defect

14 Num

By making use of data mining techniques and reducing the number of attributes or selecting only important features of that disease among all available attributes accuracy of classifier may be improved. As heart disease and Heart attack is one of major disease in today’s life so we have considered Heart Cleveland dataset contains 14 attributes including class attributes and the number of

Feature Selection for Classification in Medical

Data Mining

Prof.K.Rajeswari 1, Dr.V.Vaithiyanathan 2 and Shailaja V.Pede 3 1_{Associate Professor,}

PCCOE, Pune University, & Ph.D Research Scholar, SASTRA University, Tanjore, India.

2_{Associate Dean, Research,}

School of Computing and CTS Chair Professor, SASTRA University, Tanjore, India

3_{Student(ME Computer) PCCOE,} Pune University, Pune, India

(2)

Volume 2, Issue 2 March – April 2013 Page 493 instances are 303. The available dataset is initially

discretized based on some assumptions and studying different papers. The detail description of dataset is given in table 1.

Breast cancer is another problem observed in females. This dataset contains total 10 attributes including class attribute and number of instances are286. Diabetes diagnosis dataset contains 9 attributes including class attribute. Numbers of instances are 768 where last attributes shows sick or healthy.

These datasets are applied on different classifiers such as Neural Network, Decision tree, SVM and accuracy is checked with actual dataset and with selected attributes. It is observed that the accuracy with selected attribute is more compared to actual accuracy. we also applied genetic search attribute selection method for feature selection and tested the accuracy of different classifiers. With this approach the other idea is to use apriori rule generation algorithm and used it with correlation of attributes to find out closely related attributes. This algorithm is explained in methodology.

The main research objective is to find set of attributes which are really important for disease prediction, improve the performance of classifier and also this work further extends to build a cost effective clinical decision support system which leads to predict the disease more accurately.

2. R

ELATED WORK

Healthcare professionals store significant amounts of patientï£·s data that could be used to extract useful knowledge. Researchers have been investigating the use of statistical analysis and data mining techniques to help healthcare professionals in the diagnosis of heart disease. Statistical analysis has identified the risk factors associated with heart disease to be age, blood pressure, smoking habit [1], total cholesterol [2], diabetes [3], hypertension, family history of heart disease [4], obesity, and lack of physical activity [5]. Knowledge of the risk factors associated with heart disease helps health care professionals to identify patients at high risk of having heart disease. Researchers have been applying different data mining techniques such as decision tree, naï£·ve bayes, neural network, bagging, kernel density, and support vector machine over different heart disease datasets to help health care professionals in the diagnosis of heart disease [6]-[11]. In[12] showed correct classified accuracy of approximately 77 % with logistic regression. The another model R-C4.5 is applied which is based on C4.5 shows better results , rules created by R-C4.5 can give health care experts clesr and useful explainations 13].

CLASSIT conceptual clustering system[14] achieved a 78.9% accuracy on the Cleveland database. In [15] Neural network based method obtained 89.01%

classification accuracy from the experiments made on the data taken from Cleveland heartdisease database. The ANFIS classification[16] withPCA of diabetes disease was classi?ed due to training and test of all the diabetes diseasedataset. The obtained test classi?cation accuracy was 89.47% by using the 10 fold cross validation. Up to now, several studies have been reported that have focused on cardiovascular disease diagnosis.The overall literature survey shows that in medical data mining and medical field if we apply these kind of approaches the average classification accuracy is achieved as 77 %. The most common model which are applied for the classification approaches such as Neural network, Naïve Bayesian classifier, Association rules, Decision tree etc. are mostly considered to be easy to understand and to implement. But mostly if the size of data is too large then the problem arises for developing model, generating rules and constructing tree. So there is need to identify such kind of rules or reduce dimensionality by selecting important features. Many successful applications that which are based on association rule mining algorithms are used to produce very large numbers of rules. In most of the decision support systems the accuracy of classifier is measured on the basis of all attributes, but other than using all those attributes we can use only closely related attributes which solves the problem of dimensionality in medical data mining and also it becomes cost effective.

3. M

ETHODOLOGY

Feature Selection (FS) a preprocessing technique is used to identify the significant attributes, which play a dominant role in the task of classification. This leads to the dimensionality reduction. By applying different approahes features can be reduced. The reduced feature set improves the accuracy of the classification task in comparison of applying the classification task on the original data set. The overall procedure includes the following steps as shown in figure 1.

1.Preprocessing of data which is in any format. 2.Selection of attributes using feature selection for dimensionality reduction

3.Dataset with reduced set of attributes given as input to the classifier.

4.Allocation of class. Let S be the system: S={A ,B, C,} Where

A=Set of input attributes for the medical dataset B= Approach of feature selction

C= Set of selected attributes or set of removed attributes

B indicates the feature selection approach which may be either brute force approach, existing feature selection techniques or novel approach using correlated and associated rules.

(3)

Volume 2, Issue 2 March – April 2013 Page 494 Figure 1: Data Flow Diagram for Feature Selection

3.1 Brute Force Approach

The To know the importance of feature selection in medical data mining, we have started our work with brute force approach. Three discretized datasets such as Cleveland Heart disease, Brest Cancer and diabetes downloaded from UCI repository and then by trying all possible combinations of features i.e. to select k features from N number of features, 2k subsets of features are created by removing one attribute from actual dataset and accuracy of the classifier is measured and it is observed that accuracy of classifier is better than actual one. This process leads to find out the subset of attributes which really affect the performance of the classifier. But the major problem with this approach is it usually takes too many try and also danger of over fitting. But using this work we conclude that removal of certain attributes from the dataset improves the accuracy of classifiers and we further extend it to use novel approach for feature selection using correlation and association.

Given a set of n different attributes A={a1,a2,a3,....,aN} Using N features, 2N subsets are created for all possible combinations like one attribute removal, two attribute removal etc.

In this approach in first iteration each one attribute is removed once and accuracy is tested and it is observed that accuracy of all classifiers were better after removal of specific attribute for all three dataset, Further in second iteration two attributes were removed by taking all possible combinations of 2 attributes and accuracy is measured again in this way we have taken maximum possible combination s of those attributes and tested the accuracy of all classifiers on these different dataset. After taking all possible combinations of attributes after removing one attribute, 2 attribute upto k attributes, where k<N all subsets are obtained and then this input

was given to the classifiers it is observed that the accuracy of classifier is better even number of attributes are less. These results are summarized in results and discussion. With these results we thought feature selection is the best way to solve the problem of dimensionality in medical data mining without degrading the performance of classifier and also further to think of the clinical decision support system.

We have performed many experiments with the data mining tool WEKA 3.7 for the existing attribute selection algorithms and then classified with selected attributes only.

3.2 Correlation based associated feature selection The Association means finding interesting and closely related relationships in large data set. It was initially applied to market basket data to find the products among all existing sales which further help to select the products for selling. The development of association analysis can be traced back to AIS algorithm [11] proposed by R. Agrawal in 1993. AIS algorithm doesn’t utilize the property of the frequent itemsets, which result in unnecessarily generating and counting too many candidate itemsets. Subsequently, R. Agrawal and R. Srikant introduced the property of the frequent itemsets and proposed Apriori algorithm [12], which is a main technique widely used in commerce at present. Under the applications of Apriori, many researchers were attracted into the field of research on mining association rules,

3.2.1 Related Terms 1.Support

Let IS={i1, i2,…, im} be the set of all items, and

TR={t1, t2,…, tn} be the transactions. Each instance ti

contains a subset of items selected from IS. In association analysis, a collection of one or more items is known as itemset. If an itemset contains k items, it is called a k -itemset. An important property of an itemset is its support, which refers to the number of instances that contain a particular itemset.

Mathematically, the support count,

(X)= count (X) / TR

2.Confidence

An association rule is an implication expression as X

rule can be generated based on the terms support and confidence. Support defines the applicability of rule to given dataset and confidence decides how frequently item 2 appears when item 1 is in transaction.

Mathematically ,

Confidence

C(X-3.Correlation :

Correlation shows how two itemsets are closely related to each other which can be used for generation of

(4)

Volume 2, Issue 2 March – April 2013 Page 495 association rule. It shows the dependence of two itemsets

or correlation of two itemsets. If P(XUY)=P(X)P(Y) then two itemsets X & Y are independent otherwise X and Y are considered as correlated.

Corr(X,Y)= P(X U Y) / P(X)P(Y) If Corr(X,Y) is greater than 1 then those attributes are closely related to each other otherwise X & Y are not correlated or independent attributes.

With these terms and using association rule generation algorithm the features can be selected.

4. R

ESULT AND

D

ISCUSSION

This paper shows the analysis of various data mining techniques which can be helpful for medical analysts or practitioners for accurate diagnosis by selecting closely associated features. The main methodology used for our work was by examining the standard datasets downloaded from UCI and executed through WEKA 3.7 for first two approaches.

4.1 Classification of dataset using different classifiers We have used widely popular data mining tool WEKA 3.7.7 for checking the accuracy of classifiers on different datasets. Initially data is preprocessed while preprocessing data discretization is done and then classified using multilayer perceptron, J48 decision tree and Support Vector machine as SMO classifier. Following table shows results of classifiers among all three datasets considering all attributes.

Table 2: Comparison of different classifiers on actual dataset

For analyzing feature selection approach for classifiers in disease prediction using brute force approach the detailed analysis is given in the following table for Cleveland Heart disease dataset and it is observed that accuracy of classifiers is better after removal of attributes. The features selected during this approach for Heart disease datasets are using Neural network after removal of 5 attributes accuracy of classifier is more which indicates important attributes for disease prediction. For NN attribute subset as R={2,5,6,8,9,11,12,13} are important with these attributes accuracy is more as 84.4.9%. , using Decision tree the selected features are 6 as R= {1, 3, 9, 11, 12, 13} with accuracy 84.16% and using SVM the

accuracy of classifier is more 87.46% for features identified as R= {2, 3, 6, 9, 10, 11, 12, 13} for same dataset.

4.2 Brute force approach of feature selection using Heart Disease dataset

The table 3 shows detailed analysis of brute force approach on Heart cleveland dataset. From this we have analyzed the accuracy of different classifiers after removal of one, two , three and more attributes and which indicates accuracy of classifier is more than accuracy of dataset with actual attributes.

4.3 Genetic Search feature selection method for Heart Disease dataset.

Genetic search attribute selection method selects the 8 different features but accuracy of classifier is less compared to brute force approach.

Table 4 Accuracy of different classifiers using genetic search feature selection.

Now we have discussed only results for first two approaches i.e. using Brute force approach and classifier for feature selection and genetic search method of attribute selection. Results of correlated and associated feature selection are yet to be calculated.

5. CONCLUSION

AND

FUTURE

WORK

By removing irrelevant attributes the performance of classification can be improved and also cost of classification may gets reduced. In this work the analysis of two different approach for feature selection is done specially for medical datasets. This work also shows a novel approach for feature selection using correlation and by generating association rules. We conclude that feature selection really helpful for dimensionality reduction and also for building cost effective model for disease prediction. Our further work is to implementation of novel approach of feature selection using association rule mining and in future to develop a Cost effective Clinical Decision Support System for more accurate disease diagnosis.

Classifier-> Neural Network Decision Tree SVM

Breast Cancer 64.69% 75.52% 69.58% Heart Cleveland 80.53% 77.89% 84.16% Diabetes 75.13% 73.83% 77.34% Classifier-> Neural Network Decision Tree SVM Selected attributes : 3, 5, 8, 9, 10, 11, 12, 13 Accuracy -> 77.23 % 82.17 % 85.81 %

(5)

Volume 2, Issue 2 March – April 2013 Page 496

References

[1] M. C. Tu, D. Shin, and D. Shin, “Effective Diagnosis of Heart Disease through Bagging Approach”, Biomedical Engineering and Informatics ,IEEE, 2009.R.

[2] Mourad Benkaci, Bruno Jammes, Andrei Doncescu, “Feature Selection for Medical Diagnosis Using Fuzzy Artmap Classification and Intersection Conflict”, waina, pp.790-795, 2010 IEEE 24th

International Conference on Advanced Information Networking and Applications Workshops, 2010 . [3] M. R .F. Heller , “How well can we predict coronary

heart disease?”, Findings in the United Kingdom Heart Disease Prevention Project, British Medical Journal,1984.

[4] P. W. F.Wilson, “Prediction of Coronary Heart Disease Using Risk Factor Categories”, American Heart Association Journal, 1998.

[5] H. Wang, T. M. Khoshgoftaar, and J. Van Hulse,A Comparative Study of Threshold-Based Feature Selection Techniques, 2010 IEEE International Conference on Granular Computing, pp. 499·504, 2010.

[6] L. A. Simons et al., “Risk functions for prediction of cardiovascular disease in elderly Australians: the Dubbo Study”, Medical Journal of Australia, 2003. [7] Salahuddin and F. Rabbi, “Statistical Analysis of

Risk Factors for Cardiovascular disease in Malakand Division”, Pak. j. stat. oper. res, pp. 49-56, 2006. [8] L.Shahwan-Akl, “Cardiovascular Disease Risk

Factors among Adult Australian-Lebanese in Melbourne “, International Journal of Research in Nursing, 2010.

[9] J. Xie, J. Wu, and Q. Qian, “Feature Selection Algorithm Based on Association Rules Mining Method” , 2009 Eighth IEEE/ACIS International Conference on Computer and Information Science, pp. 357·362, 2009.

[10] P. Andreeva, “Data Modelling and Specific Rule Generation via Data Mining Techniques “, International Conference on Computer Systems and Technologies -CompSysTech, 2006.

[11] R. Agrawal, T. Imielinski, and A. Swami., “Mining association rules between sets of items in large databases”, In Proceedings of the ACM SIGMOD International Conference on Management of Data, pp.207-216, Washington D.C., May 1993.

[12] [grawal R, Srikant R. “ Fast Algorithms for Mining Association Rules”.VLDB. Sep 12-15 1994, Chile, pp.487-499.

[13] C.M. Kuok, A. Fu, M.H.Wong., “Mining Fuzzy Association Rules in Databases, ACM SIGMOD Record, Volume 27, Number 1, March 1998, pp.41- 46.

[14] Jiawei Han, Jian Pei, Yiwen Yin, “ Mining frequent patterns without candidate generation”, Proceedings of the 2000 ACM SIGMOD international conference on Management of data, p.1-12, May 15-18, 2000, Dallas, Texas, United States

[15] V. A. Sitar-Taut.,”Using machine learning algorithms in cardiovascular disease risk evaluation“, Journal of Applied Computer Science and Mathematics, 2009.

[16] K. Srinivas, B. K. Rani, and A. Govrdhan, “Applications of Data Mining Techniques in Healthcare and Prediction of Heart Attacks”, Table 3: Accuracy of Cleveland Heart Disease using

Classifiers Classifier Removed attributes Accuracy(Correctly Classified) Neural Network None 80.53% 1 82.51% 1 4 82.80% 1 4 6 84.16% 5 6 7 10 83.17% 1 3 4 7 10 84.49% 1 4 6 9 10 11 83.17% Decision Tree None 77.89% 13 79.54% 10 13 81.85% 5 10 13 82.84% 5 10 11 13 83.83% 5 6 10 11 13 83.83% 1 2 4 6 7 9 84.16% 2 4 5 6 7 8 10 84.16% SVM None 84.16% 5 85.15% 5 8 86.80% 5 7 10 86.80% 1 4 5 8 87.46% 1 4 5 7 8 87.13% 1 5 6 7 8 10 86.80%

(6)

Volume 2, Issue 2 March – April 2013 Page 497 International Journal on Computer Science and

Engineering (IJCSE), vol. 2, no. 2, pp. 250-255, 2010.

[17] M. C. Tu, D. Shin, and D. Shin, “Effective Diagnosis of Heart Disease through Bagging Approach “, Biomedical Engineering and Informatics,IEEE, 2009.

[18] H. Yan, “Development of a decision support system for heart disease diagnosis using multilayer perceptron “, in Proc. of the 2003 International Symposium on, vol. 5, pp. V-709- V-712.

[19] Detrano, R.; Steinbrunn, W. Pfsterer, “ International application of a new probability algorithm for the diagnosis of coronary artery disease”, American Journal of Cardiology, Vol. 64, No. 3, 1987, pp. 304- 310.

[20] Yao, Z.; Lei, L.; Yin, J., “R-C4.5 Decision tree model and its applications to health care dataset”, proceedings of International Conference on Services Systems and Services Management 2005, pp. 1099-1103.

[21] David West , Paul Mangiameli , Rohit Rampal, Vivian West , “Ensemble strategies for a medical diagnostic decision support system: A breast cancer diagnosis application “, European Journal of Operational Research 162 (2005) 532·551.