ISSN: 2005-4238 IJAST 309
Copyright ⓒ 2019 SERSC
An Efficient Feature Selection Based Heart Disease Prediction Model
1
Pulugu Dileep
1, Kunjam Nageswara Rao
2, Prajna Bodapati
3 Research Scholar, 2,3Professor, Department of CS & SE, AUCE(A) atAndhra
University, Visakhapatnam, Andhra Pradesh.
Abstract
Heart disease is one of the health concerns of humans. It has caused thousands sad demises of people early in their life. There are different kinds of heart diseases and each one has its symptoms and they are preventable or even curable if detected early.
Therefore, early detection of heart disease is wiser way of diagnosing it. Fortunately, health data of a person is sufficient to detect the probability of heart disease accurately.
This has motivated many researchers and academia investigating into data-driven approaches towards solution. Machine learning techniques that are part of Artificial Intelligence (AI) play key role in the prediction of heart diseases. The existing research on it revealed their utility in garnering Business Intelligence (BI) for making expert decisions. However, in terms of feature selection and improving performance of detection mechanisms there is need for further scope of the research. In this paper a novel feature selection algorithm named Entropy and Gain-based Feature Selection (EGFS) is proposed. The hypothesis “feature selection improves performance of heart disease prediction models” is evaluated using EGFS by applying it with state of the art machine learning methods like k-Nearest Neighbour (k-NN), Naïve Bayes (NB), Decision Tree (DT), Random Forest (RF) and Support Vector Machines (SVM). These methods are used to form heart disease prediction models. The empirical study revealed that the performance of the prediction models is improved with EGFS. The effectiveness of prediction models is enhanced with feature selection process.
Keywords – Heart disease prediction, supervised learning, feature selection, k-Nearest Neighbour, Decision Tree, Naïve Bayes, Random Forest.
1. Introduction
Heart disease prediction models based on machine learning techniques have an important utility in modern Decision Support Systems (DSS) of healthcare units. Intelligence required for heart disease diagnosis is obtained with such techniques. It is relatively simple and effective as it forms a data-driven solution which is easier to develop and use [4]. Supervised machine learning methods are widely used for heart disease prediction.
They are classification algorithms like k-NN, SVM, DT, NB and RF to mention few.
These techniques need training data in order to predict class labels effectively. As the training data in increased, it will help improve the quality and accuracy of prediction models. Data mining domain is rich in algorithms that already exist. When it comes to classification,
it is essential to see that the quality of training is good [2], [3]. If training dataset is not good, it results in deteriorated performance of classifiers. The rationale behind this is the noise in the data in the form of redundant features and irrelevant features. Therefore, it is essential to have mechanisms to know the relevance of a feature to the class label.
ISSN: 2005-4238 IJAST 310
Copyright ⓒ 2019 SERSC
Many prediction models came into existence as found in the literature with optimizations as well. In [2] a hybrid prediction model is defined with classification and also features selection methods. Such system is found to be more accurate besides reducing time and space complexity. Particle Swam Optimization (PSO) is used for optimization of prediction models in [3], [14] while Genetic Algorithm (GA) is used for improving performance in prediction as studied in [17]. With respect to heart disease prediction models, model accuracy and error dynamics are explored with different metrics in [12], [22], [23]. Different feature selection methods are employed in [13], [14], [15], [16]. Two important observations in the existing prediction models are observed here. First, they are data-driven and based on machine learning algorithms. Second, their performance is enhanced with feature selection approaches. Based on these facts, further study in the area of feature selection revealed that there is room for improving feature selection process so as to boost performance of prediction models. Feature selection improves accuracy, decreases prediction errors and ensures time and space complexity is reduced.
In this paper we proposed a heart disease prediction framework known as Data Driven Heart Disease Prediction System (DD-HDPS). This framework is realized with classification techniques along with the proposed feature selection method named Entropy and Gain-based Feature Selection (EGFS). This feature selection method finds features that are having sufficient relevance with the class label. SGFS computes a score that shows how much relevance the feature has with the objective in hand. Thus the proposed feature selection method improved the performance of the prediction models. Benchmark dataset collected from the UCI repository. The results revealed that there is performance improvement of prediction models with feature selection mechanisms. Evaluations made with FGFS have resulted in observations that show increased prediction accuracy and speed. Out contributions in this paper are as follows.
1) Heart disease prediction framework named DD-HDPS is proposed.
2) The framework is realized with many machine learning algorithms like k-NN, NB, RF, DT and SVM.
3) The prediction models are further enhanced with the proposal of a feature selection method known as SGFS.
4) A prototype application is built to show the effectiveness of the proposed feature selection mechanism on the machine learning based prediction models.
The rest of the paper is organized as follows. Section 2 provides review of literature on the machine learning methods and feature selection algorithms used for heart disease prediction models. Section 3 presents the proposed methodology including the framework and algorithm. Section 5 presents experimental results with evaluation of hypothesis.
Section 6 provides conclusions and scope for the future research.
2. Related work
This section provides review of literature on data driven heart disease prediction approaches based on machine learning. Austin et al. [1] found that modern data mining methods can detect Heart Failure (HF). HF has a subtype named HF with Preserved Ejection Fraction (HFPEF) which can be identified with the Logistic Regression (LR) method. They also found that LR predicts better than Random Forests (RF), bagged trees and boosted trees. Ul Huq et al. [2] proposed an intelligent framework made up of machine learning methods and feature selection methods. They found that feature
ISSN: 2005-4238 IJAST 311
Copyright ⓒ 2019 SERSC
selection methods do have their impact on the classifiers like SVM, Random Forest, Decision Trees and other machine learning classifiers. The rationale behind this is that feature selection improves quality of training. Their framework uses three feature selection methods namely Relief, MRMR and LASSO. They employed cross validation for evaluation of classifiers with feature selection approaches.
Khourdifi and Bahaj [3] employed Fast Correlation Based Feature Selection (FCBF) in order to eliminate redundant features and identify most relevant features. Afterwards, they are further optimized by optimization techniques like Ant Colony Optimization (ACO) and Particle Swarm Optimization (PSO). The mixed approach is employed to evaluateclassifiers known as k-NN, SVM, RF, NB and MLP. Ramalingam et al. [4]
explored different machine learning techniques for heart disease prediction. They include NB, SVM, K-NN, DT, RF and ensemble method. The ensemble method combines two or more classifiers working with different characteristics to produce better results. Nashif et al. [5] focused on cardiovascular disease prediction using ML algorithms. They employed classifiers like SVM, RF, NB, simple logistic, and MLP. SVM was found to provide best performance.
Seh and Chaurasia [6] studied classification algorithms with feature selection techniques.
They found that with less number of features, the performance is improved. They made comparative study on J48, REPTREE, NAÏVE BAYES, BAYESNET and simple CART.
They found J48 to have better performance. Ali et al. [7] studied different mining algorithms like RF, SVM, NB and DT. According to their results NB showed best performance. Parthiban et al. [8] proposed a methodology to evaluate machine learning algorithms and tested different classification algorithms. Fatima and Pasha [9] focused on Computer Aided Diagnosis (CAD) and reviewed different techniques. Different ML techniques are found such as supervised learning, semi-supervised learning, unsupervised learning, reinforcement learning, evolutionary learning and deep learning. These techniques are part of Artificial Intelligence (AI). They employed machine learning for diagnosing heart disease, diabetes, liver disease, dengue disease and hepatitis disease. For heart disease prediction, they employed BN, SVM, FT, NB, J48, GA+SVM, and bagging.
Shouman et al. [10] proposed a research model to evaluate machine learning algorithms with individual and hybrid approaches. They found that hybrid approaches could improve performance. In [11] data mining techniques are used for intrusion detection or anomaly detection. Galdi and Tagliaferri [12] explored different error and accuracy measures in machine learning approaches. They defined confusion matrix, sensitivity, specificity, accuracy and F-measure. Sen et al. [13] explored feature selection algorithms and classification techniques using ECG signals. They employed feature selection methods like fast correlation based filter, MRMR algorithm, Fisher Score Algorithm (FS), Relief and T-test algorithms. They also evaluated different classification algorithms like DT, Feed Forward Neural Network (FFNN), Radial Basis Network (RBF), SVM and RF.
Xue et al. [14] followed a multi-objective approach to improve classification performance. They employed PSO for feature selection. It could improve performance of classification algorithms. Ganapathy et al. [15] proposed an intelligent method with both classification and feature selection methods. Lee and Kim [16] explored used multi- variate mutual information approach in order to have feature selection that is applied to classification techniques. Tsai et al. [17] employed evolutionary algorithm known as
ISSN: 2005-4238 IJAST 312
Copyright ⓒ 2019 SERSC
Genetic Algorithm (GA) for feature selection. They proposed an experimental process that includes feature selection and instance selection in order to have different classifiers.
Anooj [18] proposed a Clinical Decision Support System (CDSS) which is based on weighted fuzzy rules. The rules are used to predict risk level of patients with regard to heart disease. The system involves pre-processing, mining, selection of suitable attributes and then finding the risk level. Aldallal and Al-Moosa [19] proposed a machine learning based solution for detecting heart diseases and diabetes. Pencina et al. [20] explored different metrics used in the research of model performance. They include AUC, sensitivity, specificity and discrimination slope.
Jiang et al. [21] on the other hand focused on the multi-objective optimization and performance metrics to know consistencies as well as contradictions. They covered capacity metrics, convergence metrics and convergence-diversity metrics. Wald et al. [22]
emphasized the necessity of performance metrics. They focused on the measuring of wrapper feature selection approaches. From the literature it is found that there is need for efficient feature selection in order to improve the performance of classifiers. In this paper we proposed a novel feature selection method for this purpose.
3. Proposed Heart Disease Prediction Methodology
This section provides the problem statement, the proposed framework and the underlying feature selection mechanism for realizing a system for prediction of heart diseases.
3.1 Problem Definition
Accurate prediction of heart disease provided data of a person such as age, gender, chest pain type, resting blood pressure and so on is the problem to be addressed. The existing data mining or machine learning techniques are widely used for this purpose. However, it is clear that the quality of training data has its impact on the performance of detection models. Understanding this dependency is crucial to formulate the problem further which can be boiled down to a statement “unless redundant features and irrelevant features are eliminated effectively, the performance of classifiers will be deteriorated significantly or even lead to unusability in the prediction of heart diseases”. This is the challenging problem that will solve the issues related to accuracy in prediction. There is room for further research to have a feature selection based data driven heart disease prediction system.
3.2 Proposed Framework
In the light of the above problem, a framework is proposed to realize a data-driven heart disease prediction system. It is known as Data Driven Heart Disease Prediction System (DD-HDPS). It strives to be innovative in feature selection, feature scaling, classification and regression. After taking dataset as input, the framework employs a procedure known as component splitting. It divides the dataset into 80% training set and 20% testing set.
The training set is given for pre-processing where missing values are treated well besides following novel approaches for feature selection and feature scaling. Then classification and regression algorithms are employed to generate an efficient heart disease prediction model. This model has knowledge gained from training and thus capable of differentiating heart disease instances from other instances. In other words, it can label or classify unlabelled instances.
ISSN: 2005-4238 IJAST 313
Copyright ⓒ 2019 SERSC
Figure 1: Proposed data-driven heart disease prediction framework (DD-HDPS)
As presented in Figure 1, the data-driven approach will be successful purely based on the quality of data given for prediction. Therefore, it is essential to pre-process data and subject it to feature selection and feature scaling. Classification approach will be able to classify the instances into heart disease or no heart disease instance. However, regression will help in analysing subtle relationships between the binary prediction variable (dependent variable) and all other variables considered for heart disease prediction research. The performance of the proposed method is evaluated using the following evaluation methodology.3.3 Feature Selection Algorithm
As shown in Algorithm 1, both entropy and gain values play important role in feature selection process. They are measured used to know the utility of an attribute to mean an objective function. In this case, the objective is to select a feature which can participate in heart disease diagnosis.
ISSN: 2005-4238 IJAST 314
Copyright ⓒ 2019 SERSC
Algorithm: Entropy and Gain based Feature
Selection
Inputs: Heart Disease Dataset D Outputs: Selected Features FS
Initialize threshold for gain gt Initialize threshold for entropy et Initialize vector for selected features FS Initialize a vector for features F
Initialize a vector to hold attributes A Initialize the concept c
Initialize data structure T
Get attributes from D into A provided c
Extract Features that can Help in Diagnosis
For each attribute a in A
Find entropy e Find gain g
A weight is associated with a IF g>gt and e>et and THEN
Add attribute a to feature vector F END IF
End For
Construct a Data Structure
For each feature f in feature vector F Add feature f to the data structure T End For
Extract Final Features
For each node in T
IF feature is found useful for diagnosis THEN
Update FS with the feature END IF
End For Return FS
Algorithm 1: Entropy and gain based feature selection algorithm
3.4. Evaluation MethodologyAs shown in Table 1, confusion matrix is presented for measures like precision and recall and F-measure. The predicted positives are actual positives then it is considered as True Positive (TP). Second case is that predicted positives but actually negatives. Such results are known as False Positives (FP). The third case is predicted negative but actually positive. It is known as False Negative (FN). The fourth case is that predicted negative but actually negative. It is known as true negative.
ISSN: 2005-4238 IJAST 315
Copyright ⓒ 2019 SERSC
Ground Truth (correct prediction)
Ground Truth (incorrect prediction) Result of
algorithm (correct prediction)
True Positive (TP)
False Positive (FP)
Result of algorithm (incorrect prediction)
False Negative (FN)
True Negative (TN)
Table 1: Confusion matrix used for evaluation
Based on the details given Table 1, three performance metrics are considered. They are known as precision, recall, accuracy and F-measure. They are defined as in Eq. 1, Eq. 2, Eq. 3 and Eq. 4.
𝑃𝑟𝑒𝑐𝑖𝑠𝑖𝑜𝑛 =
𝑇 𝑃𝑇 𝑃 + 𝐹 𝑃
(1) 𝑅𝑒𝑐𝑎𝑙𝑙 =
𝑇 𝑃𝑇 𝑃 + 𝐹 𝑁
(2) 𝐴𝑐𝑐𝑢𝑟𝑎𝑐𝑦 =
𝑇 𝑃+𝑇𝑁𝑇 𝑃 + 𝑇𝑁+𝐹 𝑁+𝐹𝑃
(3) 𝐹 − 𝑚𝑒𝑎𝑠𝑢𝑟𝑒 =
2 × 𝑝𝑟𝑒𝑐𝑖𝑠𝑖𝑜𝑛 × 𝑟𝑒𝑐𝑎𝑙𝑙𝑝𝑟𝑒𝑐𝑖𝑠𝑖𝑜𝑛 + 𝑟𝑒𝑐𝑎𝑙𝑙
(4)
Precision is known as positive predictive value or specificity while the recall is known as sensitivity. Both are measured used to know the performance of prediction models. F- measure on the other hand is the harmonic mean of the precision and recall.
4. Dataset Details
The dataset consists of 920 individuals’ data. There are 15 columns in the dataset, however the first column name is not a good parameter as far as machine learning is considered so, there are effectively 14 columns.
Sl.
No.
Attribute Description
1 Age displays the age of the individual.
2 Sex displays the gender of the individual using the following format : 1 = male, 0 = female.
3 Chest pain type displays the type of chest-pain experienced by the individual using the following format : 1 = typical angina, 2 = atypical angina, 3 = non - anginal pain, 4
= asymptotic 4 Resting Blood
Pressure
displays the resting blood pressure value of an individual in mmHg (unit)
5 Serum Cholesterol displays the serum cholesterol in mg/dl (unit)
6 Fasting Blood compares the fasting blood sugar value of an
ISSN: 2005-4238 IJAST 316
Copyright ⓒ 2019 SERSC
Sugar individual with 120mg/dl. If fasting blood sugar >
120mg/dl then : 1 (true), else : 0 (false) 7 Resting ECG 0 = normal
1 = having ST-T wave abnormality 2 = left ventricular hyperthrophy 8 Max heart rate
achieved
displays the max heart rate achieved by an individual.
9 Exercise induced angina
1 = yes,0 = no 10 ST depression
induced by exercise relative to rest
displays the value which is integer or float.
11 Peak exercise ST segment
1 = upsloping, 2 = flat, 3 = downsloping 12 Number of major
vessels (0-3)
coloured by
fluoroscopy
displays the value as integer or float.
13 Thal displays the thalassemia: 3 = normal, 6 = fixed defect, 7 = reversible defect
14 Diagnosis of heart disease
Displays whether the individual is suffering from heart disease or not: 0 = absence, 1 = present.
Table 2: Shows details of dataset used
age sex chest_pain blood_pressure … slope no_of_vessels thal diagnosis
54 1 4 125 … 1
55 1 4 158 … 2 1
54 0 3 135 … 1 0 3 0
48 0 3 120 … 0
50 1 4 120 … 1 6 1
64 0 4 130 … 2 2 3 0
63 1 4 130 … 2 1
58 1 2 130 … 0
42 1 2 150 … 0
Table 3: An excerpt from the dataset
5. Experimental ResultsExperiments are made with the dataset described in the previous section. The results are observed in different aspects. Finally, the results are evaluated with accuracy, specificity, sensitivity and F-score with and without feature selection algorithm applied on various machine learning algorithms for heart disease prediction.
ISSN: 2005-4238 IJAST 317
Copyright ⓒ 2019 SERSC
5.1 Results and Analysis
Experiments are made with the proposed framework. Different prediction models are built with and without the proposed feature selection method. Prior to evaluating the performance of the framework, different means of analysing data features is made and the results are provided here.
Figure 2: Diagnosis ratio between heart disease and no heart disease instances in training set
As shown in Figure 2, the distribution of target value with respect to diagnosis is presented. In other words, the number of instances related to heart disease presence is 5.9% while the absence of heart disease is reflected in 54.1% instances.
Figure 3: Statistics of numeric columns
For the given numeric columns, Figure 3 shows the statistics that will have significance in the diagnosis of heart disease. Extreme values such as min and max are possible in real clinical scenario. The relationships among the numeric features are visualized in the form of pairplot in Figure 4.
ISSN: 2005-4238 IJAST 318
Copyright ⓒ 2019 SERSC
Figure 4: Pairplot of numeric columns
As shown in Figure 4, the relationships among the numeric features are provided. There is no high correlation between pairs. However, negative correlation is found between max_heart_rate ad age while positive correlation is witnessed between blood_pressure and age. In order to understand it better a correlation matrix is generated as shown in Figure 5.
Figure 5: Correlation heatmap for numeric columns
Correlation matrix is presented with correlation heatmap for numeric columns. The two relationships mentioned above are witnessed here. Besides, it also shows another important fact that is the dependency between ST_depression and max_heart_rate. Thus it is inferred that, the features max_heart_rate and age play an important role in prediction of heart disease.
ISSN: 2005-4238 IJAST 319
Copyright ⓒ 2019 SERSC
Figure 6: Role of age and max_heart_rate on heart disease
As presented in Figure 6, the role of max_heart_rate and age in predicting heart disease is analysed. It reveals that for healthy patients, age distribution is much wider when compared with that of ill patients. People do have highest risk of heart disease at their sixties. The risk is even higher when max_heart_rate lies between 150 and 170. For healthy patients, higher values are found to be common. Yet another perspective is found in Figure 7.
Figure 7: Age and max_heart_rate in arriving at diagnosis
The max_heart_rate and age contribute to diagnosis of heart disease. How these two attributes are used to arrive at the diagnostic result is made visible in Figure 7.
Figure 8: Diagnosis with gender, induced_angina and slope values of patient
ISSN: 2005-4238 IJAST 320
Copyright ⓒ 2019 SERSC
As shown in Figure 8, the features like se and induced_angina are contributing towards disease prediction. From the overall experiments, there are important observations. 1) Men are more prone to heart disease than women. 2) If the number of vessels is high, it leads to higher risk. 3) With respect to chest pain, if it is a soft pain it may not be a symptom of heart problem while strong paid needs to be suspected. 4) Exercise-induced angina is 3 times higher in the risk of getting heart disease. The slope related to peak exercise with value 2 in flat slope and 3 in downslope reveals high risk of the disease.
5.2 Performance Evaluation
This section evaluates the proposed framework and the underlying feature selection algorithm. The results are presented in terms of accuracy of prediction, specificity (true negative rate) and sensitivity (true negative rate) besides F-measure that is the harmonic
mean of specificity and sensitivity.
Heart Disease Prediction Model
Accuracy (%)
Specificity (%)
Sensitivity (%)
F-Score (%)
K-NN 76 74 73 73.4966
Naïve Bayes 83 87 78 82.25455
Decision Tree 74 76 68 71.77778
Random Forest 83 70 94 80.2439
SVM 86 88 78 82.6988
Table 3: Performance comparison of heart disease prediction models
As shown in Table 3, the performance of heart disease prediction models is provided in terms of accuracy, specificity, sensitivity and F-score. It is without using feature selection algorithm.Figure 9: Performance comparison of prediction models without feature selection
As presented in Figure 9, the performance metrics are provided in horizontal axis. They are known as accuracy, specificity, sensitivity and F-score. The vertical axis shows the performance percentage against these measures. Different prediction models are observed such as k-NN, NB, DT, RF and SVM. The results revealed that SVM has highest performance among the models. It has 86% accuracy, 88% specificity and 78%sensitivity. Its F-score is 82.6988% which is higher than all other methods.
0 2040 60 80 100
K-NN Naïve Bayes
Decision Tree
Random Forest Performnace (%) SVM
Prediction Models
Performance of Heart Disease Prediction Models
Accuracy Specificity Sensitivity F-Score
ISSN: 2005-4238 IJAST 321
Copyright ⓒ 2019 SERSC
Heart Disease
Prediction Model Accuracy (%) Specificity (%) Sensitivity (%) F-Score (%)
K-NN + EGFS 78 76 74 74.98666667
Naïve Bayes+ EGFS 85 90 79 84.14201183
Decision Tree+ EGFS 74 85 66 74.30463576
Random Forest+ EGFS 85 93 75 83.03571429
SVM+ EGFS 87 95 78 85.66473988
Table 4: Performance comparison with feature selection
As shown in Table 4, the performance of heart disease prediction models is provided in terms of accuracy, specificity, sensitivity and F-score. It is with feature selection algorithm.
Figure 10: Performance comparison of prediction models with feature selection
As presented in Figure 10, the performance metrics are provided in horizontal axis. They are known as accuracy, specificity, sensitivity and F-score. The vertical axis shows the performance percentage against these measures. Different prediction models are observed such as k-NN, NB, DT, RF and SVM. The results revealed that SVM has highest performance among the models. It has 87% accuracy, 95% specificity and 78%sensitivity. Its F-score is 85.66474% which is higher than all other methods. Another important observation is that the feature selection method showed better performance when the results are compared with that of Figure 9.
Heart Disease Prediction Models
Processing Time (s)
With All Features With 6 Features
K-NN 29.4 24.611
Naïve Bayes 34.101 34.101
Decision Tree 21.911 20.911
Random Forest 15.121 15.121
SVM 15.234 14.134
Table 5: Processing time of different prediction models
200 4060 10080
K-NN + EGFS
Naïve Bayes +
EGFS
Decision Tree +
EGFS
Random Forest +
EGFS
SVM + Performance (%) EGFS
Prediction Models
Heart Disease Prediction Performance
Accuracy Specificity Sensitivity F-Score
ISSN: 2005-4238 IJAST 322
Copyright ⓒ 2019 SERSC
As shown in Table 5, the time taken by heart disease prediction models is provided when feature selection algorithm is used and when it is not used with prediction models.
Figure 11: Processing time of prediction models with and without feature selection
As presented in Figure 11, the prediction models are presented in horizontal axis. The time taken in seconds for each prediction model is provided in vertical axis. There is significant improvement in reducing time complexity when feature selection method is employed. The processing time of k-NN is reduced from 29.4 seconds to 24.611 seconds.The time taken by NB is same with and without feature selection. DT reduced time taken from 21.911 to 20.911. The time taken for RF remains same with and without feature selection. SVM on the other hand reduced it from 15.234 to 14.134.
6. Conclusion and Future work
In this paper a Data Driven Heart Disease Prediction System (DD-HDPS) is proposed. It is used to realize different heart disease prediction models are created with supervised learning methods. In other words, machine learning methods are exploited to be part of heart disease prediction model besides the proposed feature selection method Entropy and Gain-based Feature Selection (EGFS). This method discovers features that highly contribute to the prediction decision. In the process it can reduce the time and space complexity of the prediction models. Many prediction models are created with the EGFS as feature selection algorithm. The models are based on k-Nearest Neighbour (k-NN), Naïve Bayes (NB), Decision Tree (DT), Random Forest (RF) and Support Vector Machines (SVM).Benchmark dataset collected from UCI is used for the experiments. The performance of prediction models with feature selection has increased significantly. As data size increases, it will be more evident. With dataset containing 303 instances, the execution time of prediction models with feature selection is increased by 16% maximum.
Similarly, the accuracy of the prediction models is increased by 5%. Thus the hypothesis
“feature selection improves performance of heart disease prediction models” is validated and it found to be affirmative. The empirical study also provides an encouraging insight that reads “heart diseases can be prevented or cured if they are detected early besides saving life of humans”. Our research with the proposed prediction models with EGFS has certain limitations. First, dataset is limited which may not give generalized conclusions.
0 10 20 30 40
K-NN Naïve Bayes
Decision Tree
Random Forest
SVM
Processing Time (seconds)
Prediction Models
Processing Time Comparison
With All Features With 6 Features
ISSN: 2005-4238 IJAST 323
Copyright ⓒ 2019 SERSC
Second, the feature selection method focuses on a particular mathematical solution. As one size does not fit all, it is an important future direction to have ensemble of classifiers with hybrid feature selection models to improve heart disease prediction performance further.
References
[1] Austin, P. C., Tu, J. V., Ho, J. E., Levy, D., & Lee, D. S. (2013). Using methods from the data-mining and machine-learning literature for disease classification and prediction: a case study examining classification of heart failure subtypes. Journal of Clinical Epidemiology, 66(4), 398–407.
[2] Haq, A. U., Li, J. P., Memon, M. H., Nazir, S., & Sun, R. (2018). A Hybrid Intelligent System Framework for the Prediction of Heart Disease Using Machine Learning Algorithms. Mobile Information Systems, 2018, 1–21.
[3] Youness Khourdifi and Mohamed Bahaj. (2018). Heart Disease Prediction and Classification Using Machine Learning Algorithms Optimized by Particle Swarm Optimization and Ant Colony Optimization. International Journal of Intelligent Engineering and Systems. 12 (1), p1-11.
[4] V.V. Ramalingam, Ayantan Dandapath and M Karthik Raja. (2018). Heart disease prediction using machine learning techniques: a survey. International Journal of Engineering & Technology. 7 (2), p684- 687.
[5] Shadman Nashif1, Md. Rakib Raihan2, Md. Rasedul Islam2, Mohammad Hasan Imam2. (2018). Heart Disease Detection by Using Machine Learning Algorithms and a Real-Time Cardiovascular Health Monitoring System. World Journal of Engineering and Technology. 6 (.), P854-873.
[6] Adil Hussain Seh, Dr. Pawan Kumar Chaurasia. (2019). A Review on Heart Disease Prediction Using Machine Learning Techniques. International Journal of Management, IT & Engineering. 9 (4), p1-17.
[7] Musfiq Ali. (2019). Heart Disease Prediction Using Machine Learning Algorithms. BACHELOR OF SCIENCE IN COMPUTER SCIENCE AND ENGINEERING., p1-45.
[8] G. Parthiban. (2012). Applying Machine Learning Methods in Diagnosing Heart Disease for Diabetic Patients. International Journal of Applied Information Systems. 3 (7), p1-6.
[9] Meherwar Fatima1, Maruf Pasha2. (2017). Survey of Machine Learning Algorithms for Disease Diagnostic. Journal of Intelligent Learning Systems and Applications. 9, p1-16.
[10] Shouman, M., Turner, T., & Stocker, R. (2012). Using data mining techniques in heart disease diagnosis and treatment. 2012 Japan-Egypt Conference on Electronics, Communications and Computers. P1-5.
[11] Agrawal, S., & Agrawal, J. (2015). Survey on Anomaly Detection using Data Mining Techniques.
Procedia Computer Science, 60, 708–713.
[12] Galdi, P., & Tagliaferri, R. (2018). Data Mining: Accuracy and Error Measures for Classification and Prediction. Reference Module in Life Sciences. P1-6.
[13] Şen, B., Peker, M., Çavuşoğlu, A., and Çelebi, F. V. (2014). A Comparative Study on Classification of Sleep Stage Based on EEG Signals Using Feature Selection and Classification Algorithms. Journal of Medical Systems, 38(3). P1-21.
[14] Xue, B., Zhang, M., and Browne, W. N. (2013). Particle Swarm Optimization for Feature Selection in Classification: A Multi-Objective Approach. IEEE Transactions on Cybernetics, 43(6), 1656–1671.
[15] Ganapathy, S., Kulothungan, K., Muthurajkumar, S., Vijayalakshmi, M., Yogesh, P., and Kannan, A.
(2013). Intelligent feature selection and classification techniques for intrusion detection in networks: a survey. EURASIP Journal on Wireless Communications and Networking, 2013(1). P1-16.
[16] Lee, J., and Kim, D.-W. (2013). Feature selection for multi-label classification using multivariate mutual information. Pattern Recognition Letters, 34(3), 349–357.
[17] Tsai, C.-F., Eberle, W., and Chu, C.-Y. (2013). Genetic algorithms in feature and instance selection.
Knowledge-Based Systems, 39, 240–247.
[18] Anooj, P. K. (2012). Clinical decision support system: Risk level prediction of heart disease using weighted fuzzy rules. Journal of King Saud University - Computer and Information Sciences, 24(1), 27–
40.
[19] Aldallal, A., & Al-Moosa, A. A. A. (2018). Using Data Mining Techniques to Predict Diabetes and Heart Diseases. 2018 4th International Conference on Frontiers of Signal Processing (ICFSP). P1-5.
[20] Ammar Aldallal. (2018). An Analysis of Heart Disease Prediction using Different Data Mining Techniques. IEEE, p1-6.
[21] Pencina, M. J., D’Agostino, R. B., & Massaro, J. M. (2012). Understanding increments in model performance metrics. Lifetime Data Analysis, 19(2), 202–218.
[22] Siwei Jiang, Yew-Soon Ong, Jie Zhang, & Liang Feng. (2014). Consistencies and Contradictions of Performance Metrics in Multiobjective Optimization. IEEE Transactions on Cybernetics, 44(12), 2391–
2404.
[23] Wald, R., Khoshgoftaar, T., & Napolitano, A. (2013). The importance of performance metrics within wrapper feature selection. 2013 IEEE 14th International Conference on Information Reuse & Integration (IRI). P1-7.