Predicting Heart Disease Using Machine Learning Classification Algorithms And Along With TPOT (AUTOML)

(1)

3202

Predicting Heart Disease Using Machine

Learning Classification Algorithms And Along

With TPOT (AUTOML)

Killana Sowjanya, Dr. G. Krishna Mohan

Abstract: In this era heart disease has become a dangerous problem, as we are seeing most of the people are affected due to the failure of the heart. As we consider on an average 2-3 persons are causing death due to the failure of heart. As we seen this kind of problem between the age group of 30-65. This doesn’t mean the age above will not affect this kind of problem and it mainly depends upon the individuals. As we can see this kind of problem not in India but in entire world. Today Machine Learning has been an emerging technology and it used to analyze most of the medical datasets and predict the disease. As we consider most researchers in present era are using ML techniques and that is going to help the health industry a lot. Mainly our problem statement is classification and we use algorithms like Logistic Regression, Decision Trees, Gaussian Naïve-Bayes, KNN, Random Forest, and TPOT (automl) to predict the heart disease.

Index Terms: Heart Disease prediction, classification algorithms decision trees, Logistic regression, Random Forest, KNN, Gaussian NB, TPOT(AutoML)

————————————————————

1.INTRODUCTION

As we all know that heart is an important organ and it is responsible of pumping of blood to all the body. If the functionality of the heart failures then the all the other organs stop their working and within few minutes the person will die. So, we can definitely say heart disease is the major factor in humans and it might cause of death in most of the cases. Recent WHO has conducted a survey 20.5 Million of people are dying due to the failure of heart. Due to the improper working of heart nearly in India 5 Million people are causing death [1]. So medical centers, which are present all over the world have been collected the health-related data and the data can be analyzed by Machine Leaning techniques to find the hidden patterns or insights of data. Moreover, it will be useful to predict the disease in advance and the problem of the people can be easily solved by using advanced methodologies The algorithms we are going to use in this paper are logistic regression, decision trees, KNN, gaussian naive bayes, random forest, tpot (automl).[2]. We are seeing more challenges in healthcare sector is lack of facilities. If the diseases are predicted in the early stage then we can suggest an appropriate treatment and we can solve the problem if not, it will be a crucial problem. So, based on personality of a person the symptoms are changed accordingly [3]. So, there are multiple factors thig might cause the failure of heart. As we all know that in current scenario, every hospital is maintaining their own databased to store the data of the patient. So, if we collect all these data and we start analyzing the trends and insights of data it will be useful to solve most of the problems.

And there are multiple organizations which are currently handling hedge projects on heart disease with different kinds of patients.[4] Moreover, we can predict any kind of disease if you have a meaningful data with you. The main motive is to solve the heart disease problems and predict whether the patient is readmitting or not. So, to predict we have to do the data preprocessing in an effective manner. Then the model can attain better accuracy. [5]

2. HEART DISEASE DATA SET

Features Type Description about data

Age Numerical Describes the age of person.

Sex Numerical Describes the gender.

CP Numerical

Describes the information about chest pain of a person.

trestbps Numerical Describes the person's blood pressure

chol Numerical Describes the cholesterol levels.

fbs Numerical fasting blood sugar levels.

___________________________________

• Killana Sowjanya, she has done her MTech (CSE) from KL University, Guntur, Andhra Pradesh.

(2)

restecg Numerical

Describes

electrocardiographic measurement

thalach Numerical Describes max heart rate

exang Numerical Describe the information about exercise

oldpeak Numerical Description about exercise relative to rest

Slope Numerical

Peak values of exercise.

ca Numerical Information about vessels

thal Numerical blood disorder levels

target Numerical Heart disease (0 = no, 1 = yes)

3. DATA EXPLORATION AND PREPARATION

3.1.DATA

The dataset is collected from Kaggle, and one best repository, consisting of any kind of domain datasets are available here. It consists of domains like Healthcare,

distribution of the data in the dataset. So, in this paper these the steps that have been consider to get the better accuracy. Taken into consideration of seaborn we have been visualized the data for better understanding. And the steps that have been consider in this paper are

 Data pre-processing  Future Engineering  Feature Extraction  Future Selection

 Normalization and standardization

3.3 Data pre-processing

In general, while developing any kind of Machine Learning projects we have to spend 30% of entire time on this stage and it plays an important role of the entire developing model. In simpler way we can describe Data Pre-processing stage as data mining stage and mainly it describes the transformation of data and it consists in unstructured way so we have to make the data into sense. [6]

Steps involved in Data Pre-Processing:

1. Importing the necessary libraries 2. Getting the related Dataset 3. Identifying the Missing values

4. And similarly identifying the Categorical values 5. Dividing the dataset into training part and testing

part

6. Feature Scaling.

3.4 Importing the necessary libraries

(3)

3204  Tpotclassifier

3.5 Getting the related Dataset

As we have seen the dataset in the above-mentioned table it is clear that the Heart Disease prediction is possible if the dataset consists of both the input and target information. [7]

3.6 Identifying the Missing values

We have to handle the Missing values in a proper way if we didn’t handle them then the accuracy is varying a lot. And there are multiple ways to handle missing values,

 dropna ()  mean  median  mode

3.7 Identifying the Categorical values

In general, a model can understand only Numeric format so we have to convert categorical values into Numeric format as it follows as:



 label_encoder ( )  OneHotEnCoder  get_dummies ( )

Dividing the dataset into training part and testing part We have to divide the dataset into two parts training and testing i.e. for splitting the dataset in two parts , based upon the problem statement we can split it .

3.8 Feature Scaling.

Feature scaling is used to scale down the data into some standard format. So, it will be easy to model to fit the data and accuracy will be good if we follow feature scaling technique.

3.9 Future Engineering

It means creation of new features from the existing once, for an improvement of model in terms of accuracy and we can define FE as data transformation for the better represent the underlying problem to the predictive models, so it resulting in increase of model accuracy.[8]

And we are going to select the important features for the analysis and giving the data to model and getting the accuracy.

3.10 Feature Extraction

In general, in any dataset based on the problem statement the number of features are going to be increasing so in order to reduce the dimensionality we are going to use Dimensionality Reduction i.e. PCA-Principal Compound Analysis.

3.11 Feature Selection

In Machine Learning we use this technique to select the most important features that related to the output according we consider them .

 Simplification of model

(4)

4.EXPLORATION AND PREPARATION WITH

SEABORN

Using Seaborn we can easily visualise the dataset and come for better understanding of the Dataset. And the target column consisting of 1 and 0 so we can visualise it as follows

And we can plot target column vs sex column and visualise the plot how many are Disease effected with respect to sex.

respective age. By using seaborn we can plot the graph

And we can compare the sex column with the target i.e. don’t have disease vs have disease.

Heart disease according to Fasting Blood sugar

(5)

3206 Analyzing the Resting electrocardiographic measurement

Analysing Exercise (1 = yes; 0 = no)

People with exercise_induced_angina=1 are much less likely to have heart problems

Analysing the slope of the exercise .

Analysing number of major vessels (0-3) colored by flourosopy

Comparision with the targrt.

(6)

4.5 Correlation analysis

5. MODELS TO APPLY

 Logistic Regression algorithm  Naïve Bayes algorithm  KNN Classifier

 Decision Tree Classifier  Random Forest Classifier  TPOT

5.1 Logistic Regression algorithm

Logistic Regression is used for classification problem and mainly it it Supervised algorithm.

1 / (1 + e^-value)

 Binary Output Variable

 Remove Noise

 Gaussian Distribution

 Remove Correlated Inputs

 Fail to Converge

5.2 Naïve Bayes model

Naive Bayes is based upon Bayes theorem. It will consider the probability of the data points to classify.

B is the evidence and A is the hypothesis

5.3 KNN -K-Nearest Neighbors Classifier

KNN is classifier and it is used to solve classification problem and it can specify a new datapoint by finding the Euclidean Distance and majority vote count.

Workflow:

1.Collect data. 2.Initilize k-value

(7)

3208 5.4 DECISION TREE CLASSIFIER

Decision tree is a algorithm and it is basically used to solve both classification problem and regression problems. In decision tree, sample facts will split into two or greater homogeneous sets or sub population with differentiator in input variables.it is broadly speaking works in categorical and contiguous input and output variables. [9]

5.5 RANDOM FOREST

Random Forest is a supervised classification algorithm. And in the name, itself describes as forest it consists of multiple Decision Trees and we can easily get the accuracy. In most of the situations we are going to use Random Forest Classifier and we can apply on any kind of classification problems. We want to decide that how to many numbers of trees do we want to consider and we can consider it by Hyperparameter method. [10]

So, in general Random Forest algorithm follows low bias and low variance .so the overfitting problem is also solved by this Classifier.

5.6 TPOT

TPOT is an automated machine learning(autoML) tool in Python. So, in TPOT the tool automatically picks the best suitable algorithm and fit it to the data. Basically, TPOT is still in development stage. In some situations, it gives the best accuracy, it depends upon the problem statement.

And it is the 1st autoML tool developed. So, in some cases it performs with great accuracy.

6. MODEL EVALUATION

Here we are going to fit the data to each of the model and finding the accuracy. Whatever the model is going to give you good accuracy we are going to consider it for our problem statement.

(8)

Models Accuracy Precision Recall F1-Score

KNN 68.8% 74.1% 67.6% 70.7%

Decision Trees

81.9% 87.0% 79.4% 83.0%

Logistic Regression

85.2% 85.7% 88.2% 86.9%

Naïve Bayes

85.4% 83.78% 91.1% 87.32%

TPOT 86.3% 84.7% 81.4% 83.5%

Random Forests

88.7% 86.1% 91.1% 88.57%

6.1 Confusion Matrix

Confusion matrix is used to check the accuracy. Moreover, we can plot the matrix from import the library from sklearn metrics.

And finally, we can plot the Accuracy of each models, out of

7. CONCLUSION

we have seen accuracy on various classification algorithms and got the accuracies accordingly. It is clear that Random Forest have given the best accuracy with 88.7%, here ML plays a key role to analyze the heart disease. So, it is clear that in healthcare ML is a kind of miracle to solve multiple kinds of problems.

8.REFERENCES

[1]. ] A. S. Abdullah and R. R. Rajalaxmi, ``A data mining model for predicting

[2]. the coronary heart disease using random forest classi_er,'' in Proc. Int. Conf. Recent Trends Comput. Methods, Commun. Controls, Apr. 2012, pp. 22_25.

[3]. A. H. Alkeshuosh, M. Z. Moghadam, I. Al Mansoori, and M. Abdar,

[4]. ``Using PSO algorithm for producing best rules in diagnosis of heartdisease,'' in Proc. Int. Conf. Comput. Appl. (ICCA), Sep. 2017, pp. 306_311. [5]. N. Al-milli, ``Backpropogation neural network for

prediction of heartdisease,'' J. Theor. Appl.Inf. Technol., vol. 56, no. 1, pp. 131_135, 2013. [6]. C. A. Devi, S. P. Rajamhoana, K.

Umamaheswari, R. Kiruba, K. Karunya, and R. Deepika, ``Analysis of neural networks based heart disease predictionsystem,'' in Proc. 11th Int. Conf. Hum. Syst. Interact. (HSI), Gdansk, Poland, Jul. 2018, pp. 233_239.

(9)

Univ.-3210

115_125, Jun. 2018. doi:

10.1016/j.eswa.2018.01.025. [7] C.-A. Cheng and H.-W. Chiu, ``An arti_cial neural network model forthe evaluation of carotid artery stenting prognosis using a national-wide database,'' in Proc. 39th Annu. Int. Conf. IEEE Eng. Med. Biol. Soc. (EMBC), Jul. 2017, pp. 2566_2569.

[9]. H. A. Esfahani and M. Ghazanfari, ``Cardiovascular disease detectionusing a new ensemble classi_er,'' in Proc. IEEE 4th Int. Conf. Knowl.- Based Eng. Innov. (KBEI), Dec. 2017, pp. 1011_1014.

[10]. ] F. Dammak, L. Baccour, and A. M. Alimi, ``The impact of criterion weightstechniques in TOPSIS method of multi-criteria decision making in crispand intuitionistic fuzzy domains,'' in Proc. IEEE Int. Conf. Fuzzy Syst. (FUZZ-IEEE), vol. 9, Aug. 2015, pp. 1_8.