Stroke Prediction using Distributed Machine Learning Based on Apache Spark

(1)

Vol. 28, No. 15, (2019), pp. 89-97

Stroke Prediction using Distributed Machine Learning Based on Apache Spark

Hager Ahmed

¹

, Sara F. Abd-el ghany

²

, Eman M.G.Youn

³

, Nahla F.Omran

⁴

, Abdelmgeid A.Ali

⁵

1,2,5

Faculty of Computers and Information, Minia University, Egypt

2,4

Department of Computer Science and Faculty of Science, South Valley University, Egypt

Abstract

Stroke is one of death causes and one the primary causes of severe long-term weakness in the world. In this paper, we compare different distributed machine learning algorithms for stroke prediction on the Healthcare Dataset Stroke. This work is implemented by a big data platform that is Apache Spark. Apache Spark is one of the most popular big data platforms that handle big data and includes an MLlib library. MLlib is an API integrated with Spark to provide machine learning algorithms. Four types of machine learning classification algorithms were applied; Decision Tree, Support Vector Machine, Random Forest Classifier, and Logistic Regression were used to build the stroke prediction model.

The hyperparameter tuning and cross-validation were applied with machine learning algorithms to enhance results. Accuracy, Precision, Recall, and F1-measure were used to calculate performance measures of machine learning models. The results showed that Random Forest Classifier has achieved the best accuracy at 90 %.

Keywords: Stroke; Stroke Prediction; Machine Learning; Big Bata; Apache Spark

1. Introduction

Stroke has become one of the most significant threats to public health worldwide [1].

Stroke disease ranks second in terms of life years after heart disease [2, 3].

Stroke is a sudden onset of focal neurological deficits lasting more than 24 hours. And it is caused by cerebral artery occlusion or atherosclerosis. Signs of stroke appear abruptly but they often occur gradually. In 2016, in the United States, a dramatic rise in the number of stroke patients caused a large load on the health care system [4]. Long-stroke disabilities lead to a physical, mental, and financial burden for patients, their families, and the community; while it is believed that early detection improves healing and reduces disabilities [5].

Early prediction of stroke diseases is useful for the prevention or for early treatment intervention. Machine learning and data mining are playing essential roles in predicting stroke. For example, support vector machine [6],logistic regression [7], random forest classifier and neural network [8]. Machine learning is a type of artificial intelligence that aims to create a computer with human thinking capability. The goal of machine learning allows computers to make a particular task relying on patterns and interference without using clear guidance [9].

Big data is large and complex amount of data that cannot be handled using traditional analysis methods. This data can be in structured, semi-structured, and unstructured forms.

The massive flow of data has led to the need for better analytical methods as traditional methods have become inefficient for processing big data [10, 11]. Therefore, there are frameworks display to analysis, store, and process the large amount of data such as Apache Hadoop [12] and Apache Spark [13].

Apache Spark [13, 14] is an open-source framework for data analytics that provides fault tolerance and process data in real-time. Spark can work with structured data like

(2)

Vol. 28, No. 15, (2019), pp. 89-97

CSV files and unstructured data such as JSON format. It offers high-level APIs such as Spark Streaming and MLlib. MLlib is Apache Spark's scalable machine learning library.

It offers different types of machine learning: classification, regression, and clustering [15].

The contribution of the proposed research is to design a distributed machine-learning- based on Apache Spark to predict stroke disease. Four machine learning models are used to predict stroke, which are logistic regression, support vector machine, decision tree, and random forest. Furthermore, various performance methods such as accuracy, precision, and recall has been computed. Moreover, data preprocessing techniques were applied on stroke dataset.

The rest of this paper is organized as follows. Section 2 presents related works. The proposed system of predicting stroke disease is described in Section 3. Section 4 presents the experiment results. Finally, Section 5 displays the conclusion of the paper.

2. Related Works

Several researchers used machine learning algorithms for stroke prediction. The contributions of some research studies are described in this section.

D. Shanthiet al. [16] have used Artificial Neural Networks (ANN) for the prediction of Thromboembolic stroke disease. The healthcare dataset stroke data with eight important attributes of a patient have been used. This research work demonstrates ANN based prediction of stroke disease by improving the accuracy to 89% with a higher consistent rate. The ANN exhibits the right performance levels for the prediction of stroke disease.

Besides, Kansadub et al. [17] have applied decision trees (DTs), naive Bayes, and ANN to predict stroke on the healthcare dataset stroke data. The researchers revealed that DT was the best classifier among the other used methods.

In the same context, Sung et al. [18] compared the performance of kNN, multiple linear regression (MLR), and a regression tree model to predict the stroke severity; the results showed that KNN has better accuracy than other models.

Ahmet K. Arslan et al. [19] used Support Vector Machine (SVM), Stochastic Gradient Boosting (SGB), and penalized logistic regression (PLR) to predict stroke for the collected dataset from TurgutOzal Medical Centre, Inonu University, Malatya, Turkey.

The findings of the research proved that SVM achieved the highest accuracy of 98%.

Linder et al. [20] have also compared the logistic regression (LR) and the artiﬁcial neural networks (ANNs) for classifying acute ischemic stroke from the Database of German Stroke. The results of this study showed that LR was the best for the classiﬁcation of acute ischemic stroke compared to ANNs.

Khosla et al. [21] have applied the Cox proportional hazards model with the machine learning method for the prediction of the stroke on the dataset of the Cardiovascular Health Study. The result showed that support vector machine (SVM) achieved a higher area under the ROC curve when compared to the Cox proportional hazards model.

Adam et al. [22], have also compared two algorithms decision tree and k-nearest neighbor (KNN) for classification of the stroke on the dataset from Sugam Multispecialty Hospital, Kumbakonam, Tamil Nadu, India. And the researchers concluded that the classification of decision tree performed better than KNN algorithm.

Cheng et al. [23] have also worked on predicting ischemic stroke by using two ANN models on the dataset from Sugam Multispecialty Hospital, Kumbakonam, Tamil Nadu, India. And the researchers concluded that the accuracy rates achieved 79.2% and 95.1% .

(3)

Vol. 28, No. 15, (2019), pp. 89-97

Previous studies of stroke disease prediction have only used traditional methods of machine learning to predict stroke. In our research, we have used distributed machine learning on spark platform to predict stroke.

3. Martial and Methods

3.1. Database

Healthcare Dataset Stroke [24] was used to train and test models for predicting stroke disease. This dataset consists of 10 independent variables as features and one dependent variable as the class label that is used to predict heart disease. The features’ name are gender, age, hypertension, heart_disease, ever_married, work_type, residence_type,_avg glucose_level, bmi and smoking status. The class label has two values which are: 0 represents the absence of stroke disease; while the value 1 represents the presence of stroke disease. Table 1 illustrates the complete information about the features.

Table 1. Features name and description of stroke dataset

#num Features Description

1 Age Age

2 Gender Male and Female

3 Hypertension Hypertension 4 Heart Disease 1 Has heart disease

0 Does not have heart disease

5 Ever_married 1 means Married 0 means Not married 6 Work_type Children

Private

Never worked Govt job Self employed 6 Residence_type Rural

Urban

7 Avg_glucose_level Average glucose level

8 bmi Body mass index

10 smoking_status Never smoked Formerly smoked

3.2. The proposed system of predicting the stroke disease

Figure 1 below illustrates the architecture of the stroke disease prediction system. This proposed system includes five stages as follows: 1) loading stroke dataset 2) data pre- processing, 3) Cross-validation and Hyperparameter Tuning, 4) Classifiers, and 5) Evaluating Classifiers.

(4)

Vol. 28, No. 15, (2019), pp. 89-97

Figure 1. The architecture of the stroke disease prediction system.

A) Data pre-processing

Data pre-processing is a primary step for adequately describing the data for the machine learning algorithm. It is playing an essential role in improving the performance results of machine learning. In this stage, several steps are applied.

1. Smoking-status and bim features have many missing values. Mean is applied to fill missing values.

2. Converting categorical features into numerical data using LabelEncoder.

3. The database is imbalanced data. Imbalanced data means there is an unbalanced ratio of values for each class label. We handle imbalanced data using random resample techniques.

B) Machine learning algorithms.

In this stage, four types of machine learning are used: Logistic regression (LR),Random forest classifier (RF), Decision tree (DT), and Support vector machine (SVM).

 Logistic regression is widely used in many domains, such as the biological sciences. The logistic regression algorithm is used to find the relationship between the target and predictive variables. The target variable is binary 0 or 1.

The purpose of our logistic regression algorithm is to find the best fit that is diagnostically reasonable to describe the relationship between our target variable and the predictive variables [25].

 The Decision tree is a type of supervised classifiers having a set of rules. A decision tree has two main parts: The internal nodes make a decision and the leaf nodes that do not have child nodes and is associated with a label. Decision trees support various data types in classifying instances [26].

 Random forest is a popular machine learning classifier for developing prediction models in many research settings. Random forests are a collection of trees which are constructed using randomly selected training datasets and random subsets of predictor variables for modeling outcomes. Random forest often gives higher accuracy compared to a single decision tree model [27].

 A support vector machine is used for both classification and regression problems.

The goal of SVM is to obtain the most suitable hyperplane that can divide the dataset into two classes, which are 0 and 1 [28] .

(5)

Vol. 28, No. 15, (2019), pp. 89-97

C) Cross-validation and Hyperparameter Tuning

 The hyperparameters are applied to tuning within the machine learning algorithms [29]. We define a set of values for each hyperparameter for each class. Then, Grid search method is applied to test each value and select the best values that achieve the best performance.

 K-Fold Cross-Validation: the dataset is divided within k equal size of fold.

k-1 groups are applied for the training, and the remaining part is utilized to evaluate the models. In our work, we applied k = 10. In the 10 -fold CV process, 10% of data is used to test the models, and 90% is used to train the models.

D) Evaluating Classifiers

For evaluating the performance of models, we have used the confusion matrix to calculate accuracy, precision, recall, and f-measure.

Confusion matrix describes the performance of a model on a set of test data. It gives two types of correct predictions and two types of incorrect predictions for the classifier [30]. Table 2shows the confusion matrix. TP is the predicted output as true positive, TN is the predicted output as true negative, FP is the predicted output as false positive, and FN is the predicted output as a false negative. The accuracy, precision, recall, and f- measure (f-score) are defined in the following:

Table 2. confusion matrix Predicted

Class 0

Predicted Class 1

Actual Class 0 TP FN

Actual Class 1 FP TN

 Accuracy shows the performance of the classification system as follows:

Accuracy= TP+TN TP+TN+FP+FN

 Precision is the total number of correctly classified positive divide on the total number of predicted positive examples [31]. The equation of the precision is given as follow

Precision= TP

TP+FP

 F-measure is a measurement that represents the relationship between Precision and Recall. F-Measure will always be nearer to the smaller value of Precision or Recall [32]. The equation of the f-measure is given as follows

F-measure = 2*Recall*Precision Recall + Precision

 Recall: The equation of the recall is given as follows:

Recall= TP

TP+FN

(6)

Vol. 28, No. 15, (2019), pp. 89-97

4. Results of applying machine learning algorithms and Discussion

Four supervised machine learning algorithms were applied to the developed predictive models, which are SVM, RF, LR, and DT. We applied 10-fold cross-validation and hyperparameter tuning with machine learning algorithms to improve results. For the 10-fold cross-validation, 10% is used for the testing data, and 90% is used for the training data. Four different performance measures were used to evaluate the performance of classification models such as accuracy, recall, precision, and f1-score.

4.1. Experimental setup

The predictive models were developed on Apache Spark and were written in PySpark.

In addition, we used various API libraries that are integrated with Spark. Spark's MLlib is applied to perform classification algorithms. Also, we have used Python libraries to handle an unbalanced dataset.

The predictive models were executed on a Spark cluster, which includes one master node and two worker nodes. Ubuntu 14.04 virtual machines have Java (VM) 16GB of RAM, seven cores, and 100GB disk that is used to build the cluster.

4.2. Results of applying logistic regression

Table 3 shows the results of precision, recall, and f1-score of applying logistic regression for each class. For class 0, precision registered the highest percentage at 81%, while recall registered the lowest rate at 73%. For class 1, f1-score achieved the highest result at 79%.

Table 3. Results of applying logistic regression Class precision recall f1-score

0 81 73 76

1 75 82 79

4.3. Results of applying random forest classifier

Table 4 shows the results of precision, recall, and f1-score of applying random forest classifier for each class. For class 0, precision registered the highest percentage at 961%, while recall registered the lowest rate at 90%. For class 1, recall achieved the highest result at 96%, while precision made the lowest result at 79%.

Table 4. Results of applying random forest classifier Class precision recall f1-score

0 96 85 90

1 87 96 91

4.4. Results of applying decision tree

Table 5 shows the results of precision, recall, and f1-score of applying a decision tree for each class. For class 0, precision recorded the highest percentage at 82%, while recall registered the lowest rate at 75%. For class 1, recall achieved the highest rate at 84%.

Table 5. Results of applying decision tree

Class precision recall f1-score

0 82 75 79

1 77 84 81

(7)

Vol. 28, No. 15, (2019), pp. 89-97

4.5. Results of applying a linear support vector machine

Table 3 presents the results of precision, recall, and f1-score of applying a linear support vector machine for each class. For class 0, precision registered the highest percentage at 81%, while recall registered the lowest percentage at 72%. For class 1, f1- score scored the highest rate at 79%.

Table 6. Results of applying a linear support vector machine Class precision recall f1-score

0 81 72 76

1 75 83 79

4.6. Discussion

Figure 2 shows the accuracy of applying LR, RF, DT and SVM. The random forest recorded the highest accuracy of 90%. The decision tree registered the second-highest accuracy at 79%. The support vector machine and logistic regression techniques recorded the same accuracy at 77%.

Figure 2. Accuracy of applying machine learning algorithms

5. Conclusion

Stroke disease ranks second in terms of life years after heart disease. Machine learning plays an essential role in predicting stroke. The proposed stroke prediction system is developed on Apache Spark. It used distributed machine learning to train and test the models. It consists of five stages, which are loading stoke dataset, data pre-processing, cross-validation and hyperparameter tuning, classifiers, and evaluating classifiers. The results showed that random forest classifier achieved the best accuracy result at 90%.

References

1. Katan, M. and A. Luft. Global burden of stroke. in Seminars in neurology. 2018. Thieme Medical Publishers.

2. Feigin, V.L., B. Norrving, and G.A. Mensah, Global burden of stroke. Circulation research, 2017. 120(3): p. 439-448.

(8)

Vol. 28, No. 15, (2019), pp. 89-97

3. Naghavi, M., et al., Global, regional, and national age-sex specific mortality for 264 causes of death, 1980–2016: a systematic analysis for the Global Burden of Disease Study 2016. The Lancet, 2017. 390(10100): p. 1151-1210.

4. Mozaffarian, D., et al., Heart disease and stroke statistics-2016 update a report from the American Heart Association. Circulation, 2016. 133(4): p. e38-e48.

5. Veerbeek, J.M., et al., Early prediction of outcome of activities of daily living after stroke:

a systematic review. Stroke, 2011. 42(5): p. 1482-1488.

6. Swethalakshmi, H., et al. Online handwritten character recognition of Devanagari and Telugu Characters using support vector machines. 2006.

7. Al-Talqani, H.M., Dyslipidemia and Cataract in Adult Iraqi Patients. EC Ophthalmology, 2017. 5: p. 162-171.

8. McKinley, R., et al., Fully automated stroke tissue estimation using random forest classifiers (FASTER). Journal of Cerebral Blood Flow & Metabolism, 2017. 37(8): p.

2728-2741.

9. Jos Timanta Tarigan, C.L.G., Elviawaty Muisa Zamzami, A REVIEW ON APPLYING MACHINE LEARNING IN GAME INDUSTRY International Journal of Advanced Science and Technology, 2019-09-27 28(2).

10. Saiteja Myla, S.T.M., K Karthikeya ,Preetham.B , SK Hasane Ahammad, The Rise of

“Big Data” in the field of Cloud Analytics. International Journal of Advanced Science and Technology, 2019. 28(8).

11. Ara, A. and A. Ara, Beyond Hadoop: The Paradigm Shift of Data From Stationary to Streaming Data for Data Analytics.

12. Hadoop, A. Apache Hadoop. [cited 2019; Available from: https://hadoop.apache.org/.

13. Spark, A. Apache Spark. [cited 2019; Available from: https://spark.apache.org/.

14. Ahmed, H., et al., Heart disease identification from patients’ social posts, machine learning solution on Spark. Future Generation Computer Systems, 2019.

15. Meng, X., et al., Mllib: Machine learning in apache spark. The Journal of Machine Learning Research, 2016. 17(1): p. 1235-1241.

16. Shanthi, D., G. Sahoo, and N. Saravanan, Designing an artificial neural network model for the prediction of thrombo-embolic stroke. International Journals of Biometric and Bioinformatics (IJBB), 2009. 3(1): p. 10-18.

17. Kansadub, T., et al. Stroke risk prediction model based on demographic data. in 2015 8th Biomedical Engineering International Conference (BMEiCON). 2015. IEEE.

18. Sung, S.-F., et al., Developing a stroke severity index based on administrative data was feasible using data mining techniques. Journal of clinical epidemiology, 2015. 68(11): p.

1292-1300.

19. Arslan, A.K., C. Colak, and M.E. Sarihan, Different medical data mining approaches based prediction of ischemic stroke. Computer methods and programs in biomedicine, 2016. 130: p. 87-92.

20. Linder, R., et al., Two models for outcome prediction. Methods of information in medicine, 2006. 45(05): p. 536-540.

21. Khosla, A., et al. An integrated machine learning approach to stroke prediction. in Proceedings of the 16th ACM SIGKDD international conference on Knowledge discovery and data mining. 2010. ACM.

22. Adam, S.Y., A. Yousif, and M.B. Bashir, Classification of ischemic stroke using machine learning algorithms. Int J Comput Appl, 2016. 149(10): p. 26-31.

23. Cheng, C.-A., Y.-C. Lin, and H.-W. Chiu. Prediction of the prognosis of ischemic stroke patients after intravenous thrombolysis using artificial neural networks. in ICIMTH.

2014.

24. healthcare dataset stroke data. [cited 2019; Available from:

https://www.kaggle.com/asaumya/healthcare-dataset-stroke-data.

25. Zhu, C., C.U. Idemudia, and W. Feng, Improved logistic regression model for diabetes prediction by integrating PCA and K-means techniques. Informatics in Medicine Unlocked, 2019: p. 100179.

26. Witten, I.H., et al., Data Mining: Practical machine learning tools and techniques. 2016:

Morgan Kaufmann.

27. Breiman, L., Random forests. Machine learning, 2001. 45(1): p. 5-32.

28. Han, J., J. Pei, and M. Kamber, Data mining: concepts and techniques. 2011: Elsevier.

(9)

Vol. 28, No. 15, (2019), pp. 89-97

29. Claesen, M., et al. Hyperparameter tuning in python using optunity. in Proceedings of the International Workshop on Technical Computing for Machine Learning and Mathematical Engineering. 2014.

30. Haq, A.U., et al., A hybrid intelligent system framework for the prediction of heart disease using machine learning algorithms. Mobile Information Systems, 2018. 2018.

31. Davis, J. and M. Goadrich. The relationship between Precision-Recall and ROC curves.

in Proceedings of the 23rd international conference on Machine learning. 2006. ACM.

32. Chai, K.M.A. Expectation of F-measures: Tractable exact computation and some empirical observations of its properties. in SIGIR. 2005.

Authors

Hager Ahmed obtained a Master’s degree in Computer Science in 2017. I obtained a Bachelor’s degree in Information systems from the Faculty of Computer and Information, University Assuit, Egypt. I am a researcher member of the Big Data Team in Egypt. My research interests are centered on Big data Analytics, Data Mining, Sentiment Analysis, Natural Language Processing, Machine Learning, and Streaming Data.

Sara F. Abd-el ghany works in south valley university. I obtained a Bachelor’s degree in computer science from the Faculty of science, south valley university, Egypt. aobtained a Master’s degree in Computer Science in 2016. My research interests are centered on Big data Analytics, Data Mining, and Machine Learning.

Eman Younis is currently working as an Associate Professor at Minia University, Faculty of Computers and Information, Information Systems Department. She got her B.Sc. degree from Zagazig University, Egypt, 2002. She obtained her MSc degree from Meunofia University, Egypt in 2007. She received her Ph.D. degree from Cardiff University, UK in 2014. She spent some time as post-doc at Nottingham Trent University, UK. Her research interests are machine learning, data mining, Geo-spatial data processing, semantic web, sentiment analysis and emotion recognition.

Dr. Nahla F.Omran is currently working as

lecturer of Computer Science, Faculty of Science, South Valley University. She has published over 15 research papers in prestigious international journals, and conference proceedings.. She has supervised over 15 Ph.D. and M.Sc. students. Dr. Nahla interests are Big Data, Machine learning, Algorithms, Cloud Computing , IoT, Data Science, image processing and data mining.

Abdelmgeid A. Ali is a Professor in Computer Science Department, Minia University, El Minia , Egypt. He has published over 80 research papers in prestigious international journals, and conference proceedings. He has supervised over 60 Ph.D. and M.Sc. Students. Prof Ali is a member of the International Journal of Information Theories and Applications (ITA). Prof Ali interests are Information Retrieval, Software Engineering, Image Processing, Data security, metaheuristics, IOT, Digital Image Steganography, Data Warehousing.