The Significance of Fine Tuning Parameters in Supervised Machine Learning Techniques for Diabetic Disease Prediction

(1)

364 ISSN: 2005-4238 IJAST

The Significance of Fine Tuning Parameters in Supervised Machine

Learning Techniques for Diabetic Disease Prediction

P.Kalaiyarasi, Dr.J.Suguna

Ph.D. Research Scholar, Associate Professor,

Department of Computer Science, Vellalar College for Women, Erode, Tamilnadu, India.

Abstract

In health care analysis, data mining plays a significant role in disease prediction.

Presently, our society large amount of death rates are due to diabetic disease. The mortality rate of the patients in diabetic disease has been increased every year. Researchers are dealing with various machine learning approaches which help the health care professionals to diagnose the disease in primary stage. Various classification mechanisms exist in the literature to predict the disease in primary stage. In this paper, the supervised machine learning algorithms namely Decision Tree, K Nearest Neighbor and Support Vector Machine are used for the prediction of diabetic disease. The purpose of this study is to emphasize the importance of hyper-parameter tuning to improving the performance of classifiers. The results obtained were compared with normal classifiers DT, KNN and SVM before fine tuning the parameters. From the results it is found that hyper parameters tuning improves the performance of the model. Finally the results are evaluated by using various validation metrics..

Keywords: Classification, Diabetic disease, Hyper parameter tuning, Decision Tree, K Nearest Neighbor and Support Vector Machine

1. Introduction

Diabetic is a chronic disease which occurs when the human pancreas does not yield enough insulin. When the body cannot capably use the insulin it yields, it leads to rise in blood glucose levels. Diabetic disease is divided into several distinct forms. The two major clinical types are, TYPE 1 Diabetic (T1D) and TYPE 2 Diabetic (T2D), as stated in the etiopathology of the disorder. T2D occur in the most common form of diabetic (90% of all diabetic patients), essentially described by insulin resistance. In diabetic disease the high blood glucose will damage blood vessels and the nerves that control heart and blood vessels.

Several risk factors associated with type2 diabetic includes [13].

• Family history of diabetics

• Overweight

(2)

365

• Unhealthy diet

• Hypertension

• Hypoglycemia- low blood glucose

• Hyperglycemia- high blood glucose

The prolonged T2D persons have the higher chances of getting heart disease in future [3].

Today’s, society faces a large number of deaths which is mainly caused by diabetic diseases.

According to report produced by the World Health Organization, major stroke, heart attack and various circulatory related diseases lead to increase of death rate [6]. The diabetic disease can be predicted by using various symptoms, whereas it is difficult to predict disease in short time. Health care industry would posse’s high amount of data related to health and those have used to do decision making process. The machine learning technology employs in large area for dealing with these issues. Thus, Machine learning approaches introduce to deal with these issues by making effective prediction [13].

This paper pays the attention of diabetic disease prediction in an effective manner by using machine learning algorithms. Data are gathered from UCI repository. The dataset would contain some missing values, which have to be removed before applying the classification algorithms [9]. KNN is used for impute the missing values. Initially the data set has been dived in to training data and testing data. Then by using the training data a classification model has been designed. The designed model could be evaluated using the testing data.

Once the prediction model is developed, the fine tuning parameters are used to find out the optimal classifier. This main task is to improve the classifiers performance. The first part of this work includes the three classifiers namely DT, KNN, SVM used to predict the diabetic disease [7]. The second part apply the fine tuning parameters to each classifier. The parameters for tuning are Complexity Parameter (CP), Maxdepth, Minisplit, K value, Cost and sigma parameters.

Finally the diabetic disease is predicted by the optimal classifiers. The remaining of this paper is organized as follows. Section II discussed the related work on health care analytics;

Section III provides the details of the proposed system. In Section IV Results and discussion is described. Section V gives the conclusion.

2. Related Work

Jabbar MA et al. [5] has employed with heart disease prediction using K-Nearest Neighbor and particle Swarm Optimization algorithm. Feature subset selection is introduced in the aim of increasing the accuracy and minimizing the run time. This model helped the physicians in an efficient manner to predict the diseases with predominant features.Kittipol et al. [8] dealt with predicting the heart disease which depends on feature selection using multilayer perception with back-propagation and the k-nearest neighbor algorithm. The results are proved that it is the better prediction for heart disease which may be helpful for doctors to take decisions.

(3)

366 ISSN: 2005-4238 IJAST

Sanjay Kumar et al. [11] have dealt with machine learning algorithms namely K nearest neighbor, Decision tree, Naïve Bayes and Support Vector Machine to predict the heart disease. From the results it is concluded that Naïve base classifier is best when compared to other algorithms. Sarwar et al. [12] have suggested the work on Naive Bayes to predict diabetic Type-2 and type-3. Type-2 diabetic comes from the growth of Insulin resistance. The accurate prediction is achieved by Naive Bayes.Iwan Syarif et al. [4] have discussed the Grid Search and Genetic Algorithm (GA) to optimize the SVM parameters.SVM parameter optimisation using GA can be used to solve the problem of grid search. From the results showed that SVM parameter optimization using grid search always finds near optimal parameter combination within the given ranges. However, grid search was low; thus it was very reliable only in low dimensional datasets with few parameters. GA has proved to be stable than grid search. Luis Carlos et al. [10] have analyzed effectiveness of Random Search by using the Estimation Distribution Univariate, Marginal Distribution and Boltzmann Univariate Marginal Distribution Algorithms. From the results it is concluded that to finding the optimal hyper-parameters without increasing the complexity of Random Search.Ahmed et al. [1] has proposed genetic algorithm to find out the top value of decision tree approach. The optimal value helps to construct an accurate decision tree for achieving high quality results.

From the results Genetic Algorithm Based Decision tree (GADT) gives more efficient results compared to traditional decision tree algorithm.Anbarasi et al. [2] focused to make Clinical diagnosis of heart disease prediction. The classifiers such as Naive Bayes and Decision tree have been used in the sense of prediction. Genetic algorithm is used to determine the attributes which contribute more towards the diagnosis of heart ailments which indirectly reduces the quantity of tests which are required to be taken for a patient.

3. Proposed Methodology

The diabetic dataset has been collected from UCI repository for the proposed system.

Initially, the dataset has been pre-processed using KNN algorithm. Then it has been classified using three machine learning algorithms namely Decision Tree, K -Nearest Neighbor and Support Vector Machine. Fine tuning parameters are used to increase the performance of these classifiers. Finally, classifiers performance has been measured.

The architecture of the proposed system as follows

Preprocessing – (KNN)

Classification Models

DT KNN

Data set

SVM

Fine Tuning parameters (CP, Maxdepth, Minisplit, K, Cost, Sigma)

Prediction

Evaluation of Results

(4)

367

Figure1. Proposed System Architecture

A)Data Set

The diabetic data set contains two classes with 9 attributes such as pregnancies, glucose, blood pressure, Insulin, BMI, diabetes pedigree function and so on used for this research work is showed in the following table.

Table 1. Dataset Information

B) Pre-Processing

The dataset consists of some unwanted or missing values. These values may affect the prediction of disease, so it must to be removed. In this work KNN algorithm is used for impute the missing values. It can predict the most common value among the nearest neighbors and if it is continues value mean is taken among the k nearest neighbor. The KNN algorithm follows,

Step 1: Determine the K value, Initial value of k=5

Step 2: Calculate the distance between the missing value instances and training instance.

Euclidean distance is used for calculating the distance

Step 3: After calculating the distances choose the data values those having minimum distance. If the value of K is 5 then we have to choose five values that having minimum distance.

Step 4: Calculate the mean of chosen values. The mean is given by the equation as: Mean

= sum of all the values/total number of values.

Step 5: Impute the Mean as the output value for missing data.

After pre-processing, the data set has been passed to classification phase.

C) Classification

In healthcare industry supervised machine learning approaches are mainly used to predict the disease. In this work the three different classification algorithms namely Decision tree (DT), K-Nearest Neighbor (KNN) and Support Vector Machine (SVM) has been employed for classification of presence and absence of disease. The fine tune parameters are used to increase the performance of each classifiers.

Decision Tree (DT)

A Decision tree could be a flowchart-like tree structure, wherever each internal node denotes a test on an attribute, every branch represents an outcome of test, and every leaf node

Dataset Instances Attributes Pima India

diabetic data

769 9

(5)

368 ISSN: 2005-4238 IJAST

holds a class label. The primary node in a tree is the root node. It is constructed during a top- down recursive divide-and-conquer manner [14]. Most of the decision tree algorithms follow a top-down approach, which starts with a training set of tuples and their associated class label. The training set is recursively divided into smaller subsets as the tree is being built. In this work, Information gain (IG) is used for selecting root node of the tree. Information gain that measures expected reduction in entropy caused by the value of a feature Fj. It is used to select the best feature at each step of growing a decision tree.

(1)

Where VFj is a set of all probable values of feature Fj and Svi is a subset of S, for which feature Fj has value vi. In this paper, diabetic data attributes such as, glucose, blood pressure, BMI, pregnancies are randomly chosen for developing decision tree for predicting the diabetic disease. According to the results of these four attributes the patients can lie in any one of the two classes. The Fig.2 denotes the tree representation of the diabetic data set.

Figure 2. Decision tree representation of diabetic disease

These four attributes can be changed randomly by the variation in the training and testing data. Through this classifier, the importance of the particular attributes of diabetic disease in predicting the risk can also be found.

K –Nearest Neighbor (KNN)

The KNN classifier is depend on learning analogy, by comparing the given test tuples and the training tuples which is similar to it.KNN is used to classify the objects depends on nearest training samples in the feature space. It is the essential type of instance-based learning or lazy learning.

It suspect all occurrences points in n-dimensional space. A distance measure is required to determine the “closeness” of instances. It classifies an instance by finding its nearest neighbors and picking the most popular class among the neighbors [14]. Closeness is defined in terms of a distance metric, like Euclidean distance. The Euclidean distance between the data point,

(6)

369

(x, y) = sqrt (sum ((xj – yj) ^2)) (2)

where, x - new point, y – existing point for each attributes j.

Support Vector Machine (SVM)

SVM is the process for the classifying the both linear and nonlinear data. It uses the nonlinear mapping to convert the original training data into higher dimension. In the new dimension, it searches for the linear best possible separating hyperplane. With a suitable nonlinear mapping to a sufficiently high dimension, data from two classes can always be separated by a hyperplane. The SVM finds this hyperplane consuming support vectors and margins. Generally the larger margin the lower generalization error of the classifier.

Separating the categories could be a hyperplane of the form:

w^T x + b = 0 (3) where w is a weight vector–x is input vector–b is bias.

In Linear SVM, a data point measured as a p-dimensional vector (list of p-numbers) and separate the data point using (p-1) dimensional hyperplane. There can be many hyper planes separating data in a linear order, but the best hyperplane is considered to be the one which maximizes. Transform the original input data into a higher dimensional space using a nonlinear mapping [14]. Once the data have been transformed in to new higher area, then search for a linear separating hyperplane in the new space. The maximal marginal hyperplane found within the new area corresponds to a nonlinear separating hyper surface within the original area.

D) Fine Tuning Parameters

It is the machine learning predictive model used to improve the accuracy of the classifier.

In this paper, fine tuning parameters are applied for three classifiers namely DT, KNN, SVM in order to improve the accuracy.

DT Parameters Tuning

Tree pruning is performed for improving the classifier accuracy. It identifies and removes the irrelevant branches in a tree. It has two types’ pre pruning and post pruning. Pre pruning - in which while structure the decision tree keep checking on whether the tree is over fitting. Post pruning - in which the tree is built after reducing the branches and leaf on a tree.

The decision tree has hyper parameters that require fine-tuning in order to derive the best possible model. It reduces the generalization error as much as possible. In this work three parameters are used.

Complexity Parameter (CP): It control the size of the decision tree and select the optimum tree size.

Max depth: The maximum number of children nodes that can grow out from the decision tree until the tree is cut off.

Minisplit: It represents the minimum number of observations in terminal node.

(7)

370 ISSN: 2005-4238 IJAST

KNN Parameter Tuning: A parameter value is used to control the learning process. In KNN, K value consider for tuning the parameter.

SVM Parameters Tuning:To increase the performance of SVM classifier the tuning parameters used are Cost, Kernel, and sigma function.

Cost: It is the penalty parameter of the error terms. It controls the trade of between smith decision boundary and classifying the training data correctly.

Kernel: It select the type of hyperplane used to separate the data. The kernel defines the distance measure between new data and the support vectors [15].

Sigma: It is a factor for non-linear hyper planes. The higher sigma value it tries to exactly fit the training data set.

The default values of tuning parameter are set to be (kernel='rbf’ '(Linear radial polynomial), C=0.00, sigma=0.00). If the tuning parameters values are changed the classification performance will be increased.

4. Result and Discussion

The research work is implemented using R tool. Data is gathered form UCI repository. In this work three classifiers namely DT, KNN, and SVM are used for classifying the presence and absence of disease. The prediction model is developed by using machine learning algorithms. Once the prediction model is developed, the fine tuning parameters are applied to find the optimal classifiers.

DT Parameters optimization using Grid search

Grid-search is employed for finding the besthyper parameters of a model. It is used to optimize the Cp, Max depth and Minisplit parameters values. The following table shows the DT parameter tuning accuracy.

Table 2. DT parameter tuning accuracy

Parameters

Accuracy

CP Max

depth

Minisplit

0 5 100 0.761048

0.0084 6 100 0.762338

0.0090 7 100 0.7748918

0.01 8 100 0.7878788

(8)

371

Figure 3. Tree Vs error rate of CP values

Table 2 and Fig 3 represents the decision tree parameters tuning results. The tree pruning is performed to reduce the over fitting of a tree. The Accuracy was used to choose the best model using the leading value. The final value used for the model were Cp= 0.01 and max depth = 8.

KNN Parameter optimization using Grid search

Here a single parameter k value is used to tune the Classifier parameter using grid search.

Table 3. KNN Accuracy

K Accuracy K Accuracy 5 0.7191929 15 0.7407407 7 0.7305206 17 0.7468728 9 0.7302970 19 0.7546751 11 0.7323410 21 0.7500175 13 0.7360727 23 0.7488994

(9)

372 ISSN: 2005-4238 IJAST

Figure 4. K parameter tuning accuracy

Fig 4 shows the performance of parameter tuning of KNN classifier. Initially the k value is set to be 1 and tune the k values to get the optimal results using grid search.

Whenever the k value increases the classification accuracy changes until it reaches the optimal result. The final value selected for this model was k = 19.

SVM Parameters optimization using Grid search

Grid search is used to optimize the parameters (in linear kernel), C and sigma parameter (in RBF and sigmoid kernels). The following table depicts the SVM liner gird accuracy.

Table 4. SVM linear grid for various values of cost and corresponding accuracy

Figure 5. SVM linear grid using tuning parameter accuracy Cost

parameter

Accuracy Cost parameter

Accuracy

NaN NaN 0.75 0.7852793

0.00 NaN 1.00 0.7864388

0.01 0.7790079 1.25 0.7864388 0.05 0.7818008 1.50 0.7864388 0.10 0.7829401 1.75 0.7870135 0.25 0.7858641 2.00 0.7864388 0.50 0.7864388 5.00 0.7858641

(10)

373

Table 4 and fig 5 represents the SVM linear grid accuracy for various values of cost parameter. From the results obtained the ultimate value used for this model was C = 1.75 and the accuracy is 0.7870135.

Table 5. Different values of the tuning parameter Cost andSigma

Figure 6.Tuning parameters Cost and Sigma different in values with corresponding

accuracy

The fig 6 show the SVM radial grid tuning parameters accuracy. Whenever the cost and values increases, the classification accuracy changes until it reaches the optimal result. The last value used for the model was C=1.00, Sigma=0.060.

Performance Metrics

Confusion matrix is the table that is used to show the performance of the classifiers. The performance metrics is derived from the confusion matrix. The results of diabetic data set, based on accuracy and confusion matrix for classifying the data such as true positive, true negative, false positive and false negative.

In this work three performance metrics are used namely Sensitivity, Specificity and F Measure.Sensitivity = TP/TP+FN, Specificity=TN/TN+FP

F Measure = 2*(precision*recall) / (precision+ recall).

Table 6. Classification accuracy

Figure 7. Performance analysis of classification algorithms

Sigma Cost Accuracy

0.00 0.00 Null

0.010 0.05 0.6389047

0.020 0.10 0.6782693

0.030 0.25 0.7707592

0.040 0.50 0.7667772

0.050 0.75 0.7702766 0.060 1.00 0.7749453

Algorithms

Accuracy (%) Before

tuning parameters

After tuning parameters

DT 74.78 78.79

KNN 69.57 71.03

SVM 72.40 73.44

(11)

374 ISSN: 2005-4238 IJAST

The above fighure shows the performance of classification algorithms. From the results it is found that DT yields better prediction accuracy when compared to KNN and SVM.

Table 7: Performance analysis

Figure 8. Results of Validation Metrics

Table 7 and Fig 7 represents the validation metrics of the machine learning algorithms. From the results, it is clear that parameters tuning improves the classification accuracy than normal classifiers.

5. Conclusion and Future Work

In this work, the importance of fine tuning parameters in three classification algorithms DT, KNN and SVM for predicting the diabetic disease was discussed and analyzed. Form the results, it is found that DT algorithm performs best than the other two algorithms KNN and SVM. That is DT algorithm is best suited for predicting the diabetic disease after fine tuning the parameters.In future, Particle Swarm Optimization, and Genetic algorithms can be applied for optimize the parameters values.

References

1. [1]Ahmed I. Taloba, Adel A.Sewisy, Safaa S.I. Ismail,” Parameter Tuning in Decision Tree Based on Genetic Algorithm for Text Classification”, International Journal of Scientific & Engineering Research,Vol 10, Issue 3,2019.

Algorithm

Before tuning parameters After tuning parameters

Sensitivity Specificity F- Measure Sensitivity Specificity F- Measure

DT 0.85 0.49 0.62 0.91 0.46 0.66

KNN 0.82 0.40 0.53 0.86 0.43 0.57

SVM 0.83 0.45 0.58 0.84 0.48 0.63

(12)

375

2. Anbarsi M, Anupriya E, IyengarC.N, “Enhanced Prediction of Heart Disease with Feature Subset Selection Using Genetic Algorithm”, International Journal of Engineering Science and Technology, Vol 2 Issue10, 2010.

3. Han Wu, Shengqi Yang, Zhangqin Huang, Jian He, Xiaoyi Wang, “Type 2 diabetes mellitus prediction model based on data mining”, Informatics in Medicine Unlocked Vol 10, 2018.

4. Iwan Syarif Adam Prugel-Bennett and Gary Wills,” SVM Parameter Optimization Using Grid Search and Genetic Algorithm to Improve Classification Performance”, TELKOMNIKA, Vol 14, Issue 4, pp. 1502-1509, 2016.

5. Jabbar MA, Prediction of heart disease using k-nearest neighbor and particle swarm Optimization, Biomedical Research, 2017.

6. Jerez J, I. Molina, P. Garcia-Laencina, E. Alba, N. Ribelles, M. Martn, and L. Franco,

“Missing data imputation using statistical and machine learning methods in a real breast cancer problem” ,Artif. Intell, Med., Vol 50, Issue 2, pp. 105-115, 2010.

7. Kalaiyarasi P and Suguna J.,” The Effect of imbalance in Diabetic Disease Prediction by using Machine Learning from Healthcare Communities”, Journal of Advanced Research in Dynamical &Control System Vol 10 Special issue 14,pp1135-1141, 2018.

8. Kittipol Wisaeng, “Predict The Diagnosis Of Heart Disease Using Feature Selection And K-Nearest Neighbor Algorithm”, Applied Mathematical Sciences, Vol 8, 2014.

9. Konstantina Kourou A, Themis P. Exarchos A, Konstantinos P, Michalis V, Karamouzis C, Dimitrios I. Fotiadis, “Machine learning applications in cancer prognosis and prediction”, Computational and Structural Biotechnology Journal Vol 13, 2015.

10. Luis Carlos Padierna, Martín Carpio, Alfonso Rojas,Héctor Puga, Rosario Baltazar and Héctor Fraire,” Hyper-Parameter Tuning for Support Vector Machines by Estimation of Distribution Algorithms”, Springer International Publishing, Studies in Computational Intelligence 667, DOI 10.1007/978-3-319-47054-2_53,pp. 787- 800,2017.

11. Sanjay Kumar Sen, “Predicting and Diagnosing of Heart Disease Using Machine Learning Algorithms” International Journal Of Engineering And Computer Science ISSN: 2319-7242, Vol 6, Issue 6, pp: 21623-21631, 2017.

12. Sarwar, A. and Sharma V., “Intelligent Naïve Bayes Approach to Diagnose Diabetes Type-2”. Special Issue of International Journal of Computer Applications (0975- 8887) on Issues and Challenges in Networking, Intelligence and Computing Technologies-ICNICT 2012, Vol 3,Issue 6,2012.

13. Seokho Kang, Pilsung Kang, TaehoonKo, Sung zoon Cho, Su-jinRhee, Kyung-Sang Yu,”An efficient and effective ensemble of support vector machines for anti-diabetic drug failure prediction”, Expert system with Applications,pp 4265-4273, 2015.

14. Jaiwei Han and Micheline Kambar, “Data mining Concepts and Techniques”, Morgan Kaufman publishers, 2012.

15. https://medium.com/all-things-ai/in-depth-parameter-tuning-for-svc-75821539476.