Thyroid Disease Classification Using Decision Tree and SVM

(1)

Thyroid Disease Classification Using Decision Tree and SVM

K. Dharmarajan1_{, K. Balasree}2_{, A.S. Arunachalam}3_{, K. Abirmai}4

1_{Associate Professor, Department of Information Technology,}2_{Research Scholar, Department of Computer Science,}

3_{Associate Professor, Department of Computer Science, VISTAS,}4_{Assistant Professor,} Department of BCA, Guru Nanak College, Chennai, India

Abstract

Thyroid is one of the disease that can be increasing day by day due to their lifestyle. Thyroid disease is a very common disease among humans. A Thyroid disorders are the conditions that affects the thyroid gland and also the butterfly-shaped gland at the front of the neck. The thyroid gland is located on the below of Adam’s apple that wrapped around the trachea. The Hydroxide is also known as T4 and it is the primary hormone produced by the gland. Thyroid hormone that regulates the body numerous metabolic mechanisms throughout the body. When compare to male, female is more affected than male due to the thyroid disease. In thyroid, there are two types of diseases, They are Hyperthyroidism and Hypothyroidism. Hypothyroidism that produces a lots of thyroid hormone in the blood and in Hypothyroidism that produces less thyroid hormone in the blood. This is controlled by the pituitary gland and hypothalamus. The disorders of these tissues can also be affecting a thyroid function and it causes the thyroid problems. There are the Specific types of thyroid glands are includes: Hypothyroidism, Hyperthyroidism, Goiter, Thyroid nodules and Thyroid cancer. This paper describes about the diagnosis of thyroid disorders using decision tree attribute splitting rules. The proposed method, classifies the thyroid nodules accurately and efficiently. In this study, the comparative thyroid disease diagnosis were performed by using the Machine learning techniques that can be a method which is Support Vector Machine (SVM), Naïve Bayes and Decision Trees. The accuracy of this classification is to be 99.89%. This result is very efficient when compared to our previous work that used the Decision tree.

Keywords: k Nearest Neighbours, Support vector machine, Decision tree and Naïve bayes.

Introduction

Classification techniques that plays a vital role and there is a major role for analyzing diseases and providing facilities to reduce the cost for the patients. Nowadays, the peoples are more suffering from the diseases they are diabetes, heart disease, typhoid, tuberculosis and kidney disease, etc., Thyroid disease can be affected by the people in worldwide, after affecting the peoples becomes a serious health problem[1]_{. In India, it is expected that}

about 42 million people suffer from thyroid disorders. Symptoms that includes, Weight gain, tiredness, weakness as well as feeling cold etc[2]_{. The hormones}

made in the thyroid gland affected almost every organ in the body including the heart. There are Hypothyroidism, hyperthyroidism and goiter deficiency disorders. Hypothyroidism can cause the heart beat more and

slowly and the hyperthyroidism causes fast heartbeat[3] [4]._{The evaluated levels of thyroid hormones can also}

lead to increase the blood pressure level. The Symptoms that includes weight loss/weight gain, swollen neck, changes in heart rate, Hair loss and other symptoms like problems in vision, Diarrhea, Irregularities for women’s in the menstrual cycle then Trembling hands and Muscle weakness[7][8]_{. Goiter and Iodine Deficiencies, Goiter}

is one of the abnormal enlargements of your thyroid gland. Iodine is the element that can be needed for the production of thyroid hormone. The recent population that shown in the studies nearly about 12% of adults have been affected by the palpable goiter[9].

(2)

recognized variables (input and output) for diagnosing the thyroid disease (through the published research) and to develop an integrated framework model and validated with Decision-tree model. A variety of these algorithms including Decision trees, Random forest, Support vector machine, Artificial Neural Network and Logistic regression have been widely used in development of predictive models of thyroid disease.

Thyroid function testing is the most used diagnostic evaluation in endocrine practice and is used as a screening tool, to verify the clinical diagnosis of hyperand hypothyroidism, to assess adequacy of medical treatment, and in the followup of differentiated thyroid cancer[10][11].

It may predict the patients and doctors in handling thyroid disease with care and there’ll fully be suggestions for a social development by applying an integrated model. Classifying possible variables (both input and output) affecting the diagnosis of thyroid disease and investigating the relationships among such variables[6].

Pre Processing: Types of pre-processing: Data cleansing, Data editing, Data reduction, Data wrangling. The pre-processing is to resolve the several types of problems that includes the noisy data, redundant data and missing datas and values, etc., [2] [8]_{. The high quality}

of data that will be lead to the high quality results and also by the costs that reduced for the data mining. The Missing data can be pre-processed and it is also to allow the whole data set to be processed. Itundergoes the pre-processing[13]_{. The numbers are not a missing number}

that constraints are checked using masking method. If the missing values or not a Number values that can be presented and it is replaced by the mean value of the column. Pre-processing, it refers to the program that processes the input data values and also to produce output used as input to a compiler[11]_.

Dataset Explanation:

1. The datas are collected from the thyroid patients. (500 patients).

2. It is a blood samples.

3. There are three categories of patients.

Thyroid Classification: Classification is a machine learning task that predicts the classes according to someconstraints. Supervised learning is a classification algorithm in data mining[2] [7]_{. The main desire of the}

classification issue is to diagnose the class for new data.

Various classification algorithms are used to diagnose the classes. In this thyroid dataset using classification algorithm it.

Specific kinds of thyroid disorders: • Hypothyroidism

• Hyperthyroidism • Goiter

• Thyroid nodules • Thyroid cancer

Method and Methodology

K-Nearest Neighbor (K-NN): A k-nearest-neighbor is often abbreviated as k-NN algorithm. It is the data classification that estimates likely as a data point into the member of one group or into the other depending on grouping the data points that may be nearest. The k-nearest-neighbor is also called as a “lazy learner” algorithm that not be built on a model that is using in a training set until the query of the data set is performed[9].

It is the classification algorithm, to determine the attempts of the data groups that points by looking the data points that looking the data at one points in group A or it is in group B. It is to look at the states of the points that may nearest[2]_{. The range of arbitrarily is the point}

to take a sample of the patients data and analyze it. If there is a many points in the group A, then it is likely that the data point will be A rather than B and a vice versa.

Algorithm: The algorithm is in the case, is classified by the majority of vote to its neighbors, with the case being assigned to the class, the most common among its K nearest neighbors. Measured by a distance function. If the value K = 1, then the Case value is simply assigned to the class of its nearest Neighbor. The three distance measures are noted as a valid continuous Variables.

In the instance of the Hamming distance must be used. When the values are 0 and 1 it is used to brings the issue of the standardization of numerical values as well as a mixture of numerical and categorical Variables in the dataset.

(3)

representing and the probabilistic learning method based on Bayesian theorem. Naive classifier assumes the value of the one attribute is not dependent on the value of another attribute and it assumes that the presence or absence of particular attribute of the prediction process does not affect. Suppose there are m classes say K1, K2….Kn having a unidentified data sample X, Naive Bayesian classifier will predict an unknown sample X to the class Ki on the basis of the classes having highest probability[3] [9]_.

P (Ki |X > P (Kj |X) for 1≤j≤m, j ≠ ... (2) Applications: Real time prediction of the Naive Bayes is an eager learning classifier and it is the fastest one. It is used for making predictions in real time. Multi class Prediction algorithm is also called as a multi class prediction feature. Text classification or Spam Filtering or Sentiment Analysis, Naive Bayes classifiers are

mostly used in text classification that have the higher success rate when compared to the other algorithms. In result, it is widely used in Spam filtering and Sentiment Analysis.

Recommendation System, Naive Bayes classifier and Collaborative Filtering that makes together to build a recommendation system, uses machine learning and data mining techniques to filter unseen information and also to predict the user is liking a given resource[13].

Support Vector Machine (SVM): Support Vector Machine is one of the managed machine learning algorithm used for both the classification and regression issues and it is usually used for a bit of arrangement problems. The estimation of selected organize is of the each half being the estimation. Then the tendency to perform characterization by finding the hype-plane, is completely have categories[11],[12]_.

Table 1: Percentage of peoples affected by thyroid

Percentage of peoples affected by Thyroid 2014 2015 2016 2017

Male 20% 10% 12% 15%

Female 50% 60% 45% 56%

Fig 1. Peoples affected in thyroid gender wise year by year Mentally disturbed. All the newborns are given

by screening the blood test in hospital to evaluate the thyroid function.

Accuracy Metrics: Note, the accuracy of this model is very high at 97.3%

• The disease that spread very quickly, who is in sick condition.

• The positive that represent here as a fraud case

• The positive value represents terrorist and also as the model says it’s a non-terrorist.

• Idea about the costs that having a mis-classified actual positive value is very high there.

Precision and Recall

There are two new metrices are:

(4)

• Recall is True Positive/true positive + false positive. Precision: The Precision that explains about that how precise or the accurate model is out of those predicted positive and then the actual positive value predicts how many how many positive values occurred[11]. _{The Precision that is good to measure and}

determine about the costs, if the False Positive is high, mail spam detection. In email spam detection there is a false positive means they have an email that in non-spam has been identified as a spam message that is unwanted. This email user might lose important emails if there is a precision is not high for the spam detection model.

Precision = True Positive/True Positive + False Positive

Recall: The same logic to recall the application. Recall is also calculated.

True positive/True positive + False Negative is an Actual Positive

Recall=True positive/Total actual value.

Recall calculates that how many positives values in our model captures labeling as Positive. Applying the same as we recall that shall be the model metric and to select the use of our best model, high cost that associated with the False Negative values.

In fraud detection or the sickness of patient detection. If a fraudulent transaction is predicted as a non-fraudulent, then the consequence can be very bad and in sick patient detection, If a sick patient goes through the test and it can be predicted as not sick then the cost can be associated as a False Negative then it will be extremely high, the sickness is contagious.

Recall= True Positive/Total Actual Positive. Score: There is a lots on Precision and Recall; cannot avoid the other measure of F1, which is a function of Precision and Recall.

Score = 2*Precision*Recall/precision+ Recall. Score is needed to seek a balance between the Precision and Recall. So, the difference between F1 Score and if the accuracy is previously seen then that accuracy can be largely contributed by a large number of True Negative values which is the most business circumstances and do not focus, because the False Negative and False Positive is usually having the

business costs then the Score is more better to measure the use of the need to seek a balance between a Precision and Recall[15].

Fig 2: Overall percentage of peoples affected in thyroid.

Result Analysis:

Table 2: Percentage of Algorithm comparison result Classification Model

Naive Bayes Decision Tree SVM KNN

Accuracy 91.62% 97.35% 95.3% 94.2%

The impact of certain attributes is that the classification of model accuracy. The following attributes were ignored and query on thyroxin, hypothyroid, hyperthyroid. The classification model is based on the decision tree obtained as a best accuracy (97.35%), while Naïve Bayes obtained the weakest classification.

Accuracy of classification model after removing the three of the model attributes.

(5)

most relevant for classification and it is broadly used to categorize the subset selection method and ranking method.

Conclusion

The medical dataset in the various data mining and the machine learning techniques are available and then the important aspect of medical data mining is to increase the accuracy and efficiency of disease diagnosis. The main objective of this research is to show the variance of thyroid after 90 days 60 days from the available raw medical dataset then the various splitting rule for decision tree attribute selection and had been analysed and compared. This helps to diagnosis the thyroid diseases through the extracted rules. It is clear and normalized based splitting rules have high accuracy and sensitivity or true positive rate. The data mining technique is applied on the hypothyroid and hypothyroid dataset and it is also to determine the positive and the negative values from the entire dataset. The experimental result provides, when compared to male and female dataset, females are more affected than male. The improved accuracy, precision and recall by comparing the Decision tree, Support vector Machine. Further enhancement has been made by using the various optimization algorithms or rule extraction algorithms. The future work is applied on validating the multiple disease dataset simultaneously like heart disease, diabetics, etc.

Conflict of Interest: Taken from...committee Source of Funding: Self

Ethical Clearance: Nil

Reference

1. Ahmed, Jamil, and M. Abdul Rehman Soomrani. “TDTD: Thyroid disease type diagnostics.” 2016 International Conference on Intelligent Systems Engineering (ICISE). IEEE, 2016.

2. Ammulu, K., and T. Venugopal. “Thyroid data prediction using data classification algorithm.” Int. J. Innov. Res. Sci. Technol 4.2 (2017): 208-212. 3. Begum, Amina, and A. Parkavi. “Prediction of

thyroid Disease Using Data Mining Techniques.” 2019 5th International Conference on Advanced Computing & Communication Systems (ICACCS). IEEE, 2019.

4. Chang, Chuan-Yu, Ming-Feng Tsai, and Shao-Jer Chen. “Classification of the thyroid nodules using

support vector machines.” 2008 IEEE International Joint Conference on Neural Networks (IEEE World Congress on Computational Intelligence). IEEE, 2008.

5. Dov, David, et al. “A Deep-Learning Algorithm for Thyroid Malignancy Prediction From Whole Slide Cytopathology Images.” arXiv preprint arXiv:1904.12739 (2019).

6. Geetha, K., and S. Santhosh Baboo. “An empirical model for thyroid disease classification using evolutionary multivariate Bayseian prediction method.” Global Journal of Computer Science and Technology (2016).

7. Ioniţa, Irina, and Liviu Ioniţa. “Prediction of thyroid disease using data mining techniques.” BRAIN. Broad Research in Artificial Intelligence and Neuroscience 7.3 (2016): 115-124.

8. Kousarrizi, MR Nazari, F. Seiti, and M. Teshnehlab. “An experimental comparative study on thyroid disease diagnosis based on feature subset selection and classification.” International Journal of Electrical & Computer Sciences IJECS-IJENS 12.01 (2012): 13-19.

9. Margret, J., B. Lakshmipathi, and S. Aswani Kumar. “Diagnosis of thyroid disorders using decision tree splitting rules.” International Journal of Computer Applications 44.8 (2012): 43-46.

10. Prasad, V., T. Srinivasa Rao, and M. Surendra Prasad Babu. “Thyroid disease diagnosis via hybrid architecture composing rough data sets theory and machine learning algorithms.” Soft Computing 20.3 (2016): 1179-1189.

11. Raisinghani, Sagar, et al. “Thyroid Prediction Using Machine Learning Techniques.” International Conference on Advances in Computing and Data Sciences. Springer, Singapore, 2019.

12. Shaik Razia, P. Swathi Prathyusha, N. Vamsi Krishna, N. Sathya Sumana.” A Comparative study of machine learning algorithms onthyroid disease prediction”,International Journal of Engineering & Technology, 7 (2.8) (2018) 315-319.

13. Shaik Razia, P. Swathi Prathyusha, N. Vamsi Krishna, N. Sathya Sumana, “A Comparative study of machine learning algorithms onthyroid disease prediction”International Journal of Engineering & Technology (UAE), vol 8, 7 (2.8) (2018) 315-319. 14. Visser, Theo J. “Regulation of Thyroid Function,

(6)

Thyroid Diseases: Pathogenesis, Diagnosis and Treatment (2018): 1-30.

15. Yadav, Dhyan Chandra, and Saurabh Pal. “To