Comparative Analysis of Heart Risk Dataset Using Different Classification Techniques

(1)

Satyajeet Shinge, IJRIT-489 International Journal of Research in Information Technology

(IJRIT)

www.ijrit.com ISSN 2001-5569

Comparative Analysis of Heart Risk Dataset Using Different Classification Techniques

Satyajeet Shinge¹, Ravindra Zarekar², Dr. V.Vaithiyanathan³ and K.Rajeswari⁴

1ME Student, Department of Computer Engineering PCCOE, University of Pune, India.

1[email protected]

2ME Student, Department of Computer Engineering PCCOE, University of Pune, India.

2[email protected]

3Associate Dean Research, SASTRA University.India.

3[email protected]

4Associate Professor, Department of Computer Engineering PCCOE, University of Pune, India.

4[email protected].

Abstract

Association rule mining and classification are two important techniques of data mining for the knowledge discovery process. This paper focuses on the construction of class association rules and classification model. It explains the basics of class association rule mining and classification through WEKA. In the simulation on WEKA, we perform some classification techniques on the Heart Risk dataset to compare the results obtained by different classification techniques.

Keywords: Rule mining; Association rule; Classification; Data mining; Apriori; WEKA.

1. Introduction

Data mining is the process of discovering interesting, useful patterns and relationships in large volumes of data. It is the process of analyzing data from different perspectives and summarizing it into useful information. Data mining is the process of finding correlations or patterns among dozens of fields in large relational databases.

Association rule mining and classification rule mining are the two most popular techniques in data mining. In data mining, association rule learning is a popular and well researched method for discovering interesting relations between variables in large databases. The classification problem is to build a model, which is based on external observations, assigns an instance to one or more labels. Data classification is two step processes. In the first step, we build a classification model based on previous data. In the second step, we determine whether accuracy of model is acceptable or not. If it is acceptable we use the model to classify new data.

Classification learner that uses association rules has to perform three major steps: Mining a set of potentially accurate rules, evaluating and pruning rules, and classifying future instances using the found rule set.

(2)

Satyajeet Shinge, IJRIT-490

2. Association Rules Mining

In data mining, association rule learning is a popular and well researched method for discovering interesting relations between variables in large databases. Association rule mining is defined as [4]: Let I= {i1, i2,…,in} be a set of n binary attributes called items. Let T={t1,t2,…,tn} be a set of transactions called the database. A transaction ID is unique ID for each transaction in T and it contains a subset of the items in I. A rule is defined in the form of X

⇒

Y where X, Y is subset of I and X ∩ Y = Φ. The sets of items (for short item sets) X and Y are called antecedent (left-hand-side or LHS) and consequent (right-hand-side or RHS) of the rule respectively. We use a small example to understand the concept. The set of items is I = {Pen, Pencil, Eraser, Sharpener} and a small database containing the items. An example rule for the supermarket could be

“{Pencil, Sharpener}

⇒

Eraser” …………....(Rule 1)

From this rule, we infer that whenever Pencil and Sharpener is bought then the customers also buy Eraser. Support and Confidence are two measures of Association Rule Mining. They respectively reflect the usefulness and certainty of discovered rules. A support for rule means that all the transactions show that Pencil, Sharpener and Eraser purchased together. A confidence is that who purchase Pencil and Sharpener also bought the Eraser. Typically, Association Rules are consider interesting if they satisfy both a Minimum support threshold and a Minimum Confidence threshold.

tuples total

B and A both containing tuples

B A

Support → =

(1)

A containing tuples

B and A both containing tuples

B A

Confidence → =

(2)

Table 1: Sample list of Patients for Heart Risk Data Set

Patient Id List of Attribute

P1 Weight, Thyroid, Smoking

P2 Weight, Smoking

P3 Weight, Gender

P4 Thyroid, Gender ,Cholesterol

The above Equations 1 and 2 gives the method for computing support and confidence. Suppose Heart Risk dataset contain 4 Patients as shown in Table 1, then minimum support is 50% and the minimum confidence is 50% then we have Rules,

Weight

⇒

Smoking (50%, 66.6%) Smoking

⇒

Weight (50%, 100%)

3. Classification

Classification is one of the most frequently studied problems by Data Mining and machine learning researchers. It consists of predicting the value of a (categorical) attribute (the class) based on the values of other attributes (the predicting attributes). An example would be assigning a Patient to Heart Risk Level 1,2,3,4 or 5 or assigning diagnosis to a given patient as described by observed characteristics of the patient such as age, gender, height, weight, etc. Classification is a well-known task in data mining and is used to predict the class of an unseen instance as accurately as possible. The following are some different classification methods:

(3)

Satyajeet Shinge, IJRIT-491 (1) Naive Bayes: A Naïve Bayes [3] is a simple probabilistic classification technique. It applies Bayes' theorem with strong independence assumptions. In simple terms, a Naive Bayes classifier assumes that the presence or absence of a particular feature of a class is unrelated to the presence or absence of any other feature.

(2) J48 Tree: The J48 classifier implements the C4.5 algorithm. J48 classifier applies on dataset to predict the target variable of a new dataset record.

(3) REPTree: REPTree is fast decision tree learner that builds a decision or regression tree using information gain or variance reduction and prunes it using reduced-error pruning. It sorts values for numeric attributes once.

(4) Random tree: It considers K randomly chosen attributes at each node and performs no pruning. It allows estimation of class probabilities based on a hold-out set [3].

4. Results and discussion

WEKA [4] is a data mining system developed at the University of Waikato and has become very popular among the academic community working on data mining. The WEKA workbench contains a collection of visualization tools and algorithms for data analysis. It uses Graphical user interfaces to access its functionality. WEKA is an open source machine learning environment with many useful data mining and machine learning algorithms. The Experiment is performed on Heart Risk data set collected from Madras medical College. It has 712 instances and 16 attributes.

In this paper, we compare different classifier to analyze their performance on Heart Risk Dataset.

Fig. 1 Snapshot for Naïve Bayes classifier on Heart risk Dataset

(4)

Satyajeet Shinge, IJRIT-492 Fig. 2 Snapshot for J48 classifier on Heart risk Dataset

Fig. 3 Snapshot for REP TREE classifier on Heart risk Dataset

(5)

Satyajeet Shinge, IJRIT-493 Fig. 4 Snapshot for Random Tree classifier on Heart risk Dataset

The figures 1,2,3,4 gives the screen shot of the results obtained using different classification techniques in WEKA. The Figure 5 below gives a summary of the correctly classified instances and Incorrectly Classified Instances. This research is a starting attempt to use data mining functions to analyze and evaluate datasets. We classified datasets with four different classification techniques: Naïve Bayes, J48, REP Tree and Random Tree. This study compares the correctly classified instances as well as incorrectly classified instances with different algorithms namely, Naive Bayes, J48, and REP Tree and Random Tree.

Fig. 5 correctly classified instances and incorrectly classified instances.

(6)

Satyajeet Shinge, IJRIT-494 Our future work will focus on using other classification algorithms for comparative analysis and apply the same for different domains.

5. Conclusions

This paper illustrates how well different classification techniques are used as predictive tools in data mining domain after comparing their performances. It has been concluded from the results obtained that Random Tree algorithm is most appropriate for Heart Risk Dataset to classify instances. Random Tree gives 92% accuracy which is relatively higher than the other algorithms. This is an attempt to use classification algorithms to calculate performance and compare the performance of Naïve Bayes, J48, and REPTree and Random Tree.

References

[1] J. Han and M. Kamber, “Data Mining: Concepts and Techniques”, Morgan Kaufmann Publishers, 2012.

[2] Cristóbal Romero, Sebastián Ventura, Pedro G. Espejo and César Hervás, “Data Mining Algorithms to Classify Students”, Computer Science Department, Córdoba University, Spain.

[3] C.sugandhi, P.Yasodha, M.Kannan, “Analysis of a Population of Cataract Patients Databases in WEKA Tool”, International Journal of Scientific & Engineering Research Volume 2, Issue 10, Oct-2011 1ISSN 2229-5518.

[4] MohdFauzi bin Othman, ThomasMoh Shan Yau, “ Comparison of Different Classification Techniques Using WEKA for Breast Cancer” , Universiti Teknologi Malaysia, Skudai, Malaysia.

[5] Sunita B Aher, Mr. LOBO L.M.R.J., “Data Mining in Educational System using WEKA”, International Conference on Emerging Technology Trends (ICETT) 2011.

[6] V. Vaithiyanathan1, K. Rajeswari, Kapil Tajane, Rahul Pitale , “Comparison Of Different Classification Techniques Using Different Datasets”, International Journal of Advances in Engineering & Technology, May 2013,

[7] Suchita Borkar, K.Rajeswari, “Attributes Selection for Predicting Students’ Academic Performance using Education Data Mining and Artificial Neural Network”, International Journal of Computer Applications (0975 – 8887) Volume 86 – No 10, January 2014.

[8] Dr. V. Vaithiyanathan, K. Rajeswari, Rahul Pitale, KapilTajane, “Performance Improvement Using Integration Of Association Rule Mining And Classification Techniques”, International Journal of Scientific & Engineering Research, Volume 4, Issue 5, May-2013 1891 ISSN 2229-5518.

[9] K. Rajeswari, KumariPreeti, Dr. V. Vaithiyanathan, “Selection of Significant Features using Decision Tree Classifiers”, Journal of Applied Sciences Research, 9(4): 2566-2569, 2013 ISSN 1819-544X.

[10] K. Rajeswari, RohitGarud, Dr. V. Vaithiyanathan, “ Improving Efficiency of Classification using PCA and Apriori based Attribute Selection Technique”, Research Journal of Applied Sciences, Engineering and Technology 6(24): 4681-4684, 2013 ISSN: 2040-7459; e-ISSN: 2040-7467 © Maxwell Scientific Organization.