Diagnosis of heart diseases using data mining
Poonam Pannu
Dept. of Computer Science and Engineering, CBS Group of Institutions, MDU, Rohtak, Haryana
ABSTRACT
In today’s modern world cardiovascular disease is the most lethal one. This disease attacks a person so instantly that it hardly gets any time to get treated with. So diagnosing patients correctly on timely basis is the most challenging task for the medical fraternity. A wrong diagnosis by the hospital leads to earn a bad name and loosing reputation. At the same time treatment of the said disease is quite high and not affordable by most of the patients particularly in India. The purpose of this paper is to develop a cost effective treatment using data mining technologies for facilitating data base decision support system. Almost all the hospitals use some hospital management system to manage healthcare in patients. Unfortunately most of the systems rarely use the huge clinical data where vital information is hidden. As these systems create huge amount of data in varied forms but this data is seldom visited and remain untapped. So, in this direction lots of efforts are required to make intelligent decisions.
Keywords: heart, Techniques, diseases, Applications, data mining, cardiovascular disease, etc
1. INTRODUCTION
Data Mining
Data mining is the process of finding previously unknown patterns and trends in databases and using that information to build predictive models. Data mining combines statistical analysis, machine learning and database technology to extract hidden patterns and relationships from large databases. The World Health Statistics 2012 report enlightens the fact that one in three adults worldwide has raised blood pressure – a condition that causes around half of all deaths from stroke and heart disease. Heart disease, also known as cardiovascular disease (CVD), encloses a number of conditions that influence the heart – not just heart attacks. Heart disease was the major cause of casualties in the different countries including India. Heart disease kills one person every 34 seconds in the United States. Coronary heart disease, Cardiomyopathy and Cardiovascular disease are some categories of heart diseases. The term ―cardiovascular disease‖ includes a wide range of conditions that affect the heart and the blood vessels and the manner in which blood is pumped and circulated through the body. Diagnosis is complicated and important task that needs to be executed accurately and efficiently. The diagnosis is often made, based on doctor’s experience & knowledge. This leads to unwanted results & excessive medical costs of treatments provided to patients. Therefore, an automatic medical diagnosis system would be exceedingly beneficial. Our work attempts to present the detailed study about the different data mining techniques which can be deployed in these automated systems.
Data mining is process to analyses number of data sets and then extracts the meaning of data. It helps to predict the patterns and future trends, allowing business in decision making. Data mining applications are able to give the answer of business questions which can take much time to resolve traditionally. High amount of data that can be generated for the prediction of disease is analyzed traditionally and is too complicated along with voluminous to be processed. Data mining provides methods and techniques for transformation of the data into useful information for decision making. These techniques can make process fast and take less time to predict the heart disease with more accuracy. The healthcare sector assembles enormous quantity of healthcare data which cannot be mined to uncover hidden information for effectual decision making. Although, there is a plenty of hidden information in this data which is untapped and not being used appropriately for predictions. It becomes more influential in case of heart disease that is considered as the predominant reason behind death all over the world.
SCOPE OF DATA MINING
intelligently probing it to find exactly where the value resides. Given databases of sufficient size and quality, data mining technology can generate new business opportunities by providing these capabilities:
Automated prediction of trends and behaviours: Data mining automates the process of finding predictive information in large databases. Questions that traditionally required extensive hands-on analysis can now be answered directly from the data — quickly. A typical example of a predictive problem is targeted marketing. Data mining uses data on past promotional mailings to identify the targets most likely to maximize return on investment in future mailings. Other predictive problems include forecasting bankruptcy and other forms of default, and identifying segments of a population likely to respond similarly to given events.
Automated discovery of previously unknown patterns: Data mining tools sweep through databases and identify previously hidden patterns in one step. An example of pattern discovery is the analysis of retail sales data to identify seemingly unrelated products that are often purchased together. Other pattern discovery problems include detecting fraudulent credit card transactions and identifying anomalous data that could represent data entry keying errors.
Data mining techniques can yield the benefits of automation on existing software and hardware platforms, and can be implemented on new systems as existing platforms are upgraded and new products developed. When data mining tools are implemented on high performance parallel processing systems, they can analyze massive databases in minutes. Faster processing means that users can automatically experiment with more models to understand complex data. High speed makes it practical for users to analyze huge quantities of data. Larger databases, in turn, yield improved predictions.
METHODOLOGY OF DATA MINING
Data mining is the fundamental part of knowledge discovery database. Knowledge discovery database is the process of discovering hidden knowledge from massive amounts of data that we are technically capable of gathering and storing. It is a process which contains sequence of following steps:
a. Problem definition: in this, we define the problem that is to be solved and the noise or the irrelevant data is removed.
b. Data integration: here multiple data sources are combined to gather and prepare information.
c. Model building: here the data is retrieved from the database which is relevant to the analysis task and is transformed into the form which is appropriate for mining by performing certain operations.
d. Data mining: in this, different algorithms are applied to extract patterns.
e. Knowledge deployment: knowledge representation techniques are used to present mined knowledge to the user.
APPLICATION OF DATA MINING
Data mining is used in various fields such as retail industry, telecommunication industry, healthcare industry, financial data analysis, intrusion detection, sports and also in analyzing student’s performance.
a. Retail industry: data mining is a great application in retail industry as it collects large amount of data which includes transportation, sales and consumption of goods and services. This data expand rapidly due to increase in purchase and sales in business. Data mining helps to identify customer’s buying patterns and trends that lead to improved quality of customer service and customer’s satisfaction. b. Telecommunication industry: telecommunication industry is the most growing industry as it provides
various services such as fax, pager, cellular phones and e-mails. With the development of computer, telecommunication services have integrated with the communication technologies and works more effectively. Data mining helps to identify telecommunication patterns, fraud activities, make better use of resources and improve quality of service.
c. Healthcare industry: data mining is very useful in healthcare industry in diagnosis of heart diseases, breast cancer and diabetes. It helps in identifying patterns and trends in patient’s records having same risk factor and helps in decision making.
d. Financial data analysis: financial data in banking is reliable and of high quality which facilitates systematic data analysis in financial industry. It helps in loan payment prediction and customer credit policy analysis. It also helps in clustering of customers for target marketing.
e. Intrusion detection: intrusion is any kind of action that threatens the confidentiality or integrity of network resources from any outside party. With the increased usage of internet and availability of the tools and tricks of intrusion and attacking network, intrusion detection has become an important issue for network administration. Data mining helps in the development of data mining algorithm for intrusion detection and analysis of stream data so that intrusion threats can be avoided.
f. Sports: in sports, vast amount of statistics are gathered for each player, team, game and season. Data mining is used in the prediction of performance of players, selection of players and forecast of future events.
g. Student’s performance: data mining is used to evaluate student’s performance using classification technique for data classification. Attendance, class test, seminar and assignment marks are collected from the student record to predict the performance of the student at the end of the semester.
Causes of Heart Diseases
a. High blood pressure: when the heart pumps blood, the force of the blood pushes against the walls of the arteries causing pressure. If the pressure rises and stays high over the time it is called high blood pressure or hypertension which can harm the body in many ways i.e. increasing the risk of heart stroke or developing heart failure, kidney failure etc.
b. High cholesterol: cholesterol is a waxy substance found in the fatty deposits in the blood vessels. Increase in the fatty deposits (high cholesterol) does not allow sufficient blood to flow in through the arteries causing heart attacks.
c. Unhealthy diet: eating too much fast food increases blood pressure and cholesterol level causing the risk of heart attacks.
d. Smoking: it damages the lining of arteries and builds up a fatty material called atheroma which narrows the arteries causing heart attacks.
e. Lack of physical activity: lack of exercise increases cholesterol level in blood vessels which further increases the risk of heart attacks.
f. Obesity: obese people are more likely to have high blood pressure, high cholesterol level and diabetes (increase in blood sugar level) which increases the risk of heart strokes in human body. Nowadays, data mining is gaining popularity in health care industry as this industry generates large amount of complex data about hospital resources, medicines, medical devices, patients, disease diagnosis etc. This complex data needs to be processed and analyzed for knowledge extraction which will further help in decision making and is also cost effective.
HEART DISEASE PREDICTION USING DATAMINING
equal interval binning with approximate values based on medical expert advice on Pima Indian heart attack data. The significant items were calculated for all frequent patterns with the aid of the proposed approach. The frequent patterns with confidence greater than a predefined threshold were chosen and it was used in the design and development of the heart attack prediction system. The, Pima Indian Heart attack dataset used was obtained from the UCI machine learning repository. Characteristics of the patients like number of times of chest pain and age in years were recorded. The actions comprised in the preprocessing of a data set are the removal of duplicate records, normalizing the values used to represent information in the database, accounting for missing data points and removing unneeded data fields. Moreover it might be essential to combine the data so as to reduce the number of data sets besides minimizing the memory and processing resources required by the data mining algorithm. In the real world, data is not always complete and in the case of the medical data, it is always true.
To remove the number of inconsistencies which are associated with data we use Data preprocessing. K. Srinivas et al. presented Application of Data Mining Technique in Healthcare and Prediction of Heart Attacks. The potential use of classification based data mining techniques such as Rule based, Decision tree, Naïve Bayes and Artificial Neural Network to the massive Volume of healthcare data. Tanagra data mining tool was used for exploratory data analysis, machine learning and statistical learning algorithms. The training data set consists of 3000 instances with 14 different attributes. The instances in the dataset are representing the results of different types of testing to predict the accuracy of heart disease. The performance of the classifiers is evaluated and their results are analyzed. The results of comparison are based on 10 tenfold cross-validations. According to the attributes the dataset is divided into two parts that is 70% of the data are used for training and 30% are used for testing. The comparison made among these classification algorithms out of which the naive Bayes algorithm considered as the best performance algorithm.
The performance of various algorithms is listed below. Table.1. Performance Study of Data mining Algorithms The algorithm used Accuracy Time taken Naïve Bayes 52.33% 609ms Decision list 52% 719ms K-NN 45.67% 1000ms Diagnosis of heart disease was used Naïve Bayes, K-NN, Decision List in this Naïve Bayes has taken a time to run the data for accurate result when compared to other algorithms. Sudha et al. [11] to propose the classification algorithm like Naïve Bayes, Decision tree and Neural Network for predicting the stroke diseases. The classification algorithm like decision trees, Bayesian classifier and back propagation neural network were adopted in this study. The records with irrelevant data were removed from data warehouse before mining process occurs. Data mining classification technology consists of classification model and evaluation model. The classification model makes use of training data set in order to build classification predictive model. The testing data set was used for testing the classification efficiency. Then the classification algorithm like decision tree, naive Bayes and neural network was used for stroke disease prediction. The performance evaluation was carried out based on three algorithms and compared with various models used and accuracy was measured. While comparing these classification algorithms, the observation shows the neural network performance was more than the other two algorithms.
M A. Jabbar proposed Association Rule mining based on the sequence number and clustering for heart attack prediction . The entire database is divided into partitions of equal size. The dataset with 14 attributes was used in that work and also each cluster is considered one at a time for calculating frequent item sets. This approach reduces main memory requirement. To predict the heart attack in an efficient way the patterns are extracted from the database with significant weight calculation. The frequent patterns having a value greater than a predefined threshold were chosen for the valuable prediction of heart attack. Three mining goals were defined based on data exploration and all those models could answer complex queries in predicting heart attack. Mai Shouman, proposed k-means clustering with the decision tree method to predict the heart disease.
In their work they suggested several centroid selection methods for kmeans clustering to increase efficiency. The 13 input attributes were collected from Cleveland Clinic Foundation Heart disease data set. The sensitivity, specificity, and accuracy are calculated with different initial centroids selection methods and different numbers of clusters. For the random attribute and random row methods, ten runs were executed and the average and best for each method were calculated. When comparing integrating kmeans clustering and decision tree with traditional decision tree applied previously on the same data set, integrating k-means clustering with decision tree could enhance the accuracy of decision tree in diagnosing heart disease patients. In Addition, integrating k-means clustering and decision tree could achieve higher accuracy than the paging algorithm in the diagnosis of heart disease patients. The accuracy achieved was 83.9% by the enabler method with two clusters..
CONCLUSION
diagnosis. Some Heart Disease classification system is reviewed in this paper. From the analysis it is concluded that, data mining plays a major role in heart disease classification. Neural Network with offline training is a good for disease prediction in early stage and the good performance of the system can be obtained by preprocessed and normalized dataset. The classification accuracy can be improved by reduction in features.
REFERENCES
[1]. Preventing Chronic Disease: A Vital Investment. World Health Organization Global Report, 2005. [2]. Global Burden of Disease. 2004 update (2008). World Health Organization.
[3]. Srinivas, K.,‖ Analysis of coronary heart disease and prediction of heart attack in coal mining regions using data mining techniques‖, IEEE Transaction on Computer Science and Education (ICCSE), p(1344 - 1349), 2010. [4]. Yanwei Xing, ―Combination Data Mining Methods with New Medical Data to Predicting Outcome of Coronary
Heart Disease‖, IEEE Transactions on Convergence Information Technology, pp(868 – 872), 21-23 Nov. 2007 [5]. IBM, Data mining techniques, http://www.ibm.com/developerworks/opensource/library/ba-data-miningtechniques/
index.html?ca=drs- , downloaded on 04 April 2013.
[6]. Microsoft Developer Network (MSDN). http://msdn2.microsoft.com/enus/virtuallabs/aa740409.aspx, 2007. [7]. Han, J., Kamber, M.: ―Data Mining Concepts and Techniques‖, Morgan Kaufmann Publishers, 2006. Ho, T. J.:
―Data Mining and Data warehousing‖, Prentice Hall, 2005.
[8]. M. Ilayaraja, ‖Mining Medical Data to Identify Frequent Diseases using Apriori Algorithm‖, IEEE-International Conference on Pattern Recognition, Informatics and Mobile Engineering,2013
[9]. Nidhi Bhatla, ‖An Analysis of Heart Disease Prediction using Different Data Mining Techniques‖, International Journal of Engineering Research & Technology (IJERT), Vol. 1 Issue 8,2012.
[10]. T. John Peter, K. Somasundaram, study and development of novel feature selection framwork for Heart disease prediction, International Journal of Scientific and Research Publications, 2012.
[11]. Tsien, H.S.F. Fraser, W.J. Long and R.L. Kennedy ―Using classification trees and logistic regression methods to diagnose myocardial infarction‖ in Proc. 9th World Congr., Inf., vol. 52, pp. 483-497, 2001.
[12]. VahidKhatibi and Gholam Ali Montazer, ―A fuzzy-evidential hybrid inference engine for coronary heart disease risk assessment‖, Journal of Expert Systems with Applications, Vol. 37, PP. 8536–8542, 2010