Survey on Data Mining

(1)

International Journal of Emerging Technology and Advanced Engineering

Website: www.ijetae.com (ISSN 2250-2459, Volume 2, Issue 2, February 2012)

275

Survey on Data Mining

Vibha Maduskar

1

,

Prof. yashovardhan kelkar2

1

A-4/1 Mahananda Nagar Ujjian(M.P) 2

M.I.T.S Ujjain

1_{[email protected]} 2

[email protected]

Abstract - In the Information Technology era information plays vital role in every sphere of the human life. It is very important to gather data from different data sources, store and maintain the data, generate information ,generate knowledge .To analyze this vast amount of data and drawing fruitful conclusions and inferences it needs the special tools called data mining tools. This paper gives overview of the data mining in which what research could be done in different areas and some of its applications.

I. I

NTRODUCTION

Data Mining is the process of extracting knowledge hidden from large volumes of raw data. The knowledge must be new, not obvious, and one must be able to use it. Data mining has been defined as “the nontrivial extraction of implicit, previously unknown, and potentially useful information from data [1]. It is “the science of extracting useful information from large databases”. Data mining is one of the tasks in the process of knowledge discovery from the database .

Fig.1.1 KDD process

The steps involved in Knowledge discovery are –

1. Data Selection: The data relevant to the analysis is decided and retrieved from the various data locations 2. Data Preprocessing: In this stage the process of data cleaning and data integration is done.

3.Data Cleaning: It is also known as data cleansing; in this phase noise data and irrelevant data are removed from the collected data.

4.Data Integration: In this stage, multiple data sources, often heterogeneous, are combined in a common source. 5.Data Transformation: In this phase the selected data is transformed into forms appropriate for the mining procedure.

6. Data Mining: It is the crucial step in which clever techniques are applied to extract potentially useful patterns. The decision is made about the data mining technique to be used.

7. Interpretation and Evaluation: In this step, interesting patterns representing knowledge are identified based on given measures. The discovered knowledge is visually presented to the user.

This essential step uses visualization techniques to help users understand.

II. APPLICATIONS OF DATA MINING

As data mining matures, new and increasingly innovative applications for it emerge. Although a wide variety of data mining scenarios can be described. For the purpose of this paper the applications of data mining are divided in the following categories: • Healthcare • Finance • Retail industry • Telecommunication • Text Mining & Web Mining • Higher Education.

(2)

International Journal of Emerging Technology and Advanced Engineering

276

[2.1.1]Heart Disease Prediction -

Some hospitals use decision support systems, but are largely limited. They can answer simple queries like “What is the average age of patients who have heart disease?” , “How many surgeries had resulted in hospital stays longer than 10 days?”, “Identify the female patients who are single ,above 30 years old, and who have been treated for cancer.” However they cannot answer complex queries like.“Given patient records, predict the probability of patients getting a heart disease.”[10].

[2.1.1.1] By naïve Bays –

The Naïve Bayes Classifier technique is particularly suited when the dimensionality of the inputs is high. Despite its simplicity, Naive Bayes can often outperform more sophisticated classification methods. Naïve Bayes model identifies the characteristics of patients with heart disease. It shows the probability of each input attribute for the predictable state.[10] Why preferred naive bayes implementation:

1) When the data is high.

2) When the attributes are independent of each other. 3) When we want more efficient output, as compared to other methods output.

Decision Support in Heart Disease Prediction System is developed using Naive Bayesian Classification technique. The system extracts hidden knowledge from a historical heart disease database. This is the most effective model to predict patients with heart disease. This model could answer complex queries, each with its own strength with respect to ease of model interpretation, access to detailed information and accuracy. DSHDPS can be further enhanced and expanded.

[2.1.1.2]Heart Disease Diagnosis By Association rule and Decision tree –

The goal is to link perfusion measurements and risk factors to artery disease. Some rules were expected, confirming valid medical knowledge, and some rules were surprising, having the potential to enrich medical knowledge. We show some of the most important discovered rules. Predictive rules were grouped in two sets: (1) if there is a low perfusion measurement or no risk factor then the arteries are healthy; (2) if there exists a risk factor or a high perfusion measurement then the arteries are diseased. The maximum association size κ was 4.[2] In this paper decision trees are not as powerful as association rules to exploit a set of numeric attributes manually binned and categorical attributes and several related target attributes. Decision trees do not work well with combinations of several target variables (arteries), which requires defining one class attribute for each values combination. Decision trees fail to identify many medically relevant combinations of independent numeric variable ranges and categorical values .

The objective of this work This work presented three search constraints that had the following objectives: producing only medically useful rules, reducing the number of discovered rules and improving running time. First, data set attributes are constrained to belong to user-specified groups to eliminate uninteresting value combinations and to reduce the combinatorial explosion of rules. Second, attributes are constrained to appear either in the antecedent or in the consequent to discover only predictive rules. Third, rules are constrained to have a threshold on the number of attributes to produce fewer and simpler rules This work presented three search constraints that had the following objectives: producing only medically useful rules, reducing the number of discovered rules and improving running time. First, data set attributes are constrained to belong to user-specified groups to eliminate uninteresting value combinations and to reduce the combinatorial explosion of rules. Second, attributes are constrained to appear either in the antecedent or in the consequent to discover only predictive rules. Third, rules are constrained to have a threshold on the number of attributes to produce fewer and simpler rules.

Experimental results provide evidence that decision trees are less effective than constrained association rules to predict disease with several related target attributes, due to low confidence factors (i.e. low reliability), slight overfitting, rule complexity for unrestricted trees (i.e. long rules) and data set fragmentation (i.e. small data subsets). Therefore ,constrained association rules can be an alternative to other statistical and machine learning [2].techniques applied in medical problems where there is a requirement to predict several target attributes based on subsets of independent numeric and categorical attributes.

[2.2]Higher Education

[2.2.1]By Web Based mining

(3)

International Journal of Emerging Technology and Advanced Engineering

277

Data exploration focused on the number of attempted exercises combined with classification led us to identify students at risk, those who have not trained enough. Clustering and cluster visualization led us to identify a particular behavior among failing students, when students try out the logic rules of the pop-up menu of the tool. , a timely and appropriate warning to students at risk could help preventing failing in the final exam[3]. Therefore it seems to us that data mining has a lot of potential for education, and can The way we have performed clustering may seem rough, as only few variables, namely the number and type of mistakes, the number of exercises have been used to cluster students in homogeneous groups. This is due to our particular data. All exercises are about formal proofs. Even if they differ in their difficulty, they do not fundamentally differ in the concepts students have to grasp. We have discovered a behavior rather than particular abilities. In a different context, clustering students to find homogeneous groups regarding skills should take into account answers to a particular set of exercises.

[2.2.2]Predicting Students Performance-

The main objective of higher education institutions is to provide quality education to its students. One way to achieve highest level of quality in higher education system is by discovering knowledge for prediction regarding enrolment of students in a particular course, alienation of traditional classroom teaching model, detection of unfair means used in online examination, detection of abnormal values in the result sheets of the students, prediction about students’ performance and so on. The knowledge is hidden among the educational data set and it is extractable through data mining techniques. the classification task is used to evaluate student’s performance and as there are many approaches that are used for data classification, the decision tree method is used here.

[2.2.3]Prediction of Student Performance

The proposed model is used to extract all information of student behavior in writing the code of assignments and to find some statistical patterns or predicators that can be used to enhance students’ performance in writing the code. The results obtained have suggested that aspects such as student work habits, and even code quality, have little bearing on the student’s Performance[9].

[2.3]PHARMACEUTICALINDUSTRY

[2.3.1]The implications are such that by a simple process of merging the drug usage and cost of medicines (after completing the legal requirements) with the patient care records of doctors and hospitals helping firms to conduct nation wide trials for its new drugs.

Other possible uses of information technology in the field of pharmaceuticals include pricing (two-tier pricing strategy) and exchange of information between vertically integrated drug companies for mutual benefit. Nevertheless, the challenge remains though data collect Data mining fondly called patterns analysis on large sets of data uses tools like association,clustering, segmentation and classification for helping better manipulation of the data help the pharma firms compete on lower costs while improving the quality of drug discovery and delivery methods. The paper presents how Data Mining discovers and extracts useful patterns from this large data to find observable patterns. The paper demonstrates the ability of Data Mining in improving the quality of decision making process in pharma industry. There are in general three stages of drug development namely finding of new drugs, development tests and predicts drug behavior, clinical trials test the drug in humans and commercialization takes drug and sells it to likely consumers (doctors and patients).A simple association technique could help us measure the outcomes that would greatly enhance the patient’s quality of life say for e.g. faster restoration of the body’s normal functioning. This could be a benefit much sought after by the patient and could help the firm better position the drug vis-à-vis the competition. This paper described that these techniques can be easily and successfully used. The paper presented on how Data mining discovers and extracts useful patterns from this large data to find observable patterns. The paper demonstrates the ability of Data Mining in improving the quality of decision making process in pharma industry.

[2.3.2]Data Mining of Market Knowledge in The Pharmaceutical Industry-

(4)

International Journal of Emerging Technology and Advanced Engineering

278

[2.4] Security-

[2.4.1]Intrusion Detection-

As network attacks have increased in number and severity over the past few years, intrusion detection system (IDS) is increasingly becoming a critical component to secure the network. Due to large volumes of security audit data as well as complex and dynamic properties of intrusion behaviors, optimizing performance of IDS becomes an important open problem that is receiving more and more attention from the research community. We make a quick and up-to date literature survey on attempts for designing intrusion detection systems using the KDD dataset [4](classifiers, evaluation setup and performance comparison).

[2.4.2]Cyber security-

MINDS is a suite of data mining algorithms which can be used as a tool by network analysts to defend the network against attacks and emerging cyber threats. The various components of MINDS such as the scan detector, anomaly detector and the profiling module detect different types of attacks and intrusions on a computer network. The scan detector aims at detecting scans which are the precursors both[7] any network attack. The anomaly detection algorithm is very effective in detecting behavioral anomalies in the network traffic which typically translate to malicious activities such as dos traffic, worms, policy violations and inside abuse.

[2.5]Transportation-

An application of data-mining analysis to a typical construction database containing information about asphalt projects in Illinois. A case study was presented to test the applicability of data mining as an analysis method. A database was constructed with collected data from IDOT sources. Data mining technique was utilized to analyze the created dataset and rules generated. Based on the generated results and interpretation, certain previously unknown patterns were discovered[5]. The study shows that data mining can provide information on a dataset/database beyond statistical methods only and provide a source of valuable information (that could not have been detected otherwise) to support decision-making. If the time-consuming data collection process can be reduced, the method can extract information faster than other analysis methods. Another important extension to this research is exploring the validity of the previously unknown patterns that were discovered. This could entail mining the data using other software as well as conducting a long-term study to check to verify those rules. Furthermore, other applications of data mining to the construction industry could be developed.

One of the applications suggested by IDOT personnel is collusion detection among contractors, similar to the fraud detection application of data mining currently used by many insurance companies .

[2.6]Agricultural-The data mining new technology applying to the modern agriculture logistics decision is to raise agricultural product circulate speed, lower the logistics cost, promote the profession logistics value added service, raise our international competition ability, promote the agricultural product industry management level, exalt farmers income, it's notable advantage lies in raising the modern agriculture logistics decision level, the accuracy and efficiency of decision ,modern logistics system, reducing subjective and blindness of decision. This paper applies the decision support system in the information system basing on the foundation of developing agriculture logistics data mining system, which provide strong decision support to the main agriculture logistics business, the governor and decision maker of the agriculture logistics to adapt the development demand of the modern agriculture logistics[6].

(ii) Spatial Data Mining-

An overview of data clustering method using cluster analysis and there by generates patterns/rules. “Association Rule Mining Analysis” usually sounds like something very smart, difficult to understand, something that is useful only to those researchers and professors wearing thick glasses. But the reality is just opposite. Although we might not be aware of it, pattern analysis using association rule mining is present in many aspect of our everyday life. The paper considers only two dimensions and on the basis of these two dimensions it clusterizes [11] the data objects. the paper can be implemented for more than two dimensions. The current method to find the distance between the data point and the cluster is Euclidean distance. These methods give circular clusters.

III. C

ONCLUSION

(5)

International Journal of Emerging Technology and Advanced Engineering

279

Several attempts have been made to design and develop the generic data mining system but no system found completely generic. Thus, for every domain the domain expert’s assistant is mandatory. The domain experts shall be guided by the system to effectively apply their knowledge for the use of data mining systems to generate required knowledge. The domain experts are required to determine the variety of data that should be collected in the specific problem domain, selection of specific data for data mining, cleaning and transformation of data, extracting patterns for knowledge generation and finally interpretation of the patterns and knowledge generation.

References

[1] K.Srinivas B.Kavihta Rani Dr. A.Govrdhan “Applications of Data Mining Techniques in Healthcare and Prediction of Heart Attacks” Srinivas et al. / (IJCSE) International Journal on Computer Science and Engineering Vol. 02, No. 02, 2010, 250-255.

[2] Carlos Ordonez University of Houston Houston, TX, USA”Comparing Association Rules and Decision Trees for Disease Prediction”.

[3] Şenol Zafer ERDOĞAN Mehpare TİMOR “A DATA MINING

APPLICATION IN A STUDENT DATABASE” JOURNAL OF AERONAUTICS AND SPACE TECHNOLOGIESJULY 2005 VOLUME 2 NUMBER 2 (53-57)

[4]Agathe MERCERON and Kalina YACEF “Educational Data Mining: a Case Study”

[5] Huy Anh Nguyen and Deokjai Choi Chonnam National University, Computer Science Department “Application of Data Mining to Network Intrusion Detection: Classifier Selection Model”

[6] Khaled Nassar, Associate ProfessorDepartment of Architectural Engineering, University of SharjahAPPLICATION OF DATA-MINING TO STATE TRANSPORTATION AGENCIES’ PROJECTS DATABASES.

[7] LIU Dejun , ZHANG Guangsheng Shenyang Agricultural University, Shenyang “Application of Data Mining Technology in Modern Agricultural Logistics Management Decision.

[8] Varun Chandola, Eric Eilertson, Levent ErtÄoz, GyÄorgy Simon and Vipin Kumar Department of Computer Science, University of Minnesota Data Mining for Cyber Security

[9]Qasem A. Al-Radaideh, Emad M. Al-Shawakfa, and Mustafa I. Al-Najjar “Mining Student Data Using Decision Trees”.

[10],Mrs.G.Subbalakshmi (M.Tech), Mr. K. Ramesh M.Tech, Asst. Professor, Mr. M. Chinna Rao M.Tech,(Ph.D.) Asst. Professor

Decision Support in Heart Disease Prediction System using

Naive Bayes.