In past, we had a question, what happens? But with data mining’ we discover that what will happen and why? (Nakasima et.al 2018 14 ). Data mining integrates the various technologies like statistics, machine learning and databases(Irshad et.al, 2018 15 ). It has applications in different disciplines like medical, financial, defence, intelligence and so on(Sohail et.al, 2019 8 ). The tools of data mining include clustering, classifications, associations and detections (Muhammad et.al, 2017 16 ). From decades, data mining have developed in many ways regarding techniques, which includes extracting association, neural networks, logic programming, rough sets and decision trees(Zhu et.al, 2018 17 ). Furthermore, data mining has gone beyond limits like the relational databases to the text mining and multimedia data(Ristoski et.al 2018 18 ); also it’s involved in the information security and detections(Santis et.al, 2018 19 ). After so many developments, companies are still facing some challenges like Scalability (Sohail et.al, 2017 8 ), but till far data mining is working on the massive quality of datasets and also engaged in working for the Terabyte sizes(Najjar et.al, 2018 20 ). By the enormous growth of data in different disciplines’ the question arise, Can this technology fulfilthe needs in extraction of Petabyte size data? This comes in the limitation and domination of data mining technology (on which this paper is focusing). As data mining involves some algorithms, it is important to understand the limitation of data mining algorithms and tools. Which requires the time and space for the complexity? In example: can these algorithms be completed in time? If the problem is decided, what is the complexity? For future predictions, we need to find out more about the complexity of markets and business(Alonso et.al, 2018 11 ) and how data mining is shining in the financial platforms.
b). Behaviour scoring/credit rating migration analysis. Valuation of a customer‘s or product’s probability of a change in risk level within a given time.(i.e., default rate volatility) In commercial lending, risk assessment is usually an attempt to quantify the risk of loss to the lender when making a particular lending decision. Here credit risk can quantify by the changes of value of a credit product or of a whole credit customer portfolio, which is based on change in the instrument’s ranting, the default probability, and recovery rate of the instrument in case of default. Further diversification effects influence the result on a por tfolio level. Thus a major par t of implementation and care of credit risk management system will be a typical data mining problem: the modelling of the credit instrument’s value through the default probabilities, rating migrations, and recovery rates.
• Parallel, distributed and incremental mining algorithm: The factors such as huge size of databases, wide distribution of data, complexity of data mining methods motivates the development of parallel and distributed data mining algorithm. These algorithms dived the data into partitions which is further processes parallel. Then the result from the partitions is merged. The incremental algorithms are used to update the databases .These databases may contain the noisy data.to clear the noisy data ,these incremental algorithms are used.
Outlier detection is one of the most important tasks in data analysis. An outlier is an extreme observation. Typically points farther than, say, three or four standard deviations from the mean are considered as “outliers”. In regression however, the situation is somewhat more complex in the sense that some outlying points will have more influence on the regression than others. Outlier detection has been suggested for numerous applications, such as credit and fraud detection, clinical trials, voting, irregularity analysis, network intrusion, severe weather prediction, geographic information system, and other data mining tasks (Barnett and Lewis, 1995; Fawcett and Provost, 1997; Hawkins, 1980; and Penny and Jolliffe, 2001). Outliers in a data may be due to recording errors or system noise of various kinds, and as such needs to be cleaned with regard to extract, transform, clean and load phase (ETCL) of the data mining/KDD process. On the other hand an outlier or small group of outliers may be quite error-free recordings that represent the most important part of a data that deserve further careful inspection, e.g., an outlier might represent an unusually high response to a particular advertising campaign, or an unusually effective dose-response combination in a drug therapy (Ben-Gal I, 2005). Either way, it is quite important in data mining to detect outliers in large amounts of highly
It enhances the web site with intelligent behavior, such as suggesting related links or recommending new products to the consumer. Web mining is especially exciting because it enables tasks that were previously difficult to implement. They can be configured to monitor and gather data from a wide variety of locations and can analyze the data across one or multiple sites. For example the search engines work on the principle of data mining. data Mining for Intrusion detection
collecting and managing data, it also includes analysis and prediction. Thus, data mining (sometimes called data or knowledge discovery) is the process of analyzing data from different perspectives and summarizing it into useful information - information that can be used to increase revenue, cuts costs, or both. Data mining software is one of a number of analytical tools for analyzing data. It allows users to analyze data from many different dimensions or angles, categorize it, and summarize the relationships identified. Technically, data mining is the process of finding correlations or patterns among dozens of fields in large relational databases. In other words, Data mining, (the extraction of hidden predictive information from large databases,) is a powerful new technology with great potential to help companies focus on the most important information in their data warehouses. We can understand it as follows: Data mining can be performed on data represented in quantitative, textual or multimedia forms. Data mining applications can use a variety of parameters to examine the data. They include association (patterns where one event is connected to another event, such as purchasing a pen and purchasing paper), sequence or path analysis (patterns where one event leads to another event, such as the purchase of a sim number & recharging it), classification (identification of new patterns, such as coincidences between duct tape purchases and plastic sheeting purchases), clustering(finding and visually documenting groups of previously unknown facts, such as International Journal of Engineering, Science and Metallurgy
Data warehouses typically contain large amounts of historical data. Data can feed into the data warehouse from multiple databases. For example, a large company may have many plants, each with its own database for managing day-to-day oper- ations. The company may wish to look for overall trends across the data from all plants. Each plant may have a different database schema. The names of the tables and attributes can differ between source databases. A plant in the United States may have an attribute named “state,” whereas another plant in Canada may use an attribute named “province.” The values of corresponding attributes may vary from plant to plant. Maybe one plant uses “B” as an operational code specifying a “blue” widget, and another plant uses “B” to specify “black.” The pertinent data needs to be extracted from the feeder database into a staging area. The data is cleaned and transformed in the staging area. Corresponding attributes are mapped to the same data warehouse attribute. Disparate operational codes are replaced with consistent surrogate IDs. Terminology is standardized across the company. Then the data is loaded into the data warehouse. Moving data from the feeder databases into the data warehouse is often referred to as an extract, transform, and load (ETL) process. Data in the warehouse can then be explored in a variety of ways, including OLAP, data mining, report generators, and ad hoc query tools. Figure 4.7 illustrates the overall flow of data.
477 Read more
The term business refer to any activity developed in a company in the most general sense, no matter the nature and aim of such activity (commercial, governmental, education, . . . ). Data mining is one of the technologies that make Business Intelligence solutions  be implemented (“a fairly new term that incorporates a broad variety of processes and technologies to harvest and analyze speciﬁc information to help a business make sound decisions”). In fact, any business intelligence solution should include a data mining project to extract “the intelligence” of the business that will be accordingly deployed. However, the truth is that data mining projects are being developed more as an art than as an engineering process. It does not properly meet real busi- ness needs when dealing with any kind of project. Companies really need to manage projects in the most controlled way, always trying to reduce risks without increasing costs. As there is no proper methodology to face data min- ing projects, several diﬀerent practices from diﬀerent areas are applied. This leads to failures when developing a project to getting poor results, or at least not as good as they could be.
561 Read more
The main objective behind the smart farming system is to provide better determination to farmer for high yield. In this system all 3 areas i.e. Irrigation, Fertilizer and pesticide that affect the agriculture yield is considered. Smart farming system is a web application with huge amount of dataset or historical data available in backend. The data mining is used in the process of finding correlations or patterns among the dozens of fields in relational databases. Clustering algorithm is used for that purpose. Clustering is the process of making a abstract objects into classes of similar object. While doing cluster analysis, we first partitions the set of data into groups based on the data similarity. There are various clustering techniques available from that k-nearest neighbor technique is used.
Depending on the type of data being mined, the pre-processing step may consist of several sub-tasks. If the raw data is very large, we could use sampling and work with fewer instances, or use multi-resolution techniques and work with data at a coarser resolution. Next, noise in the data is removed to the extent possible, and relevant features are extracted. In some cases, where data from different sources or sensors are available, data fusion may be required to allow the miner to exploit all the data available for a problem. At the end of this first step, we have a feature vector for each data instance. Depending on the problem and the data, we may need to reduce the number of features using feature selection or dimension reduction techniques such as principal component analysis (PCA) (Jackson, 1991) or its non- linear versions. After this pre-processing, the data is ready for the detection of patterns through the use of algorithms such as classification, clustering, regression, etc. These patterns are then displayed to the user for validation. Data mining is an iterative and interactive process. The output of any step, or feedback from the domain experts, could result in an iterative refinement of any, or all, of the sub-tasks. While there is some debate about the exact definition of data mining (Kamath 2001), most practitioners and proponents agree that data mining is a multi- disciplinary field, borrowing ideas from machine learning and artificial intelli- gence, statistics, high performance computing, signal and image processing, math- ematical optimization, pattern recognition, etc. What is new is the confluence of the mature offshoots of these technologies at a time when we can exploit them for the analysis of massive data sets. As data mining has been applied to new problem domains, this technology mix has grown as well. For example, the growth of the Internet and the World Wide Web has resulted in tasks such as clustering text documents, multimedia searches, or mining a user’s Web surfing patterns to predict what page they are likely to visit next or to target the advertising on a Web page. This has added natural language processing and privacy issues to the technological mix that comprises data mining.
310 Read more
Data pre-processing is a significant term of data mining. Making a suitable analysis and appropriate for clustering of collected data. This is the main concern of data pre-processing. Sometimes data warehouse is consisted with duplicate data and missing any values of data. Data pre-processing cleans the duplicates data and supplies the missing values according to the past recorded data. It also minimizes the memory and normalizes the values used to represent information in database.
Data mining is the process of analyzing hidden patterns of data according to different perspectives, for categorization into useful information. The data is collected and assembled in common areas, such as data warehouses, for efficient analysis. Data mining tools predict future trends and behaviors, thus allowing businesses to make proactive, knowledge -driven decisions. Data mining principles have been around for many years, but, with the advent of big data, it is even more prevalent. It caused an explosion in the use of more extensive data mining techniques, partially because the size of the information is much larger and because the information tends to be more varied and extensive in its very nature and content .
Therefore, data mining techniques are the result of a long process of research and product development (elder et al, 1998). This evolution began when business data was first stored on computers, continued with improvements in data access, and more recently, generated technologies that allow users to navigate through their data in real time. Data mining takes this evolutionary process beyond retrospective data access and navigation to prospective and proactive information delivery. Data mining is ready for application in the business community because it is supported by three technologies that are now sufficiently mature: • Massive data collection
The performance of various algorithms is listed below. Table.1. Performance Study of Data mining Algorithms The algorithm used Accuracy Time taken Naïve Bayes 52.33% 609ms Decision list 52% 719ms K-NN 45.67% 1000ms Diagnosis of heart disease was used Naïve Bayes, K-NN, Decision List in this Naïve Bayes has taken a time to run the data for accurate result when compared to other algorithms. Sudha et al.  to propose the classification algorithm like Naïve Bayes, Decision tree and Neural Network for predicting the stroke diseases. The classification algorithm like decision trees, Bayesian classifier and back propagation neural network were adopted in this study. The records with irrelevant data were removed from data warehouse before mining process occurs. Data mining classification technology consists of classification model and evaluation model. The classification model makes use of training data set in order to build classification predictive model. The testing data set was used for testing the classification efficiency. Then the classification algorithm like decision tree, naive Bayes and neural network was used for stroke disease prediction. The performance evaluation was carried out based on three algorithms and compared with various models used and accuracy was measured. While comparing these classification algorithms, the observation shows the neural network performance was more than the other two algorithms.
knowledge representation has an exceptional place in medical informatics. The need to unambiguously describe medical knowledge with in clinical environments, inherently characterized by terminological and data, has given rise to the models based on such ontologies have been proposed for various medical aided reporting, medical The increasing amounts of medical data produced annually comprise an invaluable source of knowledge to be discovered, represented and exploited to . Data mining, either supervised or unsupervised, provides the methodological tools to extract his knowledge. Supervised methods usually address data based on prior knowledge gained by training on whereas the unsupervised methods group data into clusters based solely on the similarity of the ithout any training. The latter could be the supervised methods. And this study also includes discusses the Data Mining applications
With the Collaborative Approach of Data Mining and Neural Network we mean to develop new generation algorithms which are being expected to portray the various diverse sources and types of data that will support mixed-initiative data mining, where human experts collaborate with the computer to form hypotheses and test them. The main challenges to the data mining procedure involve the following according to Razvan Andonie and Boris Kovalerchuk can be summarized as under 24 :
Organizations in the world wide generate huge amount of data which is mostly unorganized. This unorganized data requires some processing to generate meaningful and useful information. In order to organize the huge amount of data, we implement the database management system concept such as SQL Server. Structured Queury Language (SQL) is a query language used to retrieve and manipulate the data that are stored in relational database management systems. However, use of SQL is not always adequate to meet the end user requirements of sophisticated information from unorganized data bank. This paper describes the concepts of data mining, its process, techniques and some of its applications.
Imagine that you are a manager at AllElectronics and have been charged with analyzing the company's data with respect to the sales at your branch. You immediately set out to perform this task. You carefully study/inspect the company's database or data warehouse, identifying and selecting the attributes or dimensions to be included in your analysis, such as item , price , and units sold . Alas! You note that several of the attributes for various tuples have no recorded value. For your analysis, you would like to include information as to whether each item purchased was advertised as on sale, yet you discover that this information has not been recorded. Furthermore, users of your database system have reported errors, unusual values, and inconsistencies in the data recorded for some transactions. In other words, the data you wish to analyze by data mining techniques are incomplete (lacking attribute values or certain attributes of interest, or containing only aggregate data), noisy (containing errors, or outlier values which deviate from the expected), and inconsistent (e.g., containing discrepancies in the department codes used to categorize items). Welcome to the real world!
314 Read more
At Data Miners, the analytic marketing consultancy I founded in 1997, we firmly believe that data mining projects succeed or fail on the basis of the quality of the data mining process and the suitability of the data used for mining. The choice of particular data mining techniques, algorithms, and software is of far less importance. It follows that the most important part of a data mining project is the careful selection and preparation of the data, and one of the most important skills for would-be data miners to develop is the ability to make connections between customer behavior and the tracks and traces that behavior leaves behind in the data. A good cook can turn out gourmet meals on a wood stove with a couple of cast iron skillets or on an electric burner in the kitchenette of a vacation condo, while a bad cook will turn out mediocre dishes in a fancy kitchen equipped with the best and most expensive restaurant-quality equipment. Olivia Parr Rud understands this. Although she provides a brief introduction to some of the trendier data mining techniques, such as neural networks and genetic algorithms, the modeling examples in this book are all built in the SAS programming language using its logistic regression procedure. These tools prove to be more than adequate for the task.
430 Read more
Another important predictive methodology is represented by tree models, described in Section 4.5, which can be used for regression and clustering pur- poses. There is a fundamental difference between cluster analysis, on one hand, and logistic regression and tree models, on the other hand. In logistic regression and tree models, the clustering is supervised – it is measured against a refer- ence variable (target or response) whose levels are known. In cluster analysis the clustering is unsupervised – there are no reference variables. The cluster- ing analysis determines the nature and the number of groups and allocates the observations within them. Section 4.6 deals with neural networks. It examines two main types of network: the multilayer perceptron, which can be used for predictive purposes, in a supervised manner; and the Kohonen networks (also known as self-organising maps), which are clustering methods useful for unsuper- vised learning. Section 4.7 deals with another important predictive methodology, based on the rather ﬂexible class of nearest-neighbour methods, sometimes called memory-based reasoning models. Section 4.8 deals with the two most important local data mining methods: association and sequence rules, which are concerned with relationships between variables, and retrieval by content, which is con- cerned with relationships between observations. Finally, Section 4.9 contains a brief overview of recent computational methods and gives some pointers to the literature.
378 Read more