– This step involves analyzing a business problem to assess if it is suitable for tackling using datamining. If it is, then an assessment has to be made of the availability of data, the datamining technology to used (induction, associations, etc.) and how the results of datamining will be deployed as part of the overall solution
Initially simple structured languages were used for finding trends and patterns from the collected data but with the increasing demand for sophisticated information, it is realized that these simple structured languages (like SQL) are not adequate to fulfill the needs of the users of this large collection of data. This is because a SQL1 query is generally written or constructed to retrieve data that is specific to the user. Means the researchers applying SQL query on the data collection already know what they are going to get as an output of this process. Usually the output of the SQL query is a database subset while researchers applying datamining over data collection are not exactly sure of what they require as mining result. So the output of a datamining query is said to be an analysis of the contents of the database. Datamining deals with the extraction of useful information from vast amount of data with a variety of application fields such as medical, customer relationship management, credit card analysis, market basket analysis, and bioinformatics. The output of the mining process or the extracted information could be in the form of patterns, trends, clusters or classification models. Datamining is applied on different data models like clustering, classification etc. over centralized and distributed data. Association rules on shopping records of a supermarket are a general example that can describe the relationship among the shopping items bought together. If we take a customer relationship management into consideration, then here customers could be clustered in different-different groups according to different metrics for improving customer relationship management. An another model is the classification models that can be built on the recorded data like customer shopping records and behavior to perform better decision making to do targeted marketing. If we talk about the widely acceptable datamining tasks then these include association rule mining, clustering, classification - prediction and outlier detection. Datamining being very popular and widely available, has been used in many famous areas that include financial data analysis, credit card and fraud detection analysis, identification of unusual patterns and analysis of telecommunication dataand Biomedical and DNA2 data analysis etc.
Conventional tools which are used will not be able to extract different information from these data, and moreover they will also not be able to handle the continuous large amount of data generated. Therefore we need new kind of technology, platform which can capture significantly large amount of incoming data so that it can be processed, analyzed, visualized, stored and shared.Moreover connection and correlation of these data has to be found. This can be done using datamining. In DataMining, different data repositories have to be included so that it can manage any form of data. Datamining techniques are used in object relational system, so that it can be used to find patterns or trends in these objects. For example, sales record of previous years of a large E-commerce company can be used to find the buying trends over the years. Similarly geographical databases are used for environmental and ecological planning,astronomical data is used to predict paths of different objects moving through space. There is also a
Much of the data used in DIID will be non-textual, for example, satellite imagery, video surveillance footage, or photographs of the symptoms of a disease in the various stages of its lifecycle. As a result, image and video processing using content-based analysis techniques are likely to become significant features of datamining in DIID. Some examples are: automatically analysing satellite images to detect regions of crop failure based on colour; detecting body temperature anomalies using infrared video as people move through airports or other public places; and automatic diagnosis of a condition by analysing photographs of skin lesions and comparing them with known cases in a database. Analysis of video surveillance footage has already been applied to situations relevant to DIID, including surveillance in public transport (Sun 2004), and image analysis has been used as the basis for automated condition classification (Lewis 2004).
Organizations in the world wide generate huge amount of data which is mostly unorganized. This unorganized data requires some processing to generate meaningful and useful information. In order to organize the huge amount of data, we implement the database management system concept such as SQL Server. Structured Queury Language (SQL) is a query language used to retrieve and manipulate the data that are stored in relational database management systems. However, use of SQL is not always adequate to meet the end user requirements of sophisticated information from unorganized data bank. This paper describes the concepts of datamining, its process, techniques and some of its applications.
There already exist multiple proposals for extending a query language with some datamining primitives. The most well-known examples are the SQL-like oper- ator MINE RULE of Meo et al.  for mining association rules, and the datamining query language DMQL by Han et al. . In both studies, however, the language constructions only allow to specify the desired output, but this output is not integrated again into the database. Our proposal goes beyond this, by also allowing the results to be used as input for further datamining queries, as they are treated as regular database tables.
Other possible uses of information technology in the field of pharmaceuticals include pricing (two-tier pricing strategy) and exchange of information between vertically integrated drug companies for mutual benefit. Nevertheless, the challenge remains though data collect Datamining fondly called patterns analysis on large sets of data uses tools like association,clustering, segmentation and classification for helping better manipulation of the data help the pharma firms compete on lower costs while improving the quality of drug discovery and delivery methods. The paper presents how DataMining discovers and extracts useful patterns from this large data to find observable patterns. The paper demonstrates the ability of DataMining in improving the quality of decision making process in pharma industry. There are in general three stages of drug development namely finding of new drugs, development tests and predicts drug behavior, clinical trials test the drug in humans and commercialization takes drug and sells it to likely consumers (doctors and patients).A simple association technique could help us measure the outcomes that would greatly enhance the patient’s quality of life say for e.g. faster restoration of the body’s normal functioning. This could be a benefit much sought after by the patient and could help the firm better position the drug vis-à-vis the competition. This paper described that these techniques can be easily and successfully used. The paper presented on how Datamining discovers and extracts useful patterns from this large data to find observable patterns. The paper demonstrates the ability of DataMining in improving the quality of decision making process in pharma industry.
In literature [3, 5] there were identified two major classes of datamining algorithms: supervised (classification) represented by the following: Bayesian, Neural Network, Decision Tree, Genetic Algorithms, Fuzzy Set, K-Nearest Neighbor and unsupervised algorithms such as Association Rules, Clustering.
In computer science and datamining, Apriori is a classic algorithm for learning association rules. Apriori is designed to operate on databases containing transactions (for example, collections of items bought by customers, or details of a website frequentation). Frequent itemsets play an essential role in many datamining tasks that try to find interesting patterns from databases, such as association rules, correlations, sequences, episodes, classifiers, clusters and many more of which the mining of association rules is one of the most popular problems. In this paper, we take the classic Apriori algorithm, and improve it quite significantly by introducing what we call a vertical sort. We then use the large dataset, web documents to contrast our performance against several state-of-the-art implementations and demonstrate not only equal efficiency with lower memory usage at all support thresholds, but also the ability to mine support thresholds as yet un-attempted in literature. We also indicate how we believe this work can be extended to achieve yet more impressive results. We have demonstrated that our implementation produces the same results with the same performance as the best of the state-of-the art implementations. In particular, we have started with the classic algorithm for this problem and introduced a conceptually simple idea, sorting the consequences of which have permitted us to outperform all of the available state-of-the-art implementations.
Technically Text Mining is a process of deriving high quality information from text. In this the hidden patterns and trends are extracted from the textual data. There are multiple utilities of text mining like- Text Classification, Opinion Mining, Text Clustering, and Document Summarization etc. Text mining is that the method of seeking or extracting the helpful info from the matter information. It is an exciting analysis space because it tries to find knowledge from unstructured texts. it's additionally legendary as Text data processing (TDM) and information Discovery in matter Databases (KDT). KDT plays an more and more important role in rising applications, like Text Understanding. Text mining method is same as data processing, except, the datamining tools ar designed to handle structured data whereas text mining will able to handle unstructured or semi-structured information sets like emails hypertext mark-up language files and full text documents etc. . Text Mining is employed for locating the new, previously unidentified info from completely different written resources. Structured information is information that resides in an exceedingly mounted field within a record or file. This information is contained in relational database and spreadsheets. The unstructured information typically refers to info that does not reside in an exceedingly ancient row-column info and it's the other of structured information. Semi Structured data is that the typed information in an exceedingly standard info system. Text mining may be a new space of applied science analysis that tries to resolve the problems that occur within the space of datamining, machine learning, info extraction, linguistic communication process, info retrieval, information management and classification.
Datamining is the need of the hour for knowledge discovery process. DataMining has benefited the market and retail industry. Marketing people can keep a track all the products whose sale was maximum and of the ones whose were minimum. It also helped them in campaigns. Moreover, retailers too had a record of their goods and thus can give discount on correct time to gain better profits. Not only the private sector, even the public sector has made good use of datamining. All the banks are using this process. It is because of this that they are able to keep track of the loans given and the percent by which it should be increased. Moreover, it has helped all the MNC’s in building up their databases. No matter it has been to great help to everyone, but it has had its backlogs. User is sometimes has no privacy with his data. Moreover, this brings doubt in our minds about the security of our data. Datamining no matter is a very successful technique but it is time consuming because finding out the required data from a bulk needs patience and time. This gives birth to a big fear in our hearts. What will we do if our data gets misused? What will you do if your database gets hacked? However, hacking data in datamining is not so easy but still there have been a few exceptions and the companies are working on these demerits. However, if you will not keep any loose ends, you can never be cheated or caught cheating. Every coin has two sides. But you have the right to choose your one.
in memory. In the last decade, there has been an increased focus on scalability but research work in the community has focused on scaling analysis over data that is stored in the file system. This tradition of performing datamining outside the DBMS has led to a situation where starting a datamining project requires its own separate environment. Data is dumped or sampled out of the database, and then a series of Perl, Awk, and special purpose programs are used for data preparation. This typically results in the familiar large trail of droppings in the file system, creating an entire new data management problem outside the database. In particular, such “export” creates nightmares of data consistency with data in warehouses. This is ironic because database systems were created to solve exactly such data management problems.
STEP 4:- Transformation of data in this step can be defined as decreasing the dimensionality of the data that is sent for datamining. Usually there are cases where there are a high number of attributes in the database for a particular case. With the reduction of dimensionality we increase the efficiency of the data-mining step with respect to the accuracy and time utilization.
The air quality data were analysed using the datamining software package Clementine  (originally from Integral Solutions Ltd. and now from SPSS Inc.) While this provides standard machine learning algorithms to generate models, its great virtue is the powerful visual environment it offers for data exploration. The figure shows an example of a data processing stream in Clementine’s graphical interface. This ease of data exploration and modelling is crucial in allowing the domain expert to attack the problem and find
The problem that often confronts researchers new to the field is that there are a variety of datamining techniques available—which one to choose? All these tools give you answers. Some are more difficult to use than others, and they differ in other, superficial ways, but most importantly, the underlying algorithms used differ and the nature of these algorithms is directly related to the quality of the results obtained and ease of use. This paper will address those issues.
4. Data Management Strategy: Handling the data access efficiently during the search/optimization. The final component in any datamining algorithm is the data management strategy: the ways in which the data stored, indexed, and accessed. Most well- known data analysis algorithms in statistics and machine learning have been developed under the assumption that all individual data points can be accessed quickly and efficiently in random-access memory(RAM), while main memory technology has improved rapidly, there have been equally rapid improvements in secondary (disk) and tertiary tape) storage technologies, to the extent that many massive data sets still reside largely on disk or tape and will not fit in available RAM. Thus, there will probably be a price to pay for accessing massive data sets, since not all data points can be simultaneously close to the main processor.
2.1. General Schema. A general schema is suggested (see Figure 2.1) to find association rules on a dataset. Taking into account that a datamining process (the search for association rules, for example) is part of a KDD process (Knowledge Discovery in Data: cleaning, pre-processing, datamining, results validation . . . ), and that the algorithm has been tested on a real medical database, it was decided to introduce into the schema the pre-processing phase that aims to format data for the problem. Assuming that the database is initially distributed in a vertical split (non-overlapping subsets of attributes on each workstation):
We explore different methods of datamining in the field of aviation and their effectiveness. The field of aviation is al- ways searching for new ways to improve safety. However, due to the large amounts of aviation data collected daily, parsing through it all by hand would be impossible. Because of this, problems are often found by investigating accidents. With the relatively new field of datamining we are able to parse through an otherwise unmanageable amount of data to find patterns and anomalies that indicate potential incidents be- fore they happen. The datamining methods outlined in this paper include Multiple Kernel Learning algorithms, Hidden Markov Models, Hidden Semi-Markov Models, and Natural Language Processing.
To get accurate results from datamining process, current and historical data should be available for the process but keeping the historical data in a regular database would cause a negative effect on the database itself. Usually old data is not used for every day transactions but it is used for the datamining and reporting issues. Storing historical data in everyday database will cause a huge increase of its size which lends to a slower performance. A good practice is to move the old data from different sources and integrate the whole in another repository called data warehouse. Moving the data from operational databases to a data warehouse involves three steps:
So these were the best suitable algorithms and most frequently used algorithms in the computing world. The areas in which these methodologies can be applied is very astonishing. At present datamining is very new and important area of research it is very suitable for solving problems of data because of the characteristics of robustness, self-organizing, adaptive, parallel processing, distributed storage and high degree of fault tolerance. The commercial, educational and scientific applications are increasingly dependent on these methodologies.