DATA MINING AND KNOWLEDGE DISCOVERY FROM RESEARCH PROBLEMS

(1)

DATA MINING AND KNOWLEDGE DISCOVERY

FROM RESEARCH PROBLEMS

Kamlesh Kumar1, Bhavesh Kumar Chauhan2, J.P. Pandey3, Arvind Kumar Tomer4

Abstract—

As in reference this paper begins with the definition of data mining classifications, data mining techniques, issues in data mining, data mining applications and ends with research problems in data mining. In this paper, we discussed the research challenges in engineering sciences and medical sciences from the data mining perspective, with a focus on the following problems: (1) biological and environmental problems , (2) process related problems, (3) dealing with non-static, unbalanced and cost sensitive data, (5) related to network settings, and (6) mining in and for computer networks.

Index Terms— Databases, Data mining, knowledge

discovery, Knowledge representation, KDD, Biological data, Network data, Network Security, High speed access.

I. INTRODUCTION

We are in an age often referred to as the information age, because we believe that information leads to power and success, and thanks to sophisticated technologies such as computers, satellites, etc., we have been collecting tremendous amounts of information. Initially, with the advent of computers and means for mass digital storage, we started collecting and storing all sorts of data, counting on the power of computers to help sort through this amalgam of information.

Unfortunately, these massive collections of data stored

ondisparate structures very rapidly became

overwhelming. This initial chaos has led to the creation of structured databases and database management systems (DBMS). The efficient database management systems have been very important assets for management of a large corpus of data and especially for effective and efficient retrieval of particular information from a large collection whenever needed. Nowadays, we have far more information than we can handle from business transactions and scientific data, to satellite pictures, text reports and military intelligence. Information retrieval is simply not enough anymore for decision-making. Confronted with huge collections of data, we have now created new needs to help us make better decision making. There is a consistent need for automatic summarization of data, extraction of the essence of information stored, and the discovery of patterns in raw data, technologies for analysis of data and discovery of hidden patterns. Data Mining, also

popularly known as Knowledge Discovery in

Databases (KDD), refers to the nontrivial extraction of implicit, previously unknown and potentially useful information from data in databases. While data mining and knowledge discovery in databases are frequently treated as synonyms, data mining is actually part of the knowledge discovery process. Other similar terms referring to data mining are: data dredging, knowledge extraction and pattern discovery.

The Knowledge Discovery in Databases process comprises of a few steps leading from raw data collections to some form of new knowledge. The iterative process consists of the following steps:

Data cleaning: also known as data cleansing, it is a phase in which noise data and irrelevant data are removed from the collection.

Data integration: at this stage, multiple data sources, often heterogeneous, may be combined in a common source.

Data selection: at this step, the data relevant to the analysis is decided on and retrieved from the data collection.

Data transformation: also known as data consolidation, it is a phase in which the selected data is transformed into forms appropriate for the mining procedure.

Data mining: it is the crucial step in which clever techniques are applied to extract patterns potentially useful.

Pattern evaluation: in this step, strictly interesting patterns representing knowledge are identified based on given measures.

Knowledge representation: It is the final phase in which the discovered knowledge is visually represented to the user. This essential step uses visualization techniques to help users understand and interpret the data mining results.

II. CLASSIFICATION OF DATA

MINING

In data mining there are different types of information are collected like business transactions, scientific data, medical and personal data, surveillance video and pictures, satellite sensing, games, digital media, CAD and software engineering data, virtual worlds, Text reports and memos (e-mail messages),world wide web repositories. There are many data mining systems available or being developed. Some are specialized systems dedicated to a given data source or are confined to limited data mining functionalities, other are more

(2)

versatile and comprehensive. Data mining systems can be categorized according to various criteria among other classification are the following based on:

The type of data source mined: this classification categorizes data mining systems according to the type of data handled such as spatial data, multimedia data, time-series data, text data, World Wide Web, etc.

The data model drawn on: this classification categorizes data mining systems based on the data model involved such as relational database, object-oriented database, data warehouse, transactional, etc.

The king of knowledge discovered: This classification categorizes data mining systems based on the kind of knowledge discovered or data mining functionalities, such as characterization, discrimination, association, classification, clustering, etc. Some systems tend to be comprehensive systems offering several data mining functionalities together.

The mining techniques used: Data mining systems

employ and provide different techniques. This

classification categorizes data mining systems

according to the data analysis approach used such as machine learning, neural networks, genetic algorithms, statistics, visualization, database-oriented or data warehouse-oriented, etc. The classification can also take into account the degree of user interaction involved in the data mining process such as query-driven systems, interactive exploratory systems, or autonomous systems. A comprehensive system would provide a wide variety of data mining techniques to fit different situations and options, and offer different degrees of user interaction.

III. ISSUES IN DATA MINING

While data mining is still in its infancy, it is becoming a trend and ubiquitous. Some of these issues are:

Security and social issues: Security is an important issue with any data collection that is shared and/or is intended to be used for strategic decision-making. In addition, when data is collected for customer profiling, user behavior understanding, correlating personal data with other information, etc., large amounts of sensitive and private information about individuals or companies is gathered and stored. This becomes controversial given the confidential nature of some of this data and the potential illegal access to the information. Moreover, data mining could disclose new implicit knowledge about individuals or groups that could be against privacy policies, especially if there is potential dissemination of discovered information. Another issue that arises from this concern is the appropriate use of data mining. Due to the value of data, databases of all sorts of content are regularly sold, and because of the competitive advantage that can be attained from implicit knowledge discovered, some important information could be withheld, while other information could be widely distributed and used without control.

User interface issues: The knowledge discovered by data mining tools is useful as long as it is interesting, and above all understandable by the user. Good data visualization eases the interpretation of data mining results, as well as helps users better understand their needs. Many data exploratory analysis tasks are significantly facilitated by the ability to see data in an appropriate visual presentation. There are many visualization ideas and proposals for effective data graphical presentation. However, there is still much research to accomplish in order to obtain good visualization tools for large datasets that could be used to display and manipulate mined knowledge. The major issues related to user interfaces and visualization are

screen real-estate, information rendering, and

interaction. Interactivity with the data and data mining results is crucial since it provides means for the user to focus and refine the mining tasks, as well as to picture the discovered knowledge from different angles and at different conceptual levels.

Mining methodology issues: These issues pertain to the data mining approaches applied and their limitations. Topics such as versatility of the mining approaches, the diversity of data available, the dimensionality of the domain, the broad analysis needs, and the assessment of

the knowledge discovered the exploitation of

background knowledge and metadata, the control and handling of noise in data, etc. are all examples that can dictate mining methodology choices. For instance, it is often desirable to have different data mining methods available since different approaches may perform differently depending upon the data at hand. Moreover, different approaches may suit and solve user's needs differently.

Most algorithms assume the data to be noise-free. This is of course a strong assumption. Most datasets contain exceptions, invalid or incomplete information, etc., which may complicate, if not obscure, the analysis process and in many cases compromise the accuracy of the results. As a consequence, data preprocessing (data cleaning and transformation) becomes vital. It is often seen as lost time, but data cleaning, as time-consuming and frustrating as it may be, is one of the most important phases in the knowledge discovery process. Data mining techniques should be able to handle noise in data or incomplete information. More than the size of data, the size of the search space is even more decisive for data mining techniques. The size of the search space is often depending upon the number of dimensions in the domain space. The search space usually grows exponentially when the number of dimensions increases. This is known as the curse of dimensionality. This ‗curse‘ affects so badly the performance of some data mining approaches that it is becoming one of the most urgent issues to solve.

Performance issues: Many artificial intelligence and statistical methods exist for data analysis and interpretation. However, these methods were often not designed for the very large data sets data mining is

(3)

dealing with today. Terabyte sizes are common. This raises the issues of scalability and efficiency of the data mining methods when processing considerably large data. Algorithms with exponential and even medium-order polynomial complexity cannot be of practical use for data mining. Linear algorithms are usually the norm. In same theme, sampling can be used for mining instead of the whole dataset. However, concerns such as completeness and choice of samples may arise. Other topics in the issue of performance are incremental updating, and parallel programming. There is no doubt that parallelism can help solve the size problem if the dataset can be subdivided and the results can be merged later. Incremental updating is important for merging results from parallel mining, or updating data mining results when new data becomes available without having to re-analyze the complete dataset.

Data source issues: There are many issues related to the data sources, some are practical such as the diversity of data types, while others are philosophical like the data glut problem. We certainly have an excess of data since we already have more data than we can handle and we are still collecting data at an even higher rate. If the spread of database management systems has helped increase the gathering of information, the advent of data mining is certainly encouraging more data harvesting. The current practice is to collect as much data as possible now and process it, or try to process it, later. The concern is whether we are collecting the right data at the appropriate amount, whether we know what we want to do with it, and whether we distinguish between what data is important and what data is insignificant. Regarding the practical issues related to data sources, there is the subject of heterogeneous databases and the focus on diverse complex data types. We are storing different types of data in a variety of repositories. It is difficult to expect a data mining system to effectively and efficiently achieve good mining results on all kinds of data and sources. Different kinds of data and sources may require distinct algorithms and methodologies. Currently, there is a focus on relational databases and data warehouses, but other approaches need to be pioneered for other specific complex data types. A versatile data mining tool, for all sorts of data, may not

be realistic. Moreover, the proliferation of

heterogeneous data sources, at structural and semantic levels, poses important challenges not only to the database community but also to the data mining community.

IV. APPLICATIONS OF DATA

MINING

Data mining is a relatively new technology that has not fully matured. Despite this, there are a number of industries that are already using it on a regular basis. Some of these organizations include retail stores, hospitals, banks, and insurance companies. Many of

these organizations are combining data mining with such things as statistics, pattern recognition, and other important tools. Data mining can be used to find patterns and connections that would otherwise be difficult to find. This technology is popular with many businesses because it allows them to learn more about their customers and make smart marketing decisions. There are a number of applications in business. The first

is called market segmentation. With market

segmentation, it will be possible to find behaviors that are common among customers. Subsequently, we can look for patterns among customers that seem to purchase the same products at the same time. Another application of data mining is called customer churn. Customer churn will allow anyone to estimate which customers are the most likely to stop purchasing the products or services and go to one of the competitors. In addition to this, a company can use data mining to find out which purchases are the most likely to be fraudulent While many businesses use data mining to help increase their profits, many of them do not realize that it can be used to create new businesses and industries. One industry that can be created by data mining is the automatic prediction of both behaviors and trends. Instead of simply guessing what the next big trend will be, you will determine it based on statistics, patterns, and logic. Another example of automatic prediction is to use data mining to look at your past marketing strategies: Which one worked the best, Why did it work the best, Who were the customers that responded most favorably to it, Data mining will allow us to answer these questions, and once we have the answers, it will be possible to avoid making any mistakes that we have made in our previous marketing campaign.

Data mining can allow you to become better at what we do. It is also a powerful tool for those who deal with finances. A financial institution such as a bank can predict the number of defaults that will occur among their customers within a given period of time, and they can also predict the amount of fraud that will occur as well.

Another potential application of data mining is the automatic recognition of patterns that were not previously known. Imagine if we had a tool that could automatically search our database to look for patterns which are hidden. With access to this technology, we would be able to find relationships that could allow us to make strategic decisions. Because our decisions are based on logic, we would increase the chances of being successful. While data mining is a very valuable tool, it is important to realize that it is not a panacea. Even if an automated technology should be invented, it will not guarantee the success of an individual or a company. However, it will tip the odds in our favor.

V. SOME UNSOLVED

(4)

Data Mining for Biological and Environmental Problems:Many researchers that we surveyed believe that mining biological data continues to be an extremely important problem, both for data mining research and for biomedical sciences. An example of a research issue is how to apply data mining to HIV vaccine design. In molecular biology, many complex data mining tasks exist, which cannot be handled by standard data mining algorithms. These problems involve many different aspects, such as DNA, chemical properties, 3D structures, and functional properties. There is also a need to go beyond bio-data mining. Data mining

researchers should consider ecological and

environmental informatics. One of the biggest concerns today, which is going to require significant data mining efforts, is the question of how we can best understand and hence utilize our natural environment and resources — since the world today is highly ―resource-driven‖! Data mining will be able to make a high impact in the area of integrated data fusion and mining in ecological/environmental applications, especially when involving distributed/decentralized data sources, e.g. autonomous mobile sensor networks for monitoring climate and/or vegetation changes. For example, how can data mining technologies be used to study and find out contributing factors in the observed doubling of the number of hurricane occurrences over the past decades, as recently reported in Science magazine? Most of the data sources that we are dealing with today are fast evolving, e.g. those from stock markets or city traffic. There is much interesting knowledge yet to be discovered, as far as the dynamic change regularities and/or their cross-interactions are concerned. In this regard, one of the challenges today is how to deal with the problem of dynamic temporal behavioral pattern identification and prediction in: (1) very large scale systems (e.g. global climate changes and potential ―bird flu‖ epidemics) and (2) human-centered systems (e.g. user-adapted human-computer interaction or P2P transactions). Related to these questions about important applications, there is a need to focus on ―killer applications‖ of data mining. So far three important and challenging applications for data mining have emerged: bioinformatics, CRM/personalization and security applications. However, more explorations are needed to expand these applications and extend the list of applications.

Data Mining Process-Related Problems:Important topics exist in improving data-mining tools and processes through automation, as suggested by several researchers. Specific issues include how to automate the composition of data mining operations and building a methodology into data mining systems to help users avoid many data mining mistakes. If we automate the different data mining process operations, it would be possible to reduce human labor as much as possible. One important issue is how to automate data cleaning. We can build models and find patterns very fast today, but 90 percent of the cost is in pre-processing (data

integration, data cleaning, etc.) Reducing this cost will have a much greater payoff than further reducing the cost of model-building and pattern-finding. Another issue is how to perform systematic documentation of data cleaning. Another issue is how to combine visual interactive and automatic data mining techniques together. He observes that in many applications, data mining goals and tasks cannot be fully specified, especially in exploratory data analysis. Visualization helps to learn more about the data and define/refine the data mining tasks. There is also a need for the development of a theory behind interactive exploration of large/complex datasets. An important question to ask is: what are the compositional approaches for multi-step mining ―queries‖? What is the canonical set of data mining operators for the interactive exploration approach? For example, the data mining system Clementine has a nice user interface, but what is the theory behind its operations?

Dealing with Non-Static, Unbalanced and Cost-Sensitive Data:An important issue is that the learned models should incorporate time because data is not static and is constantly changing in many domains. Historical actions in sampling and model building are not optimal, but they are not chosen randomly either. This gives the following challenging phenomenon for the data collection process. Suppose that we use the data collected in 2000 to learn a model. We then apply this model to select inside the 2001 population. Subsequently, we use the data about the individuals selected in 2001 to learn a new model, and then apply this model in 2002. If this process continues, then each time a new model is learned, its training set has been created using a different selection bias. Thus, a challenging problem is how to correct the bias as much as possible. Another related issue is how to deal with unbalanced and cost-sensitive data, a major challenge in research. Charles Elkan made the observation in an invited talk at ICML 2003 Workshop on Learning from Imbalanced Data Sets. First, in previous studies, it has been observed that UCI datasets are small and not highly unbalanced. In a typical real-world dataset, there are at least 105 examples and 102.5 features, without single well-defined target class. Interesting cases have a frequency of less than 0.01. There is much information on costs and benefits, but no overall model of profit and loss. There are different cost matrices for different examples. However, most cost matrix entries are unknown. An example of this dataset is the direct marketing DMEF data library. Furthermore, the costs of different outcomes are dependent on the examples; for example, the false negative cost of direct marketing is directly proportional to the amount of a potential donation. Traditional methods for obtaining these costs relied on sampling methods. However, sampling methods can easily give biased results.

Data Mining in a Network Setting

Community and social networks: Today‘s world is interconnected through many types of links. These links

(5)

include Web pages, blogs, and emails. Many respondents consider community mining and the mining of social networks as important topics. Community structures are important properties of social networks. The identification problem in itself is a challenging one. First, it‘s critical to have the right characterization of the notion of ―community‖ that is to be detected. Second, the entities/nodes involved are distributed in real-life applications, and hence distributed means of identification will be desired. Third, a snapshot-based dataset may not be able to capture the real

picture; what is most important lies in the local relationships (e.g. the nature and frequency of local interactions) between the entities/nodes. Under these circumstances, our challenge is to understand (1) the network‘s static structures (e.g.topologies and clusters) and (2) dynamic behavior (such as growth factors, robustness, and functional efficiency). A similar challenge exists in bio-informatics, as we are currently moving our attention to the dynamic studies of regulatory networks. A questions related to this issue is what local algorithms/protocols are necessary in order to detect (or form) communities in a bottom-up fashion (as in the real world). A concrete question is as follows. Email exchanges within an organization or in one‘s own mailbox over a long period of time can be mined to show how various networks of common practice or friendship start to emerge. How can we obtain and mine useful knowledge from them?

Mining in and for computer networks — high-speed mining of high-speed streams: Network mining problems pose a key challenge. Network links are increasing in speed, and service providers are now deploying 1 Gig Ethernet and 10 Gig Ethernet link speeds. To be able to detect anomalies (e.g. sudden traffic spikes due to a DoS (Denial of Service) attack or catastrophic event), service providers will need to be able to capture IP packets at high link speeds and also analyze massive amounts (several hundred GB) of data each day. One will need highly scalable solutions here. Good algorithms are, therefore, needed to detect whether DoS attacks do not exist. Also, once an attack has been detected, how does one discriminate between legitimate traffic and attack traffic so that it is possible to drop attack packets? We need techniques to (1) detect DoS attacks, (2) trace back to find out who the attackers are, and (3) drop those packets that belong to attack traffic.

VI. CONCLUSION

Since engineering sciences and medical sciences are fertile lands for data mining. In the last three decades, engineering science and medical science haveevolved to a stage that enormous amounts of data are constantly being generated and collected, so, data miningand knowledge discovery becomes very essential scientific and technical discovery process.In this paper, we have examined a few important research challenges in

engineering sciences and medical sciences on the perspective of data mining.There are still several interesting research issues covered in this short abstract. Some of these research issues are data source, performance, mining methodology, user interface, andprivacy-preserving data mining issues are discussed in this short abstract that is very helpful for data mining research scholars.

References

1. Jiawei Han and Micheline Kamber, Data Mining: Concepts and Techniques, Morgan Kaufmann, Second Edition, 2001

2. Arun K. Pujari, Data Mining Techniques, University Press (India) Pvt. Ltd., 2013.

3. M. S. Chen, J. Han, and P. S. Yu. , ―Data mining: An overview from a database perspective‖, IEEE Transactions .on Knowledge and Data Engineering, pp.866-883, 1996.

4. U. M. Fayyad, G. Piatetsky-Shapiro, P. Smyth, and R. Uthurusamy, ―Advances in Knowledge Discovery and Data Mining‖, AAAI/MIT Press, 1996.

5. W. J. Frawley, G. Piatetsky-Shapiro and C. J. Matheus, ―Knowledge Discovery in Databases: An Overview‖, In G. Piatetsky-Shapiro et al., Knowledge Discovery in Databases. AAAI/MIT Press, 1991.

6. T. Imielinski and H. Mannila. ―A database perspective on knowledge discovery‖. Communications of ACM, vol. 39, pp. 58-64, 1996.

7. Usama Fayyad, Gregory Piatetsky-Shapiro, Padhraic Smyth, and Ramasamy Uthurasamy, "Advances in Knowledge Discovery and Data Mining", AAAI Press/ The MIT Press, 1996. 8. Michael Berry and Gordon Linoff, Data Mining

Technique for Marketing, Sales, and Customer Support, John Wiley & Sons, 1997.

9. Sholom M. Weiss and Nitin Indurkhya, Predictive Data Mining: A Practical Guide, Morgan Kaufmann Publishers, 1998.

10. Alex Freitas and Simon Lavington, Mining Very Large Databases with Parallel Processing, Kluwer Academic Publishers, 1998.

11. K. Jain and R. C. Dubes, "Algorithms for Clustering Data", Prentice Hall, 1988.

12. V. Cherkassky and F. Mulier, Learning from Data, John Wiley & Sons, 1998.

13. Christopher Matheus, Philip Chan, and Gregory Piatetsky-Shapiro, "Systems for Knowledge Discovery in Databases", IEEE Transactions on Knowledge and Data Engineering, vol. 5, issue 6, pp. 903-913, December 1993.

14. Rakesh Agrawal and Tomasz Imielinski, "Database Mining: A Performance Perspective", IEEE Transactions on Knowledge and Data Engineering, vol. 5, issue 6, pp. 914-925, December 1993.

(6)

15. Usama Fayyad, David Haussler, and Paul Stolorz, "Mining Scientific Data", Communications of the ACM, vol. 39, no. 11, pp. 51-57, November 1996. 16. David J. Hand, "Data Mining: Statistics and

more?", The American Statistician, vol. 52, no. 2, pp 112-118, May 1998.

17. Tom M. Mitchell, "Does machine learning really work?‖ AI Magazine, vol. 18, no. 3, pp. 11-20, Fall 1997.

18. Clark Glymour, David Madigan, Daryl Pregibon, and Padhraic Smyth, "Statistical Inference and Data Mining", Communications of the ACM, vol. 39, no. 11, pp. 35-41, November 1996.

19. QIANG YANG and XINDONG WU ―10

CHALLENGING PROBLEMS IN DATA

MINING RESEARCH‖, International Journal of Information Technology & Decision Making Vol. 5, No. 4 (2006) 597–604.

20. Thuraisingham, B., A Primer for Understanding andApplying Data Mining. IT Professional IEEE, 2000.