DATA MINING: AN EXPLORATORY OVERVIEW RIAN JOHAN FERREIRA SHORT DISSERTATION. submitted in partial fulfillment of the requirements for the degree







Full text





submitted in partial fulfillment of the requirements for the degree




in the







I would like to take this opportunity to thank, in particular, my wife Elrina for her unconditional support and inspiration, without which, surviving the M.Com experience would not have been possible for me.

I would also like to thank my study leader, Mr. C. Scheepers, for his guidance and advice.



Managers the world over complain that they are overwhelmed by the amount of data available to them, but that they are unable to make any sense of this data. The changing business environment and the fact that customers are becoming more and more demanding highlight the need for organisations to be able to adapt faster and more effectively to those changes.

Data mining developed as a direct result of the natural evolution of information technology. The increased organisational use of computer based systems has resulted in the accumulation of vast amounts of data, and the need for decision makers to have efficient access to knowledge, and not only data, has resulted in more and more organisations adopting the use of data mining.

The promise of data mining is to return the focus of large, impersonal organisations to serving their customers better and to providing more efficient business processes. Indeed, for some organisations data mining offers the potential for gaining a competitive advantage, but for others it has become a matter of survival.

The literature is filled with examples of the successful application of data mining, not only to specific business functions, but also in specific industries. Undoubtedly, certain industries, such as those dealing with huge amounts of data, and those exposed to many diverse customers, stand to benefit more from data mining than others.


The benefits, associated with data mining, for organisations, individuals and society as a whole, far exceed its drawbacks, but the biggest issue facing organisations that want to employ data mining, is its cost. The other drawbacks of data mining relate to the threat that it poses to privacy, and any data mining effort must not only be done within the framework of the relevant laws, but must also be done in an ethical manner.

Although data mining is probably beyond the financial ability of most organisations, its main principle, the fact that there might be value in organisational data, should not be forgotten. Organisations must endeavour to treat their data with the same respect it has for all its other corporate assets.








1.8 SUMMARY... 11

CHAPTER 2 ... 13


2.1.1 Origins of Data Mining ... 13

2.1.2 Data Mining Definitions ... 16

2.1.3 Business Context for Data Mining... 19


2.2.1 Business Intelligence Defined... 21

2.2.2 Business Intelligence Components... 22 Data Warehousing... 22

(6) Data Visualisation... 26 The Role of Data Mining in Business Intelligence... 26

2.2.3 The Value of Business Intelligence... 28 The liberation of previously isolated information... 28 Leveraging of existing investments ... 28 Improvement in the quality and timeliness of information ... 29 Enabling of information workers... 29 Reduction in operating costs... 30


2.4 SUMMARY... 31

CHAPTER 3 ... 33


3.1.1 Classification / Prediction... 35

3.1.2 Estimation ... 36

3.1.3 Segmentation ... 37

3.1.4 Association... 38

3.1.5 Clustering ... 38

3.1.6 Description and Visualisation... 39

3.1.7 Optimisation ... 40

3.1.8 Outlier Analysis ... 40

3.1.9 Evolution Analysis ... 41


3.2.3 Neural Networks... 45


3.3.1 Identify the business problem ... 51

3.3.2 Data collection and preparation ... 52

3.3.3 Building and evaluating the model... 54

3.3.4 Selecting the model and delivering the results ... 54

3.4 SUMMARY... 56

CHAPTER 4 ... 58


4.1.1 General Data Mining Applications... 58 E-commerce and E-business Functions ... 58 Business Intelligence Functions... 60 Customer Relationship Management Functions ... 60 Marketing and Sales Functions... 61 Enterprise Resource Management Function... 62 Manufacturing and Planning Functions... 63 Education and Training Functions... 63

4.1.2 Industry Specific Data Mining Applications... 64 Telecommunications Industry... 64 Physical Sciences, Social Sciences and Engineering Industries... 65 Medical and Biotechnical Industries... 65 Law Enforcement Industry... 66 Financial Services Industry ... 66

(8) Customer Relationship Management ... 69 Channel Management... 70 Actuarial ... 71 Underwriting and Policy Management ... 71 Claims Management ... 72


4.2.1 Security Issues... 73

4.2.2 Technology Related Issues... 74 User Interface Issues ... 74 Mining Methodology Issues... 74

4.2.3 Performance Issues ... 75

4.2.4 Data Source Issues... 75

4.2.5 Results Issues... 76

4.2.6 Cost Issues ... 76

4.2.7 Limitations of Data Mining Issues ... 77


4.3.1 Privacy ... 78

4.3.2 The role of data mining ... 79

4.3.3 The potential threat of data mining to privacy... 80

4.3.4 Privacy preserving data mining... 82


(9) Benefits of Data Mining for Society ... 85

4.4.2 Drawbacks of Data Mining ... 86 Drawbacks of Data Mining for Organisations... 86 Drawbacks of Data Mining for Individuals ... 86 Drawbacks of Data Mining for Society ... 86

4.5 SUMMARY... 86

CHAPTER 5 ... 89









LINOFF ... 50







The characteristics of the way in which organisations interact with their customers have changed dramatically over the past few years. The continuance of business relationships between organisations and their customers are no longer guaranteed, and if organisations want to survive they cannot stand back and wait for the tell-tale signs of customer dissatisfaction to surface before they take action. Today’s organisations must be proactive and have to anticipate what their current and prospective customers desire.

According to Thearling (n.d.), the following forces are working together to further increase the complexity of customer relationship management:

• Compressed marketing cycle times that occur as a result of the fact that customer loyalty is literally a thing of the past. Organisations have to reinforce the value that they can add to their customers on a continuous basis, and when customer needs are identified, they must endeavour to satisfy those needs before one of their competitors do.

• Increased marketing costs.

• Streams of new product offerings resulting from customers who demand products that satisfy all their needs exactly. They are no longer content with products that


• Niche competitors that try to focus on an organisation’s small, profitable market segments attempting to steal them away by providing in their specific needs.

In the 21st century, organisations are faced with more customers, more competitors, more products, and less time to react to changes in their business environment. This means that understanding customers are becoming more difficult, and being able to keep up with changing customer demands are more vital for organisational survival than ever.

Organisations that want to be successful in today’s competitive environment must be able to react to each of the forces identified by Thearling (n.d.) in a timely manner. Management can no longer afford to postpone decision making to the next weekly or monthly management meeting. Managers at all levels have to be empowered, through information and knowledge, to be able to make better decisions faster. To facilitate this, the decision makers need to have access to reliable, accurate, relevant and up to date information pertaining to the business and especially the customers.

Nemati & Barko (2002:21) highlight a number of strengthening economic and social forces that are motivating organisations to improve the efficiency and effectiveness of enterprise decisions. These forces are:

• The shift in the balance of power away from the seller towards the buyer.

• The abundant choices that customers have and the ease with which they can transfer form one company’s products or services to another’s.


Nemati & Barko (2002:21) further state that information is quickly becoming one of the major differentiators between industry leading organisations and second rate organisations. Being able to extract the relevant knowledge from this information plays a vital role in enhancing enterprise decision making. The major difficulty faced by most organisations though is to provide the relevant decision makers with efficient access to this knowledge.

Melab (2001) indicates that the world has seen a huge explosion of data, resulting from the real revolution that has taken place in computer science over the last decade.

According to Koch (2002:17), there is no doubt that computer systems have had a positive impact on the way that business is conducted today. In the past 30 years though, technology has been used mainly to automate manual tasks and to help organisations do business faster. The focus now needs to change in order to help organisations do business better. Organisations must leverage technology in order to create value.

Managers the world over complain that they are drowning in data that are produced by computer based systems, but starving for useful information. Organisations are accumulating massive volumes of data, and Ahmed et al. (2001) estimate that the size of the world’s data is doubling every 20 months.


Goff (2003) highlights some typical comments made by senior business managers in response to a questionnaire sent out by a technology vendor cited in his article. Some of the comments received from senior managers include the following:

• “Data is buried in a sea of noise!”

• “Swamped in information!”

• “I’m drowning!”

Goff (2003) goes on to say that partly as a result of the drop in the cost of storing data, many corporations are in danger of being swamped by information. The same survey reveals that more than half of 158 corporate executives estimate that their business now has two to three times the amount of data available than in the previous year of doing business.

The collected data needs to be converted into meaningful information, or knowledge, in order to acquire a better understanding of many aspects of the business and to enhance managers’ capability to make better decisions.

The ability to access and retrieve data contributes to business success by increasing the knowledge of decision makers at all levels. The more efficiently managers can access value adding data, the better their chances of gaining insights into what drives their organisational activities which enables them to devise improved business strategy (Anon., 2001).

Over the past years organisations have exerted huge efforts to ensuring that they capture and store as much data relating to their business transactions and customers


as possible. Today, more and more organisations are focussing their information technology efforts towards making sense of that data.

Ahmed et al. (2001) expand on the above statement by stating that data is the organisational challenge of the new millennium. They also state that competitive success will depend on the ability of companies to quickly and effectively convert their raw data into comprehensible information. It is impossible for managers to make the correct decisions if the needed information cannot be accessed or presented to them intelligibly.

The wide availability of huge amounts of data and the imminent need for turning that data into useful information or knowledge, highlighted in the above discussion, is one of the major reasons for the increased amount of attention that data mining, and its related technologies have been attracting amongst information technology companies and businesses in general in recent years.

Information technology projects are often regarded as the catalysts for the re-engineering of business away from the traditional hierarchical organisational structures, to the more customer-oriented, knowledge based, networked organizations of the 21st century, and data mining can play a vital role in facilitating this transformation process.




This study was conducted in cooperation with Ascent Technology.

Ascent Technology is a newly established company that specialises in providing database related services to its clients. These services include:

• Database Consulting Services such as database design, optimisation and performance tuning, stabilisation, rescue and recovery, planning and sizing, audits and health checks, as well as technical documentation.

• Database License Sales.

• Database Outsourcing including on-site support, remote monitoring, remote repairs and fixed cost service level agreements.

• Database Contracting Services, providing experienced certified database professionals on a project, full time or part time basis.

Ascent Technology has clients operating in industries ranging from mining to credit insurance. All these clients make use of extensive database technologies for storing, retrieving and reporting on business related activities. None of these clients currently employ data mining in their organisations, but many of them have raised it as a potential future information technology implementation.



The discussion in section 1.1 highlights the following issues that are being faced by organisations across the world:


• The relationship between organisations and their customers are changing, with customers becoming more and more demanding.

• Although computer systems have made vast improvements in the way that business is conducted, they have also dramatically increased the volumes of data captured and stored by organisations.

• In this current changing business environment, organisations, and in particular managers, must be in a position to make informed decisions in order to respond to the changes in their business environments and to remain competitive.

• In order for managers to be able to make those informed decisions, they need timely access to relevant, accurate and valid information relating to their organisation, its products or services, and their customers.

• Many organisations’ managers are prevented from having access to the correct information, simply because there is too much data available.

The points outlined above once again highlight the need for managers to have access to only the relevant information when making decisions. Managers simply do not have the time to sift through vast amounts of irrelevant or incorrect data when making decisions.

Ascent Technology currently has limited skills in terms of data mining, and wishes to expand their business to include it as one of their database related service offerings. All Ascent Technology’s clients employ databases to store and retrieve business related data. Some of these clients have already expressed an interest in applying data mining in their organisations, but are unsure as to whether their businesses are



The primary objective of the study is to conduct a literature review, or secondary data analysis, of data mining with the intention of gaining a better understanding of the subject matter.

The primary objective will be pursued by dividing the study into the following secondary objectives:

• Define data mining and gain a basic understanding of what data mining is, and what its role is in other related technologies such as business intelligence.

• Provide an overview of some of the more prominent data mining tasks, techniques and algorithms.

• Identify and suggest a process for conducting data mining.

• Identify some typical uses or applications of data mining in general and in specific industries.

• Discuss typical issues facing organisations that wish to employ data mining as a tool in their businesses.

• Identify some of the potential societal issues relating to data mining and its implementation.

• Highlight the potential benefits, and drawbacks of data mining to organisations, individuals and society as a whole.




The study will be conducted by making use of qualitative research.

The objectives of the study will be pursued by using a literature review or secondary data analysis.

Articles, textbooks, research reports, dissertations, the internet and other scientific publications relevant to data mining will be used. The viewpoint of different authors will be compared and evaluated.


The study will focus mainly on data mining, but no data mining discussion will be complete without at least mentioning some of its related information technologies.

The study will not attempt to make data mining experts out of the researcher or the reader, but will focus more on providing a basic understanding of the technology, and its potential effects on how business is conducted.


The study will consist of five chapters.


The literature review will be carried out in chapters 2, 3 and 4. These chapters will define data mining and discuss it in some detail.

Chapter 2 will focus on describing the origins of data mining, defining data mining and describing the business context for data mining. The chapter will also discuss some of data mining’s related technologies such as data warehousing, and providing some perspective with regard to where data mining fits into business intelligence as a whole. The chapter will conclude with a discussion of the classification of data mining systems.

Chapter 3 will discuss data mining in more detail. It will look at some of the more prominent data mining tasks, and will provide a short discussion of the most popular data mining techniques and algorithms. The chapter will conclude with a suggested data mining process.

Some applications and uses of data mining will be considered in chapter 4. The potential benefits and drawbacks of data mining will also be highlighted. Finally, the chapter will consider some of the management issues and societal impacts of data mining. These relate more specifically to privacy issues and so forth.

Finally, chapter 5 will provide a summary of the study. Any relevant conclusions and recommendations that might arise from the research will also be discussed in this chapter.



The business environment faced by all organisations is changing rapidly, and customers are becoming more demanding in terms of their needs and in terms of the products and services that they require. Organisations that wish to prosper, and indeed survive these changes, must be able to put their decision makers in a better position to respond to these changes in the environment and in customer needs.

One of the main threats posed to effective decision making in organisations is the overwhelming amount of data that decision makers have to sift through in order to make decisions. Data mining holds out the promise of enabling organisations to turn their huge amounts of data into valuable information or knowledge, that can be used to not only satisfy their customers’ current needs, but also to predict possible future needs.

This study will aim to provide a basic understanding of data mining, and will also aim to highlight the following:

• Identify a process for conducting data mining.

• Identify some typical business applications of data mining.

• Discuss typical issues facing organisations that wish to employ data mining as a tool in their businesses.

• Highlight the potential benefits, and drawbacks of data mining to organisations, individuals and society as a whole.


The following chapter will provide a brief introduction to data mining by looking at some of the current definitions used to describe it, classifying the different data mining systems and providing some background on business intelligence and the role that data mining plays in an organisation’s overall business intelligence strategy.




2.1.1 Origins of Data Mining

Han & Kamber (2000:1) suggest that one of the major reasons why data mining has been attracting so much attention in recent years is as a result of the wide availability of huge amounts of collected data, and the growing need to turn that data into useful information and knowledge. The information and knowledge gained from this data can then be applied to applications ranging from business management to science exploration.

Data mining is further considered to be the result of the natural evolution of information technology. According to Han & Kamber (2000:2), the following functionalities represent the major steps in this information technology evolutionary path:

• Data collection and database creation.

• Data management.

• Data analysis and data understanding.




Thearling (n.d.) also support this idea of the natural evolutionary path of information technology that resulted in data mining development. According to Thearling (n.d.) the evolution began when business data was first stored on computers, continued with improvements in data access, and more recently, generated technologies that allow users to navigate through their data in real time. This evolutionary process is then taken beyond retrospective data access and navigation to perspective and proactive information delivery. The readiness of data mining for application in the business environment has been further enhanced by the support of three related technologies that are now sufficiently mature. These are:

• Massive data collection.

• Powerful multiprocessor computers.

• Data mining algorithms.



Source: Adapted from Thearling (n.d.).

2.1.2 Data Mining Definitions

Data mining is a powerful business intelligence tool that provides knowledge about unknown patterns and events from large databases. A review of data mining literature yielded the following definitions for data mining:




What was my total revenue in the last five years?

Computers, tapes, disks

Retrospective, static data delivery

Data access (1980’s)

What were unit sales in New England last month?

Relational databases, Structured Query Language

Retrospective, dynamic data delivery at record level


warehousing and Decision support (1990’s)

What were unit sales in New England last March? Drill down to sales for Boston

On-line analytic processing, multi – dimensional databases, data warehouses

Retrospective, dynamic data delivery at multiple levels

Data Mining (Emerging today)

What is likely to happen to Boston unit sales next month? Why? Advanced algorithms, multiprocessor computers, massive databases Prospective, proactive information delivery


• Data mining is the process of exploration and analysis by automatic or semiautomatic means, of large quantities of data in order to discover meaningful patterns and rules (Berry & Linoff, 2000:7).

• Data mining refers to extracting or mining knowledge from large amounts of data (Han & Kamber, 2000:5).

• Data mining is the extraction of hidden predictive information from large databases (Thearling, n.d.).

• Data mining is a process that uses a variety of data analysis tools to discover patterns and relationships in data that may be used to make valid predictions (Anon, 1999:1).

• Data mining is the process of applying artificial intelligence techniques (such as advanced modelling and rule induction) to a large data set in order to determine patterns in the data (Ma et al., 2000:128).

• Data mining involves the analysis of data to discover previously unknown relationships that provide useful information (Chan & Lewis, 2002:56).

• Data mining is the automated analysis of large data sets to identify previously unknown patterns or trends of information in the data that may be used to make valid predictions (Coskun et al., 2002:219).

These definitions highlight the fact that data mining is a process that is used to identify hidden, unexpected patterns or relationships in large quantities of data. Rubenking (2001) also state that data mining does not involve looking for specific information, but rather than starting from a question or a hypothesis, data mining simply finds patterns that are already present in the data.


Data mining predicts future trends and behaviours, allowing business to make proactive, knowledge driven decisions. It moves beyond the analysis of past events provided by retrospective tools typical of decision support systems, answering questions that traditionally were too time consuming to resolve. Data mining scours databases for hidden patterns, finding predictive information that experts might overlook because it falls outside their expectations (Thearling, n.d.).

According to Coskun et al. (2002:219), data mining has a generic blanket definition that tends to include all the tools employed to help users analyse and understand their data. They also highlight that data mining differs from traditional statistical techniques by using the computer, rather than the analyst, to find patterns and relationships by identifying the underlying rules and features in the data. The relationships are therefore found inductively by the software employed based on the existing data rather than requiring the modeller to specify the functional form and interactions.

The unique aspect of data mining is the fact that the user does not have to know what he or she is looking for. Data mining tools assume that we don’t know exactly what we are looking for, and then evaluates the data for statistical patterns and returns any patterns or relationships that it might find.

In essence therefore, data mining aims to add value to business organisations by applying highly rigorous statistical analysis to large masses of data in an effort to extract meaningful trends and relationships that are often hidden by the sheer volume and arrangement of the data. It further supports the transformation of data into information, knowledge and wisdom (Labovitz, 2003). This is a vital point to


organisations who wish to make use of their data to establish a competitive advantage in their industries.

2.1.3 Business Context for Data Mining

Based on the discussion in the previous section, data mining is clearly of value to any discipline or organisation where there are large amounts of data, and something worth learning from that data. In the business sense though, something is only worth learning if the resulting knowledge is worth more than it costs to discover that knowledge. Something is worth knowing if the return on investment required to learn it, is greater than the return from investing the same funds in some other way (Berry & Linoff, 2000:12).

According to Berry & Linoff (2000:12), the resulting knowledge or information gained from data mining efforts, can benefit an organisation in one of three ways. These are:

• Increasing organisational profits by lowering costs.

• Increasing organisational profits by increasing revenues.

• Increase the organisation’s share price by holding out the promise of future increased profit via either of the above mentioned two methods.

Data mining attempts to accomplish one or more of these goals by predicting future business trends and customer behaviour patterns from large data warehouses and, other forms of data resources.


According to Ma et al. (2000), data mining is critical to the enterprise that wants to exploit operational and other available data to improve the quality of decision making and gain critical competitive advantages. This is where data mining can play a crucial role, by disclosing important information in a cost effective and timely fashion (Chen & Lewis, 2002:56).

As a result of the fact that businesses today operate in an increasingly information or data driven economy consisting of more complex structures, data mining offers strategies for precision, more accurately addressing and dealing with problems, better consideration of bottom line issues, and more effective decision making.

Bose & Mahapatra (2001:211–212) also highlight the following reasons for the increase in attention that data mining has been receiving from the business community over the years:

• Intense competition is forcing organisations to identify innovative ways to capture and enhance market share while reducing costs.

• A better understanding of the buying behaviour of customers can enhance the effectiveness of target marketing practices.

• Data warehousing technology has made it possible for organisations to accumulate massive amounts of data, and data mining offers an effective means of analysing that data and converting it into valuable information or knowledge.

According to Nemati & Barko (2002:21), organisational theory suggests that knowledge management and decision making are constrained by the abilities of the decision maker, and that this organisational knowledge management is a learned


ability that can only be achieved through an organised and deliberate methodology. Data mining is a key part of this methodology, and its successful implementation can lead to enhanced organisational decision making.

According to Berry & Linoff (2000:483), the larger, impersonal corporations of today have in many cases lost sight of their customers, and the promise of data mining is to return the focus of these organisations to serving their customers better and to providing more efficient business processes.



2.2.1 Business Intelligence Defined

According to Kudyba & Hoptroff (2001:5), the competitive forces prevailing in the world’s markets today, require that businesses operate as efficiently and productively as possible in order to maintain and enhance market share, profitability and shareholder value. One essential element to achieving success involves the continuous enhancement of knowledge and understanding of the business environment by employees at all levels in the organisation. Business intelligence involves the implementation of processes that augment the accessibility to such value added information throughout the organisation.

Business intelligence consists of all the activities related to organising and delivering information and analysis to businesses and its decision makers (Hackney, 2000:39).


Wilken (1999) simply states that although there is a set of technologies behind the term business intelligence, it is really about end results. Business intelligence is about extracting value from corporate data in a way that enhances business processes, provides key insights into the business, and improves the bottom line.

2.2.2 Business Intelligence Components

Business intelligence is a term that encompasses a number of related technologies. This section will briefly discuss some of those technologies, and will highlight where and how data mining fits into the overall concept. Data Warehousing

The fundamental starting point for any business intelligence effort in an organisation involves the collection and storage of business related information. The need for a company wide view of information, and the need of the information systems department to be able to manage this data in a better way lead to the development of data warehousing as a decision support trend.

Ma et al. (2000:125) define a data warehouse as a single, complete and consistent store of data obtained from a variety of sources and made available to end users in a way that they can understand and use in a business context. Data warehousing provides architectures and tools for business executives to systematically organise, understand and use their data to make strategic decisions (Han & Kamber, 2001:39). Williams (1999:531) simply define a data warehouse as a repository for historical data, used to make decisions.


Simply stated a data warehouse refers to a database that is maintained separately from an organisation’s operational databases. They allow for the integration of a variety of application systems, and support information processing by providing a solid platform of consolidated historical data for analysis.

Han & Kamber (2001:40) further state that data warehousing has four key features. These features are:

• Subject-oriented

Data warehouses are organised around key subjects relevant to the organisation such as customers, suppliers or sales. These key subjects are used to provide decision makers with concise views on specific issues that might be relevant in their decision making.

• Integrated

Data warehouses are typically used to consolidate all of an organisation’s data into a single place. The data could come from such diverse sources as relational databases, flat files and online transaction records. They often also include data from legacy systems.

• Time-variant

The data in the warehouse is stored to provide information from a historical perspective, for example, sales over the past 10 years.


• Non-volatile

The data in a data warehouse is stored separate from the organisation’s operational environment. It therefore does not require transaction processing, recovery or concurrency control mechanisms and is mainly aimed at providing access to data.

A data warehouse is typically used for one of the following purposes (Williams, 1999:533):

• Reporting – General queries applied to the data warehouse with the intention of finding answers to specific questions.

• Online analytical processing (OLAP) – Used in business analysis to investigate market trends and their underlying causes.

• Data mining – Finding patterns and correlations in data for use in business decisions.

• Executive information processing – Providing key summary information to decision makers without the need for the user to work too hard to come to his or her decision.

A data warehouse is not a requirement for data mining, but many of the data cleansing and preparation functions performed in order to import data into a data warehouse, also have to be performed on data in order to prepare it for data mining (Anon, 1999:3). The existence of a data warehouse will therefore in all probability simplify the data mining process with specific regard to the preparation of the data to be analysed.

(36) Online Analytical Processing (OLAP)

According to Ma et al. (2000:127), the OLAP Council defines online analytical processing as a category of software technology that enables analysts, managers and executives to gain insight into data through fast, consistent and interactive access to a wide variety of possible views of information that has been transformed from raw data to reflect the real dimensionality of the enterprise as understood by the user. OLAP is therefore a set of tools that can facilitate multidimensional analysis and manipulate aggregated data into various categories and allows a decision maker to look beyond the summary information in order to discover what may have caused the trend demonstrated by the summary. It provides a user with the ability to drill down into the data, as far into its minute detail as is necessary to find the answers required (Williams, 1999:534).

A data warehouse focuses on the gathering, cleansing and storing of large volumes of data. OLAP tools on the other hand provide the means needed to successfully manipulate and analyse the data (Ma et al., 2000:127).

OLAP therefore represents the next level in business intelligence, by enabling users to quickly view particular segments of data relevant to their specific decision making needs.

An OLAP analyst will generate a number of hypothetical patterns or relationships. The OLAP tools will then be used to verify or disprove these patterns or relationships.


relationships. OLAP can therefore be summarised as being a deductive process while data mining is an inductive process (Anon, 1999:3).

OLAP and data mining can be used to compliment each other in the sense that OLAP can be used to verify the patterns or relationships discovered by data mining. Data Visualisation

Data visualisation represents the use of graphics, such as charts, in order to represent the data used for analysis and decision making (Ma et al., 2000:129).

The typical characteristics of data visualisation include the following:

• Data visualisation forms part of the previously discussed technologies and is simply a method for displaying the results yielded by those tools.

• It is particularly helpful in assisting end users to recognise patterns or trends by representing the results in a graphical format.

• It goes beyond simply using charts and graphs to display results, and endeavours to make use of different colours, shapes and animation to highlight different dimensional values on two dimensional computer screens. The Role of Data Mining in Business Intelligence

The above mentioned components of business intelligence tend to give users a static view on their business data. Data mining evolved as a result of the need for users and decision makers to be able to answer questions such as (Kudyba & Hoptroff, 2001:7):


• Are there statistical relationships between variables in my data and are they reliable?

• How strong are the relationships between these variables?

• If I change certain variables, what corresponding changes can I expect in the variable in question?

• What can I expect in the future?

According to Hackney (2000:42), data mining solutions are a key weapon in an organisation’s business intelligence arsenal that reveal trends and relationships and predict future outcomes.

The importance of data mining to successful business intelligence is highlighted by the increased growth in organisational data volumes. Variar (2004:34) states that information is growing faster than Moore’s law and without effective data mining, organisations cannot leverage their data to make better decisions. Moore examined the relationship between computer power and cost and found them to be inversely related. According to Moore’s law, every few years the power of computers increases exponentially with a corresponding drop in the cost to provide that power (Mattison, 1996:22).

Data mining assists organisations in creating value by providing many functions such as forecasting, modelling and support for decision making.


2.2.3 The Value of Business Intelligence

The value that business intelligence can add to organisations is closely linked and similar to the value added by data mining, but Steadman (2003) highlights the following five major values of business intelligence: The liberation of previously isolated information

One of the major issues facing organisations is their apparent inability to link information contained in different systems together. The systems might be isolated from one another as a result of data structure or format issues.

A risk assessor working in an insurance company might have to access several different systems in order to make an underwriting decision for instance, and a business intelligence system could enhance the risk assessor’s decision making ability by grouping the relevant information together into a common format and optimising the overall collection and presentation of the data. Leveraging of existing investments

The competitive nature of the business environment, and increased pressure on organisations to reduce costs have forced many of them to implement expensive, but business enhancing information technology systems. These systems add tremendous value to organisations, but often function as individual systems that are totally independent of each other.

A good business intelligence solution will recognise the huge investment that an organisation has already made in its systems and will be implemented in such a way


as to minimise the changes that will have to be made to those systems in order to incorporate them into the overall business intelligence offering. The system should also cater for small changes to the individual components without affecting the overall business intelligence solution. Improvement in the quality and timeliness of information

Today’s organisations operate in an increasingly more competitive environment. A key success factor in being successful in such an environment is the ability to make fast and accurate decisions. Business decision makers must be empowered to make such accurate and reliable decisions, and a good business intelligence solution will provide end users with near real-time access to the relevant information. Enabling of information workers

Employees at all levels in organisations work with information and business intelligence empowers employees by providing them with the information and knowledge needed to make accurate and reliable decisions.

The decisions can be made at the point of maximum impact and in a timely fashion, resulting in better responses to changing customer demands and needs, and improved customer service levels.

A good business intelligence solution should provide this ability by presenting information in a format that employees at all levels in the organisation can

(41) Reduction in operating costs

Employees are often called upon to compile and present reports relating the activities in their individual segments of the organisation. This often requires that they spend huge amounts of time on gathering the necessary data, processing it and then formatting it for presentation.

Business intelligence solutions offer the advantage of reducing operating costs by making it possible for employees to gather, compile and share information in an effective and efficient way. A good business intelligence solution will empower employees to do this with minimum effort involved from their side.


According to Han & Kamber (2000:29) the interdisciplinary nature of data mining, and the fact that it combines technology from so many different fields, makes it helpful to classify data mining systems based on a specified set of criteria. This classification can also be helpful to potential users to distinguish data mining systems and identify those that best match their specific needs.

The criteria used to categorise data mining systems are as follows:

• Classification according to the kinds of databases mined.

Several different database systems exist such as relational, transactional and object orientated to name but a few. Data mining systems can therefore be classified accordingly.


• Classification according to the kinds of knowledge mined.

Data mining systems can be classified according to the kind of knowledge mined that is, based on the data mining functionalities employed such as characterisation, association, classification or clustering.

• Classification according to the kinds of techniques utilised.

Data mining systems can also be classified based on the underlying data mining techniques employed or the methods of data analysis employed such as neural networks or pattern recognition.

• Classification according to the applications adapted.

Finally, data mining systems can also be classified according to the applications they adapt such systems tailored specifically for the finance industry or the telecommunications industry.


The chapter starts off by discussing the origins of data mining, and indicating how it developed as a result of the evolutionary path of information technology. The application of data mining in the business environment has been further enhanced by organisations’ ability to collect huge amounts of information, the arrival of powerful multiprocessor computers and the development of data mining algorithms.


The business context for data mining and data mining’s ability to improve organisational decision making and add value is briefly discussed, and then the focus changes to business intelligence.

Business intelligence is regarded as all those activities related to organising and delivering information and analysis to businesses and its decision makers. Data warehousing, online analytical processing, data visualisation and data mining are then identified as the major components of a business intelligence system, and the specific role of data mining in business intelligence is discussed.

The chapter concludes with a discussion of the value of business intelligence to organisations, and finally, it provides a brief overview of how data mining systems are classified.

Chapter 3 discusses data mining in more detail. It focuses on identifying some of the more commonly used data mining tasks, techniques and algorithms, and concludes with a discussion of the data mining process.




Although the term data mining is often thrown around loosely, Berry & Linoff (2000:8) prefer to use the term for a specific set of tasks, all of which involve extracting meaningful information or knowledge from data. Han & Kamber (2001:21) state that these tasks specify the kind of patterns to be found in data mining.

The goal of any data mining effort can be divided in one of the following two types (Chan & Lewis, 2002:57):

• Using data mining to generate descriptive models to solve problems.

• Using data mining to generate predictive models to solve problems.

According to Han & Kamber (2001:21) descriptive data mining tasks characterise the general properties of the data in the database, while predictive data mining tasks perform inference on the current data in order to make predictions. Descriptive data mining focuses on finding patterns describing the data that can be interpreted by humans, and produces new, nontrivial information based on the available data set (Kantardzic, 2003:2). Predictive data mining involves using some variables or fields in the data set to predict unknown or future values of other variables of interest, and produces the model of the system described by the given data set (Kantardzic, 2003:2).


The goal of predictive data mining is to produce a model that can be used to perform tasks such as classification, prediction or estimation, while the goal of descriptive data mining is to gain an understanding of the analysed system by uncovering patterns and relationships in large data sets (Kantardzic, 2003:2).

The goal of a descriptive data mining model is therefore to discover patterns in the data and to understand the relationships between attributes represented by the data, while the goal of a predictive data mining model is to predict the future outcomes based on past records with known answers (Chan & Lewis, 2002:57).

Both Chan & Lewis (2002:57) and Berry & Linoff (2000:8) further divide the data mining task of generating models into the following two approaches:

• Supervised or directed data mining modelling.

• Unsupervised or undirected data mining modelling.

The goal in supervised or directed data mining is to use the available data to build a model that describes one particular variable of interest in terms of the rest of the available data. The task is to explain the value of some particular field. The user selects the target field and directs the computer to determine how to estimate, classify or predict its value (Chan & Lewis, 2002:57).

In unsupervised or undirected data mining however, no variable is singled out as the target. The goal is rather to establish some relationship among all the variables in the data. The user asks the computer to identify patterns in the data that may be significant (Chan & Lewis, 2002:57).


Undirected modelling is therefore used to recognise patterns and relationships in the data, while directed modelling is used to explain those patterns and relationships once they have been found.

The goals of predictive and descriptive data mining are achieved by using specific data mining techniques that fall within certain primary data mining tasks. These tasks are as follows:

3.1.1 Classification / Prediction

Classification involves the discovery of a predictive learning function that classifies a data item into one of several predefined classes (Kantardzic, 2003:2). It involves examining the features of a newly presented object and assigning to it a predefined class (Berry & Linoff, 2000:8).

Han & Kamber (2001: 279) define classification as a two-step process. First a model is built describing a predetermined set of data classes or concepts and secondly, the model is used for classification.

Prediction can be viewed as the construction and use of a model to assess the class of an unlabeled sample, or to assess the value or value ranges of an attribute that a given sample is likely to have (Han & Kamber, 2001:181). According to Berry & Linoff (2000:10), any of the techniques used for classification can be adapted for use in prediction by using training examples where the value of the variable to be predicted


Typical business related questions that can be answered using classification or prediction tasks are (Mena, 1999:10):

• Which customers will buy?

• Which products will customers buy?

• How much will customers buy?

Classification and prediction are grouped under predictive data mining tasks.

3.1.2 Estimation

While classification deals with discrete outcomes such as yes or no, debit card, home loan or vehicle financing, estimation deals with continuously valued outcomes. If some input data is available, estimation can be used to come up with some unknown continuous variable such as income or height (Berry & Linoff, 2000:8). In estimation, one wants to come up with a plausible value or a range of plausible values for the unknown parameters of a system (Kantardzic, 2003:92).

Typical examples of estimation and the business related questions that can be addressed by making use it include the following (Berry & Linoff, 2000:8):

• How many children are in a family?

• Estimating a family’s total household income.


Classification and estimation are often used together, as when data mining is used to predict who is likely to respond to a credit card balance transfer offer and also to estimate the size of the balance to be transferred (Berry & Linoff, 2000:9).

Estimation is grouped under predictive data mining tasks.

3.1.3 Segmentation

Segmentation simply means making different offers to different market segments; groups of people defined by some combination of demographic variables such as age, gender or income (Berry & Linoff, 2000:273).

Mena (1999:11) define segmentation as a form of analysis used to for instance break down the visitors to a website into unique groups with individual behaviours. The groupings can then be used to make statistical projections, such as the potential amount of purchases they are likely to make.

Typical business questions that can be answered using segmentation are (Mena, 1999:11):

• What are the different types of visitors attracted to our website?

• In which age group do the listeners of a certain radio station fall into?


3.1.4 Association

The task of association is to determine which things go together (Berry & Linoff, 2000:10). Association is yet another data mining task that looks for hidden associations in your data such as gender, age etc. (Mena, 1999:13).

Han & Kamber (2001:23) define association analysis as the discovery of association rules showing attribute-value conditions that occur frequently together in a given set of data.

Typical business questions that can be answered using association are:

• Which items are generally found together in a customer’s shopping trolley?

• Which items should be grouped together on store shelves so that items often bought together will be grouped together?

• What relationships exist between your visitors and your products (Mena, 1999:13)?

Association is grouped under descriptive data mining tasks.

3.1.5 Clustering

Clustering is the task of segmenting a diverse group into a number of similar subgroups or clusters (Chan & Lewis, 2002:58). According to Han & Kamber (2001:25), clusters of objects are formed so that objects within a cluster have high similarity in comparison to one another, but are very dissimilar to objects in other


clusters. Clustering is commonly used to search for unique groupings within a data set (Mena, 1999:15).

The distinguishing factor between clustering and classification is that in clustering there are no predefined classes and no examples. The objects are grouped together based on self-similarity (Berry & Linoff, 2000:11).

Typical business questions that can be answered using clustering are:

• What are the groupings hidden in your data?

• Which customer should be grouped together for target marketing purposes?

Clustering is grouped under descriptive data mining tasks.

3.1.6 Description and Visualisation

According to Berry & Linoff (2000:11), the purpose of data mining is sometimes simply to describe what is going on in a complicated database in a way that increases our understanding of the people, products or processes that produced the data in the first place. They state that a good enough description of behaviour will often suggest an explanation for it as well.

One of the most powerful forms of descriptive data mining is data visualisation. Although visualisation is not always easy, the right picture can truly speak a thousand words since human beings are extremely practiced at extracting meaning from visual


Visualisation can be useful in providing a visual representation of the location and distribution of a company’s major clients on a map of a city or a province or even a country. Allowing the visualisation of discovered patterns in various forms can help users with different backgrounds to identify patterns of interest and to interact or guide the system in further discovery (Han & Kamber, 2001:157).

Description and visualisation are grouped under descriptive data mining tasks.

3.1.7 Optimisation

Wallace (n.d.) simply defines optimisation as crucial in evaluating trade-offs of complex systems to find the best places to operate.

A typical business related question that could be answered using optimisation is:

• How can a company maximise its online presence and sales?

3.1.8 Outlier Analysis

In terms of data mining terminology, outliers refer to those objects in data sets that do not comply with the general behaviour or model of the data (Han & Kamber, 2001:25).

Most of the data mining methods ignore and discard these outliers as being noise or exceptions to the rule, but they play an important role in some applications such as


fraud detection, where the rare events can be more interesting than the more regularly occurring ones.

Han & Kamber (2001:25) point out that outlier analysis can be very useful in answering the following business related questions:

• Uncovering fraudulent credit card transactions.

• Analysing the purchase frequency, location and type of customers.

3.1.9 Evolution Analysis

Data evolution analysis describes and models regularities or trends from objects whose behaviour changes over time (Han & Kamber, 2001:26).

A typical example of the use of evolution analysis could be its use to attempt to find stock evolution regularities from an analysis of stock market data over past years. These discovered regularities may help the user to predict future trends in stock market prices, contributing to the user’s decision making regarding stock investments (Han & Kamber, 2001:26).


The purpose of this study is not to delve deeply into the technical aspects of data mining but to provide a general overview of the subject. However, in order to be able to decide when one technique is called for, or when another would be more suitable,


Berry & Linoff (2000:103) highlight three basic data mining techniques as being very important since they are implemented in the majority of commercial software applications. They also cover a wide range of data mining situations.

Section 3.1 of this study divided data mining tasks into directed data mining and undirected data mining. The most commonly used undirected data mining technique that will be discussed in the following section is:

• Automatic cluster detection.

Two of the most commonly used directed data mining techniques that will be discussed in the following sections are:

• Decision trees.

• Neural networks.

3.2.1 Automatic Cluster Detection

Clustering is a method of grouping together similar records in a data set. The most commonly used algorithm for automatic cluster detection is K-means.

According to Berry & Linoff (2000:104), this algorithm works by dividing a data set into a predetermined number of clusters. The number of clusters is represented by the phrase “k” in k-means. A mean is an average and in this case it refers to the average location of all the members of a cluster.


Automatic cluster detection is an undirected data mining technique. As a result of this it can be applied without prior knowledge of the structure to be discovered. This is also its weakness in that if you do not know what you are looking for, it is difficult to recognise it when you find it.

According to Berry & Linoff (2000:109), automatic cluster detection is most useful in the following circumstances:

If it is suspected that the data set contains natural groupings that may represent customers or products that have a lot in common with each other. It may turn out that these are naturally occurring customer segments that can be singled out for customised marketing approaches.

When there are many competing patterns in the data set making it hard to identify a single pattern. In this case automatic cluster detection can be used to create clusters of similar records thereby reducing the complexity of the data set so that other data mining techniques are more likely to succeed.

3.2.2 Decision Trees

Mena (1999:357) defines a decision tree as a graphical representation of the relationships between a dependent variable (output) and a set of independent variables (inputs), usually in the form of a tree-shaped structure that represents a set of decisions with each node representing a test of decision.


Decision trees work by allowing records from a data set to flow through a series of tests such as “is the field 3 greater than 27?” until the record reaches a leaf or terminal node where it is given a class label based on the class of the records that reached that node in the training set when the model was originally set up. Berry & Linoff (2000:111) define the following two types of decision trees:

• Classification trees that label records in a data set and assign them to the proper class.

• Regression trees that estimate the value of a target variable that takes on numeric values. An example of a regression tree algorithm analysis is to calculate the expected size of claims that will be made by an insured person.

Decision tree algorithms are a good choice for use in the following circumstances (Berry & Linoff, 2000:121):

• When the data mining tasks involves the classification of records in a data set or the prediction of outcomes.

• When the goal is to assign each data set record to one of a few broad categories.

The main advantage of using decision trees is found in its ability to generate understandable business rules in a decision support environment and the ability to model nonlinear relationships with logical rules (Mena, 1999:103).

Unfortunately, decision trees also have some disadvantages associated with their use in data mining, and these are highlighted by Wang (2003:233 – 234) as being the following:


• The data used by decision trees must be categorical or interval, and data not received in this format will have to be recoded to this format in order to be used.

• Decision trees normally represent a finite number of classes or possibilities, and it becomes difficult for decision makers to quantify a finite number of variables. The accuracy of the results obtained will be limited to the number of classes selected.

• Decision trees are suited for problems involving time series data unless a lot of effort is put into presenting the data in such a way that trends are made visible.

3.2.3 Neural Networks

From a data mining perspective, neural networks are simply another way of fitting a model to observed historical data in order to be able to make classifications or predictions (Berry & Linoff, 2000:121).

Kantardzic (2003:196) defines a neural network as a massive parallel distributed processor made up of simple processing units with the ability to learn from experiential knowledge expressed through inter-unit connection strengths, and that can make such knowledge available for use. Kantardzic (2003:196) highlights the following useful properties and capabilities of neural networks for use in data mining applications:

• Nonlinearity - A neural network’s nonlinearity models the inherently nonlinear real-world mechanisms responsible for generating data for learning.


• Adaptability – A neural network is able to adapt to changes in its operating environment by changing its interconnection weights.

• Evidential response – A neural network can be designed not only provide information about which particular class to select for a given sample, but also about confidence in the decision made.

• Fault tolerance – A neural network has the potential to be inherently fault tolerant and capable of robust computation, meaning that it can handle missing or incorrect data more effectively than other data mining techniques.

• Uniformity of analysis and design – The same principles, notation and steps in methodology are used in all domains involving the application of neural networks.

According to Berry & Linoff (2000:128), neural networks should be used in the following circumstances:

• Neural networks are a good choice for most classification and prediction tasks when the results obtained from the model are more important than understanding how the model works.

• Neural networks are not a good idea if there are many hundreds or thousands of input features. Such large numbers of input parameters makes it difficult for the network to find patterns and can result in very long training phases that never converge into a good solution.

Wang (2003:232 - 233) highlights the following as being the major problems with using neural networks for data mining purposes:


• Training the neural network can take up a considerable amount of time and resources, and it is therefore recommended to limit its application to small to medium datasets.

• Neural networks lack explicitness due to its black box nature, and it is therefore difficult to understand how the neural network came to its conclusions.

• The neural network will never be able to solve a problem that a human couldn’t, given enough time.

• There is a very limited tradition of experience on which to draw when choosing between the neural networks on offer, making it difficult or impossible for an organisation to confidently choose an appropriate solution.


A review of the literature yielded several different suggested data mining processes. In fact the processes are similar, but the authors seem to emphasise different steps in the processes.

Figure 3.1, figure 3.2, figure 3.3 and figure 3.4 represent four of the different data mining processes identified in the literature.




FIGURE 3.2 THE DATA MINING PROCESS ACCORDING TO MENA Id e n t ify th e b u s in e s s o b je c t iv e S e le c t t h e d a t a P r e p a r e th e d a ta E v a lu a t e th e d a t a F o r m a t th e s o lu tio n S e le c t t h e to o ls C o n s tr u c t t h e m o d e ls V a lid a te th e f in d in g s D e liv e r t h e fin d in g s In t e g r a t e t h e s o lu t io n s



Id e n t ify th e b u s in e s s p r o b le m

T r a n s f o r m t h e d a ta in to a c t io n a b le r e s u lts

A c t o n th e r e s u lt s

M e a s u r e th e r e s u lt s

Source: Adapted from Berry & Linoff (2000:44).


S t a te th e p r o b le m a n d fo r m u la te th e h y p o th e s is

C o lle c t th e d a t a

P r e p r o c e s s th e d a t a

E s t im a te t h e m o d e l

In te r p r e t t h e m o d e l a n d d r a w c o n c lu s io n s


The four data mining processes presented in the previous four figures outline the fact that data mining is not simply the random picking and applying of computer-based tools to match the presented problem and automatically find solutions. Kantardzic (2003:6) state that data mining is in fact a carefully planned and considered process of deciding what will be most useful, promising and revealing.

Although the four outlined data mining processes detail different steps in the data mining process, they have a few very important steps in common. These will be discussed next and will also form the four steps in the data mining process suggested by this study.

3.3.1 Identify the business problem

All four of the outlined data mining processes start with defining the problem. This step involves communication between the data mining experts and the people who understand the business (Berry & Linoff, 2000:44).

This first step in the data mining process is particularly important in that it can assist in answering the following questions (Berry & Linoff, 2000:45):

• Is the data mining effort really necessary?

• Is there a particular segment or subgroup that is most interesting?

• What are the relevant business rules?

• Are there any invalid data sources or problems with the data that the business is aware of?





Related subjects :