Data Mining Applications

(1)

Data Mining

Applications

Overview

Elena Irina Neaga

Forac Research Consortium Laval University

Québec City, Canada

E-mail: [email protected]

(2)

Outline

• Foundation

• Motivation: Why Data Mining?

• Definitions

• Current State-of-the-Art

• General Applications

• Examples

• Industry and Business Application Areas

• Selected Algorithms and Methods

• Distributed Data Mining using Intelligent Agents

• Commercial Software Systems

• Methodologies, Projects and Standards

• Main References and Web-Resources

(3)

Background

• The conventional model to turn data into information

and further to knowledge and probably wisdom is defined as follows:

data ==> information ==> knowledge ==> wisdom

• Knowledge discovery (KD) and data mining (DM) are

interdisciplinary areas based on statistical analysis, database approaches and artificial intelligence (AI), especially machine learning.

• KD and DM incorporate complex algorithms from

statistics and AI, including imaginative and intuitive processing.

“ Knowledge is Power“ Francis Bacon

(4)

Data, Information, Knowledge

Wisdom

(5)

Data

is a collection of unanalyzed observations of worldly events.

Information

is a summary and communication of

the main components and relationships contained within the data and presented within a specific context.

Knowledge

is an interrelated collection of

procedures for acting toward particular results in the world with associated references for when each is applicable along with its range of effectiveness.

(6)

Motivation

Nowadays the amount of data generated by

several applications has dramatically increased, and this data is a valuable source for the discovery of new

information and knowledge.

Also, the eruption of data has caused a

comparable explosion in the need to analyze it which is possible by the increase of computational power which might at one time have been too computationally

expensive.

”We are drowning in information, but starved for knowledge.”

(7)

•Organizations have huge databases containing amounts of data which could be a source of new information and knowledge;

•Business and marketing databases potentially

constitute a valuable resource for business and market intelligence applications;

•Enterprises also rely on vast amounts of data and

information that is located in large databases. The value of this information can be increased if additional

knowledge can be gained from it.

Motivation

(

continued)

“ In an economy where the only certainty is uncertainty, the one sure source of lasting competitive advantage is knowledge.“

(8)

Definition

Knowledge Discovery from Databases (KDD) is the nontrivial process of identifying valid, previously unknown,

potentially useful and ultimately understandable

patterns

in

data [Fayyad et al.,1996].

The whole KDD process could include and it is not limited to the following steps:

•data selection;

•data cleaning;

•data preprocessing includes reduction and transformation;

•data mining for identifying interesting patterns in datasets;

•data interpretation and evaluation;

(9)

Patterns in the context of knowledge discovery and data mining are defined as similar structures in a file or a database that are relevant and repetitive.

A model is an abstraction that captures the essential and global aspects of the complex real-world systems

and/or sub-systems. The model may include the definition of an information structure in order to store, process,

analyze and use the associated data.

In the context of DM the distinction between the pattern and model is arbitrary [Hand, 1998]

(10)

Discovery vs. Invention

•Discovery Science(DS) - 5th _{Conference was held} at Lubeck, Germany, in 2002.

•Knowledge is a topic which belongs to science and philosophy.

•Francis Bacon (1610) stated that knowledge is obtained from experience, and the Nature is ruled by laws and

theories which the scientists have the main task to discover and to describe by models. According to him science is an inductive process.

•On the other hand science may be defined as a process of inventing theories which are checked against experience. This trend is stated in 19th _{century by the invention of} non-Euclidian geometry, and relativity theories.

(11)

KDD vs. KM

KM supports the knowledge creation; KDD leads to

knowledge.

KM typically deals with the managerial procedures for

producing and using knowledge within an organization such as individual, collective learning and transferring.

KDD is focused on the automated or semi-automated

knowledge generation from rough data based on machine learning.

The difficulty of the formulation of distinct definitions

for KDD and KM is due to the paradox that knowledge resides in the human’s mind, but it may be captured, generated, stored, processed and reported using

(12)

Polanyi (1962, 1966) defines two types of

knowledge generally accepted in the field of KM, but also some KDD approaches attempt to consider:

Tacit knowledge: implicit, mental models, and experiences of individuals.

Explicit knowledge: formal models, rules, and procedures.

An open debate may be related to the human

knowledge and computer knowledge approaches such as knowledge discovery, knowledge engineering

(acquisition, knowledge based/expert systems) and some areas of knowledge management.

(13)

Knowledge about the past which is stable, voluminous and accurate;

Knowledge about present which is unstable, compact and may be inaccurate;

Knowledge about the future which is hypothetical.

(14)

DM vs. Operations Research

Combining OR and data mining may be very useful in decision-making because:

¾ A discovered pattern is interesting only to the extent in which it

can be used in the decision-making process of an enterprise.

¾ Generally OR deals with searching for the best solutions to

decision problems using mathematical techniques.

¾ Optimization Solvers may be complemented and refined with

data mining algorithms.

¾ Optimization algorithms are applied to data imported from

DBMS and/or Internet, but they may be processed a data warehouse and/or the discovered patterns in data.

¾ The potential of applied DM and neural networks for OR

increases.

¾ SAS/Operations Research and SAS/Enterprise Miner may be

(15)

Related Definitions

KD and DM are defined in several ways, but from the

perspective of computer science the best known definitions are: The process of searching and retrieving or visualization of valuable information and new knowledge in large volumes of data.

Representing the exploration and analysis by automatic, or semi-automatic means of large quantities of data usually stored in databases.

Dealing with the discovery of new correlations, hidden knowledge, unexpected information, patterns and new rules from large databases;

It is also possible to consider DM more as a set of organized activities than as methods on their own because the main algorithms are employed from close areas such as statistics and/or artificial

(16)

Related Definitions

(

continued)

DM is the key element or the core of the whole process of

Knowledge Discovery in Databases (KDD) dealing with several

processing techniques for data especially included in large

databases and data warehouse. Data warehouse is a central store of data extracted from operational data.

Cristofor (2002) clearly specifies that there is no restriction to the types of data that can be used as input for DM. The input data can be a relational or object-oriented database, a data warehouse, a web server log or a text file. DM is associated with large amounts of data, but for research and testing applications, the test data sets are of a limited length, and are usually flat files.

(17)

Related Definitions

(continued)

Several research projects are inter or cross-disciplinary with respect to data mining as well as to business, finance, marketing and other areas. These approaches define data mining as follows [Berry, Linoff, 2000], [Berson et al., 2000], [Helberg, 2002], [Pyle, 2003]:

• The process of utilizing the results of data exploration to adjust or enhance business strategies and performances. The information produced by DM engines requires intelligent review by human experts.

• A technique which helps uncover trends in time to make the knowledge actionable.

• Within every organization is an amount of data which can describe the past performance of the organization through KD and DM.

(18)

Related Definitions

((

continued)

DM finds patterns and relationships in data by using sophisticated techniques to build models.

A model

is an abstract representation of the reality

which is useful to understand and analyze it in order to making decisions. There are two main kinds of models in data mining:

Predictive models

can be used to forecast

explicit values, based on patterns determined from known results. They could predict financial trends, market

evolution, customer behaviour etc.

Descriptive models

describe patterns in existing

data, and are generally used to create meaningful subgroups.

(19)

General Applications

Marketing (Direct Marketing, Market Basket

Analysis)

Banking and Finance

Telecommunication

Engineering

Environmental and Molecular Sciences

Medicine

(20)

Examples

• Analysis of transactional data stored into a database

of a supermarket in order to improve the way in which the products are arranged on shelves.

• Exploring a supermarket database in order to

determine the patterns related to the way in which

people use to buy, grouping products that people buy together, and what time.

• Predicting customer demand for a specific product.

• Data analysis of a promotional campaign e.g. who is

most probably to reply to a direct-mail promotional campaign.

(21)

Industry and

Business Application Areas

•

Customer Relationship Management;

•

Supply Chain Management;

•

Enterprise Resource Planning;

•

E-Business and E-Commerce;

•

Demand Management (forecasting);

(22)

Data aggregation and integration Data Visualisation Data Modeling Data Segmentation Data management and

selection Knowledge Communication Presentation and interpretation of the result Prediction and forcasting based on new information

MODIFY THEOBJECTIVE

ENTERPRISE DATABASES

Data Mining Processing

ENTERPRISE OBJECTIVESAND

STRATEGIES :

Improving the quality of products and services;

Improving business performances; Improving the position on the marketplaces;

Improving the customer satisfaction and loyality;

(23)

DM using a

DataWarehouse

. . …… DB1 DB2 _WAREHOUSEDATA & Central Repository Legacy System M1 M2 Mn Data Mining Retrieving/Using/Visualizing New Information

(24)

Integrated

DM, DW

and

OLAP

Data Mining tools

DSS tools OLAP tools Data Warehouse OLAP DataBase Data Mart Data Mart DATA MINING SERVER

Extended Enterprise Databases Other DATA CRM SCM ERP Data Mart Data Mart Data Mart Data Mart

(25)

Data warehouse(DW)

is defined as the extraction and integration of data from multiple sources and legacy systems in an effective and efficient manner.

Usually a DW is obtained from operational data, and the information in a DW is subject-oriented, non-volatile, integrated and time dependent [Adriaans, Zantinge, 1996].

A DW contains large datasets which are organized

using

metadata

concept which describes the properties

and characteristics of data and information stored in a central repository. The metadata becomes a topic in its own right which deals with the intensive studies of data and its behaviour.

D

ata marts (DMs)

are subsets of data focused on

selected subjects for e.g. a marketing data mart may include customer, product and sales information.

(26)

Data Mining vs. Statistics

” To statisticians, the data mining conveys the sense of naive hope vainly struggling against the cold realities of chance.” D.J. Hand DM and Statistics do not overlap, and the main differences are presented below [Pyle, 1999]:

Statistics assume a pattern and the algorithms attempt to prove it; DM describes a kind of pattern and the algorithms find them;

DM processes data which is usually given as a database or a large flat file; Statistics are applied to small and clean datasets;

The objective of DM is to find patterns, knowledge and valuable new information in data and through statistical analysis data is processed according to a defined objective of analysis;

Statistics consider data variation, but this is not considered in DM;

In DM residual data is useful, and it is processed, and in statistics it is removed from the original data set.

(27)

Data Mining vs. Statistics

(continued)(continued)

” To statisticians, the data mining conveys the sense of naive hope vainly struggling against the cold realities of chance.” D.J. Hand

DM is very much an inductive process opposed to the hypothetic-deductive approach often seen as the paradigm for how modern science progresses [Hand, 1998]

Statistics are dealt with primary data analysis. DM is entirely concerned with secondary data analysis.

Classical statistics deals with numeric data. DM is

applied to image data, audio data, text data, and geographical data. Mining the web has become a distinct topic.

(28)

However DM applied to several real-world problems such as supply chain optimizations, process and quality control may not provide solutions beyond the use of statistics, probability theory, evolutionary computation (ANN, fuzzy logic) and operations research.

Several DM algorithms have their roots in statistical analysis.

DM is not new as it joins several mathematical and artificial intelligence problem solving techniques and methods usually applied to large amounts of historical data.

(29)

Selected Algorithms

and Method

” It is by intuition that we discover and by logic we prove.” Henri Poincaré

•

Regression;

•

Classification;

•

Association Rules;

•

Clustering;

•

Sequential Analysis/Pattern Finding;

•

Combined Methods;

•

On-Line Analytical Processing

(OLAP)

;

(30)

Regression

Linear and non-linear regression are widely used for correlating data.

Statistical regression requires the specification of a function over which the data is fitted.

In order to specify the function it is necessary to know the forms of equations governing the correlation for a data set [Wang, 1999]. Even

though regression is considered to be a statistical technique, the

distinction is arbitrary because DM deals with predictive modeling, and regression does exactly the same [Berry, Linoff, 2000].

There are many applications of regression, for example predicting costumer demand for a new product as a function of

advertising expenditure and predicting time series where the input variables can be time-lagged versions of the prediction variable.

(31)

Classification

Classification also known as segmentation is the process of examining known groups of data to determine which characteristics can be used to identify (or predict) group membership.

Examples of classification include the

classification of trends in financial markets, grouping

customers based on their past transactions and predicting

their response to a particular product promotion [Fayyad et

(32)

Association Rules

Association Rules were introduced by R. Agrawal, T. Imielski and A. Shami, in 1993, and the most

used algorithm Apriori, in 1994, by R. Agrawal and R. Srikant. The basic idea of association rules is to search the data for patterns of the following form:

IF (some conditions are true) THEN

(some other conditions are probably true) Each condition extracted from data is called an association rule, or simply a rule.

(33)

Association Rules

(

continued)

Association Rules have two main characteristics associated with them that measure their value:

• Coverage describes how much evidence is in the training data set to back up the rule. It usually ranges between 0 and 1 (0% and 100%).

• Confidence describes how likely the rule is to give a correct prediction. It is also in the range between 0 and 1(0% and 100%).

In addition, the algorithm of association rules uses the support of a rule which is the number of records or transactions which confirm the rule [Cristofor, 2002].

(34)

Association Rules

((continued)continued)

Let I = {i1, i2, ….im} – a set of items;

Let D – a database usually of transactions, where each T

⊆

I;

For a given itemset (a non-empty set of items) X

⊆

I and given a transaction T If X ⊆ T then T contains X;

It is also defined the support count σ X of an itemset X and X is a large itemset

with respect to support (s) if σ X ≥ s x |D| where |D| is the number of transactions

in D.

An Association Rule is an implication of the form X⇒Y , where X ⊆ I, Y ⊆ I and X

∩

Y = Φ

The Association Rule X⇒Y has the confidence c if the ratio of

σ

X ∪Y over

σ X = c. The rule X⇒Y has the support s in D if σ X ∪Y = s x |D|.

Thus if s is the given support the mining association rules is finding the set L ={ X|X ⊆ I ∧ σ X ≥ s x |D|}.

(35)

Clustering

Clustering like segmentation identifies groups of similar cases, but it does not predict outcomes or target categories [Helberg, 2002].

Clustering algorithms are also called

unsupervised classification, and they process a group of physical and abstract objects into classes of similar objects.

Clustering analysis supports the construction of meaningful partitions of a large set of objects based on the divide-and-conquer methodology which decomposes a large-scale system into smaller components to simplify design and implementation.

An example relates to identifying customers that would make good targets for a new product marketing promotion.

(36)

Clustering

(

continued)

The clustering methods are divided into:

• Hierarchical clustering which represents the

combination of cases and clusters that are similar to each other, one pair at a time.

• K-Means clustering which is based on the

assumption that the data falls into a known number (K) of clusters. This method starts by defining initial profiles called cluster centers, for the K clusters, sometimes using random values for the clustering characteristics or sometimes using dissimilar cases from the data set.

(37)

K

-

Means Clustering

In K-Means algorithm, each object xi is

assigned to a cluster j according to its distance d(xi,mj) from a value mj representing the cluster

itself. mj is called the representative of the cluster.

Given a set of objects D = {x1, . . . , xn}, a

clustering problem is to find a partition C = {C1 . . . Ck}, of D such that:

1. Each Ci is associated to a representative mi ;

2. xi ∈ Cj if d(xi,mj) ≤ d(xi,ml) for 1 ≤ l ≤ k, j≠ l;

3. The partition C minimizes:

∑ = ∑ ∈

i

k

1

xj

Ci

d

2

(x

j

,

m

i

)

.

(38)

Sequential Patterns

Sequential patterns are part of

sequential analysis. The main goal of this algorithm is to find all sequential patterns with a pre-defined minimum support represented by a data sequence. The input data is represented by a list of sequential transactions and there is often an associated transaction-time.

(39)

Combined Methods

•Combination of different algorithms for the knowledge

extraction process based on rules with neural

networks (NN) and Case Base Reasoning (CBR)

¾ CBR represents the process of acquiring knowledge

represented by cases using reasoning by analogy.

¾ NNs are computer models based on the architecture of the human brain which consists of multiple simple

processing units connected by adaptive weights.

•Combination of clustering and neural networks (NN);

(40)

Combined Methods

((

continued)

Knowledge extraction DB NN model generation Rule-based model generation NN models Rule base

(41)

On

-

line Analytical Processing

OLAP and DM are considered to be two complementary techniques for analyzing large amounts of data in databases and/or data

warehousing environments.

OLAP is a way of performing multi-dimensional analysis on relational databases.

DM is more powerful than an OLAP because of the difference of multi-dimensional processing of a database and the fact that new knowledge, and

hidden information can be extracted through DM.

A multi-dimensional representation related to a product family is shown in the next slide / figure.

(42)

OLAP

(

continued)

City= London Company = xx Product= yy Category=aa Industry= Food Year=2002 Profit= 56% DIMENSION ATTRIBUTES

(43)

Distributed Data Mining using

Intelligent Agents

Intelligent Agents support the distributed and collaborative KD&DM systems:

Each agent is responsible for a different step in the

KD&DM process such as pre-processing, DM, and evaluation of the results;

Some agents specialize in a pre-determined task could use the services of other agents, e.g. classification uses a pre-processing agent services;

The agents interact as usually by a communication language or messages;

The cooperative DM agents run concurrently and they could be driven by an agent manager;

The mining agent systems could be flexibly integrated with other agent systems.

(44)

XLMiner

TMTM

• It is an extension of Microsoft ExcelTM_;

• It can help to quickly start the DM on spreadsheets and Excel files;

• It has extensive coverage of statistical and

machine learning techniques for classification, prediction, affinity analysis,

(45)

SAS Enterprise Miner

TMTM

• It is supported by SEMMA (sampling, exploration, modification, modeling and assessment) methodology;

• It combines data warehousing, data mining and OLAP

technologies;

• It defines a comprehensive solution that addresses the whole

KDD processes;

• It integrates advanced models and algorithms including clustering, decision trees, neural networks, memory-based reasoning, linear and logistic regression and associations;

• It also provides powerful statistical analysis capabilities;

• It uses advanced modeling techniques;

• It generates code in SAS internal language as well as C and

Java;

(46)

SAS Enterprise Miner

TMTM

It has been successfully used for a wide range of CRM

and e-commerce applications such as:

¾direct mail, telephone, e-mail, and Internet delivered and promotion campaigns;

¾customers profiling;

¾identifying the most profitable customers and the underlying reasons for their loyalty;

¾Identifying the fraudulent behaviour in an e-commerce site.

It is very easy to be used because of its GUI;

The business analyst with little statistical expertise can quickly and easily navigate through the SEMMA process while the data mining experts can analyze deeply the analytical process.

(47)

SPSS

Clementine

TMTM

•It is a DM workbench that enables to quickly develop

predictive models and deploy them into business operations to improve decision making;

•It delivers the maximum return on investment in the minimum

amount of time;

•It supports the entire DM process to shorten time-to-solution;

•It is designed around the de facto industry standard and

methodology CRoss-Industry Standard Process for Data Mining

(CRISP-DM);

•It uses Clementine Application Templates (CATs) which follow

the industry standard CRISP-DM methodology and use

previous real-world application experience in order that a new project to benefit from a proven methodology and best

(48)

SEMMA Methodology

SEMMA (Sample, Explore, Modify, Model, Assess) methodology was elaborated by SAS Institute Inc. and it is applied successfully, with the

SAS Enterprise MinerTM_.

The steps of this methodology are as follows:

•Sample the data by extracting a portion of a large data set containing enough significant information, but having optimal dimension to be

manipulated quickly.

•Explore the data by searching for unanticipated trends and anomalies in order to understanding ideas and the trends of the data set.

•Modify the data by creating, selecting and transforming the variables to focus the model selection process.

•Model the data by allowing the system to search automatically for a combination of data that reliably predicts a desired outcome.

•Assess the data by evaluating the usefulness and reliability of the findings from the data mining process.

(49)

Projects and Standards

Overview of Main Projects and Standards

SOLEUNET KDE CRISP CWM JDM SQL/MM PMML OLE DB for DM

(50)

Projects & Standards

(

continued)

•ISO: SQL/MM is a collection of SQL user-defined types and routines to define and apply DM models.

•DM Group: Predictive Model Markup Language (PMML) is an

open standard based on XML specification for exchanging DM models between applications.

•OMG: Common Warehouse Metamodel (CWM) is a Unified Modeling Language/XML specification for DM metadata.

•Microsoft: OLE DB for DM is a major step toward the standardization

of DM primitives, and it defines a DM object model for relational databases.

•Oracle9i DM is an extension to Oracle9i Database Enterprise Edition that embeds DM algorithms for classifications, predictions and association rules. All models and functions are accessible through Java-based

(51)

Projects & Standards

(

continued)

•CRISP-DM is a project which has also defined and validated a standard DM process that is applicable in diverse industry sectors, and it attempts to make any DM project faster, cheaper, reliable and manageable.

•SolEuNet has the main aim to apply of DM and Decision Support (DS) systems in order to enhance efficiency, effectiveness and quality of operations in business and industry. A virtual enterprise model has been proposed as a dynamic problem-solving link between advanced DM and DS systems.

•Kensington Enterprise DM (Imperial College, Dept. of

Computing, London, UK) project has developed Kensington

Discovery Edition (KDE) which is an enterprise-wide platform that supports entire processes of KD, including dynamic information

(52)

CRISP

-

DM

(53)

Defining a DM Project

"Make it as simple as possible, but no simpler.“ Albert Einstein

Project

Definition Experimental DesignData Identification &

Data pre- -processing Data Mining Evaluation of Results Factors affecting the adoption of DW and DM

(54)

Main References

Adriaans, P., Zantinge, D. “Data Mining”, Addison-Wesley, 1996.

Berry, M., Linoff, G.S. “Mastering Data Mining The Art and Science of Customer Relationship Management” , John Wiley & Sons Inc., 2000.

Berson et al. “Building Data Mining Applications for CRM”, McGraw-Hill, USA, 2000.

Bramer, M.A.(editor)”Knowledge Discovery and Data Mining”, IEE, 1999.

Chen, Z., “An integrated architecture for OLAP and data mining” in Knowledge Discovery and Data Mining, Bramer, M.A.(editor), IEE, 1999.

Cristofor, L. “Mining Rules in Single-table, and Multiple-table Databases, PhD Thesis, CS Dept. of Univ.of Massachusetts, Boston, USA, 2002.

Goglin, J.F., “La construction du datawarehouse du datamart au dataweb“, 2e édition revue, Hermes Science Publication, Paris, 1998, 2001.

(55)

Main References

(continued)

Fayyad et al. (eds) “Advances in Knowledge Discovery and Data Mining”, AAAI Press/The MIT Press, 1996.

Han, J., Kamber, M. “Data Mining: Concepts and Techniques”, Morgan Kaufman, 2001.

Hand, D.J., “Data Mining: Statistics and More”, The American Statistician, Vol. 52, No. 2, 1998.

Helberg, C. “Data Mining with Confidence”, 2nd edition, SPSS Inc., 2002.

Jambu, M., “ Introduction au data mining - Analyse intelligente des données“, 1999 Eyrolles, Paris.

Lange, S., Satoh, K., Smith, C.H. (eds.) “Discovery Science 5th _{International}

Conference, DS2002, Lubeck, Germany, Procedings“, Berlin: Springer-Verlag, 2002.

Klosgen, W., Zytkow, J.M. (editors) “Handbook of Data Mining and Knowledge Discovery “, Oxford University Press, 2002.

(56)

Web

-

Resources

(continued)

•

http://www.dmreview.com/

•

http://www.andypryke.com/university/sites.html

•

http://www.modelandmine.com

•

http://www.kdnuggets.com/

•

http://www.sas.com/technologies/analytics/data

mining/miner/index.html

•

http://www.sas.com/operationsresearch

•

http://www.spss.com/spssbi/clementine/

(57)

Web

-

Resources

(continued)

•

http://www.thearling.com/dmintro/dmintro.htm

•

http://www.crisp-dm.org/

•

http://soleunet.ijs.si/website/html/euproject.html

•

http://www.dmg.org

•

http://kmcenter.free.fr

•

http://www.megaputer.com

(58)

”

” Discovery consists of seeing what everybody has seen Discovery consists of seeing what everybody has seen and

and

thinking what nobody has thought.

thinking what nobody has thought.””

A

Albert von lbert von SzentSzent--GyorgyiGyorgyi

THANK YOU

MER