Data Mining
Applications
Overview
Elena Irina NeagaForac Research Consortium Laval University
Québec City, Canada
E-mail: [email protected]
Outline
Outline
• Foundation• Motivation: Why Data Mining?
• Definitions
• Current State-of-the-Art
• General Applications
• Examples
• Industry and Business Application Areas
• Selected Algorithms and Methods
• Distributed Data Mining using Intelligent Agents
• Commercial Software Systems
• Methodologies, Projects and Standards
• Main References and Web-Resources
Background
Background
• The conventional model to turn data into information
and further to knowledge and probably wisdom is defined as follows:
data ==> information ==> knowledge ==> wisdom
• Knowledge discovery (KD) and data mining (DM) are
interdisciplinary areas based on statistical analysis, database approaches and artificial intelligence (AI), especially machine learning.
• KD and DM incorporate complex algorithms from
statistics and AI, including imaginative and intuitive processing.
“ Knowledge is Power“ Francis Bacon
Data, Information, Knowledge
Data, Information, Knowledge
Wisdom
Wisdom
Data
is a collection of unanalyzed observations of worldly events.Information
is a summary and communication ofthe main components and relationships contained within the data and presented within a specific context.
Knowledge
is an interrelated collection ofprocedures for acting toward particular results in the world with associated references for when each is applicable along with its range of effectiveness.
Motivation
Motivation
Nowadays the amount of data generated by
several applications has dramatically increased, and this data is a valuable source for the discovery of new
information and knowledge.
Also, the eruption of data has caused a
comparable explosion in the need to analyze it which is possible by the increase of computational power which might at one time have been too computationally
expensive.
”We are drowning in information, but starved for knowledge.”
•Organizations have huge databases containing amounts of data which could be a source of new information and knowledge;
•Business and marketing databases potentially
constitute a valuable resource for business and market intelligence applications;
•Enterprises also rely on vast amounts of data and
information that is located in large databases. The value of this information can be increased if additional
knowledge can be gained from it.
Motivation
Motivation
(
continued)
“ In an economy where the only certainty is uncertainty, the one sure source of lasting competitive advantage is knowledge.“
Definition
Definition
Knowledge Discovery from Databases (KDD) is the nontrivial process of identifying valid, previously unknown,
potentially useful and ultimately understandable
patterns
indata [Fayyad et al.,1996].
The whole KDD process could include and it is not limited to the following steps:
•data selection;
•data cleaning;
•data preprocessing includes reduction and transformation;
•data mining for identifying interesting patterns in datasets;
•data interpretation and evaluation;
Patterns in the context of knowledge discovery and data mining are defined as similar structures in a file or a database that are relevant and repetitive.
A model is an abstraction that captures the essential and global aspects of the complex real-world systems
and/or sub-systems. The model may include the definition of an information structure in order to store, process,
analyze and use the associated data.
In the context of DM the distinction between the pattern and model is arbitrary [Hand, 1998]
Discovery vs. Invention
Discovery vs. Invention
•Discovery Science(DS) - 5th Conference was held at Lubeck, Germany, in 2002.
•Knowledge is a topic which belongs to science and philosophy.
•Francis Bacon (1610) stated that knowledge is obtained from experience, and the Nature is ruled by laws and
theories which the scientists have the main task to discover and to describe by models. According to him science is an inductive process.
•On the other hand science may be defined as a process of inventing theories which are checked against experience. This trend is stated in 19th century by the invention of non-Euclidian geometry, and relativity theories.
KDD vs. KM
KDD vs. KM
KM supports the knowledge creation; KDD leads to
knowledge.
KM typically deals with the managerial procedures for
producing and using knowledge within an organization such as individual, collective learning and transferring.
KDD is focused on the automated or semi-automated
knowledge generation from rough data based on machine learning.
The difficulty of the formulation of distinct definitions
for KDD and KM is due to the paradox that knowledge resides in the human’s mind, but it may be captured, generated, stored, processed and reported using
Polanyi (1962, 1966) defines two types of
knowledge generally accepted in the field of KM, but also some KDD approaches attempt to consider:
Tacit knowledge: implicit, mental models, and experiences of individuals.
Explicit knowledge: formal models, rules, and procedures.
An open debate may be related to the human
knowledge and computer knowledge approaches such as knowledge discovery, knowledge engineering
(acquisition, knowledge based/expert systems) and some areas of knowledge management.
Knowledge about the past which is stable, voluminous and accurate;
Knowledge about present which is unstable, compact and may be inaccurate;
Knowledge about the future which is hypothetical.
DM vs. Operations Research
DM vs. Operations Research
Combining OR and data mining may be very useful in decision-making because:
¾ A discovered pattern is interesting only to the extent in which it
can be used in the decision-making process of an enterprise.
¾ Generally OR deals with searching for the best solutions to
decision problems using mathematical techniques.
¾ Optimization Solvers may be complemented and refined with
data mining algorithms.
¾ Optimization algorithms are applied to data imported from
DBMS and/or Internet, but they may be processed a data warehouse and/or the discovered patterns in data.
¾ The potential of applied DM and neural networks for OR
increases.
¾ SAS/Operations Research and SAS/Enterprise Miner may be
Related Definitions
Related Definitions
KD and DM are defined in several ways, but from the
perspective of computer science the best known definitions are: The process of searching and retrieving or visualization of valuable information and new knowledge in large volumes of data.
Representing the exploration and analysis by automatic, or semi-automatic means of large quantities of data usually stored in databases.
Dealing with the discovery of new correlations, hidden knowledge, unexpected information, patterns and new rules from large databases;
It is also possible to consider DM more as a set of organized activities than as methods on their own because the main algorithms are employed from close areas such as statistics and/or artificial
Related Definitions
Related Definitions
(
(
continued)
continued)
DM is the key element or the core of the whole process of
Knowledge Discovery in Databases (KDD) dealing with several
processing techniques for data especially included in large
databases and data warehouse. Data warehouse is a central store of data extracted from operational data.
Cristofor (2002) clearly specifies that there is no restriction to the types of data that can be used as input for DM. The input data can be a relational or object-oriented database, a data warehouse, a web server log or a text file. DM is associated with large amounts of data, but for research and testing applications, the test data sets are of a limited length, and are usually flat files.
Related Definitions
Related Definitions
(continued)Several research projects are inter or cross-disciplinary with respect to data mining as well as to business, finance, marketing and other areas. These approaches define data mining as follows [Berry, Linoff, 2000], [Berson et al., 2000], [Helberg, 2002], [Pyle, 2003]:
• The process of utilizing the results of data exploration to adjust or enhance business strategies and performances. The information produced by DM engines requires intelligent review by human experts.
• A technique which helps uncover trends in time to make the knowledge actionable.
• Within every organization is an amount of data which can describe the past performance of the organization through KD and DM.
Related Definitions
Related Definitions
((continued)
continued)
DM finds patterns and relationships in data by using sophisticated techniques to build models.
A model
is an abstract representation of the realitywhich is useful to understand and analyze it in order to making decisions. There are two main kinds of models in data mining:
Predictive models
can be used to forecastexplicit values, based on patterns determined from known results. They could predict financial trends, market
evolution, customer behaviour etc.
Descriptive models
describe patterns in existingdata, and are generally used to create meaningful subgroups.
General Applications
General Applications
Marketing (Direct Marketing, Market Basket
Analysis)
Banking and Finance
Telecommunication
Engineering
Environmental and Molecular Sciences
Medicine
Examples
Examples
• Analysis of transactional data stored into a database
of a supermarket in order to improve the way in which the products are arranged on shelves.
• Exploring a supermarket database in order to
determine the patterns related to the way in which
people use to buy, grouping products that people buy together, and what time.
• Predicting customer demand for a specific product.
• Data analysis of a promotional campaign e.g. who is
most probably to reply to a direct-mail promotional campaign.
Industry and
Industry and
Business Application Areas
Business Application Areas
•
Customer Relationship Management;
•
Supply Chain Management;
•
Enterprise Resource Planning;
•
E-Business and E-Commerce;
•
Demand Management (forecasting);
Data aggregation and integration Data Visualisation Data Modeling Data Segmentation Data management and
selection Knowledge Communication Presentation and interpretation of the result Prediction and forcasting based on new information
MODIFY THEOBJECTIVE
ENTERPRISE DATABASES
Data Mining Processing
ENTERPRISE OBJECTIVESAND
STRATEGIES :
Improving the quality of products and services;
Improving business performances; Improving the position on the marketplaces;
Improving the customer satisfaction and loyality;
DM using a
DM using a
DataWarehouse
DataWarehouse
. . …… DB1 DB2 WAREHOUSE DATA & Central Repository Legacy System M1 M2 Mn Data Mining Retrieving/Using/Visualizing New Information
Integrated
Integrated
DM, DW
DM, DW
and
and
OLAP
OLAP
Data Mining tools
DSS tools OLAP tools Data Warehouse OLAP DataBase Data Mart Data Mart DATA MINING SERVER
Extended Enterprise Databases Other DATA CRM SCM ERP Data Mart Data Mart Data Mart Data Mart
Data warehouse(DW)
is defined as the extraction and integration of data from multiple sources and legacy systems in an effective and efficient manner.Usually a DW is obtained from operational data, and the information in a DW is subject-oriented, non-volatile, integrated and time dependent [Adriaans, Zantinge, 1996].
A DW contains large datasets which are organized
using
metadata
concept which describes the propertiesand characteristics of data and information stored in a central repository. The metadata becomes a topic in its own right which deals with the intensive studies of data and its behaviour.
D
ata marts (DMs)
are subsets of data focused onselected subjects for e.g. a marketing data mart may include customer, product and sales information.
Data Mining vs. Statistics
Data Mining vs. Statistics
” To statisticians, the data mining conveys the sense of naive hope vainly struggling against the cold realities of chance.” D.J. Hand DM and Statistics do not overlap, and the main differences are presented below [Pyle, 1999]:
Statistics assume a pattern and the algorithms attempt to prove it; DM describes a kind of pattern and the algorithms find them;
DM processes data which is usually given as a database or a large flat file; Statistics are applied to small and clean datasets;
The objective of DM is to find patterns, knowledge and valuable new information in data and through statistical analysis data is processed according to a defined objective of analysis;
Statistics consider data variation, but this is not considered in DM;
In DM residual data is useful, and it is processed, and in statistics it is removed from the original data set.
Data Mining vs. Statistics
Data Mining vs. Statistics
(continued)(continued)” To statisticians, the data mining conveys the sense of naive hope vainly struggling against the cold realities of chance.” D.J. Hand
DM is very much an inductive process opposed to the hypothetic-deductive approach often seen as the paradigm for how modern science progresses [Hand, 1998]
Statistics are dealt with primary data analysis. DM is entirely concerned with secondary data analysis.
Classical statistics deals with numeric data. DM is
applied to image data, audio data, text data, and geographical data. Mining the web has become a distinct topic.
However DM applied to several real-world problems such as supply chain optimizations, process and quality control may not provide solutions beyond the use of statistics, probability theory, evolutionary computation (ANN, fuzzy logic) and operations research.
Several DM algorithms have their roots in statistical analysis.
DM is not new as it joins several mathematical and artificial intelligence problem solving techniques and methods usually applied to large amounts of historical data.
Selected Algorithms
Selected Algorithms
and Method
and Method
” It is by intuition that we discover and by logic we prove.” Henri Poincaré
•
Regression;
•
Classification;
•
Association Rules;
•
Clustering;
•
Sequential Analysis/Pattern Finding;
•
Combined Methods;
•
On-Line Analytical Processing
(OLAP);
Regression
Regression
Linear and non-linear regression are widely used for correlating data.
Statistical regression requires the specification of a function over which the data is fitted.
In order to specify the function it is necessary to know the forms of equations governing the correlation for a data set [Wang, 1999]. Even
though regression is considered to be a statistical technique, the
distinction is arbitrary because DM deals with predictive modeling, and regression does exactly the same [Berry, Linoff, 2000].
There are many applications of regression, for example predicting costumer demand for a new product as a function of
advertising expenditure and predicting time series where the input variables can be time-lagged versions of the prediction variable.
Classification
Classification
Classification also known as segmentation is the process of examining known groups of data to determine which characteristics can be used to identify (or predict) group membership.
Examples of classification include the
classification of trends in financial markets, grouping
customers based on their past transactions and predicting
their response to a particular product promotion [Fayyad et
Association Rules
Association Rules
Association Rules were introduced by R. Agrawal, T. Imielski and A. Shami, in 1993, and the most
used algorithm Apriori, in 1994, by R. Agrawal and R. Srikant. The basic idea of association rules is to search the data for patterns of the following form:
IF (some conditions are true) THEN
(some other conditions are probably true) Each condition extracted from data is called an association rule, or simply a rule.
Association Rules
Association Rules
(
(
continued)
continued)
Association Rules have two main characteristics associated with them that measure their value:
• Coverage describes how much evidence is in the training data set to back up the rule. It usually ranges between 0 and 1 (0% and 100%).
• Confidence describes how likely the rule is to give a correct prediction. It is also in the range between 0 and 1(0% and 100%).
In addition, the algorithm of association rules uses the support of a rule which is the number of records or transactions which confirm the rule [Cristofor, 2002].
Association Rules
Association Rules
((continued)continued)Let I = {i1, i2, ….im} – a set of items;
Let D – a database usually of transactions, where each T
⊆
I;For a given itemset (a non-empty set of items) X
⊆
I and given a transaction T If X ⊆ T then T contains X;It is also defined the support count σ X of an itemset X and X is a large itemset
with respect to support (s) if σ X ≥ s x |D| where |D| is the number of transactions
in D.
An Association Rule is an implication of the form X⇒Y , where X ⊆ I, Y ⊆ I and X
∩
Y = ΦThe Association Rule X⇒Y has the confidence c if the ratio of
σ
X ∪Y overσ X = c. The rule X⇒Y has the support s in D if σ X ∪Y = s x |D|.
Thus if s is the given support the mining association rules is finding the set L ={ X|X ⊆ I ∧ σ X ≥ s x |D|}.
Clustering
Clustering
Clustering like segmentation identifies groups of similar cases, but it does not predict outcomes or target categories [Helberg, 2002].
Clustering algorithms are also called
unsupervised classification, and they process a group of physical and abstract objects into classes of similar objects.
Clustering analysis supports the construction of meaningful partitions of a large set of objects based on the divide-and-conquer methodology which decomposes a large-scale system into smaller components to simplify design and implementation.
An example relates to identifying customers that would make good targets for a new product marketing promotion.
Clustering
Clustering
(
(
continued)
continued)
The clustering methods are divided into:
• Hierarchical clustering which represents the
combination of cases and clusters that are similar to each other, one pair at a time.
• K-Means clustering which is based on the
assumption that the data falls into a known number (K) of clusters. This method starts by defining initial profiles called cluster centers, for the K clusters, sometimes using random values for the clustering characteristics or sometimes using dissimilar cases from the data set.
K
K
-
-
Means Clustering
Means Clustering
In K-Means algorithm, each object xi is
assigned to a cluster j according to its distance d(xi,mj) from a value mj representing the cluster
itself. mj is called the representative of the cluster.
Given a set of objects D = {x1, . . . , xn}, a
clustering problem is to find a partition C = {C1 . . . Ck}, of D such that:
1. Each Ci is associated to a representative mi ;
2. xi ∈ Cj if d(xi,mj) ≤ d(xi,ml) for 1 ≤ l ≤ k, j≠ l;
3. The partition C minimizes:
∑ = ∑ ∈
i
k
1
xj
Ci
d
2
(x
j,
m
i)
.Sequential Patterns
Sequential Patterns
Sequential patterns are part of
sequential analysis. The main goal of this algorithm is to find all sequential patterns with a pre-defined minimum support represented by a data sequence. The input data is represented by a list of sequential transactions and there is often an associated transaction-time.
Combined Methods
Combined Methods
•Combination of different algorithms for the knowledge
extraction process based on rules with neural
networks (NN) and Case Base Reasoning (CBR)
¾ CBR represents the process of acquiring knowledge
represented by cases using reasoning by analogy.
¾ NNs are computer models based on the architecture of the human brain which consists of multiple simple
processing units connected by adaptive weights.
•Combination of clustering and neural networks (NN);
Combined Methods
Combined Methods
((continued)
continued)
Knowledge extraction DB NN model generation Rule-based model generation NN models Rule base
On
On
-
-
line Analytical Processing
line Analytical Processing
OLAP and DM are considered to be two complementary techniques for analyzing large amounts of data in databases and/or data
warehousing environments.
OLAP is a way of performing multi-dimensional analysis on relational databases.
DM is more powerful than an OLAP because of the difference of multi-dimensional processing of a database and the fact that new knowledge, and
hidden information can be extracted through DM.
A multi-dimensional representation related to a product family is shown in the next slide / figure.
OLAP
OLAP
(
(
continued)
continued)
City= London Company = xx Product= yy Category=aa Industry= Food Year=2002 Profit= 56% DIMENSION ATTRIBUTES
Distributed Data Mining using
Distributed Data Mining using
Intelligent Agents
Intelligent Agents
Intelligent Agents support the distributed and collaborative KD&DM systems:
Each agent is responsible for a different step in the
KD&DM process such as pre-processing, DM, and evaluation of the results;
Some agents specialize in a pre-determined task could use the services of other agents, e.g. classification uses a pre-processing agent services;
The agents interact as usually by a communication language or messages;
The cooperative DM agents run concurrently and they could be driven by an agent manager;
The mining agent systems could be flexibly integrated with other agent systems.
XLMiner
XLMiner
TMTM• It is an extension of Microsoft ExcelTM;
• It can help to quickly start the DM on spreadsheets and Excel files;
• It has extensive coverage of statistical and
machine learning techniques for classification, prediction, affinity analysis,
SAS Enterprise Miner
SAS Enterprise Miner
TMTM• It is supported by SEMMA (sampling, exploration, modification, modeling and assessment) methodology;
• It combines data warehousing, data mining and OLAP
technologies;
• It defines a comprehensive solution that addresses the whole
KDD processes;
• It integrates advanced models and algorithms including clustering, decision trees, neural networks, memory-based reasoning, linear and logistic regression and associations;
• It also provides powerful statistical analysis capabilities;
• It uses advanced modeling techniques;
• It generates code in SAS internal language as well as C and
Java;
SAS Enterprise Miner
SAS Enterprise Miner
TMTMIt has been successfully used for a wide range of CRM
and e-commerce applications such as:
¾direct mail, telephone, e-mail, and Internet delivered and promotion campaigns;
¾customers profiling;
¾identifying the most profitable customers and the underlying reasons for their loyalty;
¾Identifying the fraudulent behaviour in an e-commerce site.
It is very easy to be used because of its GUI;
The business analyst with little statistical expertise can quickly and easily navigate through the SEMMA process while the data mining experts can analyze deeply the analytical process.
SPSS
SPSS
Clementine
Clementine
TMTM•It is a DM workbench that enables to quickly develop
predictive models and deploy them into business operations to improve decision making;
•It delivers the maximum return on investment in the minimum
amount of time;
•It supports the entire DM process to shorten time-to-solution;
•It is designed around the de facto industry standard and
methodology CRoss-Industry Standard Process for Data Mining
(CRISP-DM);
•It uses Clementine Application Templates (CATs) which follow
the industry standard CRISP-DM methodology and use
previous real-world application experience in order that a new project to benefit from a proven methodology and best
SEMMA Methodology
SEMMA Methodology
SEMMA (Sample, Explore, Modify, Model, Assess) methodology was elaborated by SAS Institute Inc. and it is applied successfully, with the
SAS Enterprise MinerTM.
The steps of this methodology are as follows:
•Sample the data by extracting a portion of a large data set containing enough significant information, but having optimal dimension to be
manipulated quickly.
•Explore the data by searching for unanticipated trends and anomalies in order to understanding ideas and the trends of the data set.
•Modify the data by creating, selecting and transforming the variables to focus the model selection process.
•Model the data by allowing the system to search automatically for a combination of data that reliably predicts a desired outcome.
•Assess the data by evaluating the usefulness and reliability of the findings from the data mining process.
Projects and Standards
Projects and Standards
Overview of Main Projects and Standards
SOLEUNET KDE CRISP CWM JDM SQL/MM PMML OLE DB for DM
Projects & Standards
Projects & Standards
(
(
continued)
continued)
•ISO: SQL/MM is a collection of SQL user-defined types and routines to define and apply DM models.
•DM Group: Predictive Model Markup Language (PMML) is an
open standard based on XML specification for exchanging DM models between applications.
•OMG: Common Warehouse Metamodel (CWM) is a Unified Modeling Language/XML specification for DM metadata.
•Microsoft: OLE DB for DM is a major step toward the standardization
of DM primitives, and it defines a DM object model for relational databases.
•Oracle9i DM is an extension to Oracle9i Database Enterprise Edition that embeds DM algorithms for classifications, predictions and association rules. All models and functions are accessible through Java-based
Projects & Standards
Projects & Standards
(
(
continued)
continued)
•CRISP-DM is a project which has also defined and validated a standard DM process that is applicable in diverse industry sectors, and it attempts to make any DM project faster, cheaper, reliable and manageable.
•SolEuNet has the main aim to apply of DM and Decision Support (DS) systems in order to enhance efficiency, effectiveness and quality of operations in business and industry. A virtual enterprise model has been proposed as a dynamic problem-solving link between advanced DM and DS systems.
•Kensington Enterprise DM (Imperial College, Dept. of
Computing, London, UK) project has developed Kensington
Discovery Edition (KDE) which is an enterprise-wide platform that supports entire processes of KD, including dynamic information
CRISP
CRISP
-
-
DM
DM
Defining a DM Project
Defining a DM Project
"Make it as simple as possible, but no simpler.“ Albert Einstein
Project
Definition Experimental DesignData Identification &
Data pre- -processing Data Mining Evaluation of Results Factors affecting the adoption of DW and DM
Main References
Main References
Adriaans, P., Zantinge, D. “Data Mining”, Addison-Wesley, 1996.
Berry, M., Linoff, G.S. “Mastering Data Mining The Art and Science of Customer Relationship Management” , John Wiley & Sons Inc., 2000.
Berson et al. “Building Data Mining Applications for CRM”, McGraw-Hill, USA, 2000.
Bramer, M.A.(editor)”Knowledge Discovery and Data Mining”, IEE, 1999.
Chen, Z., “An integrated architecture for OLAP and data mining” in Knowledge Discovery and Data Mining, Bramer, M.A.(editor), IEE, 1999.
Cristofor, L. “Mining Rules in Single-table, and Multiple-table Databases, PhD Thesis, CS Dept. of Univ.of Massachusetts, Boston, USA, 2002.
Goglin, J.F., “La construction du datawarehouse du datamart au dataweb“, 2e édition revue, Hermes Science Publication, Paris, 1998, 2001.
Main References
Main References
(continued)
(continued)
Fayyad et al. (eds) “Advances in Knowledge Discovery and Data Mining”, AAAI Press/The MIT Press, 1996.
Han, J., Kamber, M. “Data Mining: Concepts and Techniques”, Morgan Kaufman, 2001.
Hand, D.J., “Data Mining: Statistics and More”, The American Statistician, Vol. 52, No. 2, 1998.
Helberg, C. “Data Mining with Confidence”, 2nd edition, SPSS Inc., 2002.
Jambu, M., “ Introduction au data mining - Analyse intelligente des données“, 1999 Eyrolles, Paris.
Lange, S., Satoh, K., Smith, C.H. (eds.) “Discovery Science 5th International
Conference, DS2002, Lubeck, Germany, Procedings“, Berlin: Springer-Verlag, 2002.
Klosgen, W., Zytkow, J.M. (editors) “Handbook of Data Mining and Knowledge Discovery “, Oxford University Press, 2002.
Web
Web
-
-
Resources
Resources
(continued)
(continued)
•
http://www.dmreview.com/
•http://www.andypryke.com/university/sites.html
•http://www.modelandmine.com
•http://www.kdnuggets.com/
•http://www.sas.com/technologies/analytics/data
mining/miner/index.html
•http://www.sas.com/operationsresearch
•http://www.spss.com/spssbi/clementine/
Web
Web
-
-
Resources
Resources
(continued)
(continued)
•
http://www.thearling.com/dmintro/dmintro.htm
•http://www.crisp-dm.org/
•http://soleunet.ijs.si/website/html/euproject.html
•http://www.dmg.org
•http://kmcenter.free.fr
•http://www.megaputer.com
”
” Discovery consists of seeing what everybody has seen Discovery consists of seeing what everybody has seen and
and
thinking what nobody has thought.
thinking what nobody has thought.””
A
Albert von lbert von SzentSzent--GyorgyiGyorgyi
THANK YOU
THANK YOU
MER