Assessing Data Mining:
Assessing Data Mining:
The State of the The State of the
Practice Practice
©2003
©2003 Herbert A. Edelstein Herbert A. Edelstein Two Crows Corporation Two Crows Corporation 10500 Falls Road 10500 Falls Road Potomac, Maryland 20854 Potomac, Maryland 20854 www.twocrows.com www.twocrows.com
Objectives Objectives
Separate myth from realitySeparate myth from reality
Interactive session: question driven! The Interactive session: question driven! The slides are largely to ensure common
slides are largely to ensure common background.
background.
The Key to Value The Key to Value
The utility of data increases as it spans The utility of data increases as it spans the business value chain and is integrated the business value chain and is integrated
Information increases as data are relatedInformation increases as data are related
Consolidate similar databasesConsolidate similar databases
Consolidate different types of Consolidate different types of databases
databases
Without data and good analysis all you Without data and good analysis all you have are opinions.
have are opinions.
Data Mining Definitions Data Mining Definitions
What IT departments callWhat IT departments call
OLAPOLAP
Query Query
StatisticsStatistics
Data Mining Definitions Data Mining Definitions
Knowledge Discovery in Databases Knowledge Discovery in Databases is the is the non-trivial process of identifying valid, non-trivial process of identifying valid, novel, potentially useful and ultimately novel, potentially useful and ultimately
understandable patterns in data.
understandable patterns in data.
(Fayyad, Piatetsky-Shapiro, & Smyth) (Fayyad, Piatetsky-Shapiro, & Smyth)
KDD is the process, data mining is the KDD is the process, data mining is the application of algorithms
application of algorithms
Includes description and predictionIncludes description and prediction
Large databases often explicitly added Large databases often explicitly added
Data Mining Definitions Data Mining Definitions
Data miningData mining is a process that uses a is a process that uses a
variety of data analysis tools to discover variety of data analysis tools to discover
patterns and relationships in data that patterns and relationships in data that
may be used to make valid predictions may be used to make valid predictions
Exploration and description is required Exploration and description is required but not the goal
but not the goal
New Statistical Software New Statistical Software
Takes advantage of advances in Takes advantage of advances in hardware and software
hardware and software
Provides new interfaces for a wider class Provides new interfaces for a wider class of users
of users
Comes from statistics, machine learning Comes from statistics, machine learning and information systems
and information systems
Why Data Mining is Taking Off Why Data Mining is Taking Off
Demand for informationDemand for information
Availability of data Availability of data
Enormous quantity of happenstance dataEnormous quantity of happenstance data
Spread of data warehousesSpread of data warehouses
Data is easily accessible through the Web.Data is easily accessible through the Web.
Improved technologyImproved technology
Inexpensive, scalable processingInexpensive, scalable processing
Inexpensive storageInexpensive storage
High bandwidthHigh bandwidth
Why Data Mining Is Needed Why Data Mining Is Needed
Massive amounts of dataMassive amounts of data
Example:Example:
75 million customers75 million customers
3,000 columns for each customer3,000 columns for each customer
Low signal to noiseLow signal to noise
Subtle relationshipsSubtle relationships
VariationVariation
Allow domain experts to build predictive Allow domain experts to build predictive models
models
Data Mining Products Data Mining Products
Handle large volumes of dataHandle large volumes of data
Reduce dependence on the modelerReduce dependence on the modeler
Model specification Model specification
Knowing characteristics of variablesKnowing characteristics of variables
Create hypothesesCreate hypotheses
Emphasize predictionEmphasize prediction
Simplify model deploymentSimplify model deployment
Data mining is a productivity tool even Data mining is a productivity tool even for the skilled statistician
for the skilled statistician
Attribute Interactions Attribute Interactions
High 0.000
0.025 0.050 0.075
High Fee, Low Interest Low Fee, Low Interest
Low
Interest Rate Interest Rate Response
Response RateRate
Low Fee, High Interest High Fee, High Interest
Data Mining Myths Data Mining Myths
Data mining does NOTData mining does NOT
Find answers to unasked questionsFind answers to unasked questions
Explain behavior.Explain behavior.
Continuously monitor data for new patternsContinuously monitor data for new patterns
Eliminate the need to understand your businessEliminate the need to understand your business
Eliminate the need to collect good dataEliminate the need to collect good data
Commercial Applications Commercial Applications
IndustryIndustry
Retail Retail
FinancialFinancial
ManufacturingManufacturing
InsuranceInsurance
PublishingPublishing
Health careHealth care
ApplicationApplication
Marketing Marketing
Sales force managementSales force management
Fraud detectionFraud detection
Credit Risk Analysis Credit Risk Analysis
Database checksDatabase checks
Data validation: Does an address exist, is the Data validation: Does an address exist, is the social security number consistent with date social security number consistent with date
and place of birth, etc.
and place of birth, etc.
Where is Benford’s Law applicable?Where is Benford’s Law applicable?
History checks: when was the last time a History checks: when was the last time a property sold?
property sold?
Profiling good and bad credit risks – Finding Profiling good and bad credit risks – Finding good risks within bad categories.
good risks within bad categories.
Outlier detection – uni-dimensional and multi-Outlier detection – uni-dimensional and multi- dimensional
dimensional
Does not replace human follow up Does not replace human follow up
Benford’s Law Benford’s Law
If the numbers under investigation are not entirely If the numbers under investigation are not entirely random but somehow
random but somehow socially or naturally relatedsocially or naturally related, the , the distribution of the first digit is not uniform. More
distribution of the first digit is not uniform. More accurately, digit D appears as the first digit with the accurately, digit D appears as the first digit with the frequency proportional to log10(1 + 1/D). In other frequency proportional to log10(1 + 1/D). In other words, one may expect 1 to be the first digit of a words, one may expect 1 to be the first digit of a
random number in about 30% of cases, 2 will come up random number in about 30% of cases, 2 will come up in about 18% of cases, 3 in 12%, 4 in 9%, 5 in 8%,
in about 18% of cases, 3 in 12%, 4 in 9%, 5 in 8%, etc.
etc.
http://www.cut-the-knot.org/do_you_know/zipfLaw.shtml http://www.cut-the-knot.org/do_you_know/zipfLaw.shtml
Data Mining
Data Mining Process Process
Define business problemDefine business problem
Prepare the data Prepare the data
Build data mining databaseBuild data mining database
Explore dataExplore data
Prepare data for modelingPrepare data for modeling
Create modelsCreate models
Build modelsBuild models
Evaluate modelsEvaluate models
Act on the results (implementation)Act on the results (implementation)
Apply models to new dataApply models to new data
Data Preparation Data Preparation
Build data mining databaseBuild data mining database
Explore dataExplore data
Prepare data for modelingPrepare data for modeling
60% to 95% of the time is spent 60% to 95% of the time is spent
preparing the data preparing the data
The Tools The Tools
Starting simply - linear regression Starting simply - linear regression
Fancier regressions Fancier regressions
Projections Projections
Smoothing based regressionsSmoothing based regressions
Survival analysis Survival analysis
Nearest neighbor methodsNearest neighbor methods
Collaborative filteringCollaborative filtering
Trees Trees
MARS MARS
Neural networks Neural networks
Decision Trees Decision Trees
Build a tree inductively that describes a set of Build a tree inductively that describes a set of datadata
Classification tree: Hierarchical set of rules Classification tree: Hierarchical set of rules which classify data
which classify data
Regression tree: Hierarchical set of rules Regression tree: Hierarchical set of rules which predicts values
which predicts values
Neural Nets Neural Nets
Don’t resemble the brainDon’t resemble the brain
A model of memoryA model of memory
A model of learningA model of learning
NN Don’t Resemble the Brain NN Don’t Resemble the Brain
Brain neurons can not only add signals, but they Brain neurons can not only add signals, but they can subtract, multiply, divide, filter, average, etc.
can subtract, multiply, divide, filter, average, etc.
““The computational toolbox of individual neurons The computational toolbox of individual neurons dwarfs the elements available to today’s electronic dwarfs the elements available to today’s electronic
circuit designers” Christof Koch, Professor, circuit designers” Christof Koch, Professor,
Linear Regression Example Linear Regression Example
Inputs (I) Inputs (I)
0.50.5
3x3x11+.7x+.7x22-.2x-.2x33+.4x+.4x44+.5= y+.5= y
Output Output
xx11 xx22 xx33 xx44 xx00
0.30.3
0.70.7 -0.2-0.2
0.40.4 yy
Weights Weights
Perceptron Perceptron
Logistic Regression Example Logistic Regression Example
Inputs (I) Inputs (I)
0.50.5
3x3x11+.7x+.7x22-.2x-.2x33+.4x+.4x44+.5= y+.5= y
Output Output
xx11 xx22 xx33 xx44 xx00
0.30.3
0.70.7 -0.2-0.2
0.40.4 ffy)y)
Weights Weights
Perceptron Perceptron
Sigmoid Activation Function Sigmoid Activation Function
S shapedS shaped
Continuous approximation of thresholdContinuous approximation of threshold
Has derivativeHas derivative
e y
y
f
1 ) 1
(
0 0.2 0.4 0.6 0.8 1 1.2
-10 -7 -4 -1 2 5 8
Logistic Regression Example Logistic Regression Example
Input (I) Input (I)
Output (y) Output (y) 0.30.3
0.70.7
-0.2-0.2 0.40.4 0.50.5
xx1 1 = +1 = +1 xx2 2 = -1 = -1 xx3 3 = +1 = +1 xx4 4 = +1 = +1 xx0 0 = +1 = +1
57 1 .
1
3
.
y e
3x3x11+.7x+.7x22-.2x-.2x33+.4x+.4x44+.5= I+.5= I
Perceptron Perceptron
Layered Architecture Layered Architecture
Input layer
Output layer
State of the Industry Report Card State of the Industry Report Card
19991999 20012001 20032003
Products Products
User interface
User interface CC B -B - B B
Data preparation
Data preparation DD CC CC
Data exploration
Data exploration C-C- CC BB
Algorithms
Algorithms CC BB B+B+
Model deployment
Model deployment DD CC BB
Robustness
Robustness CC B-B- B+B+
Adoption Adoption
Organizational readiness
Organizational readiness CC CC BB Successful applications
Successful applications C+ C+ BB AA Training
Training DD CC BB
Available consulting
Available consulting DD CC BB
State of the Product Market State of the Product Market
Good products are availableGood products are available
Feature rich Feature rich
MatureMature
Reasonably stableReasonably stable
Well supportedWell supported
Tools and Technology Tools and Technology
Match tool to application and users.Match tool to application and users.
Allow time for training and learningAllow time for training and learning
The tool is not the solution.The tool is not the solution.
Model building is the fun and easy part.Model building is the fun and easy part.
Further Reading Further Reading
Herb Edelstein,
Herb Edelstein, Introduction to Data Mining and Knowledge DiscoveryIntroduction to Data Mining and Knowledge Discovery, 2003, 2003 Herb Edelstein,
Herb Edelstein, Data Mining Technology ReportData Mining Technology Report, 2003, 2003 M. Berry and G. Linoff,
M. Berry and G. Linoff, MasteringMastering Data Mining TechniquesData Mining Techniques, John Wiley, 1999, John Wiley, 1999 William S. Cleveland,
William S. Cleveland, The Elements of Graphing DataThe Elements of Graphing Data, revised, Hobart Press, 1994, revised, Hobart Press, 1994 Howard Wainer,
Howard Wainer, Visual Revelations, Copernicus, 1997Visual Revelations, Copernicus, 1997 T. Hastie, R. Tibshirani, J. H. Friedman,
T. Hastie, R. Tibshirani, J. H. Friedman, The Elements of Statistical Learning : Data The Elements of Statistical Learning : Data Mining, Inference, and Prediction,
Mining, Inference, and Prediction, Springer Verlag, 2001Springer Verlag, 2001 Richard O. Duda, P. E. Hart, D. G. Stork,
Richard O. Duda, P. E. Hart, D. G. Stork, Pattern ClassificationPattern Classification, John Wiley, 2000, John Wiley, 2000 David W Hosmer Jr., S. Lemeshow,
David W Hosmer Jr., S. Lemeshow, Applied Logistic Regression Applied Logistic Regression, John Wiley, 2000, John Wiley, 2000 David W Hosmer Jr., S. Lemeshow,
David W Hosmer Jr., S. Lemeshow, Applied Survival Analysis Applied Survival Analysis, John Wiley, 1999 , John Wiley, 1999 David J. Hand, H. Mannila, P. Smyth ,
David J. Hand, H. Mannila, P. Smyth , Principles of Data Mining Principles of Data Mining , MIT Press, 2001, MIT Press, 2001 Brieman, Freidman, Olshen, and Stone,
Brieman, Freidman, Olshen, and Stone, Classification and Regression TreesClassification and Regression Trees, , Wadsworth, 1984
Wadsworth, 1984 J. R. Quinlan, C4.5:
J. R. Quinlan, C4.5: Programs for Machine LearningPrograms for Machine Learning, Morgan Kaufmann, 1992 , Morgan Kaufmann, 1992