• No results found

Assessing Data Mining: The State of the Practice

N/A
N/A
Protected

Academic year: 2022

Share "Assessing Data Mining: The State of the Practice"

Copied!
30
0
0

Loading.... (view fulltext now)

Full text

(1)

Assessing Data Mining:

Assessing Data Mining:

The State of the The State of the

Practice Practice

©2003

©2003 Herbert A. Edelstein Herbert A. Edelstein Two Crows Corporation Two Crows Corporation 10500 Falls Road 10500 Falls Road Potomac, Maryland 20854 Potomac, Maryland 20854 www.twocrows.com www.twocrows.com

(2)

Objectives Objectives

Separate myth from realitySeparate myth from reality

Interactive session: question driven! The Interactive session: question driven! The slides are largely to ensure common

slides are largely to ensure common background.

background.

(3)

The Key to Value The Key to Value

The utility of data increases as it spans The utility of data increases as it spans the business value chain and is integrated the business value chain and is integrated

Information increases as data are relatedInformation increases as data are related

Consolidate similar databasesConsolidate similar databases

Consolidate different types of Consolidate different types of databases

databases

Without data and good analysis all you Without data and good analysis all you have are opinions.

have are opinions.

(4)

Data Mining Definitions Data Mining Definitions

What IT departments callWhat IT departments call

OLAPOLAP

Query Query

StatisticsStatistics

(5)

Data Mining Definitions Data Mining Definitions

Knowledge Discovery in Databases Knowledge Discovery in Databases is the is the non-trivial process of identifying valid, non-trivial process of identifying valid, novel, potentially useful and ultimately novel, potentially useful and ultimately

understandable patterns in data.

understandable patterns in data.

(Fayyad, Piatetsky-Shapiro, & Smyth) (Fayyad, Piatetsky-Shapiro, & Smyth)

KDD is the process, data mining is the KDD is the process, data mining is the application of algorithms

application of algorithms

Includes description and predictionIncludes description and prediction

Large databases often explicitly added Large databases often explicitly added

(6)

Data Mining Definitions Data Mining Definitions

Data miningData mining is a process that uses a is a process that uses a

variety of data analysis tools to discover variety of data analysis tools to discover

patterns and relationships in data that patterns and relationships in data that

may be used to make valid predictions may be used to make valid predictions

Exploration and description is required Exploration and description is required but not the goal

but not the goal

(7)

New Statistical Software New Statistical Software

Takes advantage of advances in Takes advantage of advances in hardware and software

hardware and software

Provides new interfaces for a wider class Provides new interfaces for a wider class of users

of users

Comes from statistics, machine learning Comes from statistics, machine learning and information systems

and information systems

(8)

Why Data Mining is Taking Off Why Data Mining is Taking Off

Demand for informationDemand for information

Availability of data Availability of data

Enormous quantity of happenstance dataEnormous quantity of happenstance data

Spread of data warehousesSpread of data warehouses

Data is easily accessible through the Web.Data is easily accessible through the Web.

Improved technologyImproved technology

Inexpensive, scalable processingInexpensive, scalable processing

Inexpensive storageInexpensive storage

High bandwidthHigh bandwidth

(9)

Why Data Mining Is Needed Why Data Mining Is Needed

Massive amounts of dataMassive amounts of data

Example:Example:

75 million customers75 million customers

3,000 columns for each customer3,000 columns for each customer

Low signal to noiseLow signal to noise

Subtle relationshipsSubtle relationships

VariationVariation

Allow domain experts to build predictive Allow domain experts to build predictive models

models

(10)

Data Mining Products Data Mining Products

Handle large volumes of dataHandle large volumes of data

Reduce dependence on the modelerReduce dependence on the modeler

Model specification Model specification

Knowing characteristics of variablesKnowing characteristics of variables

Create hypothesesCreate hypotheses

Emphasize predictionEmphasize prediction

Simplify model deploymentSimplify model deployment

Data mining is a productivity tool even Data mining is a productivity tool even for the skilled statistician

for the skilled statistician

(11)

Attribute Interactions Attribute Interactions

High 0.000

0.025 0.050 0.075

High Fee, Low Interest Low Fee, Low Interest

Low

Interest Rate Interest Rate Response

Response RateRate

Low Fee, High Interest High Fee, High Interest

(12)

Data Mining Myths Data Mining Myths

Data mining does NOTData mining does NOT

Find answers to unasked questionsFind answers to unasked questions

Explain behavior.Explain behavior.

Continuously monitor data for new patternsContinuously monitor data for new patterns

Eliminate the need to understand your businessEliminate the need to understand your business

Eliminate the need to collect good dataEliminate the need to collect good data

(13)

Commercial Applications Commercial Applications

IndustryIndustry

Retail Retail

FinancialFinancial

ManufacturingManufacturing

InsuranceInsurance

PublishingPublishing

Health careHealth care

ApplicationApplication

Marketing Marketing

Sales force managementSales force management

Fraud detectionFraud detection

(14)

Credit Risk Analysis Credit Risk Analysis

Database checksDatabase checks

Data validation: Does an address exist, is the Data validation: Does an address exist, is the social security number consistent with date social security number consistent with date

and place of birth, etc.

and place of birth, etc.

Where is Benford’s Law applicable?Where is Benford’s Law applicable?

History checks: when was the last time a History checks: when was the last time a property sold?

property sold?

Profiling good and bad credit risks – Finding Profiling good and bad credit risks – Finding good risks within bad categories.

good risks within bad categories.

Outlier detection – uni-dimensional and multi-Outlier detection – uni-dimensional and multi- dimensional

dimensional

Does not replace human follow up Does not replace human follow up

(15)

Benford’s Law Benford’s Law

If the numbers under investigation are not entirely If the numbers under investigation are not entirely random but somehow

random but somehow socially or naturally relatedsocially or naturally related, the , the distribution of the first digit is not uniform. More

distribution of the first digit is not uniform. More accurately, digit D appears as the first digit with the accurately, digit D appears as the first digit with the frequency proportional to log10(1 + 1/D). In other frequency proportional to log10(1 + 1/D). In other words, one may expect 1 to be the first digit of a words, one may expect 1 to be the first digit of a

random number in about 30% of cases, 2 will come up random number in about 30% of cases, 2 will come up in about 18% of cases, 3 in 12%, 4 in 9%, 5 in 8%,

in about 18% of cases, 3 in 12%, 4 in 9%, 5 in 8%, etc.

etc.

http://www.cut-the-knot.org/do_you_know/zipfLaw.shtml http://www.cut-the-knot.org/do_you_know/zipfLaw.shtml

(16)

Data Mining

Data Mining Process Process

Define business problemDefine business problem

Prepare the data Prepare the data

Build data mining databaseBuild data mining database

Explore dataExplore data

Prepare data for modelingPrepare data for modeling

Create modelsCreate models

Build modelsBuild models

Evaluate modelsEvaluate models

Act on the results (implementation)Act on the results (implementation)

Apply models to new dataApply models to new data

(17)

Data Preparation Data Preparation

Build data mining databaseBuild data mining database

Explore dataExplore data

Prepare data for modelingPrepare data for modeling

60% to 95% of the time is spent 60% to 95% of the time is spent

preparing the data preparing the data

(18)

The Tools The Tools

Starting simply - linear regression Starting simply - linear regression

Fancier regressions Fancier regressions

Projections Projections

Smoothing based regressionsSmoothing based regressions

Survival analysis Survival analysis

Nearest neighbor methodsNearest neighbor methods

Collaborative filteringCollaborative filtering

Trees Trees

MARS MARS

Neural networks Neural networks

(19)

Decision Trees Decision Trees

Build a tree inductively that describes a set of Build a tree inductively that describes a set of datadata

Classification tree: Hierarchical set of rules Classification tree: Hierarchical set of rules which classify data

which classify data

Regression tree: Hierarchical set of rules Regression tree: Hierarchical set of rules which predicts values

which predicts values

(20)

Neural Nets Neural Nets

Don’t resemble the brainDon’t resemble the brain

A model of memoryA model of memory

A model of learningA model of learning

(21)

NN Don’t Resemble the Brain NN Don’t Resemble the Brain

Brain neurons can not only add signals, but they Brain neurons can not only add signals, but they can subtract, multiply, divide, filter, average, etc.

can subtract, multiply, divide, filter, average, etc.

The computational toolbox of individual neurons The computational toolbox of individual neurons dwarfs the elements available to today’s electronic dwarfs the elements available to today’s electronic

circuit designers” Christof Koch, Professor, circuit designers” Christof Koch, Professor,

(22)

Linear Regression Example Linear Regression Example

Inputs (I) Inputs (I)

0.50.5

3x3x11+.7x+.7x22-.2x-.2x33+.4x+.4x44+.5= y+.5= y

Output Output

xx11 xx22 xx33 xx44 xx00

0.30.3

0.70.7 -0.2-0.2

0.40.4 yy

Weights Weights

Perceptron Perceptron

(23)

Logistic Regression Example Logistic Regression Example

Inputs (I) Inputs (I)

0.50.5

3x3x11+.7x+.7x22-.2x-.2x33+.4x+.4x44+.5= y+.5= y

Output Output

xx11 xx22 xx33 xx44 xx00

0.30.3

0.70.7 -0.2-0.2

0.40.4 ffy)y)

Weights Weights

Perceptron Perceptron

(24)

Sigmoid Activation Function Sigmoid Activation Function

S shapedS shaped

Continuous approximation of thresholdContinuous approximation of threshold

Has derivativeHas derivative

e y

y

f

1 ) 1

(

0 0.2 0.4 0.6 0.8 1 1.2

-10 -7 -4 -1 2 5 8

(25)

Logistic Regression Example Logistic Regression Example

Input (I) Input (I)

Output (y) Output (y) 0.30.3

0.70.7

-0.2-0.2 0.40.4 0.50.5

xx1 1 = +1 = +1 xx2 2 = -1 = -1 xx3 3 = +1 = +1 xx4 4 = +1 = +1 xx0 0 = +1 = +1

57 1 .

1

3

.

y e

3x3x11+.7x+.7x22-.2x-.2x33+.4x+.4x44+.5= I+.5= I

Perceptron Perceptron

(26)

Layered Architecture Layered Architecture

Input layer

Output layer

(27)

State of the Industry Report Card State of the Industry Report Card

19991999 20012001 20032003

Products Products

User interface

User interface CC B -B - B B

Data preparation

Data preparation DD CC CC

Data exploration

Data exploration C-C- CC BB

Algorithms

Algorithms CC BB B+B+

Model deployment

Model deployment DD CC BB

Robustness

Robustness CC B-B- B+B+

Adoption Adoption

Organizational readiness

Organizational readiness CC CC BB Successful applications

Successful applications C+ C+ BB AA Training

Training DD CC BB

Available consulting

Available consulting DD CC BB

(28)

State of the Product Market State of the Product Market

Good products are availableGood products are available

Feature rich Feature rich

MatureMature

Reasonably stableReasonably stable

Well supportedWell supported

(29)

Tools and Technology Tools and Technology

Match tool to application and users.Match tool to application and users.

Allow time for training and learningAllow time for training and learning

The tool is not the solution.The tool is not the solution.

Model building is the fun and easy part.Model building is the fun and easy part.

(30)

Further Reading Further Reading

Herb Edelstein,

Herb Edelstein, Introduction to Data Mining and Knowledge DiscoveryIntroduction to Data Mining and Knowledge Discovery, 2003, 2003 Herb Edelstein,

Herb Edelstein, Data Mining Technology ReportData Mining Technology Report, 2003, 2003 M. Berry and G. Linoff,

M. Berry and G. Linoff, MasteringMastering Data Mining TechniquesData Mining Techniques, John Wiley, 1999, John Wiley, 1999 William S. Cleveland,

William S. Cleveland, The Elements of Graphing DataThe Elements of Graphing Data, revised, Hobart Press, 1994, revised, Hobart Press, 1994 Howard Wainer,

Howard Wainer, Visual Revelations, Copernicus, 1997Visual Revelations, Copernicus, 1997 T. Hastie, R. Tibshirani, J. H. Friedman,

T. Hastie, R. Tibshirani, J. H. Friedman, The Elements of Statistical Learning : Data The Elements of Statistical Learning : Data Mining, Inference, and Prediction,

Mining, Inference, and Prediction, Springer Verlag, 2001Springer Verlag, 2001 Richard O. Duda, P. E. Hart, D. G. Stork,

Richard O. Duda, P. E. Hart, D. G. Stork, Pattern ClassificationPattern Classification, John Wiley, 2000, John Wiley, 2000 David W Hosmer Jr., S. Lemeshow,

David W Hosmer Jr., S. Lemeshow, Applied Logistic Regression Applied Logistic Regression, John Wiley, 2000, John Wiley, 2000 David W Hosmer Jr., S. Lemeshow,

David W Hosmer Jr., S. Lemeshow, Applied Survival Analysis Applied Survival Analysis, John Wiley, 1999 , John Wiley, 1999 David J. Hand, H. Mannila, P. Smyth ,

David J. Hand, H. Mannila, P. Smyth , Principles of Data Mining Principles of Data Mining , MIT Press, 2001, MIT Press, 2001 Brieman, Freidman, Olshen, and Stone,

Brieman, Freidman, Olshen, and Stone, Classification and Regression TreesClassification and Regression Trees, , Wadsworth, 1984

Wadsworth, 1984 J. R. Quinlan, C4.5:

J. R. Quinlan, C4.5: Programs for Machine LearningPrograms for Machine Learning, Morgan Kaufmann, 1992 , Morgan Kaufmann, 1992

References

Related documents