What is Data mining?

(1)

STAT 588 : DATA MINING

Javier Cabrera

Fall 2005

What is Data mining?

Business

Question

Find

Data

Internal Databases Data Warehouses Internet Online databases Data Collection

Data Processing

Extract Information

Data Analysis

Answer

Business

Question

Data collected in large databases:

• Relational databases, Internet, Data Warehouses:

Large Datasets, Many variables and cases.

• Mostly noisy data: Missing Values, Zeros, Outliers.

• No random samples.

Data mining objective: To extract valuable information.

To identify nuggets, small clusters of observations in these

data that contain potentially valuable information.

The definition of valuable is generally reflected by a large

response value of a specific category of a qualitative response.

Sifting through a large volume of data that is noisy, badly

behaved, and that may have many missing values, or that may

just be irrelevant is the main challenge of data mining.

Welcome to Data mining

•Processing “capacity” doubles every couple of years

(Exponential)

•Hard Disk storage “capacity” doubles every 18 months

(Use to be every 36 months)

•Bottle necks are not speed anymore.

•Processing capacity is not growing as fast as data

acquisition.

Moore

Moore’’’’ssss law

law

:

+

-How large is large?

How large is large?

By number of cases:

• Small: N < 30 (No CLT)

• Moderate: 30 < N < 500 (CLT)

• Moderately large: 500 < N < 50000 ( tolerable N

2

₎

• Large: 50000+: No N

2

_{computations.}

By the number of variables:

• Small: One variable.

• Moderate: Less than 100 Variables. Matrix inversion.

• Large: More than 100 Variables.

By database size:

• Large: Does not fit in memory.

Data mining Software

Fast computations.

• Economic use of memory.

• Flexible (and user friendly) Graphics Interface.

Software that will used in class:

• Clementine from SPSS

• SAS, R

Other Software:

(2)

• Data summarization, EDA, Basic Statistics.

• Advanced Data Visualization.

• Data Reduction: variable and case Subsetting,

Sampling.

• Dimension Reduction: Principal Components, Covariance.

• Cluster analysis (Segmentation): k-means, hierarchical.

• Classification techniques (Pattern recognition):

- LDA, QDA

- Trees

- Neural nets, Support Vector Machines,

- Nearest Neighbors.

• Model based methods: Linear, Non-Linear, logistic

Methods and Techniques

• Improved Methodology and Software.

• Solve business problems:

Data is from regular businesses.

Objective: Better business decisions.

What is new?

Case Study: SALES OF ORTHOPEDIC EQUIPMENT

The objective of this study is to find ways to increase sales of orthopedic

material from our company to hospitals in the United States. VARIABLES:

BEDS : NUMBER OF HOSPITAL BEDS RBEDS : NUMBER OF REHAB BEDS OUT-V : NUMBER OF OUTPATIENT VISITS ADM : ADMINISTRATIVE COST(In $1000's per year) SIR : REVENUE FROM INPATIENT

SALES12 : SALES OF REHAB. EQUIP. FOR THE LAST 12 MO HIP95 : NUMBER OF HIP OPERATIONS FOR 1995 KNEE95 : NUMBER OF KNEE OPERATIONS FOR 1995 TH : TEACHING HOSPITAL? 0, 1

TRAUMA : DO THEY HAVE A TRAUMA UNIT? 0, 1 REHAB : DO THEY HAVE A REHAB UNIT? 0, 1

HIP96 : NUMBER HIP OPERATIONS FOR 1996 KNEE96 : NUMBER KNEE OPERATIONS FOR 1996 FEMUR96 : NUMBER FEMUR OPERATIONS FOR 1996

The Role of visualization

Data visualization methods are attractive tools to use for

analyzing such datasets for several reasons:

• Data visualization methods show many features

(expected and unexpected) of a dataset at once and, as

such, are well equipped to pick up subtle structures of

interest and anomalies as well as clear patterns.

• They allow (in fact, encourage) flexible interaction with

the data.

• They can be more readily understood by non-statisticians

(although their properties may not be).

• Good user-friendly graphics software is becoming more

readily available.

Data visualization methods

Large datasets create visualization challenges.

•Scatterplots: Large numbers of points may hide the underlying structure. - Apply Data Binning and use an image graph.

- Avoid Masking by duplicating plots and highlighting subgroups. - Sometimes is enough to graph a subset selected at random. Many variables at once. There are many ingenious tools for this. •Scatterplot matrix

- all variables

- all descriptor variables with color coding according to one response

- all response variables with color coding according to one descriptor • plot selected 2D views to highlight some feature of the data: - principal components analysis (spread)

- projection pursuit (clustering)]

• look at “all” 2D views of the data via a dynamic display [rotating 3D display, grand tour] • conditional plots

• multiple windows with brush and link

x y -10 -5 0 5 10 -1 0 -5 0 5 1 0 -10 -5 0 5 10 -1 0 -5 0 5 1 0

Scatter Plot

Binning Plot

Data Binning

(3)

f.plot <- function(x,y,nr=20,nc=20, scale="raw") {

zx = c(1:nr,rep(1,nc),1+trunc( nr*(x- min(x))/(max(x)-min(x)) )) zx[zx>nr] = nr

zy = c(rep(1,nr),1:nc,1+trunc( nc*(y- min(y))/(max(y)-min(y)) )) zy[zy>nc] = nc z = table(zx,zy); z[,1]=z[,1]-1; z[1,]=z[1,]-1; if (scale=="l") z= log(1+z) image(z=t(z),x=seq(length=nr+1,from=min(x),to=max(x)), y= seq(length=nc+1,from=min(y),to=max(y)), xlab="",ylab="", col=topo.colors(100)) }

# Run this code line by line x = rnorm(10000) ; y = rnorm(10000) plot(x,y) f.plot(x,y,10,10) f.plot(x,y,50,50) f.plot(x,y,100,100) f.plot(x,y,100,100,'l') f.plot(x,y,500,500,'l') ux = rnorm(5000)/3 uy = ux^2 -0.5 f.plot(c(x,ux),c(y,uy)+20,50,50,'l')

R example of Binning Plot

Using Color

0 2 4 6 02 46 8 log(1 + BEDS) lo g (1 + S A LE S 12) REHAB NO REHAB 0 2 4 6 02 4 6 8 log(1 + BEDS) log( 1 + S A LE S 12) REHAB NO REHAB 0 2 4 6 02 4 6 8 log(1 + BEDS) log( 1 + S A LE S 12) REHAB NO REHAB

Masking effect

Drawing green dots first Drawing purple dots first

Pairwise

Pairwise Scatter Plot

Scatter Plot

Conditional Plot example

02 4 6 8 010 20 30 010 20 30 010 20 30 024 68 0 2 468 024 6 8 02 468 010 20 30 010 20 30 010 20 30 024 6 8 sqrt(KNEE96) log( 1 + S A LE S 12) 0 200 400 600 800 1000 1200 1400 Given : BEDS 0 1 00 00 20 00 0 3 000 0 40 000 5 00 00 6 00 00 G iv en : A D M

•variable and case selection

•cluster analysis (unsupervised pattern recognition) -partitioning methods (e.g.,k-means, k-medioids) -hierarchical methods (e.g., agglomerative nesting) -two-way clustering

•classification (supervised pattern recognition, discriminant analysis)

-trees (e.g., CART, C5, Firm, Tree, ARF) -model-based methods (e.g., logistic regression) -artificial neural networks

role of robust methods / diagnostics

(4)

1 ₂ 3 4 7 6 5

Cluster Example

I. Dependent variable is categorical

• Classification trees (e.g., CART, C5, Firm, Tree, ARF) • Decision Trees

• Decision Rules

Example: Personal loan decision

Credit Card? Car? Reject Age<30 Approve Y Y Y _N N N Reject Approve

Tree methods

5 0 2 3 3 4 2 X Y _Y<₄ X<3 0 2 Y<2 3 5

Function f(X,Y)

Tree form of f(X,Y)

II. Dependent variable is numerical • Regression Tree | HIP95<2.52265 HIP96<2.01527 RBEDS<2.77141 HIP95<0.5 KNEE96<1.36514 ADM<4.87542 FEMUR96<2.28992 KNEE95<2.96704 BEDS<3.8403 OUTV<15.2396 SIR<9.85983 1.0900 0.3752 0.9898 0.8984 2.3880 1.2010 1.7840 2.1280 3.1080 2.4380 3.2130 3.9790

Regression Tree for log(1+Sales12)

Linear model: Y = X

β

+

ε

Least Squares Estimator: b= (X

T

_X)

-1

_X

T

_Y

Linear Discriminant: Y = 0 or 1.

- Estimate b by L.S.

- Predict 1 if Xb > 0.5

0 otherwise

Linear Models

Linear Discriminants

Example: Pima Indians

0.1 0.2 0.3 0.4 0.5 100 12 0 140 16 0 180 20 0 PEDIGREE PL A S M A Diabetes: 63 None: 28 Diabetes: 69 None: 185

(5)

20 40 60 80 100 100 12 0 140 16 0 180 20 0 PEDIGREE P L ASMA

Example: Pima KNN with k = 1

_{Example: Pima KNN with k = 50}

20 40 60 80 100 100 120 14 0 160 180 20 0 PEDIGREE PL A S M A

Example: Pima Indians K =10

20 40 60 80 100 10 0 120 14 0 1 60 18 0 2 00 PEDIGREE PL ASM A