STAT 588 : DATA MINING
STAT 588 : DATA MINING
STAT 588 : DATA MINING
STAT 588 : DATA MINING
Javier Cabrera
Javier Cabrera
Javier Cabrera
Javier Cabrera
Fall 2005
Fall 2005
Fall 2005
Fall 2005
What is Data mining?
What is Data mining?
What is Data mining?
What is Data mining?
Business
Question
Find
Data
Internal Databases Data Warehouses Internet Online databases Data Collection
Data Processing
Extract Information
Data Analysis
Answer
Business
Question
Data collected in large databases:
• Relational databases, Internet, Data Warehouses:
Large Datasets, Many variables and cases.
• Mostly noisy data: Missing Values, Zeros, Outliers.
• No random samples.
Data mining objective: To extract valuable information.
To identify nuggets, small clusters of observations in these
data that contain potentially valuable information.
The definition of valuable is generally reflected by a large
response value of a specific category of a qualitative response.
Sifting through a large volume of data that is noisy, badly
behaved, and that may have many missing values, or that may
just be irrelevant is the main challenge of data mining.
Welcome to Data mining
Welcome to Data mining
Welcome to Data mining
Welcome to Data mining
•Processing “capacity” doubles every couple of years
(Exponential)
•Hard Disk storage “capacity” doubles every 18 months
(Use to be every 36 months)
•Bottle necks are not speed anymore.
•Processing capacity is not growing as fast as data
acquisition.
Moore
Moore
Moore
Moore’’’’ssss law
law
law
law
:
+
-How large is large?
How large is large?
How large is large?
How large is large?
By number of cases:
• Small: N < 30 (No CLT)
• Moderate: 30 < N < 500 (CLT)
• Moderately large: 500 < N < 50000 ( tolerable N
2)
• Large: 50000+: No N
2computations.
By the number of variables:
• Small: One variable.
• Moderate: Less than 100 Variables. Matrix inversion.
• Large: More than 100 Variables.
By database size:
• Large: Does not fit in memory.
Data mining Software
Data mining Software
Data mining Software
Data mining Software
Fast computations.
• Economic use of memory.
• Flexible (and user friendly) Graphics Interface.
Software that will used in class:
• Clementine from SPSS
• SAS, R
Other Software:
• Data summarization, EDA, Basic Statistics.
• Advanced Data Visualization.
• Data Reduction: variable and case Subsetting,
Sampling.
• Dimension Reduction: Principal Components, Covariance.
• Cluster analysis (Segmentation): k-means, hierarchical.
• Classification techniques (Pattern recognition):
- LDA, QDA
- Trees
- Neural nets, Support Vector Machines,
- Nearest Neighbors.
• Model based methods: Linear, Non-Linear, logistic
Methods and Techniques
Methods and Techniques
Methods and Techniques
Methods and Techniques
•
Improved Methodology and Software.
• Solve business problems:
Data is from regular businesses.
Objective: Better business decisions.
What is new?
Case Study: SALES OF ORTHOPEDIC EQUIPMENT
The objective of this study is to find ways to increase sales of orthopedicmaterial from our company to hospitals in the United States. VARIABLES:
BEDS : NUMBER OF HOSPITAL BEDS RBEDS : NUMBER OF REHAB BEDS OUT-V : NUMBER OF OUTPATIENT VISITS ADM : ADMINISTRATIVE COST(In $1000's per year) SIR : REVENUE FROM INPATIENT
SALES12 : SALES OF REHAB. EQUIP. FOR THE LAST 12 MO HIP95 : NUMBER OF HIP OPERATIONS FOR 1995 KNEE95 : NUMBER OF KNEE OPERATIONS FOR 1995 TH : TEACHING HOSPITAL? 0, 1
TRAUMA : DO THEY HAVE A TRAUMA UNIT? 0, 1 REHAB : DO THEY HAVE A REHAB UNIT? 0, 1
HIP96 : NUMBER HIP OPERATIONS FOR 1996 KNEE96 : NUMBER KNEE OPERATIONS FOR 1996 FEMUR96 : NUMBER FEMUR OPERATIONS FOR 1996
The Role of visualization
The Role of visualization
The Role of visualization
The Role of visualization
Data visualization methods are attractive tools to use for
analyzing such datasets for several reasons:
•
Data visualization methods show many features
(expected and unexpected) of a dataset at once and, as
such, are well equipped to pick up subtle structures of
interest and anomalies as well as clear patterns.
•
They allow (in fact, encourage) flexible interaction with
the data.
•
They can be more readily understood by non-statisticians
(although their properties may not be).
•
Good user-friendly graphics software is becoming more
readily available.
Data visualization methods
Data visualization methods
Data visualization methods
Data visualization methods
Large datasets create visualization challenges.
•Scatterplots: Large numbers of points may hide the underlying structure. - Apply Data Binning and use an image graph.
- Avoid Masking by duplicating plots and highlighting subgroups. - Sometimes is enough to graph a subset selected at random. Many variables at once. There are many ingenious tools for this. •Scatterplot matrix
- all variables
- all descriptor variables with color coding according to one response
- all response variables with color coding according to one descriptor • plot selected 2D views to highlight some feature of the data: - principal components analysis (spread)
- projection pursuit (clustering)]
• look at “all” 2D views of the data via a dynamic display [rotating 3D display, grand tour] • conditional plots
• multiple windows with brush and link
x y -10 -5 0 5 10 -1 0 -5 0 5 1 0 -10 -5 0 5 10 -1 0 -5 0 5 1 0
Scatter Plot
Binning Plot
Data Binning
Data Binning
Data Binning
Data Binning
f.plot <- function(x,y,nr=20,nc=20, scale="raw") {
zx = c(1:nr,rep(1,nc),1+trunc( nr*(x- min(x))/(max(x)-min(x)) )) zx[zx>nr] = nr
zy = c(rep(1,nr),1:nc,1+trunc( nc*(y- min(y))/(max(y)-min(y)) )) zy[zy>nc] = nc z = table(zx,zy); z[,1]=z[,1]-1; z[1,]=z[1,]-1; if (scale=="l") z= log(1+z) image(z=t(z),x=seq(length=nr+1,from=min(x),to=max(x)), y= seq(length=nc+1,from=min(y),to=max(y)), xlab="",ylab="", col=topo.colors(100)) }
# Run this code line by line x = rnorm(10000) ; y = rnorm(10000) plot(x,y) f.plot(x,y,10,10) f.plot(x,y,50,50) f.plot(x,y,100,100) f.plot(x,y,100,100,'l') f.plot(x,y,500,500,'l') ux = rnorm(5000)/3 uy = ux^2 -0.5 f.plot(c(x,ux),c(y,uy)+20,50,50,'l')
R example of Binning Plot
R example of Binning Plot
R example of Binning Plot
R example of Binning Plot
Using Color
Using Color
Using Color
Using Color
0 2 4 6 02 46 8 log(1 + BEDS) lo g (1 + S A LE S 12) REHAB NO REHAB 0 2 4 6 02 4 6 8 log(1 + BEDS) log( 1 + S A LE S 12) REHAB NO REHAB 0 2 4 6 02 4 6 8 log(1 + BEDS) log( 1 + S A LE S 12) REHAB NO REHABMasking effect
Masking effect
Masking effect
Masking effect
Drawing green dots first Drawing purple dots first
Pairwise
Pairwise
Pairwise
Pairwise Scatter Plot
Scatter Plot
Scatter Plot
Scatter Plot
Conditional Plot example
Conditional Plot example
Conditional Plot example
Conditional Plot example
02 4 6 8 010 20 30 010 20 30 010 20 30 024 68 0 2 468 024 6 8 02 468 010 20 30 010 20 30 010 20 30 024 6 8 sqrt(KNEE96) log( 1 + S A LE S 12) 0 200 400 600 800 1000 1200 1400 Given : BEDS 0 1 00 00 20 00 0 3 000 0 40 000 5 00 00 6 00 00 G iv en : A D M
•variable and case selection
•cluster analysis (unsupervised pattern recognition) -partitioning methods (e.g.,k-means, k-medioids) -hierarchical methods (e.g., agglomerative nesting) -two-way clustering
•classification (supervised pattern recognition, discriminant analysis)
-trees (e.g., CART, C5, Firm, Tree, ARF) -model-based methods (e.g., logistic regression) -artificial neural networks
role of robust methods / diagnostics
1 2 3 4 7 6 5
Cluster Example
I. Dependent variable is categorical
• Classification trees (e.g., CART, C5, Firm, Tree, ARF) • Decision Trees
• Decision Rules
Example: Personal loan decision
Credit Card? Car? Reject Age<30 Approve Y Y Y N N N Reject Approve
Tree methods
5 0 2 3 3 4 2 X Y Y<4 X<3 0 2 Y<2 3 5Function f(X,Y)
Tree form of f(X,Y)
II. Dependent variable is numerical • Regression Tree | HIP95<2.52265 HIP96<2.01527 RBEDS<2.77141 HIP95<0.5 KNEE96<1.36514 ADM<4.87542 FEMUR96<2.28992 KNEE95<2.96704 BEDS<3.8403 OUTV<15.2396 SIR<9.85983 1.0900 0.3752 0.9898 0.8984 2.3880 1.2010 1.7840 2.1280 3.1080 2.4380 3.2130 3.9790
Regression Tree for log(1+Sales12)
Linear model: Y = X
β
+
ε
Least Squares Estimator: b= (X
TX)
-1X
TY
Linear Discriminant: Y = 0 or 1.
- Estimate b by L.S.
- Predict 1 if Xb > 0.5
0 otherwise
Linear Models
Linear Discriminants
Example: Pima Indians
0.1 0.2 0.3 0.4 0.5 100 12 0 140 16 0 180 20 0 PEDIGREE PL A S M A Diabetes: 63 None: 28 Diabetes: 69 None: 185
20 40 60 80 100 100 12 0 140 16 0 180 20 0 PEDIGREE P L ASMA
Example: Pima KNN with k = 1
Example: Pima KNN with k = 50
20 40 60 80 100 100 120 14 0 160 180 20 0 PEDIGREE PL A S M A
Example: Pima Indians K =10
20 40 60 80 100 10 0 120 14 0 1 60 18 0 2 00 PEDIGREE PL ASM A