• No results found

What is Data mining?

N/A
N/A
Protected

Academic year: 2021

Share "What is Data mining?"

Copied!
5
0
0

Loading.... (view fulltext now)

Full text

(1)

STAT 588 : DATA MINING

STAT 588 : DATA MINING

STAT 588 : DATA MINING

STAT 588 : DATA MINING

Javier Cabrera

Javier Cabrera

Javier Cabrera

Javier Cabrera

Fall 2005

Fall 2005

Fall 2005

Fall 2005

What is Data mining?

What is Data mining?

What is Data mining?

What is Data mining?

Business

Question

Find

Data

Internal Databases Data Warehouses Internet Online databases Data Collection

Data Processing

Extract Information

Data Analysis

Answer

Business

Question

Data collected in large databases:

• Relational databases, Internet, Data Warehouses:

Large Datasets, Many variables and cases.

• Mostly noisy data: Missing Values, Zeros, Outliers.

• No random samples.

Data mining objective: To extract valuable information.

To identify nuggets, small clusters of observations in these

data that contain potentially valuable information.

The definition of valuable is generally reflected by a large

response value of a specific category of a qualitative response.

Sifting through a large volume of data that is noisy, badly

behaved, and that may have many missing values, or that may

just be irrelevant is the main challenge of data mining.

Welcome to Data mining

Welcome to Data mining

Welcome to Data mining

Welcome to Data mining

•Processing “capacity” doubles every couple of years

(Exponential)

•Hard Disk storage “capacity” doubles every 18 months

(Use to be every 36 months)

•Bottle necks are not speed anymore.

•Processing capacity is not growing as fast as data

acquisition.

Moore

Moore

Moore

Moore’’’’ssss law

law

law

law

:

+

-How large is large?

How large is large?

How large is large?

How large is large?

By number of cases:

• Small: N < 30 (No CLT)

• Moderate: 30 < N < 500 (CLT)

• Moderately large: 500 < N < 50000 ( tolerable N

2

)

• Large: 50000+: No N

2

computations.

By the number of variables:

• Small: One variable.

• Moderate: Less than 100 Variables. Matrix inversion.

• Large: More than 100 Variables.

By database size:

• Large: Does not fit in memory.

Data mining Software

Data mining Software

Data mining Software

Data mining Software

Fast computations.

• Economic use of memory.

• Flexible (and user friendly) Graphics Interface.

Software that will used in class:

• Clementine from SPSS

• SAS, R

Other Software:

(2)

• Data summarization, EDA, Basic Statistics.

• Advanced Data Visualization.

• Data Reduction: variable and case Subsetting,

Sampling.

• Dimension Reduction: Principal Components, Covariance.

• Cluster analysis (Segmentation): k-means, hierarchical.

• Classification techniques (Pattern recognition):

- LDA, QDA

- Trees

- Neural nets, Support Vector Machines,

- Nearest Neighbors.

• Model based methods: Linear, Non-Linear, logistic

Methods and Techniques

Methods and Techniques

Methods and Techniques

Methods and Techniques

Improved Methodology and Software.

• Solve business problems:

Data is from regular businesses.

Objective: Better business decisions.

What is new?

Case Study: SALES OF ORTHOPEDIC EQUIPMENT

The objective of this study is to find ways to increase sales of orthopedic

material from our company to hospitals in the United States. VARIABLES:

BEDS : NUMBER OF HOSPITAL BEDS RBEDS : NUMBER OF REHAB BEDS OUT-V : NUMBER OF OUTPATIENT VISITS ADM : ADMINISTRATIVE COST(In $1000's per year) SIR : REVENUE FROM INPATIENT

SALES12 : SALES OF REHAB. EQUIP. FOR THE LAST 12 MO HIP95 : NUMBER OF HIP OPERATIONS FOR 1995 KNEE95 : NUMBER OF KNEE OPERATIONS FOR 1995 TH : TEACHING HOSPITAL? 0, 1

TRAUMA : DO THEY HAVE A TRAUMA UNIT? 0, 1 REHAB : DO THEY HAVE A REHAB UNIT? 0, 1

HIP96 : NUMBER HIP OPERATIONS FOR 1996 KNEE96 : NUMBER KNEE OPERATIONS FOR 1996 FEMUR96 : NUMBER FEMUR OPERATIONS FOR 1996

The Role of visualization

The Role of visualization

The Role of visualization

The Role of visualization

Data visualization methods are attractive tools to use for

analyzing such datasets for several reasons:

Data visualization methods show many features

(expected and unexpected) of a dataset at once and, as

such, are well equipped to pick up subtle structures of

interest and anomalies as well as clear patterns.

They allow (in fact, encourage) flexible interaction with

the data.

They can be more readily understood by non-statisticians

(although their properties may not be).

Good user-friendly graphics software is becoming more

readily available.

Data visualization methods

Data visualization methods

Data visualization methods

Data visualization methods

Large datasets create visualization challenges.

•Scatterplots: Large numbers of points may hide the underlying structure. - Apply Data Binning and use an image graph.

- Avoid Masking by duplicating plots and highlighting subgroups. - Sometimes is enough to graph a subset selected at random. Many variables at once. There are many ingenious tools for this. •Scatterplot matrix

- all variables

- all descriptor variables with color coding according to one response

- all response variables with color coding according to one descriptor • plot selected 2D views to highlight some feature of the data: - principal components analysis (spread)

- projection pursuit (clustering)]

• look at “all” 2D views of the data via a dynamic display [rotating 3D display, grand tour] • conditional plots

• multiple windows with brush and link

x y -10 -5 0 5 10 -1 0 -5 0 5 1 0 -10 -5 0 5 10 -1 0 -5 0 5 1 0

Scatter Plot

Binning Plot

Data Binning

Data Binning

Data Binning

Data Binning

(3)

f.plot <- function(x,y,nr=20,nc=20, scale="raw") {

zx = c(1:nr,rep(1,nc),1+trunc( nr*(x- min(x))/(max(x)-min(x)) )) zx[zx>nr] = nr

zy = c(rep(1,nr),1:nc,1+trunc( nc*(y- min(y))/(max(y)-min(y)) )) zy[zy>nc] = nc z = table(zx,zy); z[,1]=z[,1]-1; z[1,]=z[1,]-1; if (scale=="l") z= log(1+z) image(z=t(z),x=seq(length=nr+1,from=min(x),to=max(x)), y= seq(length=nc+1,from=min(y),to=max(y)), xlab="",ylab="", col=topo.colors(100)) }

# Run this code line by line x = rnorm(10000) ; y = rnorm(10000) plot(x,y) f.plot(x,y,10,10) f.plot(x,y,50,50) f.plot(x,y,100,100) f.plot(x,y,100,100,'l') f.plot(x,y,500,500,'l') ux = rnorm(5000)/3 uy = ux^2 -0.5 f.plot(c(x,ux),c(y,uy)+20,50,50,'l')

R example of Binning Plot

R example of Binning Plot

R example of Binning Plot

R example of Binning Plot

Using Color

Using Color

Using Color

Using Color

0 2 4 6 02 46 8 log(1 + BEDS) lo g (1 + S A LE S 12) REHAB NO REHAB 0 2 4 6 02 4 6 8 log(1 + BEDS) log( 1 + S A LE S 12) REHAB NO REHAB 0 2 4 6 02 4 6 8 log(1 + BEDS) log( 1 + S A LE S 12) REHAB NO REHAB

Masking effect

Masking effect

Masking effect

Masking effect

Drawing green dots first Drawing purple dots first

Pairwise

Pairwise

Pairwise

Pairwise Scatter Plot

Scatter Plot

Scatter Plot

Scatter Plot

Conditional Plot example

Conditional Plot example

Conditional Plot example

Conditional Plot example

02 4 6 8 010 20 30 010 20 30 010 20 30 024 68 0 2 468 024 6 8 02 468 010 20 30 010 20 30 010 20 30 024 6 8 sqrt(KNEE96) log( 1 + S A LE S 12) 0 200 400 600 800 1000 1200 1400 Given : BEDS 0 1 00 00 20 00 0 3 000 0 40 000 5 00 00 6 00 00 G iv en : A D M

variable and case selection

cluster analysis (unsupervised pattern recognition) -partitioning methods (e.g.,k-means, k-medioids) -hierarchical methods (e.g., agglomerative nesting) -two-way clustering

classification (supervised pattern recognition, discriminant analysis)

-trees (e.g., CART, C5, Firm, Tree, ARF) -model-based methods (e.g., logistic regression) -artificial neural networks

role of robust methods / diagnostics

(4)

1 2 3 4 7 6 5

Cluster Example

I. Dependent variable is categorical

• Classification trees (e.g., CART, C5, Firm, Tree, ARF) • Decision Trees

• Decision Rules

Example: Personal loan decision

Credit Card? Car? Reject Age<30 Approve Y Y Y N N N Reject Approve

Tree methods

5 0 2 3 3 4 2 X Y Y<4 X<3 0 2 Y<2 3 5

Function f(X,Y)

Tree form of f(X,Y)

II. Dependent variable is numerical • Regression Tree | HIP95<2.52265 HIP96<2.01527 RBEDS<2.77141 HIP95<0.5 KNEE96<1.36514 ADM<4.87542 FEMUR96<2.28992 KNEE95<2.96704 BEDS<3.8403 OUTV<15.2396 SIR<9.85983 1.0900 0.3752 0.9898 0.8984 2.3880 1.2010 1.7840 2.1280 3.1080 2.4380 3.2130 3.9790

Regression Tree for log(1+Sales12)

Linear model: Y = X

β

+

ε

Least Squares Estimator: b= (X

T

X)

-1

X

T

Y

Linear Discriminant: Y = 0 or 1.

- Estimate b by L.S.

- Predict 1 if Xb > 0.5

0 otherwise

Linear Models

Linear Discriminants

Example: Pima Indians

0.1 0.2 0.3 0.4 0.5 100 12 0 140 16 0 180 20 0 PEDIGREE PL A S M A Diabetes: 63 None: 28 Diabetes: 69 None: 185

(5)

20 40 60 80 100 100 12 0 140 16 0 180 20 0 PEDIGREE P L ASMA

Example: Pima KNN with k = 1

Example: Pima KNN with k = 50

20 40 60 80 100 100 120 14 0 160 180 20 0 PEDIGREE PL A S M A

Example: Pima Indians K =10

20 40 60 80 100 10 0 120 14 0 1 60 18 0 2 00 PEDIGREE PL ASM A

Data Mining Techniques:

•Artificial Neural Nets

•Support Vector Machines

Objective:

Try to emulate the way the brain works (???)

Machine Learning

Pattern recognition

Hoax:

The functioning of the brain is not yet understood. Any

relation with Artificial Neural Nets is purely anecdotal.

References

Related documents

For both tables, the (negative) wealth effect is equal to the (positive) income effect, so that the impact on the level of consumption depends crucially on the elasticity

...133 Figure 5-1: Profile of Increasing Leaf Temperature on Photosynthesis in Single Leaves of Barley Genotypes...141 Figure 5-2: Profile of Increasing Leaf Temperature

In this thesis, we faced several problems related to drawing driving range as a polygon on the map, designing and planning routes while considering charging stations and their

The first model assumes that the ICAR module resides in sys- tem kernel while ICAR critical data (cryptographic hashes and file backups) is stored outside the protected

Magdalena Szczepańska Institute of Socio-Economic Geography and Spatial Management, Adam Mickiewicz University in Poznań. Marek Pieniążek Departament Badań Regionalnych i

The solutions proposed by the executives, such as providing education and training in ethics, using ethical codes of conduct, encouraging transparency, conducting ethical

In Figure 4 the measured parameter relation is shown - under the zero flow condition - between the voltage difference across input of the heater as the input variable

Players can create characters and participate in any adventure allowed as a part of the D&amp;D Adventurers League.. As they adventure, players track their characters’