Lecture 5-Data Preprocessing-M

(1)

CSC479

Data Mining

Lecture # 5

Data Preprocessing

(2)

2

Data Preprocessing



Why preprocess the data?



Data cleaning



Data integration



Data reduction



Data Transformation and Discretization

(3)

Why Data Preprocessing?



_{Data in the real world is dirty}

 incomplete: lacking attribute values, lacking certain attributes of

interest, or containing only aggregate data

 noisy: containing errors or outliers

 inconsistent: containing discrepancies in codes or names



_{Data quality is a major concern in Data Mining and Knowledge}

Discovery tasks.

 Why: At most all Data Mining algorithms induce knowledge

strictly from data.

 No quality data, no quality mining results!

 Quality decisions must be based on quality data

 No quality data, inefficient mining process!

 Complete, noise-free, and consistent data means faster algorithms  The quality of knowledge extracted highly depends on the quality

(4)

Effect of Noisy Data on Results Accuracy

age income student buys_computer

<=30 high no ?

>40 medium yes ?

31…40 medium yes ?

age income student buys_computer

<=30 high yes yes

<=30 high no yes

>40 medium yes no

>40 medium no no

>40 low yes yes

31…40 no yes

31…40 medium yes yes

Data Mining

• If ‘age <= 30’ and income = ‘high’ then buys_computer = ‘yes’

• If ‘age > 40’ and income = ‘medium’ then buys_computer = ‘no’

Discover only those rules which contain support (frequency) greater >= 2

Due to the missing value in training dataset, the accuracy of prediction decreases and becomes “66.7%”

Training data

(5)



Data cleaning



Fill in missing values, smooth noisy data, identify or

remove outliers, and resolve inconsistencies



Data integration



Integration of multiple databases, data cubes, or files



Data reduction



Obtains reduced representation in volume but

produces the same or similar analytical results



Data transformation



Normalization and aggregation



Data discretization



Part of data reduction but with particular importance,

especially for numerical data

(6)

(7)

Data Preprocessing



Why preprocess the data?



Data cleaning



Data integration



Data reduction



Data Transformation and Discretization

(8)

Data Cleaning



Data cleaning tasks



Fill in missing values



Noisy data

(9)



Data is not always available

 E.g., many tuples have no recorded value for several

attributes, such as customer income in sales data



Missing data may be due to

 equipment malfunction

 inconsistent with other recorded data and thus deleted  data not entered due to misunderstanding

 certain data may not be considered important at the time of

entry

 not register history or changes of the data



Missing data may need to be inferred.

(10)

Methods of Treating Missing Data

 Ignoring and discarding data:- There are two main ways to discard

data with missing values.

 Discard all those records which have missing data also called as discard case analysis. Usually done when class label is missing

 Discarding only those attributes which have high level of missing data.

 Fill in the missing value manually: tedious + infeasible?

 Use a global constant to fill in the missing value: e.g., “unknown”, a

new class.

 Imputation using Mean/median or Mod:- One of the most

frequently used method (Statistical technique).

 Use the attribute mean to fill in the missing value

 Use the attribute mean for all samples belonging to the same class to

fill in the missing value: smarter

 Replace (numeric continuous) type “attribute missing values” using

mean/median. (Median robust against noise).

(11)



Replace missing values using prediction/ classification

model:- Use the most probable value to fill in the missing value:

inference-based such as Bayesian formula or decision tree

 Advantage:- it considers relationship among the known attribute

values and the missing values, so the imputation accuracy is very high.

 Disadvantage:- If there is no correlation exist for some missing

attribute values and known attribute values. The imputation can’t be performed.

 (Alternative approach):- Use hybrid combination of

Prediction/Classification model and Mean/MOD.

• First try to impute missing value using prediction/classification model, and then Median/MOD.

 We will study more about this topic in Association Rules Mining.

(12)



K-Nearest Neighbor (k-NN) approach (Best

approach):-

k-NN imputes the missing attribute values on the basis

of nearest

K

neighbor.

Neighbors are determined

on

the basis of

distance measure

.



Once K neighbors are determined, missing value are

imputed by taking mean/median or MOD of

known

attribute values of missing attribute

.

Missing value record

Other dataset records

(13)

Imputation of Missing Data (Basic)



Imputation

is a term that denotes a procedure that

replaces the missing values in a dataset by some

plausible values



i.e. by considering relationship among

correlated values among the attributes of the

dataset.

Attribute 1 Attribute 2 Attribute 3 Attribute 4

20 cool high false

cool high true

20 cool high true

20 mild low false

30 cool normal false

10 mild high true

If we consider only

{attribute#2}, then

value “cool” appears in 3 records.

Probability of Imputing

value (20) = 66.7%

(14)

Imputation of Missing Data (Basic)

Attribute 1 Attribute 2 Attribute 3 Attribute 4

20 cool high false

20 cool high true 20 mild low false 30 cool normal false 10 mild high true

For {attribute#4} the value “true” appears in 2 records

value (20) = 50%

value (10) = 50%

Attribute 1 Attribute 2 Attribute 3 Attribute 4

20 cool high false

20 cool high true 20 mild low false 30 cool normal false 10 mild high true

For {attribute#2,

attribute#3} the value

{“cool”, “high”}

appears in only 2 records Probability of Imputing

(15)



Noise: random error or variance in a measured

variable



Incorrect attribute values may be due to

 faulty data collection instruments  data entry problems

 data transmission problems  technology limitation

 inconsistency in naming convention



Other data problems which requires data cleaning

 duplicate records  incomplete data  inconsistent data

(16)

Removing Noise



Data Smoothing (rounding, averaging

within a window).



Data smoothing by Binning method:

• first sort data and partition into (equi-depth) bins

• then one can smooth by bin means, smooth by bin median, smooth by bin boundaries, etc.



Smoothing by Regression

• smooth by fitting the data into regression functions



Clustering/merging and Detecting outliers.

(17)

Smoothing by Binning Method

 Equal-width (distance) partitioning:

 It divides the range into

N

intervals of equal size: uniform grid  if

A

and

B

are the lowest and highest values of the attribute,

the width of intervals will be:

W

= (

B

-

A

)/

k,

where

k

is the number of bins.

 The most straightforward

 But outliers may dominate presentation  Skewed data is not handled well.

 Equal-depth (frequency) partitioning:

 It divides the range into

M

intervals, each containing

approximately same number of samples

 Good data scaling

(18)

Binning Methods for Data Smoothing

* Sorted data for price (in dollars): 4, 8, 9, 15, 21, 21, 24, 25, 26, 28, 29, 34 * Partition into (equi-width) bins: A+w, A+2w,…

- Bin 1: 4, 8, 9

- Bin 2: 15, 21, 21, 24 - Bin 3: 25, 26, 28, 29, 34 * Smoothing by bin means: - Bin 1: 7, 7, 7

- Bin 2: 20, 20, 20, 20 - Bin 3: 28, 28, 28, 28, 28 * Smoothing by bin boundaries: - Bin 1: 4, 4, 4, 9

(19)

Binning Methods for Data Smoothing

* Sorted data for price (in dollars): 4, 8, 9, 15, 21, 21, 24, 25, 26, 28, 29, 34 * Partition into (equi-depth) bins:

- Bin 1: 4, 8, 9, 15 - Bin 2: 21, 21, 24, 25 - Bin 3: 26, 28, 29, 34 * Smoothing by bin means: - Bin 1: 9, 9, 9, 9

- Bin 2: 23, 23, 23, 23 - Bin 3: 29, 29, 29, 29

* Smoothing by bin boundaries: - Bin 1: 4, 4, 4, 15

(20)

Regression Method for smoothing the data

 Regression is a technique

that conforms data values to a function. Linear

regression involves finding the “best” line to fit two attributes (or variables) so that one attribute can be used to predict the other.

x y

y = x + 1

X1 Y1

(21)

Detecting Outliers (Clustering)



Outliers

may be detected by

clustering

, where similar

values are organized into groups or “clusters”.



Values which falls outside of the set of clusters may be