CSC479
Data Mining
Lecture # 5
Data Preprocessing
2
Data Preprocessing
Why preprocess the data?
Data cleaning
Data integration
Data reduction
Data Transformation and Discretization
Why Data Preprocessing?
Data in the real world is dirty incomplete: lacking attribute values, lacking certain attributes of
interest, or containing only aggregate data
noisy: containing errors or outliers
inconsistent: containing discrepancies in codes or names
Data quality is a major concern in Data Mining and KnowledgeDiscovery tasks.
Why: At most all Data Mining algorithms induce knowledge
strictly from data.
No quality data, no quality mining results!
Quality decisions must be based on quality data
No quality data, inefficient mining process!
Complete, noise-free, and consistent data means faster algorithms The quality of knowledge extracted highly depends on the quality
Effect of Noisy Data on Results Accuracy
age income student buys_computer
<=30 high no ?
>40 medium yes ?
31…40 medium yes ?
age income student buys_computer
<=30 high yes yes
<=30 high no yes
>40 medium yes no
>40 medium no no
>40 low yes yes
31…40 no yes
31…40 medium yes yes
Data Mining
• If ‘age <= 30’ and income = ‘high’ then buys_computer = ‘yes’
• If ‘age > 40’ and income = ‘medium’ then buys_computer = ‘no’
Discover only those rules which contain support (frequency) greater >= 2
Due to the missing value in training dataset, the accuracy of prediction decreases and becomes “66.7%”
Training data
Data cleaning
Fill in missing values, smooth noisy data, identify or
remove outliers, and resolve inconsistencies
Data integration
Integration of multiple databases, data cubes, or files
Data reduction
Obtains reduced representation in volume but
produces the same or similar analytical results
Data transformation
Normalization and aggregation
Data discretization
Part of data reduction but with particular importance,
especially for numerical data
Data Preprocessing
Why preprocess the data?
Data cleaning
Data integration
Data reduction
Data Transformation and Discretization
Data Cleaning
Data cleaning tasks
Fill in missing values
Noisy data
Data is not always available
E.g., many tuples have no recorded value for several
attributes, such as customer income in sales data
Missing data may be due to
equipment malfunction
inconsistent with other recorded data and thus deleted data not entered due to misunderstanding
certain data may not be considered important at the time of
entry
not register history or changes of the data
Missing data may need to be inferred.
Methods of Treating Missing Data
Ignoring and discarding data:- There are two main ways to discard
data with missing values.
Discard all those records which have missing data also called as discard case analysis. Usually done when class label is missing
Discarding only those attributes which have high level of missing data.
Fill in the missing value manually: tedious + infeasible?
Use a global constant to fill in the missing value: e.g., “unknown”, a
new class.
Imputation using Mean/median or Mod:- One of the most
frequently used method (Statistical technique).
Use the attribute mean to fill in the missing value
Use the attribute mean for all samples belonging to the same class to
fill in the missing value: smarter
Replace (numeric continuous) type “attribute missing values” using
mean/median. (Median robust against noise).
Replace missing values using prediction/ classification
model:- Use the most probable value to fill in the missing value:
inference-based such as Bayesian formula or decision tree
Advantage:- it considers relationship among the known attribute
values and the missing values, so the imputation accuracy is very high.
Disadvantage:- If there is no correlation exist for some missing
attribute values and known attribute values. The imputation can’t be performed.
(Alternative approach):- Use hybrid combination of
Prediction/Classification model and Mean/MOD.
• First try to impute missing value using prediction/classification model, and then Median/MOD.
We will study more about this topic in Association Rules Mining.
K-Nearest Neighbor (k-NN) approach (Best
approach):-
k-NN imputes the missing attribute values on the basis
of nearest
K
neighbor.
Neighbors are determined
on
the basis of
distance measure
.
Once K neighbors are determined, missing value are
imputed by taking mean/median or MOD of
known
attribute values of missing attribute
.
Missing value record
Other dataset records
Imputation of Missing Data (Basic)
Imputation
is a term that denotes a procedure that
replaces the missing values in a dataset by some
plausible values
i.e. by considering relationship among
correlated values among the attributes of the
dataset.
Attribute 1 Attribute 2 Attribute 3 Attribute 4
20 cool high false
cool high true
20 cool high true
20 mild low false
30 cool normal false
10 mild high true
If we consider only
{attribute#2}, then
value “cool” appears in 3 records.
Probability of Imputing
value (20) = 66.7%
Probability of Imputing
Imputation of Missing Data (Basic)
Attribute 1 Attribute 2 Attribute 3 Attribute 4
20 cool high false
cool high true
20 cool high true 20 mild low false 30 cool normal false 10 mild high true
For {attribute#4} the value “true” appears in 2 records
Probability of Imputing
value (20) = 50%
Probability of Imputing
value (10) = 50%
Attribute 1 Attribute 2 Attribute 3 Attribute 4
20 cool high false
cool high true
20 cool high true 20 mild low false 30 cool normal false 10 mild high true
For {attribute#2,
attribute#3} the value
{“cool”, “high”}
appears in only 2 records Probability of Imputing
Noise: random error or variance in a measured
variable
Incorrect attribute values may be due to
faulty data collection instruments data entry problems
data transmission problems technology limitation
inconsistency in naming convention
Other data problems which requires data cleaning
duplicate records incomplete data inconsistent data
Removing Noise
Data Smoothing (rounding, averaging
within a window).
Data smoothing by Binning method:
• first sort data and partition into (equi-depth) bins
• then one can smooth by bin means, smooth by bin median, smooth by bin boundaries, etc.
Smoothing by Regression
• smooth by fitting the data into regression functions
Clustering/merging and Detecting outliers.
Smoothing by Binning Method
Equal-width (distance) partitioning:
It divides the range into
N
intervals of equal size: uniform grid ifA
andB
are the lowest and highest values of the attribute,the width of intervals will be:
W
= (B
-A
)/k,
wherek
is the number of bins. The most straightforward
But outliers may dominate presentation Skewed data is not handled well.
Equal-depth (frequency) partitioning:
It divides the range into
M
intervals, each containingapproximately same number of samples
Good data scaling
Binning Methods for Data Smoothing
* Sorted data for price (in dollars): 4, 8, 9, 15, 21, 21, 24, 25, 26, 28, 29, 34 * Partition into (equi-width) bins: A+w, A+2w,…
- Bin 1: 4, 8, 9
- Bin 2: 15, 21, 21, 24 - Bin 3: 25, 26, 28, 29, 34 * Smoothing by bin means: - Bin 1: 7, 7, 7
- Bin 2: 20, 20, 20, 20 - Bin 3: 28, 28, 28, 28, 28 * Smoothing by bin boundaries: - Bin 1: 4, 4, 4, 9
Binning Methods for Data Smoothing
* Sorted data for price (in dollars): 4, 8, 9, 15, 21, 21, 24, 25, 26, 28, 29, 34 * Partition into (equi-depth) bins:
- Bin 1: 4, 8, 9, 15 - Bin 2: 21, 21, 24, 25 - Bin 3: 26, 28, 29, 34 * Smoothing by bin means: - Bin 1: 9, 9, 9, 9
- Bin 2: 23, 23, 23, 23 - Bin 3: 29, 29, 29, 29
* Smoothing by bin boundaries: - Bin 1: 4, 4, 4, 15
Regression Method for smoothing the data
Regression is a technique
that conforms data values to a function. Linear
regression involves finding the “best” line to fit two attributes (or variables) so that one attribute can be used to predict the other.
x y
y = x + 1
X1 Y1
Detecting Outliers (Clustering)
Outliers
may be detected by
clustering
, where similar
values are organized into groups or “clusters”.
Values which falls outside of the set of clusters may be