Data Preprocessing
Data Preprocessing
Why preprocess the data?
Data cleaning
Data integration
Data transformation
Discretization and concept hierarchy
Why Data
Preprocessing?
Data in the real world is dirty
incomplete: lacking attribute values, lacking certain attributes of interest, or containing only aggregate data
noisy: containing errors or outliers
inconsistent: containing discrepancies in codes or names
Accuracy: correct or wrong, accurate or not Completeness: not recorded, unavailable, …
Timeliness: timely update?
Believability: how trustable the data are correct?
No quality data, no quality mining results!
Quality decisions must be based on quality data
Data Quality
Data quality is a major concern in Data Mining and
Knowledge Discovery tasks.
Why: At most all Data Mining algorithms induce
knowledge strictly from data.
The quality of knowledge extracted highly depends
on the quality of data.
There are two main problems in data
quality:- Missing data: The data not present.
Why can Data be
Incomplete?
Attributes of interest are not available (e.g.,
customer information for sales transaction data)
Data were not considered important at the time
of transactions, so they were not recorded!
Data not recorder because of misunderstanding
or malfunctions
Data may have been recorded and later deleted!
Why can Data be
Noisy/Inconsistent?
Faulty instruments for data collection
Human or computer errors
Errors in data transmission
Technology limitations (e.g., sensor data come at
a faster rate than they can be processed)
Inconsistencies in naming conventions or data
codes (e.g., 2/5/2002 could be 2 May 2002 or 5 Feb 2002)
Duplicate tuples, which were received twice
Major Tasks in Data
Preprocessing
Data cleaning
Fill in missing values, smooth noisy data, identify or remove
outliers, and resolve inconsistencies
Data integration
Integration of multiple databases, data cubes, or files
Data transformation
Normalization and aggregation
Data reduction
Obtains reduced representation in volume but produces the
same or similar analytical results
Data discretization
Part of data reduction but with particular importance,
especially for numerical data
Data Preprocessing
Why preprocess the data?
Data cleaning
Data integration
Data transformation
Data reduction
Data Cleaning
Data cleaning tasks
Fill in missing values
Identify outliers and smooth out noisy data
How to Handle Missing
Data?
Age Income Religion Gender
23 24,200 Muslim M
39 ? Christian F
45 45,390 ? F
Fill missing values using aggregate functions (e.g., average) or probabilistic estimates on global value distribution
E.g., put the average income here, or put the most probable income based on the fact that the person is 39 years old
Imputation of Missing Data
(Basic)
Imputation is a term that denotes a procedure that
replaces the missing/invalid/inconsistent values in a dataset by some plausible values
i.e. by considering relationship among correlated values among the attributes of the dataset.
Attribute 1 Attribute 2 Attribute 3 Attribute 4
20 cool high false
cool high true
20 cool high true 20 mild low false 30 cool normal false 10 mild high true
If we consider only
{attribute#2}, then value
“cool” appears in 4 records.
Probability of Imputing
value (20) = 66.67%
Probability of Imputing
Imputation of Missing Data
(Basic)
Attribute 1 Attribute 2 Attribute 3 Attribute 4
20 cool high false
cool high true
20 cool high true 20 mild low false 30 cool normal false 10 mild high true
For {attribute#4} the value “true” appears in 3 records
Probability of Imputing
value (20) = 50%
Probability of Imputing
value (10) = 50%
Attribute 1 Attribute 2 Attribute 3 Attribute 4
20 cool high false
cool high true
20 cool high true 20 mild low false 30 cool normal false 10 mild high true
For {attribute#2,
attribute#3} the value {“cool”, “high”} appears in only 2 records
Probability of Imputing
Methods of Treating
Missing Data
Ignoring and discarding data:- There are two main
ways to discard data with missing values.
Discard all those records which have missing data also
called as discard case analysis.
Discarding only those attributes which have high level of
missing data.
Imputation using Mean/median or Mod:- One of the
most frequently used method (Statistical technique).
Replace (numeric continuous) type “attribute missing
values” using mean/median. (Median robust against noise).
Replace (discrete) type attribute missing values using
Methods of Treating
Missing Data
Replace missing values using
prediction/classification
model:- Advantage:- it considers relationship among the known
attribute values and the missing values, so the imputation accuracy is very high.
Disadvantage:- If there is no correlation exist for some
missing attribute values and known attribute values. The imputation can’t be performed.
(Alternative approach):- Use hybrid combination of
Prediction/Classification model and Mean/MOD. First try to impute missing value using
prediction/classification model, and then Median/MOD.
We will study more about this topic in Association Rules
Methods of Treating
Missing Data
K-Nearest Neighbor (k-NN) approach (Best
approach):- k-NN imputes the missing attribute values on the basis
of nearest K neighbor. Neighbors are determined on the basis of distance measure.
Once K neighbors are determined, missing value are
imputed by taking mean/median or MOD of known attribute values of missing attribute.
Pseudo-code/analysis after studying distance measure.
Missing value record
Noisy Data
Noise: random error or variance in a measured
variable
Incorrect attribute values may exist due to
faulty data collection instruments data entry problems
data transmission problems technology limitation
inconsistency in naming convention
Other data problems which requires data cleaning
How to Handle Noisy Data?
Smoothing techniques
Binning method:
first sort data and partition into (equi-depth) bins then one can smooth by bin means, smooth by bin
median, smooth by bin boundaries, etc.
Clustering
detect and remove outliers
Combined computer and human inspection
computer detects suspicious values, which are then
checked by humans
Regression
Smoothing using Binning
Methods
* Sorted data for price (in dollars): 4, 8, 9, 15, 21, 21, 24, 25, 26, 28, 29, 34
* Partition into (equi-depth) bins: - Bin 1: 4, 8, 9, 15
- Bin 2: 21, 21, 24, 25 - Bin 3: 26, 28, 29, 34 * Smoothing by bin means: - Bin 1: 9, 9, 9, 9
- Bin 2: 23, 23, 23, 23 - Bin 3: 29, 29, 29, 29
* Smoothing by bin boundaries: [4,15],[21,25],[26,34]
- Bin 1: 4, 4, 4, 15
Cluster
Analysis
cluster
outlier
salary
Regression
x y
y = x + 1
X1 Y1
(salary)
(age)
Inconsistent Data
Inconsistent data are handled by:
Manual correction (expensive and tedious)
Use routines designed to detect inconsistencies and manually correct them. E.g., the routine may use the check global constraints (age>10) or
functional dependencies
Data Preprocessing
Why preprocess the data?
Data cleaning
Data integration
Data transformation
Data
Integration
Data integration:
combines data from multiple sources into a coherent store
Schema integration
integrate metadata from different sources
metadata: data about the data (i.e., data descriptors)
Entity identification problem: identify real world entities
from multiple data sources, e.g., A.cust-id B.cust-#
Detecting and resolving data value conflicts
for the same real world entity, attribute values from
different sources are different (e.g., J.D.Smith and Jonh Smith may refer to the same person)
possible reasons: different representations, different
Data Integration (Problem
1)
Attribute naming
Problem: Attributes are named differently across different data sources.
CustomerID …
…
CustID …
…
CustomerID …
…
Multiple Sources
Coherent Store ClientID
… …
Extraction,
Transformation, and Loading
• Data Encoding
Problem: Same attribute has same values denoted in different ways
26
Gender Male Female Gender M
F
Gender Male Female
Multiple Sources Coherent Store
Similarly, Islamabad might be denoted as ‘isb’, ‘ISB’ or ‘ISBD’, may be misspelled, or be inconsistently capitalized (some programs are CASE-SENSITIVE)
27
• Measurement Basis
Problem: Same attributes is measured in different units.
27
Weight
(kilograms) 6
10
Multiple Sources Coherent Store
Weight (pounds) 6
10
Weight
(kilograms) 6
10 2.72 4.54
Handling
Redundant Data in
Data Integration
Redundant data occur often when integration of
multiple databases
The same attribute may have different names in different
databases
One attribute may be a “derived” attribute in another
table, e.g., annual revenue
Redundant data may be able to be detected by
correlation analysis
Careful integration of the data from multiple
sources may help reduce/avoid redundancies and inconsistencies and improve mining speed and
Data Preprocessing
Why preprocess the data?
Data cleaning
Data integration
Data transformation
Data
Transformation
Smoothing: remove noise from data
Aggregation: summarization, data cube
construction
Normalization: scaled to fall within a small,
specified range
min-max normalization z-score normalization
Normalization: Why
normalization?
The objective of normalization is convert all integer type
attributes, so that there values fall within a small specified range, such as 0 to 1.0.
Speeds-up some learning techniques (ex. neural networks) For example: ‘Income’ attribute can dominate the distance
as compared to ‘Weight’ and ‘Age’ attributes.
Normalization is particularly useful for clustering and
distance measure algorithms such as k-nearest-neighbor
Helps prevent attributes with large ranges outweigh ones
with small ranges
Example:
income has range 3000-200000 age has range 10-80
Data Transformation:
Normalization
min-max normalization
Example:- Suppose that the minimum and maximum values for the
attribute income are 12,000 and 98,000. By min-max normalization, a value of 73,600 for income is transformed to
((73,000-12,000)/(98,000-12,000)) (1.0-0) + 0 = 0.716
z-score normalization
normalization by decimal scaling
A A A A A A
min
new
min
new
max
new
min
max
min
v
v
'
(
_
_
)
_
A Adev
stand_
mean
v
v
'
j
v
v
10
Data Preprocessing
Why preprocess the data?
Data cleaning
Data integration and transformation
Data reduction
Discretization
reduce the number of values for a given continuous attribute by
dividing the range of the attribute into intervals. Interval labels can then be used to replace actual data values.
Three types of attributes:
Nominal — values from an unordered set Ordinal — values from an ordered set
Continuous — real numbers
Discretization:
divide the range of a continuous attribute into intervals why?
Some classification algorithms only accept categorical attributes.
Reduce data size by discretization
Entropy-Based
Discretization
Given a set of samples S, if S is partitioned into two
intervals S1 and S2 using boundary T, the information gain I(S,T) after partitioning is
The boundary that maximizes the information gain
over all possible boundaries is selected as a binary discretization.
The process is recursively applied to partitions
obtained until some stopping criterion is met, e.g.,
Experiments show that it may reduce data size and
improve classification accuracy
) ( | | | | ) ( | | | | ) ,
( 2 2
1 1
S
S
S
S
Ent S Ent S T SI
(
,
)
)
(
S
I
T
S
Example 1
ID
ID
1
2
3
4
5
6
7
8
9
Age
Age
21
22
24
25
27
27
27
35
41
Grade
Grade
F
F
P
F
P
P
P
P
P
Let Grade be the class attribute. Use
entropy-based discretization to divide the range of ages into different discrete intervals.
There are 6 possible boundaries. They are 21.5,
23, 24.5, 26, 31, and 38.
Let us consider the boundary at T = 21.5.
Let S1 = {21}
Let S2 = {22, 24, 25, 27, 27, 27, 35, 41}
(21+22) / 2 = 21.5
Example 1 (cont’)
The number of elements in S1 and S2 are:
|S1| = 1 |S2| = 8
The entropy of S1 is
The entropy of S2 is
) 0 ( log ) 0 ( ) 1 ( log ) 1 ( ) P ( log ) P ( ) F ( log ) F ( ) ( 2 2 2 2
1 P Grade P Grade P Grade P Grade
S Ent ) 6 ( log ) 6 ( ) 2 ( log ) 2 ( ) P ( log ) P ( ) F ( log ) F ( ) ( 2 2 2 2
2 P Grade P Grade P Grade P Grade
S Ent
ID 1 2 3 4 5 6 7 8 9
Age 21 22 24 25 27 27 27 35 41
Example 1 (cont’)
Hence, the entropy after partitioning at
T
= 21.5 is