Lecture 4- Data Preprocessing

(1)

Data Preprocessing

(2)

Data Preprocessing



Why preprocess the data?



Data cleaning



Data integration



Data transformation



Discretization and concept hierarchy

(3)

Why Data

Preprocessing?

 Data in the real world is dirty

 incomplete: lacking attribute values, lacking certain attributes of interest, or containing only aggregate data

 noisy: containing errors or outliers

 inconsistent: containing discrepancies in codes or names

 Accuracy: correct or wrong, accurate or not  Completeness: not recorded, unavailable, …

 Timeliness: timely update?

 Believability: how trustable the data are correct?

 No quality data, no quality mining results!

 Quality decisions must be based on quality data

(4)

Data Quality

 Data quality is a major concern in Data Mining and

Knowledge Discovery tasks.

 Why: At most all Data Mining algorithms induce

knowledge strictly from data.

 The quality of knowledge extracted highly depends

on the quality of data.

 There are two main problems in data

quality:- Missing data: The data not present.

(5)

Why can Data be

Incomplete?

 Attributes of interest are not available (e.g.,

customer information for sales transaction data)

 Data were not considered important at the time

of transactions, so they were not recorded!

 Data not recorder because of misunderstanding

or malfunctions

 Data may have been recorded and later deleted!

(6)

Why can Data be

Noisy/Inconsistent?

 Faulty instruments for data collection

 Human or computer errors

 Errors in data transmission

 Technology limitations (e.g., sensor data come at

a faster rate than they can be processed)

 Inconsistencies in naming conventions or data

codes (e.g., 2/5/2002 could be 2 May 2002 or 5 Feb 2002)

 Duplicate tuples, which were received twice

(7)

Major Tasks in Data

Preprocessing

 Data cleaning

 Fill in missing values, smooth noisy data, identify or remove

outliers, and resolve inconsistencies

 Data integration

 Integration of multiple databases, data cubes, or files

 Data transformation

 Normalization and aggregation

 Data reduction

 Obtains reduced representation in volume but produces the

same or similar analytical results

 Data discretization

 Part of data reduction but with particular importance,

especially for numerical data

(8)

(9)

Data Preprocessing



Data cleaning



Data integration



Data transformation



Data reduction

(10)

Data Cleaning



Data cleaning tasks

 Fill in missing values

 Identify outliers and smooth out noisy data

(11)

How to Handle Missing

Data?

Age Income Religion Gender

23 24,200 Muslim M

39 ? Christian F

45 45,390 ? F

Fill missing values using aggregate functions (e.g., average) or probabilistic estimates on global value distribution

E.g., put the average income here, or put the most probable income based on the fact that the person is 39 years old

(12)

Imputation of Missing Data

(Basic)

 Imputation is a term that denotes a procedure that

replaces the missing/invalid/inconsistent values in a dataset by some plausible values

 i.e. by considering relationship among correlated values among the attributes of the dataset.

Attribute 1 Attribute 2 Attribute 3 Attribute 4

20 cool high false

cool high true

20 cool high true 20 mild low false 30 cool normal false 10 mild high true

If we consider only

{attribute#2}, then value

“cool” appears in 4 records.

Probability of Imputing

value (20) = 66.67%

(13)

Imputation of Missing Data

(Basic)

Attribute 1 Attribute 2 Attribute 3 Attribute 4

20 cool high false

For {attribute#4} the value “true” appears in 3 records

value (20) = 50%

value (10) = 50%

Attribute 1 Attribute 2 Attribute 3 Attribute 4

20 cool high false

For {attribute#2,

attribute#3} the value {“cool”, “high”} appears in only 2 records

(14)

Methods of Treating

Missing Data

 Ignoring and discarding data:- There are two main

ways to discard data with missing values.

 Discard all those records which have missing data also

called as discard case analysis.

 Discarding only those attributes which have high level of

missing data.

 Imputation using Mean/median or Mod:- One of the

most frequently used method (Statistical technique).

 Replace (numeric continuous) type “attribute missing

values” using mean/median. (Median robust against noise).

 Replace (discrete) type attribute missing values using

(15)

Methods of Treating

Missing Data

Replace missing values using

prediction/classification

model:- Advantage:- it considers relationship among the known

attribute values and the missing values, so the imputation accuracy is very high.

 Disadvantage:- If there is no correlation exist for some

missing attribute values and known attribute values. The imputation can’t be performed.

 (Alternative approach):- Use hybrid combination of

Prediction/Classification model and Mean/MOD.  First try to impute missing value using

prediction/classification model, and then Median/MOD.

 We will study more about this topic in Association Rules

(16)

Methods of Treating

Missing Data

K-Nearest Neighbor (k-NN) approach (Best

approach):- k-NN imputes the missing attribute values on the basis

of nearest K neighbor. Neighbors are determined on the basis of distance measure.

 Once K neighbors are determined, missing value are

imputed by taking mean/median or MOD of known attribute values of missing attribute.

 Pseudo-code/analysis after studying distance measure.

Missing value record

(17)

Noisy Data

 Noise: random error or variance in a measured

variable

 Incorrect attribute values may exist due to

 faulty data collection instruments  data entry problems

 data transmission problems  technology limitation

 inconsistency in naming convention

 Other data problems which requires data cleaning

(18)

How to Handle Noisy Data?

Smoothing techniques

 Binning method:

 first sort data and partition into (equi-depth) bins  then one can smooth by bin means, smooth by bin

median, smooth by bin boundaries, etc.

 Clustering

 detect and remove outliers

 Combined computer and human inspection

 computer detects suspicious values, which are then

checked by humans

 Regression

(19)

Smoothing using Binning

Methods

* Sorted data for price (in dollars): 4, 8, 9, 15, 21, 21, 24, 25, 26, 28, 29, 34

* Partition into (equi-depth) bins: - Bin 1: 4, 8, 9, 15

- Bin 2: 21, 21, 24, 25 - Bin 3: 26, 28, 29, 34 * Smoothing by bin means: - Bin 1: 9, 9, 9, 9

- Bin 2: 23, 23, 23, 23 - Bin 3: 29, 29, 29, 29

* Smoothing by bin boundaries: [4,15],[21,25],[26,34]

- Bin 1: 4, 4, 4, 15

(20)

Cluster

Analysis

cluster

outlier

salary

(21)

Regression

x y

y = x + 1

X1 Y1

(salary)

(age)

(22)

Inconsistent Data



Inconsistent data are handled by:

 Manual correction (expensive and tedious)

 Use routines designed to detect inconsistencies and manually correct them. E.g., the routine may use the check global constraints (age>10) or

functional dependencies

(23)

Data Preprocessing



Data cleaning



Data integration



Data transformation

(24)

Data

Integration

 Data integration:

 combines data from multiple sources into a coherent store

 Schema integration

 integrate metadata from different sources

 metadata: data about the data (i.e., data descriptors)

 Entity identification problem: identify real world entities

from multiple data sources, e.g., A.cust-id  B.cust-#

 Detecting and resolving data value conflicts

 for the same real world entity, attribute values from

different sources are different (e.g., J.D.Smith and Jonh Smith may refer to the same person)

 possible reasons: different representations, different

(25)

Data Integration (Problem

1)



Attribute naming

Problem: Attributes are named differently across different data sources.

CustomerID …

…

CustID …

…

CustomerID …

…

Multiple Sources

Coherent Store ClientID

… …

Extraction,

Transformation, and Loading

(26)

• Data Encoding

Problem: Same attribute has same values denoted in different ways

26

Gender Male Female Gender M

F

Gender Male Female

Multiple Sources Coherent Store

Similarly, Islamabad might be denoted as ‘isb’, ‘ISB’ or ‘ISBD’, may be misspelled, or be inconsistently capitalized (some programs are CASE-SENSITIVE)

(27)

27

• Measurement Basis

Problem: Same attributes is measured in different units.

27

Weight

(kilograms) 6

10

Multiple Sources Coherent Store

Weight (pounds) 6

10

Weight

(kilograms) 6

10 2.72 4.54

(28)

Handling

Redundant Data in

Data Integration

 Redundant data occur often when integration of

multiple databases

 The same attribute may have different names in different

databases

 One attribute may be a “derived” attribute in another

table, e.g., annual revenue

 Redundant data may be able to be detected by

correlation analysis

 Careful integration of the data from multiple

sources may help reduce/avoid redundancies and inconsistencies and improve mining speed and

(29)

Data Preprocessing



Data cleaning



Data integration



Data transformation

(30)

Data

Transformation

 Smoothing: remove noise from data

 Aggregation: summarization, data cube

construction

 Normalization: scaled to fall within a small,

specified range

 min-max normalization  z-score normalization

(31)

Normalization: Why

normalization?

 The objective of normalization is convert all integer type

attributes, so that there values fall within a small specified range, such as 0 to 1.0.

 Speeds-up some learning techniques (ex. neural networks)  For example: ‘Income’ attribute can dominate the distance

as compared to ‘Weight’ and ‘Age’ attributes.

 Normalization is particularly useful for clustering and

distance measure algorithms such as k-nearest-neighbor

 Helps prevent attributes with large ranges outweigh ones

with small ranges

 Example:

 income has range 3000-200000  age has range 10-80

(32)

Data Transformation:

Normalization

 min-max normalization

 Example:- Suppose that the minimum and maximum values for the

attribute income are 12,000 and 98,000. By min-max normalization, a value of 73,600 for income is transformed to

 ((73,000-12,000)/(98,000-12,000)) (1.0-0) + 0 = 0.716

 z-score normalization

 normalization by decimal scaling

A A A A A A

min

new

min

new

max

new

min

max

min

v

'

(

_



_

)



_





A A

dev

stand_

mean

v

'





j

v

10

(33)

Data Preprocessing



Data cleaning



Data integration and transformation



Data reduction

(34)

Discretization

 reduce the number of values for a given continuous attribute by

dividing the range of the attribute into intervals. Interval labels can then be used to replace actual data values.

 Three types of attributes:

 Nominal — values from an unordered set  Ordinal — values from an ordered set

 Continuous — real numbers

 Discretization:

 divide the range of a continuous attribute into intervals  why?

 Some classification algorithms only accept categorical attributes.

 Reduce data size by discretization

(35)

Entropy-Based

Discretization

 Given a set of samples S, if S is partitioned into two

intervals S1 and S2 using boundary T, the information gain I(S,T) after partitioning is

 The boundary that maximizes the information gain

over all possible boundaries is selected as a binary discretization.

 The process is recursively applied to partitions

obtained until some stopping criterion is met, e.g.,

 Experiments show that it may reduce data size and

improve classification accuracy

) ( | | | | ) ( | | | | ) ,

( 2 ₂

1 1

S

_Ent S Ent S T S

I  







(

,

)

(

S

I

T

S

(36)

Example 1

ID

1

2

3

4

5

6

7

8

9

Age

21

22

24

25

27

35

41

Grade

F

P

F

P

 Let Grade be the class attribute. Use

entropy-based discretization to divide the range of ages into different discrete intervals.

 There are 6 possible boundaries. They are 21.5,

23, 24.5, 26, 31, and 38.

 Let us consider the boundary at T = 21.5.

Let S1 = {21}

Let S2 = {22, 24, 25, 27, 27, 27, 35, 41}

(21+22) / 2 = 21.5

(37)

Example 1 (cont’)

 The number of elements in S1 and S2 are:

|S1| = 1 |S2| = 8

 The entropy of S1 is

 The entropy of S2 is

               ) 0 ( log ) 0 ( ) 1 ( log ) 1 ( ) P ( log ) P ( ) F ( log ) F ( ) ( 2 2 2 2

1 P Grade P Grade P Grade P Grade

S Ent                ) 6 ( log ) 6 ( ) 2 ( log ) 2 ( ) P ( log ) P ( ) F ( log ) F ( ) ( 2 2 2 2

2 P Grade P Grade P Grade P Grade

S Ent

ID 1 2 3 4 5 6 7 8 9

Age 21 22 24 25 27 27 27 35 41

(38)

Example 1 (cont’)



Hence, the entropy after partitioning at

T

= 21.5 is