Data Cleaning

(1)

Data Cleaning

(2)

What is data cleaning ?!

Data in the real world is dirty

◦incomplete: lacking attribute values, lacking certain attributes of interest, or containing only aggregate data

◦e.g.,Customer_Income=“ ”, Occupation = “ ” (missing data)

◦noisy: containing noise, errors, or outliers

◦e.g., Salary = “−10”, Age = “222”

◦inconsistent: containing discrepancies in codes or names, e.g., ◦Age = “42”, Birthday = “03/07/2010”

◦Was rating “1, 2, 3”, now rating “A, B, C”

(3)

Data cleaning tasks

Data cleaning involves the following tasks:

◦Fill in missing values

◦Identify outliers and smooth out noisy data

(4)

How to Fill Missing Data ?

1. Ignore the tuple.

◦ Not effective method unless several attributes missing values.

2. Fill in the missing value manually.

◦ is time-consuming and may not be feasible given a large data set with many missing values.

3. Use a global constant to fill in the missing value.

◦ e.g., “unknown” or -∞, a new class?! (misunderstanding).

4. Use the attribute mean to fill in the missing value.

5. Use the attribute mean for all samples belonging to the same class as the given tuple.

6. Use the most probable value to fill in the missing value.

(5)

Fill Missing Data

Fill missing values using aggregate functions (e.g., average) or probabilistic estimates on global value distribution

E.g., put the average income here, or put the most probable income based on the fact that the person is 39 years old

(6)

Fill Missing Data: Regression

Regression shows a relationship between the average values of two variables (Attributes) and it is useful in estimating and predicting the average value of one variable for a given value of other variable.

There are two types of variables in regression:

◦ The variable whose value is to be predicted is called dependent variable (response variable).

◦ The variable whose value is used for prediction is called independent variable (predictor variable).

(7)

(8)

Exercise 1

Following table shows a set of paired data where X is the number of years of work experience of a college graduates and Y is the corresponding

salary of the graduate.

◦ Predict the salary of a college graduate with 10 years of experience. X Years Experience Y Salary (in $1000) 3 30 8 57 9 64 13 72 3 36 6 43 11 59 21 90 1 20 16 83 10 ??

(9)

(10)

Outliers

Outliers are values thought to be out of range (unusually small or unusually large value in a data set).

Outlier Detection can be done through Inter Quartile Range or Clustering.

(11)

Detecting Noise(outliers): Clustering

Clustering is used for outlier detection where outliers are values that are “far away” from any cluster.

(12)

Detecting Noise (outliers): Inter

Quartile Range

–IQR-Steps:

1. Arrange data in order.

2. Calculate First quartile (Q1: median of the smaller half of the data), Third quartile (Q3: median of the larger half of the data) and the interquartile range (IQR=Q3-Q1). 3. Compute (Q1 - 1.5*IQR) and (Q3 + 1.5 *IQR), Anything outside this range is an outlier.

4. Draw Boxplot to view data including outliers.

(13)

Detecting Noise (outliers): Boxplot

Steps

A box is drawn with its ends located at the first and third quartiles. A vertical line is drawn in the box at the location of the median.

Whiskers (dashed lines) are drawn from the ends of the box to the smallest and largest data values inside the limits (range).

375 400 425 450 475 500 525 550 575 600 625

(14)

Exercise 1

The following is the salary of employees stored in the database: 30 57 64 72 36 43 59 90 20 83

Check if there is any outliers in that data and illustrate your results using Boxplot.

Solution

Step1: Sort Data 20 30 36 43 57 59 64 72 83 90

Step2: Compute First (Q1),Third (Q3) Quartile and IQR

20 30 36 43 57 59 64 72 83 90

Median (Second Quartile Q2) = (57+59)/2 = 58

Q1 = 36 Q3 = 72

(15)

Exercise 1 Cont.

Step3: Compute (Q1 - 1.5*IQR) and (Q3 + 1.5 *IQR) (36-1.5*36) and ( 72+1.5*36)

So the range for detecting outliers is -18, 126

All data falls with in the range so there are no outliers

Step4: Draw Boxplot

36 58 72 126

-18 20 90

(16)

Exercise 2

Given the following heights of 5th grade students: 110, 129, 130, 131, 132, 134, 134, 134, 135, 153

Which ones are outliers and why then draw Boxplot to illustrate your solution? Solution Median = 133 First Quartile Q1=130 Third Quartile Q3=134 IQR = 134-130 = 4 Cutoff1=130-1.5(4)=124 Cutoff2=134+1.5(4)=140

Since ‘110’ and ‘153’ don’t fall within range [124,140], they are outliers.

130 133 134 140

124 129 135

Q1 Q2 Q3

153 110

(17)

Handle Noisy Data: Binning

Binning methods smooth a sorted data value by consulting its neighborhood,” that is, the values around it.

The sorted values are distributed into a number of:

I. Equal-depth/frequency bins: bins are of the same size.

II. Equal-width bins : w = (B-A)/N Where ‘B’ is the max value And ‘A’ is the min value and ‘N’ is the

number of bins Smooth by:

I. Bin means “Replace value with the Average of a list of numbers”

II. Bin median “ Replace value with the "middle" value in a list of numbers”.

III. Bin boundaries “ Replace value with the closest boundary value of Min or Max of a list of

numbers” Data: 1,2,3,5,7,8,10,15

• Equal Depth ‘depth=4’: Bin1: 1,2,3,5 Bin2: 7,8,10,15 • Equal-Width “w: 7’: Bin1: 1,2,3,5,7,8 Bin2: 10,15

(18)

Exercise 1

Sorted Data for Price: 4,8,15,21,21,24,25,28,34

Bin depth = 3

Partition into equal-depth bins:

• Bin1: 4,8,15 • Bin2: 21,21,24 • Bin3: 25,28,34 Smooth by means: • Bin1: 9,9,9 • Bin2: 22,22,22 • Bin3: 29,29,29

Mean= 9 , median= 8 , boundaries are: 4,15 Mean= 22 , median= 21 , boundaries are: 21,24

Mean= 29 , median= 28 , boundaries are: 25,34

Smooth by boundaries:

• Bin1: 4,4,15

• Bin2: 21,21,24

• Bin3: 25,25,34

(19)

Exercise 1 Cont.

Sorted Data for Price: 4,8,15,21,21,24,25,28,34

#of bins = 3

Width = (34-4)/3=10

Partition into equal-width bins:

• Bin1: 4 Bin1: 4,8 Bin1: 4,8,15 • Bin1: 4,8 • Bin2: 15,21,21,24,25 • Bin3: 28,34 Then Smooth. 15-4 > 10

(20)

Exercise 2

Suppose data for an Age attribute in the data tuples are:

15,13,16,19,16,21,22,22,20,20,30,25,33,25,33,35,25,35,36,40,35,45,46,52,70,25 ,35

Use binning to smooth the above data by mean, median and boundaries with: - equal-depth partitioning using a bin depth of 9.

(21)

Exercise 2 solution

Step 1 -Sort Data-:

13,15,16,16,19,20,20,21,22,22,25,25,25,25,30,33,33,35,35,35,35,36,40,45,46,52,70

Step 2 –partition data-:

1- equal-depth number of bins = size of data/bin depth, then I get number of bins = 27/9 = 3

◦ Bin1: 13,15,16,16,19,20,20,21,22 median = 19 , mean = 18,Boundaries= 13,22

◦ Bin2: 22,25,25,25,25,30,33,33,35 median = 25, mean = 28.1,Boundaries= 22,35

◦ Bin3: 35,35,35,36,40,45,46,52,70 median = 40, mean = 43.8,Boundaries= 35,70

2- equal-width width = (max-min)/n = (70-13)/3 = 19

Range: Difference between Min and Max in each group must not exceed width “19”

◦ Bin1:13,15,16,16,19,20,20,21,22,22,25,25,25,25,30 median = 21 , mean = 20.9,Boundaries= 13,30

◦ Bin2: 33,33,35,35,35,35,36,40,45,46,52 median = 70/2 = 35, mean = 38.6,Boundaries= 33,46

(22)

Exercise 2 solution Cont.

Step 3 –smoothing data-:

1- By Mean: ◦ Bin1: 13,15,16,16,19,20,20,21,22 Bin1: 18,18,18,18,18,18,18,18,18 ◦ Bin2: 22,25,25,25,25,30,33,33,35 Bin2: 28.1,28.1, 28.1, 28.1, 28.1, 28.1, 28.1, 28.1, 28.1 ◦ Bin3: 35,35,35,36,40,45,46,52,70 Bin3: 43.8,43.8,43.8,43.8,43.8,43.8,43.8,43.8,43.8 2- By Median: ◦ Bin1: 13,15,16,16,19,20,20,21,22 Bin1: 19, 19, 19, 19, 19, 19, 19, 19, 19 ◦ Bin2: 22,25,25,25,25,30,33,33,35 Bin2: 25, 25, 25, 25, 25, 25, 25, 25, 25 ◦ Bin3: 35,35,35,36,40,45,46,52,70 Bin3: 40, 40, 40, 40, 40, 40, 40, 40, 40 3- By Boundaries: ◦ Bin1: 13,15,16,16,19,20,20,21,22 Bin1: 13,13,13,13,22,22,22,22,22 ◦ Bin2: 22,25,25,25,25,30,33,33,35 Bin2: 22,22,22,22,22,35,35,35,35 ◦ Bin3: 35,35,35,36,40,45,46,52,70 Bin3: 35,35,35,35,35,35,35,35,70

(23)

Assignment

1- Suppose a hospital tested the age and body fat data for 18 randomly selected patients with

the following result:

Draw the box plots for both age and fat then detect if the data contains any outliers.

Age 23 23 27 27 39 41 47 49 50 52 54 54 56 57 58 58 60 61

(24)

Assignment

2- Using Linear Regression find the predicted Post-test grade for

Someone with a score of 43 on the Pre-test.

3- Smooth the Post-test data by bins mean using equal-depth

Binning (depth=2).

4- smooth the Pre-test data by bins boundaries using

(25)

Data Cleaning

Data Cleaning

What is data cleaning ?!

Data in the real world is dirty

Data cleaning tasks

How to Fill Missing Data ?

Fill Missing Data

Fill Missing Data: Regression

Exercise 1

Outliers

Detecting Noise(outliers): Clustering

Detecting Noise (outliers): Inter

Quartile Range

Detecting Noise (outliers): Boxplot

Steps

Exercise 1

Solution

Exercise 1 Cont.

Exercise 2

Handle Noisy Data: Binning

Exercise 1

Exercise 1 Cont.

Exercise 2

Exercise 2 solution

Exercise 2 solution Cont.

Assignment

Assignment

Thank

You