• No results found

Lecture 5-Data Preprocessing-M

N/A
N/A
Protected

Academic year: 2020

Share "Lecture 5-Data Preprocessing-M"

Copied!
21
0
0

Loading.... (view fulltext now)

Full text

(1)

CSC479

Data Mining

Lecture # 5

Data Preprocessing

(2)

2

Data Preprocessing

Why preprocess the data?

Data cleaning

Data integration

Data reduction

Data Transformation and Discretization

(3)

Why Data Preprocessing?

Data in the real world is dirty

 incomplete: lacking attribute values, lacking certain attributes of

interest, or containing only aggregate data

 noisy: containing errors or outliers

 inconsistent: containing discrepancies in codes or names

Data quality is a major concern in Data Mining and Knowledge

Discovery tasks.

 Why: At most all Data Mining algorithms induce knowledge

strictly from data.

 No quality data, no quality mining results!

 Quality decisions must be based on quality data

 No quality data, inefficient mining process!

 Complete, noise-free, and consistent data means faster algorithms  The quality of knowledge extracted highly depends on the quality

(4)

Effect of Noisy Data on Results Accuracy

age income student buys_computer

<=30 high no ?

>40 medium yes ?

31…40 medium yes ?

age income student buys_computer

<=30 high yes yes

<=30 high no yes

>40 medium yes no

>40 medium no no

>40 low yes yes

31…40 no yes

31…40 medium yes yes

Data Mining

If ‘age <= 30’ and income = ‘high’ then buys_computer = ‘yes’

If ‘age > 40’ and income = ‘medium’ then buys_computer = ‘no’

Discover only those rules which contain support (frequency) greater >= 2

Due to the missing value in training dataset, the accuracy of prediction decreases and becomes “66.7%”

Training data

(5)

Data cleaning

Fill in missing values, smooth noisy data, identify or

remove outliers, and resolve inconsistencies

Data integration

Integration of multiple databases, data cubes, or files

Data reduction

Obtains reduced representation in volume but

produces the same or similar analytical results

Data transformation

Normalization and aggregation

Data discretization

Part of data reduction but with particular importance,

especially for numerical data

(6)
(7)

Data Preprocessing

Why preprocess the data?

Data cleaning

Data integration

Data reduction

Data Transformation and Discretization

(8)

Data Cleaning

Data cleaning tasks

Fill in missing values

Noisy data

(9)

Data is not always available

 E.g., many tuples have no recorded value for several

attributes, such as customer income in sales data

Missing data may be due to

 equipment malfunction

 inconsistent with other recorded data and thus deleted  data not entered due to misunderstanding

 certain data may not be considered important at the time of

entry

 not register history or changes of the data

Missing data may need to be inferred.

(10)

Methods of Treating Missing Data

 Ignoring and discarding data:- There are two main ways to discard

data with missing values.

 Discard all those records which have missing data also called as discard case analysis. Usually done when class label is missing

 Discarding only those attributes which have high level of missing data.

 Fill in the missing value manually: tedious + infeasible?

 Use a global constant to fill in the missing value: e.g., “unknown”, a

new class.

 Imputation using Mean/median or Mod:- One of the most

frequently used method (Statistical technique).

 Use the attribute mean to fill in the missing value

 Use the attribute mean for all samples belonging to the same class to

fill in the missing value: smarter

 Replace (numeric continuous) type “attribute missing values” using

mean/median. (Median robust against noise).

(11)

Replace missing values using prediction/ classification

model:- Use the most probable value to fill in the missing value:

inference-based such as Bayesian formula or decision tree

 Advantage:- it considers relationship among the known attribute

values and the missing values, so the imputation accuracy is very high.

 Disadvantage:- If there is no correlation exist for some missing

attribute values and known attribute values. The imputation can’t be performed.

 (Alternative approach):- Use hybrid combination of

Prediction/Classification model and Mean/MOD.

• First try to impute missing value using prediction/classification model, and then Median/MOD.

 We will study more about this topic in Association Rules Mining.

(12)

K-Nearest Neighbor (k-NN) approach (Best

approach):-

k-NN imputes the missing attribute values on the basis

of nearest

K

neighbor.

Neighbors are determined

on

the basis of

distance measure

.

Once K neighbors are determined, missing value are

imputed by taking mean/median or MOD of

known

attribute values of missing attribute

.

Missing value record

Other dataset records

(13)

Imputation of Missing Data (Basic)

Imputation

is a term that denotes a procedure that

replaces the missing values in a dataset by some

plausible values

i.e. by considering relationship among

correlated values among the attributes of the

dataset.

Attribute 1 Attribute 2 Attribute 3 Attribute 4

20 cool high false

cool high true

20 cool high true

20 mild low false

30 cool normal false

10 mild high true

If we consider only

{attribute#2}, then

value “cool” appears in 3 records.

Probability of Imputing

value (20) = 66.7%

Probability of Imputing

(14)

Imputation of Missing Data (Basic)

Attribute 1 Attribute 2 Attribute 3 Attribute 4

20 cool high false

cool high true

20 cool high true 20 mild low false 30 cool normal false 10 mild high true

For {attribute#4} the value “true” appears in 2 records

Probability of Imputing

value (20) = 50%

Probability of Imputing

value (10) = 50%

Attribute 1 Attribute 2 Attribute 3 Attribute 4

20 cool high false

cool high true

20 cool high true 20 mild low false 30 cool normal false 10 mild high true

For {attribute#2,

attribute#3} the value

{“cool”, “high”}

appears in only 2 records Probability of Imputing

(15)

Noise: random error or variance in a measured

variable

Incorrect attribute values may be due to

 faulty data collection instruments  data entry problems

 data transmission problems  technology limitation

 inconsistency in naming convention

Other data problems which requires data cleaning

 duplicate records  incomplete data  inconsistent data

(16)

Removing Noise

Data Smoothing (rounding, averaging

within a window).

Data smoothing by Binning method:

• first sort data and partition into (equi-depth) bins

• then one can smooth by bin means, smooth by bin median, smooth by bin boundaries, etc.

Smoothing by Regression

• smooth by fitting the data into regression functions

Clustering/merging and Detecting outliers.

(17)

Smoothing by Binning Method

 Equal-width (distance) partitioning:

 It divides the range into

N

intervals of equal size: uniform grid  if

A

and

B

are the lowest and highest values of the attribute,

the width of intervals will be:

W

= (

B

-

A

)/

k,

where

k

is the number of bins.

 The most straightforward

 But outliers may dominate presentation  Skewed data is not handled well.

 Equal-depth (frequency) partitioning:

 It divides the range into

M

intervals, each containing

approximately same number of samples

 Good data scaling

(18)

Binning Methods for Data Smoothing

* Sorted data for price (in dollars): 4, 8, 9, 15, 21, 21, 24, 25, 26, 28, 29, 34 * Partition into (equi-width) bins: A+w, A+2w,…

- Bin 1: 4, 8, 9

- Bin 2: 15, 21, 21, 24 - Bin 3: 25, 26, 28, 29, 34 * Smoothing by bin means: - Bin 1: 7, 7, 7

- Bin 2: 20, 20, 20, 20 - Bin 3: 28, 28, 28, 28, 28 * Smoothing by bin boundaries: - Bin 1: 4, 4, 4, 9

(19)

Binning Methods for Data Smoothing

* Sorted data for price (in dollars): 4, 8, 9, 15, 21, 21, 24, 25, 26, 28, 29, 34 * Partition into (equi-depth) bins:

- Bin 1: 4, 8, 9, 15 - Bin 2: 21, 21, 24, 25 - Bin 3: 26, 28, 29, 34 * Smoothing by bin means: - Bin 1: 9, 9, 9, 9

- Bin 2: 23, 23, 23, 23 - Bin 3: 29, 29, 29, 29

* Smoothing by bin boundaries: - Bin 1: 4, 4, 4, 15

(20)

Regression Method for smoothing the data

 Regression is a technique

that conforms data values to a function. Linear

regression involves finding the “best” line to fit two attributes (or variables) so that one attribute can be used to predict the other.

x y

y = x + 1

X1 Y1

(21)

Detecting Outliers (Clustering)

Outliers

may be detected by

clustering

, where similar

values are organized into groups or “clusters”.

Values which falls outside of the set of clusters may be

References

Related documents

All Classes other than Life (Individual): Continuing education for an All Classes licensee must be directly related to general insurance products or services, or the management of

In our study, the prevalence of migraine (41%, RR 4.4) and especially MA (15%, RR 7.3) is higher in patients with the homogenous epilepsy subgroup of JME than expected from studies

Mykola Komarnytskyi has considerable scientific achievements in many areas of modern algebra: theory of rings and modules (especially, the the- ory of radicals and torsions and in

Even though the (data) range between member states increased with the enlargement of the Union, delegation of risk regulation as a general mode of governance

We hypothesized that if general population without Parkinson‘s disease have PD-related NMSs which implies underlying alpha synucleinopathy, they are likely to be vulnerable

8/24/2010 22 3 CHIRURGEN which technique TEP TAPP PELLISIER TAPP STOPPA MILLIKAN SHOULDICE RUTKOW PELLISIER?. LICHTEN STEIN UGAHARY GILBERT BASSINI- KIRCHNER UGAHARY RIVES

The operational cost which constitutes cost of human labour, animal labour, machine labour, planting material, fertilizer &amp; manure, insecticides, irrigation

This study aimed to examine the genetic and environ- mental relationships between weight change from the age of 20 years and IR traits using four indicators — fasting plasma