CSC479 Data Mining

(1)

CSC479

Data Mining

Lecture # 6

Data Preprocessing

(2)

χ2 Correlation Test for Nominal Data

● A correlation relationship between two nominal attributes can be discovered by a χ2 (Chi-square) test.

where

Example: For the following data find weather the two attributes are independent or not.

male female Total

Fiction 250 200 450

Non-fictio n

50 1000 1050

(3)

Data Preprocessing

● Why preprocess the data?

● Data cleaning

● Data integration

● Data reduction

● Data Transformation and Discretization

● Summary

(4)

Data Reduction

● Reduced representation of the data set that is much smaller in volume, yet

closely maintains the integrity of the original data. Different Strategies are:

• Dimensionality Reduction

• Numerosity reduction

• Data Compression

(5)

Dimensionality Reduction (DR)

● Process of reducing the number of

random attributes under consideration.

● Two very common methods are Wavelet

Transforms and Principal Components Analysis

● They transform or project the data onto a smaller space

(6)

Numerosity Reduction

● Replace the original data volume by alternative smaller forms of data

representation.

● Regression and log-linear models (parametric methods)

● Histograms, clustering, sampling and data cube aggregation (nonparametric methods)

(7)

Data Compression

● Transformations are applied to obtain a reduced or compressed representation

● Lossless: The original data can be

reconstructed from the compressed data without any information loss.

● Lossy: Only an approximation of the original data can be reconstructed.

(8)

DR: Principal Components Analysis (PCA)

● Why PCA?

● PCA is a useful statistical technique, has found applications in:

● Face recognition

● Image Compression

● Reducing dimension of data

(9)

PCA Goal:

Removing Dimensional Redundancy

● The major goal of PCA in Data Mining is to remove the “dimensional redundancy” from data.

● What does that mean?

● A typical dataset contains several dimensions (variables) that may or may not correlate.

● Dimensions that correlate vary together.

● The information represented by a set of dimensions with high correlation can be extracted by studying just one dimension the represents the whole set.

● Hence the goal is to reduce the dimensions of a dataset to a smaller set of representative dimensions that do

(10)

Dim 1 Dim 2 Dim 3 Dim 4 Dim 5 Dim 6 Dim 7 Dim 8 Dim 9 Dim 10 Dim 11 Dim 12 Analyzing 12 Dimensional data is challenging !!! PCA Goal:

(11)

Dim 1 Dim 2 Dim 3 Dim 4 Dim 5 Dim 6 Dim 7 Dim 8 Dim 9 Dim 10 Dim 11 Dim 12

But some dimensions represent redundant information. Can we

“reduce” these.

PCA Goal:

(12)

Lets assume we have a “PCA black box” that can

reduce the correlating dimensions.

Pass the 12 d data set through the black box to

get a three dimensional data set.

PCA Goal:

(13)

Pass the 12 d data set through the black box to get a three dimensional

data set. PCA Black box Dim A Dim B Dim C

Given appropriate reduction, analyzing the reduced dataset is

much more efficient than the original “redundant” data.

PCA Goal:

(14)

● Lets now give the “black box” a mathematical form.

● In linear algebra dimensions of a space are a linearly

independent set called “bases” that spans the space created by dimensions.

i.e. each point in that space is a linear combination of the bases set.

e.g. consider the simplest example of standard basis in Rn consisting of the coordinate axes.

Mathematics inside PCA Black box: Bases

Every point in R3 is a linear combination of the standard basis of R3

(15)

● Assume X is the 6-dimensional data set given as input

• A naïve basis for X is standard basis for R6 and hence BX = X

• Here, we want to find a new (reduced) basis P such as PX = Y

• Y will be the resultant reduced data set.

Dimensions

Data Points

(16)

PCA Goal

● Change of Basis

• QUESTION: What is a good choice for P ?

– Lets park this question right now and revisit after studying some related concepts

(17)

Background Stats/Maths

● Mean and Standard Deviation

● Variance and Covariance

● Covariance Matrix

(18)

● Mean:

● it doesn’t tell us a lot about data set.

● Different data sets can have same mean.

● Standard Deviation (SD) of a data set is a measure of how spread out the data is.

Mean and Standard Deviation

● Variance is another measure of the spread of data in data set.

(19)

Covariance

● SD and Variance are 1-dimensional

● 1-D data sets could be

● Heights of all the people in the room

● Salary of employee in a company

● Marks in the quiz

● However many datasets have more than 1-dimension

● Our aim is to find any relationship between different dimensions.

● E.g. Finding relationship with students result and their hour of study.

(20)

Covariance Interpretation

● We have data set for students study hour (H) and marks achieved (M)

● We find cov(H,M)

● Exact value of covariance is not as important as the sign (i.e. positive or negative)

● +ve , both dimensions increase together

● -ve , as one dimension increases other decreases

(21)

Covariance Matrix

● Covariance is always measured between

2 – dim.

● What if we have a data set with more

than 2-dim?

● We have to calculate more than one

covariance measurement.

● E.g. from a 3-dim data set (dimensions x,y,z) we could cacluate cov(x,y) ,

(22)

Covariance Matrix

● Can use covariance matrix to find covariance of all the possible pairs

● Since cov(a,b)=cov(b,a)

The matrix is symmetrical about the main diagonal

(23)

Eigenvectors

● Consider the two

multiplications between a matrix and a vector

● In first example the

resulting vector is not an integer multiple of the original vector.

● Whereas in second

example, the resulting vector is 4 times the original matrix

(24)

Eigenvectors and Eigenvalues

● More formally defined

● Let A be an n x n matrix. The vector v that satisfies

● For some scalar v is called the eigenvector of vector A and is the eigenvalue corresponding to eigenvector v

(25)

Principal Component Analysis

● PCA is a technique for identifying patterns in data.

● Also used to express data in such a way as to highlight similarities and

differences.

● PCA are used to reduce the dimension

in data without losing the integrity of information.

(26)

Step by Step

● Step 1:

● We need to have some data for PCA

● Step 2:

● Mean normalization and feature scaling

● Subtract the mean from each of the data point

(27)

Step1 & Step2

● Mean of x = 18.1/10 = 1.81

● Mean of y = 19.1/10 = 1.91

(28)

Step3: Calculate the Covariance

● Calculate the covariance matrix

● Non-diagonal elements in the covariance matrix are positive

(29)

Step 4: Calculate the eigenvectors and eigenvalues of the covariance matrix

● |ρ - λI| = 0 is used to find Eigenvalues

● Then Bx = 0 is solved

● Since covariance matrix is square, we can calculate the eigenvector and eigenvalues of the matrix

eigenvectors =

eigenvalues =

(30)

What does this all mean?

Data Points

(31)

Conclusion

● Eigenvector give us information about the pattern.

● By looking at graph in previous slide. See how one of the eigenvectors go through the middle of the points.

● Second eigenvector tells about another weak pattern in data.

● So by finding eigenvectors of covariance matrix we are able to extract lines that characterize the data.

(32)

Step 5:Chosing components and forming a feature vector.

● Highest eigenvalue is the principal component of the data set.

● In our example, the eigenvector with

the largest eigenvalue was the one that pointed down the middle of the data.

● So, once the eigenvectors are found, the next step is to order them by

eigenvectors, highest to lowest.

● This gives the components in order of significance.

(33)

Cont’d

● Now, here comes the idea of

dimensionality reduction and data compression

● You can decide to ignore the components

of least significance.

● You do lose some information, but if eigenvalues are small you don’t lose much.

(34)

Cont’d

● We have n – dimension

● So we will find n eigenvectors

● But if we chose only p first eigenvectors.

● Then the final dataset has only p dimension

(35)

Step 6: Deriving the new dataset

● Now, we have chosen

the components

(eigenvectors) that we want to keep.

● We can write them in

form of a matrix of vectors

● In our example we have

two eigenvectors, So we have two choices

Choice 1 with two eigenvectors

(36)

Cont’d

● To obtain the final dataset we will multiply the above vector matrix transposed with the transpose of original data matrix.

● Final dataset will have data items in columns and dimensions along rows.

● So we have original data set

represented in terms of the vectors we chose.

(37)

Original data set represented using two eigenvectors.

(38)

Original data set represented using one eigenvectors.

(39)

PCA – Mathematical Working

● Naïve Basis (I) of input matrix (X) spans a large

dimensional space.

● Change of Basis (P) is required so that X can be projected along a lower dimension space having significant dimensions only.

● A properly selected P will generate a projection Y.

● Use this P to project the correlation matrix. Lessen the number of Eigenvectors in P for a reduced dimension projection.

(40)

PCA Procedures

● Step 1

● Get data ● Step 2

● Subtract the mean ● Step 3

● Calculate the covariance matrix ● Step 4

● Calculate the eigenvectors and eigenvalues of the covariance matrix

● Step 5