CSC479
Data Mining
Lecture # 6
Data Preprocessing
χ2 Correlation Test for Nominal Data
● A correlation relationship between two nominal attributes can be discovered by a χ2 (Chi-square) test.
where
Example: For the following data find weather the two attributes are independent or not.
male female Total
Fiction 250 200 450
Non-fictio n
50 1000 1050
Data Preprocessing
● Why preprocess the data?
● Data cleaning
● Data integration
● Data reduction
● Data Transformation and Discretization
● Summary
Data Reduction
● Reduced representation of the data set that is much smaller in volume, yet
closely maintains the integrity of the original data. Different Strategies are:
• Dimensionality Reduction
• Numerosity reduction
• Data Compression
Dimensionality Reduction (DR)
● Process of reducing the number of
random attributes under consideration.
● Two very common methods are Wavelet
Transforms and Principal Components Analysis
● They transform or project the data onto a smaller space
Numerosity Reduction
● Replace the original data volume by alternative smaller forms of data
representation.
● Regression and log-linear models (parametric methods)
● Histograms, clustering, sampling and data cube aggregation (nonparametric methods)
Data Compression
● Transformations are applied to obtain a reduced or compressed representation
● Lossless: The original data can be
reconstructed from the compressed data without any information loss.
● Lossy: Only an approximation of the original data can be reconstructed.
DR: Principal Components Analysis (PCA)
● Why PCA?
● PCA is a useful statistical technique, has found applications in:
● Face recognition
● Image Compression
● Reducing dimension of data
PCA Goal:
Removing Dimensional Redundancy
● The major goal of PCA in Data Mining is to remove the “dimensional redundancy” from data.
● What does that mean?
● A typical dataset contains several dimensions (variables) that may or may not correlate.
● Dimensions that correlate vary together.
● The information represented by a set of dimensions with high correlation can be extracted by studying just one dimension the represents the whole set.
● Hence the goal is to reduce the dimensions of a dataset to a smaller set of representative dimensions that do
Dim 1 Dim 2 Dim 3 Dim 4 Dim 5 Dim 6 Dim 7 Dim 8 Dim 9 Dim 10 Dim 11 Dim 12 Analyzing 12 Dimensional data is challenging !!! PCA Goal:
Dim 1 Dim 2 Dim 3 Dim 4 Dim 5 Dim 6 Dim 7 Dim 8 Dim 9 Dim 10 Dim 11 Dim 12
But some dimensions represent redundant information. Can we
“reduce” these.
PCA Goal:
Dim 1 Dim 2 Dim 3 Dim 4 Dim 5 Dim 6 Dim 7 Dim 8 Dim 9 Dim 10 Dim 11 Dim 12
Lets assume we have a “PCA black box” that can
reduce the correlating dimensions.
Pass the 12 d data set through the black box to
get a three dimensional data set.
PCA Goal:
Dim 1 Dim 2 Dim 3 Dim 4 Dim 5 Dim 6 Dim 7 Dim 8 Dim 9 Dim 10 Dim 11 Dim 12
Pass the 12 d data set through the black box to get a three dimensional
data set. PCA Black box Dim A Dim B Dim C
Given appropriate reduction, analyzing the reduced dataset is
much more efficient than the original “redundant” data.
PCA Goal:
● Lets now give the “black box” a mathematical form.
● In linear algebra dimensions of a space are a linearly
independent set called “bases” that spans the space created by dimensions.
i.e. each point in that space is a linear combination of the bases set.
e.g. consider the simplest example of standard basis in Rn consisting of the coordinate axes.
Mathematics inside PCA Black box: Bases
Every point in R3 is a linear combination of the standard basis of R3
● Assume X is the 6-dimensional data set given as input
• A naïve basis for X is standard basis for R6 and hence BX = X
• Here, we want to find a new (reduced) basis P such as PX = Y
• Y will be the resultant reduced data set.
Dimensions
Data Points
PCA Goal
● Change of Basis
• QUESTION: What is a good choice for P ?
– Lets park this question right now and revisit after studying some related concepts
Background Stats/Maths
● Mean and Standard Deviation
● Variance and Covariance
● Covariance Matrix
● Mean:
● it doesn’t tell us a lot about data set.
● Different data sets can have same mean.
● Standard Deviation (SD) of a data set is a measure of how spread out the data is.
Mean and Standard Deviation
● Variance is another measure of the spread of data in data set.
Covariance
● SD and Variance are 1-dimensional
● 1-D data sets could be
● Heights of all the people in the room
● Salary of employee in a company
● Marks in the quiz
● However many datasets have more than 1-dimension
● Our aim is to find any relationship between different dimensions.
● E.g. Finding relationship with students result and their hour of study.
Covariance Interpretation
● We have data set for students study hour (H) and marks achieved (M)
● We find cov(H,M)
● Exact value of covariance is not as important as the sign (i.e. positive or negative)
● +ve , both dimensions increase together
● -ve , as one dimension increases other decreases
Covariance Matrix
● Covariance is always measured between
2 – dim.
● What if we have a data set with more
than 2-dim?
● We have to calculate more than one
covariance measurement.
● E.g. from a 3-dim data set (dimensions x,y,z) we could cacluate cov(x,y) ,
Covariance Matrix
● Can use covariance matrix to find covariance of all the possible pairs
● Since cov(a,b)=cov(b,a)
The matrix is symmetrical about the main diagonal
Eigenvectors
● Consider the twomultiplications between a matrix and a vector
● In first example the
resulting vector is not an integer multiple of the original vector.
● Whereas in second
example, the resulting vector is 4 times the original matrix
Eigenvectors and Eigenvalues
● More formally defined
● Let A be an n x n matrix. The vector v that satisfies
● For some scalar v is called the eigenvector of vector A and is the eigenvalue corresponding to eigenvector v
Principal Component Analysis
● PCA is a technique for identifying patterns in data.
● Also used to express data in such a way as to highlight similarities and
differences.
● PCA are used to reduce the dimension
in data without losing the integrity of information.
Step by Step
● Step 1:
● We need to have some data for PCA
● Step 2:
● Mean normalization and feature scaling
● Subtract the mean from each of the data point
Step1 & Step2
● Mean of x = 18.1/10 = 1.81
● Mean of y = 19.1/10 = 1.91
Step3: Calculate the Covariance
● Calculate the covariance matrix
● Non-diagonal elements in the covariance matrix are positive
Step 4: Calculate the eigenvectors and eigenvalues of the covariance matrix
● |ρ - λI| = 0 is used to find Eigenvalues
● Then Bx = 0 is solved
● Since covariance matrix is square, we can calculate the eigenvector and eigenvalues of the matrix
eigenvectors =
eigenvalues =
What does this all mean?
Data Points
Conclusion
● Eigenvector give us information about the pattern.
● By looking at graph in previous slide. See how one of the eigenvectors go through the middle of the points.
● Second eigenvector tells about another weak pattern in data.
● So by finding eigenvectors of covariance matrix we are able to extract lines that characterize the data.
Step 5:Chosing components and forming a feature vector.
● Highest eigenvalue is the principal component of the data set.
● In our example, the eigenvector with
the largest eigenvalue was the one that pointed down the middle of the data.
● So, once the eigenvectors are found, the next step is to order them by
eigenvectors, highest to lowest.
● This gives the components in order of significance.
Cont’d
● Now, here comes the idea of
dimensionality reduction and data compression
● You can decide to ignore the components
of least significance.
● You do lose some information, but if eigenvalues are small you don’t lose much.
Cont’d
● We have n – dimension
● So we will find n eigenvectors
● But if we chose only p first eigenvectors.
● Then the final dataset has only p dimension
Step 6: Deriving the new dataset
● Now, we have chosen
the components
(eigenvectors) that we want to keep.
● We can write them in
form of a matrix of vectors
● In our example we have
two eigenvectors, So we have two choices
Choice 1 with two eigenvectors
Cont’d
● To obtain the final dataset we will multiply the above vector matrix transposed with the transpose of original data matrix.
● Final dataset will have data items in columns and dimensions along rows.
● So we have original data set
represented in terms of the vectors we chose.
Original data set represented using two eigenvectors.
Original data set represented using one eigenvectors.
PCA – Mathematical Working
● Naïve Basis (I) of input matrix (X) spans a largedimensional space.
● Change of Basis (P) is required so that X can be projected along a lower dimension space having significant dimensions only.
● A properly selected P will generate a projection Y.
● Use this P to project the correlation matrix. Lessen the number of Eigenvectors in P for a reduced dimension projection.
PCA Procedures
● Step 1
● Get data ● Step 2
● Subtract the mean ● Step 3
● Calculate the covariance matrix ● Step 4
● Calculate the eigenvectors and eigenvalues of the covariance matrix
● Step 5