• No results found

CSC479 Data Mining

N/A
N/A
Protected

Academic year: 2020

Share "CSC479 Data Mining"

Copied!
40
0
0

Loading.... (view fulltext now)

Full text

(1)

CSC479

Data Mining

Lecture # 6

Data Preprocessing

(2)

χ2 Correlation Test for Nominal Data

● A correlation relationship between two nominal attributes can be discovered by a χ2 (Chi-square) test.

where

Example: For the following data find weather the two attributes are independent or not.

male female Total

Fiction 250 200 450

Non-fictio n

50 1000 1050

(3)

Data Preprocessing

● Why preprocess the data?

● Data cleaning

● Data integration

● Data reduction

● Data Transformation and Discretization

● Summary

(4)

Data Reduction

● Reduced representation of the data set that is much smaller in volume, yet

closely maintains the integrity of the original data. Different Strategies are:

• Dimensionality Reduction

• Numerosity reduction

• Data Compression

(5)

Dimensionality Reduction (DR)

● Process of reducing the number of

random attributes under consideration.

● Two very common methods are Wavelet

Transforms and Principal Components Analysis

● They transform or project the data onto a smaller space

(6)

Numerosity Reduction

● Replace the original data volume by alternative smaller forms of data

representation.

● Regression and log-linear models (parametric methods)

● Histograms, clustering, sampling and data cube aggregation (nonparametric methods)

(7)

Data Compression

● Transformations are applied to obtain a reduced or compressed representation

● Lossless: The original data can be

reconstructed from the compressed data without any information loss.

● Lossy: Only an approximation of the original data can be reconstructed.

(8)

DR: Principal Components Analysis (PCA)

● Why PCA?

● PCA is a useful statistical technique, has found applications in:

● Face recognition

● Image Compression

● Reducing dimension of data

(9)

PCA Goal:

Removing Dimensional Redundancy

● The major goal of PCA in Data Mining is to remove the “dimensional redundancy” from data.

● What does that mean?

● A typical dataset contains several dimensions (variables) that may or may not correlate.

● Dimensions that correlate vary together.

● The information represented by a set of dimensions with high correlation can be extracted by studying just one dimension the represents the whole set.

● Hence the goal is to reduce the dimensions of a dataset to a smaller set of representative dimensions that do

(10)

Dim 1 Dim 2 Dim 3 Dim 4 Dim 5 Dim 6 Dim 7 Dim 8 Dim 9 Dim 10 Dim 11 Dim 12 Analyzing 12 Dimensional data is challenging !!! PCA Goal:

(11)

Dim 1 Dim 2 Dim 3 Dim 4 Dim 5 Dim 6 Dim 7 Dim 8 Dim 9 Dim 10 Dim 11 Dim 12

But some dimensions represent redundant information. Can we

“reduce” these.

PCA Goal:

(12)

Dim 1 Dim 2 Dim 3 Dim 4 Dim 5 Dim 6 Dim 7 Dim 8 Dim 9 Dim 10 Dim 11 Dim 12

Lets assume we have a “PCA black box” that can

reduce the correlating dimensions.

Pass the 12 d data set through the black box to

get a three dimensional data set.

PCA Goal:

(13)

Dim 1 Dim 2 Dim 3 Dim 4 Dim 5 Dim 6 Dim 7 Dim 8 Dim 9 Dim 10 Dim 11 Dim 12

Pass the 12 d data set through the black box to get a three dimensional

data set. PCA Black box Dim A Dim B Dim C

Given appropriate reduction, analyzing the reduced dataset is

much more efficient than the original “redundant” data.

PCA Goal:

(14)

● Lets now give the “black box” a mathematical form.

● In linear algebra dimensions of a space are a linearly

independent set called “bases” that spans the space created by dimensions.

i.e. each point in that space is a linear combination of the bases set.

e.g. consider the simplest example of standard basis in Rn consisting of the coordinate axes.

Mathematics inside PCA Black box: Bases

Every point in R3 is a linear combination of the standard basis of R3

(15)

● Assume X is the 6-dimensional data set given as input

• A naïve basis for X is standard basis for R6 and hence BX = X

• Here, we want to find a new (reduced) basis P such as PX = Y

• Y will be the resultant reduced data set.

Dimensions

Data Points

(16)

PCA Goal

● Change of Basis

• QUESTION: What is a good choice for P ?

– Lets park this question right now and revisit after studying some related concepts

(17)

Background Stats/Maths

● Mean and Standard Deviation

● Variance and Covariance

● Covariance Matrix

(18)

● Mean:

● it doesn’t tell us a lot about data set.

● Different data sets can have same mean.

● Standard Deviation (SD) of a data set is a measure of how spread out the data is.

Mean and Standard Deviation

● Variance is another measure of the spread of data in data set.

(19)

Covariance

● SD and Variance are 1-dimensional

● 1-D data sets could be

● Heights of all the people in the room

● Salary of employee in a company

● Marks in the quiz

● However many datasets have more than 1-dimension

● Our aim is to find any relationship between different dimensions.

● E.g. Finding relationship with students result and their hour of study.

(20)

Covariance Interpretation

● We have data set for students study hour (H) and marks achieved (M)

● We find cov(H,M)

● Exact value of covariance is not as important as the sign (i.e. positive or negative)

● +ve , both dimensions increase together

● -ve , as one dimension increases other decreases

(21)

Covariance Matrix

● Covariance is always measured between

2 – dim.

● What if we have a data set with more

than 2-dim?

● We have to calculate more than one

covariance measurement.

● E.g. from a 3-dim data set (dimensions x,y,z) we could cacluate cov(x,y) ,

(22)

Covariance Matrix

● Can use covariance matrix to find covariance of all the possible pairs

● Since cov(a,b)=cov(b,a)

The matrix is symmetrical about the main diagonal

(23)

Eigenvectors

● Consider the two

multiplications between a matrix and a vector

● In first example the

resulting vector is not an integer multiple of the original vector.

● Whereas in second

example, the resulting vector is 4 times the original matrix

(24)

Eigenvectors and Eigenvalues

● More formally defined

● Let A be an n x n matrix. The vector v that satisfies

● For some scalar v is called the eigenvector of vector A and is the eigenvalue corresponding to eigenvector v

(25)

Principal Component Analysis

● PCA is a technique for identifying patterns in data.

● Also used to express data in such a way as to highlight similarities and

differences.

● PCA are used to reduce the dimension

in data without losing the integrity of information.

(26)

Step by Step

● Step 1:

● We need to have some data for PCA

● Step 2:

● Mean normalization and feature scaling

● Subtract the mean from each of the data point

(27)

Step1 & Step2

● Mean of x = 18.1/10 = 1.81

● Mean of y = 19.1/10 = 1.91

(28)

Step3: Calculate the Covariance

● Calculate the covariance matrix

● Non-diagonal elements in the covariance matrix are positive

(29)

Step 4: Calculate the eigenvectors and eigenvalues of the covariance matrix

● |ρ - λI| = 0 is used to find Eigenvalues

● Then Bx = 0 is solved

● Since covariance matrix is square, we can calculate the eigenvector and eigenvalues of the matrix

eigenvectors =

eigenvalues =

(30)

What does this all mean?

Data Points

(31)

Conclusion

● Eigenvector give us information about the pattern.

● By looking at graph in previous slide. See how one of the eigenvectors go through the middle of the points.

● Second eigenvector tells about another weak pattern in data.

● So by finding eigenvectors of covariance matrix we are able to extract lines that characterize the data.

(32)

Step 5:Chosing components and forming a feature vector.

● Highest eigenvalue is the principal component of the data set.

● In our example, the eigenvector with

the largest eigenvalue was the one that pointed down the middle of the data.

● So, once the eigenvectors are found, the next step is to order them by

eigenvectors, highest to lowest.

● This gives the components in order of significance.

(33)

Cont’d

● Now, here comes the idea of

dimensionality reduction and data compression

● You can decide to ignore the components

of least significance.

● You do lose some information, but if eigenvalues are small you don’t lose much.

(34)

Cont’d

● We have n – dimension

● So we will find n eigenvectors

● But if we chose only p first eigenvectors.

● Then the final dataset has only p dimension

(35)

Step 6: Deriving the new dataset

● Now, we have chosen

the components

(eigenvectors) that we want to keep.

● We can write them in

form of a matrix of vectors

● In our example we have

two eigenvectors, So we have two choices

Choice 1 with two eigenvectors

(36)

Cont’d

● To obtain the final dataset we will multiply the above vector matrix transposed with the transpose of original data matrix.

● Final dataset will have data items in columns and dimensions along rows.

● So we have original data set

represented in terms of the vectors we chose.

(37)

Original data set represented using two eigenvectors.

(38)

Original data set represented using one eigenvectors.

(39)

PCA – Mathematical Working

● Naïve Basis (I) of input matrix (X) spans a large

dimensional space.

● Change of Basis (P) is required so that X can be projected along a lower dimension space having significant dimensions only.

● A properly selected P will generate a projection Y.

● Use this P to project the correlation matrix. Lessen the number of Eigenvectors in P for a reduced dimension projection.

(40)

PCA Procedures

● Step 1

● Get data ● Step 2

● Subtract the mean ● Step 3

● Calculate the covariance matrix ● Step 4

● Calculate the eigenvectors and eigenvalues of the covariance matrix

● Step 5

References

Related documents

English speaking guide through the madonna delle grazie and activities you can book tickets for santa maria della grazie!. Obtain particular effects of the city and experiences that

information from the 2D intensity images is then overlaid on the customised 3D model to give it a realistic appearance. The depth and texture information axe often

Players can create characters and participate in any adventure allowed as a part of the D&D Adventurers League.. As they adventure, players track their characters’

We hear the Christmas angels The great glad tidings tell; O come to us, abide with us, Our Lord Emmanuel!.. **** Rudolph The Red-Nosed

Whether grown as freestanding trees or wall- trained fans, established figs should be lightly pruned twice a year: once in spring to thin out old or damaged wood and to maintain

This is because the method of finding desired path using Artificial Intelli- gence using graphs follows a general iterative approach of graph search that evaluates and

interventions had 1.1 to 14.0% lower new conviction recidivism rates for offenders successfully completing substance abuse treatment compared to offenders at the same location with

Chaperones act on the non-native ensemble by associating to and dissociating from misfolded (CM) and intermediate (CI) conformations. While bound to chaperones,