• No results found

DIMENSION REDUCTION

4.2 PRINCIPAL COMPONENT ANALYSIS

Principal component analysis is a popular unsupervised data-processing and dimension reduction technique. PCA attempts to reduce the dimension of the data set by identifying and using informative linear combinations of the original predictors, instead of using the 𝑝 predictors themselves. In order to reduce the dimension of the original data set, the number of linear combinations must be substantially less than 𝑝. According to Johnson and Wichern (2007:430) the aim of PCA is to find the linear combinations which maximise the variation in the data set.

Principal components are entirely dependent on the covariance (or correlation) matrix of the original variables. There are several ways to explain principal component analysis;

here the singular value decomposition (SVD) of the centred input matrix 𝑋 is used to express the principal components of the variables in 𝑋, as explained by Gosh (2002). The SVD of the 𝑁 Γ— 𝑝 matrix 𝑋 (centred to have mean zero column vectors) has the form

𝑋 = π‘ˆπ·π‘‰π‘‡. (4.1)

Here π‘ˆ is an 𝑁 Γ— 𝑝 orthogonal matrix (π‘ˆπ‘‡π‘ˆ = 𝐼𝑝) with the columns of π‘ˆ spanning the column space of 𝑋 and 𝑉 is a 𝑝 Γ— 𝑝 orthogonal matrix (𝑉𝑇𝑉 = 𝐼𝑝) whose columns span the row space of 𝑋. 𝐷 is a 𝑝 Γ— p diagonal matrix, with diagonal entries 𝑑1 β‰₯ 𝑑2 β‰₯ β‹― β‰₯

𝑑𝑝 β‰₯ 0 known as the singular values of 𝑋. Note that if one or more values of 𝑑𝑗 = 0, then 𝑋 is singular.

The sample covariance matrix is given by 𝑆 =𝑁1𝑋𝑇𝑋, and from the SVD of 𝑋 in (4.1) the

eigen-decomposition of 𝑋𝑇𝑋 can be written as

𝑋𝑇𝑋 = 𝑉𝐷2𝑉𝑇,

which implies that

(𝑋𝑇𝑋)𝑉 = 𝑉𝐷2.

In the above equation, the columns 𝒗1, … , 𝒗𝑝 of 𝑉 are the eigenvectors of 𝑋𝑇𝑋, with corresponding eigenvalues 𝑑12 β‰₯ 𝑑22 β‰₯ β‹― β‰₯ 𝑑𝑝2 > 0. The ordered sequence of

eigenvectors, 𝒗1, … , 𝒗𝑝, define the principal component directions of 𝑋 and 𝑑12 β‰₯ 𝑑22 β‰₯ β‹― β‰₯

𝑑𝑝2 represent the variances of the principal components (Hastie et al., 2009:66). For any

given data set there are at most min (𝑁 βˆ’ 1, 𝑝) principal components.

The principal components (or principal component scores) of 𝑋 are given by: 𝒛𝑗 = 𝑋𝒗𝑗 = π‘ˆπ·π‘‰π‘‡π’—

𝑗 = 𝑑𝑗𝒖𝑗 for 𝑗 = 1, … , 𝑝. (4.2)

Here 𝒛𝑗 is a vector of length 𝑁 and the columns of π‘ˆ are the vectors 𝒖𝑗 which are ordered

to ensure that 𝑑12 β‰₯ 𝑑

22 β‰₯ β‹― β‰₯ 𝑑𝑝2.

The first principal component of 𝑋 is the normalised linear combination of 𝒙1, 𝒙2, … , 𝒙𝑝 that has the largest sample variance,

Here 𝒗1 is the first principal component direction which is normalised so that the sum of squares equals 1 (𝒗1𝑇𝒗

1 =1 or βˆ‘π‘π‘—=1𝑣𝑗12 = 1). The vector 𝒗1 defines the direction in the

feature space along which the data varies the most.

Hastie et al. (2009:66) indicates that the sample variance of 𝒛1 is: Var(𝒛1) = Var(𝑋𝒗1) =

1

𝑁𝒗1𝑇𝑋𝑇𝑋𝒗1 = 𝑑12

𝑁.

It is seen that this variance depends directly on the largest eigenvalue. The second principal component, 𝒛2, is the linear combination of 𝒙1, 𝒙2, … , 𝒙𝑝 that has maximal

variance out of all the linear combinations that are uncorrelated with 𝒛1. Using the

constraint that 𝒛2 must be uncorrelated with 𝒛1 is equivalent to the constraint that the direction 𝒗2 must be orthogonal (perpendicular) to the direction 𝒗1 (𝒗1𝑇𝒗

2 = 0). Similarly,

the subsequent principal components 𝒛𝑗 have maximum variance, 𝑑𝑗2

𝑁, subject to being

orthogonal to the previous principal components. Hence, the small eigenvalues 𝑑𝑗2 correspond to directions in the column space of 𝑋 having small variance.

Summarising, the principal components, 𝒛1, … , 𝒛𝑀 (𝑀 < 𝑝), are uncorrelated and have

variances proportional to the eigenvalues of 𝛴. The proportion of the total variation in the sample explained by the π‘Žπ‘‘β„Ž principal component can be expressed as:

π‘‘π‘Ž

2

𝑑12 + β‹― + 𝑑

𝑝2. (4.3)

Calculating the sum of (4.3) up to principal component π‘š gives an expression for the proportion of the total variance that is explained by the first π‘š principal components. This is denoted by πœ”π‘š, i.e.: πœ”π‘š = 𝑑12+ 𝑑22+ β‹― + π‘‘π‘š2 𝑑12+ 𝑑 2 2+ β‹― + 𝑑 𝑝2 . (4.4)

PCA eliminates the problem of co-linearity between the variables in a data set, since the principal components are by definition uncorrelated (Johnson and Wichern, 2007:430). It is interesting to note that if the original predictor variables are highly correlated then the proportion of the total variance explained by the first π‘š principal components is close to 1 even for small values of π‘š.

An important step in PCA is to decide how many principal components to retain. Although there is no set rule to aid in this decision, there are several guidelines as proposed by Rencher (2002:434):

1. Retain sufficient components to account for a specified proportion of the total variance (the threshold value), say by letting Ο‰π‘“π‘Ÿπ‘Žπ‘= 0.9.

2. Exclude principal components whose eigenvalues are less than the overall average of the eigenvalues, calculated as:

𝑑2 Μ…Μ…Μ… = βˆ‘π‘‘π‘—2 𝑝 𝑝 𝑗=1 .

3. Represent the principal components in a scree plot, by plotting 𝑑𝑗2 against 𝑗. A scree

plot indicates the appropriate number of components to retain by an elbow (bend) in the plot which separates the large eigenvalues from the small eigenvalues.

4. Use significance tests to test if the β€œlarger” principal components are indeed significant.

By requiring Ο‰π‘š > Ο‰π‘“π‘Ÿπ‘Žπ‘ (where Ο‰π‘“π‘Ÿπ‘Žπ‘ denotes the desired proportion of the total variance to be explained), the π‘š principal components are determined and used to replace the original 𝑝 predictor variables.

The first proposed method for determining how many principal components should be retained to effectively summarise the original data is the method implemented in this thesis. The main drawback associated with this method is the difficulty of determining the threshold value Ο‰π‘“π‘Ÿπ‘Žπ‘. Setting this value too high runs the risk of retaining principal components that are sample or variable specific (Rencher, 2002:434). However, Ο‰π‘“π‘Ÿπ‘Žπ‘ will be set to 0.9 for the data exploration in this thesis.

When comparing dimension reduction by means of PCA to variable selection, the advantage of variable selection is that variables that do not discriminate between groups are removed and need not be observed for future purposes, as they are no longer considered important. However, in dimension reduction, all the variables remain in the model as linear combinations of all these variables are used. Therefore, variable selection techniques yield easier interpretation of the important predictors compared to dimension reduction by means of PCA.

PCA can also be used as a tool for data visualisation for both the observations and the variables, as it can be used to obtain a low-dimensional representation of the data that captures most of the information from the data, which is typically very complex in high- dimensional settings (James et al., 2013:375)

In the experimental work reported later in the thesis, PCA will be implemented using the

preProcess() function in the caret package in R. PCA will be applied to the original high-

dimensional data sets as a dimension reduction step, and then KNN and SVM classification procedures will be implemented using the ordinary principal components, 𝒛1, … , 𝒛𝑀 (𝑀 < 𝑝), extracted from the data as inputs.

This section explained PCA using the SVD of 𝑋 = [𝒙1 𝒙2 β‹― 𝒙𝑝]; however, a modified

weighted PCA could be performed implementing the SVD of π‘‹π·πœƒ =

[𝑑1(πœƒ)𝒙1 𝑑2(πœƒ)𝒙2 β‹― 𝑑𝑝(πœƒ)𝒙𝑝]. Supervised principal components fall within this

framework, using 𝑑𝑗(πœƒ) = Ind (|𝛽̂𝑗| β‰₯ πœƒ). Supervised principal components are explained

in detail in the following section.