PRINCIPAL COMPONENT ANALYSIS - DIMENSION REDUCTION

DIMENSION REDUCTION

4.2 PRINCIPAL COMPONENT ANALYSIS

Principal component analysis is a popular unsupervised data-processing and dimension reduction technique. PCA attempts to reduce the dimension of the data set by identifying and using informative linear combinations of the original predictors, instead of using the 𝑝 predictors themselves. In order to reduce the dimension of the original data set, the number of linear combinations must be substantially less than 𝑝. According to Johnson and Wichern (2007:430) the aim of PCA is to find the linear combinations which maximise the variation in the data set.

Principal components are entirely dependent on the covariance (or correlation) matrix of the original variables. There are several ways to explain principal component analysis;

here the singular value decomposition (SVD) of the centred input matrix 𝑋 is used to express the principal components of the variables in 𝑋, as explained by Gosh (2002). The SVD of the 𝑁 × 𝑝 matrix 𝑋 (centred to have mean zero column vectors) has the form

𝑋 = 𝑈𝐷𝑉𝑇_. _(4.1)

Here 𝑈 is an 𝑁 × 𝑝 orthogonal matrix (𝑈𝑇𝑈 = 𝐼_𝑝) with the columns of 𝑈 spanning the column space of 𝑋 and 𝑉 is a 𝑝 × 𝑝 orthogonal matrix (𝑉𝑇𝑉 = 𝐼_𝑝) whose columns span the row space of 𝑋. 𝐷 is a 𝑝 × p diagonal matrix, with diagonal entries 𝑑1 ≥ 𝑑2 ≥ ⋯ ≥

𝑑_𝑝 ≥ 0 known as the singular values of 𝑋. Note that if one or more values of 𝑑_𝑗 = 0, then 𝑋 is singular.

The sample covariance matrix is given by 𝑆 =_𝑁1𝑋𝑇_{𝑋, and from the SVD of 𝑋 in (4.1) the}

eigen-decomposition of 𝑋𝑇_{𝑋 can be written as}

𝑋𝑇_{𝑋 = 𝑉𝐷}2_𝑉𝑇_,

which implies that

(𝑋𝑇_{𝑋)𝑉 = 𝑉𝐷}2_.

In the above equation, the columns 𝒗₁, … , 𝒗_𝑝 of 𝑉 are the eigenvectors of 𝑋𝑇𝑋, with corresponding eigenvalues 𝑑12 ≥ 𝑑22 ≥ ⋯ ≥ 𝑑𝑝2 > 0. The ordered sequence of

eigenvectors, 𝒗1, … , 𝒗𝑝, define the principal component directions of 𝑋 and 𝑑12 ≥ 𝑑22 ≥ ⋯ ≥

𝑑𝑝2 represent the variances of the principal components (Hastie et al., 2009:66). For any

given data set there are at most min (𝑁 − 1, 𝑝) principal components.

The principal components (or principal component scores) of 𝑋 are given by: 𝒛_𝑗 = 𝑋𝒗_𝑗 = 𝑈𝐷𝑉𝑇_𝒗

𝑗 = 𝑑𝑗𝒖𝑗 for 𝑗 = 1, … , 𝑝. (4.2)

Here 𝒛𝑗 is a vector of length 𝑁 and the columns of 𝑈 are the vectors 𝒖𝑗 which are ordered

to ensure that 𝑑₁2 _{≥ 𝑑}

22 ≥ ⋯ ≥ 𝑑𝑝2.

The first principal component of 𝑋 is the normalised linear combination of 𝒙₁, 𝒙₂, … , 𝒙_𝑝 that has the largest sample variance,

Here 𝒗₁ is the first principal component direction which is normalised so that the sum of squares equals 1 (𝒗₁𝑇_𝒗

1 =1 or ∑𝑝𝑗=1𝑣𝑗12 = 1). The vector 𝒗1 defines the direction in the

feature space along which the data varies the most.

Hastie et al. (2009:66) indicates that the sample variance of 𝒛₁ is: Var(𝒛1) = Var(𝑋𝒗1) =

𝑁𝒗1𝑇𝑋𝑇𝑋𝒗1 = 𝑑₁2

𝑁.

It is seen that this variance depends directly on the largest eigenvalue. The second principal component, 𝒛₂, is the linear combination of 𝒙1, 𝒙2, … , 𝒙𝑝 that has maximal

variance out of all the linear combinations that are uncorrelated with 𝒛1. Using the

constraint that 𝒛₂ must be uncorrelated with 𝒛₁ is equivalent to the constraint that the direction 𝒗₂ must be orthogonal (perpendicular) to the direction 𝒗₁ (𝒗₁𝑇_𝒗

2 = 0). Similarly,

the subsequent principal components 𝒛𝑗 have maximum variance, 𝑑_𝑗2

𝑁, subject to being

orthogonal to the previous principal components. Hence, the small eigenvalues 𝑑_𝑗2 correspond to directions in the column space of 𝑋 having small variance.

Summarising, the principal components, 𝒛1, … , 𝒛𝑀 (𝑀 < 𝑝), are uncorrelated and have

variances proportional to the eigenvalues of 𝛴. The proportion of the total variation in the sample explained by the 𝑎𝑡ℎ_{principal component can be expressed as:}

𝑑𝑎

𝑑₁2 _{+ ⋯ + 𝑑}

𝑝2. (4.3)

Calculating the sum of (4.3) up to principal component 𝑚 gives an expression for the proportion of the total variance that is explained by the first 𝑚 principal components. This is denoted by 𝜔_𝑚, i.e.: 𝜔_𝑚 = 𝑑12+ 𝑑22+ ⋯ + 𝑑𝑚2 𝑑₁2_{+ 𝑑} 2 2_{+ ⋯ + 𝑑} 𝑝2 . (4.4)

PCA eliminates the problem of co-linearity between the variables in a data set, since the principal components are by definition uncorrelated (Johnson and Wichern, 2007:430). It is interesting to note that if the original predictor variables are highly correlated then the proportion of the total variance explained by the first 𝑚 principal components is close to 1 even for small values of 𝑚.

An important step in PCA is to decide how many principal components to retain. Although there is no set rule to aid in this decision, there are several guidelines as proposed by Rencher (2002:434):

1. Retain sufficient components to account for a specified proportion of the total variance (the threshold value), say by letting ω𝑓𝑟𝑎𝑐= 0.9.

2. Exclude principal components whose eigenvalues are less than the overall average of the eigenvalues, calculated as:

𝑑2 ̅̅̅ = ∑𝑑𝑗2 𝑝 𝑝 𝑗=1 .

3. Represent the principal components in a scree plot, by plotting 𝑑_𝑗2_{against 𝑗. A scree}

plot indicates the appropriate number of components to retain by an elbow (bend) in the plot which separates the large eigenvalues from the small eigenvalues.

4. Use significance tests to test if the “larger” principal components are indeed significant.

By requiring ω_𝑚 > ω_{𝑓𝑟𝑎𝑐} (where ω_{𝑓𝑟𝑎𝑐} denotes the desired proportion of the total variance to be explained), the 𝑚 principal components are determined and used to replace the original 𝑝 predictor variables.

The first proposed method for determining how many principal components should be retained to effectively summarise the original data is the method implemented in this thesis. The main drawback associated with this method is the difficulty of determining the threshold value ω_{𝑓𝑟𝑎𝑐}. Setting this value too high runs the risk of retaining principal components that are sample or variable specific (Rencher, 2002:434). However, ω_{𝑓𝑟𝑎𝑐} will be set to 0.9 for the data exploration in this thesis.

When comparing dimension reduction by means of PCA to variable selection, the advantage of variable selection is that variables that do not discriminate between groups are removed and need not be observed for future purposes, as they are no longer considered important. However, in dimension reduction, all the variables remain in the model as linear combinations of all these variables are used. Therefore, variable selection techniques yield easier interpretation of the important predictors compared to dimension reduction by means of PCA.

PCA can also be used as a tool for data visualisation for both the observations and the variables, as it can be used to obtain a low-dimensional representation of the data that captures most of the information from the data, which is typically very complex in high- dimensional settings (James et al., 2013:375)

In the experimental work reported later in the thesis, PCA will be implemented using the

preProcess() function in the caret package in R. PCA will be applied to the original high-

dimensional data sets as a dimension reduction step, and then KNN and SVM classification procedures will be implemented using the ordinary principal components, 𝒛₁, … , 𝒛_𝑀 (𝑀 < 𝑝), extracted from the data as inputs.

This section explained PCA using the SVD of 𝑋 = [𝒙1 𝒙2 ⋯ 𝒙𝑝]; however, a modified

weighted PCA could be performed implementing the SVD of 𝑋𝐷𝜃 =

[𝑑1(𝜃)𝒙1 𝑑2(𝜃)𝒙2 ⋯ 𝑑𝑝(𝜃)𝒙𝑝]. Supervised principal components fall within this

framework, using 𝑑𝑗(𝜃) = Ind (|𝛽̂𝑗| ≥ 𝜃). Supervised principal components are explained

in detail in the following section.

In document Statistical classification in high-dimensional scenarios with a focus on microarray data sets (Page 60-64)