DIMENSION REDUCTION
4.2 PRINCIPAL COMPONENT ANALYSIS
Principal component analysis is a popular unsupervised data-processing and dimension reduction technique. PCA attempts to reduce the dimension of the data set by identifying and using informative linear combinations of the original predictors, instead of using the π predictors themselves. In order to reduce the dimension of the original data set, the number of linear combinations must be substantially less than π. According to Johnson and Wichern (2007:430) the aim of PCA is to find the linear combinations which maximise the variation in the data set.
Principal components are entirely dependent on the covariance (or correlation) matrix of the original variables. There are several ways to explain principal component analysis;
here the singular value decomposition (SVD) of the centred input matrix π is used to express the principal components of the variables in π, as explained by Gosh (2002). The SVD of the π Γ π matrix π (centred to have mean zero column vectors) has the form
π = ππ·ππ. (4.1)
Here π is an π Γ π orthogonal matrix (πππ = πΌπ) with the columns of π spanning the column space of π and π is a π Γ π orthogonal matrix (πππ = πΌπ) whose columns span the row space of π. π· is a π Γ p diagonal matrix, with diagonal entries π1 β₯ π2 β₯ β― β₯
ππ β₯ 0 known as the singular values of π. Note that if one or more values of ππ = 0, then π is singular.
The sample covariance matrix is given by π =π1πππ, and from the SVD of π in (4.1) the
eigen-decomposition of πππ can be written as
πππ = ππ·2ππ,
which implies that
(πππ)π = ππ·2.
In the above equation, the columns π1, β¦ , ππ of π are the eigenvectors of πππ, with corresponding eigenvalues π12 β₯ π22 β₯ β― β₯ ππ2 > 0. The ordered sequence of
eigenvectors, π1, β¦ , ππ, define the principal component directions of π and π12 β₯ π22 β₯ β― β₯
ππ2 represent the variances of the principal components (Hastie et al., 2009:66). For any
given data set there are at most min (π β 1, π) principal components.
The principal components (or principal component scores) of π are given by: ππ = πππ = ππ·πππ
π = ππππ for π = 1, β¦ , π. (4.2)
Here ππ is a vector of length π and the columns of π are the vectors ππ which are ordered
to ensure that π12 β₯ π
22 β₯ β― β₯ ππ2.
The first principal component of π is the normalised linear combination of π1, π2, β¦ , ππ that has the largest sample variance,
Here π1 is the first principal component direction which is normalised so that the sum of squares equals 1 (π1ππ
1 =1 or βππ=1π£π12 = 1). The vector π1 defines the direction in the
feature space along which the data varies the most.
Hastie et al. (2009:66) indicates that the sample variance of π1 is: Var(π1) = Var(ππ1) =
1
ππ1πππππ1 = π12
π.
It is seen that this variance depends directly on the largest eigenvalue. The second principal component, π2, is the linear combination of π1, π2, β¦ , ππ that has maximal
variance out of all the linear combinations that are uncorrelated with π1. Using the
constraint that π2 must be uncorrelated with π1 is equivalent to the constraint that the direction π2 must be orthogonal (perpendicular) to the direction π1 (π1ππ
2 = 0). Similarly,
the subsequent principal components ππ have maximum variance, ππ2
π, subject to being
orthogonal to the previous principal components. Hence, the small eigenvalues ππ2 correspond to directions in the column space of π having small variance.
Summarising, the principal components, π1, β¦ , ππ (π < π), are uncorrelated and have
variances proportional to the eigenvalues of π΄. The proportion of the total variation in the sample explained by the ππ‘β principal component can be expressed as:
ππ
2
π12 + β― + π
π2. (4.3)
Calculating the sum of (4.3) up to principal component π gives an expression for the proportion of the total variance that is explained by the first π principal components. This is denoted by ππ, i.e.: ππ = π12+ π22+ β― + ππ2 π12+ π 2 2+ β― + π π2 . (4.4)
PCA eliminates the problem of co-linearity between the variables in a data set, since the principal components are by definition uncorrelated (Johnson and Wichern, 2007:430). It is interesting to note that if the original predictor variables are highly correlated then the proportion of the total variance explained by the first π principal components is close to 1 even for small values of π.
An important step in PCA is to decide how many principal components to retain. Although there is no set rule to aid in this decision, there are several guidelines as proposed by Rencher (2002:434):
1. Retain sufficient components to account for a specified proportion of the total variance (the threshold value), say by letting Οππππ= 0.9.
2. Exclude principal components whose eigenvalues are less than the overall average of the eigenvalues, calculated as:
π2 Μ Μ Μ = βππ2 π π π=1 .
3. Represent the principal components in a scree plot, by plotting ππ2 against π. A scree
plot indicates the appropriate number of components to retain by an elbow (bend) in the plot which separates the large eigenvalues from the small eigenvalues.
4. Use significance tests to test if the βlargerβ principal components are indeed significant.
By requiring Οπ > Οππππ (where Οππππ denotes the desired proportion of the total variance to be explained), the π principal components are determined and used to replace the original π predictor variables.
The first proposed method for determining how many principal components should be retained to effectively summarise the original data is the method implemented in this thesis. The main drawback associated with this method is the difficulty of determining the threshold value Οππππ. Setting this value too high runs the risk of retaining principal components that are sample or variable specific (Rencher, 2002:434). However, Οππππ will be set to 0.9 for the data exploration in this thesis.
When comparing dimension reduction by means of PCA to variable selection, the advantage of variable selection is that variables that do not discriminate between groups are removed and need not be observed for future purposes, as they are no longer considered important. However, in dimension reduction, all the variables remain in the model as linear combinations of all these variables are used. Therefore, variable selection techniques yield easier interpretation of the important predictors compared to dimension reduction by means of PCA.
PCA can also be used as a tool for data visualisation for both the observations and the variables, as it can be used to obtain a low-dimensional representation of the data that captures most of the information from the data, which is typically very complex in high- dimensional settings (James et al., 2013:375)
In the experimental work reported later in the thesis, PCA will be implemented using the
preProcess() function in the caret package in R. PCA will be applied to the original high-
dimensional data sets as a dimension reduction step, and then KNN and SVM classification procedures will be implemented using the ordinary principal components, π1, β¦ , ππ (π < π), extracted from the data as inputs.
This section explained PCA using the SVD of π = [π1 π2 β― ππ]; however, a modified
weighted PCA could be performed implementing the SVD of ππ·π =
[π1(π)π1 π2(π)π2 β― ππ(π)ππ]. Supervised principal components fall within this
framework, using ππ(π) = Ind (|π½Μπ| β₯ π). Supervised principal components are explained
in detail in the following section.