3.2 Compressed Factorisation
3.2.1 Principal Component Analysis (PCA)
In general, PCA is performed by computing the eigenvalues and eigenvectors of the covariance matrix but given the reduced dataset Ak×n the eigenvalue decomposition of the compressed covariance matrix can
be computed and the eigenvectors subsequently decompressed. In the literature the eigenvectors are of- ten referred to as PCA coefficients or PCA loadings, the term coefficients are used here but all three are interchangeable in this context. First, the basic PCA calculation is reviewed.
The use of an orthogonalised random projection has two important consequences. Firstly, it reduces the dimensionality of the problem. Using this, it is possible to perform PCA on the reduced data matrix A instead of on the full data matrix X. On a full dataset, this calculation requires the formation and diagonalisation of the m × m covariance matrix, with m ≈ 105 for Matrix Assisted Laser Desorption Ionisation (MALDI)
datasets. This is intractable by normal means, although special data reduction techniques can be employed in combination with memory-efficient algorithms which construct the PCs without requiring the data to be in memory [172]. Here, it is only required that the k × k covariance matrix from the reduced subspace is diagonalised, from which no data channels have been fully removed. Secondly, the orthogonalisation procedure will allow the principal component vectors to be projected back into the full high-dimensional measurement space so that they can be analysed in terms of physically meaningful quantities. This is the key advantage of this approach over non-orthogonalised random projection methods.
Algorithm 3.1: Compute the principal component eigenvectors and eigenvalues from a data matrix Data: A data matrix, X
Result: eigenvectors,W; eigenvalues, Φ
1 Given a data matrix Xm×n, compute the mean of each row (data channel) and subtract to form
matrix ¯X such that each row of ¯X has zero-mean.;
2 Compute the covariance matrix C = ¯X ¯XT;
3 Calculate the eigenvalue/vectors of C. Noting that C is a symmetric real matrix, this is equivalent to
performing SVD on C: C = WΦWT, The columns of W are the principal component vectors, and
the diagonal of Φ contains the variance along each of the principal components.;
The use of an orthogonalised random projection has two important consequences. Firstly, it has reduced the dimensionality of the problem. PCA can be performed on the reduced data matrix A instead of on the full data matrix X. On a full dataset, this calculation requires the formation and diagonalisation of the m × m covariance matrix, with m ≈ 105for MALDI datasets. This is intractable by normal means, although special data reduction techniques can be employed in combination with memory-efficient algorithms which construct the PCs without requiring the data to be in memory[172]. It is enough to simply diagonalise the k × k covariance matrix from the reduced subspace, from which no data channels have been fully removed. Secondly, the orthogonalisation procedure allows the principal component vectors to be projected back into the full high-dimensional measurement space so that they can be analysed in terms of physically meaningful quantities. This is the key advantage of this approach over non-orthogonalised random projection methods.
It should be noted that an alternative algorithm exists for obtaining the eigenvectors using SVD factori- sation of the data matrix directly, rather than explicitly forming the covariance matrix[105] which also suffers due to the data size. Identical arguments can be made for applying this method to the compressed data, and the same transformations can be used to recover the principal components in the original space, as detailed in the following Algorithm.
3.2. COMPRESSED FACTORISATION 73
Algorithm 3.2: Compute the principal component eigenvectors and eigenvalues from a compressed data matrix
Data: Compressed data matrix, A; and its approximate basis, Q Result: eigenvectors,W; eigenvalues, Φ
1 Form the zero-mean reduced data matrix ¯A using Algorithms 2.4 and 3.1.1. ; 2 Form the covariance matrix in the reduced subspace, ˜C = ¯A ¯AT.;
3 Perform SVD to diagonalise: ˜C = ˜W ˜Φ ˜WT, giving the principal components and variances in the
reduced subspace.;
4 Since ¯A = QTX, ˜¯ C = ¯A ¯AT = QTX ¯¯XTQ.;
5 It then follows that Q ˜CQT = QQTX ¯¯XTQQT ≈ ¯X ¯XT = C6 As C = WΦWT and ˜C = ˜W ˜Φ ˜WT it
follows that WΦWT = Q ˜W ˜Φ ˜WTQT.;
7 The principal components of X can be computed directly from the principal components of A:
W = Q ˜W and Φ = ˜Φ.;
The projection of the data onto the principal component vectors YL= AW>L, the ‘scores’, are denoted
by YL where the number of components maintained, L, is determined by the fraction of variance maintained.
Columns from YL can be plotted as an image to examine trends extracted by PCA, and the spectral origin
of these trends can be deduced from the principal component vectors.