Principal Component Analysis (PCA) - Multivariate Data Analysis as a Tool to Explore Data

Chapter 4: Chemometrics: An outline of the development of the

4.4 Multivariate Data Analysis as a Tool to Explore Data

4.4.1 Principal Component Analysis (PCA)

This is a useful multivariate statistical technique to investigate and model the dimensional differences in a multivariable data set by simplifying the data to fewer dimensions and removal of noise for ease of analysis and is a useful application to test for similar groupings of samples in experimental data [8, 12].

Experimental data consists of a number of variables (p) for each sample (n). The data are gathered in a matrix X (Table 4.1), with n rows and p columns [12, 14].

Table 4.1. Example of a data matrix X. Individual objects or samples denoted by n rows and the variables denoted by X columns.

Sample Variable

X

.

X

1 X1

2 X2

. . . X2

3 X3

. . . X3

.

. . .

.

. . .

.

. . .

.

n

Xn

. . . Xn

In general for spectral data, principal components analysis algorithms such as the NIPALS (non-linear iterative partial least squares) algorithm are used to group some given spectra that are defined by a set of numerical properties in such a way that the spectra within a group are more similar than the spectra in different groups.

Each spectrum is represented as a point in the variable space, so that the data set presents as a swarm of sample points in the variable space (CAMO, 2012) (Figure 4.11), the density of the swarm depends on the number of data points.

Principal component analysis finds the eigenvalues from the covariant matrix of the

data variables, the variations in and between data are defined as variance and covariance, where variance is the measure of the spread of variable values. In the case of spectral data the variables are the wavenumbers (cm-1) and the variance is the

relative intensity of the Raman scattering at each wavenumber. The association between two variables, X and Y or X and X, is a measure of their covariance [7]. The covariance of two variables X1 and X2 for example is:

ࢉ࢕࢜൫࢞

࢟

ǡ ࢞

ࢠ

൯ ൌ

σ

࢔࢏ୀ૚

ሺ࢞

࢏࢟

െ࢞ഥ

࢟

ሻሺ࢞

࢏ࢠ

െ࢞ഥ

ࢠ

ሻ

ሺ࢔ െ ૚ሻ

The covariant matrix of a data set with n samples and p variables is denoted by:

࡯ ൌ ቎

࡯

_૚૚

ڮ ࡯

_૚࢖

ڭ

ڰ

ڭ

࡯

_࢖૚

ڮ ࡯

_࢖࢖

቏

Calculated by the sample co-variance between variables.

Eigenvectors (z) and eigenvalues (λ) calculated from the covariance matrix and are

described by

࡯ࢠ ൌ ࣅࢠ

Where Cz is the eigenvector of the covariance matrix Cand λzis the eigenvalue of the corresponding eigenvector z. PCA converts a set of observations of possibly correlated variables into a set of variables called principal components [7, 12]. The number of principal components possible will equal the number of variables (p), PCA does this by picking the direction of the largest elongation of the swarm of data points in the multivariate space and plots a line through the centre point along this direction, this

line is the direction of the first principal component (PC1) and this line points in the

AS, Oslo, Norway). The second principal component will be orthogonal to the first and in the direction of the second most variation in the data set and so on with each following PC orthogonal to the previous PC. If the first PC does not explain 100% of the variation in the data, then further PC’s are transformed depending on the complexity

of variation in the dataset. The maximum number of principal components will be less

than or equal to the number of original samples in the dataset, with each succeeding

PC accounting for the next largest variance.

Figure 4.10. Representation of data points in the variable space. The red line passes through the centre

and direction of the largest variance and is PC1 (an eigenvector). T = the t-score value of each object

point.

Each PC is an eigenvector (Figure 4.10) with an eigenvalue representing the spread of

the variance of the data points along the vector (i.e. comparative length of the vector).

parallel to the PC. The loadings are coefficients of the vector (Figure 4.11) which

represent the values of the parallel distance of the PC vector to the variable axis’ and

plots the vector in the variable space and reflects the relative weight given to that

variable in describing the location of the vector.

Figure 4.11. Representation of data points in the variable space. The red line passes through the centre

and direction of the largest variance and is PC1. T = the t-score value of each object point. The

eigenvalue is the value of the variance along the eigenvector and the associated coefficients are the angle value between the vector and any variable axis.

PCA deconstructs the initial data set matrix (X) into scores (T), loadings (PT_{) and}

residuals (E noise) (Figure 4.12). Scores and loadings (coefficient) values are defined by calculating eigenvector and eigenvalue data from the variable space (figures 4.10 and 4.11) [8, 12]. The residuals in this case are the PC’s that account for very little variance in the data matrix whereas the product of T x PTcontains the main structure of your data.

X

Eigenvector

ࢄ ൌ ࢀ ൈ ࡼ

ࢀ

_{൅ ࡱ}

Where X is the matrix of the data set, T is the scores matrix, PT_{is the loadings matrix}

and E represents the residuals matrix of the data set [11].

Figure 4.12. Example of PCA on spectral data. Figure taken from G.P.S. Smith et. al. 2015 “Raman

imaging of drug delivery systems”.

In PCA non-linear iterative partial least squares (NIPALS) is the most common algorithm used for computing the eigenvector components in a principal component or partial least squares analysis [9, 10, 18].

Herman Wold invented the NIPALS algorithm in 1966. NIPALS iterative PC calculation

calculates one PC component at a time. The main mathematical features of this

algorithm are as follows:

1. Centre and scale the covariancematrix appropriately, as necessary = C. (Index initialization, f: f=1; Cf = C)

2. For tfchoose any column in C (initial proxy t vector)

3. pf =CTtf / |CTtf|

4. tf = Cpf

6. Cf+1 = Cf - tf pfT

7. f = f+1

Repeat steps until f = A the optimum number of principal components i.e. minimum

PC’s required to identify the main structure of your data with removal of residual data.

Where

1. The start of the process with the centred covariance matrix

2. Requirement to start the algorithm with a proxy t vector, any column vector of C will do but it is best to use the column with the largest variance.

3. Calculation of the loading vector, Pf for iteration number f.

4. Calculation of the score vector tffor iteration number f. This step represents a

projection of the vector down onto the fth PC component in variable space. These

projections also correspond to the regression formulae for calculation regression

coefficients used in PLS analysis (section 6).

5. This is the convergence test, NIPALS usually converges to a stable vector solution in

less than 40 iterations, Convergence means that the proxy PC in variable space has

stabilized to the maximum variance sought and the 1st eigenvector has been found.

6. Updates Cf+1 = Cf - tf pfT, i.e. subtraction of component number f (in the first case

=PC1). The principal component model: T*PT is calculated for one component at a time. After convergence the rank one model tfpfTis subtracted from C [8].

In document Influential factors in nectar composition and yield in Leptospermum scoparium : a thesis presented in partial fulfilment of the requirements for the degree of Doctor of Philosophy in Plant Science, Institute of Agriculture and the Environment, College of (Page 100-106)