Normalising the samples - Principal Component Analysis Theory

Abbreviations

F. The competitive Freundlich-Langmuirian Isotherm.

3. Principal Component Analysis Theory

3.4. Normalising the samples

Normalisation o f samples prior to PCA is sometimes performed in order to give the samples the same relative size. In chromatography peak size is related to the amount o f sample injected and it may be desirable to account for this before analysis. The method, which essentially involves converting chromatograms to a common area was first recognised by Kvalheim (1985). This study, however used peak areas as the

A Practical Investigation into the use o f Principal C om ponent A nalysis for the M odelling and Scale-up o f High Performance

Liquid C hrom atography C hapter 3

original variables for the PCA and not, as is the case with this thesis, UV absorbance values which plot out the entire chromatograms. The merits o f sample normalisation and other pre-processing techniques will be discussed in future results chapters.

3.5. M ean C entring

The original data set is commonly centred before PCA is applied. In such cases the mean chromatogram from the original data (or a reference chromatogram) is subtracted from every individual chromatogram. The principal components generated from the mean-centred data set are indicative o f changes in UV absorbance and not the absolute values. Mean centring has the effect o f enhancing the more subtle differences between chromatograms and since this improves the ability o f the calculation to detect differences between chromatograms will result in a more accurate PCA model to the original data set. Mean centring is a logical step when considered in the context o f how PCA calculates the loadings. Since the loadings represent the changes in the data which are common to every chromatogram, removing the mean simply removes the first and most common variation before the data has been subjected to the PCA algorithm.

3.6. C alculating the Principal Com ponents

There are two different methods used to calculate the Principal Components from a data set; the NIPALS Algorithm and the Decomposition o f Covariance. The following descriptions o f these methods assume that the matrices involved have the following dimensions: X is a matrix o f m rows x n columns, t is the scores matrix with dimensions m xk (k is the number o f principal components extracted), p is the loadings matrix (dimensions kxn) and 1 is an kxk matrix o f eigenvalues. When used, the subscripts on the matrices indicate the matrix row.

NOTE: The eigenvalues matrix 1 is a diagonal square matrix, i.e. it only has values on its diagonal and is zero everywhere else. The eigenvalues should always descend along the diagonal since the first principal component (eigenvector) represents the largest variation, the second is the next largest and so on.

A Practical Investigation into the use o f Principal Com ponent Analysis for the M odelling and S cale-up o f H igh Perform ance

Liquid C hrom atography Chapter 3

3.6.1. The NIPALS Algorithm

The NIPALS (non-iterative partial least squares) is the most common method for calculation o f the Principal Components o f a data set. The advantages o f this method over Decomposition o f Covariance method are three-fold: firstly, the covariance matrix X X does not need to be determined; secondly, missing data can be handled; and thirdly PCs are extracted in order o f decreasing importance [W old et al (1987), MacGregor et al (1991), Kvalheim (1993)]. The flexibility o f the NIPALS algorithm in coping with missing data makes it an ideal choice for analysis by PCA o f data generated from industrial processes which may contain incomplete data. Only a description o f the mechanism is presented since frill details can be found in Geladi and Kowalski (1986).

The NIPALS algorithm gives more numerically accurate results when compared with the Decomposition o f Covariance method, but is slower to calculate. The steps are as follows:

1. As a first approximation, set the PQ loading to the first sample: Pi=x, 2. Calculate the eigenvalue: I;, = ( I

3. Normalise the PQ loading: p, = p /l;j 4. Compute the scores values: t, = X/pT

5. Check for convergence by comparing these scores to the scores from the previous pass for this eigenvector. If this is the first pass for the current (PC J loading, or scores are not the same continue with step 5. If the scores are the same, go to step

6 . Recompute the PQ loading: p, = XVt,

7. Go to step 2.

8 . Calculate the residual matrix. Proceed with the next principal component loading

(PCj+i). Go to step 1.

9. Stop calculating at PC^ when the residual matrix reaches a certain degree of sparsity.

A Practical Investigation into the use o f Principal Com ponent Analysis for the M odelling and Scale-up o f High Performance

Liquid Chrom atography Chapter 3

The NIPALS algorithm can be summarised diagrammatic all y in Figure 3.1 for the first 2 PCs. A mean or reference matrix is initially subtracted from the original data and the remaining matrix X is broken down into the product of two smaller matrices and a residual matrix (E,). The smaller matrices have dimensions (Ixn) and (m xl) respectively. The (Ixn) matrix, the loadings matrix for PC I, is a common component o f all of the experimental samples. The (m xl) matrix is the scores matrix for PCI and represents the amount o f the PCI loadings matrix which is present in each experimental sample. Collectively they represent the first principal component (PCI).

PCI loadings matrix

M atrix

In document A practical investigation into the use of principal component analysis for the modelling and scale-up of high performance liquid chromatography (Page 69-72)