Principal component analysis (PCA) - Chemometric data analysis

Chemometric data analysis

4.3. Principal component analysis (PCA)

Principal component analysis (PCA) is by far the most common form of multivariate technique of exploratory data analysis that seeks a linear combination of variables such that the maximum variance is extracted from the variables. It then removes this variance and seeks a second linear combination which explains the maximum proportion of the remaining variance, and so on. These linear combinations (axes) of new dimensions are known as principal components (PCs). These PCs result in orthogonal (uncorrelated) axes and analyze total (common and unique) variance [2,29,34]. PCA is related to FA as both methods look for a simple structure in a set of variables by reducing dimensionality but they also differ in many scenarios. Rencher [29], has pointed out the following differences between PCA and FA:

1. In FA variables are expressed as linear combinations of factors, whereas in PCA principal components are linear functions of variables.

2. FA’s effort is in explaining the co-variance, in contrast PCA attempts to explain total variance.

3. FA makes several key assumptions but PCA requires no assumption.

4. In FA factors are subject to an arbitrary rotation whereas PCA’s principal components are unique with distinct eigenvalues.

5. In FA if the number of factors changed, the estimated factors are likely to change, which does not happen in PCA.

PCA is one of the oldest multivariate methods and was developed by Pearson [35] in 1901. Since then the technique has increased tremendously its popularity. The first step in PCA analysis is to start with a correlation matrix. This places the measurements on different variables on the same scale and the variances have similar magnitude. The next step would be identifying the number of PCs and this can be

done in different ways. One would consider the percentage of the cumulative proportion of total variance by defining the minimum variation desired or expected, which is the most commonly used approach in PCA analysis. The second approach is based on the magnitude of the variances explaining each PC, which measures the eigenvalues of the correlation matrix associated with each principal component. In this case it is assumed that all the standardized variables will have a variance of one and any principal component with variance < 1 is not selected as it provides insignificant information compared to the original variable/s. Since the eigenvalues are standard output of the statistical procedure, it would be easy to implement. The third approach is using the scree test plot in a similar fashion as was explained for FA.

The idea behind using this plot is that in consecutive measurements the difference between successive eigenvalues becomes smaller and smaller, in turn making it easier to identify the important PCs [1,2,28,29,34].

In a similar fashion as we saw in the case of FA, in PCA analysis, data can be presented either as a scores plot of cases (samples) or a loading plot of variables using a combination of the chosen principal components. The score plot involves the projection of objects (cases/samples) as data points onto the PCs dimensions, where both x- and y-axes contains user-selected PCs. The plot contains points that represent the original data set. The score plot can be examined using combinations of either pairs of the principal components, where commonly the first two PCs represent the direction of highest fraction of the overall variability in the data set. The initial plot of the data points can lead to easy identification of outliers or nonlinearity, which can be removed if their contribution to the variability of the PCs is insignificant. However, sometimes removing an outlier can cause loss of vital information. Hence, it should be done with cautious and careful examination. Generally the first two or three PCs are sensitive to outliers that could potentially raise variances or deform covariance. At the same time, the last few PCs are equally sensitive to outliers in introducing false dimensions, or possibly hiding singularities. For these reasons, it is advisable to evaluate scores plots of at least the first two and the last two PCs in order to investigate the presence of outliers. Similar approach can be taken to loading plots of variables (see below), although a more pronounced effect can be seen in score plots.

Apart from the mentioned advantages, scores plot can be very useful in revealing clustering (grouping) of points. This grouping pattern shows the multivariate normal

Chemometric data analysis

distribution of the data set in the new dimensions and can be interpreted based on their location in the bi-variate plots. For instance, data points (samples) that exhibit high levels of the first principal component and low levels of the second principal component are displayed in the lower right corner of the plot and vice versa. At the same time those exhibiting equal levels towards the two components lie along the diagonal of the plot.

PCA results can also be presented using loading plots of variables and interpreted accordingly. In multivariate methods such as PCA, loading plots are regarded as very important to find the relevant components and the variables significantly associated with them [36]. In the same way as score plots (plots of score vectors), loading vectors are also plotted against each other. Loadings provide information on how the original variables are related to each other and to the principal components by constituting a link between the variable space and the PCs space. It can show the variable similarities (inter-variable relationship) and also how much each variable contributes to each principal component. Similar trend as for score plots highlighted above can be followed during interpretation of variable loadings. Variables that possesses high loading on the first principal component and low loading on the second principal component are displayed in the lower right corner of the loading plot and vice versa. Similarly, those variables exhibiting equal loadings towards the two components lie along the diagonal axes of the plot (between the two axes) [1,34].

These plots can also be interpreted in comparison with the score plots and can give valuable information especially in combining the information of objects (samples) and variables associated with the same PCs. In addition, both scores and loadings can be plotted together in a PCA bi-plot providing integrated information. A PCA bi-plot offers more dimensions compared to the ordinary scatter plots as both scores and loadings are visualized together, by displaying objects as data points in the two-dimensional space and variables as bi-plot axes, with a separate axis for each variable.

These axes are similar to ordinary scatter plots and are calibrated based on the original scales of measurement. The axes are not perpendicular as in ordinary scatter plots but still used in a similar way to provide information on all variables in a single graph.

The correlation of variables with objects (samples) and among the variables themselves can be investigated in either one or combinations of the following:

1. The trend in the weight (magnitude) of the variables 2. The size of the angles between the axes

3. The distances between axes

4. The distance between the data points (samples).

Variables with axes in close proximity are expected to be correlated considerably depending on the size of the angle formed by the axes. Regardless the type of correlation (positive or negative), axes with smaller angle have higher correlation and when they are at 90^owith each other, their correlation is zero [9,28,37].

In document Application of modern chromatographic technologies for the analysis of volatile compounds in South African wines (Page 77-80)