Howard Mark
THE ERROR-FREE CASE
8.11 PRINCIPAL COMPONENTS: INTRODUCTION
8.11.2.1 Fourier Analysis
As an example of how Fourier analysis works, consider Figure 8.8a–d. Part a presents a spectrum of hard red spring wheat; this is the target function that we will be reconstructing. Parts b–d present the first three lowest frequency sine and cosine waves. To emphasize the parallelism between Fourier analysis and PCA, we will introduce some nomenclature, and call these sine and cosine waves Fourier components, in analogy to principal components. Thus Figure 8.8b–d presents the Fourier components that we will use to approximate the target spectrum. Each Fourier component consists of two parts: a real (cosine) portion and an imaginary (sine) portion. The first Fourier component consist of one-half cycle each of the sine and cosine portions; the nth Fourier component consists of n half-cycles of each portion.
Figure 8.9a shows the real and imaginary parts of the initial portion (indeed, the first 25 points) of the Fourier transform of the target wheat spectrum. How are these obtained? There are a number
0.8
Absorbance AbsorbanceAbsorbanceAbsorbanceAbsorbance
Absorbance
0.0
1100 2500
Wavelength (nm) (a)
1100 2500
Wavelength (nm)
0.8
0.0
1100 2500
Wavelength (nm) (e)
(b) (f)
(c) (g)
(d) (h)
1.0
0.0
1100 2500
Wavelength (nm)
Absorbance
1100 2500
Wavelength (nm)
1100 2500
Wavelength (nm)
Absorbance
1100 2500
Wavelength (nm)
1100 2500
Wavelength (nm)
FIGURE 8.8 Target spectrum of wheat that will be reconstructed and three components of each type that will be used in the reconstruction. (a) Spectrum of wheat. (b–d) First three Fourier components. (e) Spectrum of wheat. (f–h) First three principal components of wheat.
6.000 5.000 4.000 3.000 2.000 1.000 0.000
5.000 10.000 15.000 Wavelength (a)
20.000 25.000
−1.000
−2.000
−3.000
1.000 0.800 0.600 0.400 0.200 0.000
5.000 10.000 15.000 Wavelength (b)
20.000 25.000
−0.400
−0.200
AbsorbanceAbsorbance
FIGURE 8.9 Transforms of the wheat spectrum. (a) Fourier transform. (b) Principal component transform.
of algorithms that can be used to obtain the Fourier transform of any given function. While the algorithm we present here is not normally the one of choice, for expository purposes the following approach is most useful. Basically each point of the Fourier transform represents the “amount”
of the corresponding Fourier component in the target spectrum. Thus, if the Fourier components have been normalized as described in Equation (8.21), the first point of the Fourier transform is computed as
R(1) =
m i=1
TiC1i(real) (8.22a)
I(1) =
m i=1
TiS1i(imag.) (8.22b)
where R(1) and I(1) represent the real and imaginary parts of the first point of the Fourier transform Ti
represents the value of the target spectrum at the ith wavelength, and C1iand S1irepresent values of the cosine and sine waves comprising the real and imaginary portions of the first Fourier component
at the ith wavelength. Similarly, the nth point of the Fourier transform is computed as
R(n) =
m i=1
TiCni(real) (8.23a)
I(n) =
m i=1
TiSni(imag.) (8.23b)
Again using the nomenclature of PCA to describe a Fourier transform, the various points constituting the Fourier transform would be called the Fourier scores of the target spectrum. This computation is the same as the one by which principal component scores are obtained from principal components;
hence the analogy of the names follows. The value of the ith point of the Fourier transform, in other words, the ith Fourier score, is simultaneously the result of the cross-product between the (normalized) Fourier component and the target spectrum (that is how the “amount” is determined), and the proper value by which to scale the corresponding Fourier component to regenerate the target spectrum (that is how the “amount” is used). Figure 8.10 and Figure 8.11 demonstrate this: from the Fourier transform presented in Figure 8.8a, successive approximations to the target spectrum are made by including successively more Fourier components, scaled by the corresponding values of the Fourier transform. Each approximation is overlayed on the target spectrum for comparison.
Figure 8.10a–c shows the results of including first one, then two, then three Fourier components (shown in Figure 8.8b–d) in the reconstruction. Parts a–d of Figure 8.11 show the results of including 5, 10, 15, and 20 Fourier components. Note that in each case the fit shown is the best possible fit that can be achieved with the given number of trigonometric functions. Thus, in Figure 8.10a, the fit shown, poor as it is, is the case that can be achieved with a single sinusoid. (Note that the use of both the sine and the cosine is what creates the phase shift the produces the fit shown, which is obviously better than either one would produce alone. Hence the definition of Fourier transforms in terms of complex numbers, so as to include both real [cosine] and imaginary [sine] parts.)
For the purpose of further comparisons, Figure 8.12b–d and Figure 8.13a–d present the differ-ences between the target spectrum and the various Fourier approximations using the corresponding number of Fourier components as Figure 8.10 and Figure 8.11. These differences represent the error of the reconstruction; as more components are included in the reconstruction process the approximation gets better and the total error decreases.
8.11.2.2 Principal Component Analysis
In the sense we have used the Fourier components, to construct an approximation to the target spectrum, principal components are just like the Fourier components. This is also demonstrated in Figure 8.8 to Figure 8.13, where corresponding operations are performed exactly in parallel with those that were done using the Fourier components.
Thus, in parts of Figure 8.8d–f we present the first three principal components of a set of spectra of hard red spring wheat; these correspond exactly to the first three Fourier components (which, as we remember, are sine and cosine functions, shown in parts a–c of that same figure) except that there is no need to represent separately real and imaginary parts of the principal components. As we are interested in them, principal components are always real. Consequently each of the principal components presented in Figure 8.8 is represented by a single curve, as opposed to the two curves needed to represent one Fourier component.
The “amount” of each principal component in any given target spectrum is determined the same way, and has the same meaning, as the amount of the various Fourier components in each target spectrum: after normalizing each principal component using Equation (8.21), the cross product between the normalized principal component and the target spectrum is computed; in principal
0.9
0.0
0.9
0.0 (a)
Absorbance
1100 2500
Wavelength (nm)
Absorbance
0.0
1100 2500
Wavelength (nm)
Absorbance
1100 2500
Wavelength (nm) (d)
(b) (e)
(c) (f)
0.9 0.9
0.0
Absorbance
1100 2500
Wavelength (nm)
Absorbance
0.9
0.0
1100 2500
Wavelength (nm)
Absorbance
0.9
0.0
1100 2500
Wavelength (nm)
FIGURE 8.10 Reconstructions of the target wheat spectrum. (a–c) Reconstructions using one, two, and all three Fourier components, respectively. (d–f) Reconstructions using one, two, and all three principal components.
component nomenclature, this is called the principal component “score” for that principal component and the target spectrum. Computing the score for each principal component results in a series of numbers corresponding to the target spectrum, just as the cross products of the Fourier components with the spectrum resulted in the series of numbers that we called the Fourier transform of the target spectrum. Since the process of obtaining the principal component scores is identical to the process used to obtain the Fourier scores, it would seem that what we have generated by doing this a mathematical construct that we are fully justified in calling the principal component transform of
Wavelength (nm)
FIGURE 8.11 Reconstructions of the target wheat spectrum. (a–d) Reconstructions using 5, 10, 15, and 20 Fourier components, respectively. (e–h) Reconstructions using 5, 10, 15, and 20 principal components.
the target spectrum, just as earlier we were justified to call the elements of the Fourier transform the Fourier scores. Having recognized this parallelism between the two transform methodologies, it is clear that we can do the same things with the principal component transform as we did with the Fourier transform. For example, in Figure 8.8b, we plot the first 25 points of the principal component transform of the target wheat spectrum, exactly as we plotted the Fourier transform of that spectrum in Figure 8.8a.
0.8
FIGURE 8.12 (a and e) Target wheat spectrum. (b–d) Differences between the target spectrum and the approximations using one, two, and three Fourier components, respectively. (f–h) Differences between the target spectrum and the approximations using one, two, and three principal components, respectively.
Similarly, in Figure 8.10d–f we plot the result of approximating the target wheat spectrum using only the first, then the first two, then the first three principal components (shown in Figure 8.8f–h) and in Figure 8.11f–h we plot the approximations using 5, 10, 15, and 20 principal compon-ents, respectively. These may be compared with the corresponding reconstructions using Fourier components shown in the first group of plots in each figure. Also, the corresponding differ-ences from the target spectrum are presented in Figure 8.12d–f and Figure 8.13f–h, where they
0.5
FIGURE 8.13 (a–d) Differences between the target spectrum and the approximations using 5, 10, 15, and 20 Fourier components, respectively. (e–h) Differences between the target spectrum and the approximations using 5, 10, 15, and 20 principal components, respectively. Note that the scale for the differences between the target spectrum and the reconstruction using principal components causes them to be expanded 10 times compared to the differences of the target spectrum from the Fourier components.
may be compared with the errors of the Fourier reconstruction shown in the first column of those figures.
When comparing these differences, care must be taken to note the scales of each part of the figures. All parts of Figure 8.12, except for the target spectrum shown in parts a and e, use the same scale:−0.4 to +0.6 absorbance. In Figure 8.13, however, while the differences of the Fourier
approximations (parts a–d) to the target spectrum is approximately the same as in Figure 8.12 (−0.5 to+0.5 absorbance), the scale for the differences from the principal component approximation (Figure 8.13e–h) expands the plot by a factor of 10 (−0.05 to +0.05).
At this point the only thing that seems to be missing that prevents us from being able to com-plete the picture of the parallelism between Fourier transforms (and their corresponding Fourier components) and principal component transforms (and their corresponding principal components) is information regarding the origin of the principal components that are presented in Figure 8.8.
8.11.3 WHATMAKESPRINCIPALCOMPONENTSUNIQUE?
To begin to understand the origin of the functions that represent the principal components, we turn again to the fact mentioned in Section 8.2.2, that is, that for our current purposes we divide all functions into two classes. We also remind ourselves that the Fourier components, from which we generated the Fourier transform, are members of the class of functions that are defined by analytical mathematical expressions.
Principal components, on the other hand, are members of the other class of functions, the class of functions which contains all those functions that are defined empirically and that represent in some sense functions that are arbitrary in that they are not describable by a priori analytic mathematical expressions. Other such functions also exist. An example which is, perhaps, more familiar, is Gram–
Schmidt orthogonalization. This approach requires the use of “basis functions;” these also fall into the category of being nonanalytic functions; often the basis functions are spectra of the pure materials that comprise the specimens to be analyzed, and this approach is sometimes called “curve fitting”
in NIR spectroscopy. What distinguishes principal components from all the other possible arbitrary functions that can be used for similar purposes?
For the answer to this question, we go back to the definition of principal components as presented in Section 8.2.1. We noted there that principal components were orthogonal and account for the maximum possible variance of the data.
The next question is, if principal components are orthogonal, what are they orthogonal to? The answer is: to each other. Principal components are determined by computing them from the set of data that they then represent, and when principal components are computed, they are all computed in such a manner that the relation:
XiXj= 0, i = j (8.24)
holds for any pair of principal components in the set. This condition is not unique; many sets of functions obey the relation expressed by Equation (8.24) (e.g., the various sine and cosine waves of the Fourier analysis discussed above all follow this rule), but it is one necessary condition. Essentially, it states that all principal components computed from a given dataset should be uncorrelated.
The condition that does uniquely define principal components, and distinguishes them from all other possible ways of expressing any given function as the sum of other orthogonal functions, is part 6 of the definition in Section 8.11.1: the requirement that each principal component should account for the maximum possible amount of variance in the set of data from which it is computed.
This means that if we compute a principal component from a set of spectra, as Figure 8.8f is a principal component for wheat, then we can fit that principal component to each of the spectra in the dataset from which it was computed. Figure 8.10d, for example, shows the first principal component of wheat being fit to a wheat spectrum; all that need be done is to successively consider each of the spectra in the initial set as the target spectrum. Then, scaling the principal component properly, it can be subtracted from each of the spectra in the initial set, just as we subtracted the first principal component from the target spectrum to produce the difference spectrum shown in Figure 8.12f.
Having calculated all these residual spectra (“residual” referring to the difference: it is what “remains”
after the principal component is subtracted) we can then compute the following quantity:
where the j subscript refers to all the wavelengths in spectrum and the i subscript refers to all the spectra in the dataset; that is, the residual sum of squares (RSS) is the sum of squares of all the residuals in the entire dataset.
Having performed this computation for the principal components, we could perform the same computation for the residuals obtained from any other set of functions used to fit the data, for example, the Fourier component that is displayed in Figure 8.8b, with its fit shown in Figure 8.10a and residual shown in Figure 8.12b.
Having computed and compared all these values of RSS we note the following facts: first, the total sum of squares (TSS), that is,
i
j(Xij− ¯Xj)2where Xij is the absorbance of the ith spectrum at the jth wavelength and ¯Xj is the mean absorbance at the jth wavelength) for any given set of data is constant (since it depends only on the data and not on any fitting functions); and second, the mathematical/statistical operation known as ANOVA tells us that the TSS is in fact equal to the sums of squares of each contribution. This means that the sum of squares not accounted for by the fitting function (or functions) is equal to the sum of squares of the residuals, that is, to the error of the approximation. Thus, since by their definition, principal components account for the maximum possible amount of variance (which in this case also holds for the sum of squares), the ANOVA guarantees that principal components are also distinguished by the fact that the RSS obtained from using principal components is smaller than that obtained from any other possible set of fitting functions, or, in other words, a principal component (specifically, the first principal component) will fit the data better than any other possible function can. Furthermore, having fit the spectra with one principal component, the residual spectra can be used to generate a second principal component which will be orthogonal to the first, and will give the smallest RSS of all possible functions that could be used to fit the set of residual spectra resulting from fitting the first principal component. Thus, in this sense, the principal components fit the target functions better with fewer components than any other means of trying to fit and reconstruct the target spectra, and will give a better reconstruction of the target spectra than any other functions will, using the same number of functions. Thus the first two principal components will fit the data better than any possible pair of other functions, the first three principal components will fit better than any possible triplet of other functions, etc.
As an example, we can compare the fits obtained using the Fourier components to those obtained using the principal components. Figure 8.10 to Figure 8.13 are set up so that the left-hand part of each figure is directly comparable to the right-hand part. For example, Figure 8.11b and Figure 8.11f each use ten components to reconstruct the target spectrum; the left-hand side shows the reconstruction using ten Fourier components while the right-hand side shows the result of using ten principal components.
Comparing Figure 8.10 to Figure 8.13, it is clear that in all cases the principal components give a closer fit to the target spectrum than the Fourier components do, as long as we compare the result of using one set to the result of using the same number of the other.
Strictly speaking, there is no guarantee that this condition will hold true for any one single spec-trum; Equation (8.25) is guaranteed true only for the entire set of data. However, as Figures 8.10 to Figure 8.13 show, in practice Equation (8.25) often holds true on a spectrum-by-spectrum basis also.
Another word of caution in this regard: this characteristic of principal components, to approximate the data better than any other mathematical function that might be used for that purpose, is defined only for the dataset from which the principal components were computed. It may be expected that it will also hold true for other data of the same type as that from which the principal components were created, but that is not guaranteed, and it is certainly not so for different types of data.
8.11.4 AREPRINCIPALCOMPONENTSBROUGHT BY THESTORK?
The foregoing is all well and good, and explains what the important characteristics of principal com-ponents are, but tells us nothing about how they are generated. Principal comcom-ponents are computed according to the following algorithm (or one of its variations):
Consider a set of n spectra, each containing m wavelengths. First, the m arithmetic means of the data are computed for each wavelength, then this mean spectrum is subtracted wavelength by wavelength from each of the spectra in the set. An m× m array of cross products is created from each spectrum, after the mean spectrum has been subtracted, such that the i, jth member of the array is the product of the value of the spectrum at the ith wavelength times the value of the spectrum at the jth wavelength. Corresponding terms of the arrays from each spectrum are added together; the resulting array is called the sum of cross products matrix, for obvious reasons. Mathematically, this can be expressed as
X-prodi,j=
n k=1
(Xi,k− ¯Xi)(Xj,k− ¯Xj) (8.26)
The principal components are the eigenvectors of the sum of cross-products matrix; we shall neither prove this result nor discuss how eigenvectors are obtained. The former requires more advanced mathematics than is appropriate for our level of discussion; the latter is somewhat farther removed from NIRA than would warrant discussion here. Both topics are well-described elsewhere, the former in texts of mathematical statistics [10] and the latter in texts about numerical analysis [12].
However, viewed as an eigenvalue problem, principal components have some features of interest.
First of all, we note that eigenvectors are solutions to the equation:
[V][X] = k[X] (8.27)
where [X] is the sum of cross-products matrix, [V ] is an eigenvector that satisfies the equality, k is a constant that along with [V ] satisfies the equality; and [V ][X] represents a matrix multiplication.
The eigenvalue (i.e., the constant in Equation (8.27)) has a useful property: The value obtained by dividing the eigenvalue associated with a given eigenvector (principal component) by the sum of all the eigenvalues is the fraction of the variance that is accounted for by that corresponding eigenvector.
Since each principal component as it is computed accounts for the maximum possible variance in the data (a corollary of the fact that the RSS is the smallest possible), the first few principal components account for the maximum variance due to effects of real physical phenomena on the spectra. This is so because real phenomena will have systematic effects on the spectra; the systematic effects
Since each principal component as it is computed accounts for the maximum possible variance in the data (a corollary of the fact that the RSS is the smallest possible), the first few principal components account for the maximum variance due to effects of real physical phenomena on the spectra. This is so because real phenomena will have systematic effects on the spectra; the systematic effects