Chapter Eight Self-organization
8.4 Principal component analysis
We have seen that one way of viewing SOM learning is that it accomplishes a dimension reduction. There is another way of performing this, which we first examine in pattern space before seeing its implementation in a network.
Figure 8.27 Clusters for PCA.
Consider the two clusters of two-dimensional patterns shown in Figure 8.27. It is clear in this graphical representation that there are two clusters; this is apparent because we can apprehend the geometry and the two sets have been identified with different marker symbols. However, suppose that we did not have access to this geometric privilege, and that we had to analyze the data on the basis of their (x, y)
co-ordinate description alone. One way of detecting structure in the data might be to examine the histograms with respect to each co-ordinate—the number of points lying in a series of groups or bins along the co-ordinate axes—as shown in the top half of Figure 8.28. There is a hint that the histograms are bimodal (have two "humps"), which would indicate two clusters, but this is not really convincing. Further, both histograms have approximately the same width and shape. In order to describe the data more effectively we can take advantage of the fact that the clusters were generated so that their centres lie on a line through the origin at 45° to the x
axis.
Now consider a description of the data in another co-ordinate system x', y' obtained from the first by a rotation of the co-ordinate axes through 45°. This is shown schematically in Figure 8.29. The new x' axis lies along the line through the cluster centres, which gives rise to a distinct bimodality in its histogram, as shown in the bottom left of Figure 8.28. Additionally, the new y' histogram is more sharply peaked and is quite distinct from its x' counterpart. We conclude that the structure in the data is now reflected in a more efficient co-ordinate representation.
In a real situation, of course, we are not privy to the way in which the clusters were produced, so that the required angle of rotation is not known. However, there is a simple property of the data in the new co-ordinate system that allows us to find the optimal transformation out of all possible candidates; that the variance along one of the new axes is a maximum. In the example this is, of course, the x' axis. A corollary of this is that the variance along the remaining y' axis is reduced. With more dimensions, a "rotation" of the co-ordinate axes is sought such that the bulk of the variance in the data is compressed into as few vector components as possible. Thus, if the axes are labelled in order of decreasing data variability, x1 contains the largest data variance commensurate with a co-ordinate rotation; x2 contains the largest part of the remaining variance; and so forth. The first few vector components in the transformed co-ordinate system are the principal components
and the required transformation is obtained by principal component analysis or PCA. This is a standard technique in statistical analysis and details of its operation may be found in any text on multivariate analysis—see, for example, Kendall (1975).
To the extent that the essential information about the dataset is contained in the first few principal components, we have effected a dimension reduction. Thus, in our simple two-dimensional example, the class membership is related directly to the distance along the x' axis. Then, although each point still needs two co-ordinates, the most important feature is contained in the single dimension—x'. The situation is reminiscent of the dimension reduction discussed for SOMs. However, the transformation under PCA is linear (since co-ordinate rotation is a
Figure 8.29 Rotated co-ordinate axes.
linear transformation) whereas dimension reduction on an SOM is not necessarily so. For example, the embedding of the arc in the plane (Fig. 8.22) requires a nonlinear transform to extract the angle around the arc (e.g. =tan-1y/x).
We now attempt to place PCA in a connectionist setting. Figure 8.30 shows a single pattern vector v taken in the context of the two-dimensional example used above. Also shown is the expression for its first principal component vx' along the
x' axis, in terms of the angle between x and this axis. Consider now a node whose weight vector w has unit length and is directed along the x' axis. The following now
holds for the activation: since ||w||=1. But
so that a=vx'. That is, the activation is just the first principal component of the input vector. Thus, if we use a linear node in which the output equals the activation, then the node picks up the most important variation in its input. Further, we may obtain an approximation of v according to
(8.15)
This property only becomes useful, however, if we can develop the required weight vector in a neural training regime. This was done by Oja (1982), who showed that it occurs under self-organization with a Hebb-like learning rule
(8.16)
This is similar to (8.5) except that the decay term is y2w instead of yw. Of course, it is more useful if we can extract principal components other than just the first, and Sanger (1989) has shown how to do this using a single layer of linear units, each one learning one of the component directions as its weight vector. Sanger also gave an application from image compression in which the first eight principal component directions were extracted from image patches of natural scenes. Let the resulting weight vectors be w1…w8 and their outputs in response to an image patch be y1… y8. These outputs are just the principal components and so, under a generalization of (8.15), the original patch may be reconstructed by forming the vector
y1w1+y2w2+…+y8w8. Sanger showed how the components yi can be efficiently quantized for each patch, without significant loss of image information, so that the number of bits used to store them for the entire image was approximately 4.5 per cent that of the original image pixels.
Finally, we note that it is possible to extend PCA in a neural setting from a linear to a nonlinear process—see, for example, Karhunan & Joutsensalo (1995).