Principal Component Analysis
3. and the kth principal component projection (vector) is
2.3 Sample Principal Components
⎛
⎜⎜
⎜⎜
⎜⎜
⎝
0. 1418 0. 0314 0. 0231 −0.1032 −0.0185 0. 0843 0. 0314 0. 1303 0. 1084 0. 2158 0. 1050 −0.2093 0. 0231 0. 1084 0. 1633 0. 2841 0. 1300 −0.2405
−0.1032 0. 2158 0. 2841 2. 0869 0. 1645 −1.0370
−0.0185 0. 1050 0. 1300 0. 1645 0. 6447 −0.5496 0. 0843 −0.2093 −0.2405 −1.0370 −0.5496 1. 3277
⎞
⎟⎟
⎟⎟
⎟⎟
⎠
. (2.5)
The eigenvalues and eigenvectors of are given in Table2.1, starting with the first (and largest) eigenvalue. The entries for each eigenvector show the contribution or weight of each variable:η2has the entry of 0. 5631 for the fourth variable X4.
The eigenvalues decrease quickly: the second is less than one-third of the first, and the last two eigenvalues are about 3 and 1 per cent of the first and therefore seem to be negligible.
An inspection of the eigenvectors shows that the first eigenvector has highest absolute weights for variables X4 and X6, but these two weights have opposite signs. The second eigenvector points most strongly in the direction of X5 and also has large weights for X4 and X6. The third eigenvector, again, singles out variables X5 and X6, and the remaining three eigenvectors have large weights for the variables X1, X2and X3. Because the last three eigenvalues are considerably smaller than the first two, we conclude that variables X4to X6 contribute more to the variance than the other three.
2.3 Sample Principal Components
In this section we consider definitions for samples of independent random vectors. Let X =
X1X2···Xn
be d× n data. In general, we do not know the mean and covariance structure of X, so we will, instead, use the sample mean X, the centred data Xcent, the sample covariance matrix S, and the notationX ∼Sam(X, S) as in (1.6) through (1.10) of Section1.3.2.
Let r≤ d be the rank of S, and let
S= T
2.3 Sample Principal Components 23 be the spectral decomposition of S, as in (1.26) of Section 1.5.3, with eigenvalue–
eigenvector pairs (λj,ηj). For q ≤ d, we use the submatrix notation q and q, similar to the population case. For details, see (1.21) in Section1.5.2.
Definition 2.3 Consider the random sampleX =
X1X2···Xn
∼Sam(X, S) with sample mean X and sample covariance matrix S of rank r . Let S = T be the spectral decomposition of S. Consider k= 1,...,r.
1. The kth principal component score ofX is the row vector W•k= ˆηTk(X − X);
2. the principal component dataW(k) consist of the first k principal component vectors W• j, with j= 1,...,k, and
W(k)=
⎡
⎢⎣ W•1
... W•k
⎤
⎥⎦ = kT(X − X); (2.6)
3. and the d× n matrix of the kth principal component projections P•k is
P•k= ˆηkˆηTk(X − X) = ˆηkW•k. (2.7)
The row vector
W•k=
W1k W2k··· Wnk
(2.8)
has n entries: the kth scores of all n observations. The first subscript of W•k, here written as•, runs over all n observations, and the second subscript, k, refers to the kth dimension or component. Because W•k is the vector of scores of all observations, we write it as a row vector. The k× n matrix W(k) follows the same convention as the data X: the rows correspond to the variables or dimensions, and each column corresponds to an observation.
Next, we consider P•k. For each k, P•k = ˆηkW•k is a d× n matrix. The columns of P•k are the kth principal component projections or projection vectors, one column for each observation. The n columns share the same direction – ˆηk – however, the values of their entries differ as they reflect each observation’s contribution to a particular direction.
Before we move on, we compare the population case and the random sample case.
Table2.2summarises related quantities for a single random vector and a sample of size n.
We can think of the population case as the ideal, where truth is known. In this case, we establish properties pertaining to the single random vector and its distribution. For data, we generally do not know the truth, and the best we have are estimators such as the sample mean and sample covariance, which are derived from the available data. From the strong law of large numbers, we know that the sample mean converges to the true mean, and the sam-ple covariance matrix converges to the true covariance matrix as the samsam-ple size increases.
In Section2.7we examine how the behaviour of the sample mean and covariance matrix affects the convergence of the eigenvalues, eigenvectors and principal components.
The relationship between the population and sample quantities in the table suggests that the population random vector could be regarded as one of the columns of the sample – a reason for writing the data as a d× n matrix.
Table 2.2 Relationships of population and sample principal components
Population Random Sample
Random vectors X d× 1 X d× n
kth PC score Wk 1× 1 W•k 1× n PC vector/data W(k) k× 1 W(k) k× n kth PC projection Pk d× 1 P•k d× n
0 4
−2 0 2
0 4
−2 0 2
Figure 2.2 Two-dimensional simulated data of Example2.3(left panel) and PC data (right panel).
The following two examples show sample PCs for a randomly generated sample with mean and covariance matrix as in Example2.1and for five-dimensional flow cytometric measurements.
Example 2.3 The two-dimensionalsimulated data consist of 250 vectors Xi from the bivariate normal distribution with mean [2,−1]T and covariance matrix as in (2.4). For these data, the sample covariance matrix
S=
2. 2790 −0.3661
−0.3661 0. 8402
. The two sample eigenvalue–eigenvector pairs of S are
( ˆλ1,ˆη1) =
2. 3668,
−0.9724 +0.2332
and ( ˆλ2,ˆη2) =
0. 7524,
−0.2332
−0.9724
. The left panel of Figure2.2shows the data, and the right panel shows the PC dataW(2), with W•1on the x -axis and W•2on the y-axis. We can see that the PC data are centred and rotated, so their first major axis is the x -axis, and the second axis agrees with the y-axis.
The data are simulations based on the population case of Example 2.1. The calcu-lations show that the sample eigenvalues and eigenvectors are close to the population quantities.
Example 2.4 TheHIV flow cytometrydata ofRossini, Wan, and Moodie(2005) consist of fourteen subjects: five are HIV+, and the remainder are HIV−. Multiparameter flow cytom-etry allows the analysis of cell surface markers on white blood cells with the aim of finding cell subpopulations with similar combinations of markers that may be used for diagnostic
2.3 Sample Principal Components 25 Table 2.3 Eigenvalues and eigenvectors of HIV+and
HIV−data from Example2.4 HIV+
λ 12,118 8,818 4,760 1,326 786
η FS 0.1511 0.3689 0.7518 −0.0952 −0.5165 SS 0.1233 0.1448 0.4886 −0.0041 0.8515 CD3 0.0223 0.6119 −0.3376 −0.7101 0.0830 CD8 −0.7173 0.5278 −0.0332 0.4523 0.0353 CD4 0.6685 0.4358 −0.2845 0.5312 −0.0051
HIV−
λ 13,429 7,114 4,887 1,612 598
η FS 0.1456 0.5765 −0.6512 0.1522 0.4464 SS 0.0860 0.2336 −0.3848 −0.0069 −0.8888 CD3 0.0798 0.4219 0.4961 0.7477 −0.1021 CD8 −0.6479 0.5770 0.2539 −0.4273 −0.0177 CD4 0.7384 0.3197 0.3424 −0.4849 0.0110
purposes. Typically, five to twenty quantities – based on the markers – are measured on tens of thousands of blood cells. For an introduction and background, seeGivan(2001).
Each new marker potentially leads to a split of a subpopulation into parts, and the dis-covery of these new parts may lead to a link between markers and diseases. Of special interest are the number of modes and associated clusters, the location of the modes and the relative size of the clusters. New technologies allow, and will continue to allow, the collec-tion of more parameters, and thus flow cytometry measurements provide a rich source of multidimensional data.
We consider the first and second subjects, who are HIV+and HIV−, respectively. These two subjects have five measurements on 10,000 blood cells: forward scatter (FS), side scatter (SS), and the three intensity measurements CD4, CD8 and CD3, called colours or parameters, which arise from different antibodies and markers. The colours CD4 and CD8 are particularly important for differentiating between HIV+and HIV−subjects because it is known that the level of CD8 increases and that of CD4 decreases with the onset of HIV+. Figure1.1of Section1.2shows plots of the three colours for these two subjects. The two plots look different, but it is difficult to quantify these differences from the plots.
A principal component analysis of these data leads to the eigenvalues and eigenvectors of the two sample covariance matrices in Table2.3. The second column in the table shows the variable names, and I list the eigenvectors in the same column as their corresponding eigenvalues. The eigenvalues decrease quickly and at about the same rate for the HIV+ and HIV− data. The first eigenvectors of the HIV+ and HIV− data have large weights of opposite signs in the variables CD4 and CD8. For HIV+, the largest contribution to the first principal component is CD8, whereas for HIV−, the largest contribution is CD4. This change reflects the shift from CD4 to CD8 with the onset of HIV+. For PC2 and HIV+, CD3 becomes important, whereas FS and CD8 have about the same high weights for the HIV−data.
−200 0 200
−200 0 200 400
PC1
PC2
−200 0 200
−200 0 200
PC1
PC3
−200 0 200
−200 0 200 400
PC1
PC2
−200 0 200
−200 0 200
PC1
PC3
Figure 2.3 Principal component scores for HIV+data (top row) and HIV−data (bottom row) of Example2.4.
Figure2.3shows plots of the principal component scores for both data sets. The top row displays plots in blue which relate to the HIV+data: PC1 on the x -axis against PC2on the left and against PC3on the right. The bottom row shows similar plots, but in grey, for the HIV−data. The patterns in the two PC1/PC2plots are similar; however, the ‘grey’ PC1/PC3 plot exhibits a small fourth cluster in the top right corner which has almost disappeared in the corresponding ‘blue’ HIV+plot. The PC1/PC3plots suggest that the cluster configurations of the HIV+and HIV−data could be different.
A comparison with Figure1.2of Section1.2, which depicts the three-dimensional score plots PC1, PC2and PC3, shows that the information contained in both sets of plots is similar.
In the current figures we can see more easily which principal components are responsible for the extra cluster in the HIV−data, and we also note that the orientation of the main cluster in the PC1/PC3plot differs between the two subjects. In contrast, the three-dimensional views of Figure1.2avail themselves more readily to a spatial interpretation of the clusters.
The example allows an interesting interpretation of the first eigenvectors of the HIV+and HIV− data: the largest weight ofη1of the HIV−data, associated with CD4, has decreased