• No results found

Chapter 2 Testing of Mean Size and Shape

2.4 Principal Component Analysis

Principal Component Analysis is used to analyze the variability in a dataset. Orig- inally, PCA was used as a method of fitting planes by orthogonal least squares. Hotelling (1933) proposed principal components analysis as a method for analyzing

Table 2.2: Mean Size and Shape Test Results for First 5 Landmarks (Systematic Sampling). This table contains the results from the tests of mean size and shape for the first five landmarks on the damaged-undamaged pairs of DNA molecules where the observations used in the tests were systematically selected from the DNA dataset. The numbers in the parentheses are the number of observations skipped between sys- tematically selecting the next observation used for conducting the test. For example, in the column labeled “p-value(100)”, the p-values in this column correspond to the test of equal mean size and shape where 100 observations skipped between the se- lected observations. In this setting, the 100th, 200th,. . ., 2500th observations in each dataset were used for conducting the tests.

Damaged Undamaged p-value(100) p-value(50) p-value(25)

AFA AGA 0.277 0.406 0.277 AFC AGC 0.267 0.554 0.594 AFG AGG 0.614 0.941 0.109 TFA TGA 0.218 0.020 0.030 TFC TGC 0.455 0.991 0.495 TFT TGT 0.030 0.020 0.020

Table 2.3: Mean Size and Shape Test Results for All 22 Landmarks (Randomly Selected). This table contains the results from the tests of mean size and shape for all 22 landmarks on the damaged-undamaged pairs of DNA molecules where the observations used in the tests were randomly selected from the DNA dataset. The numbers in the parentheses are the number of observations selected for conducting the test. For example, in the column labeled “p-value(25)”, the p-values in this column correspond to the test of equal mean size and shape where 25 observations were randomly selected from the dataset.

Damaged Undamaged p-value(25) p-value(50) p-value(100)

AFA AGA 0.901 0.426 0.040 AFC AGC 0.089 0.010 0.010 AFG AGG 0.386 0.188 0.010 TFA TGA 0.574 0.158 0.010 TFC TGC 0.792 0.396 0.010 TFT TGT 0.386 0.010 0.010

Table 2.4: Mean Size and Shape Test Results for All 22 Landmarks (Systematic Sam- pling). This table contains the results from the tests of mean size and shape for all 22 landmarks on the damaged-undamaged pairs of DNA molecules where the obser- vations used in the tests were systematically selected from the DNA dataset. The numbers in the parentheses are the number of observations skipped between system- atically selecting the next observation used for conducting the test. For example, in the column labeled “p-value(100)”, the p-values in this column correspond to the test of equal mean size and shape where 100 observations skipped between the selected observations. In this setting, the 100th, 200th,. . ., 2500th observations in each dataset were used for conducting the tests.

Damaged Undamaged p-value(100) p-value(50) p-value(25)

AFA AGA 0.693 0.416 0.069 AFC AGC 0.069 0.040 0.010 AFG AGG 0.455 0.347 0.010 TFA TGA 0.208 0.010 0.010 TFC TGC 0.911 0.436 0.010 TFT TGT 0.455 0.010 0.020

covariance structures and correlation structures (Flury, 1988). The work in this chap- ter is based on the ideas of Hotelling in that the goal of this work is to analyze the covariance structures of the DNA molecules. The construction of PCA is described by the following procedure (Morrison, 1976). The PCA carried out on the DNA is on the vectorized registered data,Xp1,Xp2, . . . ,Xpn where vec(Xip) = (XpTix ,XpTiy,XpTiz )T is

a km×1 vector, to correct for variation due to rotation and translation.

The data vec(Xp1),vec(X

p

2), . . . ,vec(Xpn) are assumed to follow a multivariate dis- tribution with mean µ and covariance Σ. Next, let λ1 > λ2 > · · · > λr be the r largest eigenvalues where r < q < n and q is the rank ofΣ.

Now, the jth principal component can be defined as the linear combination Y

j =

Pq

i=1cijX p

i wherecij are the ordered elements of the eigenvector ofΣb corresponding to

the jth largest eigenvalue λj. Using this system of defining the principal components

leads to the variance of thejth component beingλj andPqi=1λi = trace(Σb). Another

convenient result of this method is that the importance of the jth component is

measured by λj

trace(Σb) .

The three dimensional visual representation of the results is useful when attempt- ing to determine the variation between landmarks using the inverse of vectorizing. One example from the DNA data is given below in Figure 2.1. The 3d PCA plots are quite similar for each molecule. The pattern of the magnitude of variation be- tween landmarks is broadly consistent for each molecule. The landmarks at the top and bottom of the molecule seem to vary more than the landmarks in the middle of the molecule. The plot shows this difference in variation by the length of the lines attached to the landmarks.

Figure 2.1: 3d PCA example. This figure shows the first 3 principal components of the DNA molecule. The length of the lines coming from each landmark represent the amount of variation in each direction.

Using the 3d PCA plots to analyze differences in variation is rather difficult due to the number of molecules. The DNA dataset contains twelve molecules. To simplify the analysis, examining sets of 2d plots is helpful. The method for analyzing the three dimensional PCA results used during this analysis was to analyze the first three principal components by plotting the PC scores against each other. The following

plots in Figure 2.2 show the first three principal components plotted against each other for all twelve DNA molecules. The plots are difficult to read due to the number of molecules and observations. Looking carefully at the third graph, PC2 vs PC3, one can see some differences the variation for the molecules. The points are more concentrated in the foreground of the plot as compared to the more spread out points in the background of the plot.

(a) (b) (c)

Figure 2.2: PCA Plots. This figure shows the three comparisons of the PC scores using all of the DNA data. The plot of PC1 vs PC2 is shown in Panel (a). The plot of PC1 vs PC3 is shown in Panel (b). The plot of PC2 vs PC3 is shown in Panel (c).

As the major aim of this work is to investigate the differences in size and shape variability between the damaged and undamaged molecules, it is of interest to consider pairwise plots of the PC scores. The pairwise plots were obtained by plotting the PC scores for the damaged and undamaged registered molecules against each other. While not all of the pairs show differences, the pairwise plots in Figure 2.3 suggest differences in variability between the PC scores of the damaged and undamaged molecules. The pair shown in the figure consists of the undamaged molecule, AGC, displayed as circles and the damaged molecule, AFC, displayed as triangles. The plot of PC1 vs PC2 and the plot of PC2 vs PC3 show that differences in the percent of variability explained by these components for each molecule exist. As part of this dissertation, we will develop methods for testing for differences between covariance matrices.

(a) (b) (c)

Figure 2.3: PCA Plots. This figure shows the three comparisons of the PC scores using only the DNA molecules AFC and AGC. The plot of PC1 vs PC2 is shown in Panel (a). The plot of PC1 vs PC3 is shown in Panel (b). The plot of PC2 vs PC3 is shown in Panel (c). Molecule AFC is the damaged version of molecule AGC. The AFC molecule is displayed as red triangles. The AGC molecule is displayed as black circles.