Chapter 2 Testing of Mean Size and Shape
2.5 Canonical Variate Analysis
Canonical Variate Analysis is a multivariate statistical method used to reduce the di- mension of a dataset designed to show differences between groups of points. Canonical Variate Analysis is a special case of Canonical Correlation Analysis (Hawkins, 1982) which was developed by Hotelling (1936) as a generalization of Principal Components Analysis (Timm, 2002). The key difference between Canonical Variate Analysis and PCA is that canonical analysis transforms the variates of interest such that the swarm of points becomes a sphere whereas PCA simply rotates the coordinate axes to be parallel to the principal axes of the ellipsoid of the swarm of sample means. Using CVA, the within-group random error variation has been standardized to be the same in every direction (Falkenhagen and Nash, 1978).
The method used to analyze the DNA dataset in this section uses Linear Dis- criminant Analysis (LDA) as the specific form of CVA. The method of LDA was first introduced as a solution to classifying observations into one of two groups (Fisher, 1936). The implementation of this method is carried out by “shapes.cva” in the R “shapes” package (Dryden, 2013). To make the plots easy to understand, the data
used in this section has been thinned from the original 2500 observations to 50 ob-
servations by using every 50th observation in the original datasets. That is to say the
data being used in this section isXthinned = (X50,X100, . . . ,X2500) whereXi is theith observation of the original dataset.
In the previous PCA analysis, Figure 2.2 and Figure 2.3 were somewhat difficult to read. The analysis in this section shows CVA results for the thinned data in an effort to make the results more interpretable. Figure 2.4 shows the canonical variate analysis of molecules AFC, AGC, AFA, AGA, AFG, and AGG. The numbers in the plot correspond to the molecules in this list. For example, the “1”s represent the AFC molecules, the “2”s represent the AGC molecules, and so forth. This figure shows a separation of the “1”s and “2”s but no clear separation of “3”s and “4”s or the “5”s and “6”s. This suggests a pairwise difference for the AFC and AGC molecules but not for the AFA-AGA or AFG-AGG molecule pairs.
Figure 2.4: Canonical Variate Analysis of Axx Molecules. This figure shows shows the canonical variate analysis of molecules AFC, AGC, AFA, AGA, AFG, and AGG. The numbers in the plot correspond to the molecules in this list. For example, the “1”s represent the AFC molecules, the “2”s represent the AGC molecules, and so forth. This figure shows a separation of the “1”s and “2”s but no clear separation of “3”s and “4”s or the “5”s and “6”s.
A further investigation of Figure 2.4 is given in Figure 2.5 which shows the canon- ical variate analysis of the AFC-AGC, AFA-AGA, and AFG-AGG molecule pairs. The “1”s in Figures 2.5 represent the damaged molecules and the “2”s represent the undamaged molecules. It is clear from this figure that the groups have a clear sepa- ration for the AFC-AGC pair in Panel (a) but not for the AFA-AGA and AFG-AGG pairs in Panels (b) and (c), respectively.
(a) (b) (c)
Figure 2.5: Canonical Variate Analysis of the AFC-AGC, AFA-AGA, and AFG-AGG molecule pairs. This figure shows the canonical variate analysis of the AFC-AGC pair in Panel (a), the AFA-AGA pair in Panel (b), and the AFG-AGG pair in Panel (c). The “1”s in each figure represent the damaged molecule and the “2”s represent the undamaged molecule of the pairs. For example, the “1”s in Panel (a) represent the AFC molecules and the “2”s represent the AGC molecules.
Similar to the analysis of the Axx molecules in Figure 2.4, Figure 2.6 shows the canonical variate analysis of molecules TFC, TGC, TFA, TGA, TFT, and TGT. The numbers in this plot correspond to the molecules in this list. For example, the “1”s represent the TFC molecules, the “2”s represent the TGC molecules, and so forth. This figure does not show clear separation in the molecule pairs but a closer pairwise examination of the molecules suggests pairwise separation for the TFC-TGC pair and the TFA-TGA pair. The pairwise analysis is given in Figure 2.7. Figure 2.7 shows the TFC-TGC pair in Panel (a), the TFA-TGA pair in Panel (b), and the TFT-TGT pair in Panel (c). The “1”s in each figure represent the damaged molecule and the “2”s represent the undamaged molecule of the pairs. For example, the “1”s in the left plot represent the TFC molecules and the “2”s represent the TGC molecules.
Figure 2.6: Canonical Variate Analysis of Txx Molecules. This figure shows shows the canonical variate analysis of molecules TFC, TGC, TFA, TGA, TFT, and TGT. The numbers in the plot correspond to the molecules in this list. For example, the “1”s represent the TFC molecules, the “2”s represent the TGC molecules, and so forth. This figure shows a separation of the “1”s and “2”s and the “3”s and “4”s but no clear separation of the “5”s and “6”s. A deeper investigation of the pairwise analyses is given in Figure 2.7
(a) (b) (c)
Figure 2.7: Canonical Variate Analysis of the TFC-TGC, TFA-TGA, and TFT-TGT molecule pairs. This figure shows the canonical variate analysis of the TFC-TGC pair in Panel (a), the TFA-TGA pair in Panel (b), and the TFT-TGT pair in Panel (c). The “1”s in each figure represent the damaged molecule and the “2”s represent the undamaged molecule of the pairs. For example, the “1”s in Panel (a) represent the TFC molecules and the “2”s represent the TGC molecules.
From this analysis, there are large differences in the mean shape structure between the damaged and undamaged molecules for the AFC-AGC, TFC-TGC, and TFA- TGA molecule pairs but the differences in mean shape for the AFA-AGA, AFG-AGG, and TFG-TGG molecule pairs are not as large.