Cross-modal Classification on Real Data

CHAPTER 5: Integrating Image and Genomic Features with Task-Driven Deep CCA

5.3 Task-driven Deep CCA

5.5.3 Cross-modal Classification on Real Data

Small Training Set: Carolina Breast Cancer Study (CBCS). This data set consists of image and genomic data for 1003 patient samples from the Carolina Breast Cancer Study, Phase 3 [Troester et al., 2018]. There are typically four TMA cores per patient. Image features were extracted with the CNN VGG16 [Simonyan and Zisserman, 2015] by taking the mean of the 512-dimensional output of the fourth set of convolutional layers across the tissue region and further averaging across all cores for the same patient. For gene expression, I used the set of 50 genes in the PAM50 array [Parker et al., 2009] or a larger set of 163 genes (referred to as GE163). The data set was randomly split into half for training, one quarter for validation, and one quarter for testing. Classification tasks included predicting Basal vs. non-Basal genomic subtype, estrogen receptor (ER) status positive vs. negative, and grade 1 vs. 3. Cross-modal classification accuracy is reported as the mean over four random training/validation/test splits. Table 5.1 compares cross-modal methods. The cross-modal projection was computed using each method, following by training a linear SVM on the projection from the first modality. Testing was done by projecting the second modality and predicting its class using the SVM trained on the first modality.

While the left and right modalities on MNIST had a roughly equal ability to predict the digit class, this CBCS data set is much less symmetric. The ground truth genomic subtype was computed directly from PAM50, so it is expected that this modality is much better able to predict genomic subtype. ER was also better predicted from PAM50, while grade was assessed by a pathologist from histology images and therefore better predicted from image features.

The top performing method on CBCS was consistently DDCCA-W or DDCCA-SD. DDCCA- W was typically the better of the two, indicating that decorrelation by whitening is especially beneficial on this data set. These methods outperformed all unsupervised linear and deep CCA methods.

I also explored CCA methods qualitatively. Figure 5.6 shows t-SNE plots for all methods on both the image features and PAM50. As expected, the PAM50 data more clearly sepa- rated classes than the image features, at least for Basal vs. non-Basal and ER status. On both modalities, the DDCCA methods all produced a better separation of classes than the

Train PAM50, Test Image

Method Basal ER Grade

CCA 0.732 (0.010) 0.637 (0.008) 0.741 (0.005) RCCA 0.815 (0.008) 0.811 (0.003) 0.877 (0.010) PLS-SVD 0.650 (0.016) 0.656 (0.003) 0.797 (0.010) DCCA 0.787 (0.010) 0.785 (0.011) 0.867 (0.012) SoftCCA 0.780 (0.010) 0.769 (0.014) 0.848 (0.015) DDCCA-ED 0.802 (0.015) 0.803 (0.029) 0.852 (0.011) DDCCA-W 0.820 (0.008) 0.828 (0.006) 0.917 (0.019) DDCCA-SD 0.796 (0.004) 0.811 (0.004) 0.874 (0.019) DDCCA-ND 0.766 (0.013) 0.805 (0.007) 0.878 (0.011)

Train Image, Test PAM50

Method Basal ER Grade

CCA 0.943 (0.005) 0.869 (0.006) 0.828 (0.011) RCCA 0.978 (0.003) 0.905 (0.008) 0.836 (0.009) PLS-SVD 0.912 (0.018) 0.865 (0.005) 0.794 (0.012) DCCA 0.976 (0.005) 0.874 (0.010) 0.854 (0.015) SoftCCA 0.978 (0.004) 0.897 (0.010) 0.843 (0.007) DDCCA-ED 0.971 (0.009) 0.889 (0.010) 0.833 (0.010) DDCCA-W 0.983 (0.006) 0.908 (0.009) 0.845 (0.005) DDCCA-SD 0.978 (0.005) 0.908 (0.009) 0.848 (0.004) DDCCA-ND 0.975 (0.004) 0.896 (0.005) 0.817 (0.005)

Table 5.1: Cross-modal classification accuracy for CBCS. The standard error is in brackets. The CCA projection for each modality was first computed. A classifier was then trained on the projection from one modality, and the projection from the other modality was used at test time.

unsupervised CCA methods. The plot for SoftCCA is rather peculiar and will require further investigation. However, it does roughly order points from low-grade, non-Basal to high grade, non-Basal to Basal - an ordering that does make some sense. This “stringy” property was also seen on MNIST for some hyperpameter settings (not shown).

HDLSS Data: TCGA-BRCA. These experiments use genomic, epigenomic, and proteomic data from the breast cancer portion of TCGA. I used the following four modalities of data that are available for 466 tumors: gene expression (GE), miRNA expression, methylation, and Reverse Phase Protein Array (RPPA). While histology images are available for TCGA-BRCA, they are whole slide images rather than TMAs, bringing complications that are not yet solved by standard software and were not studied in this dissertation. Therefore, no imaging modalities were used for this data set. Each data type was obtained from a previous publication that analyzed these and other modalities [Network et al., 2012]. GE is the highest dimensional with 12,749 features, methylation has 574, miRNA has 802, and RPPA has 171.

The data set was randomly split into half for training, one quarter for validation, and one quarter for testing. Classification tasks included predicting Basal vs. non-Basal genomic subtype, estrogen receptor (ER) status positive vs. negative, and progesterone receptor (PR) status positive vs. negative. Cross-modal classification accuracy is reported as the mean over four random training/validation/test splits.

Table 5.2 displays the cross-modal classification accuracy for GE paired with each of the other modalities. DDCCA-W or DDCCA-SD was usually the top performer. When DDCCA was the top performer, it improved over RCCA by as much as 2.6%. The linear method RCCA was remarkably powerful on both CBCS and TCGA-BRCA. We can gain some insight by examining the DCCA and SoftCCA results that use an unsupervised method for deep CCA. DCCA and SoftCCA were typically inferior to RCCA, indicating that a deep version of CCA does not improve upon a linear method on this data set. Therefore, the increase in performance from DCCA and SoftCCA to DDCCA-W and DDCCA-SD comes from the task- driven component. The task-driven component acts as a regularizer and reduces overfitting.

While the proposed DDCCA methods were not always the top performer on TCGA-BRCA, the results do show that a deep method can be successful on HDLSS data. This indicates that

Image features PAM50

CCA DDCCA-ED CCA DDCCA-ED