Chapter 8. Final remarks and future directions
8.1 Summary and conclusions
8.1.1 Newly developed classification methods based on relative expression reversals of biological features demonstrate robust phenotype distinction in binary and multi- class scenarios
TSPL (Top Scoring Pair Lists) is a natural extension of the TSP and k-TSP classification methods, and was developed to address small feature-set limitations by incorporating a broader range of gene-pair classifiers. In contrast to k-TSP, TSPL selects classifiers from a collection of non-disjoint gene-pairs with minimum TSP scores of 0.6, and classifies test samples based on an iterative majority-voting scheme that involves all chosen gene- pairs. TSPL and three other methods (SVM, TSP, and k-TSP) were evaluated on various binary-class transcriptomic datasets from a wide range of clinical phenotypes and
measurement platforms. Among the four techniques, TSPL had the highest average leave-one-out cross-validation accuracy, even in the case for which total feature size was greatly reduced.
ISSAC (Identification of Structured Signatures And Classifiers) uses a data- driven, hierarchical approach to first organize multiple clinical phenotypes into a global hierarchy, and then learn corresponding (binary) classifiers. The classifiers at each node and edge of the hierarchical structure are then accumulated into a panel of biomarkers, which can then direct classification down the tree to select a particular phenotype. The cumulative expression patterns in the biomarker panel thereby constitute “hierarchically- structured” signatures for a set of classes. Six multi-category classification methods (including ISSAC) were evaluated on various multi-class transcriptomic datasets, and ISSAC had the second highest average performance in ten-fold cross-validation. (behind only to SVM.)
187
Based on our classification results, we believe TSPL and ISSAC hold great promise for binary and multi-class phenotype distinction, respectively, and we look forward to their use in future omics-based classification problems.
8.1.2 Multi-study integration of brain cancer transcriptomes reveals organ-level diagnostic signatures
We identified comprehensive diagnostic signatures of major cancers of the human brain from a multi-study, integrated transcriptomic dataset. These signatures are based on comparing ranked expression values within gene-pair sets, which are aggregated into a brain cancer marker-panel of 44 unique genes. Several genes in our marker-panel had previously confirmed ties to brain cancers and cancer biology. The hierarchically structured signatures achieved 90% classification accuracy when training and validation sets were drawn from the same population distribution (cross validation).
In addition, our brain cancer marker-panel obtained an average classification accuracy of 87% when using genes annotated to encode extracellular products. This suggests that strong signal may possibly persist for phenotype distinction even when using secreted proteins from diseased-afflicted organs.
As shown by our hold-one-lab-in validations for our five datasets of glioblastoma, when we performed the stringent test of obtaining a diagnostic signature from a single dataset of glioblastoma, we found the variation among individual studies often have a larger effect on the transcriptome than did phenotype differences, resulting in
dramatically decreased average accuracy. However, we found that learning signatures across multiple datasets significantly improved average accuracy with concomitant
188
reduction in performance variance, even when sample sizes of the training sets were kept consistent. This was most likely due to the meta-signature encompassing more of the heterogeneity across different sources and conditions, while amplifying signal from the repeated global phenotype characteristics. Therefore, we found that sufficient dataset integration across multiple studies can provide molecular diagnostic signatures that have strong phenotype-associated signal in comparison with noise from batch effects and other sources of variance.
8.1.3 Conserved expression patterns in mRNA and protein profiles for human intestinal cancers allow prediction of relative feature abundances across
heterogeneous data types for in vivo monitoring of disease-perturbed networks The mechanistic relationship between transcription and translation remains poorly understood, and several studies have shown that, in general, no significant correlation exists directly between gene and protein expression. However, if we are to eventually use blood protein measurements for in vivo monitoring of biomolecular network states within disease-perturbed cells, it is critical to establish a framework for reliably
predicting expression levels across heterogeneous data types. To this end, we developed SOMEIRA (Signatures of Matching Expression to Infer Relative Abundances), a novel computational method that allows prediction of relative expression levels of mRNA profiles using protein profiles (and vice versa). This is achieved by first binarizing all mRNA and protein profiles based on pair-wise relative expression feature comparisons. Then, gene-pairs and protein-pairs that display consistent binary patterns across all matching samples are identified to form the basis for predicting relative abundances of expression profiles. Importantly, inference of expression levels is possible even without
189
the complete understanding of the intricate cellular processes that take place in between gene and protein expression.
The predictive performance of SOMEIRA was evaluated on mRNA and protein profiles from Gastrointestinal stromal tumor (GIST) and Leiomyosarcoma (LMS) biopsy samples. Each protein profile was composed of 40 protein measurements, while each mRNA profile was limited to the 2,094 features that map to all pathways in the BioCarta database. Training and performance testing of SOMEIRA followed a leave-one-out cross-validation approach, in the sense that conserved expression patterns between pairs of mRNAs and pairs of proteins (which form the framework for predicting relative feature abundances across heterogeneous datasets) were learned on all but one sample. The remaining protein profile was used to predict relative levels of the corresponding mRNA profile from the same biopsy source. Our results showed that SOMEIRA can be highly predictive of relative expression levels for different data types; for all 49 cases of matching mRNA and protein profiles, the average correlation coefficient between actual and inferred mRNA profiles was 0.91, while the average correlation coefficient for protein profile estimation was 0.81. These correlations were significantly higher than those obtained using a linear regression-based method (0.76 and 0.66 for mRNA and protein prediction, respectively). One great advantage of SOMEIRA is that both mRNA and protein measurements corresponding to the same gene are not required, since feature pairs within only one data type are used to binarize a dataset.
Our next analysis was to evaluate how well protein profiles can serve as proxies to infer biological information concerning mRNA profiles. In particular, we extended the use of SOMEIRA to assessing states of intracellular networks by applying DIRAC
190
directly on all mRNA profiles inferred from protein profiles. When we compared results from actual mRNA profiles, we found significant overlap (based on hypergeometric tests) in (DIRAC-defined) tightly-regulated pathways in GIST and LMS, and also in
differentially-regulated pathways between the two phenotypes. To the best of our knowledge, this work is the first systems-level demonstration of analyzing biomolecular network states of human cancers using protein profiles as proxies to global gene
191