Classification performance on various transcriptomic datasets

Chapter 3. Molecular diagnostics using relative expression analysis

3.3 Classification performance on various transcriptomic datasets

wide range of clinical phenotypes and measurement platforms. A total of 19 datasets were obtained from the Gene Expression Omnibus (GEO) or from online repositories of previously published studies. Clinical phenotypes, total sample numbers, and GEO references of all datasets are shown in Table 3.1.

Leave-one-out cross-validation was chosen as the performance metric for binary

classification. For comparison with the results from TSPL, Support Vector Machines (SVM) and two other RXA-based methods, TSP and k-TSP, were also tested on every dataset. Cross-

validation accuracies from all four methods are shown in Table 3.2. TSPL displayed the highest average cross-validation accuracy across all datasets (87.3 %), compared to SVM (85.4 %), TSP (79.2 %), and k-TSP (87.3 %). When considering only the RXA-based methods, TSPL

Table 3.1 Description of all 19 binary phenotype datasets used in this study.

‘ψ’: expression data from protein lysate array. ‘-’: Datasets from published studies, but not available on GEO.

GEO dataset (GDS #)

A _{Responsive vs. non-responsive breast tumors dataset 1} ₆₀ ₈₀₇

B Gastrointestinal stromal tumor vs. Leimyosarcoma 52 -

Gastrointestinal stromal tumor vs. Leimyosarcoma (protein lysate array) 52 - D _{Responsive vs. non-responsive breast tumors dataset 2} ₆₀ ₁₆₂₇ E _{Head and neck squamous cell carcinoma vs. control tissue} ₄₄ ₂₅₂₀

F Bipolar disorder vs. control cerebral cortex 61 2190

G Carcinoma-like vs. adenoma-like ovarian tumor 43 2785

H _{Acute myeloid leukemia vs. acute lymphoblastic leukemia dataset 1} ₇₂ _-

I _{Marfan syndrome vs. control subjects} ₁₀₁ ₂₉₆₀

J Tumor vs. control prostate 102 2545

K Glioblastoma multiforme vs. anaplastic oligodendroglioma 29 1813

L _{Melanoma patients vs. control lympocytes} ₄₆ ₂₇₃₅

M _{Parkinson vs. control brain dataset 1} ₄₇ ₃₁₂₉

N Parkinson vs. control brain dataset 2 25 2821

O Tumor vs. normal pancreas epithelial layer 52 -

P _{Tumor vs. control ovarian tissue} ₂₄ ₃₅₉₂

Q _{Adenocarcinoma vs. squamous cell carcinoma NSC lung cancer} ₅₈ ₂₇₇₁ R Acute myeloid leukemia vs. acute lymphoblastic leukemia dataset 2 63 3057 S Diffuse large B-cell lymphoma vs. follicular lymphoma 77 3516

Total sample # Clinical phenotypes

Table 3.2 Leave-one-out cross-validation accuracies from SVM, TSP, k-TSP, and TSPL.

‘ψ_{’: expression data from protein lysate array. ‘}Φ_{’: Highest accuracy obtained for each dataset is} shaded. A B C! D E F G H I J K L M N O P Q R S Average Dataset 60 52 52 60 44 61 43 72 101 102 29 46 47 25 52 24 58 63 77 ! Total sample # _SVM _TSP _k-TSP _TSPL 61.7% 28.3% 55.0% 80.0% 94.2% 86.5% 96.2% 96.2% 92.3% 55.8% 63.5% 88.5% 53.3% 68.3% 66.7% 63.3% 95.5% 84.1% 84.1% 95.5% 47.5% 65.6% 54.1% 60.7% 100.0% 93.0% 100.0% 100.0% 98.6% 87.5% 97.2% 97.2% 94.1% 80.2% 89.1% 92.1% 91.2% 92.2% 89.2% 89.2% 96.6% 72.4% 79.3% 75.9% 67.4% 73.9% 67.4% 78.3% 87.2% 89.4% 83.0% 87.2% 72.0% 68.0% 72.0% 76.0% 86.5% 82.7% 86.5% 92.3% 100.0% 95.8% 100.0% 100.0% 94.8% 87.9% 93.1% 96.6% 93.7% 93.7% 96.8% 98.4% 96.1% 98.7% 97.4% 92.2% 85.4% 79.2% 82.7% 87.3%

Next, cross-validation accuracies of TSPL were directly compared to those of SVM (Fig. 3.1a), TSP (Fig. 3.1b) and k-TSP (Fig. 3.1c). In 4 of 19 datasets, strong classification

performance (≥ 90%) was obtained when either method was used. TSPL performed equally as well as, or better than SVM, TSP and k-TSP in 13, 14, and 16 of 19 total cases, respectively. In the datasets where both methods show moderate to poor performances (both accuracies ≤ 80%), TSPL had higher accuracies in 5 of 5 (vs. SVM, Fig. 3.1a), 4 of 6 (vs. TSP, Fig. 3.1b), and 4 of 6 (k-TSP, Fig. 3.1c) datasets. This shows that, compared to currently available techniques, better results may be expected when TSPL is used in cases where a strong distinction between two phenotypes does not exist at the transcriptomic level. That TSPL performs better than SVM in all 5 cases is highly encouraging, because SVM has been previously shown to perform remarkably well compared to other gene-expression microarray classification strategies (Statnikov, et al., 2005; Tan, et al., 2005). These results clearly indicate that our large-scale approach of utilizing all relative expression reversal signals shows significant promise towards binary phenotype classification.

Figure 3.1 Comparisons of classification performances between (a) SVM vs. TSPL; (b) TSP vs. TSPL; and (c) k-TSP vs. TSPL across all 19 datasets. Each point on the scatter-plots represents a comparison between cross-validation accuracies of two methods for a particular dataset.

Numerical values of points are shown in Table 3.2. In approximately a fifth of all datasets, high accuracy (≥ 90%) is obtained when either method is used, as indicated by points towards the top- right of the scatter-plots. In datasets where both methods show moderate to poor performances (both accuracies ≤ 80%, as indicated by green dashed-line), higher accuracies are obtained from TSPL in (a) 5 of 5; (b) 4 of 6; and (c) 4 of 6 corresponding datasets. The gray line has a slope of 1, and represents a cut-off of accuracies higher in one method than in the other.

In Dataset C, where the measurement platform was a protein lysate array composed of 40 antibody features per sample (Yang, et al., 2010), TSPL, TSP, and k-TSP have cross-validation accuracies of 88.5 %, 55.8 %, and 63.5 %, respectively (Table 3.2). This observation motivated our next analysis, which was to test whether TSPL could outperform other RXA-based methods on datasets of small feature sizes. In contrast to gene-expression microarrays, small feature size is generally the case for antibody-based protein chips, which are also known to be highly expensive and labor-intensive to produce despite their relatively low measurement throughput. (Although there have been recent improvements regarding comprehensiveness (Fan, et al., 2008; Sardiu and Washburn, 2011; Song, et al., 2011; Talapatra, et al., 2002), the current technology is still far from reaching the entire proteome-scale.) It would greatly behoove such strategies if the advantages of using simple decision rules based on relative expression can be fully utilized in protein expression measurements, as long as high classification performance can be consistently maintained.

We tested this hypothesis by performing cross-validation on the same datasets used previously, but on 100 randomly selected features chosen consistently across all samples for a particular dataset, over 100 iterations. Results in Table 3.3 show that, on average, TSPL displayed the best classification performance (in cross-validation) among the four methods (SVM: 79.9 %; TSP: 72.5 %; k-TSP: 77.1 %; TSPL: 80.1 %). In addition, TSPL had cross- validation accuracies equal to, or higher than, the others RXA-based methods in 16 of 18 cases, showing again that TSPL clearly performs better than TSP and k-TSP, even on expression data for which the total feature size was greatly reduced.

Table 3.3 Average leave-one-out cross-validation accuracies from 100 iterations of SVM, TSP,

k-TSP, and TSPL on datasets of 100 features selected randomly.

‘ψ_{’: Protein lysate array data has only 40 measured features per sample, and was therefore} removed from this analysis. ‘Φ’: Highest accuracy obtained for each dataset is shaded.

A B C! D E F G H I J K L M N O P Q R S Average Dataset 60 52 52 60 44 61 43 72 101 102 29 46 47 25 52 24 58 63 77 ! Total sample # _SVM _TSP _k-TSP _TSPL 55.0% 56.7% 63.3% 72.0% 91.4% 82.1% 89.2% 90.9% - - - - 51.1% 47.2% 47.2% 48.8% 90.2% 80.7% 88.9% 88.9% 57.4% 45.9% 45.9% 60.7% 98.2% 91.2% 96.2% 98.2% 80.0% 83.3% 87.8% 89.4% 87.6% 73.6% 79.7% 82.5% 80.0% 74.6% 78.4% 82.7% 96.5% 62.9% 65.0% 70.2% 48.4% 57.8% 56.1% 54.7% 86.8% 76.0% 80.7% 80.7% 72.2% 60.6% 64.3% 70.8% 78.5% 82.1% 85.3% 87.4% 93.5% 85.7% 94.6% 97.0% 90.9% 77.7% 85.6% 87.8% 89.8% 87.1% 92.1% 93.3% 89.8% 79.5% 87.0% 86.3% 79.9% 72.5% 77.1% 80.1%

3.4 Chapter summary

The advent of high-throughput measurement technologies for the comprehensive, rapid, and inexpensive detection of molecular signatures in human cells, tissues, and serum has led to the generation of a tremendous amount of raw, unprocessed information. However, analyzing and interpreting these data in order to enhance our understanding of human health and genetic diseases continues to be a challenge in the scientific community. In the case of gene-expression microarray data, standard statistical learning methods have been shown to produce accurate classifiers to distinguish disease phenotypes, but still lack the convenience and simplicity desired for extracting any underlying biological rationale for the decision rules.

To address this issue, Geman et al. developed the TSP and k-TSP classifiers, two bioinformatics techniques for the molecular classification of gene expression data based on the analysis of relative expression values. Due to the simplicity of the classifier and ease of

biological interpretation, as well as its independence to data normalization and parameter-fitting, the TSP and k-TSP methods have been applied in several studies to perform molecular

classification of various pathologies, primarily to cancer. These methods have been shown to display highly accurate classification performance in distinguishing a broad phenotype range of case (i.e., disease) vs. controls, cancer subclasses, prognostic outcomes, and so forth. As we will see in Chapter 4, these strategies can be applied towards classification of multiple phenotypes based on a decision-tree approach.

In addition to the currently available RXA-based classification methods, we introduce a new approach called Top Scoring Pair Lists (TSPL), which follows a natural extension of the TSP and k-TSP methods by incorporating a wider range of gene-pairs that show moderate to strong diagnostic value. We evaluated the classification performance of TSPL by leave-one-out cross- validation across 19 datasets. On average, TSPL outperformed TSP and k-TSP, and also SVM. TSPL also performed better than these methods on expression data of small feature size, which is

the general case for antibody-based protein detection platforms. Our results suggest that this large-scale approach of utilizing all relative expression reversal signals shows great promise towards the identification of more robust molecular diagnostic signatures.

3.5 References

Alizadeh, A.A., et al. (2000) Distinct types of diffuse large B-cell lymphoma identified by gene expression profiling, Nature, 403, 503-511.

Auffray, C. (2007) Protein subnetwork markers improve prediction of cancer outcome, Mol. Syst.

Biol., 3.

Bicciato, S., et al. (2003) Pattern identification and classification in gene expression data using an autoassociative neural network model, Biotechnol. Bioeng., 81, 594-606.

Bloom, G., et al. (2004) Multi-platform, multi-site, microarray-based human tumor classification,

American Journal of Pathology, 164, 9-16.

Boulesteix, A.L., Tutz, G. and Strimmer, K. (2003) A CART-based approach to discover emerging patterns in microarray data, Bioinformatics, 19, 2465-2472.

Chuang, H.Y., et al. (2007) Network-based classification of breast cancer metastasis, Mol. Syst.

Biol., 3.

Clarke, M.J. (2008) Tamoxifen for early breast cancer, Cochrane Database of Systematic Reviews

(Online).

Clarke, R., et al. (2003) Antiestrogen resistance in breast cancer and the role of estrogen receptor signaling, Oncogene, 22, 7316-7339.

Dettling, M. and Buhlmann, P. (2003) Boosting for tumor classification with gene expression data, Bioinformatics, 19, 1061-1069.

Dudoit, S., et al. (2002) Statistical methods for identifying differentially expressed genes in replicated cDNA microarray experiments, Statistica Sinica, 12, 111-140.

Fan, R., et al. (2008) Integrated barcode chips for rapid, multiplexed analysis of proteins in microliter quantities of blood, Nature Biotechnology, 26, 1373-1378.

Geman, D., et al. (2004) Classifying gene expression profiles from pairwise mRNA comparisons,

Golub, T.R., et al. (1999) Molecular classification of cancer: class discovery and class prediction by gene expression monitoring, Science, 286, 531-537.

Gordon, G.J., et al. (2002) Translation of microarray data into clinically relevant cancer

diagnostic tests using gene expression ratios in lung cancer and mesothelioma, Cancer Research, 62, 4963-4967.

Hippo, Y., et al. (2001) Differential gene expression profiles of scirrhous gastric cancer cells with high metastatic potential to peritoneum or lymph nodes, Cancer Research, 61, 889-895.

Jordan, C. (2002) Historical perspective on hormonal therapy of advanced breast cancer, Clin.

Ther., 24, 3-16.

Khan, J., et al. (2001) Classification and diagnostic prediction of cancers using gene expression profiling and artificial neural networks, Nature Medicine, 7, 673-679.

LaTulippe, E., et al. (2002) Comprehensive gene expression analysis of prostate cancer reveals distinct transcriptional programs associated with metastatic disease, Cancer Research, 62, 4499- 4506.

Lee, E., et al. (2008) Inferring Pathway Activity toward Precise Disease Classification, PLoS

Comput. Biol., 4.

Lin, X., et al. (2009) The ordering of expression among a few genes can provide simple cancer biomarkers and signal BRCA1 mutations, BMC Bioinformatics, 10, 256.

Ma, X.J., et al. (2004) A two-gene expression ratio predicts clinical outcome in breast cancer patients treated with tamoxifen, Cancer Cell, 5, 607-616.

Michiels, S., Koscielny, S. and Hill, C. (2005) Prediction of cancer outcome with microarrays: a multiple random validation strategy, The Lancet, 365, 488-492.

Nicholson, R.I., et al. (2003) The biology of antihormone failure in breast cancer, Breast Cancer

Res. Treat., 80, 29-34.

Pavlidis, P., Li, Q. and Noble, W.S. (2003) The effect of replication on gene expression microarray experiments, Bioinformatics, 19, 1620-1627.

Peng, S., et al. (2003) Molecular classification of cancer types from microarray data using the combination of genetic algorithms and support vector machines, FEBS Lett., 555, 358-362. Qiu, X., et al. (2006) Assessing stability of gene selection in microarray data analysis, BMC

Bioinformatics, 7, 50.

Qu, Y., et al. (2002) Boosted decision tree analysis of surface-enhanced laser

desorption/ionization mass spectral serum profiles discriminates prostate cancer from noncancer patients, Clin. Chem., 48, 1835-1843.

Rajeshkumar, N.V., et al. (2009) Antitumor effects and biomarkers of activity of AZD0530, a Src inhibitor, in pancreatic cancer, Clin. Cancer Res., 15, 4138-4146.

Raponi, M., et al. (2008) A 2-gene classifier for predicting response to the farnesyltransferase inhibitor tipifarnib in acute myeloid leukemia, Blood, 111, 2589-2596.

Sardiu, M.E. and Washburn, M.P. (2011) Construction of protein interaction networks based on the label-free quantitative proteomics, Methods Mol Biol, 781, 71-85.

Singh, D., et al. (2002) Gene expression correlates of clinical prostate cancer behavior, Cancer

Cell, 1, 203-209.

Song, Q., et al. (2011) Peptide aptamer microarrays: bridging the bio-detector interface, Faraday

Discussions, 149, 79-92.

Statnikov, A., et al. (2005) A comprehensive evaluation of multicategory classification methods for microarray gene expression cancer diagnosis, Bioinformatics, 21, 631-643.

Subramanian, A., et al. (2007) GSEA-P: a desktop application for Gene Set Enrichment Analysis,

Bioinformatics, 23, 3251-3253.

Subramanian, A., et al. (2005) Gene set enrichment analysis: A knowledge-based approach for interpreting genome-wide expression profiles, Proc. Natl. Acad. Sci. U. S. A., 102, 15545-15550. Talapatra, A., Rouse, R. and Hardiman, G. (2002) Protein microarrays: challenges and promises,

Tan, A.C., et al. (2005) Simple decision rules for classifying human cancers from gene expression profiles, Bioinformatics, 21, 3896-3904.

Xu, L., Geman, D. and Winslow, R.L. (2007) Large-scale integration of cancer microarray data identifies a robust common cancer signature, BMC Bioinformatics, 8, 275.

Xu, L., et al. (2005) Robust prostate cancer marker genes emerge from direct integration of inter- study microarray data, Bioinformatics, 21, 3905.

Xu, L., et al. (2008) Merging microarray data from separate breast cancer studies provides a robust prognostic test, BMC Bioinformatics, 9, 125.

Yang, J., et al. (2010) Integrated proteomics and genomics analysis reveals a novel mesenchymal to epithelial reverting transition in leiomyosarcoma through regulation of slug, Molecular &

cellular proteomics : Molecular and Cellular Proteomics, 9, 2405-2413.

Yeang, C.H., et al. (2001) Molecular classification of multiple tumor types, Bioinformatics, 17 S316-322.

Zhang, H., Yu, C.Y. and Singer, B. (2003) Cell and tumor classification using gene expression data: construction of forests, Proc. Natl. Acad. Sci. U. S. A., 100, 4168-4172.

Chapter 4. Multi-category classification of clinical

In document Systems approaches to identify molecular signatures from high-throughput expression data: towards next generation patient diagnostics (Page 75-88)