We applied VizRank to evaluate scatterplot, rad- viz, and polyviz projections. We limited radviz and polyviz projections to a maximum of eight features since projections with more features are harder to interpret. For each dataset and vi- sualization method, VizRank evaluated 100,000 projections as selected by the search heuristic. With these constraints, the runtime for the largest
of the datasets in terms of number of instances was about half an hour.
Top projections for each dataset (Figures 1(c), 2, and 4) show that VizRank is able to find a pro- jection with a relatively good to excellent class separation using only a fraction of available fea- tures. Scores for these projections in Table 2 show that radviz and polyviz projections consistently offer better class separation than scatterplots. This was expected since scatterplots present the data on only two features, while we searched for radviz and polyviz visualizations that included up to eight features. The advantage of using more features is especially evident in datasets with many class values (e.g., lung cancer) where two features alone are clearly insufficient for discrimination between all classes.
Important to our study was to answer the ques- tion if the features (genes) used in best projections bear also biological relevance, that is, were they expected to be associated with particular disease. Since most datasets try to discriminate between different tumor types, we assumed that most useful genes will mostly be markers of different tissue or cell origin and will not necessarily be related to cancer pathogenesis. However, we found that many of the genes appearing in the best projections are annotated as cancer or cancer-related genes according to the Atlas of Genetics and Cytogenet- ics in Oncology and Haematology (http://www.
Data set
P for best-ranked projection Scatterplot Radviz Polyviz
Leukemia 96.54% 99.93% 99.91% DLBCL 89.34% 99.90% 99.87% Prostate 87.34% 96.58% 97.23% MLL 90.12% 99.70% 99.75% SRBCT 83.52% 99.94% 99.92% Lung cancer 75.48% 93.49% 93.66%
infobiogen.fr/services/chromcancer/index.html). On the other hand, for the prostate dataset, where we try to differentiate between tumor and normal tissue samples, one would expect the “marker” genes to be cancer related. We support our hypothesis by ascertaining that six out of eight genes used in the best radviz projection (LMO3, RBP1, HSPD1, HPN, MAF, and TGFB3) (Figure 4(b)) are cancer related according to the cancer gene atlas.
For brevity, we here only present a biological interpretation of the genes used in the best visual- izations of a single data set, and for this consider the MLL data. The best polyviz projection for this dataset is shown in Figure 2(b), and exhibits a clear separation of instances with different diagnostic class. In the visualization, class ALL instances lie closer to the anchor points of the MME and POU2AF1 gene. The anchor point of gene CCNA1 most strongly attracts the MLL class samples and, by some degree, also the AML samples. These
findings are consistent with the work of Armstrong et al. (2002), in which they report on genes MME and POU2AF1 to be specifically expressed in ALL and gene CCNA1 in MLL. There is also a well-founded biological explanation for the appearance of these genes in some of the other best projections separating different classes of the MLL dataset. For example, MME (membrane metalloendopeptidase), also known as common acute lymphocytic leukemia antigen (CALLA), is an important cell surface marker in the diagnosis of human acute lymphocytic leukemia (ALL) (http://www.ncbi.nlm.nih.gov/entrez/dispomim. cgi?id=120520). It is present on leukemic cells of pre-B phenotype, which represent 85% of cases of ALL, and is not expressed on the surface of AML or MLL cells. Similarly, gene POU2AF1 (Pou domain class 2 associating factor 1) is required for appropriate B-cell development (Schubart, Massa, Schubart, Corcoran, Rolink, & Matthias, 2001) and is therefore expressed in ALL samples but
(a) Lung cancer (93.49%) (b) Prostate tumor (97.23%)
(c) SRBCT (99.92%) (d) DLBCL (99.87%)
Figure 4. Optimal radviz and polyviz projections for lung cancer, prostate tumor, SRBCT, and DLBCL data set.
116
not in instances with AML or MLL class label. On the other hand, gene CCNA1 (cyclin A1) is
a myeloid-specific gene, expressed in hemato- poietic lineages other than lymphocytes (Liao,
Wang, Wei, Li, Merghoub, & Pandolfi, 2001). Overexpression of CCNA1 results in abnormal myelopoiesis, which explains the higher expres- sion in AML samples. According to Armstrong
et al. (2002), lymphoblastic leukemias with MLL
translocation (MLL class) constitute a distinct
disease and are characterized by the expression of myeloid-specific genes such as CCNA1. Because this gene is myeloid-specific it is not expressed
in ALL samples.
Instead of a single good projection, VizRank can often find a set of projections with high scores. We can then analyze what features are shared between these projections. For example, we may want to know whether a particular gene appears in only one good projection or in several top-ranked projections. In Figure 5(a) we show a plot that lists the first 20 genes present in the top-ranked
scatterplot projections of the MLL dataset. For each pair of genes (one from the x and one from
the y axis), a black box indicates whether their scatterplot projection is ranked among the best 500. The figure shows that three genes—MME, DYRK3, and POU2AF1—stand out in the number of their appearances in the top-ranked projections. Interestingly, in the original study of this dataset (Armstrong et al., 2002) these three genes were
listed as the 1st, 3rd, and 10th gene, respectively, among the top 15 genes most highly correlated
with ALL class compared with the MLL and AML classes.
We have performed a similar experiment for the leukemia dataset. For each gene we counted how often it appears in the top 500 scatterplots. The histogram of 20 most frequent genes is shown in Figure 5(b). It is evident that gene ZYX (zyxin)
particularly stands out as it was present in more
than 260 out of 500 projections. ZYX is also one of the anchor genes in the best radviz projection of the leukemia dataset (Figure 2(a)). One can
(a) (b)
Figure 5. (a) Genes on the x and y axis are the first 20 genes from the list of top-ranked scatterplot projections of the MLL data set. Black boxes indicate that the corresponding pair of genes on the x and y axis form a scatterplot that is ranked as one of the best 500 scatterplots. (b) Histogram of genes that appear most frequently in the list of 500 best ranked scatterplot projections of the leukemia data set.
observe in the figure that instances from the AML class lie closer to this anchor; therefore, they have higher expression of this gene. Zyxin has been previously recognized as one of the most important genes in differentiating acute lymphoblastic and acute myeloid leukemia samples. In the original study of this dataset (Golub et al., 1999), zyxin was reported as one of the genes most highly cor- related with ALL-AML class distinction. Wang and Gehan (2005) systematically investigated and compared feature selection algorithms on this dataset, and reported that zyxin was ranked as the most important gene in differentiating the two leukemias by most filter and wrapper feature selection methods. They also give a possible bio- logical explanation for the involvement of zyxin in leukaemogenesis.
We found similar biological relevance of genes that participated in the best visualizations of other datasets. Besides finding projections with good class separations, VizRank therefore also pointed at specific important genes, which were already experimentally proven to be relevant in the diagnosis of different cancer types. Most of our visualizations included in this chapter point to nonlinear gene interactions, giving VizRank an advantage over univariate feature selection algorithms prevailingly used in the current related work in the area.