Scoring of the projection using leave-one-out and
k-nearest neighbor algorithm has a complexity on the order of O(N2), where N is the number of
instances in a data set. This complexity is high, especially for larger datasets, and can be reduced to O(N*log(N)) using a more efficient implemen- tation of nearest neighbor search (Krauthgamer & Lee, 2004). Notice that it is also unlikely that the data set considered by our approach will be extremely large, as these cannot be nicely visu- alized using point-based visualization. Typical datasets for which these visualizations are ap- propriate include from several hundreds to (at most) couple of thousand data items, where our implementation of VizRank typically evaluates about 1,000 to 10,000 projections per minute of runtime on a medium-scaled desktop PC.
Another major factor that contributes to the complexity of the procedure is the number of projections to evaluate. Consider radviz visu- alization, which can include arbitrary number of features in a single plot. The number of dif- ferent projections grows exponentially with the number of features included in visualizations. Yet, in practice, visualizations with more than ten features are rarely considered since they are hard to interpret. Still, even with this limitation, it is impossible to evaluate all possible projections. The approach we propose is to consider only the projections that can be evaluated in some limited time. Due to the proposed heuristic search, and as we experimentally demonstrate on the studies with
microarray cancer data, the overall approach can find interesting projections with high predictive power in few minutes of runtime.
exPerIMentAl AnAlysIs
We tested VizRank on several microarray gene ex- pression datasets. The analysis of such datasets has recently gained considerable attention in the data mining community. Typically, the datasets include up to few hundred data instances. Instances (tissue samples from different patients) are represented as a set of gene expression measurements; most often, datasets include measurements of several thousands genes. As tissue samples are labeled, with classes denoting malignant vs. nonmalignant tissue, or different cancer types, the task is to find if the cancer type can be diagnosed based on the set of gene expressions. Another important issue emerging from cancer microarray datasets is to consider what is the minimal number of genes for which we need to measure the expression in order to derive a reliable diagnosis.
In the following, we first describe six datasets used in our analysis and discuss the top-ranked projections. We then present a study in which we used top-ranked projections as simple and understandable prediction models and show that despite their simplicity we achieve high prediction accuracies. We also show how these projections can be used to find important features and identify possible outliers or misclassified instances.
datasets
Gene expression datasets are obtained by the use of DNA microarray technology, which simulta- neously measures expressions of thousands of genes in a biological sample. These datasets can be used to identify specific genes that are dif- ferently expressed across different tumor types. Several recent studies of different cancer types (Armstrong, Staunton, Silverman, Pieters, den
Boer, & Minden, 2002; Golub, et al., 1999; Nutt, Mani, Betensky, Tamayo, Cairncross, & Ladd, 2003; Shipp, Ross, Tamayo, Weng, Kutok, & Aguiar, 2002) have demonstrated the utility of gene expression profiles for cancer classifica- tion, and reported on the superior classification performance when compared to standard mor- phological criteria.
What makes microarray datasets unique and difficult to analyze is that they typically contain thousands of features (genes) and only a small number of data instances (patients). Analysts typi- cally treat them with a combination of methods for feature filtering, subset selection and modeling. For instance, in the work reported by Khan, Wei, Ringner, Saal, Ladanyi, & Westermann (2001) on the SRBCT dataset, the authors first removed genes with low expression values throughout the data set, then trained 3,750 feed-forward neural networks on different subsets of genes as deter- mined by principal component analysis, analyzed the resulting networks for most informative genes thus obtaining a subset of 96 genes expression of which clearly separated different cancer types when used in multidimensional scaling. Other approaches, often similar in their complexity, include k-nearest neighbors, weighted voting of informative genes (Golub et al., 1999) and support vector machines (Statnikov, Aliferis, Tsamardinos, Hardin, & Levy, 2005). In most
cases, the resulting prediction models are hard or even impossible to interpret and can not be com- municated to the domain experts in a simple way that would allow reasoning about the roles genes play in separating different cancer types.
In our experimental study we considered six publicly available cancer gene expression datasets with 2 to 5 diagnostic categories, 40 to 203 data instances (patients) and 2,308 to 12,600 features (gene expressions). The basic informa- tion on these is summarized in Table 1. Three datasets, leukemia (Golub et al., 1999), diffuse large B-cell lymphoma (DLBCL) (Shipp et al., 2002) and prostate tumor (Singh, Febbo, Ross, Jackson, Manola, Ladd, 2002) include two di- agnostic categories. The leukemia data consists of 72 tissue samples, including 47 with acute lymphoblastic leukemia (ALL) samples and 25 with acute myeloid leukemia (AML), each with 7,074 gene expression values. The DLBCL data set includes expressions of 7,070 genes for 77 patients, 59 with DLBCL and 19 with follicular lymphoma (FL). The prostate tumor data set includes 12,533 genes measured for 52 prostate tumor and 50 normal tissue samples.
The other three datasets analyzed in this work include more than two class labels. The mixed lineage leukemia (MLL) (Armstrong et al., 2002) data set includes 12,533 gene expression values for 72 samples obtained from the peripheral blood or
Data set Samples (Instances) Genes (Features) Diagnostic classes Majority class Leukemia 72 7,074 2 52.8% DLBCL 77 7,070 2 75.3% Prostate 102 12,533 2 51.0% MLL 72 12,533 3 38.9% SRBCT 83 2,308 4 34.9% Lung cancer 203 12,600 5 68.5%
bone marrow samples of affected individuals. The ALL samples with a chromosomal translocation involving the mixed lineage gene were diagnosed as MLL, so three different leukemia classes were obtained (AML, ALL, and MLL). The small round blue cell tumors (SRBCT) dataset (Khan et al., 2001) consists of four types of tumors in child- hood, including Ewing’s sarcoma (EWS), rhab- domyosarcoma (RB), neuroblastoma (NB) and Burkitt’s lymphoma (BL). It includes 83 samples derived from both tumor biopsy and cell lines and 2,308 genes. The last dataset is the lung cancer dataset (Bhattacharjee, Richards, Staunton, Li, Monti, & Vasa, 2001) that contains 12,600 gene expression values for 203 lung tumor samples (139 adenocarcinomas (AD), 21 squamous cell lung carcinomas (SQ), 20 pulmonary carcinoids (COID), 6 small cell lung cancers (SMLC) and 17 normal lung samples (NL)).