Related Work - Data Mining Patterns New Methods And Applications Pascal Poncelet (2008) pdf

There are several approaches that were developed

(or can be adapted) for finding interesting data visualizations. Most of them search for the vectors in the original feature space that contain some interesting information. The two most important vectors are then visualized in a scatterplot, one

vector on the x and the other on the y axis. Such

projections are called linear projections since each axis is a linear combination of original features. Following is an overview of such methods.

In the area of unsupervised learning, one of the oldest techniques is principal component analysis (PCA). PCA is a dimension reduction technique that uses variance as a measure of interestingness and finds orthogonal vectors (principal compo-

nents) in the feature space that account for the most variance in the data. Visualizing the two most important vectors can identify some elongated shapes or outliers. A more recent and general technique was developed by Friedman and Tukey (1974), and is known as projection pursuit. Diaco-

nis and Friedman (1984) proved that a randomly selected projection of a high-dimensional dataset would show approximately Gaussian distribution of data points. Since we are interested in nonran-

dom patterns, such as clusters or long tails, they propose to measure interestingness as departure from normality. Several such measures, known as projection pursuit indices, were developed and can be used in a gradient-based approach to search for interesting projections.

Probably the most popular method for finding projections for labeled data is Fisher’s linear dis-

criminant analysis (LDA) (Duda, Hart & Stork, 2000), which finds a linear combination of features that best discriminate between instances of two classes. When we have more than two classes, we can compute discriminants for each pair of classes and visualize pairs of discriminants in scatterplots. LDA’s drawbacks (sensitivity to outliers, assumption of equal covariance matrix for instances in each class, etc.) gave rise to several modifications of the method. One of the most recent ones is normalized LDA (Koren & Carmel, 2004), which normalizes the distances between instances and makes the method far more robust with respect to outliers. Another method that searches for projections with a good class separation is FreeViz (Demsar, Leban & Zupan, 2005), which could be considered as a projection pursuit for supervised learning. FreeViz plots the instances in a two dimensional projection where Figure 7. Parallel coordinates plot of the leukemia data set

the instance’s position in each dimension is com- puted as a linear combination of feature values. The optimization procedure is based on a physical metaphor in which the data instances of the same class attract, and instances of different classes repel each other. The procedure then searches for a configuration with minimal potential energy, which at the same time results in the optimal (as defined by the algorithm) class separation.

conclusIon

We presented a method called VizRank that can evaluate different projections of class labeled data and rank them according to their interestingness defined by the degree of class separation in the projection. Analysts can then focus only on the small subset of highest ranked projections that contain potentially interesting information regard- ing the importance of the features, their mutual interactions and their relation with the classes. We have evaluated the proposed approach on a set of cancer microarray datasets, all featuring about a hundred data instances but a large number of features, which, with the biggest datasets, went into several thousands.

Perhaps the most striking experimental result reported in this work is that we found simple visualizations that clearly visually differentiate among cancer types for all cancer gene expression datasets investigated. This finding complements a recent related work in the area that demonstrates that gene expression cancer data can provide ground for reliable classification models (Stat- nikov et al., 2005). However, our “visual” classification models are much simpler and comprise much smaller number of features, and besides provide means for a simple interpretation, as was demonstrated throughout the chapter.

The approach presented here is of course not limited to cancer gene expression analysis and can be applied to search for good visualizations on any class-labeled dataset that includes continuous

or nominal features. VizRank is freely available within Orange open-source data mining suite (Demsar et al., 2004; Demsar et al., 2004), and can be found on the Web at www.ailab.si/orange.

references

Armstrong, S. A., Staunton, J. E., Silverman, L. B., Pieters, R., den Boer, M. L., Minden, M. D., Sallan, S. E., Lander, E. S., Golub, T. R., & Kors- meyer, S. J. (2002). MLL translocations specify a distinct gene expression profile that distinguishes a unique leukemia. NatureGenetics, 30, 41-47. Bhattacharjee, A., Richards, W. G., Staunton, J., Li, C., Monti, S., Vasa, P., Ladd, C., Beheshti, J., Bueno, R., Gillette, M., Loda, M., Weber, G., Mark, E. J., Lander, E. S., Wong, W., Johnson, B. E., Golub, T. R., Sugarbaker, D. J., & Meyerson, M. (2001). Classification of human lung carci- nomas by mRNA expression profiling reveals distinct adenocarcinoma subclasses. PNAS, 98, 13790-13795.

Brunsdon, C., Fotheringham, A. S., & Charlton, M. (1998). An investigation of methods for visu- alising highly multivariate datasets. In D. Unwin, & P. Fisher (Eds.), Case studies of visualization

in the social sciences (pp. 55-80).

Cover, T. M., & Hart, P. E. (1967). Nearest neighbor pattern classification. IEEE Transactions on Information Theory, 1, 21-27.

Dasarathy, B. W. (1991) Nearest neighbor (NN) norms: NN pattern classification techniques. Los Alamitos, CA: IEEE Computer Society Press. Demsar, J., Leban, G., & Zupan, B. (2005). FreeViz - An intelligent visualization approach for class- labeled multidimensional data sets. IDAMAP-05

Workshop Notes. Aberdeen, UK.

Demsar, J., Zupan, B., & Leban, G. (2004). Or- ange: From experimental machine learning to interactive data mining. White Paper, Faculty of

Computer and Information Science, University of Ljubljana.

Demsar, J., Zupan, B., Leban, G., & Curk, T. (2004). Orange: From experimental machine learning to interactive data mining. In Proceed- ings of the European Conference of Machine Learning. Pisa, Italy.

Diaconis, P., & Friedman, D. (1984). Asymptotics of graphical projection pursuit. Annals of Statis- tics, 1, 793-815.

Duda, R. O., Hart, P. E., & Stork, D. G. (2000).

Pattern classification. Wiley-Interscience. Friedman, J. H., & Tukey, J. W. (1974). A projection pursuit algorithm for exploratory data analysis, IEEE Transactions on Computers, C-23, 881-889.

Golub, T. R., Slonim, D. K., Tamayo, P., Huard, C., Gaasenbeek, M., Mesirov, J. P., Coller, H., Loh, M. L., Downing, J. R., Caligiuri, M. A., Bloomfield, C. D., & Lander, E. S. (1999). Molecu- lar classification of cancer: Class discovery and class prediction by gene expression monitoring,

Science, 286, 531-537.

Harris, R. L. (1999). Information graphics: A

comprehensive illustrated reference. New York: Oxford Press.

Hoffman, P., Grinstein, G., & Pinkney, D. (1999). Dimensional anchors: A graphic primitive for multidimensional multivariate information visualizations. In Proceedings of the 1999 Workshop

on new paradigms in information visualization and manipulation.

Inselberg, A. (1981). N-dimensional graphics, part I-lines and hyperplanes (Tech. Rep. No. G320- 2711). IBM Los Angeles Scientific Center. Keim, D. A., & Kriegel, H. (1996). Visualization techniques for mining large databases: A com- parison, Transactions on Knowledge and Data Engineering, 8, 923-938.

Khan, J., Wei, J. S., Ringner, M., Saal, L. H., Ladanyi, M., Westermann, F., Berthold, F., Schwab, M., Antonescu, C. R., Peterson, C., & Meltzer, P. S. (2001). Classification and diagnos- tic prediction of cancers using gene expression profiling and artificial neural networks. Nature Medicine, 7, 673-679.

Kononenko, I. (1994). Estimating attributes: Analysis and extensions of RELIEF. In Proceed- ings of the European Conference on Machine Learning (ECML).

Koren, Y., & Carmel, L. (2004). Robust linear dimensionality reduction. IEEE Transactions

on Visualization and Computer Graphics, 10, 459-470.

Krauthgamer, R., & Lee, J. (2004). Navigating nets: Simple algorithms for proximity search. In

Proceedings of the 15th Annual Symposium on Discrete Algorithms.

Leban, G., Zupan, B., Vidmar, G., & Bratko, I. (2006). VizRank: Data visualization guided by machine learning. Data Mining and Knowledge Discovery, 13, 119-136.

Liao, C., Wang, X. Y., Wei, H. Q., Li, S. Q., Merg- houb, T., Pandolfi, P. P., & Wolgemuth, D. J. (2001). Altered myelopoiesis and the development of acute myeloid leukemia in transgenic mice overexpress- ing cyclin A1. PNAS, 98, 6853-6858.

Mitchell, T. M. (1997). Machine learning. New York: McGraw-Hill.

Nutt, C. L., Mani, D. R., Betensky, R. A., Tamayo, P., Cairncross, J. G., Ladd, C., Pohl, U., Hartmann, C., McLaughlin, M. E., Batchelor, T. T., Black, P. M., von Deimling, A., Pomeroy, S. L., Golub, T. R., & Louis, D. N. (2003). Gene expression-based classification of malignant gliomas correlates bet- ter with survival than histological classification.

Cancer Research, 63, 1602-1607.

Quinlan, J. R. (1986). Induction of decision trees.

Schubart, K., Massa, S., Schubart, D., Corcoran, L. M., Rolink, A. G., & Matthias, P. (2001). B cell development and immunoglobulin gene transcrip- tion in the absence of Oct-2 and OBF-1. Nature Immunol, 2, 69-74.

Shipp, M. A., Ross, K. N., Tamayo, P., Weng, A. P., Kutok, J. L., Aguiar, R. C., Gaasenbeek, M., Angelo, M., Reich, M., Pinkus, G. S., Ray, T. S., Koval, M. A., Last, K. W., Norton, A., Lister, T. A., Mesirov, J., Neuberg, D. S., Lander, E. S., Aster, J. C., & Golub, T. R. (2002). Diffuse large B-cell lymphoma outcome prediction by gene-expression profiling and supervised machine learning.

Nature Medicine, 8, 68-74.

Singh, D., Febbo, P. G., Ross, K., Jackson, D. G., Manola, J., Ladd, C., Tamayo, P., Renshaw, A. A., D’Amico, A. V., Richie, J. P., Lander, E. S.,

Loda, M., Kantoff, P. W., Golub, T. R., & Sell- ers, W. R. (2002). Gene expression correlates of clinical prostate cancer behavior. Cancer Cell,

1, 203-209.

Statnikov, A., Aliferis, C. F., Tsamardinos, I., Hardin, D., & Levy, S. (2005). A comprehensive evaluation of multicategory classification methods for microarray gene expression cancer diagnosis.

Bioinformatics, 21, 631-643.

Wang, A., & Gehan, E. A. (2005). Gene selec- tion for microarray data analysis using principal component analysis, Stat Med, 24, 2069-2087. Witten, I. H., & Frank, E. (2005). Data mining:

Practical machine learning tools and techniques with Java implementations (2nd_{ed.). San Francisco:}

Chapter VI

Summarizing Data Cubes

In document Data Mining Patterns New Methods And Applications Pascal Poncelet (2008) pdf (Page 137-141)