Visualizing Clustering Results - Structured clustering representations and methods

Once clusters have been found, by whatever methods are used, the clusters must be presented to the user in an accessible way. For visualizing clustered data matrices in biology, heat maps and ’cluster-grams’ [34] are ubiquitous. Such views are static and rows can be arranged in only one order. While a carefully chosen ordering can be used to emphasize relationships between different subsets of columns, heat maps on their own do not provide an effective way to compare alternative blocks and orderings.

StratomeX [74] is an interesting tool developed for visualization of cancer subtypes, where a set of samples are partitioned using a variety of different types of data. StratomeX has similar motivations to BOMBASTIC and shares the concept of composing analyses by relating blocks of data. Each (fixed) partitioning by some type of data (e.g. RNA expression, miRNA expression, mutation status) is represented in a column, and partitions within each data type are drawn as blocks. Ribbons are drawn to represent intersections between partitions across blocks, adopting the Parallel Sets idea originally presented in [68]. Earlier applications of the parallel coordinates / parallel sets method to compare multiple partitionings include [144] and [46]. StratomeX also provides ’dependent’ columns that display a representation of a dependent variable within a selected subset of the data. StratomeX was described as a visualization technique aimed specifically at compar- ing pre-computed cancer subtype stratifications. A related method, Domino [42], has also recently been proposed to aid in the manipulation of subsets across multiple tabu- lar datasets, reinforcing the emerging recognition of the importance of this class of data analysis problems. BOMBASTIC has been developed contemporaneously with both of these systems [48], and while it includes a visualization component, aims to provide a

more generic methodology for structured clustering, as well as to offer specializations for time-indexed data.

2.3.2 Faceted Search

An alternative way to frame the problem of analyzing and discovering scientifically interesting subsets of items within a large dataset is as an information retrieval or search task, in which the scientist’s job is to formulate queries. Unfortunately, scientists’ queries are often vague and uncertain. One important technique that has been developed in the field of information retrieval to help users explore complex databases in the face of uncertaintly and poorly specified queries is faceted navigation and search [134, 104], and BOMBASTIC may also be considered as an attempt to provide a dynamic, faceted navigation system to query structured quantitative data.

Faceted search extends the notion of a fixed, hierarchical taxonomy to permit dy- namic, iterative composition of facets drawn from separate taxonomies that describe different aspects of objects. Each individual facet is itself a ”hierarchy formed using a [distinct] characteristic of division” [134] (i.e. a facet is similar to a taxonomic character, though facets may be hierarchical themselves). Specific points within these facet hierar- chies can be selected, and choices from multiple facets can be combined to define sets of objects matching all of the chosen predicates. The number of facets used, and the order in which selections are made is flexible, and a faceted classification system is ”hospitable” to extension with new facets that do not fit into the existing hierarchies.

The original faceted search system was the ”colon classification” library system by Ranganathan in 1933 [134]. More recently, faceted search has become nearly ubiquitous in both e-commerce and document information retrieval systems. Faceted databases can be queried with combinations of boolean predicates on the facets, and this process can be facilitated by providing interfaces that list the possible parameters for each facet. Modern faceted navigation extends the parametric search idea by dynamically updating the interface to show only the allowable remaining parameters as a user iteratively refines a search.

facet to be either fixed or dynamic. In the colon classification system, the taxonomies used for each the facets (describing aspects such as location (Earth->USA->Massachusetts- >Boston) or time (20th century -> Late 20th century -> 1980s) are pre-defined. Modern faceted search systems are often used for semi-structured datasets that may have some rigidly defined facets, but also permit ’dynamic facets’ to be defined by full-text queries. These ’search’ facets have typically been unstructured, although a recent extension has been to construct hierarchical facets dynamically [27, 3] using unsupervised methods such as topic models.

2.3.3 Dynamic Queries

Clustering provides a mechanism to assign discrete, hierarchically organized category labels to observations, against which to formulate queries. However, many of the properties on which one might like to filter are continuous variables, such as parameters used in clustering algorithms or any other statistics computed from the data. Dynamic query interfaces [1] provide graphical representations of parametric queries and statistical sum- marizations of data, allowing users to interactively select subsets of observations by direct interaction to define regions of interest on plots of variables and their distributions, and to see how the distributions of different variables relate to each other. An important de- sign goal of BOMBASTIC will be to facilitate queries over both discrete categories and associated continuous properties simultaneously.

In document Structured clustering representations and methods (Page 36-38)