Discussion - Structured clustering representations and methods

The main contribution of BOMBASTIC is to explicate a generalized and modular method- ology for block-organized clustering that is relevant to many data analysis problems in biology and to provide a software implementation that facilitates efficient visualization, filtering, and exploration of a large space of potential analyses and their results.

1

2

3

4

5

6

7

8

9

Overview of BOMBASTIC interface. To construct an analysis, the user drags blocks from a menu of available datasets (1) to assemble a sequence of BlockClusteringGenerators

(2). Such a sequence suffices to specify the generation of the rest of the analyses, clustering each block independently. Statistics associated with each resulting BlockClustering are shown in histograms (3), which may be used to interactively select

and filter subsets of the data. A cross-filter view of BlockClusteringResults is shown in (4), and the full representation of the clustering tree in (5). A user may select any combination of clusters and their intersections using either the cross-filter or the tree view, and the objects (eg. genes) comprising that cluster can then be interrogated in detail (6), and additional analyses (eg. computing over-representation of associated regulatory motifs) applied to annotate the constituents of the currently selected node (7,

8). The entire system is scriptable and can be controlled through an integrated python console (9).

3.3.1 Beneﬁts over traditional methods

Comparing sizes of subsets with Venn diagrams

Comparison of discretized changes in gene expression between several groups offers one of the simplest examples of a useful application of BOMBASTIC. Genes might be measured in different contexts and classified as being up-regulated, down-regulated, or unchanged. Venn diagrams are very often used to visualize comparisons between 2, 3, or 4 groups. Beyond 5 groups, however, Venn diagrams become so visually complex that they are unhelpful. Venn diagrams also can only show the sizes of intersections, and the identity (eg. time-course patterns) of each set is indicated only by colors or labels. In con- trast, the cross-filter and tree visualizations provided by BOMBASTIC can efficiently and comprehensibly display intersections across an arbitrary number of sets, and the identity of each set and intersection can be directly encoded in the visualization.

Concatenation and clustering

Instead of clustering blocks independently, one could concatenate data and employ stan- dard algorithms. Doing so immediately presents the choice of which blocks to use, which is part of the problem addressed by BOMBASTIC. Once a desirable collection of columns had been concatenated, one would likely be able to find the same partitioning that would be produced by a combination of independent clusterings, assuming that individual blocks were of comparable sizes and had similar characteristics. If the blocks were of different sizes or contained data with very different distributions, it would be necessary to develop specialized clustering algorithms or objective functions to account for this.

3.3.2 Comparisons to related approaches and systems

BOMBASTIC was first presented at VIZBI 2013 [48] in April 2013, and was also described at the NYAS Data Science Learning and Applications to Biomedical and Health Sciences Workshop in January 2016 [47]. There are a number of earlier and contemporaneously developed systems addressing the same class of problems, having both similarities and differences to BOMBASTIC.

Declarative visualization algebras As was reviewed in the previous chapter, BOMBAS-

TIC is inspired by declarative visualization techniques such as Polaris/Tableau [122] and the Grammar of Graphics [141, 140]. These tools, however, formalize the problem of mapping a fixed tabular dataset (potentially with hierarchically structured dimensions) into graphical representations. Their algebras do not include primitives for clustering or for combining combinations of clusterings into taxonomies. Such tools also do not aim to provide an interactive interface that relates the summary visualizations of a clustering (e.g. the cross- filter or tree views of BOMBASTIC) to analyses and visualizations of the constituents of particular clusters (i.e. the individual genes and results of over-representation analysis).

STEM STEM [36], the short time-series expression miner, was an early and influential

tool for analysis of biological time-course data. For time-course clustering, the simple TICQ approach we have proposed makes fewer assumptions than the STEM method and avoids attempting to prune the space of possible patterns, while preserving the possibil- ity of identifying clusters that might be sparsely populated. STEM also offered a tool for comparing membership between the clusters of clusterings from two contexts. BOMBAS- TIC generalizes this to an arbitrary number of independent clusterings, and can generate the combined clustering result formed by all of the intersections.

StratomeX and Domino StratomeX [74] is a visualization tool aimed at the problem of

comparing cancer subtype stratifications. BOMBASTIC is distinguished from StratomeX by its goal of providing a formalization of the clustering problem, in which the combination of two block clusterings produces a new partitioning which can viewed used as a ’first- class’ clustering itself, whereas StratomeX is primarily described as a visualization method to compare and relate fixed, alternative stratifications.

Gratzl and colleagues [42], (from the same group that developed StratomeX), also re- cently proposed Domino, a system for ”extracting, comparing, and manipulating subsets across multiple tabular datasets”. Like BOMBASTIC, Domino recognizes that perform- ing comparisons across heterogenous datasets is an important problem not well served by existing tools. Domino provides a number of relationship operators to connect blocks that indexed by the same object types, and even supports selection of clusters across mul-

tiple partitionings. BOMBASTIC again differs from Domino in providing a cross-product operator that explicitly constructs a new partitioning by combining two independent partitionings, as well as in providing an the explicit tree view. Furthermore, BOMBASTIC supports conducting analyses (eg. over-representation tests) systematically over each subset in the resulting tree, to facilitate searches for interesting paths and cluster combinations, whereas Domino relies more heavily on selection of subsets under the direction of the user. Finally, neither StratomeX nor Domino are specifically intended for clustering of time-course data, which was a major motivation for BOMBASTIC and the simple but ef- fective TIQCC method that it implements.

In document Structured clustering representations and methods (Page 52-56)