Block Clustering Methods - BOMBSTIC Methods

3.1 BOMBSTIC Methods

3.1.3 Block Clustering Methods

BOMBASTIC is intended to be modular and agnostic about the specific clustering algorithms used to produce the independent block clusterings. The major restriction is that clusterings are assumed to be ’hard’; each object should be assigned to exactly one cluster. Initially, we implemented several very simple but widely-applicable algorithms.

Binary label assignments

The simplest possible ’clustering’ is to partition objects based the value of some binary label, (eg. indicating membership in a some set). This is useful for restricting analyses to subsets of objects of interest; for example, when clustering genes, one might wish to investigate only the set of transcription factors. Calling such restrictions a clustering allows us to implement this frequent task within the general BOMBASTIC framework.

Real Filters

A slightly more complicated but equally common scenario is that of filtering objects based on the values of some associated real-valued statistics. In gene expression analysis ap- plications, for example, one often wants to filter genes by fold-change or variance. It is therefore useful to be able to define a clustering of objects by specifying a set of ranges for some statistic. Such ranges may be specified interactively by selecting regions on a histogram. While there are well-established tools to filter quantitative data in this manner (eg. Spotfire), providing partitioning based on such filters allows this task to fit naturally into the BOMBASTIC scheme and to be used in combination with other clustering algorithms.

Testing individual contrasts

Building on the above two clustering types is the common case of partitioning objects based on the results of a statistical test for a single contrast between two conditions. This produces a partitioning of the objects at multiple levels (eg. up-regulated, down- regulated, unchanged) based on a combination of filters on both statistical significance

and data values.

Trivial Indexed Quantized Contrast Clustering (TIQCC)

Time-course data can be represented as a sequence of contrasts between successive time-

points. Given an observation vector x = [x0, x1, ..xt], one can construct the vector of

contrasts c = [x1− x0, x2− x1. . . xn− xn−1]

This leads to a very simple way to cluster such data, which we call Trivial Indexed Quantized Contrast Clustering (TIQCC). We define a set of q levels for quantization (eg. the intervals between log2 fold changes of [−∞, −2, −1, 0, 1, 2, +∞]), and then enumerate all possible quantized contrast vectors. We then quantize each contrast vector, and assign it to the matching cluster. (Since many of the possible clusters may be unoccupied, one can start with the data and keep track of only those clusters that have support).

This scheme has several useful properties:

• Observations are clustered by shape, rather than absolute level. This is particularly important when analyzing bio-molecular data, since there are wide variations in dynamic range, and biological relevance is not necessarily related to absolute levels. Moreover, many common experimental techniques measure only relative changes. • Every potential pattern will be represented; one does not have to choose the num-

ber of clusters to use, and even rare patterns will be represented by clusters. • The number of quantization levels can be adjusted to generate more or less granular

clusterings

• The quantization method can easily be extended to incorporate other statistics about the contrasts. For example, if we perform significance tests for each contrast, they can be used as a filter in the quantization and assignment to clusters. • The algorithm is extremely simple and fast, and scales linearly with the number of

observations.

A limitation of TIQCC is that it is only suitable for relatively short time-courses, since if q is the number of quantization levels for each delta, and t the number of time-points,

the number of possible clusters will be qt−1_.

Ernst et al. also proposed a time series clustering algorithm based on quantized patterns [36], although the patterns used were not exhaustive, and the choice of patterns to be used as clusters was independent of the observed data.

Another related approach is to cluster time-courses by their derivatives, after transfor- mation to splines. For example, Dejean [29] described an algorithm in which time-course data was smoothed and represented by cubic splines. The profiles were then clustered (using k-means) on the first derivatives of those functions, to cluster the profiles by their shapes rather than absolute levels.

The same approach can also be used for the case in which two conditions are com- pared over time, such as disease vs. normal. In such an experiment, often it is the contrast between disease and normal that is of the greatest interest, and so the vector of these contrasts can be used without comparing successive time points, although if relevant, one could also cluster by the time-dependent changes in the disease vs. normal changes.

Scaled, centered K-means

Another simple algorithm suitable for clustering short time-course data is k-means. When the shapes of the time-courses are of interest, the data may first be transformed by cen- tering and then scaling to a uniform range. Importantly, the parameters used in these transformations are recorded and propagated to the resulting BlockClusteringResult, so that queries can be specified both on cluster shape (eg. a cluster that is monotonically increasing) and which satisfy additional criteria (eg. having expression above a baseline level and spanning a dynamic range of at least a 4-fold change).

In document Structured clustering representations and methods (Page 48-50)