Clustering structured data - Structured clustering representations and methods

In the 1960s and 1970s, biological taxonomists discovered computers and developed the field of numerical taxonomy [118, 32], which applied quantitative similarity metrics and numerical clustering algorithms to the task of classifying organisms. These methods were applicable both to phenetics, in which organisms are classified by phenotypic features, as well as to the cladistic and phylogenetic methods that have come to dominate modern

taxonomy. Classifying organisms by phenotype requires first choosing a set of characters to use. For example, to classify micro-organisms, Sneath [117, 32] proposes using characteristics such as morphology (number of flagella, shape of spores), biochemical properties (anaerobic vs aerobic; oxidase activity), drug sensitivity (penicillin sensitive?), etc. When used to classify objects in a taxonomy, such properties are referred to as taxonomic char-

acters. In the task of classifying or identifying organisms, these characters have typically

been chosen by experts to be easily observable and informative.

The distinctions between the different kinds of characters used in phenetics are usually clear-cut: morphology is one thing, and drug sensitivity is another, and there are typically a limited number of easily observable characteristics to work with. When these methods are applied to more abstract and plentiful data, it may be less obvious whether different variables reflect different characters, which variables should be considered together as groups, and which variables and combinations should be used at all to construct a useful taxonomy or partitioning of objects.

2.2.2 Biclustering, 3D biclustering, and Plaid Models

Biclustering methods address the question of finding subsets of variables that are relevant to distinguish only subsets of objects, usually for the case where there is no prior sugges- tion of a natural organization of the variables. Biclustering [82] was popularized by [23] as a method that ”groups items based on a similarity measure that depends on context”, relaxing the assumption of standard clustering methods in which all conditions (columns) are given equal weight. The objective in biclustering is to simultaneously discover subsets of both genes and conditions with similar profiles. This is potentially a much more com- putationally difficult problem than the one to be addressed by the BOMBASTIC method to be introduced in this dissertation (Chapter 3), which assumes that the column subsets are pre-specified, and that a user explicitly chooses which blocks to use.

Either ordinary clustering or biclustering can be extended to data with higher dimensional structure. For example, if gene expression is measured across both time and mul- tiple conditions, the resulting data set can be imagined as a three dimensional matrix, or equivalently, as a set of two-dimensional matrices that are aligned on one or both axes.

TRICLUSTER [143] is a graph-based clustering algorithm that extends the idea of biclustering to three-way data, such as that indexed by gene, condition, and time. TRI- CLUSTER searches for clusters that are homogenous across two of the dimensions, such as genes that have the same temporal pattern over all of the experimental conditions. While this is one potentially useful objective, note that it might also be biologically inter- esting to discover clusters that have different patterns in different conditions, or clusters that exist only in a single condition.

Strauch et al. [124] proposed an interesting ’two-step’ algorithm for 3-dimensional genes-time-condition data. In an example application, the levels of 23,000 genes were measured under 9 abiotic stress conditions, each at 6 to 9 time points. For their two-step algorithm, k-means clustering is first used to cluster data for only one of the conditions. The profiles for the corresponding genes in each of the other conditions are then com- pared to their cluster assignments learned in the first condition. The modules are categorized as either single-response modules, which cluster only in the seed condition but not in the others, coherent-response modules, which cluster together and have the same temporal pattern in all conditions, or as independent response modules, which cluster in other conditions, but have distinct dynamic profiles in each condition. This entire pro- cess is then repeated using each of the conditions as the seed to learn initial clusters. The Strauch et al. algorithm is thus able to identify clusters across subsets of conditions whether or not the exact profiles are condition-dependent.

EDISA [127] extends the ISA biclustering algorithm, which performs matrix factoriza- tion with some additional thresholding and constraints. EDISA extends to 3-way data by considering a fixed time-course vector for each (object, condition) pair. EDISA itera- tively samples observations from the data and assigns them to modules, which as in [124], are categorized as being single-response (clustering in one condition only), coherent response (similar profiles over all conditions), or independent responses (a common set of genes with potentially different profiles in different conditions).

2.2.3 Functional and time-course clustering

Clustering time-indexed and other functional data has motivated development of special- ized algorithms, and many strategies are reviewed in [58]. The simplest approach is to use standard clustering algorithms with the common distance metrics such as correlation. This mostly ignores the time-dependent character of the data. A common improvement is to transform the raw observations to a more meaningful basis, for example by fitting splines, and then using the parameters of the spline fits as the inputs to standard algorithms, as in [81].

2.2.4 Time-course clustering for biological data

Several algorithms have been developed specifically for clustering biological time-course data, which tends to be short and noisy. STEM, the short time-series expression miner, developed by Ernst et al. [35] is a notable example of gene expression time course clustering. STEM begins with the idea of enumerating all possible patterns using a fixed, quantized step size between successive time points. To reduce the number of clusters, STEM proposes a greedy algorithm to maximize the diversity of the chosen set of po- tential cluster profiles for a specified number of clusters. The choice of these patterns is independent of the data, and is determined solely by the quantization scheme and number of clusters specified. Genes are then assigned to profiles based on the correlation coefficient between the measured profile and the cluster pattern. STEM also supports comparing clusterings between two conditions, using the hypergeometric test to assess overlap between the set of genes assigned to each cluster in each condition.

2.2.5 Model-based clustering of multi-factor data

Many standard clustering methods, such as k-means, can also be viewed as fitting prob- abilistic generative models to data [38]. For multi-factor data such as gene expression measured over conditions and time, hierarchical mixture models can be used to model the effects of the various factors. For example, Jörnsten and Keles [62] proposed using 2-level Gaussian mixture models for clustering, fitting the models using expectation-

maximization. The parameters of such models can be interpreted in different ways to explicitly encode alternative scientific questions, such as modeling differential expression between conditions at each time point separately, modeling the trajectories of differential expression between time points, or comparing expression levels at individual time points.

2.3 Visualizing Clustering Results

In document Structured clustering representations and methods (Page 32-36)