3.1 Hierarchical Clustering Analysis
3.1.2 Cutting the Dendrogram
Since overlapping is not permitted between clusters in hierarchical clustering, often there is am- biguity regarding the allocation of variables into clusters. The analyst, therefore, has to decide whether several records form one or more clusters, or if they are all part of a larger cluster (i.e. a nested cluster). This ambiguity is more evident in large and complex data sets, where clusters may exist at different heights of the dendrogram. For data sets which consist of relatively distinct and homogeneous subsets, deciding a single similarity threshold, which cuts the tree at a uniform height, could be sufficient for determining representative clusters. For larger dendrograms, which often consist of heterogeneous and less distinct subsets, a more flexible approach that involves multiple-level cuts, would produce more representative results. However, it is not clear how the analyst could explore the data and the different clustering scenarios to identify groups of similar records in large and complex data sets.
Dynamic Tree Cut (DTC) [112] can cut the branches of the dendrogram at different levels automatically, based on their shape, their length or other criteria. However, heuristic criteria are tailored to describe pre-determined shapes and patterns in the dendrogram and they cannot identify
Figure 3.3: Different tree layouts as presented by McGuffin et al. [124]. (a) node-link, (b) a variation on (a) to support long labels, (c) icicle, (d) radial, (e) concentric circles, (f) nested circles, (g) treemap and (h) indented outline.
new patterns in the data. Thus, heuristic approaches are rarely optimal because they cannot capture all the pattern variations which can be observed in real data sets. Moreover, semi-supervised approaches such as the ones presented by Dotan-Cohen et al. [60], Navlakha et al. [137] and in HCsnip[140] that integrate prior knowledge into the algorithm to detect clusters, require that the data records are first labeled. The configuration cannot be generalised for unstructured data sets without assuming any background knowledge related to them. Hence, the clustering results rely on additional information about the system, which is usually missing from most data sets, and the assumption that the added labels determine how similar the data records are.
In the real world, there is no “one-size-fits-all” solution and it is common to ignore special characteristics of clusters [101]. Within the same data set, some clusters may be tight (low pair- wise dissimilarity), while some others may be loose (high pairwise dissimilarity). For instance, biologically associated genes may follow a similar expression pattern either constantly or only for a time period, as reported by Mahanta et al. [120] and Craig et al. [50]. Therefore, adding the “human in the loop”is needed to visually explore the dendrogram and the data records in different levels of detail and select potential clusters manually [165].
Visual support tools have been always used in the analysis of biological data. An evaluation of microarray visualisation tools has been presented by Saraiya et al. [156] and a more recent survey has been presented by Pavlopoulos et al. [142]. Specialised tools for microarray data analysis of- ten incorporate visualisation features for different analysis tasks, including hierarchical clustering
analysis. Chipster [98] and Mayday [22] are two open source microarray data analysis platforms that support hierarchical clustering. Due to the importance of time-course gene expression data, there is also a number of tools that target the clustering of such data sets. For instance, STEM [66] is a software tool for automatic profiling and clustering of short time-series data. The tool creates profiles for possible temporal patterns that can occur and then matches those patterns with what is found in the data set. However, it is difficult to capture all variation that could possibly exist in a multivariate time-series data set. A flexible and user-driven approach, in matters of statistical analysis capabilities, is provided by PESTS [165]. All of those tools support some visualisation features for hierarchical clustering analysis but they provide little or no support for interactive exploration of the data.
There are several tools which perform hierarchical clustering analysis, but only a few of them provide visual feedback or support the interactive exploration of the clusters. However, due to the increasing complexity and size of the data, visualisation becomes an important aspect of per- forming clustering analysis. The relatively new paradigm of visual analysis is founded on the idea that expert users are capable of steering the analysis to produce more successful results [139]. The actions of the users are often driven by tacit knowledge which cannot become part of an algorithm. Therefore, involving a human for taking decisions and for guiding the analysis is essential.
At the highest level, a view of the clusters as part of the whole dendrogram should be supported and at the lowest level, the original multidimensional data should be visualised and linked to their clustering assignment to enable the visual comparison of data records. The idea of drilling-down to see more detail in the data is common to many visual analysis tools. Similar steerable approaches have been investigated in the past for exploring graph structures, as in Archambault et al. [13] and in Abello et al. [2]. To provide flexibility and control over the clustering process, modellers required a method that would enable them to combine hierarchical clustering results with their own tacit knowledge to take decisions about the allocation of variables into clusters. During the analysis, modellers need to be able to cut branches of the dendrogram at different heights, as shown in (Figure 3.4).
Figure 3.4: Multi-level cuts in a heterogeneous dendrogram. The red icons indicate four locally applied similarity thresholds which cut the tree in four branches that form the same number of non-overlapping clusters. This clustering scenario could not be achieved using any single-height similarity threshold.
selected, mainly because the underlying data and the rationale of the clustering algorithm remain hidden to the final user, who only relies on the dendrogram for selecting clusters. However, visu- alising the original multivariate data in addition to the dendrogram can help to reveal interesting patterns and relationships between data records which are not always obvious in the dendrogram. Thus, enabling the representation and comparison of the original data records is also important for improving and confirming the selection of clusters. However, visualising the original data records can be complicated, especially when they are large and contain multiple dimensions or time points. In the following Section 3.1.3 we discuss approaches for visualising data sets of variables that contain multiple dimensions or time points.