Once a non-random structure is distinguished in the data, a technique that finds the “best” structure is desirable - a vague expression - since ‘best’ may refer to nov- elty, stability, size, suitability, and is again dependent on the nature of the analysis and experimental purpose, and so on. Some non-trivial considerations might in- clude: whether the method is exploratory or predictive (with results used as the
foundation for further investigations), whether all samples and genes should be grouped, whether a novel structure is to be assessed, and so on. Objective assess- ment measures of clustering quality fall into two categories - external and internal - summarized in Table 4.1.
Internal Measures:
These measures use only information from the clustering result and the dataset itself to assess cluster quality. Properties considered include:
• Compactness - intra-cluster homogeneity e.g. assessment of average or max- imum pairwise intra-cluster distances, average or maximum centroid-based similarities.
• Separation - inter-cluster distance, e.g. average weighted distance where the distance between clusters can be computed as the distance between their cen- troids, or as the minimum distance between data items of each, e.g. the min- imum distance between any two clusters.
• Connectedness - to what degree data items are grouped with their nearest neighbours in the data space, i.e. also known as connectivity (Handl et al., 2005).
• Combinations - as compactness usually improves with the number of clusters and separation usually deteriorates, linear/non-linear combination measures to assess both can be used, e.g. the SD-validity index (Halkidi et al., 2000), Dunn Indices (Dunn, 1974), Davies-Bouldin Index (Davies and Bouldin, 1979), Silhouette Index (Rousseeuw, 1987), C-Index (Hubert and Schultz, 1976).
An equivalent measure for fuzzy clusterings includes the Xie-Beni index (Xie and Beni, 1991).
• Fuzziness - applicable only to fuzzy partitions, these measures assesses shar- ing of membership between clusters, included are the partition coefficient and
partition entropy(Bezdek, 1973, 1974).
• Stability/Predictive Power - based on repeatedly resampling the original data, this measures the consistency of the results, which in turn provides an esti- mate of the significance of the clusters obtained from the original dataset e.g. Ben-Hur et al. (2002) and Levine and Domany (2001). The jackknife
approach, Yeung et al. (2001), forms clusters based on p− 1 (with p = num-
ber of samples) and uses the remaining sample to assess predictive power of the algorithm i.e. Figure Of Merit (FOM). Stability can also be assessed by perturbing data and comparing the different clusters found with the origi- nal partition, using external indices, (Bittner et al., 2000; Kerr and Churchill, 2001; Li and Wong, 2001).
• Preservation of distance information - the degree to which the distance in- formation in the original data is preserved in a clustering and typically used
for hierarchical clustering. Here, a cophenetic distance matrix is an N × N
matrix where each entry (i, j) records the level at which the data items i and j are grouped in the same cluster for the first time. The preservation is usually assessed using the cophenetic correlation coefficient i.e. the correlation be- tween the entries in the cophenetic distance matrix and the original distance matrix, (Sokal and Rohlf, 1962).
External Measures (Supervised):
These refer to assessments which reference external information (e.g. class labels or clusterings from alternative algorithms). The comparison of clusterings with ex- ternal class labels is of critical importance as it provides a great deal of information to the user. For instance, genes that show similar pattern across clusters do not nec- essarily indicate the same pathway or similar function but could do. Examples of the properties key to external measurements include:
• Agreement with metadata - For biological function information included, in the gene list for each cluster, a more complete picture is inevitably provided of the dataset and the success of the technique. A number of functional an- notation databases are available. The Gene Ontology database, (Ashburner et al., 2000), for example, provides a structured vocabulary that describes the role of genes and proteins in all organisms. The database is organised into three hierarchical ontologies: biological process, molecular function and cellular component. Several tools have been developed for batch retrieval of GO annotations for a list of genes ( e.g. tools DAVID, (Dennis et al., 2003), Babelomics, (Al-Shahrour et al., 2005) or Machaon CVE (Bolshakova et al., 2006)). Statistically relevant GO terms can be used to investigate the properties shared by a set of genes. These tools typically use comprehensive measures, like the F-measure (introduced by Rijsbergen (1975)), or hyper- geometric tests, (Falcon and Gentleman, 2007), to test the significance of cluster purity, (the fraction of the cluster taken up by the predominant class label) and completeness, (fraction of items in a class grouped in the current cluster). This assessment can be adapted for partially annotated datasets, by only including that fraction of genes that are annotated in the calculation of
the measure. This facilitates the transition from data collection to biological meaning by providing a template of relevant biological patterns in gene lists. • Agreement between clusterings (cluster runs) -In a simulation dataset, the true partition is known, and the performance of a technique can be assessed in terms of its clustering similarity to the true partition. There are several such indices to measure this in the literature, (Fridlyand and Dudoit, 2001). Most popular is the Rand Index (RI), (Rand, 1971) and a number of variations of this exist, including the adjusted RI, (Hubert and Arabie, 1985) and the
weighted RI (Thalamuthu et al., 2006). In general, these determine the simi-
larity between two partitions as a function of positive and negative agreement in pairwise cluster assignments. The Jaccard coefficient, (Jaccard, 1908), looks at similarity as a function of only the positive agreements in pairwise cluster assignments. Most of these can also only be used where a single class label is unequivocally assigned to a data item, thus are inappropriate for fuzzy clusterings or overlapping clusters, although a fuzzy extension has been pro- posed recently for the Rand Index, (Campello, 2007) Note: these measures can also be used where the gold standard is not known, to assess relative similarity of two clusterings obtained.
Table 4.1: Summary of Evaluation Measures, categorised into Internal and Exter- nal. Assess’ refers to which property the measure it assessing, Measure, refers to
the popular name for the measure in literature, G, L, C, C>1, F indicates if the
measure is suitable for assessing Global, Local structures, Crisp, Crisp in more than one cluster, and Fuzzy Membership respectively, Max/Min indicates whether the measure should be maximised or minimised, Bounds refers to the [maximum, minimum] possible value of the result.
Assessment Measures
Category Assess’ Measure G L C C>1 F Max
/Min
Bounds
Internal
Connectedness Conn X X Min [0, ∞)
Compactness Intra-cluster Dis- tance
X X X X Min [0, ∞)
Separation Inter-cluster Dis- tance
X X X Max [0, ∞)
Combination
SD-validity Index X X Min [0, ∞)
Dunn Indices X X Max [0, ∞)
Davies-Bouldin In- dex
X X Min [0, ∞)
Silhouette Index X X Max [−1, 1]
C-Index X X Min [0, 1]
Xie-Bien Index X X Min [0, ∞)
Fuzziness
Partition Coeffi- cient
X X X Max [0, ∞)
Partition Entropy X X X Min [0, ∞)
Table 4.1 – Continued
Category Assess’ Measure G L C C>1 F Max
/Min Bounds Stability Cluster Overlaps (Average Propor- tion Non-overlap, Average Distance, Average Distance between Means) X X X Min [0, 1]
Figure Of Merit X X X X Min [0, ∞)
Distance Preservation Cophentic Correla- tion X X Max [−1, 1] External
Purity Biological Homo- geneity Index
X X X X Max [0, 1]
Completeness Biological Stability Index
X X X Max. [0, 1]
Reconstruction of structure
Adjusted Rand In- dex
X X X X Max [0, 1]
Fuzzy Rand Index X X X X Max [0, 1]
Jaccard Coefficient X X X X Max [0, 1] HubertΓ Statistic X X X X Max [−1, 1]
These metrics are usually highly dependent on the number of clusters as an input parameter, (discussed in Chapter 3). The ‘natural’ number of clusters in the data depends on which clustering criterion are used in the algorithm and is not fixed between algorithms. For example, according to the K-Means criterion the optimal
number of clusters for a particular dataset may be 5, while for CLICK it may be 10 for the same dataset. The ‘optimal’ number of clusters depends on the dataset and algorithm so that absolute choice is difficult. Assessment measures are also biased. The compactness index, e.g. is biased towards a large number of clusters, while the
separation index is biased towards a small number of clusters. Formulae for each
of the assessments can be found in Appendix A.