Evaluating Clustering Solutions - Quantitative and evolutionary global analysis of enzyme react

Performance of PFClust Performance evaluation: The performance of

PFClust was done on the following configuration:

• Hardware: 2.2 GHz Intel(R) Core(TM) i5-3470S CPU @ 2.90 GHz, 8.00 GB RAM

• Operating system: Scientific Linux release 6.3 (Carbon) • JVM: 1.6.0_45-b06

The time difference was significantly lower for larger dataset (size > 1000 data point) from 35000 seconds to 10 seconds [144].

Limitations of clustering methods Overall, three major limitations in

the above mentioned clustering algorithms are: first defining the number of clustersk, second input parameters, and last validating the results. There is

no straightforward ‘best’ way to evaluate clustering methods, as the results are dependent on the dataset provided. Different techniques often highlight different patterns in the data, so complementary methods may be helpful in analysing a single data set. This also makes interpretation of the results harder. To evaluate the results, many authors combine different evaluation measures, discussed later in this chapter, to get a clearer interpretation of the results [135,136].

In the next section, we will discuss the work-flow of evaluating results from different clustering algorithms using ‘clValid’ [145] and ‘fpc’ [146] pack- ages in R. Moreover, we also discuss results from PFClust.

4.5 Evaluating Clustering Solutions

There is more than one definition of clusters depending on the dataset used for that particular study [147]. In fact, most authors define different group- ing criteria to cluster an item, for example, a cluster or group is formed based on the principle of minimum distance between two items or by maximum separation of clusters. Today, with well known clustering criteria, we need a measure that validates the output.

Validity is a certain amount of confidence that is added to the patterns recognised by the cluster algorithms [135, 147]. Validation also serves an

Chapter 4 4.5. Evaluating Clustering Solutions

important implication on the problems or limitations discussed in the pre- vious section, by defining the number of clusters or to optimise the parameters [148]. Broadly, the validation methods are grouped into external and internal validation. The fundamental difference between the two types of validation method is that the external validation method uses some reference classification method whereas there is no external label required in internal validation methods. Both of these methods are equally important [149]. In PFClust, we have used both validation criteria.

External Validation: Standard external validation measures take gold -

standard class labels and compare with the labels provided by the cluster algorithm via contingency table of the pairwise assignment of data items.

Probably the best known index is the Rand Index (Rand, 1971) (Equation 4.2), following the simple criteria of comparing gold-standard class labels with labels provided. The Rand Index is defined as:

RAN D= a+b

a+b+c+d (4.2)

Whereais the number of pairs of instances that are assigned to the same

cluster in clustering (C1) and to the same cluster in clustering (C2);bis the number of pairs of instances that are in the same cluster in C1, but not in the same cluster inC2;c is the number of pairs of instances that are in the same cluster inC2, but not in the same cluster in C1; andd is the number of pairs of instances that are assigned to different clusters inC1 and C2.

Internal Validation: In contrast to external validation, internal valida-

tion evaluates the intrinsic quality of the cluster. The qualities we are inter- ested in here are compactness,connectedness and separation of the cluster.

Here, compactness suggests finding homogeneity of intra-cluster variance, connectedness provides the degree of partitioning observed local densities

and groups data items together with their nearest neighbours in the data, and separation includes the measure to quantify the degree of separation

between the individual clusters. All these aspects hold important places in internal validation separately as well as with some combinations. The most popular combination is betweencompactness and separation. Several

Chapter 4 4.5. Evaluating Clustering Solutions

separation such as Silhouette width [150] and Dunn Index [151].

The Silhouette width is a useful measure when one is seeking compact and clear separation between clusters. Once the data is clustered the distance within and between clusters is quantified with respect to each objecti.

Suppose objectibelongs to clusterAthen average dissimilarity of objectiis

computed to other members of the same cluster, this is assigned toai. Next,

the average dissimilarity ofito all objects that are different from clusterA

is computed. Then, the minimum distance fromito objects not belonging

to cluster A is recorded inbi, which is also known as neighbour of object i.

Note that the construction of bi depends on the other clusters, so it is an

underlying assumption that there are more than one clusters. The number ofsi is obtained as following Equation 4.3.

si =

bi−ai

max(ai, bi) (4.3)

Another method to assess the compactness andseparationof the cluster

output is Dunn Index. Dunn Index is defined in Equation 4.4.

Dc= min Ck∈C min Cl∈C dist(Ck, Cl) max(Cm∈C)diam(Cm) (4.4) where diam(Cm) is the maximum intra-cluster distance within cluster

Cmanddist(Ck, cl) is the minimal distance between pairs of data itemsiand

jwithi∈Ckandj∈Cl. Higher Dunn Index is better for a given assignment

of clusters. One of the limitations is that this method is computationally costly. The Dunn Index measures the ratio between the smallest cluster distance and the largest intra-cluster distance in a partitioning.

Statistical analysis: In order to find association between the clusters

and biologically relevant information we use residual test statistics, which performs cell-to-cell comparisons within the cluster. For this test, the first step is to summarise the data into a contingency table, to get better insight into the clustering results. In this matrix, column is represented by the label for each cluster (arbitrary number was assigned to cluster in order to keep track) and row represents the counts of mechanistic annotations defined in MACiE. The χ2 _{test [148, 152] of the mechanism annotations within each}

In document Quantitative and evolutionary global analysis of enzyme reaction mechanisms (Page 73-76)