• No results found

CHAPTER 2. EFFICIENT SEMI-SUPERVISED k-MEANS CLUSTERING

2.3 Performance Evaluations

2.3.1 Experiments

2.3.1.1 Classic3 dataset

Our first experiment was in the context of identifying groups of documents in the well- known Classic3 dataset. This data set, available online at ftp://ftp.cs.cornell.edu/pub/smart is a well known collection of 3,893 abstracts with 1,460 articles on information retrieval in the CISI database, 1,400 documents on aerodynamics in the CRANFIELD database and 1,055 abstracts on medicine and related topics in the MEDLINE database. All these abstracts col- lectively had over 24,000 unique words. Classic3 is often used to evaluate performance of text clustering/classification algorithms because it contains a known number of fairly well-separated groups (sources of abstracts). We followed Maitra and Ramler (2010) in processing the data to remove words appearing in less than 0.02% or more than 15% of the documents. After this preprocessing, there were 3,302 words remained and the resulting document vectors were each transformed to have unit L2-norm. To obtained a semi-supervised framework we obtained

simulated seen labels. In the first case, we assumed that around 10% of the CISI abstracts were known, and then applied ss-k-means++ with K• = 1. The second case, assuming that there was 10% representation from each of CISI and CRANFIELD in the labeled observations, i.e. K• = 2. In the third case we assumed that we had (10%) representation from each of the

CISI, CRANFIELD and MEDLINE abstracts. Table 2.1summarizes the results by means of a confusion matrix, and also evaluates performance in terms of the adjusted Rand Index.

2.3.1.2 Simulation Experiments

Our large-scale simulations were on datasets obtained using the C package CARP of Melnykov and Maitra (2011) which simulates clustered datasets of pre-specified overlap characteristics as a surrogate for clustering complexity Maitra and Melnykov (2010). These overlap measures are summarized in the form of the average (¯ω), the maximum (ˇω) or the generalized (¨ω) overlap, with larger values corresponding to greater clustering difficulty. Because minimizing (2.1) in the context of clustering is really most appropriate for when we have homogeneous spherical clusters, our simulation setting was restricted to this case. Our combinations of parameters

Table 2.1: Confusion matrix for classic3 data set.

CISI CRANFIELD MEDLINE

Cluster 1 1415 5 8

Cluster 2 44 1027 51

Cluster 3 2 0 1339

K• = 1, ˆK = 3 R = 0.91

CISI CRANFIELD MEDLINE

Cluster 1 1414 6 8

Cluster 2 44 1028 50

Cluster 3 2 0 1339

K• = 2, ˆK = 3 R = 0.90

CISI CRANFIELD MEDLINE

Cluster 1 1414 6 8

Cluster 2 44 1027 51

Cluster 3 2 0 1339

K• = 3, ˆK = 3 R = 0.92

used in simulation were in the form (K, K•, n, p, ˇω). For each combination we simulated 100 data set with the following structures, K = 6, 11, n = 5×105, 106, p = 5, 10 with K= 4, 6, and

observed proportion of the data ρ ∈ {0.1, 0.15, 0.2, 0.25}. Note that for ρ = 0, the clustering problem is the same as the unsupervised one, and then K• = 0. The maximum overlaps values for theses simulated data sets, ˇω = 0.01, 0.05, 0.1, 0.2. For each combination, we reported the adjusted Rand index R, of Hubert and Arabie (1985), which measures similarity between two partitions, in this case the true labels and the estimated labels. In general the adjusted Rand index values are close to 1 indicate good performance clustering, while values far away from 1 indicate poorer performance. We now discuss performance of our suggested methods.

Computational Efficiency of Algorithms Our first range of evaluations was with re- gard to comparing the time taken using the Lloyd’s algorithm as described in Section 2.2.1.1 and its Hartigan-Wong-type counterpart of Section 2.2.1.2. We therefore first computed the time taken by both semi-supervised clustering algorithms on each simulated dataset. In order to eliminate the issue of improper initialization, we started each algorithm with true means given by the parameters that simulatied each dataset. For this set of experiments, we also assumed that K was known and, for simplicity, that K•= K. Both algorithms were efficiently

coded in the C language for maximum computational efficiency in implementation. Figure2.1 summarizes the results by means of linked boxplots which link each simulated dataset for indi- vidualized reference to performance. The figures clearly indicate that our Hartigan-Wong-type algorithm is faster than our Lloyd’s algorithm and that the improvement in computational speed is at least an order of magnitude higher. While this is expected given that our Lloyd’s algorithm re-computes every calculation at each iteration while our Hartigan-Wong-type algo- rithm only evaluates and updates those groups and points that have had changes in the last quick-transfer stage. Further, Figure 2.2shows that this speed is not at the cost of clustering performance, with our Hartigan-and-Wong-style algorithm having similar performance as our Lloyd’s algorithm for semi-supervised clustering. For the remainder of this paper, we there- fore only evaluate and implement our Hartigan-Wong-type algorithm, with initialization as per Section 2.2.2, which we collectively refer to as ss-k-means++ algorithm. At this point, we also assume that K is unknown and optimally determined as per the modified Jump statistic developed in Section2.2.3.1.

Comprehensive Performance Evaluation of ss-k-means++ Our comprehensive

performance evaluation is designed to evaluate three aspect of our methodology, namely, per- formance of the method itself, initialization, our modified jump statistic in optimally deter- mining K as well as how all these aspects work together. Figure2.3 displays the frequency of the difference in the optimal estimated ˆK and the true K, where ˆK was obtained using the modified jump statistic JK•. It is clear that on the whole, the optimal number of groups is well

estimated. There seems to be some negative bias in the estimation, however, the bias does not seem to have a major pattern. Figure2.4displays the distribution of R over different settings. It is clear that the performance is good on the whole, with, as expected, better performance for situations with lower clustering complexity. Note that the adjusted Rand indices are computed only using the unlabeled observations. In summary, we see that whether K is known or not, the results show that the proposed methodology is able to correctly identify groups quite well.

Related documents