As one might guess from the name, semi-supervised clustering is related to su- pervised clustering, but is typically applied to very different settings. In semi- supervised clustering, a clustering algorithm is likewise parameterized, but usually for application to a single large set of items (rather than multiple smaller sets of items) where information on the clustering structure of the input items is incom-
plete. As such, there is often no interest in learning a model parameterization that could be applied to new sets of items, whether it be in the form of a learned metric or other transferable learned knowledge. This is a key component of supervised clustering.
Semi-supervised clustering methods augment an unsupervised clustering algo- rithm with information about how some of the items being clustered should relate to each other. In this setting one does not get the complete clustering of the set as described in the purely supervised case. Rather, information is incomplete, usu- ally in the form of pairwise constraints, e.g., “items a and b should be in the same cluster” or “should not be in the same cluster.” When clustering the data, the semi-supervised clusterer attempts to fulfill the constraints as best it can. This is distinct from supervised clustering, since in supervised clustering one has sets of items and complete partition information on these training sets, rather than in- complete information covering only a certain subset of pairs within a single input set.
Some of these semi-supervised clustering methods modify a clustering algo- rithm so it incorporates this supervision information, but does not parameterize a distance or similarity measure. For example, Aggarwal et al. [1] describe a minimal approach based on cluster seeds. The k-means algorithm is implemented by start- ing with seed cluster centroids that are iteratively refined in a greedy fashion to minimize intracluster distance (or alternatively maximize intracluster similarity), and has a strong tendency to find local minima for its objective function. Ag- garwal et al. take advantage of this tendency to fall into local minima by having the initial cluster seeds be the same as those centroids seen in the training data, thus leaving the clustering algorithm predisposed to finding clusters close to the
initial starting points [1]. (Interestingly, were the k-means algorithm not so prone to fall into local optima, this algorithm would be ineffective.) Wagstaff et al. [110] propose an algorithm that likewise does not modify the distance metric at all, but directly constrains the k-means clustering algorithm so as to respect constraints about what sets of points should or should not be together; in the event the clus- terer comes up with a cluster that violates the constraints in the k-means iteration, the items are reassigned to satisfy constraints and the algorithm continues until convergence.
De Bie and Cristianini present a method on learning a metric, with the stated purpose of clustering [32]. It works through defining a metric parameterized through a matrix W , where the metric between two points xi and xj is
(xi− xj)TW WT(xi− xj) (1.11)
where the W is derived through an eigenanalysis of the vectors and their cluster constraints. This is remarkably similar to metric learning techniques described in Section 1.7 both in formulation and in algorithmic process, but the eigenanaly- sis would capture information about the global structure of how the data would cluster, despite clustering not being used directly in the optimization procedure.
Cohn et al. incorporate user feedback of clusterings of documents the form “these two documents should (not) be in the same cluster,” and use these con- straints to improve the distance metric between pairs of elements [22]. The clus- tering procedure used is a mixture of distributions, where each cluster corresponds to a different distribution, and each document has a probability of being generated according to that distribution. A document’s probability is modeled as a weighted product of the probabilities of the words. By changing weights corresponding to each word, one modifies the distribution generating the document. The approach
taken in this paper is to, through iterative hillclimbing, choose a weighting so that the KL divergence of the distribution for two documents is small or large depending upon whether these documents should or should not be clustered.
Some methods do directly include the clustering procedure in the metric opti- mization procedure for semi-supervised clustering. Bilenko et al. [10] and Basu et al. [8] produce algorithms called, respectively, MPCK-Means and Hidden Markov Random Field k-Means (HMRF K-Means), which both incorporate must-link and cannot-link constraints through an EM procedure. These procedures first cluster data, and second modify the distance measure to “fix” any mistakes that occurred during the clustering. The algorithms then iterate over these two steps until con- vergence. The first works through application of a matrix update procedure and the second works through MAP inference on a graphical model, but the two meth- ods appear almost identical in intent. An additional paper by Kulis et al. [60] refines Basu et al. [8] for the case of kernel clustering, where points do not nec- essarily exist as individual points in an explicit vector space in which they are clustered. This would be the case in, for example, typical representations of noun- phrase coreference [100]. These methods all modify both the clustering procedure to respect constraints, and also parameterize the distance metric as they perform clustering.
To summarize, semi-supervised clustering methods may seem closely related to supervised clustering methods, but the natural consequence of the typical target application, the clustering of a single dataset for which there is incomplete informa- tion and the lack of concern for transfer to new clustering, leads to very different problem formulations that are often inappropriate for the supervised clustering setting.