• No results found

Supervised Clustering Is Not Multiclass Classification

Repeatly, there has been confusion on the distinction between supervised clustering and simple multiclass classification. Though we were surprised the first few times this confusion arose, it is understandable: clustering and classification appear quite similar. In both cases, when they are predicting the outputs for a set, they produce a partition of an input set.

Indeed, in many types of machine learning algorithms, the line between clus- tering and classification is very thin. Utilizing clustering in classification is the basis for the majority of the semi-supervised and transductive classification re- search. In these settings, the target hypothesis is either a multiclass or binary classification rule, and the training data consists both of labeled and unlabeled points. Informally, the goal of these algorithms is that the learned classification hypothesis should be consistent with the labeled data, but the learned hypothesis should also obey the cluster assumption. If we interpret examples as existing in a vector space, then the decision boundary for the learned hypothesis should pass through regions of the space that have low density (or, depending on the type of hypothesis that is being learned, that class centers should be in regions of high density) considering both labeled and unlabeled points [19, 18, 88, 96]. In intu- itive motivation, mathematical formulation, and algorithmic interpretation, these methods use clustering-like techniques to guide the learning and application of semi-supervised classification schemes [43].

It is worth noting that this technique is applicable to situations other than multiclass and binary classification. Similar techniques have also been applied to cases where the learned hypothesis function produces outputs that are more com- plex and structured [2, 14, 20, 49]. However, in these settings, the intuitive notion

of the cluster assumption of points lying in space becomes less compelling since there is no longer an easily visualizable “vector space” model for these complex objects, so these efforts often adopt different terminology.

Despite intersecting in some applications, there are important differences in the types of problems the two are appropriate for and what concepts they are capable of representing, which we discuss in detail here.

1.5.1

Dynamic Clusters versus Static Classes

One major difference between classification versus clustering is supervised clus- tering schemes learn to partition sets of items, whereas multiclass classification schemes learn to partition sets of items into static, defined partitions.

A simple example might help illustrate this difference: Consider the problem of clustering marbles, and suppose that we have as training data a set consisting of red marbles, green marbles, and blue marbles, as pictured in Figure 1.3(a). The training data consists of the partition of these marbles into those three colors. The features for each marble include the “color angle” of this marble’s hue on a color wheel (red is at 0◦, green is at 120◦, blue is at 240◦), as well as various other features such as size, weight, clarity, and other features that turn out to be irrelevant to this task.

Were we to consider this a multiclass classification problem, then such an al- gorithm would learn how to classify future items fed into the algorithm as being in a red, green, or blue class, which was learned during the training phase. For example, for a given input marble, the classifier would view red, green, or blue as the most likely classification depending on how close the input marble’s hue is

hue angle 120° 240° (a) hue angle 120° 240° (b) hue angle 120° 240° (c) hue angle 120° 240° (d) hue angle 120° 240° (e)

Figure 1.3: This illustration serves as an example of the difference between clustering marbles, and classifying marbles.

to 0◦, 120◦, or 240◦, respectively. Were we to consider this as a supervised clas- sification problem, then the learned model would learn how to partition a set of items, perhaps learning that difference in hue angle and likelihood of being in the same cluster are inversely related. The classification learner wants to learn how to partition future items into the groups indicated by the training data, e.g., learn what areas in the space will correspond to membership in what class, as indicated in Figure 1.3(b).

Suppose that we train either a supervised clustering or multiclass classification algorithm on the red-green-blue marbles, but in prediction (either for supervised clustering or multiclass classification) we are fed marbles that are roughly in groups of yellow, cyan, and magenta marbles, as shown in Figure 1.3(c). The multiclass

classification, having learned regions corresponding to the red, green, and blue mar- bles, will have a tendency to “split” the natural yellow, cyan, and magenta groups since the classification regions have a decision boundary through each of these groups, a Figure 1.3(d). Alternately, however, the supervised clustering method, having learned to partition items according to color, but not to put them into any predefined bins, can identify the natural color regions as shown in Figure 1.3(e).

This is not to say that the partitioning according to Figure 1.3(d) is wrong; in many applications this is precisely what one wants. It is also not to say that the application of any given supervised clustering algorithm scheme would result in the scenario leading up to Figure 1.3(e). This merely illustrates a basic difference between the two types of tasks for which these methods are appropriate.

1.5.2

Large Numbers of Unknown Groups

Another difference between clustering and classification becomes obvious when one considers that classification assumes one knows a priori what classes a set of objects could be partitioned into, which is not always the case.

For instance, consider a task like noun-phrase coreference. Recall that in noun- phrase coreference one takes all of the noun phrases in a document, and partitions them according to what noun-phrases refer to the same entity. If we were to apply multiclass classification to this scheme, we would have to have a separate class for each entity. This would be impossible because the sheer number of classes implied by having to produce a class description of every entity that has been encountered and ever could be encountered is prohibitive, and these entities are also unknown. Also consider the problem of clustering news articles as being about the same news

story or not: it would be impossible to anticipate what “classes” corresponding to stories that the news articles are going to fall into in future days, or else it would not be news.

Furthermore, in common noun-phrase tasks, few of the entities referred to in the evaluation set actually occurred in the training set [77], and it is unclear how one can generalize knowledge about how to “group” items in the context of pure classification of individual noun-phrases. If one’s training data talks about Anne, Bob, and Clarence, what is the algorithm to think when it encounters, in application, noun-phrases referring to a previously unknown entity Doug?

1.5.3

Different Appropriate Choices of Features

In addition to these fundamental differences in purpose, there are practical differ- ences that separate what types of parameterizations are acceptable for supervised classification versus supervised clustering. At issue is that classification learns over individual features, whereas supervised clustering, in parameterizing a pairwise similarity score, has features describing two points jointly, that is, pairwise fea- tures. To take an example, in a vector space model, consider two points xi, xj ∈ x

that are also real valued vectors xi, xj ∈ RN. As a classification task, the natural

instinct would be to take the vectors as is. In contrast, a pairwise feature vector as used in clustering would involve some synthesis of the two, perhaps their difference |xi − xj| or a componentwise product xi ◦ xj. As an example, if we were trying

to learn a concept like the marble clustering example of Section 1.5.1, a classifier woudl find the hue angle as a useful feature, but a clusterer would get more use from a pairwise feature of the difference in hue angle.

A more subtle difference between the two feature representations becomes clear when one considers what types of features are helpful versus harmful in both settings. Suppose a hypothetical individual is working with noun-phrases, and a training set of documents that talk about the entities Anne, Bob, and Clarence. If this individual wishes to use a multiclassification scheme, i.e., classify new noun- phrases as referring to either Anne, Bob, or Clarence, then many binary features indicating that the noun-phrase in question is the character sequence Anne, or Bob, or Clarence become extraordinarily helpful. In the case of supervised clustering, though, features that are this specific can become harmful, since the goal is to be able to group noun-phrases no matter what entity they refer to, whether they refer to these three individuals or someone completely different; a model that depends heavily on these simple features would be unable to transfer to a new, unseen entity. Features specific to a token that appears in text would be difficult to help learn a general model [15]. They would allow easily fitting the training data while being nearly useless for data on unseen entities. This is not to suggest such features could not be useful—as a practical matter it seems humans must do something like this to connect names and titles to entities—but such features with very limited training data would be harmful for generalization performance.