• No results found

This chapter theoretically and empirically analyzed two classes of methods for training structural SVMs on models where exact inference is intractable. Focusing on completely connected Markov random fields, we explored how greedy search, loopy belief propagation, a linear-programming relaxation, and graph-cuts can be used as approximate separation oracles in structural SVM training. In addition to a theoretical comparison of the resulting algorithms, we empirically compared performance on multi-label classification problems. Relaxation approximations dis- tinguish themselves as preserving key theoretical properties of structural SVMs, as well as learning robust predictive models. Most significantly, structural SVMs appear to train models to avoid relaxed inference methods’ tendency to yield frac- tional, ambiguous solutions.

CHAPTER 6

CONCLUSIONS AND FUTURE RESEARCH DIRECTIONS

This chapter will present the conclusions of this work. Further, with this work done, there are still many unanswered questions, mysteries, and potential for future work with supervised clustering and structural SVMs, and so after the conclusions we briefly present some of the more interesting ideas that could be developed in future work.

6.1

Conclusions

As argued in Chapter 1, many tasks involve partitioning a given item set into re- lated groups. For example, automated news aggregators group news articles which are about the same story. Noun-phrase coreference systems group a document’s noun-phrases which refer to the same entity. In image segmentation, one identifies regions of the image corresponding to the same object. A common practice in these problems and others like them is to employ clustering techniques to find related groups in sets of items. Since manually tuning clustering algorithms to solve these problems is difficult, the common approach is to employ supervised machine learn- ing techniques to learn how to partition other item sets of the same type, learning how to cluster item sets x based on example clusterings y. Supervised clustering is the problem of tuning clustering algorithms using supervised learning so they perform well on a task of interest to the practitioner.

How can we learn a clustering function? The goal of nearly all popular clus- tering methods is to find the clustering maximizing some criteria f (x, y), where f : X × Y → R is commonly called a discriminant function. This discriminant

function is typically some formula involving the pairwise similarity between pairs of items; for example, in correlation clustering f (x, y) it is the sum of all pairwise similarities between items xi, xj ∈ x in the same clustering in y, and in k-means

the sum of similarities of each item to its cluster’s center. For this reason, nearly all supervised clustering methods learn the item pair similarity measure, thus affecting which clustering y will maximize f (x, y).

In this sense, one may view supervised clustering as a metric or similarity learning task. We argued, however, that general metric learning frameworks are insufficient, since they do not learn metrics optimized for clustering performance. All the different existing clustering methods would group items in different ways (whether it be k-means, spectral, correlation, single-link, complete-link, average- link, etc., clustering) even over the same pairwise similarity measure, so it is critical that the similarity measure be learned in such a fashion so that the cluster method in question performs well for the task at hand. Other methods might wind up finding parameterizations optimized to the wrong criteria as argued in Section 1.3 and Section 1.7.

In order to learn this parameterization for our clustering, we employed a struc- tural SVM learning algorithm, which we described in Chapter 2 as a general method for learning parameterizations of functions with complex structured in- puts and outputs. With a training set, the structural SVM learning method’s goal is to find a parameterization such that the discriminant function is maximized for the correct output, versus all possible incorrect outputs. Violations of this are punished proportionately to each incorrect output’s “loss” relative to the correct output, where loss is a sort of judgment function. Since different tasks may have different loss functions, structural SVMs have the ability to learn parameteriza-

tions optimized for specific tasks, an important distinction between the proposed supervised clustering method and those already existent in the literature.

We then derived supervised clustering methods for correlation clustering in Chapter 3 and k-means/spectral clustering in Chapter 4. In particular, we empir- ically demonstrated the method’s usefulness in being able to optimize to a task specific loss function, its computational efficiency, and its ability to learn parame- terizations of various clustering methods.

Since correlation and k-means style clusterings require the use of approxima- tions to maximize their discriminant function, and structural SVMs incorporate the predictive method into the learning program, the learning method itself be- comes approximate. We presented a detailed empirical and theoretical analysis of the use of approximations and structural SVMs in Chapter 5. In short, though some of the theoretical guarantees of the structural SVM learning algorithm no longer hold, we can make new statements for undergenerating approximations (based on some type of local maximization) and overgenerating approximations (based on some type of relaxation). In particular, when using ρ-approximate un- dergenerating approximations in structural SVMs, the extent to which the original theoretical guarantees are violated can be bounded. When using overgenerating approximations, the important theoretical guarantees hold at the cost of possible suboptimality of the structural SVM parameterization.