The preceeding material on learning distance and similarity measures for clustering may bring to mind the field of metric and kernel learning. This work could be and has been used to learn distance metrics for the purpose of clustering, but there is a substantial difference between “learning a metric” and “learning a metric so that a clustering algorithm will perform well,” as argued in Section 1.3. Also, the pri- mary application of these metric learning papers is to improve kNN classifiers [24]. Despite these differences, however, it is nonetheless a closely related field. As we shall see, these metric learning algorithms almost uniformly learn a similar type of distance metric.
Davis et al. [31] describe learning Mahalanobis distances, which generalize Eu- clidean distances through admission of linear scaling and rotations of the feature space. The algorithm is phrased in terms of learning a metric parameterized by a matrix A so as to minimize the distance between similar and dissimilar points. For example, two points xi, xj in a vector space would have distance
(xi− xj)TA(xi− xj). (1.12)
For example, if A = I, then this is standard squared Euclidean distance between xi and xj. If A is diagonal, this is Euclidean distance with corresponding feature
weights. If A is not diagonal, the measure allows for correlation between features.
The matrix A is parameterized so that pairs of points that are similar and dissimilar have this distance less than a certain threshold and greater than a dif- ferent threshold, respectively. Furthermore, this is subject to regularization of the form that the KL divergence between A and a certain prior parameterization A0
transformation as close to a priori notions of what the “right” parameterization should look like.
Weinberger et al. [111] propose learning a distance metric through a linear weighting of terms. The linear weights are learned through an optimization prob- lem that simultaneously punishes long distances between points in the same group and short distances between points in different groups. The optimization proce- dure minimizes the sum of the distances (or inverses of the distances) for dissimilar (or similar) points, where the contribution of the distances is weighted according to some learning meta-parameters that the user of this learning procedure must set.
Lanckriet et al. [65, 66] present a procedure to employ semi-definite program- ming to maximize the alignment between a learned kernel matrix (really a weighted sum of provided kernel functions, where a weighting is the learned parameteriza- tion) and the labels assigned to points. Though phrased for transductive classifi- cation, nothing prevents this method from being used for other applications where learning a kernel function would be appropriate.
Xing et al. [112] describe an elegant approach. The data consists of a set of points we want to cluster {xi : i ∈ 1..n} with xi ∈ RN. As in the typical semi-
supervised learning setting, there are two constraint sets S and D, where S contains pairs that should be similar and D contains pairs that should be different in the learned metric. The paper considers a distance metric dA(x, y) parameterized by
a positive semidefinite matrix A, identical to that shown in [31] up to a squaring.
dA(x, y) = kx − ykA=
p
(x − y)TA(x − y) (1.13)
Optimization Problem 1. (Xing et al.’s Distance Learning) argminA X (xi,xj)∈S kxi− xjk2A (1.14) s.t. X (xi,xj)∈D kxi− xjkA≥ 1. (1.15)
The remainder of the algorithmic description focuses on establishing ways to make this learning problem tractable for the case where A is not diagonal.
In a similar vein, Tsang and Kwak introduce a kernel learning algorithm [105]. They suppose that for two patterns xi, xj in the input space Rp, there is an inner
matrix product hxi, xji = sij = xTi M xj, where M ∈ Rp×p is a positive semi-
definite matrix. Since M is s.p.d. it can be factored as a product of a matrix and its transpose, so they rewrite sij as sij = xTi AATxi where A is a p × p matrix. In
their learning framework, A is some learned matrix for a learned metric ˜d, whereas M corresponds to an original metric d:
d2ij = (φ(xi) − φ(xj))TM (φ(xi) − φ(xj)) (1.16)
˜
d2ij = (φ(xi) − φ(xj))TAAT(φ(xi) − φ(xj)) (1.17)
The algorithm tries to learn an A with the following optimization problem, for S and D as sets of pairs of elements that are supposed to be similar or different, respectively:
Optimization Problem 2. (Tsang and Kwak Distance Learning) argminA,γ,ξ ij 1 2kAA Tk2+C S 1 |S| X (xi,xj)∈S ˜ d2ij +CD −νγ + 1 |D| X (xi,xj)∈D ξij (1.18) s.t. ∀(xi, xj) ∈ D d˜2ij − d 2 ij ≥ γ − ξij (1.19) ξij ≥ 0. (1.20)
Here, CS, CD, and ν are tunable positive valued parameters. The kAATk2 term
is used to encourage the rank of A to be low for sparsity. The larger CS is, the more
the algorithm attempts to make the learned distance measure for (xi, xj) ∈ S low.
The ξij serve a similar function to slack variables in a generic SVM in that their
minimization punishes pairs in D from being closer than the threshold γ, and the larger CD is, the less the program tolerates large slack. Finally, the larger ν is, the
larger the optimization program tries to make the margin γ between distances of pairs in S and distances of pairs in D. For a more intuitive explanation, it chooses an A such that close pairs are close, while distant pairs are far apart.
Schultz and Joachims describe a different way to think about learning a dis- tance metric [95]. Instead of the S and D sets that say, “these elements are similar/different,” constraints in [95] are of the form “a is closer to b than a is to c.” In this way, the desired closeness is scaled in terms of relative preferences. Relative constraints have been unnecessary for the hard clusterings we have so far considered, but this type of distance learning measure may be useful in situations where one needs to tune a distance metric with more finesse than is allowed by absolute binary relationships of “similar” and “different.”
A and W :
dA,W(x, y) =
p
(x − y)TAW AT(x − y). (1.21)
W is a positive diagonal matrix whose diagonal entries are learned by this algo- rithm. A is a real matrix provided a priori. The paper discusses two possible choices for A. One is A = I, of course. The other is A = Φ with the ith column equal to φ(xi), that is, training vector xi projected into the feature space; this A
allows one to use kernel functions representing products within this feature space provided by φ.
An optimization problem to learn this metric is given in OP 3.
Optimization Problem 3. (Schultz and Joachims Distance Learning)
min 1 2kAW A Tk2 F + C X i,j,k ξijk (1.22) s.t. ∀(i, j, k) ∈ Ptrain. (xi− xk)TAW AT(xi− xk) − (xi− xj)TAW AT(xi− xj) ≥ 1 − ξijk (1.23) ξijk ≥ 0 (1.24) Wii ≥ 0 (1.25)
In conclusion, methods in this field learn a metric so that points which are similar and different are kept close and far in a learned metric, respectively. They all learn some sort of matrix inner product hx, yi = xTBy, where the form of B and how it is learned differs from paper to paper. Even in this cursory survey, we have seen a tremendous variety of methods in this area, all with different opinions about the proper optimization criteria. In this way, supervised clustering work could be viewed as another metric learning problem, except the criteria for
optimization for a supervised clusterer is that the metric or measure learned is such that a clusterer will perform well in partitioning the data when run on the similarity matrix. However, the converse does not hold, as metric learning does not by itself constitute a supervised clustering method since the optimization criteria are typically much different.