3.3 Distance Metric Learning (DML)
3.3.1 Information Theoretic Metric Learning (ITML)
In this research we will utilize an existing supervised DML method, namely, Information Theo- retic Metric Learning (ITML) [65] to learn a Mahalanobis distance metric for the original space
X using supervisory information (pairwise similarity constraints and class labels) extracted from the privileged spaceX∗.
In ITML [65] given a set of n points {x1, ..., xn}, xi ∈ Rm, also given an initial distance
function, parameterized byA0, specifying prior knowledge about interpoint distances. In ITML
the one learns a positive definite matrixA 0 defining the (squared) distancedA(x
i, xj) =
(xi−xj)TA(xi −xj), that is close to the baseline matrixA0, subject to categorical pairwise
similarity information on the data points that should be preserved. Two sets of pairs of data points fromXare formed corresponding to the ‘similar’S+and ’dis-similar’S−data items.
• S+={(xi, xj)|xi andxj are judged to be similar}
• S−={(xi, xj)|xi andxj are judged to be dis-similar}
In supervised multi-class settings, constraints are taken directly from the provided labels, i.e. points in the same class are constrained to be ’similar’ S+, and points in different classes are
constrained to be ’dis-similar’S−[65].
The closeness relation between the original metric and the new one is measured through the Kullback-Leibler (K-L) divergence, also known as relative entropy, between the multivariate zero-mean Gaussian havingA0 andAas precision matrices. The ITML optimization problem
[65] tends to minimize the K-L divergence between the associated Gaussians whose covariance matrices are parameterized according to A0 and A. It has been found that the differential
equals the LogDet divergence between A0 and A, that is equals to DLogDet(A, A0)1. The
LogDet divergence2, that is also called the Burg matrix divergence (DBurg(A, A0)), is a type
of Bregman matrix divergence that has widely been used in matrix nearest problems [79]. One advantages of using the Burg divergence method is that it preserves the positive definiteness constraint of matrices while solving the optimization problem.
The Problem Statement
To compute the optimalA, the resulting matrix divergence
DBurg(A,A0) =tr(AA0)
−1
−log det (AA0)−m,
is minimized while enforcing the desired constraints as,
min
A0DBurg(A, A0), subject to
dA(xi, xj)≤l, if (xi, xj)∈S+, and
dA(xi, xj)≥u, if (xi, xj)∈S−. (3.1)
wherelanduare relatively small and large distance bounds (respectively),trdenotes the trace operator and m is the data dimensionality. Note that, A0 can be parameterized by inverse of
the sample covariance (when data are assumed to be Gaussian), or alternatively by the squared Euclidean metric [65].
In some cases, particularly if the number of constraints is large, it is not possible to find a feasible solution for the optimization problem in (3.1). Therefore, slack variables may be introduced to (3.1), which allows constraints to be violated, however, penalized. Yet, for sim- plicity this section reviews the ITML algorithm in its original form only, and the slack variable
1According to [65], it is actually equals to1
2DLogDet(A, A0), however, we remove the
1
2 for ease of presen- tation
2The LogDet divergence is a Bregman matrix divergence generated by taking the Burg entropy of the eigen values (λi), i.e.ϕ(A) =Pilogλi; which may be expressed asϕ(A) =−logdetA[65].
formulation will be discussed in the following sections.
The ITML Optimization Algorithm
To solve the optimization problem in (3.1), the ITML approach typically utilizes the Bregman’s method, proposed in [80,79], which is based on cyclic Bregman projection; i.e. in each iteration the algorithm chooses one constraint and performs a projection so that the current solution satisfies the chosen constraint.
The ITML framework is presented in Algorithm (4) [65, 79]. The ITML algorithm begins with implementing the required initializations. Subsequently, based on the Bregman optimiza- tion algorithm, for the chosen constraint(i, j)with indexs(i, j)(fromS+orS−), the algorithm
maintains a non-negative dual variable ζij for that constraint. A dual variable correction is
needed here to guarantee convergence to a globally optimal solution, as proved in [79]. After solving the system of equations, denoting the result here asψ0, it setψ =min(ζij, ψ0)(as given
in Eq.(3.2)), and subsequently performs the update of ζij = ζij − ψ (as given in Eq.(3.4)).
Consequently, the projection is done via the update in Eq.(3.5), where the projection parameter is computed via Eq.(3.3). Note that, unlike the orthogonal projection, the Bregman projection is tailored to the particular function that is being minimized. This process is then repeated by cycling through the constraints [79]. Furthermore, according to [65,79], in the case of the un- derlying distance constraints wheredA(x
i, xj) 6= 0, elementary arguments reveal that there is
exactly one solution forψ0 provided thatl 6= 0 and u 6= 0. The unique solution, in this case, can be expressed as given in Eq.(3.2). For further details about the algorithm description please consult [65, 79]. Description of the Bregman algorithm proof of convergence can be found in [80].
Why ITML?
In ITML [65] the learned distance function is used to enhance the accuracy of a k-NN clas- sification. In this research we utilize the ITML [65] for incorporating the privileged data
Algorithm 4The Information Theoretic Metric Learning Approach.
inputX,A0,landu
outputAMahalanobis matrix
initializeA=A0 andζij=0∀i,j
construct (dis)similarity constraintsS±. repeat select a constraint in(i, j)∈S+or(i, j)∈S− ψ = minζij,(dA(x1 i,xj) − 1 l) if(xi, xj)∈S+ minζij,(1u −dA(x1 i,xj)) if(xi, xj)∈S− (3.2) β = ψ 1−ψdA(x i,xj) if(xi, xj)∈S+, −ψ ψdA(x i,xj)+1 if(xi, xj)∈S−, (3.3) ζij =ζij −ψ, (3.4)
wherexi andxj are data points associated with one of the (dis)similarity constraints from S±, β is a projection parameter computed by the algorithm and ζij is the corresponding
dual variable.
compute the Bregman projection, via the update
A=A+βA(xi−xj)(xi−xj)TA, (3.5)
(which will be explained in the following sections). The reasons for particularly adopting the ITML as a supervised DML method in this research is that (a). it can naturally incorporate prior distances, (b). it can be solved through efficient optimization avoiding costly computa- tions (e.g. semi-definite programming as in [68]), (c). it is flexible in terms of the constraint specification (constraints may also be defined in terms of relative distance comparisons, i.e.,
dA(xi, xj) < dA(xi, xk)[81]) and(d). it has been generalized to work in kernel space1 [82],
hence, can efficiently handle data with high-dimensional feature space.