6.2 Multi-Output Regularized Projection
6.2.3 Discussions
MORP defines a general solution for supervised projection, i.e.,minimization of an output- regularized cost function. In general one can go beyond Frobenius norm and consider more general cost forX and Y:
(1−β)f(X,Z) +βg(Y,Z),
where f and g define the input-specific cost and output-specific cost respectively, with respect to the observation (XorY) and the projectionZ. There may be some parameters involved (likeA and Bin the Frobenius norm case), and in general there is no analytical solution to this optimization problem. For instance, f could be matrix 1-norm (like the
6.2. MULTI-OUTPUT REGULARIZED PROJECTION 97 case for sparse PCA [88]), andgcould be hinge-lose for binary classification problem [72]. For simplicity and tractability we stick to the Frobenius norm in this chapter.
As a natural extension of Problem (6.2), we can have different output sets, sayY1 and Y2, associated with all input data. In this case we can add the reconstruction error of Y1 andY2 to the cost function, but with different weightsβ:
(1−β1−β2)kX−ZAk2+β1kY1−ZB1k2+β2kY2−ZB2k2,
and potentially Y1 and Y2 could have different intra-correlations. Both of these two output sets can be incorporated into MORP by defining possibly different kernels for Y1 andY2, and including them into the matrix K. Therefore, MORP introduces an elegant way to take into account various supervised information and allows great flexibility and generalization ability.
It is also possible to regularize the normal PCA projection using other types of super- vised information, such as a hierarchy of outputs [43]. This information can be viewed as a different kind of cost in the MORP model.
Computational Issues
MORP solves a generalized eigenvalue problem forM ×M matrices in the primal form, and forN×N matrices in the dual form, which in computational complexity is similar to unsupervised projection PCA and kernel PCA (see [32] for details of generalized eigenvalue problems). For implementation it is very easy and just takes several lines with Matlab. The calculations of kernels and matrix multiplications are the most time-consuming parts of the algorithm, as well as the matrix inversion in kernel form. But in general, the projection is quite acceptable and takes less than one minute for about 1000 data points with 500 input features (RBF kernel) and 50 output features (linear kernel).6
Parameters
MORP has three tuning parameters: K,β andγ. They should be chosen beforehand and probably be determined by cross-validation.
K is the dimensionality of the latent space and controls the reconstruction capability of the projection. In linear MORP,K ≤M holds, while in the kernel caseKis only upper bounded by number of the data points. Selection ofK depends on the applications built on the learned mapping. Small K is sometimes preferred because it may be sufficient to recover the structure behind input and output data, and it is helpful to cut down the computational burden.
β satisfies 0 ≤ β ≤ 1 and controls the trade-off between reconstruction errors of X and Y. Special cases when β = 0 and β = 1 will be discussed in Section 6.3. In real
6
98 CHAPTER 6. SUPERVISED FEATURE PROJECTION
Algorithm 6.3 Simplified MORP Algorithm in Dual Form
Require: A set of N data points with input features X = [x1, . . . ,xN]> ∈ RN×M and
outputsY= [y1, . . . ,yN]>∈RN×L.
Require: Kernel functions κx(·,·) andκy(·,·) for input spaceX and output spaceY.
Require: Projection dimension K >0, parameters 0≤β ≤1.
1: Calculate two N×N matrices (Kx)ij =κx(xi,xj), (Ky)ij =κy(yi,yj).
2: Centralize the kernel matrices Kx and Ky using (4.3).
3: Calculate K= (1−β)Kx+βKy.
4: Solve eigenvalue problem: Kz=λz, obtain eigenvectorsz1, . . . ,zK with largest eigen-
values λ1 ≥. . .≥λK.
5: Calculate αk=K−x1zk,k= 1, . . . , K.
Output: Projection function for the k-th dimension as ψk(x) =
√
λkk(X,x)>αk, k =
1, . . . , K, wherek(X,x) := [κx(x1,x), . . . , κx(xN,x)]> and centralized via (4.4).
world applications, the value ofβ should depend on the quality of input and output with respect to the learning tasks. If input data are already enough or many of the outputs are missing, βshould be relatively small; on the other hand if correlations among the multiple outputs are very strong, β is usually large and we are forcing the mapping to align more with principal components of Y. β should also take into account the balance of traces of Kx and Ky, since otherwise the matrix with large eigenvalues will dominate the matrix
sum. From our experience, β = 0.5 is normally a good choice after we balance the traces ofKx and Ky to be the same. Performance comparison in the next section will give more
details.
The non-negative scalar γ is set to prevent overfitting of the mapping functions. We found in our experiments that the quality of mappings is insensitive to γ if it is not too large. This is especially true for the dual form solution, because for positive definite matrices Kx and K, matrix Q in Algorithm 6.2 is already very stable. Therefore we
fixed it to be 0 for simplicity. In this case the whole learning algorithm can be greatly simplified, and the whole MORP algorithm can be viewed as a slightly modification of the kernel PCA algorithm. It is summarized in Algorithm 6.3for clarity.