5.4 Sparse Kernel Canonical Correlation Analysis
5.5.4 Content-Based Image Retrieval
Content-based image retrieval (CBIR) is a challenging aspect of multimedia analysis and has become popular in past few years. Generally, CBIR is the problem of searching for digital images in large databases by their visual content (e.g., color, texture, shape) rather than the metadata such as keywords, labels, and descriptions associated with the images. There exists study utilizing kernel CCA for image retrieval [72, 73]. In this section, we apply our sparse kernel CCA approach to content-based image retrieval task by combining image and text data.
We experimented on the Ground Truth Image Database created at the University of Washington, which consists of 21 data sets of outdoor scene images. The data set
is available athttp://www.cs.washington.edu/research/imagedatabase/groundtruth/. In our experiment we used 852 images from 19 data sets that have been annotated
with keywords. We exploited text features and low-level image features, including color and texture, and applied sparse kernel CCA to perform image retrieval from text query. We used 217 images as training data and the rest were used as testing data.
Text Features We used the bag-of-words approach, same as what we have done in cross-language document retrieval experiment, to represent the text associated with images. Since each image in the data set has been annotated with keywords, we consider terms associated with an image as a document. After removing stop-words and stemming, we get a term-document matrix of size 189 × 852 for Ground Truth image data set.
We applied Gabor filters to extract texture features and used HSV (hue-saturation- value) color representation as color features. To enhance sensitivity to the overall shape, we divided each image into 8 × 8 = 64 patches from which texture and color
features were extracted.
Texture Features The Gabor filters in the spatial domain is given by
gλθψσγ(x, y) = exp −x 02+ γ2y02 2σ2 cos 2πx 0 λ + ψ , (5.51) where x0 = xcos(θ) + ysin(θ), y0 = −xsin(θ) + ycos(θ), x and y specify the position of a light impulse. In this equation, λ represents the wavelength of the cosine factor, θ represents the orientation of the normal to the parallel stripes of a Gabor function in degrees, ψ is the phase offset of the cosine factor in degrees, γ is the spatial aspect ratio and σ is the standard deviation of the Gaussian. In Figure 5.3the Gabor filter impulse responses used in this experiment are shown. So from each of the 64 image patches, the Gabor filter can extract 16 texture features, which eventually results in a total of 64 × 16 = 1024 features for each image.
θ = 0 θ = π /4 θ = π /2 f=0.15 θ = 3 π /4 f=0.2 f=0.25 f=0.3
Figure 5.3: Gabor filters used to extract texture features. Four frequencies f = 1/λ = [0.15, 0.2, 0.25, 0.3] and four directions θ = [0, π/4, π/2, 3π/4] are used. The width of the filters are σ = 4.
Color Features We used the HSV color representation as color features. Each color component was quantized into 16 bins, and each image patch was represented by 3 normalized color histograms. This gives 48 features for each of the 64 patches, which eventually results in 48 × 64 = 3072 features for each image.
5.5 Numerical Results 129 Following previous work [72,73], we used Gaussian kernel
kx(Ii, Ij) = exp −kIi− Ijk 2 2σ2 ,
where Iiis a vector concatenating texture features and color features of ith image and
σ is selected as the minimum distance between different images, to compute kernel matrix Kx for the first view. The linear kernel (5.28) was employed to compute
kernel matrix Ky using text features for the other view.
In Table5.4, we compare the performance of CCA, KCCA, RKCCA and SKCCA. Like the cross-language document retrieval experiments, we used AROC to evaluate the performance of these algorithms. We see from Table 5.4that both RKCCA and SKCCA outperform CCA and KCCA, and RKCCA achieves the best performance in terms of AROC. The dual projections Wx and Wy computed by SKCCA have
high sparsity, which can excessively reduce the computational time of computing projection of a new data in practice as we only need to evaluate kernel functions between the new data and a small subset of training data. Moreover, SKCCA also obtains larger summation of canonical correlations between testing data than other approaches.
Table 5.4: Content-based image retrieval using CCA, KCCA, RKCCA and SKCCA.
Algorithm AROC Corr Sparsity Err(Wx) Err(Wy) Regularizer
UW ground truth data: 217 training data; l = 124
CCA 73.96 11.53 (0, 7.7) 6.832e-15 7.638e-15 -
KCCA 82.59 19.09 (0, 0) 3.1310e-15 6.8237e-14 -
RKCCA 85.67 21.96 (0, 0) 9.9861e-1 9.9052e-1 (2.7e+3, 1.7e+2) SKCCA 83.17 25.55 (91.1, 95.1) 9.5920e-1 9.7713e-1 (0.51, 0.54)
In Figure5.4, we plot AROC of CCA, KCCA, RKCCA and SKCCA as a function of the number of projections used (i.e., different l). We observe that the AROC of SKCCA is at first smaller than and then exceeds that of kernel CCA. This indicates that when suitable number of dual projections are used for retrieval SKCCA can
0 20 40 60 80 100 120 0.5 0.55 0.6 0.65 0.7 0.75 0.8 0.85 0.9
Number of columns of (Wx, Wy) used
Average AROC CCA
KCCA RKCCA SKCCA
(a)
Figure 5.4: Content-based image retrieval using CCA, KCCA, RKCCA and SKCCA on UW ground truth data with 217 training data.
improve the performance of kernel CCA.
5.6
Conclusions
In this chapter we proposed a novel sparse kernel CCA algorithm called SKCCA. This algorithm is based on a relationship between kernel CCA and least squares problems which is an extension of a similar relationship between CCA and least squares problems. We incorporated sparsity into kernel CCA by penalizing the `1-norm of dual vectors. The resulting `1-regularized minimization problems were
solved by a fixed-point continuation (FPC) algorithm. Empirical results show that SKCCA not only performs well in computing sparse dual transformations, but also alleviates the over-fitting problem of kernel CCA.
Although we did not mention in this chapter, the relationship between CCA and least squares problems described in Theorem 5.1 can be exploited to design new sparse CCA algorithms by adding sparsity inducing penalization to the objective functions. This work will be left for future research.
5.6 Conclusions 131 Besides, several interesting questions and extensions of sparse kernel CCA re-
main. In many applications such as genomic data analysis, CCA is often performed on more than two data sets. It will be helpful to extend sparse kernel CCA to deal with multiple data sets. In the derivation of SKCCA, we did not discuss the choice of kernel function. However, it is believed that the performance of kernel CCA depends on the choice of the kernel. As for future research, we plan to study the problem of finding optimal kernel of kernel CCA for different applications. Moreover, we also plan to generalize the idea of sparse kernel CCA in this chapter to involve multiple kernels.
Chapter
6
Conclusions
6.1
Summary of Contributions
In this thesis, we have considered several sparse dimensionality reduction methods for high-dimensional data analysis and their applications in various fields. Now, we summarize the contributions as follows.
First, we studied sparse version of uncorrelated linear discriminant analysis which is an important generalization of LDA. We have parameterized all solutions of the generalized ULDA via solving the optimization problem proposed in [160], and then proposed a novel model, named SULDA, for computing sparse ULDA transforma- tion matrix. The main idea of this model is to select minimum `1-norm solution from
the general solution set, which leads to a basis pursuit problem. We applied the ac- celerated linearized Bregman iterative method to solve this optimization problem. Experimental results demonstrate that SULDA consistently outperforms its com- petitors in terms of classification accuracy and achieves comparable interpretability (sparsity and number of used variables). The resulting sparse transformation can also be used to visualize observations and inspect class discrimination in the low- dimensional space.
In our second contribution, we have made a new and systematic study of CCA.
We first revealed the equivalent relationship between the recursive formulation and the trace formulation for the multiple-projection CCA problem. Based on the equiv- alence relationship, we adopted the trace formulation as the criterion of CCA and obtained an explicit characterization of all solutions for the multiple-projection CCA problem even when the sample covariance matrices are singular. We also established the equivalent relationship between uncorrelated linear discriminant analysis and the CCA problem.
Based on the explicit characterization of general solutions of CCA, we have developed a novel sparse CCA algorithm, named SCCA `1. Compared with existing
sparse CCA algorithms, our proposed method has the following properties:
1. Sparse projections are computed by solving two `1-minimization problems,
one for each set of variables. Accelerated linearized Bregman iterative method can be applied to our sparse CCA optimization problems, which makes our algorithm easy to implement.
2. The orthogonality constraints on canonical variables are well satisfied, which implies that canonical variables are mutually uncorrelated. Multiple sparse projections are computed simultaneously, while other approaches have to re- cursively compute multiple sparse projections. As a consequence, the orthog- onality constraints on canonical variables may not be satisfied even when nor- malization is adopted, and associated canonical variables are not mutually uncorrelated.
Finally, we focused on designing efficient algorithm for sparse kernel CCA. We have studied sparse kernel CCA via utilizing established results on CCA, aiming at computing sparse dual transformations and alleviating over-fitting problem of kernel CCA, simultaneously. We first established a relationship between CCA and least squares problems and extended this relationship to kernel CCA. Then, based on this relationship, we succeeded in incorporating sparsity into kernel CCA by penalizing the least squares term with `1-norm and proposed a novel sparse kernel
6.2 Future Work 135 CCA algorithm, named SKCCA. Numerical results of applying the newly proposed
algorithm to various applications and comparative results with kernel CCA and regularized kernel CCA were also presented. Empirical results show that SKCCA not only performs well in computing sparse dual transformations, but also alleviates the over-fitting problem of kernel CCA.
6.2
Future Work
There are still many interesting problems that will lead to further research in the field of sparse dimensionality reduction. The first one consists of designing efficient algorithms for `1-minimization problem over the manifold of column orthogonal
matrices (Stiefel manifold ). When we selected minimum `1-norm solution from
the solution set, the parameter denoting arbitrary orthogonal matrix was fixed at identity matrix. Thus, the optimal solution was over a subset and may not be globally optimal. Minimizing an objective function with orthogonal constraints is a challenging problem and has attracted more and more attention. A further study of `1-minimization problems with orthogonal constraints should be helpful for designing
efficient algorithms in high-dimensional data analysis.
We used the accelerated linearized Bregman iterative method to solve the `1-
minimization problems in sparse ULDA and sparse CCA, and FPC method to solve the penalized least squares problems in sparse kernel CCA. There may exist faster methods, but since our sparse algorithms (SULDA, SCCA `1and SKCCA) are
frameworks, we can replace the accelerated linearized Bregman iterative method and FPC by other methods without causing any problem. In addition, in the thesis we only considered sparse CCA and sparse kernel CCA for two sets of data; however, as can be seen from the derivation of our algorithms, the same idea can be generalized to handle multiple sets of data. Due to the high efficiency and robustness of SKCCA for finding nonlinear relation among high-dimensional data, it would be fruitful to apply it in other areas like fMRI data analysis, automatic image annotation and so
on.
Another important direction is to study structured sparsity in sparse dimen- sionality reduction methods. As can be seen from experimental results of SULDA, although the computed transformation matrix is highly sparse, the number of used variables is relatively high. This can still cause problems for interpreting extracted features. In some applications, like genome-wide association study, structural infor- mation (e.g., group structure) can greatly facilitate the interpretation of the results obtained. Thus, it is of great meaning to study structured sparse dimensionality reduction methods.
Bibliography
[1] S. Akaho, A kernel method for canonical correlation analysis, in Proceedings of the International Meeting of the Psychometric Society, 2001.
[2] T. W. Anderson, An Introduction to Multivariate Statistical Analysis, Wi- ley, 3 ed., 2003.
[3] F. R. Bach, R. Jenatton, J. Mairal, and G. Obozinski, Optimization with sparsity-inducing penalties, Foundations and Trends in Machine Learning, 4 (2011), pp. 1–106.
[4] F. R. Bach and M. I. Jordan, Kernel independent component analysis, Journal of Machine Learning Research, 3 (2003), pp. 1–48.
[5] , A probabilistic interpretation of canonical correlation analysis, tech. re- port, University of California, Berkeley, 2005.
[6] S. Balakrishnan, K. Puniyani, and J. D. Lafferty, Sparse additive functional and kernel CCA, in Preceedings of the 29th International Confer- ence on Machine Learning, 2012.
[7] P. Baldi and G. Hatfield, DNA Microarrays and Gene Expression: From Experiments to Data Analysis and Modeling, Cambridge University Press, 2002.
[8] A. Beck and M. Teboulle, A fast iterative shrinkage-thresholding algo- rithm for linear inverse problems, SIAM Journal on Imaging Sciences, 2 (2009), pp. 183–202.
[9] S. Becker, J. Bobin, and E. J. Cand`es, NESTA: A fast and accurate first-order method for sparse recovery, SIAM Journal on Imaging Sciences, 4 (2011), pp. 1–39.
[10] P. N. Belhumeur, J. P. Hespanha, and D. J. Kriegman, Eigenfaces vs. Fisherfaces: Recognition using class specific linear projection, IEEE Trans- actions on Pattern Analysis and Machine Intelligence, 19 (1997), pp. 711–720. [11] M. Belkin and P. Niyogi, Laplacian eigenmaps for dimensionality reduc- tion and data representation, Neural Computation, 15 (2003), pp. 1373–1396. [12] M. W. Berry, S. T. Dumais, and G. W. O’Brien, Using linear algebra
for intelligent information retrieval, SIAM Review, 37 (1995), pp. 573–595. [13] P. J. Bickel and E. Levina, Some theory for Fisher’s linear discriminant
function, ‘naive Bayes’, and some alternatives when there are many more variables than observations, Bernoulli, 10 (2004), pp. 989–1010.
[14] T. D. Bie, N. Cristianini, and R. Rosipal, Eigenproblems in pattern recognition, in Handbook of Geometric Computing: Applications in Pattern Recognition, Computer Vision, Neuralcomputing, and Robotics, Springer, 2005, pp. 129–170.
[15] C. M. Bishop, Pattern Recognition and Machine Learning, Springer-Verlag, 2006.
Bibliography 139 [16] ˚A. Bj¨orck and G. H. Golub, Numerical methods for computing angles
between linear subspaces, Mathematics of Computation, 27 (1973), pp. 579– 594.
[17] A. P. Bradley, The use of the area under the ROC curve in the evaluation of machine learning algorithms, Pattern Recognition, 30 (1997), pp. 1145–1159. [18] L. M. Bregman, The relaxation method of finding the common point of convex sets and its application to the solution of problems in convex program- ming, USSR Computational Mathematics and Mathematical Physics, 7 (1967), pp. 200–217.
[19] C. J. Burges, A tutorial on support vector machines for pattern recognition, Data Mining and Knowledge Discovery, 2 (1998), pp. 121–167.
[20] T. Cacoullos, Estimation of a multivariate density, Annals of the Institute of Statistical Mathematics, 18 (1966), pp. 179–189.
[21] J. Cai, S. Osher, and Z. Shen, Convergence of the linearized Bregman iteration for `1-norm minimization, Mathematics of Computation, 78 (2009),
pp. 2127–2136.
[22] , Linearized Bregman iterations for compressed sensing, Mathematics of Computation, 78 (2009), pp. 1515–1536.
[23] E. Cand´es and B. Recht, Exact matrix completion via convex optimization, Foundations of Computational Mathematics, 9 (2009), pp. 717–772.
[24] E. Cand´es, J. Romberg, and T. Tao, Robust uncertainty principles: Ex- act signal reconstruction from highly incomplete frequency information, IEEE Transactions on Information Theory, 52 (2006), pp. 489–509.
[25] E. Cand´es, J. Romberg, and T. Tao, Stable signal recovery from incom- plete and inaccurate measurements, Communications on Pure and Applied Mathematics, 59 (2006), pp. 1207–1223.
[26] O. Chapelle, B. Sch¨olkopf, and A. Zien, Semi-Supervised Learning, MIT Press, 2006.
[27] L. Chen, H. M. Liao, M. Ko, J. Lin, and G. Yu, A new LDA-based face recognition system which can solve the small sample size problem, Pattern Recognition, 33 (2000), pp. 1713–1726.
[28] S. S. Chen, D. L. Donoho, Michael, and A. Saunders, Atomic de- composition by basis pursuit, SIAM Review, 43 (2001), pp. 129–159.
[29] X. Chen and H. Liu, An efficient optimization algorithm for structured sparse CCA, with applications to eQTL mapping, Statistics in Biosciences, 4 (2012), pp. 3–26.
[30] K. Chin, S. DeVries, J. Fridlyand, et al., Genomic and transcriptional aberrations linked to breast cancer pathophysiologies, Cancer Cell, 10 (2006), pp. 529–541.
[31] D. Chu and S. T. Goh, A new and fast implementation for null space based linear discriminant analysis, Pattern Recognition, 43 (2010), pp. 1373–1379. [32] , A new and fast orthogonal linear discriminant analysis on undersampled
problems, SIAM Journal on Scientific Computing, 32 (2010), pp. 2274–2297. [33] D. Chu, S. T. Goh, and Y. S. Hung, Characterization of all solutions
for undersampled uncorrelated linear discriminant analysis problems, SIAM Journal on Matrix Analysis and Applications, 32 (2011), pp. 820–844.
[34] D. Chu, L.-Z. Liao, and M. K. Ng, Sparse orthogonal linear discriminant analysis, SIAM Journal on Scientific Computing, 34 (2012), pp. A2421–A2443. [35] D. Chu, L.-Z. Liao, M. K. Ng, and X. Zhang, Sparse canonical correla- tion analysis: New formulation and algorithm. Accepted by IEEE Transactions on Pattern Analysis and Machine Intelligence, 2012.
Bibliography 141 [36] , Sparse kernel canonical correlation analysis, in International MultiCon-
ference of Engineers and Computer Scientists, 2013.
[37] D. Chu and X. Zhang, Sparse uncorrelated linear discriminant analysis, in the 30th International Conference on Machine Learning, 2013.
[38] L. Clemmensen, T. Hastie, D. Wiiten, and B. Ersbøll, Sparse dis- criminant analysis, Technometrics, 53 (2011), pp. 406–413.
[39] A. d’Aspremont, F. R. Bach, and L. El Ghaoui, Optimal solutions for sparse principal component analysis, Journal of Machine Learning Research, 9 (2008), pp. 1269–1294.
[40] A. d’Aspremont, L. El Ghaoui, M. Jordan, and G. Lanckriet, A direct formulation of sparse PCA using semidefinite programming, SIAM Re- view, 49 (2007), pp. 434–448.
[41] R. Datta, D. Joshi, J. Li, and J. Z. Wang, Image retrieval: Ideas, influences, and trends of the new age, ACM Computing Surveys, 40 (2008). [42] J. Dauxois and G. M. Nkiet, Nonlinear canonical analysis and indepen-
dence tests, The Annals of Statistics, 26 (1998), pp. 1254–1278.
[43] T. De Bie and B. De Moor, On the regularization of canonical correlation analysis, in 4th International Symposium on Independent Component Analysis and Blind Signal Separation, 2003.
[44] M. Dettling, BagBoosting for tumor classification with gene expression data, Bioinformatics, 20 (2004), pp. 3583–3593.
[45] C. Dhanjal, Sparse Kernel Feature Extraction, PhD thesis, University of Southampton, 2008.
[46] C. Dhanjal, S. Gunn, and J. Shawe-Taylor, Efficient sparse kernel feature extraction based on partial least squares, IEEE Transactions on Pattern Analysis and Machine Intelligence, 99 (2008), pp. 1347–1361.
[47] D. L. Donoho, High-dimensional data analysis: the curses and blessings of dimensionality, in Mathematical Challenges of the 21st Century, American Mathematical Society, 2000.
[48] D. L. Donoho, Compressed sensing, IEEE Transactions on Information The- ory, 52 (2006), pp. 1289–1306.
[49] , For most large underdetermined systems of linear equations the minimal `1-norm solution is also the sparsest solution, Communications on Pure and
Applied Mathematics, 59 (2006), pp. 797–829.
[50] R. O. Duda, P. E. Hart, and D. G. Stork, Pattern Classification, Wiley Interscience, 2 ed., 2000.
[51] S. Dudoit, J. Fridlyand, and T. P. Speed, Comparison of discriminant methods for the classification of tumors using gene expression data, Journal of the American Statistical Association, 97 (2002), pp. 77–87.
[52] M. Dundar, G. Fung, J. Bi, S. Sathyakama, and B. Rao, Sparse