THE FLORIDA STATE UNIVERSITY
COLLEGE OF ENGINEERING
KERNEL METHODS AND COMPONENT ANALYSIS FOR PATTERN RECOGNITION
By
JASON C. ISAACS
A Dissertation submitted to the
Department of Electrical and Computer Engineering in partial fulfillment of the
requirements for the degree of Doctor of Philosophy
Degree Awarded:
Spring Semester, 2007
Copyright © 2007 Jason C. Isaacs All Rights Reserved
The members of the Committee approve the dissertation of Jason C. Isaacs defended on April 12, 2007.
______________________________
Simon Y. Foo
Professor Co-Directing Dissertation ______________________________
Anke Meyer-Baese
Professor Co-Directing Dissertation __________________________________
Xiuwen Liu
Outside Committee Member
______________________________
Amy Chan-Hilton
Outside Committee Member
Approved:
_____________________________________________
Victor DeBrunner, Chair, Electrical and Computer Engineering
_____________________________________________
C. J.Chen, Dean, College of Engineering
The Office of Graduate Studies has verified and approved the above named committee members.
For my parents and Melissa and Dylan, thank you for the support.
ACKOWLEDGEMENT
I would like to thank Dr. Simon Foo and Dr. Anke Meyer-Baese for there help. Also, I would like to thank the people in my lab for all of the great discussions, they were most helpful.
TABLE OF CONTENTS
LIST OF TABLES... xii
LIST OF FIGURES ... xvii
ABSTRACT... xxvii
CHAPTER 1 ... 1
INTRODUCTION ... 1
1. Motivations ... 1
2. Thesis Overview ... 2
CHAPTER 2 ... 4
KERNELS FOR PATTERN RECOGNITION ... 4
1. Introduction... 4
2. Kernel Methods for Pattern Classification... 5
2.1 Kernel Methods... 5
2.2 Inner Product Space ... 8
2.2.1 Definitions... 8
2.2.2 Norms on Inner Product Spaces... 12
2.2.3 Operators on Inner Product Spaces... 15
2.3 Degenerate Inner Products... 16
2.4 Mercer’s Theorem... 16
2.5 Continuous Distribution Function Kernels ... 18
2.5.1 Lévy Distribution and Kernel Function ... 18
2.5.2 Log-normal Distribution and Kernel Function ... 19
2.5.3 Kumaraswamy Distribution and Kernel Function ... 20
2.5.4 Rice Distribution and Kernel Function ... 21
2.5.5 Rayleigh Distribution and Kernel Function... 21
2.5.6 Erlang Distribution and Kernel Function... 22
2.5.7 Chi-squared Distribution and Kernel Function... 23
2.5.8 Von Mises Distribution and Kernel Function... 24
2.5.9 Bessel Distribution and Kernel Function... 25
2.5.10 Maxwell-Boltzmann Distribution and Kernel Function ... 26
2.5.11 Gumbel and Fisher-Tippett Distributions and Kernel Functions... 26
2.5.12 Laplace Distribution and Kernel Function... 27
2.5.13 Fermi-Dirac Distribution and Kernel Function... 28
2.6 Lattice Oscillations Model Kernel Functions ... 29
2.6.1 Einstein Functions... 29
2.6.2 Debye Functions ... 30
2.7 Orthogonal Polynomial Kernel Functions ... 31
2.7.1 Chebyshev Polynomials... 32
2.7.2 Gegenbauer Polynomials ... 33
2.7.3 Legendre Polynomials ... 34
2.7.4 Laguerre Polynomials ... 35
2.8 Kelvin Function Kernels ... 36
CHAPTER 3 ... 39
KERNEL PROJECTION ANALYSIS... 39
1. Introduction... 39
2. Component Analysis... 40
2.1 Principal Component Analysis ... 40
2.1.1 Methodology... 41
2.2 Independent Component Analysis ... 45
2.2.1 Methodology... 47
2.3 Fisher Linear Discriminant Analysis ... 49
3. Kernel Projection Analysis ... 51
3.1 Introduction... 51
3.2 Kernel Principal Component Analysis... 51
3.3 Kernel Independent Component Analysis ... 53
CHAPTER 4 ... 55
EXPERIMENTAL INVESTIGATIONS... 55
1. Experiments In Pattern Classification... 55
1.1 Cleveland Heart Disease Databases... 55
1.2 Lymphography Domain DataBase... 57
1.3 E.coli Protein Localization Sites Database ... 60
1.4 Bupa LIver Disorders Database ... 63
1.5 Glass Identification Database ... 66
1.6 Image Segmentation Data ... 69
1.7 Balance Scale Weight & Distance Database ... 71
1.8 Johns Hopkins University Ionosphere Database... 74
1.9 Yeast Protein Site Localization Database ... 77
1.10 Border Crossing/Entry Data... 78
1.11 Wine recognition data ... 80
1.12 Iris Plants Database... 83
1.13 Pima Indian Database ... 85
CHAPTER 5 ... 90
GENETIC ALGORITHM OPTIMAL PROJECTION ANALYSIS... 90
1. Introduction... 90
2. Optimization Methods ... 91
2.1 Manifold Learning Mathematical Background... 91
2.1.1 Flag Manifold... 91
2.1.2 Grassmann manifold ... 91
2.1.3 Stiefel manifold... 92
2.1.4 Skew-Symmetric Matrices... 92
2.1.5 Lie groups ... 93
2.1.6 Orthogonal Group ... 93
2.2 Linear and Non-Linear Optimization Techniques ... 95
2.2.1 Techniques ... 96
2.3 Genetic Algorithms... 100
2.3.1 Evolutionary and Biological Background... 100
2.3.2 Selection Strategies For Genetic Algorithms... 103
2.3.3 Boltzmann Selection ... 107
2.3.4 Rank Selection ... 108
2.3.5 Tournament Selection ... 109
2.4 Crossover And Mutation Methods... 109
2.5 Types of Genetic Algorithms... 110
2.6 Optimization with Genetic Algorithms... 111
2.7 Problem Formulation ... 112
2.8 Selection and Reproduction ... 112
2.9 Fitness Function... 113
2.10 Algorithm... 114
3. EXPERIMENTAL RESULTS... 114
3.1 Cleveland Heart Disease Database ... 115
3.2 E.Coli Database Results... 118
3.3 Ionosphere DataBase REsults... 121
3.4 Image Segmentation Database Results ... 124
3.5 Lymphoma DataBase Results... 126
3.6 Balance Scale Database Results... 129
3.7 Pima Indian Diabetes DataBase Results ... 132
4. Experimental Analysis... 135
CHAPTER 6 ... 136
KERNEL EIGENFACE ANALYSIS... 136
1. Introduction... 136
2. Wavelet Decomposition... 136
2.1 The Continuous Wavelet Transform... 137
2.2 The Discrete Wavelet Transform... 138
2.3 Wavelet Families ... 138
3. Experiments ... 139
CHAPTER 7 ... 144
STASTICAL PREDICTIVE ANALYSIS FOR KERNEL USAGE... 144
1. Opimal Selection Preprocessing ... 144
2. Statistical descriptors ... 144
3. Analysis... 153
CHAPTER 8 ... 154
CONCLUSION AND FUTURE WORK ... 154
1. Conclusion ... 154
2. Contributions of Dissertation... 155
3. Future Work... 155
4. Concluding Remarks... 156
APPENDIX A... 157
1. GAOKPA Convergence Graphs For the Ionosphere Data ... 157
2. GAOKPA Convergence Graphs For the Image Segmentation Data ... 160
3. GAOKPA Convergence Graphs For the Lymphoma Data... 163
4. GAOKPA Convergence Graphs For the Balance Data ... 166
5. GAOKPA Convergence Graphs For the Pima Indian Data... 169
6. GAOKPA Convergence Graphs For the Cleveland Heart Disease Data... 171
7. GAOKPA Convergence Graphs For the E.coli Protein Data ... 174
REFERENCES ... 178
BIOGRAPHICAL SKETCH ... 182
LIST OF TABLES
Table 1. A Comparison Between Classification Rate and Eigenvalue Energy for Pricipal Component Analysis... 45 Table 2. KPCA Testing Results for the Distribution and Gas Model Kernels on the Cleveland Heart Disease Data Set (Classification Rate out of 100%)... 56 Table 3. KICA Testing Results for the Distribution and Gas Model Kernels on the E.coli Protein Localization Sites Data Set (Classification Rate out of 100%)... 56 Table 4. KICA Testing Results for the Kelvin and Orthogonal Polynomial Kernels on the Cleveland Heart Data Set (Classification Rate out of 100%) ... 57 Table 5. KPCA Testing Results for the Distribution and Gas Model Kernels on the Lymphoma Recognition Data Set (Classification Rate out of 100%) ... 58 Table 6. KICA Testing Results for the Distribution and Gas Model Kernels on the Lymphoma Recognition Data Set (Classification Rate out of 100%) ... 58 Table 7. KPCA Testing Results for the Kelvin and Orthogonal Polynomial Kernels on the Lymphoma Data Set (Classification Rate out of 100%)... 59 Table 8. KICA Testing Results for the Kelvin and Orthogonal Polynomial Kernels on the Lymphoma Data Set (Classification Rate out of 100%)... 59 Table 9. KPCA Testing Results for the Distribution and Gas Model Kernels on the E.coli Protein Localization Sites Data Set (Classification Rate out of 100%)... 61 Table 10. KICA Testing Results for the Distribution and Gas Model Kernels on the E.coli Protein Localization Sites Data Set (Classification Rate out of 100%)... 62 Table 11. KPCA Testing Results for the Kelvin and Orthogonal Polynomial Kernels on the E.coli Data Set (Classification Rate out of 100%)... 62 Table 12. KICA Testing Results for the Kelvin and Orthogonal Polynomial Kernels on the E.coli Data Set (Classification Rate out of 100%)... 63 Table 13. KPCA Testing Results for the Distribution and Gas Model Kernels on the Bupa Data Set (Classification Rate out of 100%) ... 64
Table 14. KICA Testing Results for the Distribution and Gas Model Kernels on the Bupa Data Set (Classification Rate out of 100%)... 65 Table 15. KICA Testing Results for the Kelvin and Orthogonal Polynomial Kernels on the Bupa Data Set (Classification Rate out of 100%)... 65 Table 16. KPCA Testing Results for the Distribution and Gas Model Kernels on the Glass Recognition Data Set (Classification Rate out of 100%)... 67 Table 17. KICA Testing Results for the Distribution and Gas Model Kernels on the Glass Recognition Data Set (Classification Rate out of 100%)... 67 Table 18. KPCA Testing Results for the Kelvin and Orthogonal Polynomial Kernels on the Glass Data Set (Classification Rate out of 100%) ... 68 Table 19. KICA Testing Results for the Kelvin and Orthogonal Polynomial Kernels on the Glass Data Set (Classification Rate out of 100%) ... 68 Table 20. Glass Data Set Previously Published Results ... 69 Table 21. Segmentation Data Set Previously Published Results ... 70 Table 22. KPCA Testing Results for the Distribution and Gas Model Kernels on the Image Segmentation Data Set (Classification Rate out of 100%) ... 70 Table 23. KICA Testing Results for the Distribution and Gas Model Kernels on the Image Segmentation Data Set (Classification Rate out of 100%) ... 71 Table 24. KPCA Testing Results for the Distribution and Gas Model Kernels on the Balance Scale Data Set (Classification Rate out of 100%) ... 72 Table 25. KICA Testing Results for the Distribution Kernels on the Balance Scale Data Set (Classification Rate out of 100%)... 73 Table 26. KPCA Testing Results for the Kelvin and Orthogonal Polynomial Kernels on the Balance Data Set (Classification Rate out of 100%) ... 73 Table 27. KICA Testing Results for the Kelvin and Orthogonal Polynomial Kernels on the Balance Data Set (Classification Rate out of 100%) ... 74 Table 28. KPCA Testing Results for the Distribution and Gas Model Kernels on the Ionosphere Data Set (Classification Rate out of 100%) ... 75
Table 29. KICA Testing Results for the Distribution Kernels on the Ionosphere Data Set (Classification Rate out of 100%)... 76 Table 30. KPCA Testing Results for the Kelvin and Orthogonal Polynomial Kernels on the Ionosphere Data Set (Classification Rate out of 100%) ... 76 Table 31. KPCA Testing Results for the Distribution and Gas Model Kernels on the Yeast Protein Localization Data Set (Classification Rate out of 100%)... 77 Table 32. KICA Testing Results for the Distribution Kernels on the Yeast Protein Localization Data Set (Classification Rate out of 100%) ... 78 Table 33. KPCA Testing Results for the Distribution and Gas Model Kernels on the Border Crossing/Entry Data Set (Classification Rate out of 100%)... 79 Table 34. KPCA Testing Results for the Distribution and Gas Model Kernels on the Wine Recognition Data Set (Classification Rate out of 100%)... 80 Table 35. KICA Testing Results for the Distribution Kernels on the Wine Recognition Data Set (Classification Rate out of 100%)... 81 Table 36. KPCA Testing Results for the Kelvin and Orthogonal Polynomial Kernels on the Wine Recognition Data Set (Classification Rate out of 100%)... 82 Table 37. KICA Testing Results for the Kelvin and Orthogonal Polynomial Kernels on the Wine Recognition Data Set (Classification Rate out of 100%)... 82 Table 38. KPCA Testing Results for the Distribution and Gas Model Kernels on the Iris Plants Data Set (Classification Rate out of 100%) ... 83 Table 39. KICA Testing Results for the Distribution and Gas Model Kernels on the Iris Plants Data Set (Classification Rate out of 100%) ... 84 Table 40. KPCA Testing Results for the Kelvin and Orthogonal Polynomial Kernels on the Iris Plants Data Set (Classification Rate out of 100%) ... 84 Table 41. KICA Testing Results for the Kelvin and Orthogonal Polynomial Kernels on the Iris Plants Data Set (Classification Rate out of 100%) ... 85 Table 42. Pima Indian Previously Published Results ... 86
Table 43. KPCA Testing Results for the Distribution and Gas Model Kernels on the Pima Indian Diabetes Data Set (Classification Rate out of 100%)... 86 Table 44. KICA Testing Results for the Distribution and Gas Model Kernels on the Pima Indian Diabetes Data Set (Classification Rate out of 100%)... 87 Table 45. KPCA Testing Results for the Kelvin and Orthogonal Polynomial Kernels on the Pima Indian Diabetes Data Set (Classification Rate out of 100%)... 88 Table 46. KICA Testing Results for the Kelvin and Orthogonal Polynomial Kernels on the Pima Indian Diabetes Data Set (Classification Rate out of 100%)... 88 Table 47. Roulette Wheel Selection Example for a Population of Ten Chromosomes. ... 105 Table 48. GA-KOPA Experimental Results For The Cleveland Heart Data Set (Classification Rate out of 100% on the Nearest Means-Neighbor Classifier) ... 115 Table 49. GA-KOPA Experimental Results For The E.Coli Data Set (Classification Rate out of 100% on the Nearest Means-Neighbor Classifier) ... 118 Table 50. GA-KOPA Experimental Results For The Ionosphere Data Set (Classification Rate out of 100% on the Nearest Means-Neighbor Classifier)... 121 Table 51. GA-KOPA Experimental Results For The Image Segmentation Data Set (Classification Rate out of 100% on the Nearest Means-Neighbor Classifier) ... 124 Table 52. GA-KOPA Experimental Results For The Lymphoma Data Set (Classification Rate out of 100% on the Nearest Means-Neighbor Classifier)... 126 Table 53. GA-KOPA Experimental Results For The Balance Scale Data Set (Classification Rate out of 100% on the Nearest Means-Neighbor Classifier)... 129 Table 54. GA-KOPA Experimental Results For The Pima Indian Diabetes Data Set (Classification Rate out of 100% on the Nearest Means-Neighbor Classifier) ... 132 Table 55. KPCA Testing Results for the Distribution and Gas Model Kernels on the AT&T Face Data Set (Classification Rate out of 100%)... 141 Table 56. KPCA Testing Results for the Distribution and Gas Model Kernels on the Yale Face Data Set (Classification Rate out of 100%) ... 142 Table 57. Statistical Moments of Known Distributions... 150
Table 58. Statistical Moments of Databases ... 151 Table 59. Best Fitting Distribution for Each Database (Based on Euclidean Distance)... 152 Table 60. Best Performing Kernels for Each Database ... 152
LIST OF FIGURES
Figure 1. The Plot Of A Simple 2-Dimensional Non-Linearly Separable Data Set Used To
Demonstrate The Usefulness Of Kernel Methods. ... 7
Figure 2. The Resulting 3-Dimensional Kernel Projected Data from Figure 2. This Shows That The Data Is Now Linearly Separable. ... 8
Figure 3. Plot of the Levy Distribution Function with Range (0,1]. ... 18
Figure 4. Plot of the Log-normal Distribution Function with Range (0,1]. ... 19
Figure 5. Plot of the Kumaraswamy Distribution Function with Range (0,1]. ... 20
Figure 6. Plot of the Rice Distribution Function with Range (0,1]. ... 21
Figure 7. Plot of the Rayleigh Distribution Function with Range (0,1]... 22
Figure 8. Plot of the Erlang Distribution Function with Range (0,1]... 23
Figure 9. Plot of the Chi-squared Distribution Function with Range (0,1]... 24
Figure 10. Plot of the von Mises Distribution Function with Range (0,1]... 25
Figure 11. Plot of the Bessel Distribution Function with Range (0,1]... 25
Figure 12. Plot of the Maxwell-Boltzmann Distribution Function with Range (0,1]. ... 26
Figure 13. Plots of the Gumbel (left) and Fisher-Tippett (right) Distribution Functions with Range (0,1]. ... 27
Figure 14. Plot of the Laplace Distribution Function with Range (0,1]... 28
Figure 15. Plot of the Fermi-Dirac Distribution Function with Range (0,1]... 29
Figure 16. Plots of the Four Einstein Functions with range (0,1]. Clockwise from Top Left Einstein-1, Einstein-2, Einstein-3, and Einstein-4. ... 30
Figure 17. Plots of Four Debye Functions with range (0.2,1]. Clockwise from Top Left Debye-2, Debye-3, Debye-4, and Debye-5... 31
Figure 18. Plot of the Second Chebyshev Polynomial Function with Range (0,1]... 33
Figure 19. Plot of the Second Gegenbauer Polynomial Function with Range (0,1]. ... 34
Figure 20. Plot of the Second Legendre Polynomial Function with Range (0,1]. ... 35
Figure 21. Plot of the Second Laguerre Polynomial Function with Range (0,1]. ... 36
Figure 22. Plots of the First Four Bei Functions with Range (0,1]. Clockwise from Top Left Bei-0, Bei-1, Bei-2, Bei-3... 37
Figure 23. Plots of the First Four Ber Functions with Range (0,1]. Clockwise from Top Left Ber-0, Ber-1, Ber-2, Ber-3. ... 37
Figure 24. Plots showing the transformation and separation of the first two dimensions of the simple iris flower database as the data is subjected to A) nothing, B) PCA, C) ICA, and D) PCA- ICA. ... 48
Figure 25. Surface plot representation of the field of gene combinations in three dimensions, with the lines representing contours with respect to adaptiveness and the peaks signifying highly fit solutions in the search space. ... 102
Figure 26. Results for three selection cycles each consisting of ten spins of the wheel. ... 106
Figure 27. Chart Showing the Increased Classification Performance (out of 100%) of the Kernel Projection Matrix for each Kernel over the Cleveland Heart Disease Database. The Average Improvement is 12.0% ... 116
Figure 28. Genetic Algorithm Based Kernel Optimal Projection Analysis Training Curve for the Bessel Kernel on the Cleveland Heart Database... 117
Figure 29. Genetic Algorithm Based Kernel Optimal Projection Analysis Training Curve for the Levy Kernel on the Cleveland Heart Database... 117
Figure 30. Genetic Algorithm Based Kernel Optimal Projection Analysis Training Curve for the Einstein-4 Kernel on the Cleveland Heart Database... 117
Figure 31. Chart Showing the Increased Classification Performance (out of 100%) of the Kernel Projection Matrix for each Kernel over the E.Coli Database. The Average Improvement is 7.4%. ... 119
Figure 32. Genetic Algorithm Based Kernel Optimal Projection Analysis Training Curve for the Gumbel Kernel on the E.coli Database... 119 Figure 33. Genetic Algorithm Based Kernel Optimal Projection Analysis Training Curve for the Fisher-Tippett Kernel on the E.coli Database... 120 Figure 34. Genetic Algorithm Based Kernel Optimal Projection Analysis Training Curve for the Einstein-3 Kernel on the E.coli Database. ... 120 Figure 35. Chart Showing the Increased Classification Performance (out of 100%) of the Kernel Projection Matrix for each Kernel over the Ionosphere Database. The Average Improvement is 14.12%... 122 Figure 36. Genetic Algorithm Based Kernel Optimal Projection Analysis Training Curve for the Gaussian Kernel on the Ionosphere Database... 122 Figure 37. Genetic Algorithm Based Kernel Optimal Projection Analysis Training Curve for the Erlang Kernel on the Ionosphere Database... 123 Figure 38. Genetic Algorithm Based Kernel Optimal Projection Analysis Training Curve for the Gumbel Kernel on the Ionosphere Database... 123 Figure 39. Chart Showing the Increased Classification Performance (out of 100%) of the Kernel Projection Matrix for each Kernel over the Image Segmentation Database. The Average Improvement is 7.6%... 125 Figure 40. Genetic Algorithm Based Kernel Optimal Projection Analysis Training Curve for the Log-norm Kernel on the Image Segmentation Database... 125 Figure 41. Genetic Algorithm Based Kernel Optimal Projection Analysis Training Curve for the Von Mises Kernel on the Image Segmentation Database... 126 Figure 42. Chart Showing the Increased Classification Performance (out of 100%) of the Kernel Projection Matrix for each Kernel over the Lymphoma Database. The Average Improvement is 14.5%... 127 Figure 43. Genetic Algorithm Based Kernel Optimal Projection Analysis Training Curve for the Rice Kernel on the Lymphoma Database. ... 128 Figure 44. Genetic Algorithm Based Kernel Optimal Projection Analysis Training Curve for the Gaussian Kernel on the Lymphoma Database. ... 128
Figure 45. Genetic Algorithm Based Kernel Optimal Projection Analysis Training Curve for the Einstein-1 Kernel on the Lymphoma Database. ... 129 Figure 46. Chart Showing the Increased Classification Performance (out of 100%) of the Kernel Projection Matrix for each Kernel over the Balance Scale Database. The Average Improvement is 14.0%... 130 Figure 47. Genetic Algorithm Based Kernel Optimal Projection Analysis Training Curve for the Einstein-2 Kernel on the Balance Scale Database ... 131 Figure 48. Genetic Algorithm Based Kernel Optimal Projection Analysis Training Curve for the Von Mises Kernel on the Balance Scale Database ... 131 Figure 49. Genetic Algorithm Based Kernel Optimal Projection Analysis Training Curve for the Rice Kernel on the Balance Scale Database ... 132 Figure 50. Chart Showing the Increased Classification Performance (out of 100%) of the Kernel Projection Matrix for each Kernel over the Pima Indian Database. The Average Improvement is 7.2%... 133 Figure 51. Genetic Algorithm Based Kernel Optimal Projection Analysis Training Curve for the Rayleigh Kernel on the Pima Indian Database. ... 133 Figure 52. Genetic Algorithm Based Kernel Optimal Projection Analysis Training Curve for the Erlang Kernel on the Pima Indian Database. ... 134 Figure 53. Genetic Algorithm Based Kernel Optimal Projection Analysis Training Curve for the Einstein-1 Kernel on the Pima Indian Database. ... 134 Figure 54. Pictures of a simple sinusoidal wave (a) and a wavelet (b). ... 137 Figure 55. Wavelets from left to right: Haar, Daubechies4, Coiflet1, Symlet2, Meyer, Morlet, and Mexican Hat. ... 139 Figure 56. Example wavelet decomposition of an AT&T face using a reverse bi-orthogonal wavelet 2.2 and periodic padding to level two. (Left) Original face. (Right) Face after decomposition... 140 Figure 57. Example wavelet decomposition of a Yale face using a reverse bi-orthogonal wavelet 2.2 and periodic padding to level two. (Left) Original face. (Right) Face after decomposition... 140
Figure 59. Plot for the Levy Distribution Function... 146
Figure 60. Plot for the Gumbel Distribution Function. ... 146
Figure 61. Plot for the Fisher-Tippett Distribution Function. ... 147
Figure 62. Plot for the Gaussian Distribution Function. ... 147
Figure 63. Plot for the Maxwell-Boltzmann Distribution Function... 147
Figure 64. Plot for the Bessel Distribution Function... 148
Figure 65. Plot for the Rice Distribution Function... 148
Figure 66. Plot for the Von Mises Distribution Function... 148
Figure 67. Plot for the Kumaraswamy Distribution Function... 149
Figure 68. Plot for the Erlang Distribution Function. ... 149
Figure 69. Plot for the Log-normal Distribution Function... 149
Figure 70. Plot for the Chi-squared Distribution Function. ... 150
Figure 71. Plot for the Rayleigh Distribution Function... 150
Figure 72. Genetic Algorithm Based Kernel Optimal Projection Analysis Training Curve for the Levy Kernel (Left) and the Rayleigh Kernel (Right) on the Ionosphere Database. ... 157
Figure 73. Genetic Algorithm Based Kernel Optimal Projection Analysis Training Curve for the Einstein-3 Kernel (Left) and the Chi-squared Kernel (Right) on the Ionosphere Database. 157 Figure 74. Genetic Algorithm Based Kernel Optimal Projection Analysis Training Curve for the Lognorm Kernel (Left) and the Kumaraswamy Kernel (Right) on the Ionosphere Database.... ... 158
Figure 75. Genetic Algorithm Based Kernel Optimal Projection Analysis Training Curve for the Von Mises Kernel (Left) and the Rice Kernel (Right) on the Ionosphere Database. ... 158
Figure 76. Genetic Algorithm Based Kernel Optimal Projection Analysis Training Curve for the Bessel Kernel (Left) and the Max-Boltzman Kernel (Right) on the Ionosphere Database. . 158
Figure 77. Genetic Algorithm Based Kernel Optimal Projection Analysis Training Curve for the Fisher-Tippet Kernel (Left) and the Laplace Kernel (Right) on the Ionosphere Database... 159 Figure 78. Genetic Algorithm Based Kernel Optimal Projection Analysis Training Curve for the Ploy-3 Kernel (Left) and the Einstein-1 Kernel (Right) on the Ionosphere Database... 159 Figure 79. Genetic Algorithm Based Kernel Optimal Projection Analysis Training Curve for the Einstein-2 Kernel (Left) and the Einstein-4 (Right) on the Ionosphere Database. ... 159 Figure 80. Genetic Algorithm Based Kernel Optimal Projection Analysis Training Curve for the Fermi-Dirac Kernel on the Ionosphere Database... 160 Figure 81. Genetic Algorithm Based Kernel Optimal Projection Analysis Training Curve for the Levy Kernel (Left) and the Rayleigh Kernel (Right) on the Image Segmentation Database. ....
... 160 Figure 82. Genetic Algorithm Based Kernel Optimal Projection Analysis Training Curve for the Chi-squared Kernel (Left) and the Erlang Kernel (Right) on the Image Segmentation Database. ... 161 Figure 83. Genetic Algorithm Based Kernel Optimal Projection Analysis Training Curve for the Kumaraswamy Kernel (Left) and the Rice Kernel (Right) on the Image Segmentation Database. ... 161 Figure 84. Genetic Algorithm Based Kernel Optimal Projection Analysis Training Curve for the Bessel Kernel (Left) and the Max-Boltzman Kernel (Right) on the Image Segmentation Database. ... 161 Figure 85. Genetic Algorithm Based Kernel Optimal Projection Analysis Training Curve for the Gaussian Kernel (Left) and the Fisher-Tippett Kernel (Right) on the Image Segmentation Database. ... 162 Figure 86. Genetic Algorithm Based Kernel Optimal Projection Analysis Training Curve for the Gumbel Kernel (Left) and the Laplace Kernel (Right) on the Image Segmentation Database. . ... 162 Figure 87. Genetic Algorithm Based Kernel Optimal Projection Analysis Training Curve for the Poly-3 Kernel (Left) and the Einstein-1 Kernel (Right) on the Image Segmentation Database.
... 162 Figure 88. Genetic Algorithm Based Kernel Optimal Projection Analysis Training Curve for the Einstein-2 Kernel (Left) and the Einstein-3 Kernel (Right) on the Image Segmentation
Figure 89. Genetic Algorithm Based Kernel Optimal Projection Analysis Training Curve for the Einstein-4 Kernel (Left) and the Fermi-Dirac Kernel (Right) on the Image Segmentation Database. ... 163 Figure 90. Genetic Algorithm Based Kernel Optimal Projection Analysis Training Curve for the Fermi-Dirac Kernel (Left) and the Levy Kernel (Right) on the Lymphoma Database. ... 163 Figure 91. Genetic Algorithm Based Kernel Optimal Projection Analysis Training Curve for the Rayleigh Kernel (Left) and the Chi-squared Kernel (Right) on the Lymphoma Database. . 164 Figure 92. Genetic Algorithm Based Kernel Optimal Projection Analysis Training Curve for the Lognorm Kernel (Left) and the Erlang Kernel (Right) on the Lymphoma Database... 164 Figure 93. Genetic Algorithm Based Kernel Optimal Projection Analysis Training Curve for the Kumaraswamy Kernel (Left) and the von Mises Kernel (Right) on the Lymphoma Database..
... 164 Figure 94. Genetic Algorithm Based Kernel Optimal Projection Analysis Training Curve for the Bessel Kernel (Left) and the Maxwell-Boltzmann Kernel (Right) on the Lymphoma Database. ... 165 Figure 95. Genetic Algorithm Based Kernel Optimal Projection Analysis Training Curve for the Fisher-Tippet Kernel (Left) and the Gumbel Kernel (Right) on the Lymphoma Database.. 165 Figure 96. Genetic Algorithm Based Kernel Optimal Projection Analysis Training Curve for the Laplace Kernel (Left) and the Poly-3 Kernel (Right) on the Lymphoma Database. ... 165 Figure 97. Genetic Algorithm Based Kernel Optimal Projection Analysis Training Curve for the Einstein-2 Kernel (Left) and the Einstein-3 Kernel (Right) on the Lymphoma Database. .. 166 Figure 98. Genetic Algorithm Based Kernel Optimal Projection Analysis Training Curve for the Einstein-4 Kernel on the Lymphoma Database. ... 166 Figure 99. Genetic Algorithm Based Kernel Optimal Projection Analysis Training Curve for the Levy Kernel (Left) and the Rayleigh Kernel (Right) on the Balance Scale Database... 166 Figure 100. Genetic Algorithm Based Kernel Optimal Projection Analysis Training Curve for the Chi-squared Kernel (Left) and the Log-normal Kernel (Right) on the Balance Scale Database. ... 167 Figure 101. Genetic Algorithm Based Kernel Optimal Projection Analysis Training Curve for the Erlang Kernel (Left) and the Kumaraswamy Kernel (Right) on the Balance Scale Database. ... 167
Figure 102. Genetic Algorithm Based Kernel Optimal Projection Analysis Training Curve for the Bessel Kernel (Left) and the Max-Boltzmann Kernel (Right) on the Balance Scale Database ... 167 Figure 103. Genetic Algorithm Based Kernel Optimal Projection Analysis Training Curve for the Gaussian Kernel (Left) and the Fisher-Tippett Kernel (Right) on the Balance Scale Database. ... 168 Figure 104. Genetic Algorithm Based Kernel Optimal Projection Analysis Training Curve for the Gumbel Kernel (Left) and the Laplace Kernel (Right) on the Balance Scale Database. 168 Figure 105. Genetic Algorithm Based Kernel Optimal Projection Analysis Training Curve for the Poly-3 Kernel (Left) and the Einstein-1 Kernel (Right) on the Balance Scale Database. ....
... 168 Figure 106. Genetic Algorithm Based Kernel Optimal Projection Analysis Training Curve
for the Einstein-3 Kernel (Left) and the Einstein-4 Kernel (Right) on the Balance Scale Database.
... 169 Figure 107. Genetic Algorithm Based Kernel Optimal Projection Analysis Training Curve
for the Fermi-Dirac Kernel (Left) and the Debye-1 Kernel (Right) on the Balance Scale Database. ... 169 Figure 108. Genetic Algorithm Based Kernel Optimal Projection Analysis Training Curve for the Chi-squared Kernel (Left) and the Kumaraswamy Kernel (Right) on the Pima Indian Database. ... 169 Figure 109. Genetic Algorithm Based Kernel Optimal Projection Analysis Training Curve for the Von Mises Kernel (Left) and the Rice Kernel (Right) on the Pima Indian Database... 170 Figure 110. Genetic Algorithm Based Kernel Optimal Projection Analysis Training Curve for the Bessel Kernel (Left) and the Maxwell-Boltzmann Kernel (Right) on the Pima Indian Database. ... 170 Figure 111. Genetic Algorithm Based Kernel Optimal Projection Analysis Training Curve for the Gaussian Kernel (Left) and the Fisher-Tippet Kernel (Right) on the Pima Indian Database. ... 170 Figure 112. Genetic Algorithm Based Kernel Optimal Projection Analysis Training Curve for the Polynomial-3 Kernel (Left) and the Einstein-2 Kernel (Right) on the Pima Indian Database. ... 171
Figure 113. Genetic Algorithm Based Kernel Optimal Projection Analysis Training Curve for the Einstein-3 Kernel (Left) and the Einstein-4 Kernel (Right) on the Pima Indian Database...
... 171 Figure 114. Genetic Algorithm Based Kernel Optimal Projection Analysis Training Curve
for the Debye Kernel (Left) and the Fermi-Dirac Kernel (Right) on the Cleveland Heart Database. ... 171 Figure 115. Genetic Algorithm Based Kernel Optimal Projection Analysis Training Curve for the Einstein-3 Kernel (Left) and the Einstein-2 Kernel (Right) on the Cleveland Heart Database. ... 172 Figure 116. Genetic Algorithm Based Kernel Optimal Projection Analysis Training Curve for the Einstein-1 Kernel (Left) and the Poly-3 Kernel (Right) on the Cleveland Heart Database..
... 172 Figure 117. Genetic Algorithm Based Kernel Optimal Projection Analysis Training Curve
for the Laplace Kernel (Left) and the Gumbel Kernel Kernel (Right) on the Cleveland Heart Database. ... 172 Figure 118. Genetic Algorithm Based Kernel Optimal Projection Analysis Training Curve for the Fisher-Tippett Kernel (Left) and the Gaussian Kernel (Right) on the Cleveland Heart Database. ... 173 Figure 119. Genetic Algorithm Based Kernel Optimal Projection Analysis Training Curve for the Maxwell-Boltzmann Kernel (Left) and the Rice Kernel (Right) on the Cleveland Heart Database. ... 173 Figure 120. Genetic Algorithm Based Kernel Optimal Projection Analysis Training Curve for the Von Mises Kernel (Left) and the Kumaraswamy Kernel (Right) on the Cleveland Heart Database. ... 173 Figure 121. Genetic Algorithm Based Kernel Optimal Projection Analysis Training Curve for the Erlang Kernel (Left) and the Log-normal Kernel (Right) on the Cleveland Heart Database. ... 174 Figure 122. Genetic Algorithm Based Kernel Optimal Projection Analysis Training Curve for the Chi-squared Kernel (Left) and the Rayleigh Kernel (Right) on the Cleveland Heart Database. ... 174 Figure 123. Genetic Algorithm Based Kernel Optimal Projection Analysis Training Curve for the Gaussian Kernel (Left) and the Fermi-Dirac Kernel (Right) on the E.coli Database. .... 174
Figure 124. Genetic Algorithm Based Kernel Optimal Projection Analysis Training Curve for the Erlang Kernel (Left) and the Einstein-4 Kernel (Right) on the E.coli Database... 175 Figure 125. Genetic Algorithm Based Kernel Optimal Projection Analysis Training Curve for the Einstein-1 Kernel (Left) and the Einstein-2 Kernel (Right) on the E.coli Database... 175 Figure 126. Genetic Algorithm Based Kernel Optimal Projection Analysis Training Curve for the Debye-1 Kernel (Left) and the Bessel Kernel (Right) on the E.coli Database. ... 175 Figure 127. Genetic Algorithm Based Kernel Optimal Projection Analysis Training Curve for the Rayleigh Kernel (Left) and the Chi-squared Kernel (Right) on the E.coli Database... 176 Figure 128. Genetic Algorithm Based Kernel Optimal Projection Analysis Training Curve for the von Mises Kernel (Left) and the Rice Kernel (Right) on the E.coli Database... 176 Figure 129. Genetic Algorithm Based Kernel Optimal Projection Analysis Training Curve for the Poly-3 Kernel (Left) and the Maxwell-Boltzmann Kernel (Right) on the E.coli Database..
... 176 Figure 130. Genetic Algorithm Based Kernel Optimal Projection Analysis Training Curve
for the Log-normal Kernel (Left) and the Levy Kernel (Right) on the E.coli Database. ... 177 Figure 131. Genetic Algorithm Based Kernel Optimal Projection Analysis Training Curve for the Laplace Kernel (Left) and the Kumaraswamy Kernel (Right) on the E.coli Database... 177
ABSTRACT
Kernel methods, as alternatives to component analysis, are mathematical tools that provide a higher dimensional representation, for feature recognition and image analysis problems. In machine learning, the kernel trick is a method for converting a linear classification learning algorithm into non-linear one, by mapping the original observations into a higher-dimensional space so that the use of a linear classifier in the new space is equivalent to a non-linear classifier in the original space. In this dissertation we present the performance results of several continuous distribution function kernels, lattice oscillation model kernels, Kelvin function kernels, and orthogonal polynomial kernels on select benchmarking databases. In addition, we develop methods to analyze the use of these kernels for projection analysis applications; principal component analysis, independent component analysis, and optimal projection analysis. We compare the performance results with known kernel methods on several benchmarks. Empirical results show that several of these kernels outperform other previously suggested kernels on these data sets.
Additionally, we develop a genetic algorithm-based kernel optimal projection analysis method which, through extensive testing, demonstrates a ten percent average improvement in performance on all data sets over the kernel principal component analysis projection. We also compare our kernels methods for kernel eigenface representations with previous techniques.
Finally, we analyze the benchmark databases used here to determine whether we can aid in the selection of a particular kernel that would perform optimally based on the statistical characteristics of each database.
CHAPTER 1 INTRODUCTION
1. MOTIVATIONS
Pattern recognition aims to classify data (patterns) based on either a priori knowledge or on statistical information extracted from the patterns. The patterns to be classified are usually groups of measurements or observations, defining points in an appropriate multidimensional space. It is typical that classification in the original space is not possible or very difficult.
Therefore, research on methods to map the data to a new, more classification friendly, space is quite prevalent. Typically these methods entail rotating or translating the data or compressing the data to fit some hyperplane in a lower dimensional space. Most of these techniques work to find linear separations, however if the patterns in the data are separable along some nonlinear manifold then most linear methods breakdown or perform rather poorly. A common alternative is to use nonlinear classifiers; however, these can be either computationally and/or functionally complex. That is not to say that they do not work, it is that they may be user unfriendly. A friendlier alternative is to project the data into a new space in a nonlinear way and then use linear techniques to classify the new data. It is on the value and utility of some novel techniques in projection methods that we will focus this work. We will perform this by focusing on the following questions:
• Why are only a few kernel functions mentioned in the literature, mainly Gaussian and polynomial?
• Can we find a cheaper method to optimize projection matrices against classification criterion?
• Can we improve the classification rate on certain databases?
These are the questions that have driven this research. We will look for answers and hopefully provide results that demonstrate practicable solutions to these questions.
Therefore, this work aims to further advance research in kernel methods and optimal projection analysis techniques. The impetus for this research comes from work done by Schölkopf and Smola [47-49], Cristianini and Shawe-Taylor [50], Bach and Jordan [6], Gallivan et al. [27] and Liu et al. [39]. Through our investigations this research will introduce the following advancements:
1. The use of continuous distribution function kernels, lattice oscillation model function kernels, Kelvin function kernels, and orthogonal polynomial kernels.
2. An investigation of the applicability of these kernel functions to pattern recognition problems.
3. The use of these kernel functions for kernel principal component analysis and independent component analysis.
4. The use of genetic algorithms for the optimization of kernel projection matrices.
5. An improved alternative kernel eigenface representation.
6. A statistical analysis of the properties of the employed data sets and the possibility for a preprocessing kernel selection algorithm.
7. A delineation of further research tasks, and the formation of a research agenda for a more thorough exploration of kernel projection analysis methods.
To this end, this research introduces new techniques in kernel methods and projection analysis as applied to pattern classification.
2. THESIS OVERVIEW
Chapter 2 discusses the details surrounding kernel methods and introduces the continuous distribution function kernels, the lattice oscillation model function kernels, the Kelvin function
kernels, and the orthogonal polynomial function kernels. Chapter 3 introduces principal component analysis, independent component analysis, and the implementation required to kernelize these methods. Chapter 4 details the experimental analysis and results. Chapter 5 discusses general optimization methods, genetic algorithms, and the genetic algorithm-based optimal kernel projection analysis method and shows the experimental results. Chapter 6 presents work done to improve kernel eigenface performance. Chapter 7 discusses experiments performed to aid in kernel selection based on the statistical analysis of the input data. Chapter 8 ends with a conclusion and a discussion on future work.
CHAPTER 2
KERNELS FOR PATTERN RECOGNITION
1. INTRODUCTION
A complete pattern recognition system consists of a sensor that gathers the observations to be classified or described; a feature extraction mechanism that computes numeric or symbolic information from the observations; and a classification or description scheme that does the actual job of classifying or describing observations, relying on the extracted features. The classification or description scheme is usually based on the availability of a set of patterns that have already been classified or described. This set of patterns is termed the training set and the resulting learning strategy is characterized as supervised learning. Learning can also be unsupervised, in the sense that the system is not given an a priori labeling of patterns, instead it establishes the classes itself based on the statistical regularities of the patterns.
The classification or description scheme usually uses one of the following approaches: statistical (or decision theoretic), syntactic (or structural). Statistical pattern recognition is based on statistical characterizations of patterns, assuming that the patterns are generated by a probabilistic system. Structural pattern recognition is based on the structural interrelationships of features. A wide range of algorithms can be applied for pattern recognition, from very simple Bayesian classifiers to much more powerful neural networks.
Principal Component Analysis (PCA) has been called one of the most valuable results from applied linear algebra. It is used abundantly in all forms of analysis from neuroscience to computer graphics because it is a simple, non-parametric method of extracting relevant information from confusing data sets. With minimal additional effort PCA provides a roadmap for how to reduce a complex data set to a lower dimension to reveal the sometimes hidden, simplified structure that often underlie it. This technique searches for directions in the data that have largest variance and subsequently project the data onto it. In this way, we obtain a lower
dimensional representation of the data that removes some of the noisy directions. Fisher discriminant analysis (FDA) is another popular method for linear supervised dimensionality reduction. As a widely used technique for pattern classification, it is able to separate multivariate data with different classes nicely in the linear projection. In two-class separation, FDA tries to find a linear discriminant that yields optimal discriminant between two classes such that the between-class scatter is maximized and the within-class scatter is minimized.
Recently, the problem of reducing the dimensionality of a data set has received renewed attention [39, 45, 47, 55]. The underlying idea, due to Hotelling [33], is that most of the variation in many high dimensional data sets can often be explained by a few latent variables.
Alternatively, we say that rather than filling the whole space, the data lie on a lower dimensional manifold. The dimensionality of this manifold is the dimensionality of the latent space and the coordinate system on this manifold provides the latent variables.
Traditional tools of principal component analysis and factor analysis (FA) are still the most widely used methods in data analysis. They project the data onto a hyperplane, so the reduced coordinates are easy to interpret. However, these methods are unable to deal with nonlinear correlations in a data set. Nonlinear correlations in data can also be accommodated implicitly, without constructing an actual low dimensional manifold. By mapping the data from the original space to an even higher dimensional feature space, we may hope that the correlations will become linear and PCA will apply. Kernel methods [47] allow us to do this without actually constructing an explicit map to feature space. They introduce nonlinearity through an a priori nonlinear kernel.
2. KERNEL METHODS FOR PATTERN CLASSIFICATION
2.1 KERNEL METHODS
The kernel trick was first published in the 1964 [4] and it has only recently come into mainstream use [47], [48], [49], and [50]. Essentially, the kernel trick is a method for easily converting a linear classification learning algorithm into non-linear one, by mapping the original observations into a higher-dimensional non-linear space so that linear classification in the new
condition [47], which states that any positive semi-definite kernel k x y can be expressed as a ( , ) dot product in a high-dimensional space. More specifically, if the arguments to the kernel are in a measurable space X, and if the kernel is positive semi-definite, i.e.
,
( , )i j i j 0
i j
k x x c c ≥
∑
(1)for any finite subset {x1, , }x of X and subset n {c1, , }cn of real numbers — then there exists a function ( )ϕ x whose range is in an inner product space of possibly high dimension, such that
( , )k x y =ϕ( ) ( ).x ⋅ϕ y (2)
The kernel trick transforms any algorithm that solely depends on the dot product between two vectors. Wherever a dot product is used, it is replaced with the kernel function. Thus, a linear algorithm can easily be transformed into a non-linear algorithm. This non-linear algorithm is equivalent to the linear algorithm operating in the range space ofϕ. However, because kernels are used, the ϕ function is never explicitly computed. This is desirable, because the high- dimensional space may be infinite-dimensional (as is the case when the kernel is a Gaussian).
The kernel representation of data amounts to a nonlinear projection of data into a high- dimensional space where it is easier to separate into classes.
Let f be the feature space induced by a nonlinear mapping :ϕ n → f, where f > and f could be infinite. To compute the value of the dot product in feature space n
fwithout explicitly using the map, we use the kernel trick: [50].
A choice of kernel function corresponds to some implicitly defined mapϕ. Mercer’s theorem guarantees the existence of such kernel functions. Three popular choices for ( , )k ⋅ ⋅ are
dth degree polynomial: k x y( , ) (= κ κ1+ 2 x y, ) ,d
Gaussian: k x y( , ) exp(= − −x y 2/ 2σ2),
Neural network: k x y( , ) tanh(= κ1 x y, +κ2).
where the data is now represented by the Gram matrix K such thatK =k x y( , )i j, and K is positive semi-definite.
Let us start with an example to demonstrate the power of kernels. Given the 2-dimensional data in Figure 1, below, that is not linearly separable; we wish to project it to a 3-dimensional space using
2 3
φ:ℜ → ℜ , (3)
that is,
1 2 1 2 3
( , ) ( , , )
X = x x → =Z z z z (4)
where
2 2
1 2 3 1 1 2 2
( , , ) ( , 2 , )
Z = z z z = x x x x . (5)
After projecting the data into this new space we can see that the two classes have now become linearly separable, as shown in Figure 2.
Figure 1. The Plot Of A Simple 2-Dimensional Non-Linearly Separable Data Set Used To Demonstrate The Usefulness Of Kernel Methods.
Figure 2. The Resulting 3-Dimensional Kernel Projected Data from Figure 2. This Shows That The Data Is Now Linearly Separable.
We will now discuss some of the mathematical ideas behind these kernel methods and follow this with the introduction of our new kernels.
2.2 INNER PRODUCT SPACE
In mathematics, an inner product space is a vector space with additional structure, an inner product (also called scalar product), which allows us to introduce geometrical notions such as angles and lengths of vectors. Inner product spaces generalize Euclidean spaces (with the dot product as the inner product) and are studied in functional analysis. An inner product space is sometimes also called a pre-Hilbert space, since its completion with respect to the metric induced by its inner product is a Hilbert space. Inner product spaces were referred to as unitary spaces in earlier work, although this terminology is now rarely used.
2.2.1 DEFINITIONS
In the following, the field of scalars denoted F is either the field of real numbers R or the field of complex numbers C. An inner product is a generalization of the dot product. In a vector space, it is a way to multiply vectors together, with the result of this multiplication being a scalar. More
precisely, for a real vector space, an inner product ⋅ ⋅, satisfies the following four properties. Let , ,
u v and wbe vectors and α be some scalar, then:
, , ,
u v w+ = u w + v w (6)
, ,
v w v w
α =α (7)
, ,
v w = w v (8)
, 0
v v ≥ (9)
where (9) is equal if and only if v=0. A vector space together with an inner product on it is called an inner product space. This definition also applies to an abstract vector space over any field. Examples of inner product spaces include:
1. The real numbers ℜ, where the inner product is given by
x,y =xy (10)
2. The Euclidean spaceℜ , where the inner product is given by the dot product.n
1 2 n 1 2 n 1 1 2 2 n n
(x ,x ,...,x ),(y ,y ,...,y ) =x y +x y +...+x y (11)
3. The vector space of real functions whose domain is an closed interval with inner product
, b
f g =
∫
a f g dx (12)When given a complex vector space, the third property above is usually replaced by
, ,
v w = w v (13)
where Z refers to complex conjugation. With this property, the inner product is called a Hermitian inner product and a complex vector space with a Hermitian inner product is called a Hermitian inner product space. Every inner product space is a metric space. The metric is given by
, ,
g v w = −v w v w− (14)
If this process results in a complete metric space, it is called a Hilbert space.
More formally, an inner product space is a vector space V over the field F together with a positive-definite form. The inner product is a map
⋅ ⋅, :V V× →F (15)
satisfying the following axioms:
1. Conjugate symmetry,
, , , , .
x y x y y x
∀ ∈V = (16)
This condition implies that for all x x, ∈ R x ∈ V, because ,x x = x x, . 2. Sesquilinearity:
, , , , ,
b x y x by b x y
∀ ∈ ∀F ∈V = (17)
, , , , , ,
x y z x y z x y x z
∀ ∈V + = + (18)
By combining these with conjugate symmetry, we get:
, , , , ,
a a x y
∀ ∈ ∀F x y V ax y∈ = (19)
, , , , , ,
x y z x z y z
∀ ∈V x y z+ = + (20)
3. Nonnegativity:
, , 0
∀ ∈x V x x ≥ (21)
(This makes sense because x x, ∈ R for allx∈ V.)
4. Nondegeneracy:
The map from V to the dual space V* is an isomorphism. For a finite-dimensional vector space, it suffices to check injectivity:
, , 0 0.
y iff x
∀ ∈V x y = = (22)
Hence, the inner product is a Hermitian form.
The property of an inner product space V that x y z+ , = x z, + y z, and
, , ,
x y z+ = x y + x z for all , ,x y z∈ V is known as additivity.
Note that if F=R, then the conjugate symmetry property is simply symmetry of the inner product, i.e. x y, = y x, . In this case, sesquilinearity becomes standard linearity.
There are various technical reasons why it is necessary to restrict the basefield to R and C in the definition. Briefly, the basefield has to contain an ordered subfield (in order for non-negativity to make sense) and therefore has to have characteristic equal to 0. This immediately excludes finite fields. The basefield has to have additional structure, such as a distinguished automorphism. In some cases we need to consider non-negative semi-definite sesquilinear forms. This means that
,
x x is only required to be non-negative. We show how to treat these below.
The general form of an inner product on Cn is given by:
, : *
x y = Mx y (23)
with M any positive-definite matrix, and x* the conjugate transpose of x. For the real case this corresponds to the dot product of the results of directionally differential scaling of the two vectors, with positive scale factors and orthogonal directions of scaling. Apart from an orthogonal transformation it is a weighted-sum version of the dot product, with positive weights.
There are several examples for Hilbert space concerning inner product spaces wherein the metric induced by the inner product yields a complete metric space. An example of an inner product which induces an incomplete metric occurs with the space C[a, b] of continuous complex valued functions on the interval [a, b]. The inner product is
, : b ( ) ( )
f g =
∫
a f t g t dt (24)This space is not complete; consider for example, for the interval [0, 1] the sequence of functions { fk }k where
• fk(t) is 1 for t in the subinterval [0, 1/2]
• fk(t) is 0 for t in the subinterval [1/2 + 1/k, 1]
• fk is affine in [1/2, 1/2 + 1/k]
This sequence is a Cauchy sequence which does not converge to a continuous function.
2.2.2 NORMS ON INNER PRODUCT SPACES
Inner product spaces have a naturally defined norm ,
x = x x . (25)
This is well defined by the nonnegativity axiom of the definition of inner product space. The norm is thought of as the length of the vector x. Therefore, directly from the axioms, we can prove the following:
Cauchy-Schwarz inequality: for x, y elements of V ,
x y ≤ x ⋅ y (26)
with equality if and only if x and y are linearly dependent. This is one of the most important inequalities in mathematics. It is also known as the Cauchy-Bunyakowski-Schwarz inequality.
The geometric interpretation of the inner product in terms of angle and length motivates much of the geometric terminology we use in regard to these spaces. Indeed, an immediate consequence of the Cauchy-Schwarz inequality is that it justifies defining the angle between two non-zero vectors x and y (at least in the case F = R) by the identity
1 ,
( , ) cos x y
x y x y
∠ = −
⋅ . (27)
We assume the value of the angle is chosen to be in the interval (−π, +π]. This is in analogy to the familiar situation in two-dimensional Euclidean space. Correspondingly, we will say that non-zero vectors x, y of V are orthogonal if and only if their inner product is zero.
Homogeneity: for x an element of V and r a scalar
r x⋅ = ⋅r x . (28)
The homogeneity property is completely trivial to prove.
Triangle inequality: for x, y elements of V
x y+ ≤ x + y . (29)
These last two properties show the function defined is indeed a norm. Because of the triangle inequality and because of axiom 2, we see that ||·|| is a norm which turns V into a normed vector space and hence also into a metric space. The most important inner product spaces are the ones which are complete with respect to this metric; they are called Hilbert spaces. Every inner product V space is a dense subspace of some Hilbert space. This Hilbert space is essentially uniquely determined by V and is constructed by completing V.
Parallelogram law:
2 2 2 2
2 2
x y+ + −x y = x + y . (30)
Pythagorean Theorem: Whenever x, y are in V and <x, y> = 0, then
2 2 2
x + y = +x y . (31)
The proofs of both of these identities require only expressing the definition of norm in terms of the inner product and multiplying out, using the property of additivity of each component. The name Pythagorean Theorem arises from the geometric interpretation of this result as an analogue of the theorem in synthetic geometry. Note that the proof of the Pythagorean Theorem in synthetic geometry is considerably more elaborate because of the paucity of underlying structure.
In this sense, the synthetic Pythagorean Theorem, if correctly demonstrated is deeper than the version given above.
An easy induction on the Pythagorean Theorem yields: If x1, ,… xnare orthogonal vectors, that is,
<xj, xk> = 0 for distinct indices j, k, then
2 2
1 1
n n
i i
i i
x x
= =
∑
=∑
. In view of the Cauchy-Schwarz inequality, we also note that <·,·> is continuous from V × V to F. This allows us to extend Pythagoras' theorem to infinitely many summands. Using Parseval's identity: Suppose V is a complete inner product space. If {xk} are mutually orthogonal vectors in V then2 2
1 1
i i
i i
x x
∞ ∞
= =
∑
=∑
(32)provided the infinite series on the left is convergent. Completeness of the space is needed to ensure that the sequence of partial sums 2
1 k
k i
i
S x
=
=
∑
which is easily shown to be a Cauchy sequence is convergent.Orthonormal sequences
A sequence {ek}k is orthonormal if and only if it is orthogonal and each ek has norm 1. An orthonormal basis for an inner product space V is an orthonormal sequence whose algebraic span is V. The Gram-Schmidt process is a canonical procedure that takes a linearly independent sequence {vk}k on an inner product space and produces an orthonormal sequence {ek}k such that for each n span v
{
1, ,… vn}
=span e{
1, ,… en}
. By the Gram-Schmidt orthonormalization process, one shows that any separable inner product space V has an orthonormal basis. Then Parseval's identity leads immediately to the following theorem:Theorem: Let V be a separable inner product space and {ek}k an orthonormal basis of V. Then the map
{ k, }k
x→ e x ∈N (33)
is an isometric linear map V → l2 with a dense image. This theorem can be regarded as an abstract form of Fourier series, in which an arbitrary orthonormal basis plays the role of the sequence of trigonometric polynomials. Note that the underlying index set can be taken to be any
countable set (and in fact any set whatsoever, provided l2 is defined appropriately). In particular, we obtain the following result in the theory of Fourier series:
Theorem: Let V be the inner product space C[ − π,π]. Then the sequence (indexed on set of all integers) of continuous functions
1
( ) (2 ) 2 ikt
e tk = π − e (34)
is an orthonormal basis of the space C[ − π,π] with the L2 inner product. The mapping
{ }
1 ( )
2
ikt k Z
f π f t e dt
π π
−
− ∈
∫
(35)is an isometric linear map with dense image. Orthogonality of the sequence {ek}k follows immediately from the fact that if k ≠ j, then π e i j k t( )dt 0
π
− −
− =
∫
. Normality of the sequence is by design, that is, the coefficients are so chosen so that the norm comes out to 1. Finally the fact that the sequence has a dense algebraic span, in the inner product norm, follows from the fact that the sequence has a dense algebraic span, this time in the space of continuous periodic functions on[ - , ]π π with the uniform norm.
2.2.3 OPERATORS ON INNER PRODUCT SPACES
Several types of linear maps A from an inner product space V to an inner product space W are of relevance:
Continuous linear maps, i.e. A is linear and continuous with respect to the metric defined above, or equivalently, A is linear and the set of non-negative reals {||Ax||}, where x ranges over the closed unit ball of V, is bounded.
Symmetric linear operators, i.e. A is linear and <Ax, y> = <x, A y> for all x, y in V.
Isometries, i.e. A is linear and <Ax, Ay> = <x, y> for all x, y in V, or equivalently, A is linear and
||Ax|| = ||x|| for all x in V. All isometries are injective. Isometries are morphisms between inner product spaces, and morphisms of real inner product spaces are orthogonal transformations (compare with orthogonal matrix). This essentially states that maps that are isometric preserve
distances. From the point of view of inner product space theory, there is no need to distinguish between two spaces which are isometrically isomorphic. The spectral theorem provides a canonical form for symmetric, unitary and more generally normal operators on finite dimensional inner product spaces. A generalization of the spectral theorem holds for continuous normal operators in Hilbert spaces.
2.3 DEGENERATE INNER PRODUCTS
If V is a vector space and < , > a semi-definite sesquilinear form, then the function ||x|| = <x, x>1/2 makes sense and satisfies all the properties of norm except that ||x|| = 0 does not imply x = 0.
(Such a functional is then called a semi-norm.) We can produce an inner product space by considering the quotient W = V/{ x : ||x|| = 0}. The sesquilinear form < , > factors through W.
This construction is used in numerous contexts. The Gelfand-Naimark-Segal construction is a particularly important example of the use of this technique. Another example is the representation of semi-definite kernels on arbitrary sets.
2.4 MERCER’S THEOREM
Mercer’s theorem provides a characterization of when a function ( , )K x z is a kernel. Given a finite input space X { , ,..., }= x x1 2 xn and that ( , )K x z is a symmetric function on X. Consider, then, the matrix K=K x x i j( , ); ,i j =1,...,n. Since K is symmetric there is an orthogonal matrix V such thatK=V VΛ , where Λ is a diagonal matrix containing the eigenvalues ' λtof K, with corresponding eigenvectors vt =( );vti i=1,...,n the columns of V. Now assume that all the eigenvalues are non-negative and consider the feature mapping
:xi ( t ti tv )n1 n,i 1,...,n
φ λ = ∈ℜ = (36)
We now have that
1
( ), ( )i j n t ti tj ( ')ij ij ( , ),i j
t
x x v v V V K K x x
φ φ λ
=
=