RAMANATH, RAJEEV. A Framework for Object Characterization and Matching in Multi– and Hyperspectral Imaging Systems. (under the supervision of Dr. Wesley Snyder)
The idea of shape has been a field of scientific study since the time of Galileo. Most shapes that have been studied until now have been those that are “conceivable” by the human mind. This has restricted the study of shape by the image processing community to the visible range of the spectrum (an otherwise very small range). Perception of shape in the realm of the spectrum outside of the visible range has not received much attention. However with the recent advancement in imaging systems (multi–and hyperspectral) that can capture images over a wide spectral range, it is only natural to expect this field to receive notice by the imaging community. In this work, the idea of “shape” in the multi–and hyperspectral imaging scenarios is studied and its paradigms explored. Notions of the hyperspectral cube are borrowed from the remote sensing community as a means of representation of this high dimensional data.
In this work, edges of two types are used, one that makes use of the vector valued data in the image and another that treats each spectral band individually. The edge-sets are used to extract spatio-spectral shape signatures of objects which are in turn used for extracting canonical views of objects and also to perform classification using three dimensionality re-duction techniques, Principal Component Analysis, Independent Component Analysis and Non-negative Matrix Factorization. As an extension to edge-based decompositions, we also use view-based techniques for classification. The results obtained by using a combination of spatial and spectral information are compared with those resulting from conventional single-band techniques, showing considerable improvement.
MATCHING IN MULTI–AND HYPERSPECTRAL IMAGING
SYSTEMS
by
Rajeev Ramanath
a dissertation submitted to the graduate faculty of north carolina state university
in partial fulfillment of the requirements for the degree of
doctor of philosophy
department of electrical and computer engineering
raleigh August 13 2003
approved by:
Biography
Rajeev Ramanath was born in February, 1977 in New Delhi, India. Most of his childhood was spent in a quiet town called Gorakhpur, doing essentially nothing — which is what every child needs to do in his/her childhood. Seeing that he was not getting anywhere, his parents decided to move to a big city – Madras! Moving from northern India to southern India was a big change – given that he hardly knew any Tamil (the fact that it is his mother-tongue is a whole different issue). Adapting took some time and before he knew it he went back to northern India for his collegiate education. Four years of “education” and he received the Bachelor of Engineering degree in Electrical and Electronics Engineering from Birla Institute of Technology and Science, Pilani, India in 1998. Having shunted up and down so much, he decided to move due west to Raleigh, across seven seas (ok, so it is two oceans and not seven seas)! Life took a strange turn with him actually taking an amazing amount of interest in research. He obtained the Masters of Science degree in Electrical Engineering from North Carolina State University in 2000. In the process, he wrote a thesis that was titled “Interpolation Methods for Bayer Color Arrays.” Ever since then, he has been working on various research topics that has led to a few journal and conference papers which comes with its associated lifestyle of “more one’s plate that one can eat.” However, this has taught him one thing (apart from all the obvious things he learnt in the process of obtaining graduate education and training) – life never gets simpler. The academic freedom that he has had over the past few years has led to a varied set of research interests that span computer vision and image processing problems, digital color cameras, color science and automatic target recognition.
One of the toughest things that accompany any accomplishment is the part called “acknowl-edgments.” It is unjust to just about every one who is not mentioned here – which leads me to the obvious deduction – include all those who have made a difference. Although logically sound and conceivably simple, this is far from possible. The next obvious solution is to in-clude those who I can remember at this point in time. So all those who did make a difference and did not get mentioned here, my humblest apologies and heartfelt gratitude to you.
Merely saying thanks to Dr. Snyder would be unfair. Through the years, he has been an excellent mentor, a very accommodating boss and above all a good friend. It is very simple with him – bring a complex problem down to its simplest form – the one-dimensional case – if you can figure out a solution, generalize. If not, there is something fundamentally wrong and needs reevaluation. Iterate till you do not reach this branch of the decision tree! Dr. Bilbro is yet another person who I find hard to thank in words – a brief description of our relationship may be given as follows: “Have a technical problem? Approach Dr. Bilbro. He will think of your problem for about 10 minutes and have a dozen ideas, none of which you will understand at that time. You make notes and go back and go through an extensive period of ‘studying’ and voila, you have reached a solution.” It is always a pleasure working with him. My sincere thanks to Drs. Khorram, Krim, Smith and Trussell, who have been wonderful mentors and have greatly supported my academic development. All these faculty members have given me the ability to ‘look at someone else’s problem in context of the things that I know,’ and I am greatly indebted to them for this foresight. Although not directly related to this work, the motivation and confidence provided by Prof. Kuehni has helped me a great deal in realizing my current potential.
My parents, Ramanath and Meera, my sister, Ramya and my grandparents, although need to be included by default, are sincerely thanked simply for their unconditional love and support through the years. My aunt, Margaret and her mother, who have literally taken
a ton.
Luckily, through the process of my graduate education, I had the support of my (now) fianc´ee, Sumathi. It is one thing to bear with someone’s idiosyncrasies and completely another to bear with mine. She has done the more difficult of the two by being by my side through this entire ‘process’ – thanks a lot.
Thanks to my friends and office-mates, who have made this journey an enjoyable and memorable one. The one thing you learn by studying in a setup as that in this school, is diversity. Of course, I would not be in this country if not for the efforts of the International Student Office – my thanks to them. Humble thanks to all the administrative folk whose efforts go unnoticed, until it is crisis time! I have to thank my high school teachers, for having instilled in me the fear of failure and replaced it with hard work and self-confidence.
List of Figures viii
List of Tables xi
1 Introduction 1
1.1 Object Recognition Philosophies . . . 1
1.2 Shape in tone? . . . 3
2 Notations 6 3 Background 9 3.1 Object Recognition Using Edges . . . 9
3.2 Object Recognition Using Views . . . 12
3.2.1 Object Detection Using Grayscale Images . . . 12
3.2.2 Object Detection using Spectral Information . . . 12
3.2.3 Canonical Views . . . 13
3.2.4 Eigenviews . . . 15
4 View-based Techniques 19 4.1 Eigenviews . . . 19
4.2 Eigenviews . . . 20
4.2.1 Principal Component Analysis . . . 21
4.2.2 Independent Component Analysis . . . 22
4.2.3 Non-negative Matrix Factorization . . . 23
4.2.4 Classification . . . 24
4.3 Noise equivalent Dimensions . . . 26
4.4 Perturbation Theory and Eigenviews . . . 29
5 Edge-based Techniques 35 5.1 Edge Pixels . . . 35
5.1.1 Two-dimensional edges . . . 36
5.1.2 Three-dimensional edges . . . 39
5.2 Comparing Edge Pixels . . . 41
5.2.1 Shape Signature . . . 42
5.2.2 Other shape encodings vs. Shape Signatures . . . 47
5.4 Eigenviews using edge information . . . 50
6 Image Database 54 6.1 Synthetic Image database . . . 54
6.1.1 Reflection . . . 54
6.1.2 Emission . . . 56
6.1.3 Combining Reflection and Emission . . . 57
6.2 Real-world database . . . 59
6.3 COIL-100 database . . . 60
6.4 Normalizing the images . . . 61
7 Experiments and Results 63 7.1 Edge-Based Techniques — 2D edges . . . 63
7.1.1 Canonical views . . . 64
7.1.2 Eigenviews . . . 66
7.2 Edge-Based Techniques — 3D edges . . . 75
7.2.1 Canonical views . . . 75
7.2.2 Eigenviews . . . 77
7.3 View-Based Techniques . . . 85
7.3.1 Eigenviews . . . 85
7.4 How does multispectral data performance compare to grayscale performance? 91 7.5 Noise Equivalent Dimensions . . . 100
7.6 Perturbation Theory . . . 102
8 Conclusions and Future Work 105 List of References 109 A First Order Perturbation Theory 122 B Comparative Analysis of SRM by SVM and Nearest Neighbor Rule 124 B.1 Introduction . . . 124
B.2 Support Vector Machine Classification and Nearest Neighbor Rule . . . 126
B.3 NNSRM Classification . . . 127
B.4 Experiments and Results . . . 130
C Proof of Properties of the Shape Signature 137
1.1 Various approaches to object detection . . . 2 1.2 A hypothetical solid object (a) a solid cube (b) the temperature profile of the
object, top to bottom, cold to hot (c) a visible band image (d)-(f) different infrared band images of the same object . . . 3 1.3 Two images taken from the Fort Carson dataset [29] showing a M133-901
tank in visible-band camouflaged and occluded (a) in the visble band, notice that the tank is not discernible from the background (b) in the long-wave IR clearly highlighting the tank . . . 4
2.1 Viewing Hemisphere illustrating the three spherical coordinates that describe the camera location with respect to the object (at the origin) . . . 6
4.1 Plot of NED for various values of M0 for a subset of the multispectral dataset
with additive noise of SNR=20dB . . . 28 4.2 Original noise free data (in black) with Gaussian distribution with variance 1
along the first dimension and variance 16 along the second. Gaussian noise (in red) of variance 9 along the first dimension and 1 along the second is added to this data. . . 32
5.1 (a) Rendering of a normalized seven-band multispectral image of a synthetic DC-10 airplane (b) quiver plot showing the directions of the eigenvectors cor-responding to the larger eigenvalue at each pixel (c) a plot of the larger eigen-value at each pixel (d) quiver plot showing the directions of the eigenvectors corresponding to the smaller eigenvalue at each pixel (c) a plot of the smaller eigenvalue at each pixel (f) Edge image generated by thresholding the intensity image,F(θ) . . . 40 5.2 (a) Image I, of a cup from an arbitrary location on the viewing hemisphere
(b) corresponding edge imageE (c) the shape signature S obtained from theE 43 5.3 Illustration of the technique used to find the best in-plane rotation (a)
mul-tispectral montage of the reference image edges (b) the ρ−ψ signature of the first band (c) multispectral montage of the second image edges (d) the corresponding ρ −ψ signature (e) Second image rotated by the angle that “best” fits the first spectral bands of the above two images. . . 46
canonical using the ρ−ψ representation (c) the most canonical view chosen by the algorithm (d) the least canonical view chosen by the algorithm. . . . 51 5.5 Sample image from the database (a) showing a montage of the normalized
seven bands for an arbitrary view of Tank 2 (b) the corresponding edges in each spectral band (c) the result of blurring these binary edge images to obtain a continuous valued image. . . 52
6.1 Tank-1 shown in the 7-band reflectance image captured using the simulator. The wavelength increases from left to right. Observe that there is no reflected energy outside of the visible band. . . 55 6.2 Tank-1 shown in the 7-band emission image generated using the simulator.
The wavelength increases from left to right. Observe that there is little emitted energy outside of the thermal bands. . . 56 6.3 Objects used in the database shown in a visible band rendering. Left to right,
the objects will be referred to as DC-10, 747, f-15, biplane, jeep, tank-0, tank-1 and tank-2. . . 57 6.4 Spectral Power Distributions of the D55 illuminant superimposed upon the
relative reflectance functions of the various materials used. . . 58 6.5 Relative sensitivities of the filters used in the data capture experiment . . . . 59 6.6 The camera setup used to capture 6-band real data. The larger camera pointed
to by the green arrow is the infrared camera which has the capability of mounting 1” filters, captured data through an RS-170 cable to a frame-grabber on a PC. The smaller camera pointed to by the red arrow is the RGB camera that has a 10-bit digital readout interfaced to a PC frame-grabber. . . 61 6.7 The four objects used in the first real-world database. The objects are all 90o
elbow pipes made of dif64.70.191.167 ferent materials, left to right, black iron steel, Galvanized steel, PVC plastic and rubber. The montage shown here consists of a six-band image, from left to right, in increasing wavelength. . . 62 6.8 Eight of the 100 objects in the COIL-100 database shown at the 0 azimuth. . 62
7.1 Montage of the normalized multispectral image chosen as canonical for the objects in the synthetic dataset using 2D edge information and processed using the ρ−ψ representation. . . 73 7.2 Color composite images of the views chosen as canonical for COIL dataset
using 2D edge information and processed using the ρ−ψ representation. . . 74 7.3 Montage of the multispectral images of the views chosen as canonical for pipes
dataset using 2D edge information and processed using theρ−ψ representation. 74
using the ρ−ψ−ζ representation. . . 83 7.5 Color composite images of the views chosen as canonical for COIL dataset
using 3D edge information and processed using the ρ−ψ−ζ representation. 84 7.6 Montage of the multispectral images of the views chosen as canonical for
pipes dataset using 3D edge information and processed using the ρ−ψ −ζ representation. . . 84 7.7 Scatter plot of the training data (projected onto three eigenviews) used for
the classification of images in the COIL dataset (a) using PCA (b) using ICA (c) using NMF as dimensionality reduction techniques on the object views. . 89 7.8 Performance of the eigenview system with varying number of eigenviews (a)
using grayscale images (b) using 2D edges (c) using 3D edges (d) using views of objects in the Synthetic database . . . 96 7.9 Performance of the eigenview system with varying number of eigenviews (a)
using grayscale images (b) using 2D edges (c) using 3D edges (d) using views of objects in the COIL database . . . 97 7.10 Performance of the eigenview system with varying number of eigenviews (a)
using grayscale images (b) using 2D edges (c) using 3D edges (d) using views of objects in the Pipes database . . . 98 7.11 Plot of NED for various values of SNR and various choices of the original
dimensionality of the eigenview system (M0). . . 101
B.1 Chessboard pattern used to generate synthetic data. Black squares correspond to class label −1 and white squares correspond to label +1 . . . 131
C.1 Two edge-sets differing only by a scale parameter . . . 137 C.2 Two edge-sets differing only by an in-plane rotation . . . 138
List of Tables
4.1 Error Matrix tabulating the performance of a NN classifier on an eigenview system with 78 eigenviews extracted using PCA on the synthetic database consisting of eight objects. OE and CE stand for Omission and Commis-sion Error, respectively. The lower-rightmost cell indicates the overall Level-2 classifier accuracy. . . 25 4.2 Eigenvalues in the presence of additive white Gaussian noise. See text for
details. . . 32
5.1 Error Matrix tabulating the performance of a Nearest Neighbor classifier on an eigenview system with 78 eigenimages on the synthetic database (3D-edges only) consisting of eight objects. OE and CE stand for Omission and Commis-sion Error, respectively. The lower-rightmost cell indicates the overall Level-2 classifier accuracy. . . 53
6.1 Spectra assignment to the surfaces of the objects used in the database . . . . 58
7.1 Error Matrix tabulating the performance of a NN classifier on an eigenview system with 78 eigenviews generated using PCA on the2D edges of object views in the synthetic database consisting of eight objects. . . 68 7.2 Error Matrix tabulating the performance of a NN classifier on an eigenview
system with 78 eigenviews generated using ICA on the 2D edges of object views in the synthetic database consisting of eight objects. . . 68 7.3 Error Matrix tabulating the performance of a NN classifier on an eigenview
system with 78 eigenviews generated using NMF on the2D edges of object views in the synthetic database consisting of eight objects. . . 68 7.4 Level-2 error matrix tabulating the performance of a NN classifier on an
eigen-view system with 78 eigeneigen-views extracted using PCA on the 2D edges of object views in the synthetic database consisting of eight objects. . . 69 7.5 Level-2 error matrix tabulating the performance of a NN classifier on an
eigen-view system with 78 eigeneigen-views extracted using ICA on the 2D edges of object views in the synthetic database consisting of eight objects. . . 69 7.6 Level-2 error matrix tabulating the performance of a NN classifier on an
eigen-view system with 78 eigeneigen-views extracted using NMF on the 2D edges of object views in the synthetic database consisting of eight objects. . . 69
views in the COIL database consisting of eight objects. . . 70 7.8 Error matrix tabulating the performance of a NN classifier on an eigenview
system with 5 eigenviews extracted using ICA on the 2D edges of object views in the COIL database consisting of eight objects. . . 70 7.9 Error matrix tabulating the performance of a NN classifier on an eigenview
system with 5 eigenviews extracted using NMF on the 2D edges of object views in the COIL database consisting of eight objects. . . 70 7.10 Classification accuracies using the LOO procedure on 2D edges of object
views in the Pipes database consisting of four objects with 64 eigenviews. 71 7.11 Level-1 error matrix tabulating the performance of a NN classifier on an
eigen-view system with 78 eigeneigen-views generated using PCA on the 3D edges of object views in the synthetic database consisting of eight objects. . . 78 7.12 Level-1 error matrix tabulating the performance of a NN classifier on an
eigen-view system with 78 eigeneigen-views generated using ICA on the 3D edges of object views in the synthetic database consisting of eight objects. . . 78 7.13 Level-1 error matrix tabulating the performance of a NN classifier on an
eigen-view system with 78 eigeneigen-views generated using NMF on the 3D edges of object views in the synthetic database consisting of eight objects. . . 78 7.14 Level-2 error matrix tabulating the performance of a NN classifier on an
eigen-view system with 78 eigeneigen-views extracted using PCA on the 3D edges of object views in the synthetic database consisting of eight objects. . . 80 7.15 Level-2 error matrix tabulating the performance of a NN classifier on an
eigen-view system with 78 eigeneigen-views extracted using ICA on the 3D edges of object views in the synthetic database consisting of eight objects. . . 80 7.16 Level-2 error matrix tabulating the performance of a NN classifier on an
eigen-view system with 78 eigeneigen-views extracted using NMF on the 3D edges of object views in the synthetic database consisting of eight objects. . . 80 7.17 Error matrix tabulating the performance of a NN classifier on an eigenview
system with 5 eigenviews extracted using PCA on the 3D edges of object views in the COIL database consisting of eight objects. . . 81 7.18 Error matrix tabulating the performance of a NN classifier on an eigenview
system with 5 eigenviews extracted using ICA on the 3D edges of object views in the COIL database consisting of eight objects. . . 81 7.19 Error matrix tabulating the performance of a NN classifier on an eigenview
system with 5 eigenviews extracted using NMF on the 3D edges of object views in the COIL database consisting of eight objects. . . 81 7.20 Classification accuracies using the LOO procedure on 3D edges of object
view system with 78 eigenviews generated using PCA on the object views in thesynthetic database consisting of eight objects. . . 86 7.22 Level-1 error matrix tabulating the performance of a NN classifier on an
eigen-view system with 78 eigeneigen-views generated using ICA on theobject views in the synthetic database consisting of eight objects. . . 86 7.23 Level-1 error matrix tabulating the performance of a NN classifier on an
eigen-view system with 78 eigeneigen-views generated using NMF on the object views in thesynthetic database consisting of eight objects. . . 86 7.24 Level-2 error matrix tabulating the performance of a NN classifier on an
eigen-view system with 78 eigeneigen-views extracted using PCA on theobject views in the synthetic database consisting of eight objects. . . 87 7.25 Level-2 error matrix tabulating the performance of a NN classifier on an
eigen-view system with 78 eigeneigen-views extracted using ICA on the object views in the synthetic database consisting of eight objects. . . 87 7.26 Level-2 error matrix tabulating the performance of a NN classifier on an
eigen-view system with 78 eigeneigen-views extracted using NMF on the object views in thesynthetic database consisting of eight objects. . . 87 7.27 Error matrix tabulating the performance of a NN classifier on an eigenview
system with 3 eigenviews extracted using PCA on the object views in the
COIL databaseconsisting of eight objects. . . 88 7.28 Error matrix tabulating the performance of a NN classifier on an eigenview
system with 3 eigenviews extracted using ICA on the object views in the
COIL databaseconsisting of eight objects. . . 88 7.29 Error matrix tabulating the performance of a NN classifier on an eigenview
system with 3 eigenviews extracted using NMF on the object views in the
COIL databaseconsisting of eight objects. . . 88 7.30 Classification accuracies using the LOO procedure on object views in the
Pipes databaseconsisting of four objects with 64 eigenviews. . . 90 7.31 Level-1 error matrix tabulating the performance of a NN classifier on an
eigen-view system with 78 eigeneigen-views generated using PCA on thegrayscale object views in the synthetic database consisting of eight objects. . . 92 7.32 Level-1 error matrix tabulating the performance of a NN classifier on an
eigen-view system with 78 eigeneigen-views generated using ICA on thegrayscale object views in the synthetic database consisting of eight objects. . . 92 7.33 Level-1 error matrix tabulating the performance of a NN classifier on an
eigen-view system with 78 eigeneigen-views generated using NMF on thegrayscale ob-ject views in thesynthetic database consisting of eight objects. . . 92
views in the synthetic database consisting of eight objects. . . 93 7.35 Level-2 error matrix tabulating the performance of a NN classifier on an
eigen-view system with 78 eigeneigen-views extracted using ICA on thegrayscale object views in the synthetic database consisting of eight objects. . . 93 7.36 Level-2 error matrix tabulating the performance of a NN classifier on an
eigen-view system with 78 eigeneigen-views extracted using NMF on the grayscale ob-ject views in thesynthetic database consisting of eight objects. . . 93 7.37 Error matrix tabulating the performance of a NN classifier on an eigenview
system with 5 eigenviews extracted using PCA on the grayscale object views in the COIL database consisting of eight objects. . . 95 7.38 Error matrix tabulating the performance of a NN classifier on an eigenview
system with 5 eigenviews extracted using ICA on thegrayscale object views in theCOIL database consisting of eight objects. . . 95 7.39 Error matrix tabulating the performance of a NN classifier on an eigenview
system with 5 eigenviews extracted using NMF on the grayscale object views in the COIL database consisting of eight objects. . . 95 7.40 Classification accuracies using the LOO procedure ongrayscale object views
in the Pipes database consisting of four objects with 64 eigenviews. . . 96 7.41 Classification accuracies summarized for the various datasets . . . 99 7.42 Reconstruction Error for the noisy synthetic database eigensystem with and
without estimates using first-order perturbation theory for M’=64. . . 103 7.43 Reconstruction Error for the noisy COIL database eigensystem with and
with-out estimates using first-order perturbation theory for M’=64. . . 103 7.44 Reconstruction Error for the noisy Pipes database eigensystem with and
with-out estimates using first-order perturbation theory for M’=64. . . 103
B.1 Performance results for the polynomial kernel applied to synthetic data. Note: time in seconds, obtained using Win32 performance counters, truncated to four decimal places. . . 132 B.2 Performance results for the polynomial kernel applied to the Wisconsin Breast
Cancer data. Note: time in seconds, obtained using Win32 performance coun-ters, truncated to four decimal places. . . 133 B.3 Performance results for the polynomial kernel applied to the Monks data.
Note: time in seconds, obtained using Win32 performance counters, truncated to four decimal places. . . 133 B.4 Performance results for the RBF kernel applied to the Wisconsin Breast
Can-cer data. Note: time in seconds, obtained using Win32 performance counters, truncated to four decimal places. . . 135
Introduction
Classical object recognition techniques that are used to recognize objects operate on either single-band or recently, on three-band imagery. They do not extend their philosophies to higher number of spectral bands. The primary reason inhibiting such a development in the past has been the lack of a system that would deliver more than a few bands of spectral data. With the increasing popularity of multi-spectral and hyper-spectral imaging systems like the AVIRIS sensor, the IKONOS satellite and HYDICE imagery and the recently announced single-aperture multi-channel camera by Electro-Optics Industries, to name a few, there is a compelling need to make use of both spatial and spectral information in recognizing objects.
1.1
Object Recognition Philosophies
The number of published articles and books on object recognition is so large that it is crucial to distinguish the philosophies used in these works. The graph in Fig. 1.1 puts the various approaches to object detection in existing literature in the context of the philosophies used. Most of the literature points to two distinct directions – one that uses shape information and the other that uses spectral information. There is very little published literature on the combination of these two otherwise independent approaches in a common framework. Under edge-based techniques, there are two philosophies, one based on the views of objects (using landmark points) and the features of objects (reconstruction of object’s 3D shape). These two philosophies have a lot in common; but the intent of each of their sub-categories is different. Landmark-based techniques are used mostly under controlled setting where the objects are defined by their salient points [35]. Landmark-free approaches are applicable under imaging scenarios wherein we use a view of an object to “compare” it with what we
Landmark free
Hyper Spectral Multi
Spectral Reconstruct
shape Landmark
based
Object Recognition
Reconstruction free
Single Band
View−based Edge−based
Figure 1.1: Various approaches to object detection
know about the object. The reconstruction of shape is important mostly under machine vision based application (Marr’s 21
2D representations [74]) where the intent is to reconstruct
the shape of the object. Reconstruction-free techniques are used in situations wherein the intent is comparison of views of objects without the computational burden of reconstruction. In landmark-based approaches the object is a set of landmark (important) points – this approach is well-studied and understood [124, 130, 35, 130, 37]. Landmark-free approaches are gaining popularity because of their applicability to shapes in images [2, 49, 96, 83]. All of these techniques may be further divided into model-based (we store a database of known objects for comparison) and model-free approaches.
be focussed upon in this work.
1.2
Shape in tone?
When considering information in one spectral band, the notion of shape is well understood – we deal with this notion in our everyday lives. When considering information from a multitude of spectral bands, it is not clear what “shape” means. As a matter of fact, it is not clear if there even is one common notion for shape in multiple bands. Fig. 1.2 illustrates
(a) (b)
(c) (d) (e) (f)
Figure 1.2: A hypothetical solid object (a) a solid cube (b) the temperature profile of the object, top to bottom, cold to hot (c) a visible band image (d)-(f) different infrared band images of the same object
the spatial and spectral information? Shape analysis with regards to the visible part of the spectrum (and individual bands) has undergone extensive scrutiny in the past years, but combining the information available from different parts of the spectrum has not received much study. In the example shown here, the parts of the object seen in IR images were contained within the expanse of the object seen in the visible band – this also however need not be the case – an object may be camouflaged in the visible band but still be seen in the IR bands. There is the possibility of conflicting sources of “information” – one sensor may say that there is no object in the scene while the other sensor may say otherwise (see Fig. 1.3). In
(a) (b)
Figure 1.3: Two images taken from the Fort Carson dataset [29] showing a M133-901 tank in visible-band camouflaged and occluded (a) in the visble band, notice that the tank is not discernible from the background (b) in the long-wave IR clearly highlighting the tank
the case of visible band color imaging systems (using red, green and blue channels), this issue does not arise because of the high correlation between the three color channels (unless we have monochromatic wavelengths in the scene, which is hardly ever the case). The location of an edge in one channel is the same for all three channels (pathological cases always do exist though). However, other parts of the spectrum may not be so well-behaved.
Notations
To aid in the reading of this document, the author presents in this chapter, some mathemat-ical notations that will be used in this work.
All images of an object will be taken from the viewing hemisphere as shown in Fig. 2.1. The object, O is centered in the hemisphere and the camera (imaging system) is positioned at various locations on this hemisphere, looking along radial lines at the object. This reduces
θ
r
(a)
φ
r
(b)
Figure 2.1: Viewing Hemisphere illustrating the three spherical coordinates that describe the camera location with respect to the object (at the origin)
the number of degrees of freedom from six to two (the parameters that change areθ and φ) – r is kept constant. Let I : R2 → RK be a continuous hyperspectral image that has K
bands defined at each pixel location. Denote by xl, l = 1,2, the spatial coordinates which
will be referred to as the vector ¯x for notational simplicity. The images resulting from all
possible views of an object O may be denoted by a set of images
Z =
n
IO(θ, φ)
oθ=π,φ=2π
θ=0,φ=0 . (2.1)
Denote by Y, a finite subset of Z. Sampled images are of size R×C×K, where R and C denote the spatial coordinates and K denotes the spectral slice (R, C, K are integers). Denote by Ik, the kth band of the image. A forward model for the image formation is given
by
IO(θ, φ) = H ⊗IO(0)(θ, φ)¯N, θ ∈[0, π], φ ∈[0,2π] (2.2)
whereHdenotes the point spread function of the system, the superscript (0) denotes a “true” image, N denotes a random noise process; the operator ⊗ denotes a spatial convolution operation and¯denotes either an additive (for signal independent noise) or a multiplicative process (for signal-dependent noise). For notational simplicity, let us drop the subscriptO. Without loss of generality, let us assume that we have M such images that are elements of Y, taken from M unique views (let the hemisphere be sampled uniformly). Each of the sampled images, Im, m= 1. . . M may be considered to be a multi– or hyper-spectral cube
with the spatial dimensions forming two axes and the spectral dimension forming the third axis. It is common in the remote sensing community to call this representation ahyperspectral
cube1.
The sampled images, Im, m = 1. . . M may be represented in a lexicographic form as a
vector ¯Im2. Eqn. 2.2 may now be written as
¯
Im =HI¯m(0) ¯ N,¯ m= 1. . . M (2.3)
whereHis an RCK×RCK matrix which is derived from H. In this work, we shall restrict ourselves to signal independent additive noise and an identity point-spread function, which replaces H with an identity matrix and ¯ with a + sign. The above equation may now be written as
¯
Im = ¯Im(0)+ ¯N, m= 1. . . M. (2.4)
1Note that the subscript k denotes the kth spectral band of the image and an added subscript m will denote thekth spectral band of themth image in the set
TheM various column vectors of the images may be stacked into aRCK×M matrix denoted by I. Denote by ˆI, the lexical representation of the mean image calculated from these M images and by ¯Im0, the zero mean images obtained by subtracting ˆI from each of the ¯Im
images (vectors). Denote by I0, the matrix formed by stacking all the zero-mean images in
the form of column vectors.
Image bases will be denoted by ¯vm, the components of each image ¯Im will be denoted
by Iim, i = 1. . . RCK and the components of a noise image will be denoted by Ni, i =
Background
A suitable starting point for looking into object recognition techniques in literature is to subdivide them based on their philosophies as mentioned earlier in Chapter 1.
3.1
Object Recognition Using Edges
The earliest writings on shape were by Galileo in 1638 [39], who was attempting to describe the relationship between bones in larger animals and those of smaller animals. This was later further refined by Thompson from a biological point of view in 1917 [114]. However, one of the earliest works on rigorously matching shapes was reported by Kendall in 1977 [59]. He defined shape as the only piece of geometric information left when location, scale and rotational effects are filtered out from an object. A plethora of work has been done in the years since in utilizing this formalism to perform shape matching. The work performed in this aspect of formalizing shape matching may be found in reference [35].
The Automatic Target Recognition (ATR) community has an increasing need for object recognition. As a matter of fact the driving force for this work is the need to develop a “smart” ATR technique that takes into account spatial and spectral information into its decision making process. In [8, 9], Bhanu defines the problems facing ATR and tabulates the variety of sensors and platforms in terms of ATR applications. A concise overview of problems involved in developing real-world ATR systems have been addressed with an application perspective rather than a mathematical perspective. The key issues facing ATR are robustness, validation, software/hardware issues, computational needs and man-machine interfaces. As of 1993, almost all work was directed towards model based ATR – models in the sense of sensors models, target models (spectral and geometric), clutter and background
models.
In [99], Sadjadi outlines the critical issues and current approaches used in the ATR com-munity. The article partitions the research into model-based object recognition techniques and multisensor fusion techniques. The basis for the model-based recognition paradigm is the observation that, by preserving as much information received about an object as possible and by using intelligent reasoning, a better interpretation of the scene containing the objects of interest can be made [98]. Thus, by better understanding the direct problem, the inverse
problem - the task of recognizing objects from their received signatures becomes more feasible.
Understanding the direct problem involves understanding the physics of wave-interactions among the objects, the atmosphere (which is the transmission path), the background vege-tation, and the sensors. Then, by translating this understanding into a set of models and uncovering the rules of model interaction, the direct problem can become manageable. The inverse problem, however, involves not only these models and their interaction rules, but also the reasoning mechanism, the search problem, and the problem of dealing with uncertain and partial information. Thus, the issues of signature variability, occlusion and countermeasures, and high false alarm rates are addressed. However, what is lacking is the integration of these models into a cohesive end-to-end model by considering the effects of their interactions. The work in this domain is still preliminary. Another automatic object recognition paradigm is based on the premise that the more sensory data that is available from the target of interest, the better the system performance. This is intuitively obvious for sensors that have comple-mentary properties. Due to numerous limitations of single-sensor ATR systems, there has been a move toward multisensor targeting systems and hence, the problem of correlating and combining data generated from multiple sensors. This problem has four components: establishing correspondence among data (popularly known as image registration) relating to common targets; deciding which sensor should be more “trustworthy” than the others based on scene, scenario, and environmental conditions (degrees of confidence of each sensor); com-bining data at various levels in the processing hierarchy, each level with a different degree of uncertainty; and finding the optimum fusion set.
silhouettes of targets in a military scenario was proposed by Sadjadi [100]. He introduced the concept of a chain code histogram. The silhouette of a target is broken down into a chain code that is subsequently binned into a histogram. This histogram is considered as a “signature” of the overall shape of the object. This histogram has two very useful properties: scale variation in the image domain is equivalent to a vertical “DC” shift in the chain code histogram domain and changes in orientation are equivalent to horizontal cyclic shifts in the histogram. The philosophy is used later in a different paper [101] in which Sadjadi considers the scenario of a chain code histogram in the case of occluded objects.
One of the most popular techniques used today to match binary edge maps to edges maps in an image is the Hausdorff distance measure [49, 83, 82]. For two finite point sets A and B, the Hausdorff distance H(A, B) is given by
H(A, B) = max{h(A, B), h(B, A)} (3.1)
where
h(A, B) = max
a∈A minb∈B ka−bk (3.2)
where k.k is the L2 norm (although other norms could certainly be used). h(.) is called the
“directed Hausdorff distance.” Clearly, this distance is sensitive to outliers and to circumvent this problem, the quantile version of the directed distance is used,
h(A, B) = fth a∈A
min
b∈B ka−bk (3.3)
gives the fth quantile value of the operand over the set A. This is a much robust to outlier
points that do not form a part of the general “shape” of the point-sets being matched. Clearly this is a computationally intensive search process. To this end, Rucklidge [96] proposes a solution based on the distance transform and by rasterizing the search process.
(x2, y2) defined as |x2−x1 |+|y2−y1 |, the measure incorporates the gray-level values in
the measure. The new measure for distance between edge pixels is given by | x2 −x1 | +|
y2−y1 |+|f2 −f1 |, where fi is the gray level value in the original image.
The spatial dependence of color experiences in humans has been well established since the 1960s when Edwin Land described what is now popularly known as the “retinex theory” [68] and it is easy to demonstrate that the same applies for the spatial dependency of signals in multi-spectral images. The idea of combining geometry and color into one framework is something that is still not well-resolved in published literature.
3.2
Object Recognition Using Views
3.2.1
Object Detection Using Grayscale Images
Rucklidge [97] provides a framework for robust matching of grayscale images. In his setup, a known pattern (template) is matched to regions in the image by searching the transfor-mation space efficiently using a tree search. A gray scale image of the template is matched using a sum of squared differences measure to find the best possible match using geometric transformations. This could be easily extended to a measure that incorporates vector val-ued pixels (multi– or hyperspectral) for an image matching technique. This work is closely related to Rucklidge’s past work on using the Hausdorff distance [49, 96] as a measure used in matching binary images composed of edge pixel locations. The novelty of the approach in [97] is the use of a search space more efficiently than those used in the past for finding the best geometric transformation.
3.2.2
Object Detection using Spectral Information
target in each channel is the feature considered. This is relatively more robust as it is impossible to accurately extract absolute spectral characteristics. However, even if this were done, it is subject to change for a variety of reasons; weather changes, better enemy camouflage, etc. Much of the published literature uses spectral information at a sub-pixel or a pixel level [111, 47, 89, 108, 129]. At the object level, spectral information has been used mostly for segmentation purposes and the recognition phase has been left to the user [48, 36, 128, 123, 75] .
3.2.3
Canonical Views
Much of the empirical work on perception of shapes by the human visual system has been based on two otherwise conflicting philosophies – one that representations of shapes are viewpoint independent [74, 23] and the other that the representations of shape are view-point specific [63, 64, 113]. In the first system, regardless of orientation, size and location of the shape with respect to the viewer, the visual system constructs the same underlying representation for recognizing shapes. In the second system, the visual systems stores mul-tiple representations of shapes are stored and through a normalization processes recognizes shapes.
In the case of the visual part of the spectrum, it is rather “easy to describe” what a canonical view is. Researchers in many fields, mostly psychology, have attempted to answer this question with considerable success [26, 27]. Another way of looking at canonical views is the degeneracy of the views of an object when viewed in the hemisphere model as described in Chapter 2. There are many views that do not provide as much information about the 3D object as some other. Computational approaches to such a model of an object have also been addressed [20, 5, 6, 11, 16, 33].
• the viewpoint that is assigned the highest “goodness” rating by the observers or,
• the viewpoint that is first imagined in visual imagery or,
• the viewpoint that is subjectively selected as the “best” photograph taken with a camera or,
• the viewpoint found to have the lowest response rime and error rate in recognition or naming experiments or,
• the viewpoint inspected for the longest period of time in a free exploration task.” These two “definitions” are extracted from literature to illustrate the variety in the ap-proaches used in defining a canonical view. The former has a “mathematical” notion and is clearly from a computer vision perspective and the latter is highly subjective and is clearly from a psychology perspective.
The analysis of Palmer et al. [85] indicates that observers preferred the off-axis views – viewpoints that make a large number of surfaces visible. After a sequence of involved exper-iments, Blanz et al. [14] conclude that canonical views are a result of complex interactions between experience, task and the geometry of the object. This result, although dishearten-ing at first, does provide an important fact that needs to be borne in mind throughout this work – there is no ONE answer and the complexity of the problem results in many possible solutions. This work will present one such answer by choosing a view that is least different from all other views as a canonical view. From a pattern classification perspective, one may call this a prototypical view as this has likeness to prototypical feature vectors.
View likelihood (canonicity of a view) of angles was computed using numerical simula-tions by Ben-Arie [5] and Burns et al. [16]. The general problem requires the numerical estimation of likelihood when the image measurements change continuously with the view-ing parameters; this computation is harder, as it requires the numerical estimation of limits. Thus, the simulation work described in [5, 16] cannot be readily generalized to compute view likelihood of general objects. To this end, Werman and Weinshall [122] propose a new measure for view likelihood and stability.
Aspect graphs explore the space of possible views and relate them based on their simi-larity. They rely on the mathematical tools of differential geometry and singularity theory to group images that share the same topological structure into equivalence classes.
3.2.4
Eigenviews
One of the most widely known pieces of work on eigenimages was that done by Sirovich and Kirby [109], revisited by Turk and Pentland [118]. The intent of their work was to find the best encoding of a face database (with say, M images) with the objective being recognition. Zero mean images, ¯Im0 are used to compute the covariance matrix, given by
Σ = 1
M
PM
m=1I¯m0I¯mT0 = I0IT0. This leads to an RCK × RCK matrix Σ. Factorizing
this matrix is a prohibitively expensive task for typically sized images. To this end, Turk and Pentland propose using the covariance matrix given by Σ0 = 1
M
PM
m=1I¯mT0I¯m0 = IT0I0.
Since the number of data points in the image space is much less than the dimensionality of the space (M << RCK), there are only M −1 meaningful eigenvectors, rather than RCK – the remainder RCK − M eigenvectors will have associated eigenvalues of zero. This is easliy seen if we consider an eigenvector vm along with a eigenvalue λm – clearly,
IT
0I0v¯m = λmv¯m. Premultiplying both sides by I0, we have I0IT0(I0v¯m) = λm(I0v¯m), which
shows us that the eigenvectors of Σ are given byI0v¯m. This gives us the required eigenvectors
at a fraction of the computational cost. The eigenvectorsI0v¯m corresponding to theM0 most
labels from the projections of a given training set with considerable accuracy [79, 80]. The idea of using eigen-representations of image databases has been explored by many authors in a variety of publications for object recognition and pose-estimation [78, 77]. There is currently no standard database of images of objects taken from different locations on the viewing hemisphere — the only database that is treated as ”standard,” is the one by Nayar located athttp://www1.cs.columbia.edu/CAVE/research/softlib/coil-100.html[81] which is a database visible color images of 100 objects taken at a fixed elevation and varying azimuths. This database will be used for comparison purposes.
The above algorithm used the Karhunen-Loeve transform to determine the “directions” with most variation (the principal components). This works best if the data has a Gaussian distribution as the cost function that is minimized is the squared error. This however need not be the case – Independent Component Analysis (ICA) [54, 53] has been known for its ability to perform source separation and represent the original data without making the Gaussian assumption. Using the notation used above, The original images ¯Im may be
considered to be derived from a collection of independent components ¯J1,J¯2, . . .J¯l, given
by ¯Im = am1J¯1 +am2J¯2 +. . .+amlJ¯l. This set of linear combinations may be written in
matrix notation as I=AJ, where A is called the mixing matrix. and is composed of rows, each representing the weights used to construct {I¯m}Mm=1 from the independent components
{J¯l}Ll=1. The objective of ICA is to find the entries in A and J such that the functions ¯Jl
are statistically independent (this independence is obtained by maximizing the non-Gaussian nature of these signals). ICA has not been used to decompose image datasets as was done by Turk and Pentland using PCA. The use of ICA in decomposing image datasets for recognition purposes is another contribution of this work.
Examining all the work on Principal Component Analysis and Independent Component Analysis is beyond the scope of this document. The intent of presenting them here is to show the various means of decomposing a large dataset into a smaller, more manageable one.
Lee and Seung apply a new technique called Non-Negative Matrix Factorization (NMF) to decompose a matrix into factors that are all positive. It extracts bases that have all positive entries. Each image is a linear combination of component images (just as in ICA), but the restriction being that these combinations be only additions with positive scalars.
NMF poses the problem of “representation” by factoring the matrix of the original data into two all-positive matrices [69, 70]. Non-negative data is “optimally” encoded as a non-negative combination of non-non-negative basis functions – in contrast to the other techniques described above. A decomposition given by I ≈ EW is sought. Lee and Seung applied NMF to a database of face images and showed that the resulting basis functions represent a decomposition of the images into conceptually meaningful parts of faces – nose, mouth, eyes etc. This is a convenient representation as it breaks down images into additive parts, much unlike the case with PCA and ICA, which involve complex additions and subtractions. In this work, this formalism is extended to muti–and hyper-spectral systems.
View-based Techniques
As alluded to earlier, we shall explore two routes to characterizing and matching objects; the first route is that of using image intensities, while the second uses edge information. It is to be noted that although there are disadvantages to using images in their original form due to variations in imaging conditions, under controlled environments, such techniques have unmatched performance.
4.1
Eigenviews
All possible views of an object O1 may be denoted by a set of multispectral images
ZO1 =
n
IO1(θ, φ)
oθ=π,φ=2π
θ=0,φ=0 . (4.1)
Consider a countable subset set Y :Y ⊂ SOi=1ZOi (we have M views of O objects). Denote
the cardinality of this set (of images corresponding to different views of various objects) as MO and the size of each image I, as R×C×K.
As traditionally used, the term eigenfaces refers to the use of a smaller, more repre-sentative set of images extracted from linear combinations of face images. Eigenfaces were proposed by Sirovich and Kirby [109] and extended by Turk and Pentland over a decade ago to represent a set of images of faces [118]. In the work presented here, we shall use the term “eigen” without the restriction of the traditionally used reference to the Karhunen-Loeve transform or to Principal Component Analysis [109]. Instead, in this work, the term eigenimages is used to denote “representative” images (either faces or objects) that encode information in its broadest sense.
The work by Turk and Pentland was later extended to generate eigenviews of images of objects [78]. Turk and Pentland proposed using Principal Component Analysis (PCA) to generate eigenfaces, which essentially accounted for the variance in the data. The system they developed projected a given set of face images onto a feature space that spans the variations in the images. The projection operation characterizes a face by a weighted sum of of eigenface features and does not recognize the details (individual components of a face). This has been known to perform with good accuracy on face images [118, 86]. Murase and Nayar’s eigenviews are generated by linear combinations of elements of Y. Here, we extend this work to perform recognition of objects viewed from arbitrary locations on the viewing hemisphere using their multispectral images.
4.2
Eigenviews
We generate a lexical representation (column vectors of lengthRCK) for each image I ∈ Y. These vectors are now stacked into a matrixIof sizeRCK×MO. Approximate factorizations for this input data matrix are now found such that
I≈VW (4.2)
where V is a matrix of size RCK×M0 and W is a matrix of size M0×MO. The columns
of V are the lexical representations of the M0 < M O eigenviews and W may be regarded
as a weight matrix that stores the weight that is used in a linear combination with each eigenview to generate the original images (in I). Let us denote by I0, the space spanned by
these eigenviews.
4.2.1
Principal Component Analysis
AsIis not a full-rank matrix, Singular Value Decomposition is used to extract the eigenvalues and eigenvectors. In traditional notation of SVD, we may factorize1 the RCK ×MO
zero-mean matrix I0 as
I0 =U1ΛUT2 (4.3)
where Λ is an RCK×MO diagonal matrix with diagonal elements λi such thatλ1 ≥λ2 ≥
. . .≥ λmin (RCK,M O) ≥ 0 and U1 and U2 are orthogonal matrices of size RCK ×RCK and
MO×MO, respectively. The λi are called singular values and the first min (RCK, M O)
columns of U1 and U2 are called the left and right singular vectors. Putting this in the
framework of Eqn. 4.2, we haveV=U1ΛandW =UT2. In the notation used This however
is an exact decomposition. What we are after is a condensed representation of I. We hence pick the largest M0 < M O singular values and vectors and now have
I0 ≈U
0
1Λ
0
U02T (4.4)
where U01 is of size RCK×RCK, Λ is of size RCK ×M0 and U0
2 is of size M0 ×M0. The
columns ofV =U01Λ0 are now the eigenviews and those ofU02are the encodings. Eigenviews generated in such a manner account for directions that have the largest variance. The eigenviews generated using PCA have the property that they involve linear combinations that involve complex cancellations and do not correspond to views that have intuitive meaning (although this is not the goal of the problem at hand).
Computing the singular values and vectors for an RCK ×MO matrix comes at a very high computational cost. To reduce the computational cost of such a process, the following setup is often used. Instead of computing the eigenvectors and eigenvalues of Σ = I0IT0,
consider the matrix Σ0 = IT
0I0. The eigenvalues λ
0
and eigenvectors ψ0 of Σ0 satisfy the condition
Σ0ψ0 =λ0ψ0. (4.5)
Pre-multiplying both sides by I0, we get
I0Σ
0
ψ0 =I0λ
0 ψ0
i.e. I0
³
IT
0I0
´
ψ0 =I0λ
0 ψ0
i.e.
³
I0IT0
´
I0ψ
0
=λ0IT
0ψ
0
i.e. ΣI0ψ
0
=λ0I0ψ
0
(4.6)
which shows us that the eigenvalues for the two matrices are the same while the eigenvectors of Σ may be calculated from those Σ0, which is computationally much more feasible. The remainder eigenvaluesRCK−M0 eigenvalues ( or min(RCK, M O) singular values) are zero.
4.2.2
Independent Component Analysis
In contrast to decorrelation techniques such as PCA, which ensures that output pairs are uncorrelated, Independet Component Analysis (ICA) imposes a much stronger criterion that the multivariate probability density function of output variables is independent (it can be factored) [51]. ICA is essentially a linear formation model that describes the data by a linear combination of a collection of independent “sources” through a mixing matrix given by
I=AS (4.7)
where A denotes the mixing matrix that is a collection of scalars and S is a collection of independent sources. The problem of ICA is to estimate not only A but also S. The enforcement of independence is done so as to remove redundant information (as is also the case with PCA). The fundamental restriction in ICA is that the independent components must be non-Gaussian for ICA to be possible. In [53], Hyvarinen et al. provide a simple explanation as to why Gaussian variables are forbidden.
described by its moments or, more conveniently, by its cumulants C(Z). Cumulants form tensors of varying ranks based upon the order of the cumulant and the diagonal elements of the tensor characterize the distribution of single components. For example, C(Z)i is the
mean, C(Z)ii is the variance, C(Z)iii is the skewness and C(Z)iiii is the kurtosis of the ith
column of Z. The off-diagonal elements characterize the statistical dependencies between components. Clearly, if and only if the off diagonal elements are zero, the columns of Z
are statistically independent (assuming infinite amount of data). Thus, ICA is equivalent to finding an unmixing matrix that diagonalizes the cumulant tensors. The conventional KL transform may be considered as an ICA process operating on the second-order cumulant.
Kurtosis is commonly used as a measure of non-Gaussinity and an often used objective function. For a random variablex, it is given byE{x4}−3(E{x2})2. For a Gaussian random
variable, it is zero and typically positive for densities with heavy tails and a peak at zero and negative for flatter densities with lighter tails. There are researchers who do not support the use of Kurtosis as an objective function due to the sheer nature of the random variables and the fact that we have a finite sample of the random variables and are hence limited in the estimate of the Kurtosis [50]. However, due to the strong global convergence properties of using such an objective function and the mathematical simplicity of such a formulation [31, 52], in this work, the Kurtosis is used as a measure of non-Gaussinity. Other measures include Negentropy and Mutual Information [51].
In this work, the ICA problem is solved using a popular algorithm called Joint Approx-imate Diagonalization of Eigenmatrices (JADE). The advantage of this approach is that it requires no knowledge of the probability densities of the independent components. This al-gorithm works by jointly diagonalizing the maximal set of cumulant matrices (fourth order cumulants are used as the Givens rotations can now be computed in closed form). The interested reader is referred to the work reported by Cardoso and Souloumiac [19].
4.2.3
Non-negative Matrix Factorization
non-negative by making the assumption that an image pixel is generated by adding Poisson noise to the factored image. In other words, the pixelsIij are obtained from a Poisson process
of mean (VW)ij. An objective function related to the likelihood of generating the pixels in
Ifrom V and W given by
logP(Iij|(VW)ij) = RCKX
i=1
M O
X
j=1
³
Iijlog (VW)ij −(VW)ij−Iij!
´
. (4.8)
Clearly, Iij! does not affect the optimization and can be dropped, giving us a cost function
that is iteratively minimized usually using gradient descent techniques or multiplicative update rules [70]. In [70], Lee and Seung prove convergence properties of such a minimization and also illustrate that this cost function is identical to the Kullback-Liebler distance between two distributions if the matrices are normalized to unity. This technique has been shown to be powerful and at the same time extract meaningful representations of data. Various other authors have used this technique to extract additive “parts” of datasets that make intuitive sense for reflectivity functions [17, 90], face image datasets [41, 43] and gene expression analysis [7] to name a few. Note though that V is of size RCK ×M0 and W is of size
M0×MO — the loss of dimensions accounts for the approximation in Eqn. 4.2.
4.2.4
Classification
We call the space spanned by theseM0eigenviews (images) as an eigenspace. The eigenspaces
represent the image data in the “best” possible manner depending upon the technique used. The images used for generating these M0 eigenviews are projected into the eigenspace and
serve as a training set which is later used for classification. Given a new (earlier unseen) view of one of these O objects, the image may now be projected into the eigenspace. This projection describes the contribution of each eigenview in representing the new view. A simple means of classifying this new image (its projection), is to assign it to the class that its projection is “closest” to. Due to the non-parametric nature of this space, an obvious choice for a classifier is the Nearest Neighbor (NN) classifier that uses the L2 norm as a distance
Label 11 12 2 3 4 51 52 53 CE % CE
11 42 8 0 0 0 0 0 0 50 8 16.00
12 10 42 3 0 0 0 0 0 55 13 23.64
2 0 0 46 1 0 0 0 0 47 1 2.13
3 0 1 3 49 0 0 0 0 53 4 7.55
4 0 1 0 2 52 4 0 0 59 7 11.86
51 0 0 0 0 0 41 0 1 42 1 2.38
52 0 0 0 0 0 5 52 0 57 5 8.77
53 0 0 0 0 0 2 0 51 53 2 3.77
52 52 52 52 52 52 52 52 375
OE 10 10 6 3 0 11 0 1
% OE 19.23 19.23 11.54 5.77 0.00 21.15 0.00 1.92 90.14
Table 4.1: Error Matrix tabulating the performance of a NN classifier on an eigenview system with 78 eigenviews extracted using PCA on the synthetic database consisting of eight objects. OE and CE stand for Omission and Commission Error, respectively. The lower-rightmost cell indicates the overall Level-2 classifier accuracy.
In other words, given a database of images as a matrixI, of sizeRCK×MO; we compute eigenimages V that is a matrix of size RCK×M0. The columns of V span the eigenspace.
Projecting Ionto this space will form a training set. When presented with a new image I0,
we project it onto the eigenspace and assign class labels according to the following
Label(I0) =Label
³ arg min
l = 1. . . M OkV
TI0−VTI lk2
´
(4.9)
which is the naive Nearest Neighbor (NN) rule. In the case of PCA however, the mean image needs to be subtracted. As an illustration of this technique, a dataset of eight objects (as described in later chapters) is used for classification purposes. The dataset consists of 128 views of eight objects. It is partitioned into a training and a test set in a 60-40 ratio. Eigenviews are generated using the training data and the test data is classified using the NN rule. A confusion matrix is shown in Table 4.1. Clearly, from the table, one can see that classes 11 and 12 are similar while the others are highly separable. Label 11 corresponds to a DC-10 aircraft and label 12 corresponds to a Boeing-747 aircraft — two very similar objects. One may visualize these as clusters of feature vectors, with those of 11 and 12 overlapping and hence not giving smaller classification accuracies.
A few questions need to be addressed
2. How many eigenviews does one need to obtain a certain accuracy?
These questions will be addressed experimentally in a later chapter, however the issues regarding noise in the data are dealt with in the next section.
4.3
Noise equivalent Dimensions
In this section, we concern ourselves with the behavior of eigenview system in the presence of noise in the test data. In other words, consider a scenario where eigenveiws are extracted from “near-noiseless” data but the data and the resulting eigenviews is to be used for classifying images corrupted by noise. Consider a dataset of “noiseless” images, Il(0), l = 1. . . MO and a basis of M0 vectors written in matrix notations as Φ
M0, where the subscript denotes the dimensionality of the space spanned by these bases. For notational ease, we had avoided introducing the superscript (0) for noiseless, but in noise behavior being the primary concern here, we denote noiseless data with superscript (0). These images may be represented in lexical notation by ¯Il(0). The columns ofΦM0 may be the same as those of V. The original data may be represented in this M0-dimensional subspace by Φ
M0TI¯l(0). The L2 error in reconstructing the original dataset from this reduced-dimensional data is given by
ε2³I¯(0), M0´=
M O X l=1 µ ε2 l ³ ¯
I(0), M0´¶ (4.10)
where
ε2l
³
¯ I(0), M0
´
= kI¯l(0)k2− kΦM0TI¯l(0)k2 (4.11)
=
³
¯ Il(0)
´T
¯ Il(0)−
³
ΦM0TI¯l(0)
´T
ΦM0TI¯l(0). (4.12)
For a given ¯I(0) and M0, the reconstruction error is given by ε2³I¯(0), M0´. The question
being addressed here is: Given a basis Φ, and noisy data ¯Il= ¯Il(0)+ ¯nl ∀l = 1. . . M O, how