An Accumulative Framework for Object Recognition

(1)

which are suited for recognizing objects in cluttered scenes using shape, color and texture. This dissertation provides a unified framework which can be applied not only to recognize simple shapes such as silhouettes but also recognize real objects in cluttered environments with occlusion.

(2)

by Karthik Krish

A dissertation submitted to the Graduate Faculty of North Carolina State University

in partial fullfillment of the requirements for the Degree of

Doctor of Philosophy

Electrical Engineering

Raleigh, North Carolina

2009

APPROVED BY:

Dr. Wesley E. Snyder Chair of Advisory Committee

Dr. Hamid Krim Dr. Griff Bilbro

(3)

(4)

BIOGRAPHY

Karthik Krish was born in Bangalore, India on June 1st, 1982. After spending a year of preschool in Mumbai (Bombay back then), India he moved south to Chennai in his native state of Tamil Nadu, India to do his primary schooling at Vidya Mandir. He moved north again to the capital New Delhi to complete his high school at the Mothers International School in 1999. He then went on to obtain a Bachelor of Engineering Degree in Electronics and Communication at the University Of Madras, Chennai, India in 2003.

In August 2003, he enrolled in the Master of Science Program of the Electrical and Computer Engineering Department at North Carolina State University. As part of his Masters thesis, he worked on writing image display and visualization software. He completed his Masters Degree in December 2005 and continued on for a PhD. His primary research focus during his doctoral degree was in the area of object and shape recognition even though he also worked in the related areas of image registration and image segmentation.

(5)

problems extremely easy to understand and comprehend. His research ethics and style has influenced me a lot and will always be my guiding principles throughout for my life. I owe the knowledge and skills gained during my days in graduate school to him. I also want to thank Dr. Griff Bilbro for his insightful discussions and thoughts and providing me with new perspectives about my research. My thanks also goes out to Dr. Siamak Khorram for his frank and candid reviews of my papers and work. I also want to thank my committee members Dr. Hamid Krim and Dr. Ben Watson for their comments and patience. I also want to thank my current and past colleagues in the Imaging Analysis Lab for all their thoughts and discussions about research.

(6)

LIST OF TABLES

Table 3.1 Example R-Table. . . 11

Table 4.1 Invariance to Similarity Transforms. . . 22

(9)

Figure 1.1 Example binary image with straight lines (left) and the corresponding

ac-cumulator array (right). Image reproduced fromwww.planetmath.org. . . 2

Figure 3.1 (Left) The image shows the accumulator array, A(x), for a good match (matching one image to a rotated and scaled version of itself). (Right) A poor match, the result of matching the image with the model of different tank. The sharp peak in the left image is at the reference point in the model. . . 10

Figure 3.2 Feature used by the Generalized Hough Transform (GHT) . . . 11

Figure 3.3 Example feature space for a two dimensional feature vector. The black points indicate the feature vectors in feature space. The model function for a sample feature vector (shown in red) calculates the shortest distance to the set of all feature vectors. . . 14

Figure 3.4 Test image used for the SNR comparison test from the MIT Books dataset 15 Figure 3.5 SNR of the Accumulator comparing the max and summation operators . . . 16

Figure 4.1 Feature Vectors,κ, curvature at the point;φ, angle between the vector from the reference point to the feature point and the tangent frame of the reference point;r is the distance from the salient point (Ck) to the reference point (Ω). All are translation and rotation invariant. . . 20

Figure 4.2 Database of 12 tanks . . . 22

Figure 4.3 The 31 randomly picked silhouettes from the SQUID database. Used with permission. . . 23

Figure 4.4 A fish contour occluded by 40% with different occluded regions . . . 24

Figure 4.5 Classification accuracy at various levels of occlusion. . . 25

Figure 4.6 Example Queries . . . 26

Figure 4.7 Example Queries . . . 27

(10)

Figure 5.1 Feature Vector at a salient point (shown in grey) in the neighborhood of an reference point (shown in blue). φ is the gradient direction with respect to the dominant orientation; r is the distance to the reference point; θ is the polar angle

of the salient point with respect to the reference point. . . 32

Figure 5.2 Two images from the dataset where one is a rotated and scaled version of the other . . . 34

Figure 5.3 The features matched between the two images. The total number of corre-spondences found was 121. The number of correct correcorre-spondences was 120 with one false correspondence. . . 35

Figure 5.4 Features matched between images where one is a rotated, scaled and noisy version of the other. The Gaussian noise was artificially added. The corresponding matches are marked using the same numbers (drawn in red) in both images. The white circle indicates one of the many correct matches. . . 36

Figure 5.5 Precision - Recall Curve for Scale and Rotation Changes . . . 38

Figure 5.6 Two images of the same scene taken at different viewpoints. The images were generated by taking an aerial photo of an urban environment and extruding 3D buildings. Rendering was done using Mental Ray 3.5 with physically simulated sky, atmospherics and shadows. . . 38

Figure 5.7 Precision versus number of correct correspondences for extreme viewpoint changes . . . 39

Figure 5.8 Registration Results . . . 41

Figure 6.1 The computation of the feature vector vj at a salient point Cj (shown in

(11)

orien-Figure 6.4 Example False Matches. . . 54

Figure 6.5 Example images from the Coil-100 Dataset . . . 55

(12)

Chapter 1

Introduction

Object recognition, a task which is so trivial to humans, is an unsolved problem in the field of computer vision. We are nowhere close to a solution that performs as well as the human brain which remains a benchmark for current object recognition algorithms (at-least in the foreseeable future).

As humans, we tend to recognize objects by looking for cues (or evidence) in the world around us which best match a particular template (or model) stored in our brain. The more cues we observe about a particular object, the more we are sure about its identity. This dissertation attempts to tackle the fundamental problem of object recognition using such an evidence accumulative framework. We try to match objects by accumulating features (which can be viewed as evidence) which best match a given model. The more the features matched, the higher the confidence of the match.

1.1

Contributions

(13)

Figure 1.1: Example binary image with straight lines (left) and the corresponding accumu-lator array (right). Image reproduced fromwww.planetmath.org

basis for higher level cognitive tasks such as learning and knowledge representation.

1.2

Introduction to Accumulators

Accumulator-based methods make use of an accumulator array to combine local information to make a global decision. The Hough transform [1] for finding straight lines is one of the most popular applications of an accumulator array. Given a set of N points (xi, yi) in an image, the algorithm tries to determine which of them lie on straight lines.

The equation of a straight line in normal form is given by:

ρ=xcos(θ) +ysin(θ) (1.1)

where (ρ, θ) are the parameters of the straight line.

One can now consider a mapping into the (ρ, θ) space from the (x, y) space by solving equation 1.1 forρfor all values ofθfor each value of (xi, yi). The space (ρ, θ) can be

quantized and can be considered to be a two dimensional image called anaccumulator array. Thus, each value of (ρ, θ) obtained by solving equation 1.1 can be used to increment (or vote) at the appropriate location in the accumulator. Therefore, points which lie on a straight line will increment one location in the (ρ, θ) accumulator array many times resulting in a peak. Figure 1.1 shows an example binary image with straight lines and the corresponding accumulator array.

(14)

essentially combined the local information (the points in an image) to help make a global decision on which of them lie on straight lines. This dissertation extends this idea further and shows one can recognize not only straight lines but also complex objects in images using accumulator arrays with excellent robustness and performance.

1.3

Organization

(15)

Chapter 2

Background

The problem of object recognition has been studied exhaustively in the literature. Some of the earliest work in object recognition was done by Marr[2]. Marr’s primal sketch involves extraction of local features which are combined in an hierarchical way into more complex descriptions. He believed that the goal of vision is to reconstruct the 3D scene. Marr and Nishihara [3] proposed that objects should be represented using an object-centered approach using 3D parts or volumes. These representations are completely view-point in-variant. This led to the theory of “Recognition by Components” (also called the structural description model) [4] by Biederman who built on Marr’s work of object-centered repre-sentations. He proposed that objects were represented by simple primitives and described syntactic ways to describe the relationships between them.

(16)

2.1

Shape Recognition

Recognition involves many features including color, shape, motion and context[13]. However, according to Edelman[8], an algorithm which is good at recognizing shapes is expected to perform equally well in recognizing objects. The importance of silhouettes in human shape perception is shown in [13],[8].

A comprehensive review of shape representation and description is presented in a paper by Zhang and Lu[14]. They identify two major approaches to shape representation and description. One is using a contour-based approach and the other is using a region-based approach. These are further divided into global and structural methods. Global approaches use the entire shape to generate a vector or a feature, which are then used to distinguish between shapes. Structural approaches break the shape down into primitives or parts and use those for matching.

Contour-based approaches include methods like boundary signatures, chain codes, curve decomposition, generic Fourier descriptors[15], correspondence based methods (like the Hausdorff distance[16],[17]) and syntactic analysis[14]. Global contour based approaches include the popular Shape Context[18] and Curvature Scale Space(CSS) matching[19].

Region-based methods include geometric[20], algebraic[21] and orthogonal moments[22], medial axis and shape matrix methods[14].

2.2

Object Recognition

(17)

Object categorization involves identifying the objects present in a scene. There has been a lot of work done in this area based on the bags-of-features (BOF) method[12] which uses an orderless collection of features by completely discarding their spatial positions in the scene to categorize objects. BOF has surprisingly shown very good performance[29][30] in this task. Other methods [31] use simple image patches with Support Vector Machines for classification. Methods based on correspondence of local features (or distinguished regions) [32] have also been used with considerable success in object recognition .

The features used for object recognition are generally local and involve the use of interest points(or salient points), neighborhoods, or regions. Interest points (particularly points of high curvature) have generally received a lot of attention in the neuroscience community. [33],[34] show evidence for sensitivity to curvature and radial position by the neurons in the V4 of primates. A survey of the literature shows many approaches including the Harris Corner detector[35] and the scale-invariant keypoints introduced by Lowe[36]. A complete review of affine invariant region detectors can be found in [37].

Local feature vectors(or descriptors) maybe built in the neighborhood of these in-terest points and then be used for matching. Some popular local descriptors include the Scale Invariant Feature Transform (SIFT) [36],[38]. An exhaustive review of local descrip-tors for feature matching can be found in [39].

2.3

Accumulator-Based Methods

(18)

collections of local features.

Several variations of the Hough transform have been proposed to reduce computa-tional complexity and memory usage. The Probabilistic Hough Transform (PHT) [42] uses a small subset of the data points for voting thereby decreasing the total computation required. The Randomized Hough Transform [43] is similar to the PHT, but uses random pairs of points for voting. Aguado et al. [44] proposed an adaptive version of the Hough Transform which iteratively determines the peak location but keeps the size of the accumulator array small.

(19)

Chapter 3

The Accumulative Framework

This chapter develops a general approach to object recognition using accumulators by providing a robust solution to the problem of making global decisions from a collection of local measurements.

The algorithm, which we will henceforth refer to as the Simple K-Space (SKS) algorithm, is invariant to translation, rotation, scale changes, and very robust to partial occlusion and local variations in image brightness. The algorithm is able perform a wide variety of recognition tasks, from the recognition of simple shapes like silhouettes to objects in cluttered real images. The application of the concept in our algorithm is most similar to that described in the Generalized Hough Transform, but the algorithm differs in several ways, including its use of feature vectors and the table lookup model. We will show that the Generalized Hough Transform is a special case of the SKS algorithm.

Before we move on, let us define the types of points (or pixels) which will be used throughout this dissertation.

• salient points: points of “interest”.

• reference points: points which serve as the origin of the coordinate frame defining the object.

(20)

3.1

Recognition

The recognition process begins with identification of salient points in an image

I. This process produces an unordered collection of (x, y) pairs, denoted by Ci = (xi, yi),

each indicating the coordinates of a particular salient point. From the neighborhood of each salient point, say Ci, one may then extract a feature vector θi = [θi1, θi2, . . . , θid]

and a special feature ri which is the distance1 to some particular reference point, Ω. Note

that there might also be features θij computed with respect to the reference point, Ω. Let

vi = [ri,θi] be the overall feature vector extracted at a salient pointCi. The exact nature

of these feature vectors and the choice of salient points are application dependent and will be discussed further in later chapters.

Let ψ be an object model also defined with respect to a reference point. ψ is a function, ψ : Rd+1 → [0,1] such that ψm(v), v = [r,θ], reports the confidence that the

feature θ occurs in the mth model at a distance r from the reference point. Construction of a model for an object is discussed in section 3.4.

Given a collection of models, {ψ1, ψ2, . . . ψM}, we seek a method to determine if

there are instances of a particular model in an image I. Let x, x ∈ _Rn _{be a parameter}

vector which describes the instance of a model in the image such as the location of the reference point (x, y) and other parameters such as the scale, orientation etc. of the model instance. The objective of the recognition process is to find all instances of this reference point in the observed image along with any other parameters, if any exist.

Let an image I be described by a set of salient points and their corresponding feature vectors. To determine the quality of possible matches between imageI, which hasJ

salient points and modelm, construct an n-dimensional accumulator arrayAm(x),x∈Rn. Then for the point at coordinates x, compute:

Am(x) = J X

j=1

ψm(vj(x)) (3.1)

where vj(x) is the feature vector vj measured at the jth salient point normalized by the elements of x. One can think of this normalization as normalizing the feature vector for translation, rotation, scale etc. The exact nature of the normalization is application depen-dent and will be explored further in later chapters.

1

(21)

Figure 3.1: (Left) The image shows the accumulator array,A(x), for a good match (match-ing one image to a rotated and scaled version of itself). (Right) A poor match, the result of matching the image with the model of different tank. The sharp peak in the left image is at the reference point in the model.

Thus, at a point x in the accumulator, we accumulate the confidence that the parameter vector x describes the instance of the model m in the image I by summing the model function computed using the normalized feature vector extracted at each salient point in the image. Observe that ψis usually precomputed, andJ is small (typically a few hundred), which makes the matching process very fast.

The best matching model ( ˆψ) is given by the model function which gives the highest peak value in the accumulator.

ˆ

ψ= arg max

m,x Am(x) (3.2)

(22)

� �

d

O

P

Figure 3.2: Feature used by the Generalized Hough Transform (GHT)

Table 3.1: Example R-Table

Gradient Angle Vector to reference point

φ1 r11,r21, ...

φ2 r12,r22, ...

. .

φm r1m,r2m, ...

3.2

The Generalized Hough Transform (GHT) and SKS

The Generalized Hough Transform (GHT)[40] was introduced by Ballard as an generalization of the Hough transform[1] to recognize arbitrary shapes. In this section, we will show that the GHT is a specialized case of the SKS algorithm.

Consider a model shape with salient points P = {P1, P2, . . . , PK}, Pk = (x, y)

is an ordered pair of the coordinates of each point. The GHT first involves picking a reference pointO= (xref, yref). Now, at each pointPkon the shape, we construct a vector

rk= (d, α) to the reference point(O) as shown in as shown in figure 3.2. This, along with

the gradient angle at the point (Pφj), is then used to construct a R-table which is essentially

(23)

R-table. The maxima of the accumulator then gives the possible locations of the target shape.

Mathematically, the total increment at every point in the accumulator is given by:

A(x) =

J X j=1 K X k=1

δ(Cφj −Pφk,||(x−Cj)−rk||) (3.3)

whereδ(u, v) is the Kronecker delta defined as:

δ(u, v) =

  

 

1 u= 0 andv= 0

0 otherwise

(3.4)

In order to prove that the GHT is a special case of the SKS algorithm, let us define feature vectors Pvk = [Pdk,Pαk,Pφk] and Cvj = [Cdj(x),Cαj(x),Cφj]. [Pdk,Pαk]

is the vector from the kth _{salient point in the model to the model reference point and}

[Cdj(x),Cαj(x)] is the vector from thejth salient point in the target shape to the point x.

Equation 3.3 can now be rewritten as:

A(x) =

J X j=1 K X k=1

δ(Cvj(x)−Pvk) (3.5)

The SKS model can now be defined from equation 3.5 as:

ψ(v) =

K X

k=1

δ(v−P_v

k) (3.6)

Substituting equation 3.6 into equation 3.5, we get the standard SKS matching equation defined in equation 3.1 :

A(x) =

J X

j=1

ψ(Cvj(x)) (3.7)

(24)

3.3

Effect of Accumulator dimensions on the SNR

Consider a scene withN salient points whereNp salient points belong to the object

and Nc are clutter (or noise) points such that:

N =Np+Nc (3.8)

Let the total accumulator contribution from points on the object be Ap and the

total accumulator contribution from clutter beAc. Since the increments from points on the

object would ideally happen at a single point in the accumulator, we can assume that Ap

to be the peak of the accumulator. Ac, however will be distributed across the accumulator.

The average accumulator contribution by the clutter (assuming that the clutter response is uniformly distributed) at each accumulator bin except at the peak will be given by:

N = Ac

(M −1) (3.9)

where M is the total number of accumulator bins across all dimensions. Ideally, we would want the average contribution by the clutter to be as small as possible. Note that the total power Ac+Ap will always be a constant irrespective of the number of bins.

Thus, one can define the signal-to-noise ratio of the accumulator as:

SN R= Ap

N = (M−1) Ap

Ac

(3.10)

Thus, adding more dimensions would increase the total number of bins in the accumulator which would translate to higher SNR. This, of course, comes at the cost of higher computational complexity.

3.4

Model Function Construction

In this section, we show how to construct the function ψ which is used as the model for a particular object.

Consider an image containing a single instance of an object, which consists, in turn, of a set of L salient points {P1, P2, ....PL}. L could be infinite, in which case a functional

(25)

�

Figure 3.3: Example feature space for a two dimensional feature vector. The black points indicate the feature vectors in feature space. The model function for a sample feature vector (shown in red) calculates the shortest distance to the set of all feature vectors.

point, compute the set of local featuresv = [r,θ], v∈Rd+1, where r is the distance to the reference point. Again, the exact nature of these features are application dependent.

Then the functionψ,ψ:Rd+1→R, is built by computing:

ψ(v) = max

l

1

Zl

exp −(v−vl)Σ(v−vl)T (3.11)

whereZl is a normalization, Σ is a diagonal matrix of weights for each dimension ofv and

l ranges over the set of salient points in the image from which the model was built. Note that max operator actually constructs a Voronoi partition on the parameter spacev. The use of the maximum operator associates every point in parameter space,v, with the closest (in parameter space) salient point vl. The function ψcan be stored as a table which make its usage in equation 3.1 a simple look up. Figure 3.4 shows the model function calculation for a two dimensional feature space.

(26)

Figure 3.4: Test image used for the SNR comparison test from the MIT Books dataset

recognition in clutter when compared to using the linear summation operator such as the one used in the Generalized Hough Transform and its derivatives where each feature vector may contribute a vote value of more than one in the accumulator. The max operator has also been found to be extremely robust to clutter in the standard model by Poggio et. al [11] and has also been used to successfully characterize the 2d and 3d shape and feature tuning response of neurons in the V4 and IT[47][48].

3.4.1 SNR Comparison

(27)

Figure 3.5: SNR of the Accumulator comparing the max and summation operators

3.5

Template Matching as an Accumulative Process

In order to show that template matching is a specialized case of an accumu-lative process, let us consider the one-dimensional case with two discrete signals F =

{fx1, fx2, . . . , fxn. . . fxN},xn ∈Z,|fxn| ≤1 andG= {gy1, gy2, gy3, . . . gyk. . . gyK}, yk ∈ Z, |gyk| ≤1.

Template matching is an approach to find instances of G in F. This is achieved by sliding GoverF and calculating the cross-correlation at every point.

A(¯x) =

N X

n=1

fxngxn−x¯ (3.12)

The best match or location of G in F is then given by the maximum of A(¯x). Equation 3.12 can be rewritten as:

A(¯x) =

N X n=1 fxn K X k=1

gykδ(xn−x¯−yk) (3.13)

= N X n=1 K X k=1

(28)

whereδ(.) is the Kronecker delta. In order to describe template matching as an accumulative process, let us define a distance vn(¯x) from a point xn to ¯x such that:

vn(¯x) =xn−x¯ (3.15)

Let ¯y be a reference point in G. ¯y will be assumed to be at the origin for simplicity. We can now define a distance vk from each pointyk inG to the reference point ¯y (¯y= 0):

vk=yk−y¯ (3.16)

Equation 3.14 can now be rewritten as:

A(¯x) =

N X n=1 K X k=1

fxngykδ(vn(¯x)−vk) (3.17)

Let us now define a functionψ,ψ:Z→Rsuch that:

ψ(vn) = K X

k=1

fxngykδ(vn−vk) (3.18)

Substituting equation 3.18 into equation 3.17, we now obtain the familiar SKS matching equation defined in equation 3.1:

A(¯x) =

N X

n=1

ψ(vn(¯x)) (3.19)

Thus, template matching is a specialized case of an accumulative process. This can be easily extended to higher dimensions by considering the location of each point as a vector instead of a scalar.

3.6

Conclusion

(29)

Chapter 4

Shape Recognition

Recognizing silhouettes is probably the simplest of shape recognition tasks. This chapter looks at the application of the SKS algorithm described in chapter 3 in recognizing silhouettes. The silhouettes are assumed to be simple curves which maybe or may not be closed. A form of the algorithm suitable for recognizing silhouettes will be derived using a continuous functional notation. We will also evaluate the performance of the algorithm under rotation (in the plane), scale (zoom) and partial occlusion.

4.1

Continuous Representation

Let C(s) be a curve in the plane,C :U →_R×_R, parametrized by arc length, s, where U = [0,1] and s ∈ U. Every point on the curve C(s) is considered a salient point. Let θ(s) be a vector of features of dimension d measured at each point s on the curve. Let r be the distance to a reference point, Ω. Note that if each of the components of θ(s) is continuous and differentiable, the curve υ(s), (denoting1 υ = [. r,θ]) embedded in Rd+1, will be continuous, differentiable; and if C is closed, υ(s) is periodic. In this section, the components are assumed to be continuously differentiable.

Let Υ⊂_Rd+1 _{be the set of all possible values of the parameter vector}_υ_{. If}_u_∈_Υ

is a point in the parameter space, define a function ψ:Rd+1→R, on uby:

ψ(u) = max

s exp(

−||u−v||2

σ2 ) v∈(υ◦C)(s) (4.1)

1

(30)

This functionψwill be our model for a shape, and we recognize it to be a particular instance of Eq. 3.11

Analysis: The 1-level set of ψ, ∂ = {v ∈ _Rd+1_|_ψ_{(v) = 1}_}_{, is the collection of}

points in the parameter space corresponding to values ofυ which actually occur onC. When we think about the function ψ, the first observation is that the maximum operator associates, in some sense, an arbitrary point in the feature space, Υ with the closest point in the feature space corresponding to a point on the boundary of C. If C

is sampled (as is always true in the computer implementation), this association creates a Voronoi partition of Υ.

Shape Matching

In order to match a target shapeCT(s) with a model shapeCm(s), we first compute

the featuresυ(CT(s)). Then a simple table lookup gives us a measure of how often a feature

found inCT occurs in Cm. That is, we calculate:

AmT(x) = Z

s

ψm(υ(CT(s))) = Z

s

(ψm◦υ◦CT)(s) (4.2)

Going around curve CT, simply compute the feature vector υ at every point with respect

to the reference point x= (x, y). Then, look that value up inψm. and add them up. The

peak of the accumulator AmT(x) gives a measure of the match between the two shapes.

Note that equation 4.2 is a continuous formulation of equation 3.1.

This explanation has ignored the length of the boundary of either curve, assuming

s ∈ [0,1]. This is easily corrected by normalization. However, such normalizations do depend on scale which will be discussed later in this chapter.

If, instead of a single model image, Cm, we have many modelsm ={1,2, . . . , M}

in a data base, we compute all matches, AmT(x), and select the best model, whose index

will be:

j= arg max

m,x AmT(x) (4.3)

The composition of three functions in equation 4.2 may seem daunting, but the computa-tions are straightforward and (since a table lookup is used) quite fast.

(31)

�

Figure 4.1: Feature Vectors, κ, curvature at the point; φ, angle between the vector from the reference point to the feature point and the tangent frame of the reference point; r is the distance from the salient point (Ck) to the reference point (Ω). All are translation and

rotation invariant.

4.1.1 Feature Selection

The choices for feature vectorθ(s) are numerous. However, ideally we would want features which have sufficient generality and possess invariance to similarity transforms. Figure 4.1 shows the feature vectors used. We pick the reference point Ω to be on the contour at the extrema of curvature. This enables us to use the Frenet frame at the reference point as a rotation invariant coordinate system. Curvature (κ) and the polar angle (φ) are both rotation invariant which makes them good choices as feature vectors. In the past, curvature has often been dismissed as a useful feature because of its sensitivity to noise (it involves second derivatives), however, when computed using digital straight segments [49, 50, 51, 52], we found curvature provided a very useful and robust feature.

(32)

4.1.2 Scale Estimation

Since many features, such as distances (e.g. r) and curvature, depend on scale, a global characteristic of a shape, scale must be estimated. In order to maintain robustness to partial occlusion however, we must estimate scale using local features.

Consider the digital contour C = {C1, C2, ..., CK} with K points. We use a set

of reference points Ωj, j = 1, ..., J on the contour. The Frenet frames at those points are

rotationally invariant reference coordinate systems. The polar coordinates, (ri, φi), of any

point Ci on the contour are calculated with respect to the reference point Ωj.

With respect to a reference point, Ωj, the “distance map”, Dj : R → R, is the distance from Ωj, to the furthest point on the curve at a particular polar angle φ, in the

local Frenet frame.

Dj(φ) = max

(r,φ)∈Cr (4.4)

Let SDi(φ) and TDj(φ) be the distance maps of the ith reference point on the

model and thejth reference point on the target shape respectively. The scale at eachφcan then be estimated as:

sji(φ) =TDj(φ)/SDi(φ) (4.5)

The scale is estimated as the median of all the scales(sji).

4.2

Experimental Results

This section compares the performance of the SKS algorithm with three other algorithms - the Hu Moments[20], Shape Context[18] and CSS[19] matching, to similarity transforms (translation, rotation and zoom) and robustness to partial occlusion. Further-more, we also show the applicability of the algorithms to similarity-based shape retrieval.

4.2.1 Datasets

(33)

Figure 4.2: Database of 12 tanks

Table 4.1: Invariance to Similarity Transforms

Algorithm SKS Shape Context Hu Moments CSS Correct Retrievals(%) 98.26 99.13 77.60 75.11

of tanks look very similar and are mostly made up of straight lines and thus, have zero curvature at most points which makes them a slightly harder dataset to work with than the SQUID database.

4.2.2 Invariance to Similarity Transforms

In this experiment, we built models using all the tanks(rotated and scaled) and matched each tank contour with every model. The number of correct matches in the top 12 retrieved shapes was determined. Since there are 12 tanks and each tank has 6 rotated and scaled versions of itself, the total number of correct matches is 1728.

(34)

Figure 4.3: The 31 randomly picked silhouettes from the SQUID database. Used with permission.

4.2.3 Robustness to Occlusion

In this experiment, we randomly picked 31 fishes from the SQUID database. These are shown in figure 4.3.

We then partially occluded each fish by retaining 10-90% of the points. To deter-mine a partial occlusion of, say δ percent, the following algorithm was used: Consider a digital contour with pointsC ={C1, C2, ..., CN}. Starting at pointC1,K sequential points

are chosen such thatK/N =δ/100. Those K points are removed from the boundary. This produces an occluded boundary with δ percent occlusion. Since it can be reasonably an-ticipated that some areas of the boundary will be more sensitive to occlusion than others, the starting point was moved from 1 to 2 , then 3, etc. and more occluded boundaries were generated resulting in a total of 8300 fishes. An example of this is shown in figure 4.4.

At each occlusion level, the occluded fish generated using all possible starting points were matched with the unoccluded original set of 31 fish and classified. The results of the occlusion experiment for the four algorithms are shown in figure 4.5.

(35)

(a) Original Contour

(b) Contour at 40% occlusion

(c) Same con-tour at 40% occlusion but with a differ-ent region oc-cluded

Figure 4.4: A fish contour occluded by 40% with different occluded regions

at 60% occlusion, the classification is essentially perfect.

4.2.4 Application to Content-based Image Retrieval

(36)

Figure 4.5: Classification accuracy at various levels of occlusion.

4.3

Discussion

(37)

(a) kk732 - Query shape.

(b) Top 5 matches in random order for the shape kk732 using CSS.

(c) Top 5 matches in random order for the shape kk732 using Hu Moments.

(d) Top 5 matches in random order for the shape kk732 using Shape Context.

(e) Top 5 matches in random order for the shape kk732 using the SKS algorithm.

(38)

(a) kk942 -Query shape.

(b) Top 5 matches for the shape kk942 using CSS.

(c) Top 5 matches for the shape kk942 using Hu Moments.

(d) Top 5 matches for the shape kk942 using Shape Context.

(e) Top 5 matches for the shape kk942 using the SKS algorithm.

(39)

large occlusions occur.

If we think of older literature in Computer Vision, we observe the opposite phi-losophy being supported, based on cognitive psychology. The point was made that humans make substantial use of key points, and figure 4.8, or similar, was used to make this point. The figure consists of only straight lines, connecting key points–points of high curvature. We propose that both types of operations occur, identification of key points and gestalt

Figure 4.8: A line drawing, without color or texture information, and consisting of only straight lines connecting salient points.

shape recognition. In this version of the SKS algorithm, only a single key point (or ref-erence point) correspondence is required. Although this is still a problem of complexity

O(nm) (where n is the number of salient points andm is the number of reference points), it is still less than the complexity of the correspondence problem required for the shape context algorithm (which is of the orderO(n3)).

(40)

Chapter 5

Feature Matching

This chapter derives a version of the SKS algorithm described in chapter 3 to match local features between real images. Local feature matching is the first step in many object recognition algorithms and in feature-based image registration algorithms where they are used to estimate point correspondences. We will consider images usually taken from different viewpoints and have significant variations in local brightness. The performance of the algorithm is evaluated by comparing it against a SIFT-based feature matcher and by applying it to register aerial images.

5.1

Feature Extraction

The goal of a feature matcher is to identify similar local regions between two im-ages. In order to accomplish this, we extract multiple reference points from the image. The local neighborhood around each of these reference points are then considered potential salient points. This enables us to describe a local region around each reference point (inde-pendent of one another) by extracting feature vectors from each salient point in the local neighborhood.

(41)

The matrix C(x, σD, σI) is defined as:

C(x, σD, σI) =σD2G(σI)∗ 



I2

x(x, σD) Ix(x, σD)Iy(x, σD)

Ix(x, σD)Iy(x, σD) Iy2(x, σD) 

 (5.2)

whereIx(x, σD) andIy(x, σD) are the first partial derivatives of the imageI(x) alongxand

y. σD is the scale used to compute the derivatives and σI is the size of the region around

xused to integrate the derivatives.

Ix(x, σD) =I(x)∗

∂G(σD)

∂x (5.3)

Iy(x, σD) =I(x)∗

∂G(σD)

∂y (5.4)

G(σD) is a Gaussian with standard deviationσD and∗ denotes convolution.

The Multi-scale Harris function R(x) is positive in corner regions, negative in edge regions and small in flat or uniform regions. α is usually to set in the range of 0.04 -0.06 [35]. The spatial derivatives at a point will in general decrease with increasing scale. Therefore, the Harris function is normalized for scale to achieve scale invariance. This is done by normalizing the derivatives with the derivative scaleσD. More information on scale

normalization can be found in [56, 57].

Once the corner points are determined at a particular scale σI, we check if the

point is at a characteristic scale. The characteristic scale is the scale at which an interest point is a local extremum across scale. Characteristic scales were extensively studied by Lindberg [57]. Theoretically, the ratio of the scales for two corresponding points at the characteristic scales is equal to the scale factor of the two images. Mikolajczyk et al. [56] analyzed four different functions to determine the local maxima of a feature point across scales. These include the Laplacian, Difference of Gaussian (DOG), simple gradient and the Harris function. They concluded that the Laplacian detects the highest percentage of correct points which are at the characteristic scale.

(42)

L(x, σn)> L(x, σn+1)∧L(x, σn)> L(x, σn−1), L(x, σn)> T (5.5)

whereT is a threshold andL(x, σ) is the Laplacian defined as:

L(x, σ) =σ2|Ixx(x, σ) +Iyy(x, σ)| (5.6)

By reducing the set of reference points to characteristic scale Harris corner points, we reduce the computational complexity of the problem without loss of scale invariance.

5.2

Feature Matching

5.2.1 Model Building

Once reference points have been detected in the images, the next step is to build a model at each one of them. Consider, a set of J reference points in the source image

S_F ₌ _{S_F

1,SF2, ...,SFj, ...,SFJ}, SFj is an ordered pair of coordinates, (x, y). We first

establish a rotation invariant coordinate system at each reference pointSFj by finding the

dominant orientation (φj) in the circular neighborhood of radiusσj whereσj is the scale at

which the reference point was found. See section 5.2.3 for more information on finding the dominant orientation.

Every point in the circular neighborhood of a reference point is considered a salient point. Assume, there areK salient points in each neighborhood. We now build a model at each and every reference point (SFj) as follows: We formulate a set ofK feature vectors

vjk = (rjk, θjk, φjk) computed at each salient point k in a circular radius σj around SFJ

as shown in figure 5.1. rjk is the distance of the salient point to the reference point,θjk is

the polar angle of the salient point with respect to the invariant coordinate system at the reference pointSFJ. φjk is the difference between the gradient direction at the salient point

kand the dominant orientation (φj).

φjk =φk−φj (5.7)

The model ψ,ψ:R3→R, at the reference point SFj is given by:

S_ψ j(v) =

K

max

k=1 exp(−

||v−vjk||2

(43)

Dominant orientation of the reference point

Figure 5.1: Feature Vector at a salient point (shown in grey) in the neighborhood of an reference point (shown in blue). φ is the gradient direction with respect to the dominant orientation; r is the distance to the reference point;θ is the polar angle of the salient point with respect to the reference point.

The model function (S_ψ

j(v)) can be viewed as a function which estimates the

closeness of a given feature vector (v) to all the feature vectors around the reference point

S_F

j. Note that the model function is essentially a particular instance of equation 3.1 which

implies that it can be precomputed and stored as a look up table which considerably speeds up the matching process.

5.2.2 Matching

The matching process uses evidence accumulation based on the SKS algorithm described in chapter 3 to determine the correspondence between two sets of reference points. Assume the source image (or reference image) has a set of J reference points

S_F ₌ _{S_F

1,SF2, ...,SFj, ...,SFJ}, SFj = (x, y) and a corresponding set of models, Sψj,

for each reference point (SFj). Let the target image have M reference points TF = {T_F

1,TF2, ...,TFm, ...,TFM}.

Consider a reference point TFm in the target image. To find the best point in the

(44)

vectorsvml = (rml, θml, φml) computed at each salient pointlin a circular radiusσmaround

the reference point. rml is the distance of the salient point to the reference point, θml is

the polar angle of the salient point with respect to the invariant coordinate system at the reference point TFm. φml is the difference between the gradient direction at the salient

point land the dominant orientation (φm).

Now, we compute a match between the reference point TFm in the target image

and the model (Sψj(v)) of the feature point SFj in the source image,

Amj =

1 L L X l=1 S_ψ

j(vml) (5.9)

Note that equation 5.9 is identical to equation 3.1 except for that fact that accu-mulation happens only at the reference point locations rather than at every possible location in the target image.

The best match for the point TFm in the source image is given by:

S_F

i= arg max

j Amj ∀Amj > T (5.10)

where T is a user-defined threshold which determines the minimum match value between two points to consider them to be similar. If no match is found, then it is assumed the interest point (TFm) does not exist in the source image.

5.2.3 Dominant Orientation Estimation

To achieve invariance to rotation, we establish a reference coordinate system at each reference point SFj by estimating the dominant orientation vector. This is

accom-plished by computing the eigenvalues and eigenvectors of the Harris matrix given by equa-tion 5.2 at each reference point. The eigenvector,ej = (ejx, ejy), corresponding to the

high-est eigen value of the Harris matrix points in the direction of maximum gradient strength in the local neighborhood and thus, intuitively can be used to assign a dominant orien-tation vector to the region. The sign of eigenvector is not unique and is corrected for by computing the histogram of the number of positive and negative projections of the gradient orientations at each point in the neighborhood on the eigenvector and assigning the sign corresponding to the histogram maxima.

The dominant orientation at each reference point SFj, therefore, is given by:

(45)

Figure 5.2: Two images from the dataset where one is a rotated and scaled version of the other

5.3

Results

We evaluate the performance of the algorithm in two steps. First, we look at the performance of the feature matcher by comparing with it the state-of-the-art. Then, we look at the application of the algorithm to register images from a dataset of eight aerial image pairs.

5.3.1 Feature Matching Performance

To evaluate the performance, one might use ROC curves but ROC is not really suited to this problem as the concepts of “positive and negative” are not really defined. Instead, we look at the performance of the feature matcher using precision-recall curves. We use the dataset1 used in [39] . Figure 5.2 shows a reference and the target image from the dataset, where the target image is a scaled and rotated version of the reference image2. Figure 5.3 shows the features matched between the two images. Figure 5.4 shows an example match with Gaussian noise artificially added to one of the images. The images clearly show that a lot of correct correspondences are identified despite the high noise level.

Recall is defined as:

recall=Number of correct matches found

Actual number of correct matches (5.12)

1

The data set used is available athttp://www.robots.ox.ac.uk/~vgg/research/affine.

2

(46)

(47)

Figure 5.4: Features matched between images where one is a rotated, scaled and noisy version of the other. The Gaussian noise was artificially added. The corresponding matches are marked using the same numbers (drawn in red) in both images. The white circle indicates one of the many correct matches.

In order to determine the actual number of correct matches between the two images and to verify the correctness of a correspondence, we use the method described in [39]. Let A and B respectively be the two regions in the reference and the target images, whose correspondence is to be checked. Let H be the known homography provided between the reference and the target images. We calculate the overlap error between the two regions by computing the areas of the union and intersection of the regions. This is done by first projecting the region B from the transformed image to the reference image using the homography H. The overlap error is defined as:

= 1−Area(A∩H T_BH)

Area(A∪HT_BH) (5.13)

A match is assumed to be correct if the overlap error is less than 0.5. The number of correct matches is estimated by finding, for each region around the corner points in the target image, the region around the corner point in the reference image with the lowest overlap error. The region size is proportional to the scale of the corner point.

Precision is defined as:

precision=Number of correct matches found

Total number of matches found (5.14) We may plot recall versus imprecision which is defined as:

imprecision = 1-precision=Number of false matches found

(48)

The goal of any feature matcher is to maximize the number of correct matches and minimize the number of false matches. Recall-precision curves are plotted by varying a threshold which increases the total number of matches. In general, recall increases with increasing imprecision. If the precision is reduced to zero, then everything is considered a match. So, a recall of 1 is achieved only at precisions close to zero.

Figure 5.5 shows the performance of the SKS feature matcher when compared with SIFT-based matching. SIFT-based descriptors with Euclidean distance for feature matching have been shown to outperform other methods in [39]. Ideally, we want a feature matching algorithm to detect a lot more correct matches at lower imprecision (<0.4). This is because at higher imprecision, we have a lot more false matches despite having higher recall in general. Notice that SKS clearly has higher recall rates at low imprecision than SIFT-based matchers. This translates to higher percent correct matches when compared to false matches in that region. Most real world feature matchers would operate in this region and SKS clearly demonstrates its superiority here.

SIFT-based matchers, however, get higher recall rates at high imprecision, but this comes at a cost of higher false matches. The next step after feature matching is usually, in the case of image registration, a transformation model estimator where high false positives rates are clearly not acceptable. Therefore, the SKS feature matcher is clearly the better choice of the two.

In order to test the robustness of the algorithm to extreme viewpoint changes (projective transforms), we plotted the precision versus number of correct correspondences using the images shown in figure 5.6 comparing the SKS feature matcher and SIFT-based matching. This is because estimating the actual number of correct correspondences is dif-ficult because circular regions need not map to circular regions in the case of projective transformations which would make the computation of overlap error unreliable and cum-bersome. A match is, therefore, estimated to be correct if the Euclidean distance between the matching points is less than a threshold (10 pixels in this case).

(49)

Figure 5.5: Precision - Recall Curve for Scale and Rotation Changes

(50)

Figure 5.7: Precision versus number of correct correspondences for extreme viewpoint changes

5.3.2 Registration Performance

We test the performance of the algorithm on registering images by using the feature matcher to determine correspondence points and then estimating the similarity transform using non-linear least squares regression (NLLS) and RANSAC [58]. We use a database of 8 synthetic aerial image pairs (generated from real images). Each image pair contains one target image with extreme rotation, scale and brightness changes which we attempt to register with a reference image.

The results of the registration are shown in figures 5.8 through figure 5.15. Each figure shows the reference image, the target image and the difference image after registration. The results are close to perfect and clearly show no registration anomalies. Notice that the difference image is not all black. This is because of the extreme brightness changes present in the target image.

5.4

Conclusion

(51)

(52)

(a) Reference Image (b) Target Image

(c) Difference image after registration

(53)

(54)

(55)

(56)

(57)

(58)

(59)

(60)

Chapter 6

Object Recognition

This chapter further extends the SKS algorithm introduced in chapter 3 to rec-ognize and localize objects in a cluttered scene. The goal here is to recrec-ognize and localize an instance of the object rather than object category identification. The algorithm builds on the feature matcher introduced in chapter 5 and extends it further by adding features which describes the geometry of an object. In a sense, we accumulate the local photometric information combined using global geometric features to make a decision about the presence and location of a given object model. The algorithm is tested against the current state-of-the-art by evaluating its performance using a database of books. Robustness to viewpoint changes are evaluated using the standard Coil-100 dataset.

6.1

Object Model

Consider an image I of the object whose model is to be built. The first step is to pick a set of salient points C = {C1, C2, . . . CJ}, Cj = (xj, yj) in the image using the

Harris-Laplace [56] interest point detector described in section 5.1. Each salient point also has an associated scale σj and dominant orientation φj (refer section 5.2.3). We then pick

a reference point Ω = (x, y) which is usually the object center though it could really be anything (even one of the salient points). The reference point Ω is assigned a random orientation (usually parallel to the x-axis of the image for convenience).

At every salient point Cj, we compute a feature vector vj = [rj, θj, φj] where rj

(61)

Figure 6.1: The computation of the feature vectorvj at a salient pointCj (shown in black).

rj is the distance to the reference point Ω (shown in red);θjis the orientation of the direction

vector to the reference point with respect to the dominant orientation at the salient point;

φj is the dominant orientation of the salient point

reference point with respect to the dominant orientation at the salient point and φj is the

dominant orientation of the salient point. This is shown in figure 6.1. (rj, θj) essentially

describes a vector from each salient point to the reference point in a rotation invariant manner using the dominant orientation direction φj as basis.

The SKS model functionψ,ψ:R3 →Rfor an object model can now be constructed as follows:

ψ(vi) = max

j dijexp −(vi−vj)Σ(vi−vj) T

(6.1)

where j ranges over all the salient points in the model and dij is the local neighborhood

(62)

6.2

Object Matching

Consider a scene with M salient points P ={P1, P2, . . . , PM},Pm = (Pxm,Pym)

extracted using the Harris-Laplace interest point detector. Each salient point also has an associated scale Pσm and dominant orientation Pφm. Let the model with reference point

Ω have J salient points C = {C1, C2, . . . CJ}, Cj = (Cxj,Cyj) with each point having an

associated scale Cσj and dominant orientation Cφj. Let Cvm be the associated feature

vectors at each salient point in the model. The goal of the matching algorithm is to not only recognize the presence of the model but to also localize it. Therefore, the matching algorithm tries not only to estimate the presence of the model reference point Ω in the scene but also its location, scale and orientation.

We construct a 4-dimensional accumulator A(x), x = [x, y, s, φ] where s is the scale andφis the orientation. Now for each pair of salient points (Pm, Cj), we calculate the

accumulator location to increment as follows:

sref = P_σ

m C_σ

j

(6.2)

(xref, yref) =Pm+sref(Crj)(cos(Cθj+Pφm),sin(Cθj+Pφm)) (6.3)

φref =Pφm−Cφj (6.4)

The accumulator is then incremented at all the points in the neighborhood of xref =

[xref, yref, sref, φref]. The radius of the neighborhood (δ) is proportional to the scale sref.

The increment value at a point x in the neighborhood of xref because of a salient point

Pm is given by ψ(Pvm(x)). Pv(x), x= [x, y, s, φ], is the feature vector computed at each

salient point in the scene assuming (x, y) as reference normalized for scale using sand for rotation using φ.

The total accumulation, therefore, at every location in the accumulator, assuming we have a group of N modelsψ1, ψ2, . . . , ψN, is given by:

An(x) = M X

m=1

ψn(Pvm(x)) (6.5)

Even though equation 6.5 is identical to equation 3.1, we do not actually compute the value at each and every pointxin the 4-dimensional accumulator as this of the orderO(N4) where

(63)

Figure 6.2: Example Training Images from the MIT Books dataset.

prohibitive. Therefore, for each pair of salient point (one from the model and one from the scene), we calculate the accumulator position to increment using equations 6.2-6.4. This reduces the computational complexity to O(N2_{) where is}_N _{is now the number of salient}

points.

The location of peak of the accumulator gives the location of the model in the scene and the peak value indicates the confidence. The best matching model for a scene is given by:

ˆ

ψ= arg max

n,x An(x) (6.6)

6.3

Experiments and Results

In order to test the algorithm in recognition and localization tasks, we evaluated the performance using a standard database of books [26]. The dataset has 120 training images (shown in figure 6.2) of different book covers and 130 test images of these books placed in cluttered indoor scenes with drastic lighting, rotation and scale changes. The test was divided into two tasks. The first task was to just recognize the presence or absence of a book correctly in each of the test images. The second task was to not only recognize but also localize the position of the book in the scene. The best published results on this dataset can be found in [26].

(64)

Table 6.1: Recognition and Localization Accuracy using the MIT Books dataset. BOF=Bag-of-Features [59], EES=Efficient Subwindow Search [25], ISM=Implicit Shape Model [23], CCL=Concurrent Classification and Localization [26]. Some of the results are reproduced from [26].

Algorithm Recognition Accuracy(%) Localization Accuracy(%)

SKS 97.69 97.69

BOF 59.2

-ESS 85.3 80

ISM 83.8 82.3

CCL 85.3 80

(65)

Figure 6.4: Example False Matches.

6.3.1 Robustness to Viewpoint Changes

The SKS algorithm is not designed to tolerate viewpoint changes. There is no concept of three dimensionality in the algorithm. The significant robustness to occlusion however, does provide some tolerance to self-occlusion. Therefore, even though ill-suited to this application, we tested the robustness of the algorithm to viewpoint changes using the standard Coil-1001 data set. The data set contains 100 uncluttered images of household objects. Each object has 72 views with each view taken 5o apart giving a total of 7200 images. Figure 6.5 shows some example images from the data set.

We used the frontal view (0o) of each object to build the model and rest were used for testing. We, therefore, have a total of 100 models and 7100 test images. The goal was to measure the recognition accuracy at each viewing angle by matching each test view with all the models. A classification was considered to be correct if the highest match score was obtained when a test view matched with its corresponding model.

Figure 6.6 shows the variation of the classification accuracy with the viewing angle. We get around 70% accuracy even at ±45o which indicates that the algorithm is quite robust to viewpoint changes. The overall recognition rate was around 56.85%. The overall recognition rates vary from 50%−87% in the literature [31][32]. The excellent performance in this test lends credence to a view-centered representation, at-least for non-articulated objects.

1

(66)

Figure 6.5: Example images from the Coil-100 Dataset

(67)

(68)

Chapter 7

Conclusion and Future Work

7.1

Conclusion

This dissertation has described a novel accumulative framework called the Simple K-Space (SKS) algorithm for representing, recognizing and localizing objects in a cluttered scene. The method was first used to describe simple shape contours using geometric features such as curvature and tangent angle. The algorithm was shown to perform as good as the current state-of-the-art in translation, rotation and scale (zoom) invariance tests using a database of tanks and outperform them in occlusion robustness testing using a database of marine animals.

The algorithm was then extended to find feature correspondences between two images by building descriptors in small neighborhoods around the Harris-Laplace corner points. The method was shown to outperform SIFT-based methods. The algorithm was also successfully used to solve for a transformation matrix (both similarity and projective transforms) using non-linear least squares regression and RANSAC which were used to register two images taken from different viewpoints.

(69)

backs. This sections enumerates on some of the potential problems and drawbacks of the SKS algorithm which needs to be addressed in the future.

1. One of the problems with the SKS algorithm now is that it does not perform well with objects having large intra-class variations and extreme rotations out-of-the-plane. The current state-of-the-art algorithms address this using some form of clustering where view-dependent features are clustered from multiple viewpoints or from models from the same object category (giving credence to a structural representation at the higher levels). One could potentially define the SKS model function using such an approach and thus, obtain better performance and robustness to viewpoint and intra-class variations.

2. The performance of the SKS algorithm primarily relies on the number of the features extracted on the object of interest. The higher the number of features extracted, the better the performance and vice-versa. Thus, there is a need to understand when the algorithm will break down particularly as the object to be detected gets smaller and smaller as this will reduce the total number of features extracted and therefore, affect the reliability to the algorithm . In other words, we need to determine the minimum size of the object below which there are not enough features extracted for reliable recognition.

(70)

Bibliography

[1] P. V. C. Hough. Method and means for recognizing complex patterns. U.S. Patent 3069654, 1962.

[2] David Marr. Vision. W.H. Freeman and Co., San Francisco, 1982.

[3] D. Marr and H. K. Nishihara. Representation and recognition of the spatial organiza-tion of three-dimensional shapes. Proceedings of the Royal Society of London. Series B, Biological Sciences, 200(1140):269–294, feb 1978.

[4] Irving Biederman. Recognition-by-components: A theory of human image understand-ing. Psychological Review, 94(2):115–147, 1987.

[5] HH Bulthoff and S Edelman. Psychophysical Support for a Two-Dimensional View Interpolation Theory of Object Recognition. Proceedings of the National Academy of Sciences, 89(1):60–64, 1992.

[6] S. Edelman and H.H. Bulthoff. Orientation dependence in the recognition of familiar and novel views of 3d objects. Vision Research, 32:2385–2400, 1992.

[7] Nikos K. Logothetis, Jon Pauls, and Tomaso Poggio. Shape representation in the inferior temporal cortex of monkeys. Current Biology, 5(5):552–563, May 1995.

[8] Shimon Edelman. Representation and Recognition in Vision. The MIT Press, Cam-bridge, Massachusetts, 1999.

(71)

[12] Chris Dance, Jutta Willamowski, Lixin Fan, Cedric Bray, and Gabriela Csurka. Visual categorization with bags of keypoints. InECCV International Workshop on Statistical Learning in Computer Vision, 2004.

[13] Donald D. Hoffman and Manish Singh. Salience of visual parts.Cognition, 63(1):29–78, 1997.

[14] Dengsheng Zhang and Goujun Lu. Review of shape representation and description techniques. Pattern Recognition, 37(1):1–19, January 2004.

[15] K. Arbter, W.E. Snyder, H. Burkhardt, and G. Hirzinger. Application of affine-invariant fourier descriptors to recognition of 3-d objects. IEEE Transactions on Pat-tern Analysis and Machine Intelligence, 12(7):640–647, 1990.

[16] E. Belogay, C. Cabrelli, U. Molter, and R. Shonkwiler. Calculating the hausdorff distance between curves. Information Processing Letters, 1997.

[17] William J. Rucklidge. Efficiently locating objects using the hausdorff distance. Inter-national Journal of Computer Vision, 24(3):251–270, 1997.

[18] S. Belongie, J. Malik, and J. Puzicha. Shape matching and object recognition using shape contexts. IEEE PAMI, 24(4), april 2002.

[19] Sadegh Abbasi, Farzin Mokhtarian, and Josef Kittler. Curvature scale space image in shape similarity retrieval. Multimedia Syst., 7(6):467–476, 1999.

[20] M K Hu. Visual pattern recognition by moment invariants. IRE Trans. Information Theory, 1962.

(72)

[22] Whoi-Yul Kim and Yong-Sung Kim. A region-based shape descriptor using zernike moments. Signal Processing: Image Communication, 16:95–102, 2000.

[23] Bastian Leibe, Ales Leonardis, and Bernt Schiele. Robust object detection with in-terleaved categorization and segmentation. International Journal of Computer Vision, 77(1-3):259–289, 2008.

[24] N. Dalal and B. Triggs. Histograms of oriented gradients for human detection. Com-puter Vision and Pattern Recognition, 2005. CVPR 2005. IEEE ComCom-puter Society

Conference on, 1:886–893 vol. 1, June 2005.

[25] Christoph H. Lampert, Matthew B. Blaschko, and Thomas Hofmann. Beyond sliding windows: Object localization by efficient subwindow search. InCVPR, 2008.

[26] Tom Yeh, John J. Lee, and Trevor Darrell. Fast concurrent object classification and localization. Technical report, MIT Dspace [http://dspace.mit.edu/dspace-oai/request] (United States), 2008.

[27] A. C. Berg, T. L. Berg, and J. Malik. Shape matching and object recognition using low distortion correspondences. In IEEE Computer Society Conference on Computer Vision and Pattern Recognition, CVPR, volume 1, pages 26–33 vol. 1, 2005.

[28] Jamie Shotton, Andrew Blake, and Roberto Cipolla. Multi-scale categorical object recognition using contour fragments. IEEE Transactions on Pattern Analysis and Machine Intelligence, Sept 2007.

[29] Yu-Gang Jiang, Chong-Wah Ngo, and Jun Yang. Towards optimal bag-of-features for object categorization and semantic video retrieval. In CIVR ’07: Proceedings of the 6th ACM international conference on Image and video retrieval, pages 494–501, New

York, NY, USA, 2007. ACM.

[30] M. Marszaek and C. Schmid. Spatial weighting for bag-of-features. Computer Vision and Pattern Recognition, 2006 IEEE Computer Society Conference on, 2:2118–2125,

2006.

(73)

[33] Anita Pasupathy and Charles E. Connor. Responses to contour features in macaque area v4. Journal of Neurophysiology, 82(5):2490–2502, 1999.

[34] Anita Pasupathy and Charles E. Connor. Population coding of shape in area v4.Nature Neuroscience, 5:1332–1338, 2002.

[35] C. Harris and M.J. Stephens. A combined corner and edge detector. InProceedings of the 4th Alvey Vision Conference, pages 147–152, 1988.

[36] David G. Lowe. Object recognition from local scale-invariant features. InProc. of the International Conference on Computer Vision ICCV, Corfu, pages 1150–1157, 1999.

[37] K. Mikolajczyk, T. Tuytelaars, C. Schmid, A. Zisserman, J. Matas, F. Schaffalitzky, T. Kadir, and L. Van Gool. A comparison of affine region detectors. Int. J. Comput. Vision, 65(1-2):43–72, 2005.

[38] D. Lowe. Distinctive image features from scale-invariant keypoints. In International Journal of Computer Vision, volume 20, pages 91–110, 2003.

[39] Krystian Mikolajczyk and Cordelia Schmid. A performance evaluation of local descrip-tors. IEEE Trans. Pattern Anal. Mach. Intell., 27(10):1615–1630, 2005.

[40] D.H Ballard. Generalizing the hough transform to detect arbitrary shapes. Pattern Recognition, 13(2):111–122, 1981.

[41] D. I. Perrett, M. W. Oram, and E. Ashbridge. Evidence accumulation in cell popula-tions responsive to faces: an account of generalisation of recognition without mental transformations. Cognition, 67(1-2):111–145, July 1998.

(74)

[43] L. Xu, E. Oja, and P. Kultanen. A new curve detection method: randomized hough transform (rht). Pattern Recogn. Lett., 11(5):331–338, 1990.

[44] Olivier Ecabert and Jean-Philippe Thiran. Adaptive hough transform for the detec-tion of natural shapes under weak affine transformadetec-tions. Pattern Recognition Letters, 25(12):1411 – 1419, 2004.

[45] W. E. L. Grimson and D. P. Huttenlocher. On the sensitivity of the hough transform for object recognition. IEEE Trans. Pattern Anal. Mach. Intell., 12(3):255–274, 1990.

[46] Alberto S. Aguado, Eugenia Montiel, and Mark S. Nixon. Bias error analysis of the generalised hough transform. Journal of Mathematical Imaging and Vision, 12(1):25– 42, 02 2000.

[4

An Accumulative Framework for Object Recognition

APPROVED BY:

TABLE OF CONTENTS

LIST OF TABLES

Chapter 1

Contributions

Organization

Shape Recognition

Object Recognition

Accumulator-Based Methods

Recognition

The Generalized Hough Transform (GHT) and SKS

Effect of Accumulator dimensions on the SNR

Model Function Construction

Template Matching as an Accumulative Process

Conclusion

Continuous Representation

Experimental Results

Discussion

Chapter 5

Feature Matching

Dominant orientation of the reference point

Results

Conclusion

Chapter 6

Object Matching

Experiments and Results

Chapter 7

Bibliography