Imperial College of Science, Technology and Medicine Department of Computing
Generative Methods
for Scene Association
with 2D Pairwise Constraints
Edward David Johns
Supervised by Prof. Guang-Zhong Yang
Submitted in part fulfilment of the requirements for the degree of Doctor of Philosophy in Computing of the University of London and
Statement of Authorship
This thesis is submitted to the Department of Computing, Imperial College London, in fulfilment of the requirements for the degree of Doctor of Philosophy. This thesis is en-tirely my own work, and except where otherwise stated, describes my own research.
Edward Johns
Copyright
The copyright of this thesis rests with the author and is made available under a Creative Commons Attribution Non-Commercial No Derivatives licence. Researchers are free to copy, distribute or transmit the thesis on the condition that they attribute it, that they do not use it for commercial purposes and that they do not alter, transform or build upon it. For any reuse or redistribution, researchers must make clear to others the licence terms of this work.
Acknowledgements
I would like to make thanks to a few people who have been wonderfully supportive in making this thesis possible. Firstly, thank you to Prof. Yang for inviting me to join The Hamlyn Centre, and for funding my studies. I am very grateful for the freedom which I have been granted, to address my research with interesting approaches as I have seen fit myself. Secondly, thank you to all members of The Hamlyn Centre with whom I have worked and socialised, with particular mention to those who have worked on the the mo-bile robot project with me: Jindong, Stephen, Javier, Charence, Alex, James and Salman. Finally, a special thank you to my parents, for encouraging me to pursue my interests, my passions, and this academic career, despite their minimal understanding of anything remotely related to my work. If I never did become a rock star, then achieving this PhD is a close second...
Abstract
This thesis is concerned with the task of efficiently recognising the particular instance of a scene depicted in a query image, with applications in robot navigation including loop closure, global localisation and topological navigation. Three novel frameworks are proposed, each based on learning scene models by tracking local features to form sets of landmarks. Recognition then proceeds by considering 2D constraints between pairs of local feature correspondences to efficiently approximate global scene geometry.
First, theinter-imageandintra-imagepairwise geometries are considered to reduce feature correspondences to a more succinct set for a RANSAC-based 3D geometry constraint. A Hough-transform voting scheme based on inter-image correspondences weakly prunes the set of correspondences, after which intra-image geometries constrain the relative image positions of correspondences to eliminate unrealistic configurations. This idea is first proposed in an image retrieval application, and then extended to scene recognition where relative landmark positions are learned explicitly per scene.
Second, a method is introduced to embed 2D pairwise geometry directly in an inverted index, to allow for fast scene recognition without 3D estimations. A set of discrete ge-ometric words are extracted for a query image, and passed through the index to find examples of such pairwise configurations in the database. A global geometry constraint is then proposed by considering a maximum-clique approach to an adjacency matrix of correspondences.
Third, a global topological localisation system is investigated which learns a naive Bayesian network for each landmark, to efficiently approximate global geometry without a fully-connected model. Long-term robot navigation is then addressed by learning scene models in an incremental manner, and updating the dynamic properties of landmarks accordingly. Experiments were performed on a new challenging dataset obtained by manually walking along a 7km path in a park and urban district, to capture long-term effects over an 8 month period.
Acronyms
2D 2-Dimensional
3D 3-Dimensional
AP Average Precision
BOW Bag Of Words
GC Geometric Cliques
GPS Global Positioning System IDF Inverse Document Frequency MAP Mean Average Precision
PCA Principal Components Analysis PPV Pairwise Probabilistic Voting PLV Probabilistic Landmark Voting R@1 Recall at 100% precision RANSAC RANdom SAmple Consensus
RR Recognition Rate
SIFT Scale-Invariant Feature Transform SLAM Simultaneous Localisation and Mapping SVM Support Vector Machine
TF Term Frequency
Nomenclature
q A query image s A database scene u A local image feature
v A different local image feature appearing in the same image as u x A scene landmark
y A different scene landmark appearing in the same scene asx wuv The feature co-occurrence of uand v
zxy The landmark co-occurrence of x andy
πu The visual word of feature u φu The geometric word of featureu ρu The position of feature u
σu The scale of featureu θu The orientation of featureu
δuv The distance between featuresu and v ψuv The angle between featuresu and v σuv The scale ratio between features uand v
θuv The orientation difference between features uand v
Πx The visual wordset of landmarkx
Φxy The geometric wordset of landmark co-occurrencezxy Px The position range of landmarkx
Σx The scale range of landmarkx Θx The orientation range of landmarkx
Ψxy The angle range between landmarks in landmark co-occurrencezxy ∆xy The distance range between landmarks in landmark co-occurrencezxy
Contents
Statement of Authorship i Copyright iii Acknowledgements v Abstract vii Acronyms ix Nomenclature xi 1 Introduction 1 1.1 Motivation . . . 2 1.2 Scene Association . . . 4 1.3 Contributions . . . 6 1.4 Summary of Results . . . 8 1.5 Thesis Outline . . . 9 xiii2 Background 10 2.1 Image Features . . . 10 2.1.1 Global Features . . . 11 2.1.2 Local Features . . . 12 2.2 Bag Of Words. . . 18 2.3 Geometric Constraints . . . 22 2.4 Scene Recognition . . . 25 2.4.1 Image Clustering . . . 25 2.4.2 Topological Localisation . . . 26 2.5 Evaluation Metrics . . . 28
3 Image Retrieval with 2D Geometric Constraints 29 3.1 Introduction. . . 29
3.1.1 Generating Candidate Feature Correspondences. . . 31
3.1.2 Geometric Verification of Feature Correspondences . . . 31
3.2 Dataset . . . 33
3.3 The Visual Dictionary . . . 35
3.4 Inter-Image Geometry . . . 37
3.5 Intra-Image Geometry . . . 42
3.5.1 Affine Transformation . . . 43
3.5.2 Epipolar Geometry . . . 44
CONTENTS xv
3.5.4 Adjacency Matrix for Outlier Detection . . . 47
3.5.5 Weighted Adjacency Matrix for Biased Sampling . . . 51
3.6 Experiments. . . 53
3.6.1 Experimental Procedure . . . 54
3.6.2 Precision-Recall. . . 54
3.6.3 Computational Time . . . 56
3.7 Conclusions . . . 58
4 From Image Retrieval to Scene Recognition 61 4.1 Introduction. . . 61
4.2 Subscenes and Compound Images. . . 63
4.2.1 Localising Landmarks . . . 65
4.3 Generative Intra-Image Geometry. . . 69
4.3.1 Avoiding Overfitting . . . 71
4.4 Clustering subscene images . . . 73
4.5 Recognition . . . 76 4.6 Experiments. . . 79 4.6.1 Clustering . . . 79 4.6.2 Experimental Procedure . . . 80 4.6.3 Competing Methods . . . 81 4.6.4 Results . . . 81 4.7 Conclusions . . . 83
5 Embedding Geometry in the Inverted Index 85
5.1 Introduction. . . 85
5.2 A Geometric Dictionary . . . 86
5.3 Pairwise Probabilistic Voting . . . 89
5.4 The Index Structure . . . 91
5.5 Parameter Learning . . . 93
5.6 Geometric Cliques for Global Consistency . . . 96
5.7 Informative Triplet Selection . . . 101
5.8 Min-Hash . . . 102
5.9 Experiments. . . 105
5.9.1 Experimental Procedure . . . 105
5.9.2 Results . . . 106
5.10 Conclusions . . . 109
6 Global Topological Localisation and Incremental Learning 113 6.1 Introduction. . . 113
6.2 Topological Localisation . . . 115
6.2.1 Probabilistic Localisation . . . 116
6.2.2 Incremental Learning in Dynamic Environments . . . 118
6.3 The Dataset. . . 118
6.4 The Scene Model . . . 122
6.5.1 Landmark observation probability . . . 126
6.5.2 Defining the image evidence . . . 128
6.6 Implementation . . . 137
6.7 Incremental Learning. . . 138
6.7.1 Static Scene Parameters . . . 138
6.7.2 Dynamic Scene Parameters . . . 142
6.8 Experiments. . . 145 6.8.1 Experimental Procedure . . . 145 6.8.2 Results . . . 146 6.9 Conclusions . . . 150 7 Conclusions 152 7.1 Future Work . . . 154 Bibliography 155 xvii
List of Tables
4.1 Summary of results for all four competing methods . . . 83
5.1 Summary of recognition results for all implementations. . . 109
6.1 Summary of results for Global Localisation with 5 training tours . . . 149
List of Figures
1.1 Different levels of granularity in scene recognition, ranging from high-level classification to low-level instance recognition. This thesis focuses on the lowest level, i.e. identifying ”My House”.. . . 3
1.2 Scene Association uses generative methods to learn pairwise relationships between landmarks . . . 6
1.3 Challenges for scene recognition. . . 7
2.1 SIFT features . . . 15
2.2 Candidate local feature matches based on visual word assignments . . . 19
2.3 The cosine similarity between a query image and a database image is an efficient way to weakly determine image similarity . . . 20
2.4 The dictionary can also be used to efficiently generate candidate feature correspondences between two images . . . 21
2.5 Generating feature correspondences between images starts finding candi-date correspondences based on feature descriptors, and proceeds through a 3D relationship via either a homography, or epipolar geometry . . . 23
3.1 Illustration of inter-image and intra-image geometries. Circles represent lo-cal features, and lines represent geometric relationships between features. The image pair of (a) and (b) both contain a set of local features which form correspondences between the images. (c) shows the inter-image ge-ometries as the difference in feature location of correspondences across the two images. (d) and (e) show the intra-image geometries as the difference in feature locations of all features within each image. The inter-image ge-ometries are much faster to compute, but the intra-image gege-ometries offer a more rigid constraint on acceptable feature configurations. . . 32
3.2 The image dataset used in Chapters 3, 4 and 5. 50 scenes of famous build-ings each are represented by 500 images. Here, one of these images is shown per scene . . . 33
3.3 A random sample of 50 “White House” images, out of a total of 500, present in the dataset used in Chapters 3, 4 and 5. . . 34
3.4 5 query images and 15 database images for the “Eiffel Tower” scene, as part of the dataset used in Chapters 3, 4 and 5. In total, each of the 50 scenes has 500 images, from which 5 query and 15 database images of the same rigid body were selected for evaluation.. . . 34
3.5 Training images for the visual dictionary . . . 35
3.6 The top 5 most likely visual words, with associated prior assignment prob-abilities . . . 36
3.7 A sample of visual words across the whole dictionary with associated prior assignment probabilities . . . 36
LIST OF FIGURES xxiii
3.8 Fixed parameters for transformation hypothesis voting often fails when there exists a large scale or rotation between two images. In each row, the two images on the left show two feature correspondences, and the im-age on the right shows the transformation hypotheses. The black square represents a hypothesis of zero translation, whilst the red and green squares represent the hypotheses based on inter-image translation of the red and green correspondences. . . 38
3.9 A parameter-free solution to generating inter-image constraints . . . 39
3.10 The effect of the inter-image geometry stage is to reduce the feature cor-respondences to a more consistent set, whose corcor-respondences all agree in
x−and y−translation, scale ratio and orientation difference. . . 42
3.11 The affine transformation can be problematic when the scene is non-planar. Here, the red correspondences form the the model, and the green correspon-dences are inliers to this model, although the corresponcorrespon-dences are false. . . 43
3.12 The epipolar constraint can accept false positive correspondences as the constraint to the entire length of an epipolar line is somewhat weak. Here, red correspondences form the model, whilst green correspondences fit the model but are in fact false positives. . . 44
3.13 Intra-image geometries . . . 46
3.14 The affine transformation constraint between correspondences across two images is helped by use of the intra-image geometries. In (a), the 3 blue correspondences are the samples for the best estimation of the model pa-rameters, and the green correspondences are all those which agree with the samples based on the binary adjacency matrix B. In (b), those correspon-dences which do not agree with the samples are shown in red. Finally, the returned set of inlier correspondences are shown in (c), all of which are consistent with the samples based on the intra-image geometries. . . 49
3.15 The epipolar constraint between correspondences across two images is helped by use of the intra-image geometries. In (a), the 8 blue correspondences are the samples for the best estimation of the model parameters, and the green correspondences are all those which agree with the samples based on the binary adjacency matrix B. In (b), those correspondences which do not agree with the samples are shown in red. Finally, the returned set of inlier correspondences are shown in (c), all of which are consistent with the samples based on the intra-image geometries. . . 50
3.16 The correspondence scores defined byβ are a much better reflection of ge-ometric consistency than those defined byα. In (a), some true correspon-dences are assigned low scores, and some false corresponcorrespon-dences are assigned high scores. In (b), due to the stronger consideration of global consistency, the correspondences have been more accurately divided into a set of inliers and outilers. Green represents a high score, and red a low score. . . 53
3.17 Mean average precision for our method with a range of intra-image thresh-olds, together with two baselines. . . 55
3.18 Precision-Recall curves for varying thresholds on the intra-image constraints. Each curve represents a different percentile of the geometries measured from the inlier dataset. . . 56
3.19 Precision-Recall curves for our method (with the intra-geometry threshold set at 0.95) and the two baselines. . . 57
3.20 Average query time for each scene, for our method with a range of intra-image thresholds, together with the two baselines. . . 58
3.21 Overall average query time over the three implementations. . . 59
LIST OF FIGURES xxv
4.2 Estimating landmark positions within the central image ˆq (=q5). Circles
in the query imagesqi represent features that have been tracked across the subscene’s training images. Internal landmarks (red and blue) are posi-tioned at their original location in the central image. External landmark (green and yellow) positions are estimated via a homographyHi5 between
each image and the central image, and the mean is taken across the feature track. The resulting compound imageqrconsists of the set of all landmarks,
shown as squares, and their positions. . . 67
4.3 A typical compound image for a subscene, reflecting landmark positions and observation probabilities. . . 68
4.4 In order to learn the generative intra-image geometries for a subscene, all training images must be aligned with the subscene’s central image. This is achieved by first scaling and rotating the training images accordingly, and then ”pivoting” the resulting image with respect to one of the features in the central image, denoted the pivot feature. (a) shows the central image, and (b), (c) and (d) demonstrate the alignment process. In (b) and (c), the pivot feature is the red feature, whereas in (d), the training image does not contain this feature, and thus pivoting is via the blue feature. In (e), the resulting range of landmark positions are shown based on these alignments. 72
4.5 Generative intra-image landmark geometries are represented byx−yranges in the compound image. The original ranges in (a) are adjusted in (b) and (c) to avoid overfitting.. . . 74
4.6 The proposed clustering algorithm to generate a set of subscenes. (a) and (b) represent the initial and final configuration, respectively. In (c), image E is selected as a central image for the red subscene, and (d) shows all training images for that subscene greyed out. From the remaining images, a central image is chosen at image G in (d), together with its neighbouring images. In (f), the only remaining image to incorporate in the model is I, which is assigned to a central image in (g), and forms a subscene with image F. Finally, (h) shows that all images, and hence the full range of viewpoints expressed in the dataset, have now been included in the model. . . 77
4.7 When considering the geometric compatibility of correspondences, thex−y
distance between query features must be satisfied by the distance between landmark positions. The distance between landmark positions itself is a range specified by the individualx−y ranges of each landmark. . . 78
4.8 The 11 central images, one for each subscene, in the “Trevi Fountain” scene. 79
4.9 One subscene in the “Trevi Fountain” scene, with the central image sur-rounding by all other images that form the subscene . . . 80
4.10 The effect of scaling thex−y landmark image ranges on average precision 82
4.11 Precision-recall performance for all four competing methods . . . 83
5.1 A comparison of the pairwise voting method with the standard single feature voting method, both using an inverted index. . . 87
5.2 Notation for pairwise geometries. . . 88
5.3 The index structure used to search for instances of word triplets in the database. . . 94
LIST OF FIGURES xxvii
5.4 Learning a deep probabilistic model of word triplets. The dark shade is the observed word, and the lighter shades are the associated alternative words. Computing the likelihood of the triplets represented by the yellow shades involves consideration of both the alternative words, and the dependencies between each word in the triplet. . . 96
5.5 Pairwise matches between the two images may be locally consistency, but when compared against all others, there may be inconsistency. The red and blue pairs are globally consistent with each other across the two images, whereas the green pair is only locally consistent with itself. . . 97
5.6 Global consistency of geometric pairs is addressed by considering a fast maximum-clique search of matrix B. In (a), three training images are pre-sented, with sets of coloured pairwise features. In (b), a query image is presented, whose pairs are coloured to represent pairwise correspondences with those in the training images. The matrix B is then formed based on compatibility of each pair. In (c), the evolution of B to a maximum clique matrix is shown by recursively eliminating the worst pair until the matrix is devoid of any 0’s. Here, a black element represents a 0 in the matrix, and white represents a 1. . . 100
5.7 A simple example of the min-hash algorithm. . . 104
5.8 The effect of the number of geometric words per geometry on average pre-cision and recognition time . . . 107
5.9 The effect of the percentage of word triplets stored in the index . . . 108
5.10 Precision-recall performance of all methods . . . 110
6.1 Topological localisation divides the map into discrete locations. This chap-ter deals with a topology with a single path, as above, whereby a robot navigates by moving between adjacent scenes. . . 114
6.2 Different graphical models for scene recognition. Circles represent land-marks, and lines represent relationships between landmarks that are built into the model. . . 117
6.3 Half of the path represents scenes from a park, where strong long-term dynamics are present due to seasonal effects on foliage. Even over a short time period, the natural deformability of leaves and branches can cause problems when matching a query image directly to a database image. . . . 119
6.4 Half of the path represents scenes from an urban environment. Long-term dynamics exist due to building renovations, together with short-term dy-namics and occlusions from cars and pedestrians, and dramatic illumination variations. . . 120
6.5 The dataset contains significant lengths of repeatable scenes which can cause problems due to perceptual aliasing. . . 120
6.6 A sample of adjacent images from the path demonstrating the map density. Each row shows the three images representing a single location in the map. 121
6.7 Geometries that represent each landmark and landmark co-occurrence in a scene, learnt in a generative manner from a set of training images. . . 122
6.8 The progression of a scene model as images each tour are incorporated into the scene’s training set. The three small images are the new training images from the latest tour, and the large image is the scene model, with landmark’s represented by their observation probabilityp(x|s). . . 123
LIST OF FIGURES xxix
6.10 The proposed graphical model for calculating the probability of observing evidence Eux given that feature u is an observation of landmark x. Blue nodes represent observed variables (features), red nodes represent the asso-ciated underlying latent variables (landmarks), and green nodes represent auxiliary latent variables used in the probabilistic model (evidences). . . 131
6.11 When matching a given query featureuto the database of landmarks, there are four types of evidence variables. eu is the evidence provided byualone. Similarly, for all co-occurring featuresv,ev is the evidence provided by each v alone. euv is then the evidence provided by co-occurrence wuv. Finally, Eu is the combination of eu, allev’s, and all euv’s . . . 132 6.12 Determining the likelihood of observing evidence for landmark ¯x, given that
the observation is of landmark x. This example illustrates calculation of the scale variable in Equation 6.14, but the same technique is applied for all appearance and geometry terms based on Equation 6.13. . . 135
6.13 Implementation of feature-to-landmark correspondences . . . 138
6.14 The likelihood of observing co-occurrence data D given a prior on the co-occurrence probability θ, where D describes n co-occurrence observations of bothx and y, out of ktotal observations of x. . . 140
6.15 The prior distribution over co-occurrence probabilities, modelled as a beta distribution, and computed by considering a separate training set and count-ing landmark co-occurrence rates.. . . 141
6.16 The posterior co-occurrence probability θ given the observation D, where
D describes n co-occurrence observations of both x and y, out of k total observations ofx. . . 142
6.17 Calculation of landmark presence α, the likelihood that a landmark is still in the map, is a function of both the landmark observability and the number of sequential images for which the landmark was not observed. . . 144
6.19 Global localisation with RANSAC . . . 148
6.20 Global localisation without RANSAC . . . 148
6.21 Global localisation with RANSAC . . . 148
6.22 Example case when our method correctly recognises a query image, but the Chow-Liue method fails. Our method is able to filter out the long-term dynamic features on the trees.. . . 149
6.23 Example case when our method fails due to the entire scene being dynamic, with few stable local features detected across the training images. The baseline image retrieval method also fails with this query. . . 150
Chapter 1
Introduction
Computer vision has become one of the fastest-growing areas of computer science research over the last thirty years. Our sense of familiarity with vision and images, and their relationship with our own understanding of the world around us, lends a natural attrac-tion to developing computer vision systems that can interpret images with human-level understanding. Whilst many applications of such research are drawn from emulating our own uses of visual information, such as navigation, object detection and recognition, and interaction with the environment, many more uses are continually being developed that extend even beyond typical human capacities, such as 3-dimensional (3D) reconstruction, image restoration, and rapid organisation of image collections. With the arrival of power-ful machine learning methods in recent years, computer vision is set to continue its stature as a highly-valuable area of research in years to come as we push towards artificial systems with a believable sense of intelligence.
This thesis focuses on the computer vision area of scene instance recognition. Given a query image, the task is to identify the particular instance of the scene depicted in the image, with labels such as ”Tower Bridge”, ”Hyde Park”, or ”My Bedroom”. This is closely related to, and a sub-category of, the broader field of object recognition, but in our case the entire image represents the object of interest, rather than the object occupying a small localised window within the image. Scene instance recognition is a distinctly different
challenge to that of sceneclassification, which aims to identify the semantic category of the scene, with labels such as ”Bridge”, ”Park” or ”Bedroom”. In fact, there is a continuous scale in the granularity of scene recognition, ranging from very high-level labels of general categories, right through to labels of individual instances, and even particular viewpoints of those instances. The challenge with the higher-level cases is that scenes within one category may vary greatly in appearance whilst still being representative of the same class of scene, whilst the challenge with the lower-level case is that the number of classes to match a query image to can be vast. Figure1.1illustrates the varying types of recognition tasks, with this thesis focusing on the highest-level of instance recognition.
1.1
Motivation
One of the key challenges for any autonomous mobile robot, is that of self-localisation. At any given time, the robot must be able to determine it’s location within a map to enable it to make appropriate decisions about how to navigate to a target destination. When this map has not already been provided, the robot must be able to build the map at the same time as localising itself within that map, a task known as Simultaneous Localisation and Mapping (SLAM). Given the imperfect nature of sensors in practice, a probabilistic approach is typically adopted, wereby both the robot’s location and the locations of observed points in the environment are estimated with a degree of uncertainty. Within this SLAM framework, there are several components which all interact to yield the overall system, and one of these components is known asloop closure. When the robot revisits a location in space that it has already built into the map, then we can use that con-straint to decrease the uncertainty of the locations of all environment points in the map, in a process known asbundle adjustment. If the map is already built, and the structure is topological with locations in discrete rather than continuous space, then loop closure can also be considered as aglobal localisation task. Here, qualitative appearance-based meth-ods are used to recognise the location, rather than geometric methmeth-ods as used in metric SLAM to estimate the location within a continuous coordinate system. Appearance-based
1.1. Motivation 3
Indoor scene Urban scene Rural scene
Street Building Car park
Offices House Shop
Terraced house Detached house Semi-detached house
47, Park Road My house 12, Green Lane
Figure 1.1: Different levels of granularity in scene recognition, ranging from high-level classification to low-level instance recognition. This thesis focuses on the lowest level, i.e. identifying ”My House”.
localisation can then be used to navigate within the pre-built topological map, or assist in the localisation of the robot when it is first initialised and has no knowledge of its global location. It is this global topological localisation challenge, using appearance-based
methods, which motivates the work in this thesis. Whilst many developments in SLAM re-search deal with the map building itself, or navigation strategies within a map, this thesis focuses largely on scene recognition and localisation within that map, and as such the work is largely a computer vision study rather than a robotics study, but with a consideration of practical applications.
Within the field of robotics, there are several examples of systems that require scene recognition capacities, each with their own individual challenges and goals. At the smaller scale, we have the challenge of localising surgical robots and endoscopes within the body. At a medium scale, we have mobile assistive robots for carrying out tasks in hospitals, warehouses and factories. Then at the larger scale, we have autonomous vehicles navigating along roads of thousands of kilometres. Whilst the motivation for this thesis is from a robotics perspective with these systems in mind, the applications of scene recognition are broad. For example, in recent years, the growing sophistication of smart phones has led to great interest in recognition of buildings with a focus on the tourism and consumer industries. The advancement of computational power has also led to sophisticated 3D reconstruction engines that require scene recognition components to construct a 3D model of that scene. All of these applications have their own uses of the work in this thesis, and the methods that will be discussed have a wide varitey of uses across several fields of computer vision.
1.2
Scene Association
We can now definescene associationas the task of learning a model for the visual properties of a particular scene, from a given set of training images representing that scene, and subsequently recognising the scene given a new query image, from a large database of candidate scenes. In this thesis, we propose to address this problem by building generative models of scenes from the set of training images, and incorporating 2D pairwise constraints into those models to enable fast and reliable recognition. Whilst discriminative methods have proved to be popular for object classification, generative methods are much more
1.2. Scene Association 5 suited to tasks when the underlying scene is easily modelled in terms of rigid geometric relationships. Discriminative methods typically require a feature vector, and as such imposing constraints based on geometric relationships is not possible with abstract vectors. Furthermore, generative methods allow for new scene models to be easily introduced into a database without having to retrain the entire system.
One of the core concepts that will be introduced is that of modelling real-world landmarks as the underlying causes of feature observations in an image. By tracking local features across several images, we who that generative models can be learned of these landmarks, and the relative geometries of pairs of landmarks can be learned to constrain the relation-ship between observed features in a query image, and modelled landmarks in the database. Figure1.2 illustrates the notion of pairwise constraints that we will be referring to regu-larly. Figure1.3then highlights some of the core challenges that will be dealt with in this thesis.
On a contextual note, the use of local features is not the only available ideology for image matching and scene recognition. Whilst local features offer well-localised portions of an image to be extracted which are stable over wide viewpoints, employing them typically discards the majority of information embedded in an image. Furthermore, this approach seems unnatural with respect to the human capacity for visual recognition and naviga-tion, whereby it seems natural that more global signatures in the image are combined to yield an output. Such an ideology in computer vision typically includes computing global colour and texture statistics across the entire image, and passing them through complex machine learning algorithms to model each scene. However, their performance is as of yet dramatically inferior, both in speed and recognition performance, to those methods using local features. In years to come, it is conceivable that the advancement of machine learning will drive a shift in mentality and such global features will be employed with encouraging results. However, for now, and in this thesis, the use of local features offers the best performance, and in particular, instance recognition benefits greatly from their repeatable, discriminative nature.
Figure 1.2: Scene Association uses generative methods to learn pairwise relationships between landmarks
1.3
Contributions
This thesis provides three key technical contributions. First, in Chapters 3 and 4, the inter-image and intra-image pairwise geometries are considered to reduce the correspon-dences to a more succinct set for a RANSAC-based global 3D geometry constraint. A Hough-transform voting scheme based on inter-image correspondences allows for fast es-timation of image scale and orientation relationships, and intra-image geometries then constrain the relative image positions of correspondence pairs to eliminate unrealistic 2D configurations. This idea is first proposed in an image retrieval application, and then extended to scene recognition whereby training images are clustered into groups of similar viewpoints, and pairwise constraints between landmarks are learned explicitly for each cluster. Furthermore, feature appearances in the Bag Of Words framework are learned by considering a generative approach to soft assignment in discrete feature space. Experi-ments are carried out on a dataset of images acquired from online image-sharing websites to represent scenes from a wide range of viewpoint, illumination and occlusion conditions. Second, in Chapter 5, a method is proposed to embed 2D pairwise geometry directly in
1.3. Contributions 7
(a) Illumination
(b) Viewpoint
(c) Scale
(d) Occlusion
an inverted index, to allow for very fast scene recognition without the need for costly 3D estimations. By discretising landmark pairs in both appearance and geometry, a set of discrete spatial words are extracted for a query image, and passed directly through an inverted index tree to find examples of such pairwise configurations in the database. A global geometry constraint is then approximated by considering a maximum-clique approach to an adjacency matrix of correspondence pairs for each scene, to find a set of correspondences which agree with all others in the set in terms of pairwise relationships. Third, in Chapter 6, a global topological localisation system is investigated which learns a naive Bayesian network for each landmark, to efficiently approximate global geometry without the complications of a fully-connected graphical model. Long-term robot naviga-tion is then addressed by learning scene models in an incremental manner as data from further tours of a path is acquired, and the dynamic properties of landmarks are updated accordingly. Filtering is then included which allows for a probabilistic localisation model incorporating both appearance and geometry, by tracking individual landmarks and ac-cumulating votes for each landmark independently. Experiments are performed on a new challenging dataset obtained by manually walking along a 7km path in a park and ur-ban district, with several tours over a period of 8 months to capture long-term dynamic changes in scene appearance.
1.4
Summary of Results
Each of the four sections presents work that improves on baselines and relevant state-of-the-art methods. In Chapter 3, we show that use of intra- and inter- image geometries allows RANSAC algorithms to converge more efficiently with fewer false positive feature correspondences. Then in Chapter 4, applying this theory to generative models of scenes allows for more accurate scene modelling than competing methods and hence better recog-nition performance. Chapter 5 shows that embedding word triplets in an inverted index is far faster than BOW plus RANSAC approaches to recognition. Finally, and perhaps most notably of all, the topological localisation in Chapter 6 introduces the first method
1.5. Thesis Outline 9 to incorporate both appearance and geometry in a probablistic model for localisation, improving localisation performance relative to appearance-only approaches.
1.5
Thesis Outline
The outline of the thesis is as follows. In Chapter 2, a literature review is presented, to-gether with background theory on image feature, image retrieval, instance recognition and appearance-based localisation. Chapters 3 introduces a new approach to image retrieval, with improvements in both the appearance-based filtering and geometric verication. Chap-ter 4 then extends this work to the task of scene instance recognition, whereby training images are fused to form a single model for each scene of interest. In Chapter 5, the embedding geometry into an inverted index is investigation to speed up the recognition algorithm, foregoing 3D estimations and focussing on fast 2D geometry. Chapter 6 then specialises the case of scene recognition in a topological, appearance-based localisation application, for long-term navigation in dynamic environments.
Background
In this chapter, the key concepts in scene instance recognition are introduced, together with a literature review of the main contributions to this and closely-related fields. We begin by considering how image features are extracted from and image to describe its appearance and structure, before it can be processed by any recognition engine. Then, we discuss how two images can be matched very efficiently using statistics purely based on an image’s appearance. Following this, we introduce image structure and discuss efficient methods for applying strong geometric constraints between images. The process of learning scene models from a set of training images is then discussed. We then discuss how a qualitative robot localisation framework applies scene recognition for practical navigation applications. Finally, the evaluation metrics to be used in this thesis are presented.
2.1
Image Features
The first stage in most computer vision techniques, after image pre-processing, is the representation of an image in a more useful form than simply the raw array of pixels, capturing the important elements of the image and discarding those which offer little information. For the task of scene recognition, this is important both for understanding the image content and hence the similarity between two images, and for allowing for an
2.1. Image Features 11 efficient recognition algorithm which operates on the minimum data necessary to achieve reliable results. Image features can be defined as either global or local in nature, depending on the region within an image of which the feature is representative.
2.1.1 Global Features
Global features are those which consider the entire image holistically, without particular focus on any one image region. The two most commonly used global properties are colour, and edge direction, with both being very easy to extract from an image and revealing a great deal of information about the image’s content. In essence, it can be considered that every image is simply an arrangement of edges of varying intensity, with the gaps filled by uniform regions of varying shades of colour, and hence these two image properties have proved popular for decades due to their semantic simplicity.
Colour histograms [139] and histograms of edge orientations [12,157] are a popular repre-sentation of image content due to their ease of implementation and conceptual simplicity. Edge or colour histograms of two images are then compared to yield a notion of global image similarity. In [10], colour histograms were used to recognise outdoor scenes, using normalisation techniques to provide invariance to dramatic illumination changes. Com-bining both colour and edge into a histogram then provides a multi-domain approach, such as applying a colour histogram to detected edges [129] or modifying the spatial bin-ning for each colour to better reflect the underlying texture [119]. Studies have also been undertaken to improve the efficiency of image-based histogram matching techniques [38] and to more effectively reflect the human-level understanding of colour when extracting colour from an image [137,152].
Biologically-inspired approaches have also been popular in combining colour and texture into a single image descriptor [132,154]. Comparing the relative intensities of neighbouring pixels has shown to be effective in describing local image structure with invariance to global illumination effects [150]. The spectral properties of images, reflecting a summary of texture in the frequency-domain, can also aid general classification of a scene at high
semantic levels [105].
One of the key strengths of global features is their efficiency in extraction. It is far quicker to extract colour and texture information holistically and describe an image with a single descriptor, than having to compose individual descriptors for local regions, of which there can be thousands in any one image. Another strength is the tolerance of global features to image noise and resolution [142] due to the effective smoothing out of high-frequency signals over the entire image.
However, the ability differentiate between images weakens dramatically when a query image is compared to a large database, because the features are high-level representations and do not focus on the low-level, discriminative details that make each individual image unique. Whilst a high-level representation can be sufficient for category classification [151], instance recognition requires more rigid constraints to differentiate between objects or scenes within the same class, and global features are typically restricted to small-scale tasks [86]. A second problem faced by global descriptors is their sensitivity to occlusions. If a part of a scene is covered by a body not representative of the scene, then the entire descriptor will be affected; for two scenes to be assigned a similar descriptor, typically the entire scene needs to be free of occlusions. However, if several localised features are extract from the image, then occlusions will only affect some of these features, and matching each local feature is still possible. Due to these two issues, in recent years, large-scale instance recognition tasks have typically been addressed by the use of local features which offer more discriminative descriptors and tolerance to occlusions.
2.1.2 Local Features
Local features are those which are localised within the image, and hence in addition to the feature descriptor, local features can be assigned geometry properties such as location and scale. Together with their discriminative power and tolerance to occlusions as discussed, their localised nature means that many such features can be matched between two images, and consequently the spatial relationships between these local features adds an informative
2.1. Image Features 13 addition that is unavailable with global features. Furthermore, locating specific elements in an image generates greater semantic awareness of particular points in 3D space, which is a useful cue for tasks such as robot interaction with objects [123] and 3D reconstruction of a scene [2].
If the feature itself is a general shape or texture, and the aim is to detect that feature within an image, then the use of sliding windows has proved popular for object detection. Here, the feature descriptor is computed for set of windows, each representing a different position and scale within the image, and the descriptor is compared to a template model that is being searched for. For example, human body detection was addressed in [33] by computing histograms of orientated edges within the sliding window, and comparing the overall descriptor to that of a typical body learned from training images. By building up statistics of primitive pixel configurations, and learning discriminative classifiers using boosting techniques, the sliding window approach has seen state-of-the-art in human face recognition [145] and object classification [130,42].
Image edges have historically been a popular local feature due to their intuitive nature and speed of detection, and their invariance across viewpoint. A straight edge will still appear straight from whatever angle it is viewed, whereas edges of an arbitrary geometry are highly sensitive to such effects. The seminal work in [15] presented a simple example of edge detection by passing a filter through an image, a technique which still forms the foundations for many modern edge-detection algorithms. This was taken a stage further in [80] by incorporating an automatic technique for detecting the scale of an edge, such that an image could be represented as a number of edges, all at varying scales. In [71], a scene was described by its number of, and distance between, vertical edges, creating a one-dimensional string across the image. Error-tolerant string-matching algorithms were then applied to find the closest match in a database. [146] grouped together lines in clusters and matched images captured from wide angle differences by finding similar clusters across two images. [37] applied more rigorous constraints, by considering the geometric transitions of lines from one image to the next.
Keypoint-based invariant features
The use of primitive features such as edges is limited by both the descriptive power of the feature, and the ease by which two features can be matched over varying conditions. As such, the dominant local feature type in recent years has become the keypoint-based invariant feature, which offers very descriptive information over a small image region, and a well-localised image position to allow for image matching or alignment based on the relative geometries of feature correspondences. The first stage is to detected a set of keypoints in an image [93], representing a well-localised and repeatable shape, typically a corner [46]. For applications such as metric SLAM [34], an image patch surrounding the keypoint is extracted [128], but for greater robustness in general recognition tasks, keypoints are assigned a scale [80] depending on the size of the detected corner. One a keypoint is detected, the second stage is then to assigned a descriptor to the feature [91]. The choice of keypoint detector and feature descriptor is important when designing a recognition system with an appropriate compromise between speed and robustness, and each has its own strengths and weaknesses [95].
Perhaps the most widely used local invariant feature in recent years is the Scale-Invariant Feature Transform (SIFT) feature proposed by David Lowe in 1999 [82] and extended in 2004 [84], which offers robustness to scale, orientation, illumination, and small view-point changes. The feature detection technique involves blurring the images at a range of scales, and subtracting adjacent scales, to form a sequence of Difference of Gaussian (DOG) images. Intensity peaks in these images then represent keypoints at the scale of the particular DOG image. Keypoints which are located on an edge, and hence are unstable and poorly localised, are eliminated. Describing each keypoint then commences by calcu-lating the dominant orientation of the feature, and assigning a window around the feature in line with this orientation, and of size proportional to the features scale. Histograms representing the local edge directions are then computed, to form a 128-dimensional vec-tor to describe the features structure, with smoothing applied to the histogram to enable tolerance to viewpoint and imperfect keypoint localisation when pixels may move between
2.1. Image Features 15 adjacent bins. Finally, the vector is normalised to allow for illumination invariance. Fig-ure2.1(a) demonstrates the detected keypoints and scales for a set of SIFT features, and (b) illustrates the feature descriptor calculated with respect to the keypoint’s scale and orientation.
(a) Keypoints localised in an image, each with an associated scale and orientation
(b) SIFT descriptor based on image gradients and keypoint scale and orientation
Figure 2.1: SIFT features
Since its introduction of SIFT, there have also been a number of extensions of both the keypoint detection and feature extraction stages [93,91,95]. In [65], Principal Components Analysis (PCA) was applied to reduce the length of the descriptor vector, which reduced the memory requirements, allowed for faster feature matching, and focused the descriptor on the more discriminative aspects of the gradient histogram. A similar technique to SIFT, called Speeded Up Robust Features (SURF) [9], was proposed by describing features by responses to Haar wavelets responses and using integral images to speed up descriptor computation. The resulting process demonstrated accuracy comparable to SIFT, with a far more efficient implementation. In [100], an extension to SIFT descriptor was proposed which incorporate an additional global descriptor to each feature, together with the local SIFT descriptor. As such, matching can be performed with a two stage approach, whereby features that match to the local descriptor are then candidates for a match to the global descriptor. This solves problems that arise when several features representing different objects have similar local appearances, and matching using the standard SIFT descriptor alone is ineffective. In recent years, the speed of keypoint detection itself has also seen
dramatic improvements [120, 122], and feature matching has been made very efficient with the use of binary descriptors that enable fast comparisons of binary strings [14,74]. Biologically-inspired approaches to feature design have also been investigated, including statistical modelling of the eye movement with respect to image regions of interesting textures [138] and emulation of the arrangement of sensory cells on the retina [3].
Whilst the fastest feature extraction algorithms are based on circular features centred on a keypoint, the shape of the feautre itself can be adjusted in an attempt to more natu-rally fit the surrounding image texture. Affine covariant features [92] were introduced to describe features across much larger affine viewpoint angles than standard circular fea-tures. Following detection of a keypoint, an iterative process is applied which modifies the features location, scale and neighbourhood, ultimately converging to an affine invariant region, bounded by an ellipse. This ellipse then contains a region whose content is in-variant across a restricted range of viewpoints. Maximally Extremal Stable Regions were introduced [89] to fit larger stable regions in the image, typically regions of uniform colour, adding power to the feature due to the unique nature of the shape describing the region.
Direct feature matching
Early work in recognition directly matched features in one image to features in all other database images. In Lowe’s original SIFT contribution [84], a method was provided to find the closest match of an image, against a database of images, based on the Euclidean distance between feature descriptors. If the closest feature match is less than k (≈ 0.6) times the next closest match, a vote is given to the image which the database feature belongs to. Finally, the image with the greatest number of votes is labelled as the closest match. The use of the ratio to verify a match ensures that the feature itself is distinctive enough, and confidence is sufficient that a correct match has been made. This is necessary because some features will only be detected in one of the two images of the same scene, due to occlusion or illumination effects. Also, if two different features have similar descrip-tors, then small viewpoint changes may actually result in the closest match appearing to
2.1. Image Features 17 be this other similar feature. This approach therefore only matches features which are unique relative to all other features. The ratio test eliminates 90% of false matches, while discarding less than 5% of the correct matches. This approach is very effective for images captured from very similar viewpoints, because the features in two images will have very similar descriptors. However, across larger viewpoint changes, it is simply not effective enough to rely on the descriptors alone. Using these ratio tests, effective structuring of image databases in advance of querying has enabled more effective matching at runtime [144]. In [76], a method was presented to reduce the feature set to only included those which were most distinctive. For each feature, the posterior probabilities for each location, given that feature, were calculated, and the top 10% of features which gave the greatest values were retained. These are the features which give the most distinctive representation of the location within which the feature is present.
Rather than simply computing the Euclidean distance between feature descriptors, im-provements in speed and recognition performance have been gained in recent years by either addressing the descriptor space itself, or the matching methodology. In [94,7], the SIFT descriptor was manipulated to improve direct feature matching using new distance functiona. In [56] descriptor space was mapped to a set of lower-dimensional spaces to improve nearest-neighbour modelling. The distance between SIFT descriptors of a feature correspondences was modelled as Gaussian rather than isotropic in [94]. For matching to large databases, optimisedk-d trees have been addressed to efficiently search for nearest-neighbour descriptors [133,60], and in [79] a prioritised approach to matching features was presented by considering the stability and observability of each feature over a wide range of viewpoints. In [17], a technique drawing from the human vision system was proposed. By segmenting images into uniform regions that may correspond to objects, each region itself was then represented by a number of SIFT features. Location recognition was then based on recognising the overall objects, rather than treating each feature independently. Direct feature matching has generated promising results for robot localisation in small environments [69,127], and is often the only technique required when the database consists of only a few images. Of course, this is rarely the reality, and as the scale of the database
increases, the technique degrades, for two reasons. First, the introduction of more features into the image database generates a greater number of false positive feature matches. Second, each feature becomes less distinct within the larger feature set, and so the number of features matches passing distance threshold tests [84] drops rapidly. As such, more robust techniques than a simple image voting scheme are necessary for practical robot localisation.
2.2
Bag Of Words
Whilst direct feature matching can provide excellent recognition performance, it scales very poorly with the size of the database and becomes impractical for any large-scale recognition tasks, particularly within the practical constraints of robot localisation. In 2003, Josef Sivic and Andrew Zisserman’s seminal work [135] proposed a new approach to image matching, the notoriousBag Of Words (BOW) framework, which is related to earlier work in document classification and text retrieval [75]. In the domain of documents and text, each document is represented by a bag of words, containing a distribution of words which ignores the order of those words. By extracting the frequency of occurrence of certain keywords, it is possible to classify the document as one of a distinct number of types.
In a similar manner, images can be represented by a bag of visual words, with each word representing a discretised portion of descriptor space. Whilst in text documents the number of word types is fixed by the dictionary, the quantising resolution of image features is more of an open-ended problem, often with a trade-off between discriminative power and generalisation. Choosing a small number of visual words lacks discriminative power of each word; however, increasing the number of words introduces a lack of generalisability as features can “jump” between words if a small amount of noise is present. The standard method for comparing two images is by computing a normalised histogram of visual word occurrences for each image, denoted the BOW vector, and calculating the cosine similarity between two such vectors. Typically, weighting based on the inverse-document-frequency
2.2. Bag Of Words 19
(a) Clustering of features in descriptor space
(b) Partitioned descriptor space and associated image patches
Figure 2.2: Candidate local feature matches based on visual word assignments in included [135] to downweight those visual words which occur regularly over a number of images, and as such offer lower discrimination. Figure2.2illustrates the clustering process by which visual dictionaries are typically constructed, and Figure 2.3 demonstrates how a query image is rapidly compared to a database of images via its BOW vector of visual word frequencies. Together with allowing for fast vector-based image matching, the BOW framework facilitates rapid generation of candidate feature correspondences via use of an inverted index. Each visual word in the dictionary stores a list of all features in the database which have been assigned to this word, such that a query feature can rapidly be linked to those database features of similar appearance. This concept is illustrated in Figure2.4.
Whilst the original BOW contribution proposed a simplek-means clustering approach to dictionary generation [135], several alternatives have been proposed in recent years [16]. David Nist´er and Henrik Stew´enius proposed the hierarchical vocabulary tree [104] (later optimised in [59]) to speed up both dictionary construction and feature quantisation, and was shown to offer significant scalability in [126]. James Philbinet al. [112] approximated feature quantisation in a flat dictionary, which was shown to outperform exact search
Query BOW vector Database BOW vectors
C
os
ine
si
m
il
ari
ty
Figure 2.3: The cosine similarity between a query image and a database image is an efficient way to weakly determine image similarity
2.2. Bag Of Words 21
Query image
Visual dictionary
Database images
Figure 2.4: The dictionary can also be used to efficiently generate candidate feature cor-respondences between two images
in the vocabulary tree. A deeper analysis of the underlying feature space has yielded promising results with Gaussian mixture model approaches to clustering [8] and Fisher encoding [110]. In [4], an incremental approach to dictionary learning was proposed for a robot navigating through an environment, and a fixed-radius approach to dictionary construction was presented in [32].
One of the issues associated with the BOW framework is that discretisation of feature space means that a feature assigned to one visual word may “jump” to another visual word on a subsequent observation. James Philbin and Andrew Zisserman [111] introduced the principle ofsoft assignment’ to address this issue, whereby each feature is assigned to a number of words, with a Gaussian weighting based on the distance between the feature and each word’s centroid. In [96], a more rigorous analysis was conducted by explicitly observing the likelyalternative assignments of each visual word. A coarse dictionary with finer discrimination based on the Hamming distance was proposed in [55] to address the same problem.
efforts to provide lossless compression of a BOW database were discussed in [55]. The comparison of BOW vectors itself has seen much work in recent years [113,57]. In [101] it was proposed to learn the PCA structure over several images, and allowing for the lack of visual word observations to have significance [58]. Modelling of visual word co-occurrences was proposed in [20], and the issue of one visual word occurring very frequently in an image, and hence distorting the BOW vector, was discussed in [54,148]. Those visual words that are likely to represent background structure, rather than an object of interest, can be downweighted by the method in [67].
2.3
Geometric Constraints
Whilst weak geometric constraints have been presented in parallel with the BOW frame-work [72, 53], typically geometric verification is reserved as a second stage in the image retrieval pipeline, by re-ranking the top images returned from a BOW vector comparison to the database. 2D geometric constraints have been shown to be very efficient, and effective when the viewpoints on a scene are similar. In [148] groups of neighbouring features were matched in bulk, and in [155], sets of visual words and their geometric arrangements were matched between images. In [140], fast Hough-based voting based on feature geometries was proposed and showed comparable results to much slower, more rigid methods. The strongest geometric constraints however are those offered by an estimation of a 3D geometric relationship, as this reflects the true underlying scene structure. Richard Hartley and Andrew Zisserman formalised many of the theories of modern multiple-view geometry for computer vision in [48]. The affine transformation and epipolar geometry constraints are perhaps the two most popular methods for registering two or more images, and they constrain a point in one image to a point, or a line, in another image, respectively. Whereas the affine transformation is suited to planar scenes with little perspective, the epipolar constraint can be degenerate in this scenario and is often better suited to structures with large depth [141]. Figure2.5illustrates the constraints imposed by these 3D relationships.
2.3. Geometric Constraints 23
(a) Candidate local feature matches based on visual word assignments
(b) Affine transformation using an estimation of a homography matrix, where the red points represent those upon which the homography matrix is estimated
(c) Epipolar geometry based on the fundamental matrix, where the red points represent those upon which the fundamental matrix is estimated
Figure 2.5: Generating feature correspondences between images starts finding candidate correspondences based on feature descriptors, and proceeds through a 3D relationship via either a homography, or epipolar geometry
.
For a pointu1 in one image andu2 in the other, the affine transformation is represented by a homography matrix H, and constrains the following:
u2 =Hu1 (2.1)
The epipolar constraint is represented by the fundamental matrix (or the essential matrix for calibrated cameras), and applies the following constraint:
u2TFu1 = 0 (2.2)
There are a wide range of estimation techniques for the associated matrices, which usually requires point correspondences to be established in the two images. Estimation of the essential matrix typically requires a 5-point algorithm [103,77]. The fundamental matrix has a greater number of degrees of freedom due to the lack of constraints from camera calibration, and can be solved with 8 point correspondences [47,81] using linear equations, or the 7-point algorithm with non-linear equations [48], although this is not guaranteed to return a unique solution. James Philbinet al. rapdily increased the speed of RANSAC for homography estimations, by estimating the transformation between two images based on the transformation of a single point correspondence. Similarly, in [6] it was proposed to use the explicit feature shape to estimate the fundamental matrix. If the camera motion is restricted to planar motion [45, 68], or further sensors are available [41, 62], then the number of point correspondences required can be further reduced.
Estimation of these matrices is often difficult when the point correspondences may not be true correspondences, and as such the RANSAC approach is adopted [40]. Here, a recursive loop begins by randomly sampling the minimum number of correspondences needed to estimate the model are drawn, and the model is estimated based on these correspondences. Then, inliers are generated from the full set of correspondences, and process repeats until a probability threshold has been reached that at least one set of samples is not contaminated by a false positive. Several modifications to the algorithm have been proposed in recent years. Chum et al. [21] proposed to locally optimise the model by resampling from the set of inliers. This was able to combat the issue of aliasing and image noise distorting the
2.4. Scene Recognition 25 apparent underlying model, whereby the observed locations of features were marginally distorted from their true theoretical image locations. In [18], an optimal strategy for choosing samples was presented, given a required confidence probability. [28] proposed to bias the sampling towards those correspondences which were a likely match based on feature descriptor similarities. In [117], the entire RANSAC algorithm was made adaptive towards the particular distribution of correspondences, removing the constraints imposed by heuristically-defined thresholds. Sampling to deal with structural constraints such as the effect of dominant planes [28,70] have also been investigated.
2.4
Scene Recognition
The preceeding discussions on BOW similarity and geometric verification form the basis of a scene recognition framework, of which there has been a wide range of designs and applications. One of the simplest methods is a feature voting approach [127], whereby scenes or locations are represented by images, with one image per scene, and the image with the greatest number of inliers from a geometric constraint represents the returned scene. Databases can then be structured such that only the most informative database features need to be considered [144], and probabilistic feature matching can be employed using relative feature descriptor distances [76]. Whilst these approaches can yield promis-ing results in small environments with carefully-acquired image databases, most modern recognition engines must deal with large-scale tasks with noisy databases.
2.4.1 Image Clustering
Image clustering to form sets of images representing a distinct viewpoint has proved to be a popular approach. In [156], clustering was based on GPS-tagged data together with image similarities, and in [51] a sophisticated hypergraph structure was proposed. In [63], a maximum intra-cluster distance was imposed to ensure that images representing a single cluster were not overly-varied in appearance. [116] allowed for scalable clustering by
dividing the geographic space into regular grids and performing clustering on each grid, and state-of-the-art 3D reconstruction engines have addressed efficient pipelines for very large-scale clustering feature matching [2]. [19] proposed large-scale image matching of this nature by use of hashing. Given a set of images structured in these ways, global location estimates of a captured image can be made estimated by considering the Global Positioning System (GPS) coordinates of similar images [49]. Visualisation of tourist photographs from online image sources has also been made possible by use of large-scale graph-based image matching techniques [136,134].
Rahul Raguramet al. [118] proposed to match a query image to these clusters by select-ing a few iconic images from each cluster that sufficiently represent the cluster diversity, reducing the need to store or match to near-identical, and hence redundant, images in the database. Yannis Kalantidiset al. [63] generated synthetic models of each scene by map-ping features from the clustered images onto a single image, via an affine transformation, and matching to the resulting augmented image. Structuring databases in this was draws parallels to the query expansion method in image retrieval [25,22,7] whereby the returned database image from a query is then itself sent through as a query, in an effort to increase recall. A set of synthetic views were generated in [52] in order to allow for matching to a point cloud from previously unobserved viewpoints, and point cloud matching was also investigated in [125] by combining visual word and feature descriptor data for efficient yet precise feature matching. Feature matching was made faster and more reliable in [79] by considering which features in the environment are stable and observable from a wide-range of viewpoints. Whilst most recognition engines employ a geometric feature-matching ap-proach, alternatives have been proposed such as use of Support Vector Machine (SVM) classifiers [78] or combining multiple feature types [131].
2.4.2 Topological Localisation
Topological localisation for mobile robot navigation is a special case of scene recognition, in that typically, the range of viewpoints on a scene are restricted. Furthermore, filtering