Inter-Image Geometry - Generative methods for scene association with 2D pairwise constraints

3.4 Inter-Image Geometry

After the BOW filtering stage of an image retrieval system returns a shortlist of candidate images, geometric re-ranking is necessary to verify that these images are geometrically consistent with the query. Due to the time-consuming nature of the RANSAC algorithm which is typically employed, it is desirable to reduce the number of feature correspondences to a more manageable level, by considering geometric constraints that may be weaker than those from a full 3D transformation, but much faster to process. This can be achieved by considering rough hypotheses of an image transform using inter-image geometries, with each correspondence voting for one hypothesis in a method similar to the Hough- transform [36]. From a single correspondence, it is possible to estimate a 4 Degree Of Freedom (DOF) transformation (x-translation, y-translation, scale and rotation) using only the differences in inter-image geometries [84]. If the local feature used encode further geometric information, such as in the form of an ellipse [92], then further degrees of freedom can be included, such as anisotropic scale and shear [112]. However, this thesis is concerned only with the SIFT feature [84], and as such we restrict the method to 4 DOF hypothesis generation.

The sizes of bins used for hypothesis voting is important for discriminating between dif- ferent hypotheses, whilst also ensuring that all inlier correspondences are assigned to the same bin. Typically, the bin size for each geometry is set to the maximum expected discrepancy across all inliers, and votes are added to the two closest bins to eliminate quantisation errors. One simple approach is to heuristically determine the bin sizes [83] to a level that offers good empirical performance. However, this approach is not generalisable and can fail when there is a significant scale change or out-of-plane rotation between the images. For example, Figure 3.8 shows two feature correspondences, together with their x−y transformation hypotheses. These hypotheses are inconsistent in (a) and (b) when scale and rotation is present, and dramatically so when both scale and rotation occur simultaneously in (c).

(a) Scale change between images

(b) Rotation between images

Figure 3.8: Fixed parameters for transformation hypothesis voting often fails when there exists a large scale or rotation between two images. In each row, the two images on the left show two feature correspondences, and the image on the right shows the transformation hypotheses. The black square represents a hypothesis of zero translation, whilst the red and green squares represent the hypotheses based on inter-image translation of the red and green correspondences.

3.4. Inter-Image Geometry 39

𝑤

ℎ

𝑤

ℎ

Figure 3.9: A parameter-free solution to generating inter-image constraints shift inx−ytranslation across the set of inlier correspondences from that object, we need to know the width and height of the object as it appears in each image. From this, the maximum x-translation discrepancy is simply the difference between the widths of the object in the two images, and similarly for the y-translation discrepancy with respect to the height. Given the object sizes, an appropriate bin size can then be determined, independently of the two camera locations.

One method to determine the object sizes is to find the width and height that the correspondences span in the two images, as shown in Figure 3.9. Note that we are only interested in the size of the object with respect to the feature correspondences it forms, not the size of the underlying structure. However, due to the likely presence of many false positive correspondences, this method would likely yield very large object sizes that span almost the entire image. To eliminate some of these outliers, we use an affine model and assume that the scale ratio between inlier feature correspondences is equivalent the overall scale ratio of the two images. Similarly, the orientation difference between inlier correspondences is roughly equivalent to the in-plane rotation between the images. As such, all true feature correspondences can be assumed to have similar scale ratios and orientation differences, independently of the viewpoint and scale change between images. As such, we propose a two-stage strategy, first narrowing down the set of correspondences using scale and orientation, and then using the resulting hypothesis of the image rotation and scale ratio to estimate the observed object sizes. For the scale and orientation bins, we can set a fixed bin size due to invariance of correspondence scale ratios and orientation

differences to camera position. These bin sizes were therefore determined empirically from the set of inlier feature correspondences which were learned before when determining the inlier descriptor distances. For the set of correspondences for each image pair, the disagree- ment in scale ratio and orientation difference was calculated between each correspondence, and ranked in order of magnitude. From these distributions, we assigned the size of each bin to the value at the 95th percentile, which we denoteλσ and λθ for the scale ratio and orientation difference, respectively. Given two images with sets of correspondences, each correspondence then votes for the two closest bins according to the correspondence’s scale ratio and orientation difference.

All correspondences from the bin with the greatest number of votes then reflects a signifi- cantly reduced set of inliers. However, this inlier set can be reduced further by considering that the observed object size in the second image should be no greater than the observed object size in the first image, multiplied by the scale ratio between the first and second image. Similarly, the observed object size in the first image should be no greater than the observed object size in the second image, divided by the image scale ratio. Here, the image scale ratio is taken as the median of all correspondences in the maximum bin from the first stage. Therefore, rather than calculating the observed differences in object width and height to define the maximum x−y discrepancy, we take the observed size in the first image and multiply it by the image scale ratio, which is the equivalent of finding the difference in observed object sizes. Then, we do the inverse by dividing the observed object size in the second image by the image scale ratio. In theory, both values should be equivalent, and so the larger of the two values must be due to outlier correspondences. As such, thex−y bin size is set to the minimum of the two values for the width and height respectively. For an image scale ratio ofσ12 and observed object widthsw1 and w2 in the two images, thex bin sizeλx is defined as:

λx= min(w1×σ12, w2 σ12

) (3.1)

3.4. Inter-Image Geometry 41

λy = min(h1×σ12, h2 σ12

) (3.2)

Finally, we can define the bin assignments for each correspondencem. We define δ_mx and δym as the inter-image distances in x and y image position for m, and σm and θm as the scale ratio and orientation difference of m. We then define the minimum and minimum possible discrepancy for each geometry asxmin andyminfor thexandytranslations,σmin for the scale ratio, andθmin for the orientation difference. xmin andymin are set to twice the image length in the respective dimension, σmin is set to 1₃ due to the SIFT feature’s instability over greater scale ratios [84], andθmin is set to 0 as we are considering the full range of orientations from 0◦ to 360◦. Each geometryk is now assigned a bin valuebk:

bx(m) = δ_mx −xmin λx (3.3) by = δmy −ymin λy (3.4) bσ(mi) = log(σmi)−log(σmin) σbin (3.5) bθ(mi) = θmi−θmin θbin (3.6)

Here, we use the logarithmic scale for the scale ratio to assign equal importance to scale ratios less than 1 as those greater than 1. For each geometry, the closest two integers to the bin value are assigned a vote by m, and the votes are combined into 4-dimensional transformation space. Finally, all correspondences from the bin with the greatest number of votes are determined to be inliers to the inter-image geometry constraint, and passed on to the next phase. Figure3.10 shows the dramatic reduction of correspondences to a more consistent inlier set by use of this inter-image constraint.

(a) Correspondences based only on visual words

(b) Correspondences based on inter-image geometries

Figure 3.10: The effect of the inter-image geometry stage is to reduce the feature correspondences to a more consistent set, whose correspondences all agree in x− and y−

translation, scale ratio and orientation difference.

In document Generative methods for scene association with 2D pairwise constraints (Page 69-74)