4.4.1 Datasets
We used two domain specific datesets Oxford [96] and Paris [97] buildings, and the more generic TRECVID Subset [116] dataset, to evaluated the proposed BLCF. For detailed detail of the datasets see Section 2.6. Figure 4.5 includes three examples of query objects from the three datasets.
4.4.2 Preliminary experiments
We extracted features from the last three convolutional layers (conv5 1, conv5 2 and conv5 3) and compared their performance on the Oxford 5k dataset. We experimented with di↵erent image input sizes: 1/3 and 2/3 of the original image. Following several other authors [8? , 125] as discussed in Chapter 3, we L2-normalize all local features, followed by PCA, whitening, and a second round of L2-normalization. The PCA models were fit on the same dataset as the test data in all cases.
Unless stated otherwise, all experiments used a visual codebook of 25,000 centroids fit using the (L2-PCA-L2 transformed) local CNN features of all images in the same
Table 4.1: Mean average precision (mAP) on Oxford 5k using di↵erent convolutional layers of VGG16, comparing the performance of di↵erent feature map resolutions (both raw and interpolated). The size of the codebook is 25,000 in all experiments.
N⇥ M denotes the spatial resolution of the input image.
conv5 1 conv5 2 conv5 3
N ⇥ M raw 0.641 0.626 0.498
2N ⇥ 2M interpolated 0.653 0.638 0.536
2N ⇥ 2M raw 0.620 0.660 0.540
di↵erent codebook sizes (25,000, 50,000, and 100,000) on the Oxford 5k dataset, and chose the 25,000 centroids one because of its higher performance.
Inspired by the boost in performance of the Gaussian centre prior in SPoC features [8], we apply a weighting scheme on the visual words of an image to provide more importance to those belonging to the central part of the image. The weighting scheme w(i, j) is described as:
w(i, j) =p 1
(i c1)2+ (j c2)2
, (4.1)
where (i, j) represents the position of a visual word within the assignment map and (c1, c2) corresponds to the centre of the assignment map. The weights w(i, j) are
min-max normalized to provide scores between 0 and 1. Table 4.1 shows the mean average precision on Oxford 5k for the three di↵erent layers and image sizes. All results are obtained using this weighting criteria; for conv5 1 in Oxford 5k, this increases mAP from 0.626 to 0.653. The combination of the layers by concatenation did not provide any gain.
4.4.3 Query augmentation
Previous works [3, 126] have demonstrated how simple data augmentation strategies can improve the performance of an instance search system. Data augmentation consists in generating di↵erent versions of an image (by typically applying rotations, translations, and/or variations in the scale) with the purpose of obtaining a more
database side, which can be prohibitively costly for large datasets. For this reason, we use data augmentation on the query side only. We explore two di↵erent strategies to enrich the query before visual search: a horizontal flip (or mirroring) and a zoomed central crop (ZCC) on an image enlarged by 50%.
Figure 4.6 shows an example of the transformations, which give rise to 4 di↵erent versions of the query image. The feature vectors they produce are added together to form a single BoW descriptor by encoding their corresponding visual words into a single representation. Table 4.2 shows the impact of incrementally augmenting the query with each one of these transformations.
Figure 4.6: The four query images after augmentation.
Table 4.2: mAP on Oxford 5k for the two di↵erent types of query augmentation: the flip and the zoomed central crop (ZCC). 2⇥ interpolated conv5 1 features are used in all cases.
Query + Flip + ZCC + Flip
+ ZCC
Global Search (GS) 0.653 0.662 0.695 0.697
Weighted Search (WS) 0.706 0.717 0.735 0.743
Local Search (LS) 0.738 0.746 0.758 0.758
4.4.4 Re-ranking and query expansion
We apply the local re-ranking (R) stage on the top-100 images in the initial ranking, using the sliding window approach described in Section 4.3.
We scan di↵erent aspect ratio score thresholds th and evaluate the precision at the first 10 images (P @10) on the top-10 retrieved images. P @10 does not change for most of the thresholds, but the selected windows sometimes do and a lower threshold
Table 4.3: mAP on Oxford 5k and Paris 5k for the di↵erent stages in the pipeline introduced in Section 4.3. The Qaug additional columns indicate the results when
the query is augmented with the transformations introduced in Section 4.4.3.
Oxford 5k Paris 6k +Qaug +Qaug GS 0.653 0.697 0.699 0.754 LS 0.738 0.758 0.820 0.832 GS + R 0.701 0.713 0.719 0.752 LS + R 0.734 0.760 0.815 0.828 GS + GQE 0.702 0.730 0.774 0.792 LS + GQE 0.773 0.780 0.814 0.832 GS + R + GQE 0.771 0.772 0.801 0.798 LS + R + GQE 0.769 0.793 0.807 0.828 GS + R + LQE 0.782 0.757 0.835 0.795 LS + R + LQE 0.788 0.786 0.848 0.833
value means faster computation (more candidate windows are discarded). We set th = 0.4 since it provides a small improvement from 0.80 to 0.84 in P @10 for the Oxford 5k dataset.
Query expansion is later applied considering the top-10 images of the resulting ranking. This section evaluates the impact in performance of both re-ranking and query expansion stages. Table 4.3 contains the results for the di↵erent stages in the pipeline for both simple and augmented queries (referred to as Qaug in the table).
The results indicate that the local re-ranking is only beneficial when applied to a ranking obtained from a search using the global BoW descriptor of the query image (GS). This is consistent with the work by Tolias et al. [125], who also apply a spatial re-ranking followed by query expansion to a ranking obtained with a search using descriptors of full images. They achieve a mAP of 0.66 in Oxford 5k, which is increased to 0.77 after spatial re-ranking and query expansion, while we reach similar results (e.g. from 0.652 to 0.769). However, our results indicate that a ranking originating from a local search (LS) does not benefit from local re-ranking. Since the BoW representation allows us to e↵ectively perform a local search (LS) in a
Figure 4.7: Examples of the top-ranked images and localizations based on local CNN features encoded with BoW. Top row: Christ Church from the Oxford Buildings dataset; middle row: Sacre Coeur from Paris Buildings; bottom row: query 9098 (a parking sign) from TRECVID INS 2013.
be redundant in terms of the achieved quality of the ranking. However, the local re-ranking stage does provide a rough localization of the object in the images of the ranking, as depicted in Figure 4.7. We use this information to perform query expansion based on local features (LQE).
Results indicate that query expansion stages greatly improve performance in Oxford 5k. We do not observe significant gains after re-ranking and QE in the Paris 6k dataset, although we achieve our best result with LS + R + LQE.
In the case of augmented queries (+Qaug), we find query expansion to be less
helpful in all cases, which suggests that the information gained with query aug- mentation and the one obtained by means of query expansion strategies are not complementary.