Evaluation on PASCAL VOC 2007 - Robust aggregation of local image descriptors for visual search

6.2 Experiments

6.2.1 Evaluation on PASCAL VOC 2007

We perform experiments on PASCAL VOC 2007 dataset [32] to optimise our classification pipeline. The PASCAL VOC-2007 dataset consists of about 10k images with twenty different object classes. We follow the standard experimental procedure which comprises of training and validating on the 5011 training images and testing on 4952 test images. The parameters of our method (dictionaries, cluster level PCA and whitening matrix) are learned using the canonical training subset. For each category, a linear one- vs-all SVM classifier [89] is trained using default VLFEAT hyper-parameters (C=10 and the number e of epochs is 100) and the performance is measured as mAP over the 20 classes.

SIFT vs RootSIFT

In our first experiment we investigate the benefits provided by the RootSIFT operation. Table 6.3 demonstrates that the conversion improves the classification performance of the RVD-W and FV representations. The SIFT descriptors are transformed to RootSIFT in two steps including L1-normalization on SIFT vectors and square root applied individually to each element.

6.2. Experiments 120

(a)

(b)

52 54 56 58 60 62 64 32 48 64 80 96 128 mAP (%) SIFT DIMENSION (d) RVD-W FV 45 48 51 54 57 60 63 66 32 64 128 256 512 mAP (%) VOCABULARY SIZE (n) RVD-W FV

Figure 6.4: (a) SIFT dimensionality reduction, (b) Vocabulary size

PCA on SIFT descriptors

In previous chapters, we have shown the benefit of SIFT dimensionality reduction, via PCA transform, in the context of visual search. In this section we investigate the impact of dimensionality reduction on classification performance. The pipeline for the experiment is as follows. First, the dimensionality of RootSIFT descriptors is reduced from 128 to d0 dimensions and a codebook of 256 cluster centres is learned. Second, the RVD-W and FV representations are computed and power+L2 normalization is applied. Finally, a linear one-vs-all SVM classifier is trained for each category and the performance is measured as mean Average Precision (mAP) over the all categories. We change the dimensionality of descriptors after PCA (d0) and observe the changes to classification performance.

The results presented in Figure 6.4(a) show that dimensionality reduction is essential to obtain good classification performance for both representations. For RVD-W, the mAP is 56.5% without dimensionality reduction, while mAP of 58.8% is achieved using only 32 most energetic dimensions. It can be observed that optimum performance of 62.3% is reached when the top 80 dimensions are retained. A similar behaviour is observed for the FV representation: the mAP is only 55.1% on full SIFT vectors (i.e without applying PCA transformation), which increases to 59.9% when the descriptor dimensionality is reduced to 80 dimensions. Compared to FV, RVD-W brings a consistent benefit of about 2%.

6.2. Experiments 121

(a)

(b)

53 54 55 56 57 58 59 mAP (%) RVD RVD-P RVD-W FV 58 59 60 61 62 63 mAP (%) RVD RVD-P RVD-W FV

Figure 6.5: (a) First order global representations performance, (b) First+Second order global representation performance on PASCAL VOC 2007 dataset.

Impact of the codebook size

Another important study investigates the impact of codebook size. Figure 6.4(b) shows that the classification performance increases as we increase the size of the codebook. For n = 512, RVD-W and FV obtain mAP of 63.9% and 61.5% respectively. The slope of the curve indicates that further gains could be achieved by increasing the number clusters even further. However for high values of n, the dimensionality of the signature becomes prohibitively high. In all the following experiments, the size of the visual vocabulary is fixed to 256, as this value is considered a good trade-off between performance and complexity.

First order global descriptor

We now compare the performance of first order RVD, RVD-P, RVD-W and FV on the PASCAL VOC dataset. The first order FV for each image is computed using VLFEAT toolbox [116]. All global representations undergo power normalization (α=0.5) followed by L2-normalization. The dimensionality (D) of global representations is thus 80 × 256 = 20480. It can be observed from Figure 6.5(a) that all representations based on the RVD framework perform significantly better than FV. Compared to FV, RVD-W offers a significant gain +2.8% in mAP.

6.2. Experiments 122

(a)

(b)

60 61 62 63 64 65 66 mAP (%) RVD-W FV 60 61 62 63 64 65 66 mAP (%) RVD-W FV

Figure 6.6: Impact of Spatial Pyramid Matching (SPM) on PASCAL VOC dataset: (a) SPM2 with configuration 1×1, 3×1, (b) SPM3 with configuration 1×1, 3×1, 2×2.

Extending RVD with second order statistics

Here we extend the descriptors by adding second order statistics as explained in section 6.1.3. The global representation is computed by concatenating vectors ζj and then

vectors ζ_jcfor each cluster and the global descriptor is power normalized (α = 0.5). The dimensionality (D) of RVD, RVD-P, RVD-W and FV is equal to 2 × 80 × 256 = 40960. Results are shown in Figure 6.5(b) where it can be observed that second order statistics brings significant gain in classification performance for all global representations. It can also be seen that RVD-W obtains a mAP=62.3% compared to 59.9% for FV, thus offering 2.4% gain.

Spatial Pyramid Matching (SPM)

We discussed in section 6.1.4 that Spatial Pyramid Matching benefits RVD-W and FV by incorporating weak geometric information. In this section we evaluate RVD-W and FV performance using two spatial pyramid configurations SPM2 and SPM3. In the SPM2, an image is partitioned into 1×1, 3×1 sub-regions and corresponding RVD-Ws are computed and concatenated. This creates a descriptor with an overall dimension of 4×D= 163840 elements. In the SPM3, the pyramid divides each image into 1×1, 3×1, 2×2 sub-regions (illustrated in Figure 6.7) resulting in a 8×D= 327680 dimensional

6.2. Experiments 123

RVD-W vector.

Figure 6.7: Spatial Pyramid Matching with configuration [1×1, 3×1, 2×2]

Figure 6.6 shows that RVD-W performs significantly better than FV bringing an improvement of 2.1% for SPM2 and improvement of 1.3% for the SPM3 configuration. A detailed class by class comparison of RVD-W and FV descriptors is presented in Figure 6.8. It can be observed that for majority of classes the best performance is achieved by RVD-W SPM3 method, however it has twice the size of representation used by RVD-W+SPM2.

In document Robust aggregation of local image descriptors for visual search. (Page 134-138)