the final image signature. We may thus include this work in the third category, of “approaches that make use of saliency to segment the image into foreground and background, and sepa- rately process the features in the two regions”. Within this category, the proposed approach is to the best of our knowledge the first to make use of bottom-up saliency operators to define the foreground and background regions.
A saliency function directly operating in the space of the local features extracted from the image was also proposed in Walker et al. [1998]. Differently from this method, our approach makes use of the AIM framework [Bruce and Tsotsos, 2005], turning the multi-dimensional joint distribution estimation problem into a set of independent mono-dimensional estimation problems.
This Chapter is based on the work first presented in Fornoni and Caputo [2012].
3.3 The proposed approach
Let us assume that an image (of height h and width w) is represented by a matrix X =
[x1, . . . ,xr]T ∈Rr×d, ofd-dimensional local descriptors. Let us also assume to have a visual vocabularyV ∈Rd×k (wherek is the number of visual words), used to encode X into an intermediate representationC =[c1, . . . ,cr]T ∈Rr×k. A histogram of visual words over a re- gionR⊆{1, 2, . . . ,r} can then be computed as the average code ¯cR=|R1|P
i∈Rci(assuming ci≥0 andkcik1=1). This corresponds to applying the average pooling approach discussed in
Section 2.3.3, to the regionR.
Here we focus on ways to define sets of pooling regions (R1,R2, . . . ,Rl) that are non-overlapping and span the full image:
|Ri| =λir, l X i=1 λi=1, 0<λi<1 and ∀i,j∈{1, . . . ,l} Ri∩Rj= ;, (3.1) so thatPl i=1|Ri| =r and Sl
i=1Ri={1, 2, . . . ,r}. We callλithemass coefficients, as they define
how the image area is divided into thel regions. For example, ifl=2 andλ1=λ2=12, the
image area is equally split between two regionsR1andR2. Ifλ1<12thenR2covers a larger
part of the image, and vice-versa, ifλ1>12, then the larger part is reserved forR1.
Our strategy to obtain robust image descriptors consists of two different approaches, expected to be complementary:
1. Saliency-driven Perceptual Pooling (SPP). We define the pooling regions by using a saliency operator. This approach captures the perceptual regularities in the scenes, regardless of their exact position in the scene.
2. Task-driven Spatial Pooling (TSP). We define pooling regions with a fixed relative posi- tion in the image. This approach captures the spatial regularities of the scenes.
3.3.1 Saliency-driven Perceptual Pooling (SPP)
Traditional spatial encodings, designed to capture the spatial regularities in the scenes, par- tition the image using a regular grid and pool the features in the resulting patches. Instead of imposing an a-priori segmentation, we would like to let visual-structures emerge from the images, regardless of their exact positions in the scene. Specifically, we aim to obtain a segmentation of the image into two regions (R1,R2), such thatR2captures the area of the
image with a richer informative content, leaving toR1the task to collect the statistics of the
remaining part. To this end, we propose to compute a saliency mapz∈Rr for each image and use a threshold ¯zto segment the image in two regions:
1. R1={1≤i≤r :z(xi)≤z¯}, 2. R2={1≤i≤r :z(xi)>z¯},
wherez(xi) is the value of the saliency map at the local descriptorxi. We propose to select ¯zso thatR1andR2satisfy the conditions in equation (3.1) for a given value of the mass coefficient
λ1. Note that, due to the conditions in equation (3.1),λ2is obliged to take the valueλ2=1−λ1.
For example, ifλ1=12, then ¯zis the median saliency value of the image, while ifλ16=12, the
image is asymmetrically split betweenR1andR2. To compute the saliency map, we tested
two different approaches.
Itti’s Saliency
One of the most established and most widely known saliency operators is the one proposed by Itti et al. [1998]. In this model the saliency mapzof a given image is computed by performing center-surround operationsOi(c,s)= |Ci(c)ªCi(s)|, wherecands=c+δare two different scales in a Gaussian pyramid, while {Ci}3i=1are three different image channels: an intensity channelC1, a color channelC2and an orientation channelC3. The responsesOi from the
different channels are then normalized and averaged, to get the final saliency score for each pixel. In our experiments we made use of the implementation of Harel [2006].
SIFT Saliency
Instead of using a saliency operator on the raw pixels data, it could be desirable to design a saliency function able to make use of the rich information already encoded in the pre- computed local descriptors. In this way, the salient / non-salient discrimination could be performed directly on the local descriptors that are to be pooled, assuring a higher consistency between the segmentation and the actual image representation used in the pooling step. A saliency operator that can enable a feature-based saliency estimation is theAIMmodel (Attention based on Information Maximization) [Bruce and Tsotsos, 2005]. Here, the proba- bility of each pixel is locally estimated by non-parametrically fitting a distribution over the RGB values of the image. Since there is not enough data in an image to reliably estimate the joint distribution of the RGB values, the authors propose to make use of Independent
3.3. The proposed approach
Component Analysis (ICA) [Hyvärinen and Oja, 2000] to turn the three-dimensional joint distribution estimation problem into a set of three independent one-dimensional estimation problems. Specifically, let {x1,x2,· · ·,xt} be a set oftimage descriptors with dimensionality d(e.g.d=3 for image pixels), sampled from a training set. The goal of ICA is to find a basis A∈Rd×dand a matrix of componentsS=[s1, . . . ,st]T∈Rt×dsuch that∀i∈{1,· · ·,t},xi=Asi and the coefficients (si)jare as statistically independent as possible. Once the basisAhas been computed, its inverseW =A−1can be used to project new data into the independent components space.
In this work we propose to apply the AIM technique to the image descriptors extracted from the original images. Specifically, we propose to compute the AIM saliency of the low-level SIFT local descriptors that are to be pooled, and use it to output a low-resolution saliency map. Similarly to AIM, after computing the ICA projection ˜X =[ ˜x1, . . . , ˜xr]T =X W>of an imageX (in our case a matrix of SIFT local descriptors), we make use of the independence assumption to estimate the local density of thej-th dimension of a descriptorias
p¡ ( ˜xi)j¢= 1 r r X k=1 K¡ ( ˜xi)j−( ˜xk)j ¢ , (3.2) whereK(x)=p1 2πexp ¡ −12x2 ¢
is a one-dimensional standard normal probability density func- tion. The saliency of the (projected) local descriptor ˜xiis then computed as:
z( ˜xi)= − d X j=1 logp¡ ( ˜xi)j ¢ (3.3)
and a first saliency map is obtained by computing the responses for all ther SIFT descriptors of the image. Since the SIFT descriptors are computed on a regular grid with a large spacing (e.g., 8 pixels), this procedure results in a low-resolution1saliency map, with sharp variations between neighboring points. A smoother map is finally obtained by convolving the initial response with a Gaussian filter, withσ=0.04∗max(h,w). This value has recently been shown to provide the best results when predicting human fixations with the original AIM model (Figure 8 of [Hou et al., 2012]), and preliminary experiments confirmed it to be a reasonable choice with our setup as well.
In Figure 3.4 we visualize the 128 SIFT Independent Components (as computed using SIFT patches sampled from one training split of the MIT-Indoor-67 [Quattoni and Torralba, 2009] dataset), together with an example of how a SIFT Saliency map is formed. As expected, this saliency operator is taking into account only the textural information provided by the SIFT features, while disregarding other channels, like color and intensity. For example, the grating on the window results to be as salient as the lamp lit on the night table. While this may not be a problem for our scene recognition goal (i.e. a light source might not be more discrimi- native than a window grating), it might be of limited use for other tasks, like human fixation
1. For our segmentation and pooling goal we don’t need a higher resolution map, since the local descriptors are computed with the same resolution (e.g., one every 8 pixels).
Figure 3.4 – Top: visualization of the 128 SIFT Independent Components, summed over the 8 orientations; white pixels correspond to high ICA (rectified) weights for the gradients in the corresponding area of the SIFT patches. Bottom: Computation of a SIFT saliency map and resulting segmentation usingl=2 regions andλ1=λ2=12.
prediction, and we do not claim any biological plausibility for it.
A comparison of the histograms obtained using SPP with Itti’s and SIFT saliency is shown in Figure 3.5. In the same Figure we also plot the average number of non-zero visual words in the histograms computed over the salient and the non-salient regions. As it is possible to see, for both Itti’s and SIFT saliency, the histograms computed over the salient regions are less peaked, containing a high number of non-zero visual words. On the contrary, the histograms computed over the non-salient regions are peaked around a few active visual words. In other words, the salient regions have a high visual complexity, while the non-salient ones capture more uniform areas that are well described by only a few visual words. In Section 3.4.2 this observation will be empirically confirmed for all the scene recognition datasets.
3.3. The proposed approach NS ’ S 1 NS S 512 1024 NS ’ S 1 NS S 512 1024 L ’ R 1 L R 512 1024 U ’ D 1 U D 512 1024
Salient Pooling (Itti)
Salient Pooling (SIFT)
Vertical Pooling
Horizontal Pooling
Histogram # of non-zero words
Figure 3.5 – Histograms obtained (withl=2 regions andλ1=λ2=12) using different pooling
techniques and number of non-zero visual words in each of the two halves of the histograms: non-salient (NS) and salient (S), left (L) and right (R), up (U) and down (D).
3.3.2 Task-driven Spatial Pooling (TSP)
In the previous Section we defined a pooling strategy conceived to capture perceptually cohesive structures in the images, regardless of their exact position. In this Section we discuss a simple spatial pooling scheme suitable for scene recognition problems.
Indoor scenes, for example, are designed to support human actions and humans have a limited range of spatial mobility. Indeed, humans can usually walk around a room, use objects and appliances within reach, sit on chairs, etc., but they cannot easily move from the floor to the ceiling, or access facilities if they are disposed too low, or too high in the room. This reduces the spatial variability of indoor scenes to lie mostly on the horizontal axis. Due to the effect of gravity, similar considerations may also be drawn for outdoor scenes.
Given this prior, we expect that by pooling features in horizontal bands we will be able to capture the most consistent spatial patterns in scene recognition problems. We instead expect much less robust results by pooling descriptors in vertical bands. To verify this intuition we performed a first set of experiments using onlyl=2 regions,R1andR2, withλ1=λ2=12:
1. Horizontal-bands Pooling (Horizontal). In this settingsR1is the set of local descrip-
tors lying in the upper 50% of the image, andR2is its complement.
2. Vertical-bands Pooling (Vertical). In this caseR1consists of the left-side 50% of the
descriptors, andR2is its complement.
A visualization of these pooling strategies, with a comparison of the resulting histograms with the ones obtained using SPP is shown in Figure 3.5. The experimental results are presented in Section 3.4.3. Results usingl=3 horizontal bands are also provided in Section 3.4.2 and 3.4.3.
3.3.3 Integrating Saliency-driven and Task-driven pooling
Once the saliency-driven and the task-driven image descriptors have been computed, we concatenate them to create a compact image signature that exploits both the perceptual and the spatial consistencies of the scenes. Since our main candidate for the spatial representation is the Horizontal scheme, we only integrate SPP with the Horizontal image descriptor. A multiresolution [Hadjidemetriou et al., 2004] version of our image descriptor is also formed by down-sampling each image by a factor of two, and concatenating the histograms obtained at the two resolutions.