Descriptor Blocks - Implementation and Performance Study

4.3 Implementation and Performance Study

4.3.5 Descriptor Blocks

Section 4.1 presented two classes of block geometries – square or rectangular R-HOGs partitioned into grids of square or rectangular spatial cells, and circular C-HOG blocks partitioned into cells in log-polar fashion – and two kinds of normalisation: the block normalisation of R- HOG or C-HOG and the centre-surround normalisation of centre-surround HOG. This section studies the performance of R-HOG, C-HOG and centre-surround HOG as the descriptor parameters varies.

R-HOG.

Figure 4.5(a) plots the miss rate at 10−4FPPW w.r.t. the cell size in pixels and the block size in cells. For human detection, 3×3 cell blocks of 6×6 pixel cells perform best with 10.4% miss-rate at 10−4FPPW. Our standard 2×2 cell blocks of 8×8 cells are a close second. In fact, 6–8 pixel wide cells do best irrespective of the block size – an interesting coincidence as human limbs are about 6–8 pixels across in our images (see Fig. 4.5(b)). We find 2×2 and 3×3 cell blocks work best. Adaptivity to local imaging conditions is weakened when the block becomes too big, and when it is too small (1×1 cell block, i.e. normalisation over cell orientation histogram alone) valuable spatial information is suppressed.

As in Lowe [2004], down-weighting pixels near the edges of the block by applying a Gaus- sian spatial window to each pixel before accumulating orientation votes into cells improves performance – here by 1% at 10−4FPPW for a Gaussian withσ =0.5∗block width.

Multiple block types with different cell and block sizes can be included to provide encoding at multiple scales. Augmenting the feature space by including 3×3 cell blocks in addition to 2×2 cells one (cell sizeη =8×8) improves performance only marginally. Including cells and blocks at different scales (η =8×8,ς =2×2 andη =4×4,ς =3×3) improves performance by around 3% at 10−4FPPW, c.f . Fig. 4.6(a). However the combination ofη =8×8,ς =2×2 andη =4×4,ς =4×4 – with block size of 16×16 pixels in both cases – brings only slight improvement. This suggests that multiple block encoding should target both different cell (spatial pooling and block normalisation) sizes. However such multiple encodings greatly increase the descriptor size so it might be preferable to perform multilevel encoding using a feature se- lection mechanism such as AdaBoost in order to avoid explicit encoding of excessively large feature vectors.

Besides the square R-HOG blocks, we also tested vertical (2×1 cell) and horizontal (1×2 cell) blocks and a combined descriptor including both vertical and horizontal pairs. Figure 4.6(b)

4.3 Implementation and Performance Study 41 4x4 6x6 8x8 10x10 12x12

Cell size (pixels)

1x1 2x2 3x3

4x4

_{Block size (Cells)}

0 5 10 15 20

Miss Rate (%)

12 8 64 6 18 (a) (b)

Fig. 4.5.Effect of cell size and number of cells in the block on detection performance. (a) The miss rate at 10−4FPPW as the cell and block sizes change. The stride (block overlap) is fixed at half of the block size. 3×3 blocks of 6×6 pixel cells perform best, with 10.4% miss rate. (b) Interestingly human limbs in 64×128 pixel normalized images are also around 6 pixels wide and 18 pixels long.

10−5 10−4 10−3 10−2 0.02

0.05 0.1 0.2

DET − effect of multiple cell blocks

false positives per window (FPPW)

miss rate η=8x8, ς=2x2 η=8x8, ς=2x2,3x3 η=8x8,4x4, ς=2x2,3x3 η=8x8,4x4, ς=2x2,4x4 10−6 10−5 10−4 10−3 10−2 10−1 0.01 0.02 0.05 0.1 0.2 0.5

DET − effect of horizontal & vertical cells

false positives per window (FPPW)

miss rate ς=2x2 ς=2x1 ς=1x2 ς=2x1,1x2 (a) (b)

Fig. 4.6.The effect of multiple-scale and rectangular blocks on detection performance. (a) Mul- tiple block types with different cell and block sizes improve performance. (b) Vertical (2×1) blocks are better than horizontal (1×2) blocks.

42 4 Histogram of Oriented Gradients Based Encoding of Images

presents the results. Vertical and vertical+horizontal pairs are significantly better than horizontal pairs alone, but not as good as 2×2 or 3×3 cell blocks. Performance drops by 1% at 10−4 FPPW.

C-HOG.

The two variants of the C-HOG geometry, ones whose central cell is divided into angular sec- tors (Fig. 4.1(b)) and ones with a single circular central cell (Fig. 4.1(c)), perform equally well. We use single circular-centre variants as our default C-HOG variant, as these have fewer spatial cells than the divided centre ones. At least two radial bins (a centre and a surround) and four angular bins (quartering) are needed for good performance. Including additional radial bins does not change the performance much, c.f . Fig. 4.7(a), while increasing the number of angular bins decreases performance (by 1.3% at 10−4_{FPPW when going from 4 to 12 angular bins),} c.f . Fig. 4.7(b). Figure 4.7(c) shows that a central bin of 4 pixels radius gives the best results, but 3 and 5 pixel radii gives similar results. Increasing the expansion factor (log space radial increment) from 2 to 3 leaves the performance essentially unchanged. With these parameters, neither Gaussian spatial weighting (as in R-HOG) nor inverse weighting of cell votes by cell area changes the performance, but combining these two reduces it slightly. These values as- sume fine orientation sampling (we used 9 bins as default). Shape contexts (1 orientation bin) require much finer spatial subdivision to work well, usually 3 radial bins and as many as 12 angular bins, but their performance is still much lower than that of C-HOG. Figure 4.9 shows these results. Another variant of C-HOG, EC-HOG, uses binarised edges (i.e. thresholded gradients) to vote into the orientation histogram. Section 4.4 shows that such thresholding decreases performance. 10−6 10−5 10−4 10−3 10−2 10−1 0.01 0.02 0.05 0.1 0.2 0.5

DET − effect of number of distance bins

false positives per window (FPPW)

miss rate R=3 R=2 R=1 10−6 10−5 10−4 10−3 10−2 10−1 0.01 0.02 0.05 0.1 0.2 0.5

DET − effect of number of angular bins

false positives per window (FPPW)

miss rate angular bin= 4 angular bin= 6 angular bin= 8 angular bin=12 10−6 10−5 10−4 10−3 10−2 10−1 0.01 0.02 0.05 0.1 0.2 0.5

DET − effect of central bin radius

false positives per window (FPPW)

miss rate central bin=3 central bin=4 central bin=5 central bin=6 (a) (b) (c)

Fig. 4.7.Performance variations as a function of the C-HOG descriptor parameters. All results are with 9 orientation bins. (a) The effect of the number of radial bins (R) on performance. (b) Increasing the number of angular bins decreases performance marginally. It is best to use 4 bins (quartering). (c) The optimal central bin radius is 4 pixels.

Centre-surround HOG.

Figure 4.4(c) (“window norm”) shows that using centre-surround HOG decreases performance relative to the corresponding block based scheme (by 2% at 10−4FPPW, for Gaussian pooling withσ =1 cell widths). The reason is that there are no longer any overlapping blocks, so each cell is coded only once in the final descriptor. Including several normalisations for each cell

4.3 Implementation and Performance Study 43

based on different pooling scalesσ, but each centred on the cell location, provides no percep- tible change in performance. So it seems that it is the existence of several pooling regions with different spatial offsets relative to the cell that is important in R-HOG, not the pooling scale. This point is further clarified in Sect. 4.5.

The centre-surround scheme has other advantages, notably that it can be optimised for much faster run-time. Recently, Zhu et al. [2006] showed that centre-surround HOG, in conjunction with integral histograms [Porikli 2005] and AdaBoost [Freund and Schapire 1996a,b, Schapire 2002], can be used to build a near real-time filter cascade style detector.

In document Finding People in Images and Videos (Page 54-57)