Developing an effective image representation is a well-known problem in several areas of computer vision. Simple applications often make use of global representations such as colour histograms, but demanding tasks such as reconstructing pose or recognizing objects require more robust repre- sentations. A powerful way of achieving robustness is by combining the information contained in many local regions of an image in order to summarize its complete contents. Effective human pose estimation — as in many cases of object recognition — now relies on the ability of a method to key on only a subset of the constituent regions in an image and successfully identify the remaining image as being ‘irrelevant’ to the problem1.
Limb detectors. Perhaps the most intuitive way to think of encoding an image of a person is using a collection of body limbs. These are natural constituent parts of the human body and its pose
1
Note that in some situations, an understanding of the background content may actually be useful for providing contextual information, but accommodating for this is out of the scope of this thesis.
6.2. Image representation 71
1 1
(a) (b) (c) (d) (e)
Figure 6.2: An overview of our method of pose estimation from cluttered images. (a) original image, (b) a grid of fixed points where the descriptors are computed (each descriptor block covers an array of 4x4 cells, giving a 50% overlap with it’s neighbouring blocks), (c) SIFT descriptors computed at these points, the intensity of each line representing the weight of the corresponding orientation bin in that cell, (d) Suppressing background using a sparse set of learned (NMF) bases encoding human-like parts, (e) final pose obtained by regression
directly depends on the relative configuration of these parts. So trying to deduce pose from an image given the location and orientation of each body part or limb is one possible approach, even though finer details of limb shape and appearance are extremely important in many cases. Detecting the presence of these limbs in an image, however, is a hard problem involving learning a general appearance model for each body limb and is still an active area of research [120, 94, 137, 113]. The problem remains very difficult due to large appearance variations caused because of clothing, lighting and occlusions (among other factors) and most state of the art human part detectors have to be supplemented with a human body model to either constrain the search or refine the detections by incorporating spatial priors and inter-part interactions. In a model-free approach like ours, we prefer to explore lower level image features that might directly be used to predict pose.
Local patches. A very effective and widely used representation for images, a collection of image patches2 can be used to include much more information than the locations and orientations of
different limbs. Appropriately located patches at the right scale could encode the appearance of human body parts such as elbow joints, shoulder contours and possibly the outlines of parts like the head and hands — and all of these contain important information concerning pose. In many problems, the contents of an image have been well-summarized by describing the properties of a small set of patches centered at salient points of interest on the image [54, 70]. Several object recognition and scene classification methods, for instance, key on corners and blob-like regions for this purpose. Recently it has been shown that computing patch descriptors densely over an entire image rather than sparsely at these interest points actually provides a more powerful encoding (e.g. [69, 32]). The key to successfully taking advantage of such a representation, however, is to develop a learning method that would automatically identify the ‘salient’ patches (or features) from amongst this dense set. We usually make use of a large collection of labeled training data for this purpose.
6.2.1
Dense patches
Densely sampling patches from an image — for instance, every few pixels — will, in general, give quite a large number of patches on an image. At the first thought, this may seem to be a redundant representation, especially if these patches overlap (as is often the case). However, using such overlapping patches and robustly encoding the information in each patch with an appropriate descriptor has recently been shown to be an effective strategy .
2
72 6. Estimating Pose in Cluttered Images
Figure 6.3: Sample clusters obtained by a k-means clustering on patches represented as their 128D SIFT descriptors appended with suitably scaled image coordinates. Each cluster includes patches with similar appearance and pose information in a localized image region, but using the centers of several such clusters to encode human images as a ‘bag of features’ is found to perform poorly with respect to encoding 3D pose information.
Patch information can be encoded in many different ways. Given the variability of clothing and the fact that we want to be able to use black and white images, we do not use colour information. To allow the method to key on important body contours, we base our representation on local image gradients and use histograms of gradient orientations for effective encoding. The underlying descriptor of the SIFT [90] proves to be a useful representation in this regard as it quantizes gradient orientations into discrete values in small spatial cells and normalizes these distributions over local blocks of cells to achieve insensitivity to illumination changes. The relative coarseness of the spatial coding provides some robustness to small position variations, while still capturing the essential spatial position and limb orientation information. Note that owing to loose clothing, the positions of limb contours do not in any case have a very precise relation to the pose, whereas orientation of body edges is a much more reliable cue. We thus use SIFT-like histograms to obtaining a 128D feature vector for each patch3.
To retain the information about the image location of each patch that is indispensable for pose estimation, the descriptors are computed at fixed grid locations in the image window. This gives an array of 128D feature vectors for the image. Figure 6.2(c) shows the features extracted from a sample image where the value in each orientation bin in each cell is represented by the intensity of the line drawn in that cell at the corresponding orientation. (Overlapping cells contribute to the intensities of the same set of lines, but these are normalized accordingly.) We denote the descriptor vectors at each of these L locations on the grid as vl, l∈ {1 . . . L}, and simply raster-scan the array
to represent the complete image as a large vector z, a concatenation of the individual descriptors:
z≡ (v1⊤
, v2⊤, . . . vL⊤)⊤ (6.1)
As an alternative to maintaining a large vector of the descriptor values, we also study a bag of features [31] style of image encoding. This is a common scheme in object recognition that involves identifying a representative set of parts (features) as a vocabulary (generally obtained by clustering the set of patch descriptors using k-means or some other similar algorithm), and then representing each image in terms of the statistics of the occurrence of each vocabulary part in that image. In an analogous manner, a human body image can be represented as a collection of representative
3
Note that other similar descriptors could also possibly be useful for this purpose, e.g. the generalized shape context [99]. However, we have not tried these in this work.