Detecting people in images essentially means finding all instances of people in a given image. Depending on the approach adopted, the problem may be viewed as being totally opposite to that of pose estimation — an ideal human detector would detect people in all possible poses and thus being completely insensitive to pose information — or indeed very much the same because detecting a person in an image involves detecting his/her body segments and hence requires some sort of pose estimation. The difference in viewpoint depends on what resolution a person is detected at. We first outline some approaches that detect people by detecting their parts. These are actually very similar in methodology to the model-based bottom-up pose estimation methods discussed in § 2.1.2. Separate detectors for different body parts are scanned over an image, generally at various scales and orientations, and their responses are post-processed with a spatial prior model e.g. [120, 94]. These part detectors normally use a variety of robust static image features, but motion and colour consistency information from several frames can also be exploited [113]. Enforcing temporal consistency among detected parts has actually been shown to be very effective and has recently also been used to detect people in unrestricted poses by starting with an easy-to-detect lateral walking pose — ‘striking’ a particular pose [114]. This allows the construction of a simple pose-specific detector (the particular one implemented being based on edges) followed by using a pictorial structure model to track the pose in successive video frames.
The other approach to human detection is to detect people without explicitly inferring part con- figurations, e.g. pedestrian detection [106, 96, 131, 161]. Here, the main challenge is handling the variability in pose and appearance of people, yet being discriminant from background. Several fea- ture types are possible for this, e.g. wavelet based transforms have commonly been used [106, 131]. Recently a ‘histogram of oriented gradients’ [32] based representation has been shown to be very effective when used to learn a linear Support Vector Machine [159] based classifier. The framework has been used to detect people by scanning a detector window over the image at multiple scales. Although the classifier in this system learns to discriminate a person from background irrespective of pose, the dense orientation histogram representation itself is actually quite powerful even in capturing pose information. In chapter 6, it is shown that such features can successfully be used to regress 3D pose from images. c.f. [7].
Somewhere ‘between’ methods that detect pedestrians without pose information and those that look for precise pose are another class of methods which make use of weak shape priors or ap- pearance exemplars for detection e.g. [152, 39]. These do not output body pose but nevertheless can often recognize the action or classify the kind of pose by comparison against a set of labeled images. A recent interesting approach that uses shape priors combines local and global cues to detect pedestrians via a probabilistic top-down segmentation [88]. Some other detection methods also give a detailed segmentation mask for images containing people at a reasonable resolution [100, 118]. These often make use of stick body models to exploit the articulated structure of the body during detection.
The human detection literature also has a large overlap with methods used in the field of object recognition. A major goal in both these areas is to develop robust and informative image repre- sentations that can discriminate people or objects from the background. We give a brief review of existing methods in this area in chapter 8.
3
Learning 3D Pose: Regression on
Silhouettes
3.1
Introduction
This chapter describes a learning based method for recovering 3D human body pose from single images and monocular image sequences. Most existing methods in this domain (example based methods) explicitly store a set of training examples whose 3D poses are known, and estimate pose by searching for training image(s) similar to the given input image and interpolating from their poses [14, 98, 132, 144]. In contrast, the method developed here aims to learn a direct mapping from an image representation space to a human body ‘pose space’ and makes use of sparse nonlinear regression to distill a large training database into a single compact model that has good generalization to unseen examples. The regression framework optionally makes use of kernel functions to measure similarity between image pairs and implicitly encode locality. This allows the method to retain the advantage of example based methods. Despite the fact that full human pose recovery is very ill-conditioned and nonlinear, we find that the method obtains enough information for computing reasonably accurate pose information via regression.
Given the high dimensionality and intrinsic ambiguity of the monocular pose estimation problem, active selection of appropriate image features and good control of over-fitting is critical for success. We have chosen to base our system on taking image silhouettes as input, which we encode using robust silhouette shape descriptors. (Other image representations are discussed in chapters 6 and 8). To learn the mapping from silhouettes to human body pose, we take advantage of the sparsification and generalization properties of Relevance Vector Machine (RVM) [150] regression, allowing pose to be obtained from a new image using only a fraction of the training database. This avoids the need to store extremely large databases and allows for very fast pose estimation at run time.
3.1.1
Overview of the Approach
We represent 3D body pose by 55D vectors x including 3 joint angles for each of the 18 major body joints. Not all of these degrees of freedom are independent, but they correspond to the motion capture data that we use to train the system (see§3.3) and we retain this format so that our regression output is directly compatible with standard rendering packages for motion capture data. The input images are reduced to 100D observation vectors z that robustly encode the shape of a human image silhouette (§3.2). Given a set of labeled training examples {(zi, xi)| i = 1 . . . n},
30 3. Learning 3D Pose: Regression on Silhouettes
(a) (b) (c) (d) (e) (f)
Figure 3.1: A step by step illustration of our silhouette-to-pose regression method: (a) input silhouette extracted using background subtraction (b) sampled edge points (c) local shape contexts computed on edge points (d) distribution of these contexts in shape context space (e) soft vector quantization of the distribution to obtain a single histogram (f) 3D pose obtained by regressing on this histogram.
the RVM learns a smooth reconstruction function x = r(z) =P
kakφk(z) that is valid over the
region spanned by the training points. r(z) is a weighted linear combination of a prespecified set of scalar basis functions{φk(z)| k = 1 . . . p}.
Our solutions are well-regularized in the sense that the weight vectors ak are damped to control
over-fitting, and sparse in the sense that many of them are zero. Sparsity occurs because the RVM actively selects only the ‘most relevant’ basis functions — the ones that really need to have nonzero coefficients to complete the regression successfully. For a linear basis (φk(z) = kthcomponent of
z), the sparse solution obtained by the RVM allows the system to select relevant input features (components). For a kernel basis — φk(z)≡ K(z, zk) for some kernel function K(z, z′) and centres
zk — relevant training examples are selected, allowing us to prune a large training dataset and
retain only a minimal subset.
The complete process is illustrated in figure 3.1. We discuss our representations of the input and output spaces in§3.2 and §3.3; and the regression methods used in §3.4. The framework is applied to estimating pose from individual images in§3.5.