Overview of Approach - Machine Learning for Image Based Motion Capture

Figure 1.3: A projection of the manifold of human silhouettes in a feature space that encodes silhouettes using robust shape descriptors. The encoding is used to map silhouettes into a high dimensional space where Euclidean distance measures the similarity between two silhouettes. Such an encoding allows 3D body pose to be recovered by using direct regression on the descriptors.

other hand, statistical models can easily capture the temporal dependencies and the correlations between movements of different body parts. Machine learning technology has begun to be used to synthesize natural looking motion for human animation e.g. [86, 124, 41].

1.4 Overview of Approach

In this thesis, we take a learning-based approach to motion capture, using regression as a basic tool to distill a large training database of 3D poses and corresponding images into a compact model that has good generalization to unseen examples. We use a bottom-up approach in which the underlying pose is predicted directly from a feature-based image representation, without directly modeling the generative process of image formation from the body configuration. The method is purely data-driven and does not make use of any explicit human body model or prior labeling of body parts in the image.

We represent the target 3D body pose by a vector, denoted x. In our experiments, we simply use native motion capture pose descriptions. These are in the form of joint angles or 3D coordinates of the body joints, but any other representation is applicable. The input image is also represented in a vectorized form denoted by z. Given the high dimensionality and intrinsic ambiguity of the monocular pose estimation problem, active selection of appropriate image features is critical for success. We use the training images to learn suitable image representations specific to capturing human body shape and appearance. Two kinds of representation have been studied. In cases where foreground-background information is available, we use background subtraction based segmentation to obtain the human silhouette, and encode this in terms of the distribution of its softly vector quantized local shape context descriptors [16], with the vector quantization centres being learned from a representative set of human body shapes. This transforms each silhouette to a point in a

18 1. Introduction

Figure 1.4: Silhouettes are an effective representation for estimating body pose from an image, but add to the problem of ambiguities in the solution because left and right limbs are sometimes indistinguishable. Here we see some examples of multiple 3D pose solutions that are obtained from such confusing silhouettes using a mixture of regressors method developed in this thesis. Cases of forward/backward ambiguity, kinematic flipping of the legs and interchanging labels between them are seen here. In the last example, the method misestimates the pose in one of the solutions.

100D space of characteristic silhouette shapes. A 3D projection of this space is shown in figure 1.3. In cases where background subtraction is not available, we use the input image directly, computing histograms of gradient orientations on local patches densely over the entire image (c.f. the SIFT [90] and HOG [32] descriptors). These are then re-encoded to suppress the contributions of background clutter using a basis learned using Non-negative Matrix Factorization [85] on training data. In our implementation this gives a 720D vector for the image.

The pose recovery problem reduces to estimating the pose x from the vectorized image representation z. We formulate several different models of regression for this. Given a set of labeled training examples{(zi, xi)| i = 1 . . . n}, we use the Relevance Vector Machine [151] to learn a smooth re-

construction function1_{x = r(z), valid over the region spanned by the training points. The function}

is a weighted linear combination r(z)_≡P

kakφk(z) of a prespecified set of scalar basis functions

{φk(z)| k = 1 . . . p}.

When we use the method in a tracking framework, we can extend the functional form to incorpo- rate an approximate preliminary pose estimate ˇx, x = r(ˇx, z). This helps to maintain temporal continuity and to disambiguate pose in cases where there are several possible reconstructions. At each time step t, a state estimate ˇxt is obtained from the previous two pose vectors using an

autoregressive dynamical model, and this is used to compute the basis functions, which now take the form_{φk(ˇx, z)| k = 1 . . . p}.

Our regression solutions are well-regularized in the sense that the weight vectors ak are damped

to control over-fitting, and sparse in the sense that many of them are zero. Sparsity is ensured by the use of the Relevance Vector Machine, a learning method that actively selects the most ‘relevant’ basis functions — the ones that really need to have nonzero coefficients for the successful

I.e. a function that directly encodes the inverse mapping from image to body pose. The forward mapping from body pose to image observations can be more easily explained by projecting a human body model or learning image likelihoods. In this thesis, we avoid the use of such a forward mapping.

In document Machine Learning for Image Based Motion Capture (Page 31-33)