• No results found

Gesture recognition using the Mixture Components

An interesting by-product of using a mixture of regressors to track human motion is the posterior probability value of each regressor class for a each image, p(l = k| z). These values can be used to deduce the label of the mixture component that was used to reconstruct pose at any given time instant. Now the mixture components not only help to resolve ambiguities by separating regions of multivaluedness, but also softly partition the space into small regions having consistently similar appearance and pose. In this section, we exploit this fact to use the model for labeling different gestures in a video sequence.

The system is retrained on a set of images from of a sequence of well defined arm gestures as shown in figure 5.9. For this experiment, we use the algorithm defined in§ 5.3, but initialize the clusters manually on the basis of the ground truth gesture label associated with each image-pose pair. The

5.6. Gesture recognition using the Mixture Components 65

Traveling Illegal dribble Illegal defense No score

Stop clock Hand check Technical foul (Neutral)

Figure 5.9: Sample frames from 7 basketball referee signals that are used as training gestures for learning a mixture of regressors capable of inferring gesture along with reconstructing pose.

Figure 5.10: Tracking 3D pose and recognizing gesture (from a predefined set of gestures) on a test sequence, using a mixture of regressors. Each state is associated with the gesture label corresponding to the class with the maximum posterior conditional likelihood at that point in time.

66 5. A Mixture of Regressors 0 100 200 300 400 500 600 700 0 1 2 3 4 5 6 7

True Action label Estimated label

Figure 5.11: A comparison of the estimated gesture labels with hand-labeled ground truth for the sequence shown in fig 5.10. The numbers 1 - 7 on the y axis correspond to the seven gestures shown in fig 5.9, with the output 0 corresponding to a neutral pose. The predicted labels are mostly the true ones, with errors occurring mostly at class transitions.

final components obtained after EM thus contain probabilistic information of the gesture label associated with each data point.

On a test sequence of similar gestures, we now assign a gesture label k∗to each reconstructed pose

by reading off the label of the component with maximum posterior probability:

k∗t = arg max

k { p(l = k | zt)} (5.9)

where p(l = k| z) is computed from (5.8). Figure 5.10 shows snapshots from a test video sequence where each reconstructed pose is labeled with its most likely gesture. No explicit smoothing is applied to this label across neighbouring images, but the condensation based tracking helps to ensure that the predicted gesture label is reasonably consistent with time. Except for a few cases of confusion (e.g. at t = 227, 431 where the algorithm outputs a wrong gesture label), the method recognizes gestures with a very high level of accuracy. A plot of the estimated gesture labels compared with ground truth for the complete sequence is shown in figure 5.11. We see that most of the errors actually occur at class transitions. One of the reasons of this is likely to be the fact that we simply take the maximum of the posterior probability — a more reasonable scheme, e.g. combining the probability values with an HHM over gesture transitions, may help in reducing these.

5.7

Discussion

We have developed a method for multiple hypothesis estimation of 3D human pose from silhouettes, based on mixtures of regressors. The mixture is learned as a generative model on the combined density of the input and output spaces and the regressors are estimated within an EM framework by constraining the covariance matrix to take a special structured form. Accurate pose reconstruction results that correctly identify the ambiguities are obtained on a variety of real unseen silhouettes, demonstrating the method’s ability to generalize across inter-person variations and imperfect sil- houette extraction. When used in a multiple hypothesis tracker, the method is capable of tracking stably over time with robustness to occasional tracking failures.

5.7. Discussion 67

In contrast to the previous two chapters, the regression framework developed here is probabilistic and more robust, but compromises on the amount of computation required at runtime. While the regressors themselves are linear and fast to apply, projecting a silhouette to the KPCA-reduced manifold requires computing a kernel function based at each of the training points, thus losing the advantage of the sparse solutions obtained by the RVM in chapters 3 and 4. A possible way around this would be to compute a sparse approximation of the embedding and apply suitable priors on the regressor parameters to learn a sparse mixture of regressors. Another possibility is to extend the cost function of the Relevance Vector Machine to incorporate the latent variable within it and directly deal with multimodal output solutions that are computed sparsely. However, we leave these possibilities for future work and focus attention on using the regression models developed so far in cases where the images cannot be represented using silhouettes, e.g. in cases of unknown or cluttered backgrounds. The next chapter develops an image representation that allows for regression-based pose recognition in the presence of clutter.

6

Estimating Pose in Cluttered

Images

6.1

Introduction

In a general scene, it is often not known apriori what the background consists of. Obtaining a human body silhouette representing the shape of the foreground may not be straight-forward in such a case. So reconstructing the pose of a person without segmentation information becomes a significantly harder problem — but on the other hand, it is not evident that a precise segmentation is actually needed to perform (pose) recognition. Several existing pose estimation methods work without any segmentation information and in the presence of background clutter, but most of these adopt a model based approach i.e. they rely on a predefined kinematic body model. Top- down methods obtain pose by minimizing the image projection error of an articulated model using techniques such as optimization [142] or by generating a large number of pose hypotheses [87]. So they automatically avoid clutter by only attempting to explain a part of the image that is covered by the hypothesized pose projection. This is an effective approach, but does not account for unexplained portions of the image. Moreover, both these techniques can be quite expensive due to repeated measures of the image likelihood involved. Bottom-up methods, on the other hand, use weak limb-detectors to find human body segments in the image (e.g. [120, 94]) and then combine independent detections from several detectors with spatial priors on the relative arrangement of the limbs to infer body pose (e.g. [137, 113]). Current bottom-up methods based on monocular images obtain very coarse level pose information that is usually not sufficient for motion capture. In fact, only a few of the existing methods in this category actually attempt to recover 3D pose. In chapter 3, we introduced a very effective model-free approach that estimates 3D body pose by learning a regression-based mapping from monocular image observations to the space of body poses. The method is completely bottom-up and extends gracefully to incorporate temporal information and support multimodal estimates, as we have seen in chapters 4 and 5. Now in all these chapters, we have made use of a robust shape descriptor to encode the input image by segmenting out the human figure to obtain a silhouette. However, the regression framework itself remains a valid approach to inferring pose from any suitable representation of the input image. In the absence of a segmented shape, the regressor could be allowed to cue on other features in an image.

In this chapter we extend the regression based approach to work on general images containing cluttered backgrounds. This calls for an appropriate encoding of image features. Unlike in the case of top-down methods that need to explain only a part of the image that is covered by a projection of a body model for a particular pose hypothesis, a bottom up method requires either the ability to explicitly ‘detect’ body parts in an image, or otherwise a representation of the image

70 6. Estimating Pose in Cluttered Images

Figure 6.1: The presence of background clutter in images makes the problem of pose estimation significantly harder. An important issue is to cue on useful parts of the image while not being confused by objects in the background.

in a manner that would allow a learning algorithm to implicitly cue on relevant features that encode pose information while being robust to irrelevant clutter.

6.1.1

Overview of the Approach

We base our method on the model-free approach developed in the previous chapters and use a large collection of pose-labeled images (from motion capture) to learn a system that directly predicts a set of body pose parameters from image descriptors. We side-step the problem of detecting people in a scene and focus on extracting pose from image windows that are known to contain people. To encode the input, local gradient orientation histograms such as those in the underlying descriptor of the SIFT [90] are computed over a dense grid of patches on the image window to give a large vector z of concatenated descriptors. This is followed by a Non-negative Matrix Factorization step that learns a set of sparse bases for the descriptors at each of the grid locations. The basis vectors correspond to local features on the human body and allow the patches to be re-encoded to selectively retain human-like features (that occur frequently and consistently in the data) while suppressing background clutter (which is highly varied and inconsistent). This gives a representation φ(z) of the image which is reasonably invariant to changes in background. Pose is then recovered by direct regression, x = A φ(z) + ǫ. The complete sequence of steps involved is summarized in figure 6.2.