After analyzing the factors affecting the results by different methods, we conduct a detailed analysis on a smaller set of the top 15 activities from Separate people.
yoga, bicyc., skiing, cooking skate- rope softb., forest.
power moun. down. or food board. skip. gener.
Dense trajectories (DT) 10.6 14.5 51.9 0.5 11.4 36.0 12.7 8.4
GT single pose (GT) 22.3 26.5 7.5 1.8 3.4 51.2 2.2 1.4
GT single pose + track (GT-T) 37.0 28.0 10.9 2.6 4.6 69.2 3.6 1.2
PS single pose + track (PS-T) 8.8 6.6 6.0 1.3 1.7 63.1 1.6 1.8
PS multi-pose (PS-M) 18.3 34.0 27.3 2.6 17.2 90.5 3.0 5.2
PS-M + DT (features) 19.6 40.7 32.9 2.2 19.5 88.7 3.9 7.2
PS-M filter DT 16.1 20.4 52.2 0.8 13.5 55.7 4.2 10.6
carpen., bicyc., golf rock ballet, aerobic resist. total
gener. racing climb. modern step train.
Dense trajectories (DT) 5.5 5.5 33.0 41.5 12.7 24.5 16.5 19.0
GT single pose (GT) 2.7 7.1 36.1 2.3 1.0 1.1 1.4 11.2
GT single pose + track (GT-T) 2.8 8.7 25.3 8.9 1.7 3.3 1.3 13.9
PS single pose + track (PS-T) 5.3 0.5 14.7 1.2 2.8 11.1 1.6 8.5
PS multi-pose (PS-M) 3.4 8.6 47.9 4.7 22.9 10.4 7.2 20.2
PS-M + DT (features) 5.0 12.1 51.9 14.4 23.7 17.1 14.4 23.5
PS-M filter DT 6.1 15.5 15.9 38.6 7.1 25.8 9.6 19.5
Table 10.1: Activity recognition results (mAP) on 15 largest classes from Separate people.
The results are shown in Tab. 10.1. None of the methods outperforms all others on all activities and different approaches are better on different activities. On average methods perform well on activities with simple poses and motions e.g. “rope skipping”, “skiing, downhill” and “golf” - typical cases in most of the current
activity recognition benchmarks. However, the performance of all methods is low for activities with more variability in motion and poses, e.g. “cooking”, “carpentry, general” and “forestry”. This leaves room for improvement of all competing methods. Analyzing the performance on individual activities, we observe that for “yoga, power” activity GT outperforms holistic DT and PS-M filter DT methods (22.3% vs. 10.6% and 16.1% mAP, respectively) and is better than the pose based PS-M (22.3% vs. 18.3% mAP). It is interesting, as GT does not use any motion and relies on static body features only. The explanation is that the “yoga, power” activity contains distinctive body poses and thus can be reliably captured by GT, while PS-M fails due to unreliable pose estimations. It can be seen that in many cases the combination PS- M + DT (features) noticeably outperforms both PS-M and DT alone. The differences are most pronounced for “bicycling, mountain”, “bicycling, racing”, “skateboarding” exhibiting characteristic motions, and “golf” activity having distinctive body motion and poses. Overall PS-M + DT (features) achieves the best performance of 23.5% mAP. We visualize several successful and failure cases of the methods in Fig. 10.3.
10.5 conclusion 177
cooking or canoeing, carpentry, sanding ballet, aerobic food prep. kayaking general floors modern step
DT
mowing lawn, canoeing, carpentry, army type ballet, rope walking kayaking general training modern skipping
PS-M
playing drums, canoeing, carrying, loading, sanding yoga, circuit sitting kayaking or stacking wood floors power training
PS-M
+
DT
drumming canoeing, carpentry, childrens ballet, aerobic bongo kayaking furniture games modern step Figure 10.3: Successful and failure cases on several activity classes. Shown are the most confident prediction per class. False positives are highlighted in red.
10.5
conclusion
In this work we address the challenging task of fine-grained human activity recog- nition on a recent comprehensive dataset with hundreds of activity classes. We study holistic and pose based representations and analyze the factors responsible for their performance. We reveal that holistic and pose based methods are comple- mentary, and their performance varies significantly depending on the activity. We found that both methods are strongly affected by the speed of trajectories. While the holistic method is also strongly influenced by the number of trajectories, pose based methods are strongly affected by human pose and viewpoint. We observe striking performance differences across activities and experimentally show that the combination of both methods performs best.
11
D E E P C U T: J O I N T S U B S E T PA R T I T I O N A N D L A B E L I N G F O R M U LT I P E R S O N P O S E E S T I M AT I O NContents
11.1 Introduction . . . 180 11.2 Problem Formulation . . . 182 11.2.1 Feasible Solutions . . . 182 11.2.2 Objective Function . . . 183 11.2.3 Optimization . . . 184 11.3 Pairwise Probabilities . . . 184 11.3.1 Probability Estimation . . . 18511.4 Body Part Detectors . . . 186
11.4.1 Adapted Fast R-CNN (AFR-CNN) . . . 186
11.4.2 Dense architecture (Dense-CNN) . . . 186
11.4.3 Evaluation of part detectors . . . 188
11.4.4 Using detections in DeepCut models . . . 189
11.5 DeepCut Results . . . 190
11.5.1 Single person pose estimation . . . 190
11.5.2 Multi-person pose estimation . . . 192
11.6 Conclusion . . . 201
11.7 Appendix: Additional Results on LSP dataset . . . 201
11.7.1 LSP Person-Centric (PC) . . . 201
11.7.2 LSP Observer-Centric (OC) . . . 204
I
nthis chapter we switch our attention back to developing expressive body models for human pose estimation. In particular, we consider the task of articulated human pose estimation of multiple people in real-world images. We propose an approach that jointly solves the tasks of detection and pose estimation: it infers the number of persons in a scene, identifies occluded body parts, and disambiguates body parts between people in close proximity of each other. This joint formulation is in contrast to previous strategies, that address the problem by first detecting people and subsequently estimating their body pose. We propose a partitioning and labeling formulation of a set of body-part hypotheses generated with CNN-based part detectors. Our formulation, an instance of an integer linear program, implicitly performs non-maximum suppression on the set of part candidates and groups them to form configurations of body parts respecting geometric and appearance constraints. Experiments on four different datasets demonstrate state-of-the-art results for both single person and multi person pose estimation.11.1
introduction
Human body pose estimation methods have become increasingly reliable. Powerful body part detectors (Tompson et al., 2015) in combination with tree-structured body models (Tompson et al., 2014; Chen and Yuille, 2014) show impressive results on diverse datasets (Johnson and Everingham, 2011; Andriluka et al., 2014; Sapp and Taskar, 2013). These benchmarks promote pose estimation of single pre-localized persons but exclude scenes with multiple persons. This problem definition has been a driver for progress, but also falls short on representing a realistic sample of real-world images. Many photographs contain multiple people of interest (see Fig 11.1) and it is unclear whether single pose approaches generalize directly. We argue that the multi person case deserves more attention since it is an important real-world task.
Key challenges inherent to multi person pose estimation are the partial visibility of some people, significant overlap of bounding box regions of people, and the a-priori unknown number of people in an image. The problem thus is to infer the number of persons, assign part detections to person instances while respecting geometric and appearance constraints. Most strategies use a two-stage inference process (Gkioxari et al., 2014; Sun and Savarese, 2011, Chapter 5) to first detect and then independently estimate poses. This is unsuited for cases when people are in close proximity since they permit simultaneous assignment of the same body-part candidates to multiple people hypotheses.
As a principled solution for multi person pose estimation a model is proposed that jointly estimates poses of all people present in an image by minimizing a joint objective. The formulation is based on partitioning and labeling an initial pool of body part candidates into subsets that correspond to sets of mutually consistent body-part candidates and abide to mutual consistency and exclusion constraints. The proposed method has a number of appealing properties. (1) The formulation is able to deal with an unknown number of people, and also infers this number by linking part hypotheses. (2) The formulation allows to either deactivate or merge part hypotheses in the initial set of part candidates hence effectively performing non-maximum suppression (NMS). In contrast to NMS performed on individual part candidates, the model incorporates evidence from all other parts making the process more reliable. (3) The problem is cast in the form of an Integer Linear Program (ILP). Although the problem is NP-hard, the ILP formulation facilitates the computation of bounds and feasible solutions with a certified optimality gap.
This work makes the following contributions. The main contribution is the derivation of a joint detection and pose estimation formulation cast as an integer linear program. Further two CNN variants are proposed to generate representative sets of body part candidates. These, combined with the model, obtain state-of-the-art results for both single-person and multi-person pose estimation on different datasets.
11.1 introduction 181
(a) (b) (c)
Figure 11.1: Method overview: (a) initial detections (= part candidates) and pairwise terms (graph) between all detections that (b) are jointly clustered belonging to one person (one colored subgraph = one person) and each part is labeled corresponding to its part class (different colors and symbols correspond to different body parts); (c) shows the predicted pose sticks.