Articulated 3D Pose Estimation - Monocular 3d Object Recognition

Considerable research has addressed the challenge of 3D human motion capture from video (Brubaker et al.,2010;Moeslund et al.,2006;Sminchisescu,2007). Early research on 3D monocular pose estimation in videos largely centred on incremental frame-to-frame pose tracking, e.g., (Bregler and Malik,1998;Sigal et al.,2012;Sminchisescu and Triggs,

2003). These approaches rely on a given pose and dynamic model to constrain the pose search space. Notable drawbacks of this approach include: the requirement that the initial- ization be provided and their inability to recover from tracking failures. To address these limitations, more recent approaches have cast the tracking problem as one of data asso- ciation across frames, i.e., “tracking-by-detection”, e.g., (Andriluka et al., 2010). Here, candidate poses are first detected in each frame and subsequently a linking process at- tempts to establish temporally consistent poses.

Another strand of research has focused on methods that predict 3D poses by searching a database of exemplars (Jiang,2010;Mori and Malik,2006;Shakhnarovich et al.,2003) or via a discriminatively learned mapping from the image directly or image features to human joint locations (Agarwal and Triggs, 2006; Ionescu et al., 2014; Salzmann and Urtasun,2010;Tekin et al.,2015;Yu et al.,2013). Recently, deep convolutional networks (CNNs) have emerged as a common element behind many state-of-the-art approaches, including human pose estimation, e.g.,Li and Chan(2014);Li et al.(2015);Tompson et al.

(2014);Toshev and Szegedy(2014). Here, two general approaches can be distinguished. The first approach casts the pose estimation task as a joint location regression problem from the input image (Li and Chan,2014;Li et al.,2015;Toshev and Szegedy,2014). The second approach uses a CNN architecture for body part detection (Chen and Yuille,2014;

Jain et al., 2014; Pfister et al., 2015; Tompson et al., 2014) and then typically enforces the 2D spatial relationship between body parts as a subsequent processing step. Similar to the latter approaches, the proposed approach uses a CNN-based architecture to regress confidence heat maps of 2D joint position predictions.

recovering 3D non-rigid shapes from image sequences captured with a single camera (Akhter et al., 2011; Bregler et al., 2000; Cho et al., 2015; Dai et al., 2012; Zhu et al.,

2014c), i.e., non-rigid structure from motion (NRSFM), and human pose recovery models based on known skeletons (Lee and Chen, 1985; Park and Sheikh, 2011; Taylor, 2000;

Valmadre and Lucey,2010) or sparse representations (Akhter and Black,2015;Fan et al.,

2014;Ramakrishna et al., 2012;Zhou et al., 2015b,c). Much of this work has been real- ized by assuming manually labeled 2D joint locations; however, there is some recent work that has used a 2D pose detector to automatically provide the input joints (Wang et al.,

2014) or solves the correspondence problem by matching a spatio-temporal pose model to candidate trajectories extracted from a video (Zhou and la Torre,2014).

Part II

Chapter 3 Discriminative Learning

In this chapter, we briefly cover the basics of discriminative learning and introduce two successful methods on which this thesis is based.

In the machine learning literature, classifiers generally fall into two categories: generative classifiers and discriminative classifiers. Generative classifiers learn the joint distributionp(y, x)of the inputxand the labely. Combined with the prior distributionp(x)of

the label, the posterior distributionp(y_|x)is then derived using Bayes rules. Discrimina-

tive classifiers learn the posterior distributionp(y_|x)directly by modeling the distribution

with a parametric model and optimizes the parameters using a training set. In the seminal work of (Ng and Jordan, 2002), the two types of classifiers are compared and the results suggest that discriminative models have lower asymptotic error given large training data.

In recent years, discriminative models have taken large strides in computer vision as more large-scale datasets are becoming available annotated with class labels, object bounding boxes and even detailed segmentation masks. Several discriminative models quickly become very successful with applications to object recognition. Among them are Support Vector Machine (Cortes and Vapnik, 1995; Vapnik and Kotz, 1982) and Deep Neural Networks (Krizhevsky et al., 2012;LeCun et al., 1998) which underpin the latest advances in the field.

3.1 Support Vector Machine

One particularly successful discriminative model is Support Vector Machine (SVM), with many applications to document classification, object classification etc. SVM was orig- inally started when statistical learning theory was developed further by Vapnik (Vapnik and Kotz, 1982) and later extended closer to its current form (Cortes and Vapnik, 1995). For simplicity, we only discuss the linear SVM. The non-linear SVM extends the linear case and builds on the idea of kernel methods, which is out of the scope of this thesis.

The linear SVM learns a separating hyperplane wT_x₊_b _{= 0}_{from labeled examples}

D =_{hx1, y1i, ...,hxn, yni}, whereyi ∈ {−1,1}. In order to achieve robustness to noise

and gain better generalization to unseen data, SVM maximizes the margin of the separating hyperplane to the examples. Formally, linear SVM can be formulated as the following optimization problem, min w,b,ξ≥0 1 2w T w+CX i ξi (3.1) s.t. yi(wTxi+b)≥1−ξi, i= 1, ..., n, (3.2)

where ξi is called the slack variable, introduced for penalizing the incorrectly classified

examples,C is the weight on the penalty cost.

By taking the gradient w.r.t. wandbon the corresponding Lagrangian, we can derive the equivalent dual problem as,

max α≥0 X i αi− 1 2αiαjyiyjx T ixi (3.3) s.t. X i αiyi = 0, αi ≤C, i= 1, ..., n, (3.4)

where αi is the Lagrange multiplier for the inequality constrain 3.2 in the primal form.

The resulting dual form of SVM is a quadratic optimization problem easier than the primal problem. The optimal solution can be derived from the KKT condition (Boyd and Vandenberghe,2004). Both specialized solvers, e.g. SMO (Platt et al.,1998), and generic primal-dual optimization algorithms, e.g. ADMM (Boyd et al.,2011), have been designed

Figure 3.1: GoogleNet desgined by (Szegedy et al., 2015): an example of modern deep convolutional neural network with many layers of convolutions, where the blue boxes are the convolution layers.

to tackle the problem, and we refer curious readers to an excellent tutorial by Burges

(1998) for more details.

In document Monocular 3d Object Recognition (Page 41-46)