Regression Methods - Machine Learning for Image Based Motion Capture

This section describes the regression methods that have been evaluated for recovering 3D human body pose from the silhouette shape descriptors described in§ 3.2. The output pose is written as a real vector x_{∈ R}m_{and the input shape as a descriptor vector z}

∈ Rd_.

Adopting a standard regression framework, x is expressed as a function of z. Note that due to the ambiguities of pose recovery from monocular silhouettes (i.e. a given silhouette may actually be produced by more that one different underlying pose), the relationship between z and x may actually be non-functional. This issue, however, is postponed to chapters 4 and 5. For the moment, we assume that the relationship can be approximated functionally as a linear combination of a prespecified set of basis functions:

x =

k=1

akφk(z) + ǫ ≡ A f(z) + ǫ (3.1)

Here,{φk(z)| k = 1 . . . p} are the basis functions, akare Rm-valued weight vectors, and ǫ is a resid-

ual error vector. For compactness, we gather the weight vectors into an m×p weight matrix A ≡ (a1 a2 · · · ap) and the basis functions into a Rp-valued function f (z) = (φ1(z) φ2(z) · · · φp(z))⊤.

To allow for a constant offset x = Af + b, we can include φ(z)_{≡ 1 in f.}

To train the model (estimate A), we are given a set of training pairs {(xi, zi)| i = 1 . . . n}. We

use the Euclidean norm to measure x-space prediction errors, so the estimation problem is of the following form:

The motion capture data used in this chapter is in the ‘BioVision Hierarchy’ format and was taken from the public website www.ict.usc.edu/graphics/animWeb/humanoid.

34 3. Learning 3D Pose: Regression on Silhouettes A := arg min A ( _n X i=1 kA f(zi)− xik2 + R(A) ) (3.2)

where R(_{−) is a regularizer on A that prevents overfitting. Gathering the training points into an} m_{×n output matrix X ≡ (x}1 x2 · · · xn) and a p×n feature matrix F ≡ (f(z1) f (z2) · · · f(zn)),

the estimation problem takes the form:

A := arg min

A kA F − Xk

2_{+ R(A)}

(3.3)

where_{k . k denotes the Frobenius norm. Note that the dependence on {φ}k(−)} and {zi} is encoded

entirely in the numerical matrix F.

3.4.1 Ridge Regression

Pose estimation is a high dimensional and intrinsically ill-conditioned problem, so simple least squares estimation — setting R(A)≡ 0 and solving for A in least squares — typically produces severe overfitting and hence poor generalization. To reduce this, we need to add a smoothness constraint on the learned mapping, for example by including a damping or regularization term R(A) that penalizes large values in the coefficient matrix A. Consider the simplest choice, R(A)_≡ λ_kAk2_{, where λ is a regularization parameter. This gives the damped least squares or ridge}

regressor which minimizes

kA ˜F_{− ˜}X_k2 := _{kA F − Xk}2+ λ_kAk2 (3.4)

where ˜F_{≡ (F λ I) and ˜}X_{≡ (X 0). The solution can be obtained by solving the linear system} A ˜F = ˜X (i.e. ˜F⊤_A⊤ _{= ˜}_X⊤_{) for A in least squares}5_{, using QR decomposition or the normal}

equations. Ridge solutions are not equivariant under relative scaling of input dimensions, so we usually scale the inputs to have unit variance before solving. λ must be set large enough to control ill-conditioning and overfitting, but not so large as to cause overdamping (forcing A towards 0 so that the regressor systematically underestimates the solution). In practice, a suitable value of λ is usually determined by cross validation.

3.4.2 Relevance Vector Regression

Relevance Vector Machines (RVMs) [150, 151] are a sparse Bayesian approach to classification and regression. They introduce Gaussian priors on each parameter or group of parameters, each prior being controlled by its own individual scale hyperparameter, and perform inference by integrating over the set of parameters A. Here we keep to the estimation form of (3.3) and adopt an alternate (MAP) approach.

Integrating out the hyperpriors (which can be done analytically) gives singular, highly nonconvex total priors of the form p(a) _{∼ kak}−ν _{for each parameter or parameter group a, where ν is a}

If a constant offset x = Af + b is included, b must not be damped, so the system takes the form (A b) ˜F= ˜X where ˜F_≡„F λ I

1 0

3.4. Regression Methods 35

−ǫ +ǫ

(a) (b)

Figure 3.4: (a) The quadratic loss function used by ridge regression and our RVM algorithm, and (b) the ǫ-insensitive linear loss function used by the SVM.

hyperprior parameter. Taking log likelihoods gives an equivalent regularization penalty of the form R(a) = ν logkak. Such a logarithmic form associates very high penalties with small values of a and has an effect of pushing unnecessary parameters to zero. The model produced is thus sparse and the RVM automatically selects the most ‘relevant’ basis functions to describe the problem. To solve for the complete matrix A , we minimize the functional form given by

kA F − Xk2+ νX

log_kakk (3.5)

where ak are the columns of A and ν is a regularization parameter that also controls sparsity. The

minimization algorithm that we use for this is based on successively approximating the logarithmic term with quadratics, in effect solving a series of linear systems. The details of the algorithm and a discussion on its sparseness properties is are given in appendix A. This is different from the original algorithm proposed in [151] and was not developled as a part of the work done in this thesis.

3.4.3 Support Vector Regression

A third method for regularized regression that we have tested uses the Support Vector Machine (SVM) [159], which is well known for its use in maximum-margin based classification.

The goal in support vector regression is to find a function that has at most ǫ deviation from the actually obtained targets xi for all the training data, and at the same time, is as flat as possi-

ble. In its standard formulation, the SVM assumes scalar outputs and hence works on individual components x of the complete vector x. As in ridge regression, flatness is ensured by minimizing the Euclidean norm of the weight matrix, but now separately for each row a of A. Since the existence of an ǫ-precision function is not guaranteed and some errors must be allowed, extra slack variables are introduced to ‘soften’ the constraints. The final formulation, as stated in [159], takes the following form for each output component x:

minimize 1 2kak 2_{+ C} n X i=1 (ξi+ ξi∗) subject to    xi− a f(zi) ≤ ǫ + ξi a f (zi)− xi ≤ ǫ + ξi∗ ξi, ξ∗i ≥ 0 (3.6)

36 3. Learning 3D Pose: Regression on Silhouettes

where a is the row corresponding to the scalar component x, ξi, ξi∗ are the slack variables and the

constant C > 0 determines the trade off between the function flatness and the amount up to which deviations larger than ǫ are tolerated. This formulation corresponds to dealing with a so called ǫ-insensitive loss function_|ξ|ǫ described by

|ξ|ǫ :=

0 if _{|ξ| ≤ ǫ}

|ξ| − ǫ otherwise (3.7)

Figure 3.4 illustrates this function in comparison to the quadratic loss functions used by ridge regression and our approximated RVM algorithm. While points with a deviation in prediction less that ǫ do not contribute to the cost, deviations greater than ǫ are penalized in a linear fashion. This gives the SVM a greater degree of robustness to outliers than ridge regression and the RVM. The optimization problem (3.6) is mostly solved in its dual form. We make use of the standard algorithm, details of which are available in [143].

In document Machine Learning for Image Based Motion Capture (Page 47-50)