Learning the Regression Models - Machine Learning for Image Based Motion Capture

The regression framework from the previous chapter that estimates 3D pose from silhouette shape descriptors is now extended to include a dynamical (autoregressive) model for handling ambiguities and producing smooth reconstructions over a stream of images. The new model, which we call discriminative tracking because it is fully conditional and avoids the construction of a generative model of image likelihoods, now involves two levels of regression — a dynamical model and an observation model.

The ambiguities persist for several frames so regressing the pose xtagainst a sequence of the last few silhouettes

4.2. Learning the Regression Models 45

4.2.1 Dynamical (Prediction) Model

Dynamical models, as the name suggests, are used to model the dynamics of a system. They are widely used in a variety of domains involving time-varying systems [83, 103] and several forms of such models have been proposed in the human tracking literature, e.g. [135, 108]. In this work, we model human body dynamics with a second order linear autoregressive process which assumes that the current state (pose in this case) can be expressed as a linear function of the states from the two previous time steps3_{. The state at time t is modeled as x}

t= ˇxt+ ǫ, where ˇxt≡ ˜A xt−1+ ˜B xt−2is

the second order dynamical estimate of xtand ǫ is a residual error vector. Learning the regression

directly in this form with regularization on the parameters ˜A and ˜B (see§ 3.4), however, forces ˇxt

to 0 in the case of overdamping. To avoid such a situation, the solution can be forced to converge to a default linear prediction if the parameters are overdamped by learning the autoregression for ˇ

xtin the following form:

xt ≡ (I + A)(2xt−1− xt−2) + B xt−1 (4.1)

where I is the m×m identity matrix. A and B are estimated by regularized least squares regression against xt, minimizingkǫk2+ λ(kAk2Frob+kBk2Frob) over the training set as described in section

3.4.1 and the regularization parameter λ is set by cross-validation to give a well-damped solution with good generalization.

4.2.2 Observation (Correction) Model

Now consider the observation model. As discussed in_{§ 4.1, the underlying density p(x}t|zt) is highly

multimodal owing to the pervasive ambiguities in reconstructing 3D pose from monocular images, so no single-valued regression function xt= xt(zt) can give completely acceptable point estimates

for xt. However, much of the ‘glitchiness’ and jitter observed in the static reconstructions can be

removed by feeding ˇxtalong with zt into the regression model.

A combined regressor xt= xt(zt, ˇxt) could be formulated in several ways. Linearly combining ˇxt

with an observation based estimate xt(zt) such as that in (3.1) would only smooth the results,

reducing jitter while still continuing to give wrong solutions when the original regressor returns a wrong estimate of xt. So we build a state sensitive observation update by including a non-linear

dependence on ˇxt with zt in the observation-based regressor i.e. we construct basis functions of

the form φ(ˇxt, zt). Our full regression model also includes an explicit linear ˇxtterm to represent

the direct contribution of the dynamics to the overall state estimate, so the final model becomes xt≡ ˆxt+ ǫ′ where ǫ′ is a residual error to be minimized, and:

ˆ xt = C ˇxt+ p X k=1 dkφk(ˇxt, zt) ≡ C D ˇ xt f (ˇxt, zt) (4.2)

Here, _{φk(x, z)| k = 1 . . . p} is a set of scalar-valued nonlinear basis functions for the regression,

and dk are the corresponding Rm-valued weight vectors. For compactness, we gather these into

an Rp_{-valued feature vector f (x, z)}

≡ (φ1(x, z), . . . , φp(x, z))⊤ and an m×p weight matrix D ≡

(d1, . . . , dp). C is an m×m coefficient matrix that controls the weight of the dynamical prediction

term. The final minimization of ǫ′ involves regularization terms for both C and D.

We find that a global model in the form of a single second-order autoregressive process suffices for the kind of motions studied here. A more sophisticated dynamical model based on a mixture of such processes that is capable of tracking through changing aspects of motion and appearance is described in chapter 7.

46 4. Tracking and Regression 0 50 100 150 200 250 300 −50 −40 −30 −20 −10 0 10 20 30 Time

Left hip angle (in degrees)

Tracking results for left hip angle True value of this angle

Time Kernel bases 0 50 100 150 200 250 300 0 20 40 60 80 100

Figure 4.2: An example of mistracking caused by an over-narrow pose kernel Kx. The kernel

width is set to 1/10 of the optimal value, causing the tracker to lose track from about t=120, after which the state estimate drifts away from the training region and all kernels stop firing by about t=200. (Left) the variation of a left hip angle parameter for a test sequence of a person walking in a spiral. (Right) The temporal activity of the 120 kernels (training examples) during this track. The banded pattern occurs because the kernels are samples taken from along a similar 2.5 cycle spiral walking sequence, each circuit involving about 8 steps. The similarity between adjacent steps and between different circuits is clearly visible, showing that the regressor can locally still generalize well.

For the experiments, we use instantiated-kernel bases that measure similarity in both components x and z:

φk(x, z) = Kx(x, xk)· Kz(z, zk) (4.3)

where (xk, zk) is a training example and Kx, Kzare independent Gaussian kernels on x-space and

z-space, Kx(x, xk) = e−βxkx−xkk

and Kz(z, zk) = e−βzkz−zkk

. Using Gaussians kernels in the combined (x, z) space makes examples relevant only if they have similar image silhouettes and similar underlying poses to training examples. This overcomes the weakness of the original model (3.1) by preventing an ambiguous silhouette from being matched to a similar looking silhouette with a different underlying 3D pose, and thus is able to resolve the ambiguities.

4.2.3 Parameter Settings

The matrices C and D in the model (4.2) are normally estimated using ridge or Relevance Vector Machine regression and the kernel widths βxand βzin φkare empirically set using cross validation.

The parameter βx, however, is observed to have a very interesting influence on the system, leading

to ‘extinction’ if set to too small a value. Also, an analysis of performance change with value of the dynamical model coefficient C gives useful insight into the role played by the linear dynamical term in the model. Both these cases are discussed individually below.

A. Mistracking due to extinction

Kernelization in joint (x, z) space allows the relevant branch of the inverse solution to be chosen, but it is essential to choose the relative widths of the kernels appropriately. If the x-kernel is chosen too wide, the method tends to average over (or zig-zag between) several alternative pose- from-observation solutions, which defeats the purpose of including ˇx in the observation regression.

4.3. A Condensation based viewpoint 47

In document Machine Learning for Image Based Motion Capture (Page 58-61)