Model Formulation - Machine Learning for Image Based Motion Capture

So far we have seen the form of the conditional density p(x|z) for multimodal pose estimation. To work in a completely probabilistic setting, we would also need to estimate the density for p(z) to allow us to measure the reliability of an observation. In this section, we see how both of these can actually be modeled using a single density estimation algorithm.

Besides the small number of typically possible reconstructions, there are other attributes that can potentially be captured by latent variables. For example, inter-person variations may also be considered to be discrete, in the form of a finite number of ‘person classes’.

5.3. Model Formulation 57 −0.6 −0.5 −0.4 −0.3 −0.2 −0.1 0 0.1 0.2 −0.2 0 0.2 0.4 0.6 −0.6 −0.5 −0.4 −0.3 −0.2 −0.1 0 0.1 0.2 −50 0 50 100 −40 −20 0 20 40 60 −60 −40 −20 0 20 40 60 80 100

Figure 5.1: (Left): Initial clusters in Ψ(z) obtained by running k-means with k=12. The plot shows a projection on the first 3 kernel principal components, with the different clusters colour- coded. (Right): 3 connected components are obtained for one of these clusters, as seen on the neighbourhood graph of the corresponding points in x. This cluster is thus split into 3 sub-clusters to separate the different pose subclasses that it contains. Of the 12 initial clusters in Ψ(z), we find that 3 get split into 2 sub-clusters each and 2 into 3 sub-clusters each based on this connectivity analysis. A few of these merge into others during the EM process, giving a final model consisting ∼20 clusters.

5.3.1 Manifold learning and Clustering

Given the nonlinearities in the mapping from z to x, we first identify a reduced manifold within the the input feature space z on which the local mappings can be approximated with linear functions. Any nonlinear dimensionality reduction technique may be used for this purpose, e.g. [122, 148]. Here, we perform a kernel PCA [126] to obtain a reduced representation2_{Ψ(z) for the input z. We}

can imagine Ψ(z) as the coordinates of the silhouette descriptor on a manifold that is folded over onto itself due to many-to-one projection mappings. To allow for multimodal output distributions, the mapping to the output space is now learned as a mixture of linear regressors on the reduced space Ψ(z). Each of the regressors is thus modeled as

x = rk(z) + ǫk ≡ AkΨ(z) + bk+ ǫk (5.4)

where Ak and bk are coefficients to be estimated and ǫk is the uncertainty associated with the

regressor, having a constant covariance Λk independent of z.

The complete learning process takes place in an iterative framework based on the Expectation Maximization (EM) algorithm [35] which guarantees convergence to a local minimum but relies on good initialization for attaining a globally optimal solution. The key to successful learning is thus to clearly separate the ambiguous cases into different mixture components (clusters) at initialization. Otherwise the individual regressors tend to average over several possible solutions. For this, we first use k-means to divide the KPCA-reduced space Ψ(z) into several clusters. (This corresponds to performing a spectral clustering in the original space z [102].) Each of these clusters is then split into sub-clusters by making use of the corresponding x values (which we assume to encode the true distance between points), exploiting the fact that silhouettes appearing similar in Ψ(z) can be disambiguated based on the distance between their corresponding 3D poses. This is achieved by constructing a neighbourhood graph in x that has an edge between all points within

The manifold projection shown in chapter1 (figure 1.3) was actually obtained by using KPCA to embed z into a 3-dimensional space. In practice, the dimensionality is much larger than 3, as will be seen later in this chapter.

58 5. A Mixture of Regressors p(x|z=z )t t z=z x = A z + b_k _k z x p(x,z) covariance

^

Figure 5.2: An illustration of the density estimation / regression mixture model used to estimate the conditional density p(x_{| z).}

a thresholded distance from one another, and robustly identifying connected components in this graph for each cluster in Ψ(z). An example illustrating the process is shown in figure 5.1. We find that this two-step clustering separates most ambiguous cases and gives better performance than the other initialization methods that we tested. For example, in terms of final reconstruction errors on a test set after EM based learning (see below), clustering in either x alone or jointly in (x, Ψ(z)) is found to give reconstruction errors higher by 0.3 degrees on average, while clustering in Ψ(z) alone shows several instances of averaging across multiple solutions owing to the inability to resolve the ambiguities, also increasing the average error.

5.3.2 Expectation-Maximization

Having obtained a set of clusters, each of which are known to be free of multivaluedness, the individual regressors can directly be learned using the methods described in chapter 3, but in order that the output of these regressors may be combined probabilistically, there are several other components that need to be learned: the likelihood of a given observation p(z), the probability p(l = k_{| z) that the solution from the kth regressor is correct, and also the uncertainty Λ}kassociated

with each regressor. All these are obtained by using the initial clusters to fit a mixture of Gaussians to the joint density of (Ψ(z), x):

Ψ(z) x ≃ K X k=1 πkN (µk, Γk) (5.5)

where πk are the gating probabilities p(l = k) of the respective classes and µk, Γk are their means

and covariances. Combining the regression model defined in (5.4) into this density model now gives the following relations for these quantities:

µ_k= Ψ(¯zk) rk(¯zk) , Γk = Σk ΣkA⊤k AkΣk AkΣkA⊤k+ Λk (5.6)

In document Machine Learning for Image Based Motion Capture (Page 70-73)