2.3 Graph Based Image Segmentation and Tracking
2.3.3 Laplacian Matrix and Tracking
The Laplacian matrix corresponding to a finite sample of data points asymptotically approaches a continuous Laplacian operator. This is generally believed to be true, pro- vided the sample size is increasing. On the other hand, the Laplace filter which uses difference to approximate the Laplacian is often used to detect large variations in an image, and it serves as an edge detector in image processing. Depending on the def- inition of the neighbourhood, both Laplacians capture the similarity or dissimilarity in neighbourhoods.
By definition the Laplacian matrix is equal to the degree matrix minus the sim-
ilarity matrix L = D−W, and the summation of each row is equal to zero. This
corresponds to a meaningful interpretation appearing in the image segmentation lit-
erature. If F denotes an estimated segmentation label, and L denotes the Laplacian
matrix, then the regularisation termFTLFmeans any segmentation labelFshould be consistent with the neighbouring smoothness which is encoded in Laplacian matrix
L. Otherwise, it incurs the residual cost FidiiFi−∑k∈i′sneighFiwikFj. Apparently, if all elements in Fhave the same label, the residual cost is minimised and equal to zero. However, this is not the desired image segmentation result. Hence, prior knowledge is added via an extra term, called the data term, to constrain the segmentation re- sult. Thus image segmentation becomes a problem that uses prior knowledge data as a clue, while conforming to the naturally grouped regions in the image. Optimal segmentation is found when both the data and smooth term are minimised.
For human motion capture, the objective is slightly different from image seg- mentation. It requires a more accurate (usually more difficult) estimate of the hu- man poseθ. Therefore, it often incorporates stronger prior knowledge (such as fore- ground/background subtraction) to guarantee better tracking performance than im- age segmentation does. This actually involves strong human judgement as a prior, for example that the foreground pixel belongs to a tracking subject. This assumption is
§2.3 Graph Based Image Segmentation and Tracking 37
simple for a human to make, but a computer cannot come up with it by itself. Thus, considering the neighbouring smoothness and foreground assumption, an appropri- ate objective function for human motion capture can be given by:
min
θ∈Rn∥F(θ)−Y∥
2+F(θ)TLF(θ)
The equation means segmentation generated from the estimated poseθshould obey
the prior judgementY(foreground subtraction) as well as conform to the natural im- age regions.
Chapter 3
Architecture Overview and
Sequential Tracking Pipeline
This work adopts the generative approach rather than the discriminative approach, because with this approach it is possible to generate synthetic data points and ap- proximate posterior distributions in temporal manner. This has better behaviour than recovering pose configurations from image observation space. Herein is presented a big picture of the entire architecture, and an introduction to the functionalities of the major components, including template modelling & automatic initialisation, ob- servation likelihood evaluation, pose estimation and the sequential Bayesian tracking pipeline. The relationships and inter-operations between basic building blocks are explained in the sequential Bayesian filtering framework. Subsequent chapters will separately elaborate each component and address more technical details.
3.1
Architecture of Human Motion Capture
As with many other tracking systems, markerless motion capture can be regarded as a dynamic system, in which the current event has very strong temporal connections to preceding and successive events. The Contextualised Dynamical Architecture in Figure 3.1 visually captures the framework architecture in the temporal domain as well as functional components below:
§3.1 Architecture of Human Motion Capture 41
1. True posture and observation: There is an actor/subject performing. At any given instant timet, the actor has a posturext∗(a vector including a position and joint angels), and an observationyt(e.g. multiview images) about the posture. The ultimate goal is to estimate the true posturext∗for every instant
2. Digital Acquisition: Actor’s performance is captured by multiple distributed digital cameras, and stored as the sequence of digital images or videos.
3. Skeleton Template Model: To be able to describe the posture of the subject, we adopt the standard articulated skeleton and kinematics routines in computer graphics. The posture estimatextat timet is described by a template skeleton associated with a series of joint angles. Ideally, we hope to find the best match xttoyt. The details of how to build the generic skeleton is described in Chapter 4.
4. Subject Specific Modelling: To improve tracking accuracy and robustness, a more advanced template body model is built according to the real subject ap- pearance. This technique captures the gender, height, weight, shape and mus- cular tone appearance features and incorporates them into the pose deformable model. As a result, given a posext, we can render a virtual character in the cor- responding posture. This can provide much richer information about the subject and helps reduce ambiguities in tracking. The technique details can be found in Chapter 4 and Section 6.2.
5. Observation Likelihood: Directly observing or obtaining the true pose is not possible. The core of the framework is to find the best possible estimatextforyt. In other words, we want to find the maximum observation likelihood p(yt|xt) in the sense of the Bayesian paradigm (more analytic details will be elaborated in following sections). The observation likelihood often takes the observations (information related to the true poseyt) and a hypothesis estimatextas inputs,
evaluates their similarity and outputs the similarity score. This usually requires that the observations and a hypothesis estimate essentially have a comparable form. The evaluation of the observation likelihood is a crucial part of marker- less motion capture, essentially related to the optimisation process, ultimately the tracking quality, and computational performance. We have proposed sev- eral novel strategies found in Chapter 6 and 7 to boost accuracy, robustness and performance.
6. Feature Extraction: Directly utilising digital images usually is not very effective. Feature extraction can be used to retrieve much more pose-relevant informa- tion, and remove irrelevant information and noise interference. For instance, the silhouette feature is often extracted and used in human tracking applications. Some techniques are introduced in Chapter 6 and 7 along with algorithms.
7. Synthesis: To make observation and hypothesis estimates comparable, a virtual character must be synthesised. A common approach is to perform perspective projection, using camera calibration parameters, to generate these images. This is described in Section A.1.
8. Optimisation on Pose Estimation: optimisation is performed to maximise the posterior probability. It uses the pose from the previous time as an initial po- sition, iteratively evaluates the observation likelihood and ideally converges to the global optimum. The converged result is then regarded as the pose esti- mation for the current timet. However, because of the high dimensionality of skeleton parameterisation, ambiguities associated with the limited number of cameras and self-occlusions, this is a multimodal, high dimensional optimisa- tion problem. As the conventional gradient-based method has difficulties in solving this problem, a stochastic approach is often used instead. In Chapter 5, we describe several nature-inspired algorithms to conquer this problem.