Discussion and Future Work - A collaborative approach to image segmentation and behavior recogn

In the formulation of our general segmentation / recognition framework, we regard behavior as a succession of simple actions. We describe these actions in terms of object attributes that are emitted with a certain probability given a particular action class. Furthermore, we characterize the succession of action classes by a Markov chain. This kind of behavior description corresponds to a Hidden Markov Model. The temporal dependency between successive attributes is modeled in terms of transition probabilities between the discrete hidden action classes that produce the attributes. As future work, we can imagine exten- sions of our model in order to incorporate more complex temporal dependencies between attributes. One example could be the inclusion of a auto-regressive dependency between

successive attributes, whose parameters would depend on the action class. Such an ex- tension would facilitate the application of the model to scenarios where the attributes are changing continuously, but in a predictable way depending on the action class, for example when wishing to discriminate between activities like walking and running.

Let us now look at the application of our framework to applications such as the finger- spelling recognition one. As our experimental results have shown, our model is able to cope with important amounts of background clutter due to the infusion of prior knowledge from the recognition process. Nevertheless, our proposed model can still become sidetracked from the correct segmentation if the objects in the background are too similar in average color with respect to the hand. This aspect could be improved by the incorporation of more complex image-based segmentation models, including more complex models of color (e.g. histogram-based), texture models, or the use of a piecewise-smooth formulation instead of the piecewise-constant model that we have employed. Another idea would be to incorporate a form of background modeling, which could be rendered adaptive in time, so as not to con- strain the application to a fixed background. However, we should mention that our choice of a rather simplistic model was partly motivated by considerations regarding computation time, which would augment with the use of more complicated models.

Indeed, computation time is one of the sensitive points of our framework. This is mainly due to the fact that it relies on a variational method for image segmentation. The numerous advantages of variational segmentation methods, among which the rigorous mathematical formulation and the flexible inclusion of various criteria, were explained in Chapter 2. How- ever, the typical numerical implementations of these methods require the iterative evolution of the segmentation contour until convergence, using evolution time steps which are limited in size by considerations regarding the stability of the numerical schemes. This translates into relatively long computation times per image. In our case, to speed up computation, we used the narrow-band method [1] for updating the level set function representing our segmentation contour, and the fast-marching method [1] for the re-initialization of the level set function to a signed distance function. However, additional computation time could be gained by considering a multi-grid numerical implementation. Another option would be to replace the level set contour representation by a B-spline parametric one, which would drastically reduce the dimensions of our problem. However, we would loose the ability to capture interior object contours (without additional complications), which come up for in- stance in representing the hand contour for letter O in our finger-spelling application. Last, but not least, the optimization of the code could be considered, by a C-only implementation and processor optimizations.

Regarding the testing of our framework, it would be interesting to extend our applications (in particular the finger-spelling one) to several gesturing persons, and also to extend the testing scenarios to different background and lighting configurations, as well as cases of missing frames from the test image sequences.

Appendix

A

A.1 The Minimization of Functionals Using the Calculus of

Variations and Gradient Descent

In the following, we briefly outline a classical method used for the minimization of typical functionals encountered in image processing problems. This method is based on the calculus of variations and gradient descent (cf. [130]).

We begin by presenting the one-dimensional (1D) case. Given a 1D function u(x) : [0, 1] _{−→ R, we wish to minimize a given energy functional}

E(u) = Z 1

F (u, u0)dx, (A.1)

subject to given boundary conditions u(0) = a and u(1) = b. Here F : R2 _{−→ R is dictated}

by the particular application to solve and depends on the function u and on its derivative u0.

In classical calculus, the extrema of a function f (x) : R −→ R are reached in those points of the domain where f0_{(x) = 0. Likewise, in the calculus of variations we can attain}

the extrema of the functional E(u) in those points where E0 = 0, where E0 = ∂E_∂u is the first variation of E(u). As shown in [130], this leads to the following necessary condition in order for u to be an extremum of E(u):

∂F ∂u − d dx ∂F ∂u0 = 0. (A.2)

This is the Euler-Lagrange equation for the 1D case. Similarly, for an energy of the form E(u) =

Z 1 0

F (u, u0, u00)dx, (A.3)

the Euler-Lagrange equation is given by ∂F ∂u − d dx ∂F ∂u0 + d 2 dx2 ∂F ∂u00 = 0. (A.4)

For the 2D case, the equations are analogous. Given a function u(x, y) : Ω⊂ R2_{−→ R,}

we wish to minimize the following energy with respect to u: E(u) =

Z Z

Ω

F (u, ux, uy, uxx, uyy) dx dy. (A.5)

The necessary condition for u to be an extremum point for E is given by the Euler-Lagrange equation: ∂F ∂u − d dx ∂F ∂ux − d dy ∂F ∂uy + d 2 dx2 ∂F ∂uxx + d 2 dy2 ∂F ∂uyy = 0. (A.6)

The remaining problem now is finding a solution for the Euler-Lagrange equation, that we denote by

L(u) = 0,

where L(u) designates the left-hand side of equations such as (A.6). Generally, in image processing tasks this equation is impossible to solve analytically. Therefore, numerical solutions are usually preferred. One of the most commonly used methods is the gradient descent. The basic idea is that in order to find a solution for L(u) = 0, we numerically solve the PDE

∂u

∂t = L(u), (A.7)

starting from the initial condition u(0) = u0, where u0 is the given initial data and t is an

artificial time-marching parameter. Once we reach the steady state of this equation, that is, when

∂u

∂t = 0, (A.8)

then we have found the solution u∗ = u to the Euler-Lagrange equation: L(u∗) = 0.

This gradient descent method is not guaranteed to reach the optimal solution. If the energy to minimize is not convex, the solution to the PDE (A.7) may not be unique or may vary depending on the initial condition which is used. Its use is nonetheless widespread, since in many cases a local minimum of the energy functional constitutes an acceptable solution to the given problem.

A.2 Image Segmentation Using the Gaussian Prior Model,

In document A collaborative approach to image segmentation and behavior recognition from image sequences (Page 136-141)