Determining search segments on the texture map

3.3 Motion tracking

3.3.4 Determining search segments on the texture map

The motion tracking procedure is frame-to-frame based. Therefore we have to determine the change of location of the face surface from one frame to the next. This and the following step establish the conditions for the comparison of two consecutive video frames. Unfortunately the continuous character of the facial surface implies that there are no rigid objects to track. Even in the case of the jaw that clearly is a rigid object and whose motion can be described with the usual six rigid motion parameters the appearance of its movements in the video frame is that of non-rigid motion due to the layers of muscles and skin above the bone structure. The eyeballs are an exception to this, but eye blinking makes them a less than favourable candidate and they are not very important for speech anyway. The bridge of the nose might be another exception, but even here some people show wrinkles when wrinkling their nose.

Since we were not interested in tracking features, the only remaining solution is to partition the face surface in small parts. The general way of accomplishing it using a multiresolution approach was already discussed in former sections. But what about the practical details? It was suggested by many studies and authors concerned with animating faces in computer graphics (see Parke and Waters,

(a) Diagonal neighbours

(b) Horizontal neighbours

Figure 3.16: Section of the ellipsoid mesh with two search segments marked. The search segment is defined by the four neighbouring nodes surrounding the centre node.

the animation of different parts of the face, i.e., high resolutions in the mouth and eye region and lower resolution in the cheek area. The same is sometimes claimed though not proved for face motion tracking. One of the longterm goals of our work is to investigate whether or not this claim holds in general and if not whether there are circumstances where it nevertheless is found to be true. Are there maybe differences depending on the kind of face motion, and how is the situation in particular for speech face motion? This requires being able to look at the covariations or correlations between measurements globally distributed over the face. Therefore we chose for the tracking ’atom’ on the final fine level of our tracking a relatively small area that is distributed globally over the face and has more or less the same size everywhere. The shape and size of the search areas on the coarser levels are merely a consequence of this.

Of course the ’search segment’, as we will call the search area from now on consistently, has to be well-defined everywhere. This is achieved in our algorithm by defining it as the area enclosed by the quadrilateral created by taking the four neighbouring nodes surrounding a centre node as its vertices. Figure3.16shows an example.

The search segments of diagonally neighbouring nodes share a border but do not overlap (Fig. 3.16(a)), vertically or horizontally neighbouring nodes share about one fourth of their area (Fig. 3.16(b)). Taking a closer look at all surrounding segments in vertical and horizontal direction reveals that each pixel of the texture map is used twice in the tracking. This redundancy, however, is inten- tional and its important role in the tracking will become evident later on.

To determine whether a pixel really lies within the quadrilateral the MATLAB

function

inpolygon

is used.

Some segments, however, must be excluded: If head motion causes one or more segments to be ’occluded’ by the reminder of the ellipsoid their location cannot be determined anymore. If the mesh were a solid ’real world’ object the occluded part would be just not visible in the video frame. But the two-dimensional

nodes - they are just wrapped around the curve of intersection between the ellipsoid and an arbitrary plane parallel to the image plane. It goes without saying that this would not only result in wrong values for the occluded segments but would interfere with the whole motion tracking. Fortunately those segments can be easily recognised by probing the angle between the optical axis and the vertex normals of the mesh nodes that define the search segment. If the absolute value of the angle is greater than 90 degree for any limiting mesh node, the search segment must be excluded from the tracking for the time being (i.e. for this particular frame-to-frame transition). Since the limiting mesh nodes form the vertices of the search quadrilateral, and the vertices are the quadrilateral extremal points, and further, the intersection curve of the half ellipsoid with any plane parallel to the image plane is convex,8 _{no pixel that would not be visible if the ellipsoid was the} ’real world’ facial surface is included in the tracking.

E

D

E’

D’

C

B

A

B’

A’

C’

Frame n

(prediction)

Frame n−1

Adaptation

Figure 3.17: Adaptation of the search segment (light gray area) of two successive frames, shown for one ’quadrant’ (dark gray area).

In document Kroos, Christian (2004): A system for video-based analysis of face motion during speech. Dissertation, LMU München: Fakultät für Sprach- und Literaturwissenschaften (Page 85-87)