Comparison with human auditory-visual speech perception

Beyond direct evaluation interesting insight was won in Munhall et al.(in press) in which spatial frequencies are used by human observers in auditory-visual speech perception, thereby providing a convincing argument for the wavelet-based approach in machine tracking of face motion during speech. Munhall and col- leagues tested spatial frequency bandpass filtered image sequences of a talker in an audiovisual speech-in-noise task in two experiments.

As already mentioned in the introduction seeing the speakers face enhances intelligibility in acoustically unfavourable environments. In the first experiment the performance of subjects in recognising correctly keywords of the CID corpus was tested where the original audio track was severely degraded by superimposed multi-speaker babble. The original video sequence was converted to grayscale and filtered with five different one octave-wide bandpass filters yielding together with the audio-only and the full video (grayscale) seven test conditions. Table 4.2 shows the bandwidth and the centre frequencies.

Table4.3 shows the mean percentage of correct answers over all 42 subjects.

Presentation

condition Percent correctkeywords

Full face 66.38 F1 49.52 F2 46.76 F3 55.9 F4 53.81 F5 31.9 Auditory only 36.67

Munhall et al.(in press) summarises:

Experiment 1 showed that all but the lowest spatial frequency band that we tested enhanced auditory speech perception, however, none of the individual frequency bands reached the accuracy level of the unfil- tered images. The band-pass conditions showed a quadratic intelligibility pattern with the peak intelligibility occurring in the mid-range filter band with center frequency of 11 c/face. Experiment 2 showed that this pattern did not vary as a function of viewing distance and thus that object-based spatial frequency best characterized the data.

Note that the spatial frequency conditions denoted as F2, F3, and F4 corre- spond exactly to the wavelet levels 5, 4, and 3, respectively used in the motion tracking. Actually the stimuli images were produced with the (slightly modified) routine implemented for the tracking procedure. Therefore we can be sure that most of the phonetic information humans can infer from image sequences can be extracted from the spatial frequency domain we are relying on for the tracking (see alsoMacDonald, Andersen, and Bachmann,2000).

Conclusion

5.1 Summary

We presented a system for video-based analysis of face motion during speech. The core of it consists of an algorithm to measure face motion from image sequences. Additional features ensure that the audio track, the acoustical speech signal, is synchronised with the face motion measurement, external head motion data can be integrated and the measurement data itself can be accessed at will for further analysis.

The tracking algorithm has two stages - an initialisation phase and the actual frame-to-frame image motion tracking. The initialisation procedure generates a parametrised ellipsoid mesh, scales it to the size of the subject’s face in a user- chosen reference frame and places it to cover the face area. The mesh’s function in the subsequent motion tracking is to provide anchor points for the tracking and record location changes of small parts of the facial surface. To achieve size accommodation and placement of the ellipsoid, the user is required to mark a few points at the outline of the face and the inner or outer eye corners. An ellipse is then fit to the outline points, the orientation of which is constrained to the slope angle of the line connecting the eye corner points. From the ellipse the ellipsoid parameters are derived. Also in the initialisation phase a camera model is instantiated, an adapted version of the ideal pinhole camera model.

The motion tracking procedure uses a multiresolution analysis in the strict sense for the image data and - adapted to it, but formulated in a less strict sense - for the mesh resolution, i.e., a set of ellipsoids meshes with varying node density is applied for the tracking in a coarse-to-fine strategy and the tracking results are refined at each step. The goal of the multiresolution decomposition of the video images is to obtain spatial frequency band-limited subbands that are mutually orthogonal and orientation sensitive in three major directions (horizontal, verti- cal, and diagonal). This is accomplished by a discrete-space wavelet transform of the image data realized as cascade filter bank with pairwise low and high pass half-band filters. The algorithm then loops through selected levels of the wavelet transform projecting the ellipsoid mesh onto the subband ’images’ using the camera model and following head movements by considering external head motion data.

The mesh resolution is reduced at the beginning of the tracking of a frame-to- frame transition and superimposed onto a higher wavelet level decomposition of each of the two frames under investigation. Search segments are defined as the

partial face texture map contained in quadrilaterals created by the four neigh- bouring nodes surrounding an arbitrary centre node. The set of all search segments covers the entire facial surface with overlapping to ensure on the one hand that enough area is included and on the other hand that a relatively high density of measurement points is maintained distributed globally over the face surface. Due to the reduced mesh node density the size of the search segment area corre- sponds to the low spatial frequencies remaining in the higher level subband.

The search segment is then warped to integrate already known information about its appearance in the next frame (e.g., the effect of head movements). Af- ter that correspondence is established using normalised cross-correlation. This yields a motion vector characterising the change of location of the search segment from one frame to the next, which is assigned to the centre node of the search segment. After in a two-step procedure motion vectors for all mesh nodes have been determined the whole mesh is deformed accordingly.

Moving to the next finer tracking level the coordinates of intermediate nodes not tracked yet are interpolated, and then the tracking procedure described above is repeated using the wavelet subbands of a lower scale (higher spatial frequencies). On finishing the run through the updating loop one more time, the entire refining step is once more repeated to yield the final fine-grained result.

In order to allow comparison and analysis of the measured face motion inde- pendent from the original video sequence a stabilised version of the mesh must be produced. Therefore the effect of projection is reversed and the translation and rotation of the mesh because of head movements is inverted. Once the tracking of a video sequence is finished the resulting sequence of ellipsoid meshes (one per frame) represents the intrinsic face motion.

The evaluation with several different methods showed that our speech face motion tracking system picked up the essential speech-related facial behaviour and the tracking error remained within acceptable limits so long as the video sequence did not become too long.

In document Kroos, Christian (2004): A system for video-based analysis of face motion during speech. Dissertation, LMU München: Fakultät für Sprach- und Literaturwissenschaften (Page 120-124)