Heterogeneous Data Sources for Signed Language Analysis and Synthesis: The SignCom Project

(1)

Heterogeneous Data Sources for Signed Language Analysis and Synthesis: The

SignCom Project

Kyle Duarte, Sylvie Gibet

Universit´e de Bretagne-Sud, Laboratoire VALORIA, Vannes, France [email protected], [email protected]

Abstract

This paper describes how heterogeneous data sources captured in the SignCom project may be used for the analysis and synthesis of French Sign Language (LSF) utterances. The captured data combine video data and multimodal motion capture (mocap) data, including body and hand movements as well as facial expressions. These data are pre-processed, synchronized, and enriched by text annotations of signed language elicitation sessions. The addition of mocap data to traditional data structures provides additional phonetic data to linguists who desire to better understand the various parts of signs (handshape, movement, orientation, etc.) to very exacting levels, as well as their interactions and relative timings. We show how the phonologies of hand configurations and articulator movements may be studied using signal processing and statistical analysis tools to highlight regularities or temporal schemata between the different modalities. Finally, mocap data allows us to replay signs using a computer animation engine, specifically editing and rearranging movements and configurations in order to create novel utterances.

1. Introduction

As researchers develop expanded studies to analyze the function of signed languages within the larger context of human language, it becomes increasingly apparent that cur-rent architectures and technologies provide insufficient res-olution for capturing and understanding the movement of the hands, face, and body throughout language production. To respond, the SignCom Project uses both video data and motion capture data as signals for linguistic annotation, for later use in linguistic analysis studies, and for re-synthesis of signed language sequences from a stored set of signs. We believe that this approach is novel in our field, and will provide researchers the data necessary to carry out finer-grained studies on the nature of signed languages.

2. Previous Research

2.1. Signed Language Corpora

Currently, corpora of signed languages exist for Auslan (Australian Sign Language), BSL (British Sign Language), DGS (German Sign Language), NGT (Sign Language of the Netherlands), and SSL (Swedish Sign Language), and researchers have proposed or initiated a number of other signed language corpora as well (Sign Linguistics Corpora Network, 2009).

Though these corpora’s linguistic content typically reflects spoken language corpora, their technical devices do not, as data is largely stored in video files, requiring almost all data processing to be done by a human (Johnston, 1998; Cras-born and Zwitserlood, 2008). Among these corpora, dif-fering types and numbers of video cameras have been used (between 2 and 10 standard- and/or high-definition mod-els), offering better or worse data sources for annotators and researchers.

The goals of these corpora have often centered around syntactic analysis, variation studies, and lexicon building, which are important inclusions for most linguistic studies at a corpus level. However, without specialized technologi-cal inclusions and narrowed discourse themes, such corpora

are not suited for the work of fine-grained signed language phonetic analysis or for signed language generation.

2.2. Motion Capture

Motion capture (mocap) is the process of storing motion as digital information, often by replicating the actions of the bones and joints of the body into a digital skeletal model (Figure 1). Such data can then be manipulated by computer systems to retrieve, analyze, and re-synthesize motion for a variety of research or application scenarios.

Figure 1: Motion capture technology recreates the skele-ton of the subject for quantitative analysis. Breaks in the lines above represent joints, and the lines themselves rep-resent bones. The face is reprep-resented with imaginary bones that each control single points on the face, such as parts of eyelids, cheeks, etc.

Currently mocap is used in the fields of movement anal-ysis (sports, music, dance, etc.), biomechanics, charac-ter animation in the encharac-tertainment industry, and defense sector training simulations (Furniss, 2010). A large mo-tion capture data repository is available at the Graphics Lab at Carnegie Mellon University1_{. Even in such rich}

1

(2)

(3)

(4)

(5)

Figure 5: Annotation of an ELAN file in current use in the SignCom Project. The tiers are, from top to bottom, Gloss-esR, a timing tier, HC R, PL R, GlossesL, EN US Trans-lation, and Gaze Target. The timing tier and the HC R and PL R tiers are annotated per the Johnson and Liddell model in progress.

4. Research Application

4.1. Phonological Studies

When considering the simultaneous phonology of signed languages, and the speed with which such phonological bundles are produced, it is immediately evident that cur-rent two-dimensional video frame resolutions and their rate of capture (25 to 30 Hz) are too low to accurately capture all the information in a sign stream. Worse, annotating sign production based on such low-resolution data leaves researchers to either leave an annotation entirely blank, or to guess at annotations from either surrounding configura-tions or personal intuition, thus sacrificing the integrity of corpus annotations.

Motion capture data is our proposed solution to the gap in understanding left by existing video capture technology, providing high-density quantitative data to supplement re-searchers’ already qualitative decisions on annotation and analysis. Motion data can be displayed in ELAN as a time-aligned waveform that gives numerical value to hand posi-tion, joint angle, etc. (Figure 6). Motion can be also re-played with a 3D model, allowing the annotator to rotate the figure and see otherwise visually-obscured configura-tions (Figure 7).

Figure 6: ELAN can be configured to show waveform data that is synchronized with the annotation timeline. The up-per waveform represents the hand’s X position, and the lower waveform represents its Y position. The annotated tier visible in this figure shows glosses for signs and Ts that represent intersign transitions.

Mocap data can also serve to validate and enforce existing theories on the phonologies of signed languages. Below are studies that are either possible or in progress with mocap-paired language data.

Figure 7: 3D models of the human body (top) can be rotated to see configurations that might otherwise be hidden from view in traditional video camera setups (bottom).

It should be noted that the examples provided below are manually processed; to develop more convincing theories, it will be necessary to process the data statistically, sidering exponentially more data points. Also, at the con-clusion of the SignCom project, our video, mocap, and an-notation data will be made available publicly for other re-searchers to study.

4.1.1. Hand Configuration Studies

Considering phonological or phonetic notation systems cur-rently in use or development (Johnson and Liddell, in progress; Prillwitz et al., 1989), motion data can serve to validate physical features that a system attempts to encode, or can aid in automatically annotating sign motion with such systems. For the purposes of our project, we have be-gun work on determining phonological targets for French Sign Language (LSF), specifically in the domains of hand configuration and hand movement.

As is evident in the learning materials published for many signed languages, the concept of having a limited amount of handshapes used in a language is useful for learning signs (Moody et al., 1998). Classically, Stokoe quantified these handshapes in his research on ASL, claiming that there are 19 dez(“designator”) possibilities that could be modified with any of three diacritics (Stokoe, 2005); in France, too, there has been a history of attempting to quantify hand-shapes, notably by Cuxac. Learning handshape possibil-ities is still an integral part of many formal and informal methods of teaching signed languages, just as one might learn the sounds of a new spoken language before learning its words or grammar.

(6)

(7)

(8)

International Gesture Workshop, Bielefeld, Germany,

September 1997. Proceedings, volume 1371 of

Lec-ture Notes in Computer Science, pages 23–36. Springer,

Berlin.

Scott K. Liddell and Robert E. Johnson. 1989. American Sign Language: The phonological base. Sign Language

Studies, 64:195–278.

Gaurav Mathur and Catherine Best. 2007. Three experi-mental techniques for investigating sign language pro-cessing. InCUNY Conference on Human Sentence

Pro-cessing.

Max Planck Institute for Psycholinguistics. 2010. Elan lin-guistic annotator. http://www.lat-mpi.eu/tools/elan/. Bill Moody, Agn`es Vourc’h, Michel Girod, and

Anne-Catherine Dufour. 1998. La Langue des Signes: Intro-duction `a l’Histoire et `a la Grammaire de la Langue des

Signes. Entre les Mains des Sourds, volume 1.

Interna-tional Visual Theatre, Paris, new edition.

Steven Parkhurst. 2006. Spanish Sign Language animation presentation. In Proceedings of 9th Theoretical Issues in Sign Language Research Conference, Florianopolis,

Brazil, December.

Justus Piater. 2009. Video and computer recognition. In

Proceedings of the Sign Linguistics Corpora Network’s

First Workshop on Data Collection. Lecture.

Siegmund Prillwitz, Regina Leven, Heiko Zienert, Thomas Hanke, and Jan Henning. 1989. HamNoSys. version 2.0; Hamburg Notation System for Sign Languages. an introductory guide. International Studies on Sign

Lan-guage and Communication of the Deaf, 5:46.

Sign Linguistics Corpora Network. 2009. Sign Language Corpus Wiki. http://www.signlanguagecorpora.org/. William C. Stokoe. 2005. Sign language structure: an

out-line of the communication systems of the American deaf.

Journal of Deaf Studies and Deaf Education, 10(1):3–