CiteSeerX — A Gesture Interface for Human-Robot-Interaction

(1)

FG'98, April 14-16, 1998 in Nara, Japan.

A Gesture Interface for Human-Robot-Interaction

Jochen Triesch and Christoph von der Malsburg

¹

Institut f¨ur Neuroinformatik

Ruhr-Universit¨at Bochum, 44780 Bochum, Germany Jochen.Triesch@neuroinformatik.ruhr-uni-bochum.de

1

also at: University of Southern California

Dept. of Computer Science and Section for Neurobiology Los Angeles, CA, USA

Abstract

We present a person-independent gesture interface implemented on a real robot which allows the user to give simple commands ,e.g., how to grasp an object and where to put it. The gesture analysis relies on realtime tracking of the user's hand and a rened analysis of the hand's shape in the presence of varying complex backgrounds.

1. Introduction

Robots of the future will interact with humans in a natural way. They will understand spoken and gestural commands and will articulate themselvesby speechand gesture.

We are especially interested in gestural interfaces for robots operating in uncontrolled real world environments. This im- poses several constraints on human-robot-interactionas a special case ofhuman-computer-interaction:

1. The robot visual system must cope with variable and possibly complex backgrounds. A system requiring uniform background is not exible enough for real world applications.

2. The system must be person independent. Many users should be able to operate it, without the necessity for retraining.

3. The system must not require the user to wear markers or colored gloves, this being too tedious.

4. The lighting conditions are uncontrolled. The robot must cope with various lighting situations.

5. The robot must be capable of real time performance.

Figure 1. The robot.

Automatic visual gesture recognition holds the promise of making man-machineinterfaces more natural. Therefore, it has received much attention recently, for a review see [9].

However, hardly any published work fullls all the requirements stated above.

In the work of Franklinet al. [4] an attempt to build a robot waiter is presented, a domain which indeed poses all the above challenges. The robot's gesture analysis, however, so far only distinguishes between an empty hand and a hand holding an object based on how much skin color is

(2)

Figure 2. The user points to an object with a particular hand posture indicating that the robot is to grasp this object in a particular way.

A second gesture then tells the robot where to place the object.

visible.

The system presented by Cui and Weng [3] recognizes different hand gestures in front of complex backgrounds. It reaches 93.1% correct recognition for 28 different gestures, but is not person independent and relies on a rather slow segmentation scheme taking 58.3 seconds per image.

Heap and Hogg [5] present a method for tracking of a hand using a deformable model, which also works in the presence of complex backgrounds. The deformable model describes one hand posture and certain variations of it and is not aimed at recognizing different postures.

The work presented by Campbell et al. [2] is an example of a system recognizingtwo-handed gestures. It allows only for motion based gestures, because it does not analyze the shape of the user's hands.

In Kjeldsen's and Kender's work [6] a realtime gesture system for controlling a window-based computer user interface is presented, which is quite fast (2 Hz). A minor drawback of the system is that its hand tracking has to be specicially adapted for each user.

The system presented by Maggioni [8] has a similar setup. It requires a constant background to the gesturing hand.

2. Task

Our robot has a kinematically redundant arm with seven degrees of freedom, which allows it to grasp an object from various directions. On top of the arm a stereo camera head with three degrees of freedom is mounted allowing for pan,

tilt and vergence motion (gure 1). The cameras yield images 768 x 572 pixels in size.

We have designed an example application where the robot recognizes gestures of the human operator and ac- cordingly grasps objects and moves them around (gure 2).

The robot is located in front of a table with various objects on it. The user points to an object, which the robot then has to grasp. The hand posture performed by the user during pointing tells the robot from which direction to grasp the object. With a second gesture the user indicates where the object shall be placed. The robot recognizes the gestures by tracking the hand until it comes to rest and then analyzing the hand's posture.

All the other skills needed by the robot, e.g., recognition of shape and orientation of the object pointed at, grip planning and grip execution are discussed elsewhere [1].

3. Tracking of the User's Hand

The tracking of the user's hand relies on a combination of motion and color cues. An additional stereo cue rules out targets which are not in the plane of xation. We use images downsampled to a size of 96 x 71 pixels in HSI (hue, saturation, intensity) color format. The tracking currently runs at a speed of 8 Hz. When the hand stops moving for a while, the tracking is ended, returning the last position of the hand in both images; then, the hand posture classication is activated.

3.1. Motion Cue

We compute a thresholded version of the absolute differ- ence images of the intensity (I) components of consecutive images according to:

M l;r

(x;y;t)=(jI l;r

(x;y;t)?I l;r

(x;y;t?1)j??);

where^?is a threshold andis the step function. After- wards we apply a local regularization, which switches on inner pixels that havea high number of direct on-neighbours and which switches off isolated on-pixels. The motion cue responds to all moving image regions, i.e., to moving objects and to some extent also to their shadows. A typical result of the motion cue in one camera image is depicted in

gure 3.

3.2. Color Cue

Skin color detection is based on the hue (H) and saturation (S) components of the image. We dene a prototypical skin color point in the HS plane. For each pixel of the input image we compute its Euclidean distance to this point, where we have scaled the two axes differently. The closer

(3)

Figure 3. The motion cue: All moving image ar- eas are selected. Note that this cue is prone to also detecting the shadows of moving objects.

Figures 3 to 5 all refer to the same physical scene which is dierent from that depicted in gure 2.

the pixel is to the prototype, the higher is its skin color similarity. This cue selects the hand but also other approximately skin-colored objects. It is depicted in gure 4. This cue proved very useful in spite of the draw-backs of being prone to noise and sensitive to drastic changes in the color of the illumination, as produced for example by the spot- lights of a TV team. An online recalibration of the skin color prototype for such cases (e.g., as done in [8]) would be desirable and we are currently pursuing this point.

3.3 . Attention Maps and Stereo Cue

For each camera the result of the color cue and the motion cue are simply added with appropriate weighting fac- tors, the stronger weight being put on the color cue. This ensures that the system will keep working (although with lower reliability) if one of the cues breaks down. Attention maps are then computed by convolving the summation results with a Gaussian kernel in order to smooth them and stress larger blobs of activity. The attention maps of the left and right image are simply added pixelwise (gure 5).

Thus, only if the activity blobs of an object fall on corre- sponding locations in both attention maps there will be a strong responses. This is only the case for objects in the plane of xation. False conjunctions, which are in princi- ple possible with this technique, turn out to be no problem in practice. Finally, the global maximum in the sum of the attention maps is computed and from its position a gradi- ent ascent is performed in the two attention maps to nd the nearest local maxima there. Triangulation now yields an estimate of the hand's position in three dimensions. Due to the low resolution, however, this estimate is rather rough.

For the more challenging task of tracking several skin

Figure 4. The color cue: Areas of approximately skin color are highlighted but localization of the pointing hand using this cue alone is often am- biguous.

colored objects simultaneously, they can be followed by searching for local maxima in the sum of the attention maps in the vicinity of local maxima found in the previous frame as long as the objects do not move too fast.

3.4. Active Tracking

The tracking can also run actively, which means that the camera head's degrees of freedom are used to keep the tar- get xated, i.e., centered in left and right camera images.

Our scheme for active tracking controls the vergence angle independently of the pan and tilt angles.

For the control of the vergence angle between the cameras, we add the attention maps of left and right image with three different disparities, i.e., different relative horizontal displacements of a positive, zero and negative number of pixels. The resulting images stress features in front of, in, and behind the plane of xation, respectively. We compute the global maxima in those three images. If the highest re- sponse was for the image focused in front of the plane of

xation, the vergence angle is increased, i.e., the cameras are converging. Conversely, if it was highest behind the plane of xation, the vergence angle is decreased. The distance of the hand may be computed directly from the current vergence angle. We reach an accuracy of about 8 cm for distances around 1 m and of about 40 cm for distances around 2.5 m.

If the highest of the three global maxima comes close to a vertical or horizontal border of the image, indicating that the hand is about to leave the eld of view, an appropriate saccade with the other degrees of freedom (pan and tilt) is made to bring it back to the image centers.

During movements of the camera head image acquisition is stopped, because the motion cue of the tracking would sense motion almost everywhere. Thus, during active track-

(4)

Figure 5. Result of the tracking: Attention maps were extracted from both cameras and added up. This suppresses targets which are not in the plane of xation like the second hand in the upper right of gures 3 to 5.

Figure 6. The six postures tell the robot whether to grasp an object from above, the front, the side and so on.

ing the effective frame rate depends on the rate and sizes of saccades and is generally lower than the 8 Hz mentioned above.

4. Hand Posture Classication

When the hand comes to rest, its posture is analyzed and its precise three-dimensional position is computed. For this purpose, regions of interest of²⁵⁶²pixels are selected around the points yielded by the tracking for the two cameras. Greyscale images of these regions are downsampled by a factor of two and serve as the input for the hand posture classication. We currently use the six different postures depicted in gure 6, which tell the robot whether to grasp the object from, e.g., the front, the side, above and so on.

The posture recognition is based onelastic graph match-

ing, which has already been successfully applied to object and face recognition, e.g., [7, 12]. Processing is done on grey scale images and works in a person-independent way and in the presence of varying complex backgrounds. We use an adaptation of our earlier system [10].

In elastic graph matching, objects are represented as labeled graphs, where the nodes carry local image information and the edges contain information about the geometry.

One model graph is created for each posture. The graphs used in this study have about 25 nodes and 20 edges. The local image information at each node is represented by a vector of responses to Gabor based kernels called ajet. The kernels are DC-free and dened by:

~

k (~x)=

~

k 2

2

exp ?

~

k 2

~ x 2

2

!

exp

i

~

k ~x

?exp

?

2

The orientation and size (wavelength) of the kernels is pa- rameterized by^~^k. A sample for a specic^~^k-value is depicted in gure 7. The jets are the responses of convolu- tions with kernels of three different sizes and eight different orientations and give a local image description. We fuse the graphs obtained from two persons performing the posture into a single bunch graphfor each posture; for details on the bunch graph concept see [12].

A complete description of the matching process of a bunchgraph onto an image is given in [10]. Here, we only qualitatively describe the procedure. A graph representing a particular posture is matched to an image by moving it across the image unitl the jets at each node t best to the regions in the image they come to lie on. During the matching process we allow for certain geometrical transformations of the graph:

Scaling the graph may change its size by up to 15%.

Rotation in plane the graph may rotate in the image plane up to 10 degrees.

During both kinds of transformations the jets are not changed since the Gabor responses are robust with respect to small geometrical transformations. In contrast to previous versions of our system we do not use a local diffusion of single nodes here, since this was the computationally most expensive part of the matching procedure and experiments have shown that recognition is still reliable without it.

For recognition of a posture in a single image, the graphs of all postures are sequentially matched to the image and the posture whose graph obtains the highest similarity is selected as the winner. In contrast to this scheme we are deal- ing with stereo image pairs here. For the purpose of posture recognition we match the graph for each posture in both images sequentially and then add the similiarities of the same posture for both images. The posture with the highest total

(5)

Figure 7. Nodes of the graphs are labeled with the responses to Gabor based kernels. They have the form of a plane wave restricted by a Gaussian envelope function. Left: real part, right: imaginary part.

Figure 8. Example of a graph of the correct pos- ture being matched to the input image. Due to the graph's rigidity, the match cannot be perfect but is good enough for recognition.

is chosen as the recognition result. Current research deals with more intelligent strategies for Elastic Graph Matching on stereo image pairs, where the matching is made subject to geometric constraints. For instance, the objects may be assumed to have a similar scaling in both images.

The result of a matching process on the image of one camera is depicted in gure 8. As we do not use the local diffusion of nodes but keep the graph rigid apart from global scaling and rotation, the match does not t particularly well, but is reliable enough for posture recognition.

Experiments with new users show a correct classication in four out of ve cases against realistic moderately complex backgrounds. Our previous system [10] reached about 86% correct recognition for 10 different postures in front of highly complex backgrounds as depicted in gure 9, but there the lighting was more tightly controlled. By means of using fewer allowed postures and coarser model-graphs, we managed to reduce the previous recognition time of 16 seconds signicantly. The Gabor transformation of the images takes 2.95 seconds on a conventional Sun UltraSparc Workstation. The matching of the model graphs adds an- other 1.88 seconds.

4.1. First and Second Gesture

We distinguish between a rst and a second gesture during the task. The position of the rst gesture indicates the object to be grasped and the posture indicates the direction from where it is to be grasped. For the rst, the precision of the positioning of the graphs on the images is not an im- portant issue (as long as the different postures can be dis- tinguished), because the following object localization and recognition steps can compensate for small errors.

For the second gesture, which determines the point in space where the object is to be placed, accuracy is vital, since with the image resolution we currently use an error of only one pixel in graph position leads to an error of approximately 1 cm of the hand's position in space (the depth component is particularly susceptible). On the other hand, we do not have to distinguish different postures here, so that we can require the operator to use a particular posture when performing the second gesture, which speeds up the matching process by a factor equal to the number of postures tested for the rst gesture.

(6)

Figure 9. One of the postures performed by nine subjects against complex backgrounds. The pictures give a good impression of the variabil- ity in size and posture of the hands and the complexiy of backgrounds with which systems of our kind can cope.

5. Discussion and Outlook

We have presented a person-independent gesture interface implemented on a real robot. The system has proved its robustness in demonstrations at our lab, where visitors test it frequently. Apart from the earlier mentioned problems of tracking under drastically changed lighting conditions, which we hope to solve using online color recalibration, the system meets all the requirements posed in the introduction for robust human-robot-interaction: It works in the presence of varying complexbackgrounds;is person-independent;no retraining is necessaryfor new users; the user is not required to wear markers or gloves; the tracking is capable of real time performance. For the computationally more extensive posture analysis, there is a complex tradeoff between the allowed number of postures, the accuracyand the speed of the matching process.

For the future, the introduction of color recalibration would make the system more robust with respect to different illuminations. Then, employing color information in the recognition process seems promising. We are currently working on adding color information to the jet at each node of a model graph. Preliminary results of this attempt appear in [11]. We also intend to close the gestural communica-

tions loop by letting the robot perform gestures itself, e.g., by pointing to unfamiliar objects whose names the user then supplies to the robot. Combination with a speech based interface is also a medium-term project to achieve more natural interaction.

Acknowledgements

This work was supported by a grant from the German Federal Ministry for Science and Technology (01 IN 504 E9).

References

[1] M. Becker, E. Kefalea, E. Maël, C. v.d. Malsburg, M. Pagel, J. Triesch, J. C. Vorbrüggen, R. P. Würtz, and S. Zadel.

GripSee: a robot for visually-guided grasping. in prepa- ration, 1998.

[2] L. W. Campbell, D. A. Becker, A. Azarbayejani, A. F. Bobic, and A. Pentland. Invariant features for 3-D gesture recognition. InProceedings of the Second International Conference on Automatic Face and Gesture Recognition 1996, Killing- ton, Vermont, USA, October 14-16, 1996.

[3] Y. Cui and J. J. Weng. Hand sign recognition from intensity image sequences with complex backgrounds. InProceed- ings of the Second International Conference on Automatic Face and Gesture Recognition 1996, Killington, Vermont, USA, October 14-16, 1996.

[4] D. Franklin, R. E. Kahn, M. J. Swain, and R. J. Firby. Happy patrons make better tippers creating a robot waiter using Perseus and the animate agent architecture. InProceedings of the Second International Conference on Automatic Face and Gesture Recognition 1996, Killington, Vermont, USA, October 14-16, 1996.

[5] T. Heap and D. Hogg. Towards 3D hand tracking using a deformable model. InProceedings of the Second Interna- tional Conference on Automatic Face and Gesture Recogni- tion 1996, Killington, Vermont, USA, October 14-16, 1996.

[6] R. Kjeldsen and J. Kender. Toward the use of gesture in traditional user interfaces. In Proceedings of the Second International Conference on Automatic Face and Gesture Recognition 1996, Killington, Vermont, USA, October 14- 16, 1996.

[7] M. Lades, J. C. Vorbr¨uggen, J. Buhmann, J. Lange, C. v.d.

Malsburg, R. P. W¨urtz, and W. Konen. Distortion invariant object recognition in the dynamic link architecture. IEEE Transactions on Computers, 42:300311, 1993.

[8] C. Maggioni. Gesturecomputer new ways of operating a computer. InProceedings of the International Workshop on Automatic Face- and Gesture Recognition 1995, Z¨urich, Switzerland, June 26-28, 1995.

[9] V. I. Pavlovic, R. Sharma, and T. S. Huang. Visual interpre- tation of hand gestures for human-computer interaction: A review.IEEE Trans. PAMI, 19 7, 1997.

[10] J. Triesch and C. v.d. Malsburg. Robust classication of hand postures against complex backgrounds. InProceedings

(7)

of the Second International Conference on Automatic Face and Gesture Recognition 1996, Killington, Vermont, USA, October 14-16, 1996.

[11] J. Triesch and C. v.d. Malsburg. Robotic gesture recognition.

InGW 97, Gesture Workshop in Bielefeld, Germany, 1997.

[12] L. Wiskott, J.-M. Fellous, N. Kr¨uger, and C. v.d. Malsburg.

Face recognition by elastic graph matching. IEEE Trans.

PAMI, 19 7, 1997.