• No results found

Determination of the Region-Of-Interest

4.5 Perception and Representation of Objects

4.5.2 Determination of the Region-Of-Interest

The alignment of the camera on the referenced Region-Of-Interest which ideally contains the object is a very challenging task. In order to solve this task, two dif- ferent algorithms were implemented in the Object Attention System, both mainly using the output of the gesture recognition system. The first one for the princi- pally 2D-gesture recognition in combination with the depth data provided by the laser range finder of the robot BIRON. The second algorithm is able to deal with the data provided by the 3D-Body Model Tracker in combination with the gesture recognition, cf. section3.1 on page36. Thus, the latter one is a great deal more precise than the first algorithm based on the gesture recognition using skin-color region tracking. Nevertheless, the goal for both approaches consists in the align- ment of the object camera which uses the robot coordinate system depicted in Figure 4.10. This is of special interest, as the robot and the object camera, re- spectively, use a completely different coordinate system than the gesture recogni- tion, although all positions, finally, need to be transformed into the object camera coordinate system. ~ zr ~ xr d h ~r ϑ ~ yr B

(a) Robot’s cylindric coordinate system, adapted from [Lan05]. (b) View from top

on robot’s coordi- nate system.

Figure 4.10: Robot coordinate system used for the Object Attention System. As Figure 4.10(a)shows is the object camera coordinate system a cylindric one with its originBfor the height ~zr on the bottom of the robot. In that plane, the axis

~

yr pointing to the front origins in the middle of the robot, while the orthogonal axis ~

4.5. Perception and Representation of Objects 65 [rad] has its 0◦angle pointed to the front of the robot, measured counterclockwise as illustrated in Figure4.10(b). The remaining components height h [m], distance d [m], and radius r [m] have the same origin point B as the Cartesian ~xr, ~yr, ~zr coordinate system as shown in Figure4.10(a).

The computation of the position for the Region-Of-Interest starts in both cases as soon as the dialog component sends the order to align the camera on an object. For efficiency reasons, the following computation is only performed for the relevant gesture and not for all stored gestures in the Short-Term Memory. This becomes available by the analysis of the temporal correlation between speech and gesture, as it is described in section 4.1. Based on the content of the XML data sent by the gesture recognition module, the Object Attention System decides how to proceed. In particular, two cases are distinguished, one for 2D-data, and one for 3D-data.

Dealing with 2D-Gesture Data for the Region-Of-Interest

If the 3D-Body Model Tracking System is not used, the gesture recognition itself supports the Object Attention System with the 2D-hand position and an estimated 2D-pointing direction α (0◦. . . 360). Thus, it is necessary for the Object Attention System to extrapolate the pointing direction as otherwise the center of mass of the hand would be interpreted as the center of the Region-Of-Interest. The following paragraph, therefore, describes the basic proceeding in a simplified manner of the algorithm applied to overcome this restriction.

To get an appropriate coordinate for the Region-Of-Interest, trigonometric func- tions for right triangles are used to cover the various pointing directions, see Fig- ure4.11.

Figure 4.11: Illustration of geometric 2D-approximation for the Region-Of-Interest.

In particular, the slope m of the red-colored hypotenuse ~h is calculated which runs parallel to the pointing direction, cf. equation4.1.

m = ~o ~

66 4. Development of an Object Attention System The hypotenuse is then, starting from Point A’, continued in pointing direction for approximately 30 cm. In this way, the center for the Region-Of-Interest (B’) is determined. The shift of the Region-Of-Interest for about 30 cm is indeed a very coarse approximation, but in real experiments it has been proven as accurate enough.

The Object Attention System so far does not have an estimated distance (Robot ⇔ Hand) for the gesture position. Thus, the distance value of the person pro- vided by the Person Tracking and Attention System [Kle05, Lan05, Spe05] is used. However, it has been shown that if the user’s hand is in the upper half of the image captured by the gesture camera, the referenced location is a great deal nearer to the camera than the legs. This is caused by the circumstance that in such a case the user usually points to a location on a tabletop in front of him. In order to compensate this discrepancy, the distance value is then simply divided by the factor 2.

As a result of the calculation of the Region-Of-Interest, a gesture-based Attention Map, like it is depicted in Figure4.12is generated.

(a) Gaussian distribution. (b) Gesture-based Attention Map (isometric view)

modeled with a Gaussian distribution.

Figure 4.12: Visualization of a gesture-based Attention Map.

In particular, the Figure shows the input image provided by the object camera, which is overlayed by a black hidden layer and the actual gesture-based Attention Map modeled as Gaussian distribution. It illustrates the effect of the gesture map. In the center of the distribution peak (marked by the black arrow), the black hidden layer is completely penetrated while with decreasing amplitude of the Gaussian distribution G(x , y), the penetration decreases as well. The mathematical rela- tion for G(x , y) is described in equation 4.2. The variance σ2 is regulated with the mean size value provided by the Modality Converter. The black area of the Attention Map causes a fade out of the image parts not belonging to the Region- Of-Interest. G(x , y) = 1 2πσ2e − „ x 2+y2 2σ2 « (4.2) This method to calculate the location for the Region-Of-Interest is indeed only a coarse estimation. Therefore, the 3D-based estimation of the Region-Of-Interest

4.5. Perception and Representation of Objects 67 has been implemented as well which in turn applies the integrated 3D-Body Model Tracking System. The following section, hence, describes the algorithm devel- oped for the Object Attention System that allows a more precise localization for the Region-Of-Interest.

Dealing with 3D-Gesture Data for the Region-Of-Interest

The preliminary evaluation of the 3D-Body Model Tracking-based gesture infor- mation has shown that it provides a great deal more accurate positions of the es- timated Regions-Of-Interest than the 2D-based gesture recognition does. There- fore, the localization algorithm for the Region-Of-Interest has been adapted in order to be able to automatically deal with the changed preconditions. Neverthe- less, it has to be noted that the usage of the 3D-Body Model Tracking System is currently not capable to process the image streams in an online Human-Robot Interaction scenario, as the 3D-Body Tracker is still not optimized.

However, before the actual algorithm is explained, an overview of the different coordinate systems used for the localization task is given first. The destination coordinate system still remains the one for the object camera, as depicted in Fig- ure4.10on page64. Furthermore, the gesture recognition and the body tracker, respectively, use an orthogonal 3D-Cartesian coordinate system as the right half of Figure4.13 illustrates. Besides these two already known coordinate systems, an additional spherical user coordinate system is used. Although it would not be necessary to introduce a third coordinate system from a mathematical point of view, it is helpful for an easier illustration and implementation of the algorithm. This user coordinate system has its originCat the HEADPOS location provided by the 3D-Body Tracking System, described in section 3.1 on page 36. While the main directions are represented by the vectors ~ξ, ~ψ, and ~ζ, are the horizontal and the vertical angle denoted by ϕ and ϑ, respectively. This additional coordi- nate system is primarily used to enable an easy extrapolation of the 3D-pointing direction which is finally used to localize the Region-Of-Interest. In the following paragraph the algorithm is explained in detail.

In a first processing step, the positions for the RIGHTFINGERTIP D (marked as red square in Figure 4.13) are extracted. If they are not available, the also red marked RIGHTHANDPOS E tags are evaluated instead. As the latter one marks the wrist, a correction distance of 15 cm is added in motion direction of the pointing gesture. Usually three different locations of the RIGHTHAND- POS, RIGHTFINGERTIP, and the HEADPOS are present, due to the last three timesteps of the Condensation-based Trajectory Recognition (CTR) (cf. page38). Consequently, the last 200 ms (3 · 66,6 ms, related to the 15 fps used for the CTR) are considered and if they are sent to the Object Attention System, their mean value is calculated in order to suppress too large variations for the positions. This becomes necessary, as the 3D-Body Tracker uses non-deterministic probabilis- tic methods which result in recognition errors, especially in the depth values, as described by Schmidt in [SKF06]. These errors are mainly caused by the monoc- ular camera used, because with only one point of view, a lot ambiguities occur for different poses. A motion model that could reduce these ambiguities is, currently, not yet implemented in the 3D-Body Tracking System.

The second processing step transforms the extracted values related to the body tracker coordinate system which is spanned by the axes ~x , ~y, and ~z into the cylin-

68 4. Development of an Object Attention System

~

z

~x

User coordinate system

~

ψ

A

C

F

D

Body tracker coordinate system

~

ξ

~

ζ

ϑ

̺

−~y

E

ϕ

Figure 4.13: 3D-Coordinate Systems used for the Object Attention System. dric object camera/robot coordinate system (Figure 4.10). This transformation considers the horizontal and vertical offset as well as the tilt of the gesture cam- era related to the object camera coordinate system.

Next, the different user-dependent positions are transformed into the spherical user coordinate system. This enables a simple distance adaption between the user’s hand position and the center of the referenced Region-Of-Interest. This distance, however, is not fixed and depends on the object size verbally specified by the user. The corresponding numeric size value is returned by the Modality Converter as described in section 4.5.1 on page 62. The same size value is used for two other aspects. First, it is used to generate an Attention Map in a similar calculation as it is described in the section before for the 2D-gesture processing. Secondly, the size value is used to adjust the zoom factor of the object camera in order to capture a more detailed view of the Region-Of-Interest. Finally, after the position and the size of the referenced location has been determined, a last transformation outgoing from the user coordinate system into the object camera coordinate system is calculated. Consequently, the object camera is then aligned to the computed location of the Region-Of-Interest while the calculated zoom position is also transmitted to the camera. During the alignment of position and zoom, the actual camera settings are continuously evaluated with a frequency of approximately 10 Hz. Thus, the Object Attention System is able to initiate further processing as it is ensured that the camera has reached its final position. In the case of an error, the Object Attention System sends a message to the dialog component that an alignment on the object is not possible.

4.5. Perception and Representation of Objects 69 For reasons of an easy evaluation, another unit test has been developed. This is discussed later on in the evaluation chapter. Assuming that the camera has successfully been aligned towards the Region-Of-Interest, a visual scene analysis can be performed. The methods of analysis used by the Object Attention System, therefore, are the topic of the next paragraph.