Perception and Communication - -A Mutual Benefit
5.3. A Probabilistic Model
5.3.5. Application in Human-Robot Interaction
For enabling the robotic system BIRON to use this approach for grounding descriptions of spatial relations a few prerequisites must be fulfilled. The robot needs advanced perception in order to pre-process the signals from two modalities so that they can be provided as input for the described approach. The visual analysis which supplies the initial configuration of the furniture graph was already described in Section 5.3.1. It analyzes the 3D information from a ASUS Xtion Pro depth sensor mounted on top of the robot in order to find furniture items in the field of view. It is also possible to register multiple scans that are obtained by turning the robot in place in order to get a wider view on the scene before starting the analysis.
Just like already described in Section 3.3.2 the registration profits from the localization abilities of the robotic system which provides an initial guess for the correct matching. The ICP algorithm just needs to perform the fine adjustment of the point clouds.
The robot is also equipped with a directional microphone which allows to perceive utterances containing the spatial descriptions. For interpretation of the auditory signal the system uses the the SPHINX-4 speech recognition toolkit (Lamere et al., 2003). It is supplied with multiple task-specific gram-mars which are defined manually for the situations the robot is expected to face. For every recognized utterance a grammar tree containing the matched non-terminal and terminal symbols is created. As seen above, the grounding approach expects descriptions in the form located object → spatial relation
158
5.3. A Probabilistic Model
→ reference object. These can easily be created from the matched gram-mar trees using a few simple rules. Thereby it is not only possible to use canonical utterances that describe the relation. The descriptions can also be obtained from complex constructions like “please bring the chair which you put in front of the shelf yesterday” or from utterances using a different order than expected like “in front of the shelf you can see the chair”.
However, the approach currently assumes that the perspectives of the agent generating the model and the agent describing the scene are the same.
This means that the model is only valid if the person describing the furni-ture layout is located in the same spot where the initial furnifurni-ture graph was perceived by the robot. The current state of the approach is not de-signed to handle perspective change. This is partly based on the findings of Moratz et al. (2003) who have found in their experiments that humans mostly take the robot’s perspective, thus a future extension of this feature is desirable. But it is possible to use the perceived furniture instances with their respective locations and classification results for establishing new mod-els from different perspectives. The initial graph just has to be initialized with different initial assumption about spatial relations.
A similar approach as with the egocentric models in the ASM system would be imaginable here as well (see Section 3.3.1). Once the robot iden-tified a small set of typical interaction locations within the apartment, it can establish a set of disambiguation models for grounding descriptions ac-cordingly. The same set of identified furniture in the environment could be used for initialization of all of the models. Even the probability distribu-tions for the furniture’s categories can be shared across the several models, because they are independent from the perspective. The furniture locations and viewpoint for each model are anchored in the allocentric representation, while the models themselves represent independent egocentric representa-tions of the scenes. When descriprepresenta-tions of spatial relarepresenta-tions occur, they can be matched using the model corresponding to the person’s location.
In the interaction with a human interlocutor the robot should not only understand the human’s utterances, it should also be able to answer or pro-actively formulate requests. The model for disambiguation can also be used for speech production. Obviously the verified labels of the furniture ensure the correct naming of the objects, but also the correct choice of spatial relations in the formulation supports the alignment with the interlocutor and therefore also the successful communication. From the information
5. Perception and Communication
about the relations in the graph the system can choose the most frequently used spatial relation for describing two objects. From the statistics about the usage of relative or intrinsic Reference Frames the formulation can be influenced additionally.
160
5.4. Evaluation
5.4. Evaluation
The algorithm for grounding spatial descriptions was evaluated by perform-ing two different studies. Johannsen and De Ruiter (2013) found that the scene type has significant influence on the choice of Reference Frames when humans describe spatial relations. Conforming with this finding Levinson (1996) claims that “relative systems of spatial description build in a view-point”, which implies that using the relative RF demands an embodied viewer in order to establish this viewpoint. Accordingly it might also have an effect on the RF selection whether the describing person sees a picture of the scene to describe or is actually situated in the scene. Furthermore, describing the scene to a virtual, not specifically named entity or an actual robot might influence the selection process as well. So in a preliminary study, an online survey was conducted in which the participants had to describe a depicted scene using gapped sentences. In a second study the participants were invited to a real apartment to describe the furniture to a real robot.