Method Pipeline - Understanding the World: High Level Perception Based on Spatial and Intuitive

[Li et al., 2017], a simulation based method was proposed to infer stability during robotics manipulation on cuboid objects. This method includes a learning process using a large set of generated simulation scenes as a training set.

However, people look at things from different angles to gather a complete information for a better understanding. For example, before a snooker player proceeds a shot, the player will usually look around the table from several critical views to have an overall understanding of the situation. This case also applies to robots which take images as the input source. A single input image provides incomplete information. Even when the images are taken from the same static scene from different views, the information may be still inadequate for scene understanding on detailed physical and spatial information while using quantitative models which require precise inputs. Qualitative reasoning is demonstrated to be suitable for modelling incomplete knowledge [Kuipers, 1989]. There are various qualitative calculi devised to represent entities in a space with focuses on different properties of the entities [Randell et al., 1992; Liu et al., 2009a; Ligozat, 1998; Guesgen, 1989; Mukerjee and Joe, 1990; Lee et al., 2013]. Reasoning using the combination of rectangle algebraand cardinal direction relations, two well-known examples of qualitative spatial calculi, was also studied [Navarrete and Sciavicco, 2006]. In our previous work [Zhang and Renz, 2014], we developed anExtended Rectangle Algebra (ERA)which simplifies the idea in [Ge and Renz, 2013] to infer stability of 2D rectangular objects. It is possible to combine ERA with an ex- tended version of basic cardinal direction relations to qualitatively represent detailed spatial relations between the objects, which helps to infer the transformation between two views. It is worth mentioning that [Panda et al., 2016] proposed a framework to analyze support order of objects from multiple views of a static scene, yet this method requires relatively accurate image segmentation and the order of the images for object matching.

Models for predicting stability of a structure have been studied in the past several decades. Fahlman [1974] proposed a model to analyze system stability based on Newton’s Laws. A few simulation based models were also presented in recent years [Cholewiak et al., 2013; Li et al., 2017]. However, Davis and Marcus [2016] argues that probabilistic simulation based methods are not suitable for automatic physical reasoning due to some limitations including the lack of capability to handle imprecise input information. Thus in our approach, we aim to apply qualitative reasoning to combine raw information from multiple views to extract understandable and more precise relations between objects in the environment.

6.3 Method Pipeline

In this section, the overall pipeline of the support relation extraction method will be described. The method consists of three modules, namely image segmentation, view registration and stability analysis.

The image segmentation module takes a set of RGB-D images taken from different views of a static scene as input. To retain generality of our method, we do not assume any pre-known shapes of the objects in the scene (that is why we do not use template matching methods that can provide more accurate segmentation results). This setting makes our method applicable in unknown environments. In the implementation, we use LCCP [Stein et al., 2014] for point cloud segmentation. LCCP first represents the

Figure 6.1: Segmentation of aligned images.

point cloud as a set of connected supervoxels [Papon et al., 2013]. Then the supervoxels are segmented into larger regions by merging convexly connected supervoxels. We identify and correct the segmentation errors by comparing segmentation from different views (see section 6.4.3). Each point cloud of a view will be segmented into individual regions. We useconnected graphto represent relations between the regions. Each graph node is a segmented region. There is an edge between two nodes if the two regions are connected. We use Manhattan world [Furukawa et al., 2009] assumption to find the ground plane. The entire scene will then be rotated such that the ground plane is parallel to the flat plane. Details about segmentation and ground plane detection will not be discussed in this chapter as we used this method with little change. Figure 6.1 shows a typical output from this module.

In the view registration module, we use an object matching algorithm to find an initial match between objects from two views. Panda et al. [2016] used a similar process to match objects across multiple views. However, their method assumes the geometries of objects are known so that a template matching segmentation can be applied. Also, the input views are assumed to be in either clockwise or anti-clockwise order and taken with a camera at the same height. While these settings lead to a more accurate object matching result, it is less applicable to some real-world situations where the robots cannot take pictures freely, or the inputs are from multiple agents. To handle these situations, we develop an efficient method that can register objects from different views without those restrictions. In this chapter, we assume no prior knowledge to ensure the generality of the method.

To achieve this, we need to solve some problems from a cognitive aspect. For example, due to occlusion, a large object may be segmented into different regions. Thus the match between objects in two views may not be a one-to-one match. Also, as the order of the input images is unknown, we need a method to determine if a match is acceptable by detecting conflicts regarding spatial relations among all

In document Understanding the World: High Level Perception Based on Spatial and Intuitive Physical Reasoning (Page 85-87)