Conclusion - Visual Perception For Robotic Spatial Understanding

We conclude this chapter with a summary of the various approaches, and a direction for the future.

5.8.1 On joint optimization

The joint DVO-ICP optimization approach improved on each of the individual methods, and allowed the system to generate the accurate models shown in section 6.3. However, it is certainly not the final answer. As I mentioned previously, the fact that DVO is based on the photometric constancy assumption means that it is sensitive to properties of the camera as well as the scene illumination, which can cause the camera to adjust the exposure and white balance between frames enough to violate the assumption. This was the largest source of failure for the algorithm, especially in scenes with low lighting or larger variations in illumination intensity. In recent vi-

sual odometry papers using the direct approach, greater emphasishas been placed on

characterizing the camera lighting response and correcting for or controlling exposure variations between frames, leading to greater accuracy in challenging conditions [12], [50], [234]. This trend should continue, although it makes the process more complex, and should bring additional methods into the fold as well; i.e., simple to compute indirect methods that may be more robust to lighting as well as the use of intermediate or semantic features like contours and objects.

Another consideration, as mentioned in the discussion of Fig. 6.4, is that loop closing processes for mapping need a secondary alignment algorithm to handle the general case of matching two scenes from not only arbitrary viewpoints, but unknown

poses. This would likely require an indirect approach, or something like spherical

harmonics, to provide an approximate alignment at the minimum, followed up by the joint optimization.

5.8.2 On spherical harmonics

Ego-motion estimation using spherical harmonics has the potential to improve on the performance of feature-based algorithms, especially in challenging environments where there is geometry present but few features detected. This is similar to the benefits of combining ICP with DVO, but where spherical harmonics can find the rotation even when the near-frame projective assumption is violated. While spherical harmonics also tended to be more efficient than some of the algorithms we compared in section 5.7.4, it is not as efficient as the joint DVO-ICP algorithm we optimized on the GPU. However, work such as that done by Schaeffer [183] to implement fast spherical harmonics transforms on GPU could provide a route to integrating the approach with the joint algorithm.

5.8.3 Looking to the future

I do not think it is contentious to assert that robots need highly accurate ego-motion estimation, if only to integrate local views of their environment into a coherent model (this is discussed in the next chapter). While the algorithms discussed in this chapter present viable options, they each have benefits and drawbacks, and not one provides a

robustenough solution in and of itself. Instead, the way forward will be to continue to

integrate: specifically, we must find ways to efficiently make use ofall the information

available to us: environment appearance and structure as well as proprioception (e.g., IMUs and wheel or limb actuation). We cannot tolerate failure in this task; the system must be robust to any condition, and operate to at least human-level standards in challenging environments. I propose the following requirements for ego-motion

estimation: inany 4×4 meter environment, the system should maintain localization

accuracy of less than 1 cm and 1 degree error given at least 1 landmark in view. These constraints should make it possible to move and manipulate objects within reasonable error bounds, and prevent catastrophic motion estimation and mapping failures such as those from GICP in Fig. 5.8.

Chapter 6 High-resolution Local Mapping

6.1 Introduction

For a robot, mapping is the process by which an agent generates a spatial1_{model of its}

world, based on the information it gains from a collection of sensor measurements and a method of representation chosen to best support the goal of the model. For example, we know that humans generate representations of the world in the form of cognitive maps [30], [102], [220], [227] (although what specificially constitutes those maps and how they are generated is still an area of research). In addition, humans have been generating maps for thousands of years as external artifacts that document what we know about the planet and the places in which we live. In everyday experience, we use map artifacts posted in shopping malls, subways, and rail cars to determine how we will navigate within the relevant environment. They provide a representation that we can use to suitably plan our behavior, such as route following.

Mapping is a key component for understanding and operating within the world

since it enables navigation through planning. Any embodied intelligent agent needs to be able to find its way in the environment, and without some model of the surround- ings, it would be very difficult. Imagine not having a model (i.e., map) of your own living space; how would you be able to function effectively? Every time you needed 1_{I emphasize}_spatial _{here, since there are other ways a robot may model the world; e.g., causal}

relations between actions and effects, physical properties of motion like gravity and inertia, social behaviors of humans, and many others.

a bowl to eat or needed to use the restroom, you would have to search randomly. It is actually hard to imagine this scenario, since this kind of spatial and semantic modeling is such an inherent human capability. We argue that building these models is also a requirement for any embodied system that must carry out arbitrary tasks over long periods of time. Experience should allow the system to build up effective spatial models of the world so it can use them to plan to navigate as efficiently as possible with respect to the current goals of the system.

Mapping, or environment modeling2_{, has another more fundamental purpose as}

well: providing a basis for understanding the spatial structure of the world. We conceive of these models in the following terms based on this thought experiment: imagine you are an intelligent autonomous system (without access to human intelligence) and you possess a set of sensors that provide a snapshot of information about

the world3. Let us interpret a single snapshot as a model of the world. This model

may be useful in itself, depending on a variety of factors. However, its utility may be directly proportional to how much reasoning and planning you can do with it. If this snapshot happens to be an organized set of colored pixels (i.e., an image), then it may

be very hard to do anything with it, without some significant experience and learn-

ing algorithms that allow you to interepret the image in a way that enables you to achieve your current goal (e.g., exit the room). If you are looking towards a doorway already, and you have an algorithm that can recognize doorways, then perhaps you can generate a motion command that would move you towards the doorway. However, did you recognize the heavy object (say, a desk) that happened to be in the way? A more informative sensing modality (such as a depth camera or laser range finder) may have given different information that is more directly usable (i.e., obstacles to avoid, whether recognized or not), and therefore you may have been more successful with this single snapshot.

In the same vein, consider the ability to merge snapshot after snapshot (say, at 2_{Throughout this chapter, I shall use these terms interchangeably, as they arguably mean the}

same thing. However, we tend to think of environment modeling as a special kind of mapping that focuses on more accurate representations of the (local) environment.

3_{Please allow me to use this discretized metaphor, as it is easy to talk about and relate to the}

a rate of 15 Hz) into a single coherent representation that respects your ego-motion and the actual structure of the world. It would then be possible to “look around” and build up a larger representation that is more useful than any individual part on its own. It is easy to imagine the utility of this kind of representation at scale, providing enough information to plan a path to a goal through an entire room, building, or city. It would simply not be possible to do so without this model. In contrast, if we imagine a robotic insect that cannot build these representational models and can only make choices based on the incoming snapshots, one at a time, then we can see how difficult it might be to reason about future behavior. Yet, reasoning about future behavior in order to achieve goals is one part of intelligence.

6.1.1 Kinds of maps

There are several different kinds of maps (and many different kinds of algorithms) that have been used for motion planning. In this section, we give a brief overview of the map types useful for understanding the place our mapping algorithm fits in the overall picture.

How many dimensions?

Maps can be represented in either 2 or 3 dimensions. In the 2D case, occupancy grids are often the default. The environment is assumed to be flat (or at least, it is acceptable for the representation of the environment to be perceived as flat), and is subdivided into equally spaced regions (also called cells) represented by one or

more values stored at each x, y coordinate. In the typical case, the value stored at

a location represents the likelihood of that cell being occupied. If a cell is occupied it contains some kind of object that would prevent the robot from moving through that cell. In this 2D planar world, motion planning usually occurs in 3 dimensions:

(x, y, θ), representing the x, y translation relative to some origin, and the yaw theta

representing the robot’s heading.

igate; however, it does have some limitations. You might realize the fact that our world is, in fact, not 2D, but actually 3D. This means it is not possible to represent the height of obstacles and objects in the 2D occupancy grid, and it is not directly possible to represent situations such as the floors in a building or the hills in a field. We have even encountered environments that are mostly flat (another person would generally agree), but still cause problems with sensors such as 2D laser range find-

ers because they are not actually flat: specifically, the imperfections of the surface

caused the robot to pitch enough that the plane of the laser sensor intersected the ground meters in front of the robot, generating false obstacles in the occupancy grid [160]. Additionally, the accuracy of your model depends on the resolution of the grid. However, increasing the resolutions requires quadratically more memory and quadratically increasing time for many planning algorithms, as halving the resolution generates 4 times as many cells. One method that may be used to account for this involves representing the occupancy grid in a more efficient data structure, such as a quadtree [58].

In 3 dimensions, there are a variety of ways to represent a map. The key element, however, is that all three dimensions are present, and therefore, any environment can be modeled as accurately as the representation, sensors, and algorithm will allow. In addition, in 3D dimensions, navigation can be performed through all 6 degrees of

freedom, (x, y, z, θ, φ, ψ). In practice, whether on a ground or air robot, planning and

motion are still relative to the local ground surface, which can be considered a 2D manifold embedded in 3D space.

What is modeled?

In the previous section, we went into some detail about a particular 2D map repre-

sentation. There are others, and the representations are relevant to both the 2D and the 3D case. The 2D occupancy grid has a 3D analog: the 3D occupancy grid, usually stored in a multi-resolution spatial data structure such as an octree [140] (also, see section 7.2.1).

cloud-based representations and sparse, landmark-based representations. The 2D and 3D occupancy grids are examples of dense grid representations. A point cloud- based representation can also be considered dense, but does not explicitly represent the “free space”, although it can usually be computed based on the viewpoint of the agent. In addition, point cloud representations don’t depend on a grid-based resolution, as they are data structures containing points, and therefore can be more efficient than an octree-based occupancy volume for the same information. Another dense representation, popularized originally in KinectFusion [99] but used extensively thereafter [38], [122], [128], [153], [214], [216], [217], is a TSDF volume [36]. Similar to an occupancy grid, this representation is particular amenable to merging multiple

depth maps (Rk) interpreted as rays cast into a volume. Each depth reading represents

a noisy measurement we assume can be truncated by some value µ, such that cells

along the ray r< λRk(p)−µare considered free space, and similarly cells along the

ray r > λRk(p) +µ are considered occupied space. The cells along r are uncertain

but are stored as a signed distance function where the positive values are outside the surface, the point at 0 represents the surface, and negative values are inside the surface. This allows an uncertain representation of the surface that can be updated by new readings, allowing the 0 crossing to change as appropriate based on weighted averaging of the cell values.

At the other end of the spectrum, landmark-based representations allow sparsity because they only store points or objects in the environment that have been detected, and then, only those points or objects that are likely to be robust to different viewpoints or lighting conditions. While these representations can still be useful for pose estimation and mapping, it is much harder to generate path plans, since the actual structure of the environment is not explicit.

6.1.2 High-resolution 3D map

We choose to generate a high-resolution 3D map, but this choice comes with benefits and compromises. A high-resolution cloud of surfels provides enough information for segmentation algorithms that may be involved in online object learning; the fact

that multiple frames often contribute to single surfels means the overall noise is re- duced, generating better boundaries between objects and fewer holes in the model. High-resolution models provide more detail, meaning smaller objects can be resolved. Relationships between objects provide information about semantic associations and typical spatial organization of related objects.

On the other hand, high-resolution models make it difficult to manage the size of the objects in memory (whether CPU or GPU, although the latter is usually more

constrained). Also, while sufficient information is present to support navigation,

working with such a dense point cloud is difficult, and still requires processing into something simpler before it can be used effectively for path planning (e.g., generating a low-polygon mesh for collision detection).

In the rest of this chapter, we document our approach to dense, high-resolution 3D mapping using RGB-D sensors, which we call the surfel modeler. We chose this approach to support learning online object recognition tasks (i.e., by capturing partial object models) and to support learning about spatial relationships between objects

and regions4_{. Our process relies directly on the incremental processing of dense depth}

maps provided by an RGB-D sensor, and also requires locally accurate ego-motion estimation.

In document Visual Perception For Robotic Spatial Understanding (Page 140-147)