Approach - Visual Perception For Robotic Spatial Understanding

It is fairly clear that mapping is inherently tied to ego-motion estimation. Consider the phrase “merge snapshot after snapshot into a single coherent representation that respects ego-motion and the actual structure of the world.” The first part is “single coherent representation,” and the second part describes the two constraints. The first constraint means that the way we integrate the information snapshot into the whole must respect the information from the system’s motion. . . this is metric information that describes the sensor pose at individual time instants and how the pose changes 4_{While these topics are outside the scope of this dissertation, we present some ideas for future}

from one instant to the next. This is critical information and makes the task of merging snapshots easier; if we have it, then we know how we have moved with respect to the previous snapshot, and can use the knowledge of the motion to better align the snapshots for mapping. The second constraint means that we want the whole to be a good representation of the world, and allow for reasoning algorithms that can directly translate into action within the world. This depends heavily on the ultimate goals for the representation; for example, if the only goal is to avoid obstacles and you are a ground robot (in one of many guises), you may not need a representation that models the full 3D structure of the world, or the height of objects, since you just need to see where there is space to move on the ground; in other words, a 2D occupancy grid would suffice. On the other hand, if you are an aerial robot, or need to understand and manipulate human-scale objects (like the DARPA Robotics Challenge [39]), then a 3D model of the world is required.

SLAM is the typical algorithmic technique used to generate maps when exploring unknown environments. It relies explicitly on receiving ego-motion updates from an odometry component (likely visual odometry, as discussed in chapter 5) and then uses this information along with the information in the sensor snapshots to incrementally build up a model of the world. A key aspect of SLAM is the iterative nature of the task, but also the ability to correct for ego-motion errors by recognizing previously visited places through the process of loop closing. This allows one to create a connec- tion or constraint between places, and facilitates the correction of drift in the model using a process of graph optimization.

6.2.1 Ego-motion and Localization

Ego-motion and localization are related but separate concepts. Ego-motion simply refers to a description (or computation in the case of ego-motion estimation) of how a sensor or agent is moving, while localization refers to the task of determining a pose given a map and sensor data. Ego-motion, in fact, can be used to help determine and maintain the localization within the map, since knowing the local motion can be useful to compare the sensed observations with the expected change in the observations of

Figure 6.1 Our SLAM system diagram. While some SLAM systems can accept different kinds of sensor input, our algorithm focuses on RGB-D frames, that is, a pair of RGB and depth images.

the map features based on the estimated location.

In the previous chapter, we described an ego-motion algorithm that computed the incremental pose between two sensor frames. In this chapter, we describe how we can use this technique in concert with building maps to perform both ego-motion

estimation and localization within the map we are building (implementing a version

of a SLAM algorithm).

6.2.2 SLAM architecture

Our approach follows the typical structure of modern SLAM algorithms: the front

end component computes the ego-motion as frames enter the system, and integrates

new depth maps for each frame into the currently active portion of the map. The

back end does loop detection, and optimizes the pose graph upon loop closures.

Figure 6.1 illustrates the overall architecture of our SLAM system. The following subsections will describe the function of each block in the diagram.

Front end

The front end contains many components we described in Chapter 5. RGB-D frames are submitted to the algorithm as they are received from the sensor, and first undergo image processing (see section 5.6.2) to prepare them for the alignment step (see section 5.6). As mentioned in section 5.6.1, incoming frames are aligned not necessarily

with the most recent frame, but with keyframes instead. This tends to improve the accuracy of the ego-motion estimation, and directly contributes to the maintenance of the active model using integration, described in a following section. First, however, we describe our modeling element, the surface element.

Surface Elements (Surfels)

A surface elements(shortened to surfel like pixel for picture element) is a generaliza-

tion of a (usually) small region of a surface represented as the sets ={p,np, r,c,vh, κ, C},

where p is the point indicating the centroid of the region, np is the normal of the

surface at the pointp, ris the radius of the region disk5_, _c_{is the color of the surface,}

vhis the histogram of viewpoints the surfel has been observed from,κis the curvature

of the surface at p, and C is the confidence in the existence of the surfel.

Why are surfels interesting for high resolution 3D modeling? Surfels are a surface sample (and as such, have area), which is semantically more pleasing than points, and better represents the imaging process from the camera. In addition, surfels allow for generating surfaces at varying resolutions, depending on how far away the camera is to the observed surface. Surfel clouds, like point clouds, do not require volumetric storage representing all the empty space, as opposed to popular dense representations such as TSDF volumes (see section 6.1.1).

On the other hand, surfels do take up more space, and like point clouds, do

not have a direct way to represent neighborhood connectivity when represented in a

typical unorganized cloud.

Integration

Integration is the process of using the ego-motion estimate to adapt the incoming point clouds (derived from the depth image) into the existing active environment model. We attempt to reduce error accumulation in the model by continuously integrating incoming frames into a single model composed of surfels, and subsequently use that model for the ICP-based motion estimation in the following frames through

the extraction of a model-based keyframe. The model is created and updated on the GPU, to maximize throughput when visiting individual surfels.

The algorithm is initialized with the first frame from the camera set as the initial model. Following query frames are aligned using the method described in the previous section, where the GPU-based join DVO/ICP algorithm aligns every visible point-

normal pair (p,np) derived from the surfels in the model to the query frame using

the GPU’s bilinear interpolation texture hardware. Once an estimate is generated, the query frame is integrated into the model.

Surfel integration Integrating the query frame into the model is performed in two steps: update and addition. Update is performed first, followed by the addition of points in the query frame that were not processed during the update procedure. This surfel integration procedure was adapted from the work of Weise et al. [211].

Update: Given an estimated transformTthat maps the model frame to the query

frame, we can iterate over every visible surfel si ∈ M and project it onto the query

image plane, s0_i = Tsi. Since the camera is moving and previous observations may

be noisy, a query point may be in front of, behind, or unobserved with respect to a projected surfel. Part of the decision to update is made based on the distance between the model surfel and the query point, as well as the difference in their normals. The

following conditions determine the outcome. Let d = z(s0_i)−z(p) and α =νs0_i·νp.

We also have the parameters Dmax representing the maximum surface offset allowed,

and A as the minimum dot product between normals.

|d|< Dmax The query point is within the maximum offset, and therefore the query

point will be used to update the surfel values.

d > Dmax The query point is in front of the model surface. This can be due to an

extreme outlier, or the motion of the camera bringing another surface into view that is truly obscuring the view of the existing surfel. In this case, we do not update the surfel, and instead mark this query point to be added independently to the model in the next step.

d <−Dmax The query point is behind the model surface. Since we are unsure whether the query point is an outlier or the model point is an outlier, we let our confidence in the surfel determine which action to take. If the surfel has been seen and updated multiple times (i.e. in other frames it has been seen at roughly the same location), then we assume the query point is an outlier and do not update the surfel. On the other hand, if the surfel has been viewed less than the minimum number of times (currently set to 3), we replace the existing surfel by generating a new surfel based on the query point.

Addition: For every marked query point, the second step simply adds a surfel

derived from the point to the model. The point, normal and curvature are passed through unchanged (the latter two estimated by the PCL routine). The viewpoint is stored as the dot product of the normalized eye vector and point normal. The radius is estimated with the following equation:

radius(d, f, z) = √1 2

d/f

z , (6.1)

whered is the depth of the point,f is the focal length of the sensor, and z =n·e, so

that the radius is larger the farther the viewpoint is from the normal direction.

Model Maintenance The GPU memory size is often more constrained than the CPU size, and over time the surfel model can grow larger than can be held in GPU RAM. This necessitates a swapping procedure, where a portion of the model is swapped out to be held in the CPU RAM. As part of the model generation, we maintain a list of keyframe locations (determined by the magnitude of the camera motion) and associate the keyframe locations with the surfels seen at each keyframe.

We currently hold a fixed number of keyframes in memory,Kmem. At every keyframe

trigger, the surfels that haven’t been seen since keyframeK−Kmem are swapped out

Back end

For the back end, we made use of the real-time appearance-based mapping approach for detecting loop closures of Labbe et al. [120]. They utilize hierarchical memory for storing place descriptors based on a bag of words approach for describing places. Fea- tures (i.e. words) are dynamically added to the vocabulary when they are sufficiently far enough away from existing words.

When the loop detection module discovers a loop, a new constraint is added to the graph. This new constraint triggers a graph optimization over the existing graph to bring the keyframes into alignment, and deforms both the camera poses and select control points embedded in the surfel cloud, based on the approach from Weise et al. [211]. We use the graph optimization framework GTSAM [105] in this work, as it comes with many features amenable to implementing SLAM optimizations.

In document Visual Perception For Robotic Spatial Understanding (Page 147-153)