Summary - Visual Perception For Robotic Spatial Understanding

We have presented an end-to-end system for calibrating a set of minimally constrained multi-modal sensors rigidly mounted to a robot. Although many of the individual concepts have been presented in previous works, as far as we are aware this is the first algorithm to bring them together into a cohesive graph optimization framework paired with a simplified data association process.

We have many plans to improve the capability of the MSG-Cal system as presented. In particular, some mobile robots may have sensors on dynamic joints (e.g., mounted on pan/tilt units) or wish to ensure a manipulator end effector is calibrated with one or more sensors. Therefore, we are currently in the process of adding dynamic

global error pairwise error 0.002 0.004 0.006 0.008 0.010 0.012 0.014 0.016 0.018 Average error (m)

Simulation: Pose translation benchmark error

(a)

global error pairwise error 0.000 0.002 0.004 0.006 0.008 0.010 0.012

Average error (radians)

Simulation: Pose rotation benchmark error

(b)

global error pairwise error

0.00345 0.00350 0.00355 0.00360 0.00365 0.00370 0.00375 0.00380 0.00385 RMS error (m)

Simulation global error vs pairwise error

(c)

global error pairwise error

0.0064 0.0066 0.0068 0.0070 0.0072 0.0074 0.0076 RMS error (m)

Husky global error vs pairwise error

(d) trials 0.0034 0.0035 0.0036 0.0037 0.0038 0.0039 0.0040 0.0041 0.0042 RMS error (m)

Simulation outlier filter vs no filter SAC filter No filter (e) trials 0.00 0.01 0.02 0.03 0.04 0.05 0.06 0.07 0.08 RMS error (m)

Husky outlier filter vs no filter

SAC filter No filter

(f )

Figure 4.5 (a) Simulation mean translation benchmark error (poses computed against ground truth). (b) Simulation mean rotational benchmark error (poses computed against ground truth). (c) Simulation graph error comparing the best pairwise error result vs. global error. (d) Husky graph error comparing the best pairwise error result vs. global error. (e) Simulation global graph error comparing results with and without filtering the data using the sampling consensus. (f ) Husky global graph error comparing results with and without filtering the data using the sampling consensus.

(a) Indoor, open area scan.

(b) Indoor, cluttered area scan.

Figure 4.6 Fused sensor output using the global calibration results for the Husky sensor data collection. The Ladybug images are back-projected to the point cloud, given the relative transform between the two sensors. Also note the blue, purple, and cyan points that show the transformed 2D laser scan output. In particular, (b) shows how well the laser scans align to the camera and 3D LRF, e.g. by their intersection on the handcart on the left side of the image.

joints to the graph and modifying the data capture process accordingly. Additionally the current system is based completely on visual sensors; while certain processes may have to change to support proprioceptive sensors such as IMUs, adding this func- tionality to the system would be worth some of the compromises for the robots that would benefit from it.

While we’ve already begun to evaluate the algorithm on a wider array of data sets, we also plan to explore improvements to the camera plane estimation algorithm and provide specific improvements to camera-camera pair calibrations.

Finally, my colleagues have begun the necessary work to make the code available as an Open Source distribution for ROS so others can benefit and improve upon the work.

Chapter 5 Ego-motion Estimation Using

Vision

5.1 Introduction

A robot must know where it is and how it is moving for a variety of tasks, including localization with respect to a known map, building a new map while localizing, and determining its relationship to an object for manipulation e.g., opening a door, picking up a tool, and shoving an object out of the way. The task of estimating the motion of the robot’s body is called ego-motion estimation.

There are a variety of ways for a robot to estimate pose. First, a wheeled robot may use knowledge of how the wheels turn to estimate how it is moving. This is called dead reckoning, or odometry (odos - way, metron - measurement). This is a very useful way to estimate motion, especially if the wheels have little slip and the surfaces the robot drives on are not very complex. However, some robots may slip on different surfaces, bump around, or use techniques for turning like skid steer that cause motion estimation based on measuring wheel motion to fail or, at the minimum, be inaccurate due to slippage and other un-sensed interactions with the ground (by accumulating enough non-recoverable error the estimate pose no longer represents the actual pose).

changes with respect to the environment, assuming large parts of the environment do not themselves change while trying to take the measurement. Using vision is one way to do this. Cameras are a relatively low cost sensor and they provide a lot of information about the environment that the robot may use to estimate motion. The basic operation of visual odometry (the approach of computing odometry using visual sensors) is simple to explain but is considerably more complicated to implement.

Thinking generally, how could we reliably determine ego-motion? Imagine we place a human or robot in the “construct,” the infinitely white artificial loading area for the characters in the movie The Matrix. How would they recognize their motion? Since there are no features or structures in any way (maybe a floor, maybe not), there

is no way to measure their motion with respect to the environment, because it doesn’t

really exist. While humans and robots can detect acceleration using their inertial measurement units (i.e. accelerometers and gyroscopes for a robot and inner ear for a human), without some kind of observable features in the environment, something to “measure” distances by, there would be no motion.

Thus, computing the motion of a camera simply from the data it provides is not a new concept. It has been called many things, often depending on the approach taken, but the general term of visual odometry is apt here since we wish to measure the motion of the robot system (which includes the camera) using the data produced by the camera. In this dissertation, we use an RGB-D camera, because it has an additional sensor and emitter that allow it to sense both color (red, green, blue: RGB) and depth (D). The way it computes depth is through a process called stereo disparity which computes the difference in location of an observed feature seen from two different, but known, viewpoints. See section 3.3.2 for a slightly more in-depth discussion.

When the color and IR camera are extrinsically calibrated with respect to each

other (which we include in the intrinsic calibration of the RGB-D sensor system),

it is possible to map a color to every depth pixel, and back-project the point into the world to generate a point cloud for every frame seen by the camera. Sensors such as the Microsoft Kinect and Asus Xtion Pro Live provide this capability over

a USB connection, and can be operated using open source software from OpenNI (conveniently integrated into ROS).

Point cloud generation is an ability most cameras do not possess, but in this case it affords us with a straight-forward frame to frame alignment methodology, shown in high-level pseudocode in Algorithm 1 [44].

Algorithm 1 Feature-based rigid motion estimation

1: compute features in query and model frame

2: compute correspondences between the features

3: Pq = point cloud of corresponding feature points in the query frame

4: Pm = point cloud of corresponding feature points in the model frame

5: cq =centroid(Pq) 6: cm =centroid(Pm) 7: C =P q∈Pq,m∈Pm(q−cq)(m−cm) T 8: (U, s, VT_{) =}_SVD₍_C₎ 9: S = [1,1,det(V UT_)]T 10: R=Vdiag(S)UT 11: t=cq−Rcm

This algorithm utilizes image feature correspondences between frames to compute a rigid motion estimate that transforms the query cloud to the model frame. One aspect to note about this formulation is the requirement for image features: distinctive locations in the image that are robust to small changes in viewpoint and therefore can be reliably detected from a different viewpoint. The best features are expensive to compute yet still rely on having “interesting” image content. This can be problematic when the image has repetitive textures, mildly textured regions, or high framerate processing is required (e.g. for robot motion estimation).

The purpose of the descriptor is to provide a hopefully unique description of that point within the image; usually this requires looking at a small area surrounding that particular point in the image in order to derive enough context to provide a usable and comparable description. The computation of correspondences, line 2 in Algorithm 1, will not work well without eliminating outliers by embedding it in a robust estimation framework like RANSAC. This implies evaluating many iterations of the algorithm with random corresponding samples and choosing the correspondence set that maximizes the number of inliers.

While this algorithm utilizes both visual and structural features in turn, if enough features are not found to generate sufficient correspondence between successive frames, then the algorithm fails and loses tracking. For a purely visual system, this is unavoid- able in some cases. For example, attempting to track motion against a blank wall with no discernable features (often found indoors) will cause almost every algorithm to fail; only systems that can rely on proprioceptive information like an IMU can continue estimation during these failures. However, if we consider an algorithm that uses both structural and visual information, we can handle many, if not all, situations that would cause a feature-based algorithm trouble.

Following Whelan et al. [213], our approach merges two techniques that combine structural alignment with photometric alignment in order to estimate the small camera motion between frames. We combine an iterative closest point algorithm using the point cloud data with a dense visual odometry algorithm that uses both the point cloud and image intensities.

Each algorithm is independently executed on the frame pair to generate inter- mediate constraints which are then combined to form a single set of weighted nor- mal equations. We exploit the massively parallel SIMT (single instruction, multiple threads) capability of the GPU to compute both the ICP and DVO steps. The output

of this portion of the algorithm is an estimate of the camera motion between frame t

and t+ 1.

In document Visual Perception For Robotic Spatial Understanding (Page 99-106)