The incremental motion is estimated by matching sequential video frames and com-puting the motion between them. This is referred to as visual odometry (VO) [79].
There have been many implementations of VO in recent years. We use a publicly available implementation called Fast Odometry for VISion (FOVIS) [44]. It supports both RGB-D and stereo cameras. In the reminder of the section we will give a de-scription of this particular algorithm and how we combined it with the overall vision SLAM system.
The main steps for a VO algorithm are:
1. Detect keypoints
2. Extract features
3. Match features to previous frame
4. Compute motion between frames
Following is a description of how FOVIS implements these steps. The input to the motion estimation algorithm, when using a stereo camera, are the gray colored frames for the left and right camera; when using the RGB-D camera the input are a gray image and a depth image. The output from the algorithm is the motion estimate from the previous keyframe to the frame passed in. Possibly the keyframe was changed to the most recent frame passed in. In addition it is possible to retrieve a covariance estimate and the detected features.
Parameter Descriptions Typical
FAST threshold Determines which keypoints are used 10
# inliers for keyframe change A keyframe is changed if the number of inliers drops below this threshold
100
# inliers for motion estimate If the number of inliers is below this threshold then the motion estimation is not computed and it is assumed the VO has failed
15
Minimum reprojection error After the optimization all points that are within this threshold are determined as inliers
1.0
Grid size Used to ensure an even distribution of features across the image. A fixed number of features with highest FAST score are selected from each grid cell
80x80
Clique inlier threshold The difference of the distance between a pair of features so they are determined as a valid pair
0.15
Table 4.1: Parameters for the visual odometry module.
Parameters
There are many parameters that will affect the performance of the visual odometry.
A description of the parameters and typical settings is provided in the table below.
Additionally there are specific features to choose from, e.g. the grid can be en-abled or disen-abled, the FAST threshold can be fixed or adaptive, sub-pixel refinement can be enabled. The parameters given in Table 4.1 are those that worked well for our scenario. We used a fixed FAST threshold instead of using the adaptive threshold that is supported, because in some cases the threshold would get too low, allowing many spurious features to be selected, and decreasing the overall accuracy. Having a minimum setting on the adaptive threshold might minimize that problem.
Detect Keypoints
First a Gaussian pyramid is constructed from the input image. A FAST [90] detector is used to find keypoints on each level of the image pyramid. The FAST detector works by finding a segment of points on a circle around the pixel being tested that
have intensity that is lower or higher then a given threshold. If the length of this segment is greater then 9 pixels the pixel is classified as a keypoint. For increased stability, maximal suppression is also performed in the neighborhood of the detected keypoint. This algorithm is extremely fast but it is sensitive to noise in the image.
Feature Extraction and Matching
After the keypoints have been detected each one is matched to the neighboring key-point that is most similar; the neighborhood is defined as all keykey-points within a fixed radius in the image. The image patch around the keypoint is used as the descriptor and the similarity is determined by the sum of absolute differences (SAD). A match is accepted if the match from reference frame to target frame, and target frame to reference frame are mutually consistent. After the match, a sub-pixel refinement is performed on the target keypoint, by aligning the gray image patches. The gray im-age patches are aligned by minimizing the sum-of-square errors between the patches.
An example of frame-to-frame feature tracking is given in Figure 4-2(a).
Compute Motion Between Frames
The cameras supported by FOVIS are either RGB-D or stereo cameras, and the only keypoints considered are keypoints that have depth. Having the depth information is particularly useful for both determining inliers and initializing the motion. To determine the inliers the distance between a pair of keypoints in the reference frame is compared to the distance of the corresponding keypoints in the target frame. The transformation being sought is a rigid body transformation which preserves distances.
Thus if the distances are within a pre-determined threshold then these correspon-dences are marked consistent. This forms a graph over keypoint corresponcorrespon-dences.
Now a maximal clique is determined in the graph. This clique forms the inliers set.
Given a collection of 3D points and consistent correspondences it is possible to use direct methods like Horn’s absolute orientation algorithm to compute the initial motion — as is done in this work.
This initial estimate is further refined by minimizing the bi-directional reprojec-tion error. This is done by solving a nonlinear least squares problem using Leven-bergMarquardt (LM). After the optimization the re-projection error is computed for each point. If the re-projection error exceeds a given threshold the point is removed from the inlier set. After the outlier points have been removed the optimization is run once more — this gives the final estimate. The covariance is reported as the approximated Hessian JTJ where J is the Jacobian of the cost function.