Extracting salient visual information from local image regions (feature extraction) is regarded as one of the most important procedures for a wide range of image processing and computer vision applications. These applications include camera calibration, image matching and registration, object recognition and classification, structure from motion and camera tracking, place recognition, and many more. Ideally, a feature extraction method would be able to obtain stable and repeatable features from images that are subject to image transformations, such as viewpoint, rotation, lighting variation, and scale changes.
Generally, the main aim of feature extraction techniques is to reduce the amount of resources required to describe an image by sampling it into a subset of points, while still describing the image with sufficient accuracy. However, extracting a sparse set of features could result in throwing away valuable information that may be useful for many applications. For instance, a 3D model that is generated using a feature based structure from motion (SfM) method, would consist of a subset of the matched features (inliers of the matching process between different images), resulting in a sparse looking model.
For many years, SIFT [33] has been widely regarded as the golden standard for feature extraction and description by the robotics and computer vision communities, due to its distinctiveness and invariance to a variety of image transformations. The number of features extracted by SIFT usually range between a few hundred to a few thousand features. While
this number may be sufficient for many applications, such as visual odometry (VO), which is usually required to perform the camera pose estimation in real-time, other applications, such as the structure from motion example described above, may benefit from the additional number of features to obtain a denser looking model and possibly improve the estimation accuracy by using a larger number of inliers.
In this chapter, we propose an image based feature extraction method that is able to extract a large number (ranging from a few thousand features to tens and even hundreds of thousand features) of highly repeatable features. When paired with robust image descriptors such as SIFT, the proposed features are highly invariant to viewpoint, rotation, blurring, lighting and scale change. Similar to the 3D feature extraction methods presented in Chapters 4 and 5, the proposed method utilizes a rank order statistics based robust segmentation method (MSSE) to segment the image into uniform regions, and ones containing high intensity variations. In the 3D case (extracting features from a point cloud), a single metric scale is used. As such, 3D features are inherently scale invariant, whereas in the 2D case, images may be subject to optical zooming and scale change, and computing them on multiple scales is required to achieve scale invariance.
Another difference between our 2D and 3D feature extraction approaches is that in the 3D case, our aim was to obtain as little number of features as possible that are needed to accurately register two RGB-D frames, whereas in the 2D case, we aim to obtain a large number of high quality features. This difference stems from the fact that RGB-D sensors provide a dense 3D point cloud, and by accurately aligning the frames using a small sample, one is able to obtain a dense looking model using the dense 3D information provided by the RGB-D sensor. Whereas in the 2D case, 3D points are triangulated using the inliers (correct matches), which result from matching the features between two frames. As a result, a higher number of features is key to obtain a denser model. Another application that could benefit from a higher number of features is the monocular SLAM, which generally consist of two main steps. In the first step, the first two images are matched and the inliers are triangulated (e.g. using a 5 point algorithm). The second step involves matching newly arrived images with previous ones using a Perspective N Point algorithm (PnP). PnP methods heavily rely on the availability of 3D information associated with features in the previous frames. For example, if 100 correct matches were obtained from matching a newly arrived image to the previous one, but only 10 of the features from the previous image were associated with 3D information, only 10 matches are used to estimate the transformation between the two images. As such, it is very important to have as much 3D information as possible to accurately
7.2 Related Work 141
estimate the motion between images using PnP methods. Moreover, the quantity of features is important for object recognition tasks, since the ability to obtain small objects in noisy backgrounds requires that at least 3 features to be correctly matched from each object for reliable identification [33].
7.2
Related Work
Harris corner detector [52] is one of the earliest and most well-known feature detectors. They defined a corner by a point in which image intensities have a large variation between adjacent regions in all directions. Mikolajczyk and Schmid extended the Harris corner detector [190] to be scale invariant. Rosten and Drummond proposed an efficient corner detector called FAST [58]. FAST corners are found by comparing the neighboring pixels (in an area that includes 16 pixels around the center) to the center pixel. A region is defined as uniform, an edge or a corner based on the percentage of neighboring pixels with similar intensities to the center pixel. Rublee et al. [63] extended FAST by adding an orientation component to the features. BRISK [191] is another feature detector that searches for maxima in both the image plane and the scale-space using the FAST scores as a measure for saliency.
Lowe [33] proposed SIFT, a method that is widely regarded as one of the most robust feature detectors available because of its invariance to scale, rotation, viewpoint and partially illumination changes. SIFT features are computed by analyzing the Difference of Gaussian (DoG) between images at different scales. One of the main downsides to SIFT is that it is computationally expensive. Bay et al. [62] outlined this issue and proposed SURF, a feature detector that is similar to SIFT in that it is invariant to multiple image transformations, but is faster. As opposed to SIFT which analyzes that DoG, SURF analyses the determinant of the approximated Hessian matrix in order to find the local maximum across all scales.