Stereo Matching—State-of-the-Art and Research Challenges
7.2 Scene Reconstruction
7.2.2 Sparse Structure-from-Motion Modeling
For sparse 3D model reconstruction, we rely on a Structure from Motion (SfM) ap-proach that is able to reconstruct a scene from unorganized image sets. Our solution
7.2.2.1 Establishing the Epipolar Graph
First, salient features are extracted from each frame. Our method utilizes the very effective combination of DoG keypoint detector and SIFT descriptor [39] which achieves excellent repeatability performance for wide baseline image matching [43].
In particular we rely on the publicly available SiftGPU [72] software.
We then match keypoint descriptors between each pair of images. A variety of approaches has been proposed to speedup nearest neighbor matching in high-dimensional spaces (like the 128-high-dimensional SIFT descriptor space). Among the most promising methods are randomized kd-trees [57] with priority search and hi-erarchical k-means trees [14]. These algorithms are in general designed to run on a single CPU and are known to provide speedups of about one or two orders of magnitude over linear search, but the speedup comes with the cost of a potential loss in accuracy [44]. On the other hand, given that the number of features is lim-ited to some thousands, nearest neighbor search, implemented as a dense matrix multiplication on recent graphics hardware, can achieve an equivalent speedup but delivers the exact solution. Hence, we employ a GPU-accelerated feature matching approach based on the CUBLAS library with subsequent instructions to apply the distance ratio test and to report the established correspondences.
After matching relevant images to each query view, geometric verification based on the Five-Point algorithm [47] is performed. Since matches that arise from de-scriptor comparisons are often highly contaminated by outliers, we employ the ran-dom sample consensus (RANSAC) [11] algorithm for robust estimation. In its basic implementation, RANSAC acts as a hypothesize-and-verify approach. From a mini-mal set of samples, several hypotheses are generated and the consensus of the model is evaluated on the full set of observations. The RANSAC termination confidence,
p= 1 − exp Nlog
1− (1 − ε)s
(7.1) is used to decide whether two images satisfy the epipolar geometry. Here, N is the total number of evaluated models, w= 1 − ε the probability that any selected data point is an inlier, and s= 5 is the cardinality of the sample point set used to compute a minimal model. We require p > 0.999 in order to accept an epipolar geometric relation. In our experiments, we use up to N= 2000 models which corresponds to a maximal outlier fraction of ε= 0.67.
The matching output is a graph structure denoted as epipolar graph, that consists of the set of verticesV = {I1. . . IN} corresponding to the images and a set of edges E = {eij|i, j ∈ V} that are pairwise reconstructions, that is, relative orientations be-tween view i and j , eij= Pi, Pj,
P0= K0[I, 0] and P1= K1[R, t] (7.2) and a set of triangulated points with respective image measurements. Since the cam-eras are calibrated, a linear triangulation method [18] is sufficient to accurately es-timate the 3D point location. This procedure is followed by a pruning step that discards ill-conditioned 3D points (points that do not satisfy the cheirality criterion) and points that have a high depth uncertainty (points where the roundness of the confidence ellipsoid [3] is larger than a given threshold).
7.2.2.2 Structure Initialization
Our SfM method follows an incremental approach [59] based on the epipolar graph.
In order to reconstruct a consistent 3D model, a robust and reliable start configura-tion is required. When the initial structure is prone to errors, a subsequent iterative optimization procedure will eventually end up in a wrong local minimum, hence good initialization is critical. As proposed in [30], we initialize the geometry in the most connected parts of the graph, therefore the vertex I∗with highest degree, that is the node having the largest number of edges, is determined. We start from the vertex I∗and determine all connected neighbors. Next, all pair-wise measurements are linked into point tracks and the global scale factor of the initial structure is esti-mated. Then, bundle adjustment [64] is used to optimize camera orientations Pi and 3D points Xj by minimizing the reprojection error,
C(P, X) =
i
j
vijd(PiXj,xij)2 (7.3)
where xij are 2D point measurements of observed 3D points and vij is an indicator variable that is 1 if the point Xj is visible in camera Piand 0 otherwise. To limit the impact of outliers, we use a robust Huber M-Estimator [23] to compute the costs d.
Given the initial optimized structure, each 3D point is back-projected and searched for in every image. We utilize a 2D kd-tree for efficient search and restrict the search radius to a constant factor rt. Again, given the new measurements, bun-dle adjustment is used to optimize 3D points and camera parameters. This method ensures strong connections within the current reconstruction.
7.2.2.3 Incremental Reconstruction
Next, for every image I that is not reconstructed and has a potential overlap to the current 3D scene (estimated from the epipolar graph), 2D-to-3D correspondences are established. A three-point pose estimation algorithm [17,32] inside a RANSAC loop is used to insert the position of a new image. When a pose can be determined
(i.e., a sufficient inlier confidence is achieved), the structure is updated with the new camera and all measurements visible therein. A subsequent procedure expands the current 3D structure by triangulation of new correspondences. We follow the approach of Snavely et al. [59] and use a priority queue to guide the insertion order.
Our insertion order is based on a saliency measure that favors early insertion of images that have a strong overlap with the given 3D structure. Rather than using the raw number of potential 2D-to-3D matches, we compute an effective matching score that further takes the spatial match distribution into account. This idea is depicted in Fig.7.3. While the number of features is equal in (a) and (b), the uniform spatial distribution of point features in (a) can be regarded as more reliable than the one shown in (b). Hence, we weight the raw number of features by an estimate for the covered image fraction yielding the effective inlier count.
Whenever a number of N images is added (we use N= 10), bundle adjustment is used to simultaneously optimize the structure and all camera poses. The iterative view insertion procedure is repeated until the priority queue is empty. The sparse reconstruction result can be seen in Fig.7.4(b).