Global
Structure-from-Motion
and
Its
Application
byZhaopeng
Cui
M.Sc.,XidianUniversity,2012 B.Sc.,XidianUniversity,2009Thesis Submitted in Partial Fulfillment of the RequirementsfortheDegreeof
DoctorofPhilosophy inthe
SchoolofComputingScience FacultyofAppliedSciences
c
ZhaopengCui2017 SIMONFRASERUNIVERSITY
Summer2017
Allrightsreserved.
However,inaccordancewiththeCopyrightActofCanada,thisworkmaybereproduced withoutauthorizationundertheconditionsfor“FairDealing.”Therefore,limited reproductionofthisworkforthepurposesofprivatestudy,research,education,satire, parody,criticism,reviewandnewsreportingislikelytobeinaccordancewiththelaw,
Approval
Name: Zhaopeng Cui
Degree: Doctor of Philosophy
Title: Global Structure-from-Motion and Its Application Examining Committee: Chair: Dr. Ze-Nian Li
Professor Dr. Ping Tan Senior Supervisor Associate Professor Dr. Greg Mori Supervisor Professor
Dr. Hao (Richard) Zhang Internal Examiner
Professor
Dr. Marc Pollefeys External Examiner Professor
Department of Computer Science ETH Zurich
Abstract
Structure-from-motion (SfM) is a fundamental problem in 3D computer vision, with the aim of recovering camera poses and 3D scene structure simultaneously given a set of 2D images. SfM methods can be broadly divided into incremental and global methods according to their ways to register cameras. Incremental methods register cameras one by one, while global SfM methods solve all cameras simultaneously from all available relative motions. As a result, global SfM has better potential in both reconstruction accuracy and computation efficiency than incremental SfM. In this thesis, we address two challenges of global SfM. Our goal is to propose a robust and efficient global SfM system which is applicable to all kinds of motions and datasets.
The first challenge is that translation averaging in global SfM is difficult, since the input relative motion between two cameras doesn’t encode the scale information. Therefore, many existing global SfM methods don’t work for the data whose measurement graph is not parallel rigid,e.g.all cameras on the same line. To tackle this challenge, we propose a global SfM method based on a novel linear relationship within camera triplets. Our formulation encodes the scale information by the baseline length ratios within the camera triplet, which helps deal with the collinear camera motion. We further extend the linear relationship within camera triplets to linear constraints for cameras seeing a common scene point, which can improve the global translation estimation for the data with weak image association.
The second challenge is that global SfM methods are fragile on noisy data, and one incorrect pair-wise relationship may distort the result greatly as global SfM considers all relative relationships together. To deal with this challenge, we propose a novel global SfM pipeline where camera reg-istration is formulated as a well-posed similarity averaging problem solved robustly withL1
opti-mization. What’s more, the novel pipeline makes the filtering of noisy relative poses simple and effective, which can further improve the robustness of global SfM.
We show the effectiveness of our global SfM system by applying it into the video alignment prob-lem which aims to find per-pixel correspondences between two video sequences in both spatial and temporal dimensions. Guided by the 3D information from global SfM, the proposed video registra-tion method can align videos taken at different times with substantially different appearances, in the presence of moving objects and moving cameras with slightly different trajectories.
Acknowledgements
I would like to take this opportunity to thank all the people who have helped me during my Ph.D. study and in making this thesis possible. First of all, I would like to thank my senior supervisor Dr. Ping Tan. I have learned a great deal from him about both research and life. Ping was always ready to help me whenever I met a problem, and I deeply cherish the time I have spent working with him. I also owe sincere thanks to Dr. Jue Wang and Dr. Oliver Wang for their mentorship during my internship at Adobe Research, and to Dr. Jinwei Gu and Dr. Jan Kautz for their mentorship during my internship at NVIDIA Research. It was a great honor for me to work closely with all these world-class researchers. I learned a lot from their brilliant insights and suggestions on the research projects.
I would also like to express my sincere gratitudes to my committee members. I am grateful to Dr. Greg Mori for his constructive suggestions through my Ph.D. study at Simon Fraser University. I appreciate Dr. Hao Zhang for his insightful comments and feedback. It is a great honor to have Dr. Marc Pollefeys as my external examiner. His excellent work on 3D reconstruction has always inspired my own research. I am also thankful to Dr. Ze-Nian Li for serving as the chair of my committee.
I would like to acknowledge a group of colleagues at Simon Fraser University: Rui Huang, Renjiao Yi, Chengzhou Tang, Luwei Yang, Min Li, Feitong Tan, Sicong Tang, Honghua Li, Rui Ma, Ruizhen Hu, Ibraheem Alhashim, Shuyang Sun, Guangtong Zhou, Zhiwei Deng, Lili Wan, Changqing Zou, Dong Wang, Chengying Gao, Han Liu, Shuhua Li, Kangxue Yin, Chenyang Zhu, Zeinab Sadeghipour, Warunika Ranaweera and too many others to list individually. I really enjoyed the collaborations and discussions with them.
I would also like to thank an inspiring group of colleagues at National University of Singapore where I spent two years during my Ph.D. study: Nianjuan Jiang, Yinda Zhang, Shuaicheng Liu, Kaimo Lin, Xiaoming Deng, Boxin Shi, Zhenglong Zhou, Zhe Wu, Peilin Wang, Zhuwen Li, Ye Luo, Jiaming Guo, Qiang Zhou, Canyi Lu, Wei Xia, Luoqi Liu, etc. I thank them for having provided an enjoyable and stimulating lab environment when I studied at NUS.
Finally, I would like to owe my deepest thanks to my parents and elder brother for their life-time love, support and encouragement.
Table of Contents
Approval ii Abstract iii Acknowledgements v Table of Contents vi List of Tables ix List of Figures xi 1 Introduction 1 1.1 Challenges . . . 2 1.2 Contributions . . . 4 1.3 Thesis organization . . . 6 2 Background 7 2.1 Two view geometry . . . 72.2 Incremental SfM . . . 8
2.3 Global SfM . . . 9
2.3.1 Factorization based methods . . . 9
2.3.2 Motion averaging based methods . . . 9
2.4 Video Alignment . . . 11
3 Linear Global SfM based on Camera Triplets 12 3.1 Introduction . . . 12
3.2 Overview . . . 13
3.3 Translation registration . . . 14
3.3.1 Triplet translation registration . . . 15
3.3.2 Multi-view translation registration . . . 16
3.4 Generalization to EG outliers . . . 17
3.5.1 Trifocal tensor estimation . . . 18
3.5.2 Multi-view reconstruction . . . 21
3.6 Summary . . . 24
4 Linear Global Translation Estimation with Feature Tracks 25 4.1 Introduction . . . 25
4.2 Overview . . . 27
4.3 Global translation averaging via feature tracks . . . 27
4.3.1 Constraints from a triangle . . . 27
4.3.2 Constraints from a feature track . . . 29
4.3.3 Feature tracks selection . . . 29
4.4 Robust estimation byL1norm . . . 30
4.5 Experiments . . . 32
4.5.1 Evaluation on benchmark data . . . 32
4.5.2 Experiment on sequential data . . . 32
4.5.3 Experiment on unordered Internet data . . . 34
4.6 Summary . . . 35
5 Global SfM based on Similarity Averaging 36 5.1 Introduction . . . 36
5.2 Overview . . . 38
5.3 Sparse depth image construction . . . 39
5.4 Similarity averaging . . . 41
5.4.1 Robust scale averaging . . . 42
5.4.2 Robust scale-aware translation averaging . . . 42
5.5 Experiment . . . 43
5.5.1 Evaluation on sequential data . . . 44
5.5.2 Evaluation on Internet data . . . 46
5.5.3 Evaluation on ambiguous data . . . 47
5.5.4 Discussions . . . 47
5.6 Summary . . . 50
6 Application with Global SfM 51 6.1 Introduction . . . 51 6.2 Related work . . . 53 6.3 Method . . . 55 6.3.1 Frame-level 3D registration . . . 56 6.3.2 Pixel-level 2D registration . . . 56 6.3.3 Video synthesis . . . 59 6.3.4 Blending . . . 61
6.3.5 Multiple videos . . . 62 6.4 Results . . . 63 6.5 Discussion . . . 68 7 Conclusion 70 7.1 Future work . . . 71 Bibliography 73
List of Tables
Table 3.1 Reconstruction accuracy of the three benchmark datasets. The termsR and
cdenote the absolute camera rotation error (in degrees) and camera location error (in meters) after final bundle adjustment, respectively. . . 22 Table 4.1 Reconstruction accuracy comparison on benchmark data with ground truth
(GT) camera intrinsics. . . 32 Table 4.2 Reconstruction accuracy comparison on benchmark data with approximate
intrinsics from EXIF. The results by Moulon[90] are not available. . . 33 Table 4.3 Comparison with [135] on challenging data. Nidenotes the number of
cam-eras in the largest connected component of our EG graph, and Nc denotes the number of reconstructed cameras. x˜denotes the median error before BA.
˜
xBA and xBA¯ denote the median error and the average error after BA re-spectively. The errors are the distances in meters to corresponding cameras computed by an incremental SfM method [124]. . . 34 Table 4.4 Running times in seconds for the Internet data. TBAandTΣdenote the final
bundle adjustment time and total running time respectively. . . 34 Table 5.1 Comparison on Internet data. x˜andx¯denote the median and mean position
errors in meters for different methods by taking the result of [124] as a refer-ence. x˜∗andx¯∗denote the median and mean position errors for our method after the final bundle adjustment. Niis the number of cameras in the largest connected component of our input EG graph, andNcis the number of recon-structed cameras. For [60], the model with the largest number of cameras is considered. The bold font highlights the best result in each row. . . 47 Table 5.2 Running times in seconds for Internet data. We report time spent on each
step of our method, including depth image reconstruction (TD), local bundle
adjustment (TLBA), missing correspondence analysis (TM C), rotation aver-aging (TR), scale averaging(Ts), scale-aware translation averaging(Tc), fi-nal bundle adjustment(TBA), and total running times (TΣ). We cite the final
bundle adjustment time and total running time from [135], [100], and [124] for a comparison. Ceres [1] is adopted to solve the final BA for all methods except [100]. . . 48
Table 6.1 Runtime of different components of our system in seconds per frame. . . 63 Table 6.2 Quantitative evaluation on alignment error in pixels. See text for details. . . 63
List of Figures
Figure 1.1 Illustration of structure from motion. Given a set ot input images (left), SfM aims to recover camera poses and sparse 3D scene structure (right) simultaneously. The blue triangles on the right represent cameras. . . 2 Figure 1.2 Pipeline of incremental SfM. . . 2 Figure 1.3 Comparison of reconstruction results generated from incremental and global
SfM methods. (a) Sample input images; (b) Result generated by an incre-mental SfM system [136]; (c) Result generated by a global SfM method [27]. We can see that the cameras at the top of (b) have noticeable drifting, and the reconstruction is also distorted obviously. . . 3 Figure 1.4 Pipeline of global SfM. . . 3 Figure 1.5 Previous global SfM methods (e.g.[49, 9, 6, 100]) usually degenerate with
collinear motions. (a) shows an example of true collinear camera con-figuration; (b) and (c) show two examples of recovered camera positions. As the relative translation between two views doesn’t encode scale infor-mation, there are infinite solutions for collinear camera motion even after removing the gauge ambiguity by fixing the baseline length betweenc1
andc4. . . 4
Figure 1.6 Example images of Internet data for matching. As there may be large illu-mination changes (a), view point changes and occlusions (b), the numbers of good correspondences between these image pairs are quite small, which leads to poor relative motion estimation. Global SfM methods are usually fragile on this kind of data because they solve large systems involving all relative relationship together. . . 4 Figure 1.7 Our robust spatio-temporal video alignment enables the blending of
mul-tiple videos recorded with different appearances. . . 5 Figure 3.1 Geometric explanation of Equation (3.4). . . 14 Figure 3.2 (a) A connected component of the match graph. (b) The two corresponding
connected triplet graphs. . . 17 Figure 3.3 The test geometry used in comparison with the four-point algorithm [95]. 19 Figure 3.4 Comparison with the four-point algorithm [95] (3V4P). Our method
Figure 3.5 Comparison with Arie-Nachimsonet al. [6]. Our method is much more stable in translation estimation for near collinear camera motions. . . 20 Figure 3.6 Input images and reconstructed point clouds of (a)fountain-P11, (b)
Herz-Jesu-P25, (c)castle-P30. . . 21 Figure 3.7 Reconstruction results for relatively large scale datasets. (a)Building. (b)
Trevi Fountain. (c)Pisa. (d)Notre Dame. . . 23 Figure 4.1 1DSFM [135] and triplet-based methods (e.g.[60]) require strong
associ-ation among images. As shown in the left, they fail for images with weak association. In comparison, as shown in the right, the results of the method in Chapter 4 do not suffer from such problems. . . 26 Figure 4.2 (a) The positions of a scene point p and two camera centers ci and cj
satisfy a linear constraint detailed in Section 4.3.1. (b) The positions of four cameras seeing the same scene point satisfy a linear constraint detailed in Section 4.3.2. . . 28 Figure 4.3 Evaluation on sequential data. From top to bottom, each row shows sample
input images, 3D reconstructions generated by our method, VisualSFM [136], and the least unsquared deviations (LUD) method [100] respectively. 33 Figure 5.1 Left: an EG graph where each camera is a vertex and two cameras are
connected if the essential matrix between them is known. Right: a stellate graph includes all vertices and edges directly linked to a center vertexi. . 37 Figure 5.2 Pipeline of the proposed method. See text for more details. . . 38 Figure 5.3 The cumulative distribution function (CDF) of relative motion errors for
theGendarmenmarkt data in Section 5.5.2. The input EGs contain sig-nificant errors in both rotation and translation directions. Our depth con-sistency, local BA, and missing correspondence analysis improves local relative motions for robust global SfM. . . 40 Figure 5.4 Missing correspondence analysis. The blue frame indicates the
field-of-view(FOV) of the camera. Green and red dots are matched and missing features. See text for more details. . . 41 Figure 5.5 Evaluation on sequential data. From top to bottom, each row shows sample
input images, 3D reconstructions generated by our method, VisualSFM [136], and the least unsquared deviations (LUD) method [100] respectively. 43 Figure 5.6 A failure case of 1DSfM [135] on theHerz-Jesu-P25data. . . 44 Figure 5.7 Our reconstruction results on the data in [135]. . . 45 Figure 5.8 Results on two large-scale Internet data published in [135]. (a) and (b) are
the results on thePiccadillyandTrafalgardata with 2276 and 4945 images reconstructed respectively. . . 46
Figure 5.9 Results on the challengingGendarmenmarktdata. The image in (a) shows the bilaterally symmetric architecture layout. The results cited from 1DSfM[135] and our method are shown at (b) and (c). Our method succeeds on this data thanks for the EG filtering in local BA and ‘missing correspondence’ anal-ysis. . . 48 Figure 5.10 Our results on challenging pathological data with large repetitive structures. 49 Figure 5.11 Our result on theQuaddata. The camera orientations are computed from
Bundler[124]. . . 50 Figure 6.1 The LOF caption . . . 52 Figure 6.2 Pipeline of the proposed method highlighting the main steps: joint 3D
reconstruction, temporal registration, semi-dense 2D registration, mesh-based warping, and finally video synthesis. By working in 3D, we can optionally select regions to blend directly in world-space. . . 54 Figure 6.3 This figure shows the projections of 3D points (red dots). We can see that
they are not well distributed where the upper part of images has few points. 58 Figure 6.4 Illustration of loop consistency check. . . 59 Figure 6.5 This figure shows the computed 2D masks (right) guided by the 3D mask
(left). . . 60 Figure 6.6 This figure shows the selected reliable points after our feature refinement
(blue and green dots), and the result of the mesh-warping in the source image. 61 Figure 6.7 Example time slice configurations showing the world-space (3D) slices.
Please see supplemental result videos [28]. From top to bottom: (a) AL
-LEY, (b) GARDEN, (c) BEAR, (d) SNOWand (e) DRONE. . . 62 Figure 6.8 Result of WALKING with 2D slices . (a) is the image-space (2D) slice
selection. (b), (c) and (d) are sample frames from the three input videos, and (e) is a frame from the synthesized video. Please see the supplemental video [28]. . . 64 Figure 6.9 Example of other applications, including compositing (left), and
clean-plate extraction (right). Top two rows show the sample frames from two input videos, and the bottom row is the frame from the synthesized video. Please see the supplemental video [28]. . . 65 Figure 6.10 3D selection for the clean-plate example. . . 66 Figure 6.11 Time-lapse video. We transfer the person in the first sequence (top row) to
Figure 6.12 Alignment error as related to parallax. (a) The camera paths of six videos in 3D space. The blue one is the path for the reference video. (b) The first frame of all input videos with increasing parallax. The color of the frame boundary corresponds to the color of its camera path. The number on the frame is the average alignment error in pixels from the current video to the reference video. . . 67 Figure 6.13 Example of failure cases. As 3D reconstruction is not successful due to
severe motion blur (top row), our final alignment is not accurate (bottom row). . . 68
Chapter 1
Introduction
With the recent development of augmented reality, visual reality and autonomous driving, 3D com-puter vision attracts a lot of attention again as the bridge between the virtual and real world. As an important problem in 3D vision, structure from motion (SfM) aims to recover camera poses and 3D scene structure simultaneously given a set of 2D images as shown in Figure 1.1.
Conventional SfM systems often consist of three steps. In the first step, relative poses between camera pairs or triplets are computed from matched image feature points, e.g. by the five-point [94, 77] or six-point [110, 131] algorithm. Meanwhile, we can obtain a set of reliable feature correspondences which fit the computed relative poses well. In the second step, all camera poses (including orientations and positions) and scene point coordinates are recovered in a global co-ordinate system according to these relative poses. If camera intrinsic parameters are unknown, self-calibration algorithms, e.g.[108], should be applied. Third, a global non-linear optimization algorithm, e.g. bundle adjustment (BA) [132], is applied to optimize both camera poses and 3D points by minimizing the reprojection error, which guarantees a maximum likelihood estimation of the result.
While there are well established theories for the first and the third steps, the second step in existing systems are often ad-hoc and heuristic. Based on the strategies for the second step, previous methods are be broadly divided into incremental and global methods.
Incremental methods register cameras one by one [124, 136]. A typical incremental pipeline is shown in Figure 1.2. After computing the relative motions between any two images, it usually selects two cameras to do the initial reconstruction. Then it will add more cameras one by one using resectioning algorithms [75, 68]. But it cannot keep adding to the last camera as the computation errors can be accumulated. In order to reduce these drifting errors, it usually takes intermediate bundle adjustment after adding several cameras. As the intermediate bundle adjustment involves a small set of cameras, it is also known as the local bundle adjustment. The steps of adding cameras and local BAs will be repeated until all cameras are added into the system. Although the local BAs help improve the robustness to noisy and incorrect relative poses, the frequent usage of local BAs is computationally expensive. Some methods [76, 56] take a hierarchical way to gradually merge
Figure 1.1: Illustration of structure from motion. Given a set ot input images (left), SfM aims to recover camera poses and sparse 3D scene structure (right) simultaneously. The blue triangles on the right represent cameras.
Two View Geometry Initial Reconstruction (2 Cameras) Add Cameras Bundle Adjustment Bundle Adjustment More Cameras Yes No Input Images 3D Model
Figure 1.2: Pipeline of incremental SfM.
short sequences or partial reconstructions, while they still need the local BAs to ensure successful reconstruction. What’s more, both incremental and hierarchical methods fix some camera poses before computing others. As a result, even with extensive local BAs, these methods still suffer from drifting errors especially for long image sequences as shown in Figure 1.3 (b).
Global SfM methods try to register all cameras simultaneously. A typical global pipeline is shown in Figure 1.4. Given the relative poses, it usually first recovers the global rotations of all cameras together which is known as rotation averaging, and then recovers the global translations of all cameras at once known as translation averaging. As the global methods consider all constraints between pairwise cameras together and don’t require local BAs, they usually have better potential in both accuracy and efficiency as shown in Figure 1.3 (c).
In this thesis, we will focus on the global SfM methods. Our goal is to propose a robust and efficient global SfM system which applicable for all kinds of motion and datasets. We will also show one application with our global SfM method.
1.1
Challenges
Previous global SfM methods [49, 9, 6] only works on small sequential datasets and cannot be applied to large datasets, especially the Internet datasets. To summarize, the global methods faces three challenges:
(a) (b) (c)
Figure 1.3: Comparison of reconstruction results generated from incremental and global SfM meth-ods. (a) Sample input images; (b) Result generated by an incremental SfM system [136]; (c) Result generated by a global SfM method [27]. We can see that the cameras at the top of (b) have noticeable drifting, and the reconstruction is also distorted obviously.
Two View Geometry Rotation Averaging Translation Averaging Bundle Adjustment Input Images 3D Model
Figure 1.4: Pipeline of global SfM.
Firstly, rotation averaging in SfM is complicated. This is because the camera rotations belong to a product of manifolds (SO(3)n, withnthe number of cameras) which have a nontrivial topology [15]. Although it is possible to derive exact closed-form solutions for the relative simple 2D rotation averaging problem, no exact close-form solution is known for the 3D case on SO(3)n[16].
Secondly, translation averaging is hard. As the intrinsic geometry between two views, the epipo-lar geometry (EG) encodes the relative translation direction between two cameras without the scale information (i.e.the baseline length). So the relationship between the global and relative transla-tions is up to a scale. Most existing translation estimation methods [49, 9, 6, 100] degenerate at the collinear camera motion as shown in Figure 1.5. Actually, according to the conclusion in [100], these methods based on relative translation directions only work for the data with a parallel rigid measurement graph.
Thirdly, global SfM methods are fragile on noisy data, e.g.Internet images as shown in Fig-ure 1.6, due to poor relative motion estimation caused by featFig-ure matching failFig-ures. As global methods considers all relative constraints together, they have to carefully filter out wrong EGs be-fore motion averaging. For some datasets which have weak associations between cameras, one bad EG can heavily distort the reconstruction result. In comparison, incremental methods can benefit from the RANSAC process and local BAs to decrease the influence of bad EGs.
𝑐1 𝑐2 𝑐3 𝑐4
(a)
𝑐1 𝑐2 𝑐3 𝑐4 𝑐1 𝑐2 𝑐3 𝑐4
(b) (c)
Figure 1.5: Previous global SfM methods (e.g.[49, 9, 6, 100]) usually degenerate with collinear motions. (a) shows an example of true collinear camera configuration; (b) and (c) show two exam-ples of recovered camera positions. As the relative translation between two views doesn’t encode scale information, there are infinite solutions for collinear camera motion even after removing the gauge ambiguity by fixing the baseline length betweenc1andc4.
(a) (b)
Figure 1.6: Example images of Internet data for matching. As there may be large illumination changes (a), view point changes and occlusions (b), the numbers of good correspondences between these image pairs are quite small, which leads to poor relative motion estimation. Global SfM methods are usually fragile on this kind of data because they solve large systems involving all relative relationship together.
1.2
Contributions
In this thesis, we propose a range of new algorithms and techniques that mainly address two chal-lenges (translation averaging and robustness) for global SfM, and apply global SfM into the video registration problem. For rotation averaging in global SfM, there are some very efficient and robust algorithms [54, 19], which are proved to work well on most of existing datasets, even for noisy large-scale datasets [135, 100].
Global SfM based on camera triplets. To address the degeneracy from collinear motion for translation averaging, we propose a global SfM method that minimizes an approximate geometric error to enforce the triangular relationship in camera triplets. As the scale information is encoded by the baseline length ratios within the triplets, our formulation doesn’t degenerate with collinear motions. What’s more, by minimizing the approximate geometric error, our formulation does not suffer from the typical ‘unbalanced scale’ problem in linear methods relying on pairwise translation direction constraints,i.e.an algebraic error. This work has been reported in [60].
Feature track based translation averaging. We extend the linear relationship within camera triplets to linear constraints for cameras seeing a common scene point, and propose the feature
Input Sequences
Time Slice Video
Input Sequences
13:00
15:00
17:00 18:00
10:00
Figure 1.7: Our robust spatio-temporal video alignment enables the blending of multiple videos recorded with different appearances.
track based translation averaging. As we can construct relationship between two cameras as long as there is a common visible scene point, this formulation can deal with more general nonrigid camera configurations where the association between cameras is weak. Moreover, the final linear formulation does not involve the coordinates of scene points, which makes it efficient for large scale data as the previous global methods only based on the camera relationships. This work has been reported in [26].
Global SfM based on similarity averaging. Based on the analysis of previous global SfM methods, we find that it is hard to deal with all challenges under the traditional global SfM pipeline. So we propose a novel global SfM pipeline. We compute a sparse depth image at each camera, and these depth images help to upgrade an essential matrix between two cameras to a similarity transformation, which can determine the scale of relative translation. Thus, camera registration is formulated as a well-posed similarity averaging problem. Depth images also make the filtering of noisy relative poses simple and effective. In this way, translation averaging can be solved robustly in two convexL1 optimization problems, which reach the global optimum rapidly. This work has
been reported in [27].
Robust video alignment guided by global SfM. Video alignment is known as a challenging problem which aims to find per-pixel correspondences between two video sequences in both spatial and temporal dimensions. Based on our global SfM, we propose a robust spatio-temporal video alignment method which enables the blending of multiple videos recorded with different appear-ances as shown in Figure 1.7. On one hand, the computed camera poses from global SfM can help the frame-level registration; on the other hand, the sparse 3D points from global SfM can also help guide pixel-level registration. With the guidance of 3D information from global SfM, our video registration method has better performance than previous methods. This work has been reported in [29].
1.3
Thesis organization
This thesis is organized in the following way: in Chapter 2, we survey some of the major related works on SfM and video alignment. Then we present our triplet based global SfM and feature track based translation averaging method in Chapter 3 and 4. In Chapter 5, we describe our new global SfM pipeline based on similarity averaging. In Chapter 6, we introduce the robust video alignment method guided by global SfM. Finally, Chapter 7 concludes this thesis and discusses some promising directions for future work.
Chapter 2
Background
In this chapter we review some previous work that are related to the problems studied in this thesis. We start by looking at the methods for the two view geometry which is the fundamental step for all SfM methods. We then review incremental and global SfM methods in detail. Lastly, we review the previous work on video alignment simply.
2.1
Two view geometry
The simplest case for 3D reconstruction is to reconstruct two view geometry which is also known as the epipolar geometry. As a fundamental for structure-from-motion with multiple images, it has been extensively studied.
When the intrinsic parameters are unknown, the scene and cameras can be reconstructed up to a projective transformation, and this geometry is encapsulated by the fundamental matrixF[55]. As a result, the self-calibration is usually needed to upgrade the projective reconstruction to the metric reconstruction [108, 55].
When the intrinsic parameters of the cameras (e.g.focal length) are known, Kruppa [71] proved that the camera poses and 3D points can be determined up to a similarity transform for two views given five point correspondences, which was further proved and studied in [32, 39, 58]. In this case, the two view geometry can be encapsulated by the essential matrixE[55]. Although Kruppa gave an algebraic algorithm, it was hard for efficient practical implementation. Philip proposed a practical non-iterative algorithm for this problem. More recently, Nistér [94] proposed an efficient algorithm using a modified Gaussian-Jordan elimination procedure based on [103]. This method was further improved by Li and Hartley [77], and advanced Grobner basis technique was used to make it more numerically stable [126, 72]. Compared to 6-, 7- and 8-point algorithms [104, 55], 5-point algorithms suffer fewer types of "critical surface" and are more efficient and accurate. As a result, 5-point algorithms are widely used in current SfM systems when the intrinsic parameters are known.
When there is only rotation between two views without translation or all feature points are on the same plane, we can only compute a homographyHto describe the geometric relationship between two views. In such cases, we cannot recover the camera poses and scenes, while we can useHto filter unreliable features matches [136, 120].
When there are hundreds or thousands of images, especially for unordered sequences, we need to reduce the number of image pairs before computing the two view geometry. Image retrieval has been extensively used for this task [4, 43, 85]. In these methods, vocabulary trees [96], an instance of bag-of-words (BoW) models, is normally adopted to describe images as a whole. Local features (e.g. SIFT [87]) are hierarchically quantized in the vocabulary tree, then the similarity between images is measured with TF-IDF scoring. Recently some novel methods were proposed for pairwise image matching based on preemptive matching [136] or learned pairwise geometric attributes [119].
2.2
Incremental SfM
After computing two view geometry and reliable feature correspondences, one straight way for the camera registration is to firstly select 2 cameras to perform an initial reconstruction and then add cameras one by one into a global coordinate system by resectioning.
Early incremental methods concentrated on reconstruction from videos based on feature track-ing techniques. Pollefeyset al. [107] proposed a complete incremental pipeline for uncalibrated image sequence captured with a hand-held camera. With multiple cameras, GPS and INS measure-ments, a system for automatic, geo-registered 3D reconstruction from videos of urban scenes was proposed in [106]. The visual simultaneous localisation and mapping (SLAM) was developed in this direction, which can be considered as the real-time incremental SfM. Visual SLAM has been exten-sively studied since the first real time application of BA in visual odometry [91]. Many successful visual SLAM systems have been proposed and can be roughly considered as feature-based methods (e.g. monoSLAM [31], PTAM [65] and ORB-SLAM [93]) and direct methods (e.g.LSD-SLAM [35] and DSO [34]).
With the development of robust feature descriptors like SIFT [87] or SURF [7], features can be matched under larger view differences, and SfM techniques were successfully extended to very large-scale and unordered set of images (e.g. Internet images). The first approach for organiz-ing unordered image sets was proposed by Schaffalitzky and Zisserman [117]. Then Snavely et al. [124, 125] proposed the first successful system, Bundler, for 3D reconstruction with Internet photo collections, which could reconstruct hundreds of images downloaded from Internet within several days. In order to speed up the process, Agarwalet al.[4] parallelized the matching process and adopted approximate nearest search and query expansion [23] instead of exhaustive pairwise image matching. It was possible to reconstruct thousands of Internet images in a day with a cluster of 62 machines [4] or even with a single PC [43]. In order to reduce the most time-consuming part, bundle adjustment, several advanced BA algorithms were proposed based on Preconditioned
Conjugate Gradient [2, 12, 137]. With the advanced BA and matching strategy, it was possible to reduce the time complexity fromO(n4)toO(n)[136].
The reconstruction quality of the incremental SfM methods usually depends on the initial pair of cameras and the order of adding other cameras. To handle this drawback, some methods adopted the ‘next-best-view’ algorithms (e.g.[33, 53, 120]). Another solution was to hierarchically merge small reconstructions into larger ones [76, 47, 57]. The state-of-art system [57] is now able to process tens of millions of input Internet images for 3D reconstruction within several days.
2.3
Global SfM
Given the geometries between two cameras, we can also compute all the global camera poses at once, which is known as the global SfM. The problem of global SfM method can be represented a graph, where the nodes represent cameras and two cameras are linked if the pairwise geometry can be computed between them. This graph is also known as EG graph or measurement graph [100]. There are two categories of global SfM methods: factorization based methods and motion averaging based methods.
2.3.1 Factorization based methods
Factorization based 3D reconstruction was originally proposed by Tomasi and Kanade [130] to re-cover all camera poses and 3D points simultaneously for orthographic cameras. An image sequence was represented as a2n×m measurement matrix W, which was made up of the coordinates of
n points tracked through m cameras. It was proved that with zero-mean coordinates, W can be factored into the product of two matricesRandS, whereRis a2n×3matrix that represents cam-era rotation, andSis a3×mmatrix that represents 3D positions of the points. This was further extended to the paraperspective camera model [105] and the perspective camera model [128, 22].
Factorization based methods faces two main problems including missing data and outliers. At first, if there is missing data inW, there is no closed-form solution to the factorization. What’s more, as it is hard to adopt robust error functions in factorization, these methods are usually sensitive to incorrect correspondences. Several methods were proposed to handle these challenges using advanced matrix factorization algorithms [10, 97, 64] or low-rank approximation [21, 20]. However, there is no practical factorization based method for the reconstruction with large-scale and noisy image sequences.
2.3.2 Motion averaging based methods
The motion averaging based methods normally solve all camera poses together in two steps. Typi-cally, they first compute camera rotations and solve translations in the next step.
The computation of global camera rotations is also known as multiple rotation averaging. Its input is a set of relative rotations{Rij}between two cameras, and the output is the global rotation
matrix Ri for each camera. This problem is complicated due to the nontrivial topology of the
rotation manifold [54]. Govindu proposed several methods to solve the problem. He first proposed a linear least squares method using quaternions as the representation for rotations in [49], where a closed-form solution was given using the Singular Value Decomposition (SVD). Then the Lie-algebra representations for rotation was discussed in [50]. In order to address the robustness issue, Govindu proposed to use a RANSAC strategy in [51]. In [89], Martinec and Pajdla proposed a linear algorithm for rotation averaging by ignoring the manifold constraint. Recently, Chatterjee and Govindu [19] proposed an efficient and robust method for large-scale rotation averaging based on the modernL1 optimization and iteratively reweighted least squares approaches. The method
has a good performance for very noisy Internet images, and has been adopted in many global SfM methods. Rotation averaging was also intensively studied in robotics and control, with an up-to-date survey at [16].
The computation of global camera translations is referred as translation averaging. The input is a set of relative translations{tij} between two cameras and global rotations{Ri}, and the output
is the global translationcifor each camera. Translation averaging is difficult because the pairwise
relative translation tij is only known up to a scale. Govindu [49] proposed a simple linear least
squares method just derived from from pairwise relative translation directions. This linear equation constraints that the local translation directiontij should be consistent with the direction computed
as−Rj(cj−ci). Brandet al.[9] formed the translation averaging as a graph embedding problem.
Arie-Nachimson et al. [6] derived a highly efficient linear solution of translations from a novel decomposition of the essential matrix. This method was more robust to different baseline lengths between cameras. However, all these methods based on two pairwise translation directions have unique solutions only for the parallel rigid measurement graph as it was proved in [100]. For other cases,e.g.the collinear motion setup, these method would degenerate. What’s more, these methods usually minimize an algebraic error which makes them sensitive to large noises.
Some global methods solve all camera poses and 3D scene points at once. Kahl [62] usedL∞
-norm to measure reprojection error of a reconstruction, which leads to a quasi-convex optimization problem. Later work along this direction tried to speed up the computation by selecting only rep-resentative points from image pairs [89], using fast optimization algorithms [99, 3], or customized cost function and optimization procedure [140]. It is well known thatL∞-norm is highly sensitive to
outliers. Therefore, careful outlier removal is required for the stability of the optimization [30, 98]. Other global methods tried to use the trifocal tensors between three cameras. Sim and Hartley [122] utilized the trifocal tensors to avoid collinear motions and formulated it as aL∞-norm
mini-mization problem. Courchayet al.[24] exploited the loops in the graph of trifocal tensors to handle drifting errors, where each loop was encoded as a non-linear constraint on the unknown camera poses. As for the non-linear constraint, this method relied on a good initialization. There are also some methods exploiting rough or partial 3D information as initialization. For instance, with the aid of GPS, city scale SfM can be solved under the MRF framework [25].
2.4
Video Alignment
Video alignment aims to find per-pixel correspondences between two video sequences in both spatial and temporal dimensions, and it is usually considered to be more difficult than the classical spatial image alignment [129].
Caspiet al.[17, 18] proposed spatio-temporal alignment based on feature trajectories. Instead of matching features from every frames directly, they transformed the video alignment into a trajectory matching problem between two sequences. Ukrainitz and Irani [133] used the dense space-time intensity information and proposed an algorithm based on maximizing local space-time correlations. Ravichandran and Vidal [111] modeled each video sequences as the output of a linear dynamical system, and took the alignment of two videos as that of the parameters of two linear dynamic systems. A single homography (or fundamental matrix) and a temporal lag factor were jointly optimized in these methods. As a result, they can only handle the videos captured by stationary or joint moving cameras which have fixed internal and relative external parameters.
Sand and Teller [116] proposed a video matching method based on robust feature matching and dense interpolation. This method can be applied to videos recorded with hand-held cameras. However, it relied on a good initial guess for frame-level registration due to the local regression method used for this step. Evangelidiset al.[38, 37] adopted an information retrieval framework for rough frame-level registration, and then used the Enhanced Correlation Coefficient (ECC) algorithm for both temporal refinement and spatial alignment.
Video alignment is a fundamental step in stitching together wide angle (incl. 360◦) video. The video stitching methods [102, 52, 74, 79] usually don’t deal with frame registration and require synchronization for all videos. As a result, they use methods similar to single-image alignment techniques such as feature matching and mesh warping [52, 74], optical flow [102], or joint 3D reconstruction [79].
Chapter 3
Linear Global SfM based on Camera
Triplets
This chapter introduces a linear triplet based global structure-from-motion (SfM) method. Our method minimizes an approximate geometric error to enforce the triangular relationship in camera triplets. By encoding the scale information in the baseline length ratios within camera triplets, this formulation does not suffer from the typical ‘unbalanced scale’ problem in linear methods relying on pairwise translation direction constraints,i.e.an algebraic error; nor the system degeneracy from collinear motion. In the case of three cameras, our method provides a good linear approximation of the trifocal tensor. It can be directly scaled up to register multiple cameras. The results obtained are accurate for point triangulation and can serve as a good initialization for final bundle adjustment. We evaluate the algorithm performance with different types of data and demonstrate its effectiveness. Our method produces good accuracy, robustness, and outperforms some well-known systems on efficiency.
3.1
Introduction
As it is introduced in Chapter 1, conventional SfM methods often consist of three steps. First, the pairwise poses and reliable feature correspondences are computed between two images. Second, all camera poses (including orientations and positions) and scene point coordinates are recovered in a global coordinate system according to these relative poses. Third, bundle adjustment (BA) [132] is applied to minimize the reprojection error, which guarantees a maximum likelihood estimation of the result.
While there are well established theories for the first and the third steps, the second step in ex-isting systems are often ad-hoc and heuristic. Some well-known systems, such as [124, 4], compute camera poses in an incremental fashion, where cameras are added one by one to the global coordi-nate system. Other successful systems,e.g.[42, 76, 56], take a hierarchical approach to gradually merge short sequences or partial reconstructions. In either case, intermediate BA is necessary to
ensure successful reconstruction. However, frequent intermediate BA causes reconstruction inef-ficiency, and the incremental approach often suffers from large drifting error. Thus, it is highly desirable that all camera poses are solved simultaneously for efficiency and accuracy. There are several interesting pioneer works in this direction, e.g.[49, 62, 89, 140]. More recently, Sinhaet al. [123] designed a robust multi-stage linear algorithm to register pairwise reconstructions with some compromise in accuracy. Arie-Nachimson et al. [6] derived a novel linear algorithm that is robust to different camera baseline lengths. Yet it still suffers from the same degeneracy as [49] for collinear cameras (e.g.cameras along a street).
In this chapter, we present a novel robust linear method. Like most solutions, we first calculate the camera orientation (rotations), e.g. using the method described in [89]. Unlike earlier alge-braic methods, we compute the camera positions (translations) by minimizing a geometric error – the Euclidean distance between the camera centers and the lines collinear with their correspond-ing baselines. This novel approach generates more precise results, and does not degenerate with collinear camera motion. We want to stress that the robustness with collinear motion is an im-portant advantage, since collinear motion is common (e.g. streetview images). Furthermore, our estimation of camera poses does not involve reconstructing any 3D point. Effectively, we first solve the ‘motion’ – camera poses, and then solve the ‘structure’ – scene points. This separation is advan-tageous, because there are much fewer unknowns in camera poses. Our algorithm is highly efficient and can be easily scaled up as a result of this separation. Once the camera poses are recovered, the scene points can be reconstructed from nearby cameras.
In the special case of three cameras, our algorithm effectively computes the trifocal tensor from three essential matrices. In our experiment, we find that our method is more robust than the four-point algorithm [95] which solves trifocal tensor from three calibrated images.
3.2
Overview
We first derive our algorithm under the assumption of known EGs without gross error. Later, this assumption is relaxed to deal with incorrect EGs with large error in Section 3.4.
The input to our system are essential matrices between image pairs, which are computed by the five-point algorithm [94]. An essential matrixEij between two imagesi, jprovides the relative
rotationRij and the translation directiontij. Here,Rij is a3×3 orthonormal matrix andtij is a 3×1unit vector. Our goal is to recover all the absolute camera poses in a global coordinate system. We use a rotation matrixRi and a translation vectorcito denote the orientation and position of the i-th camera (1≤i≤N). Ideally, the following equations should hold
Rj =RijRi, Rj(ci−cj)'tij. (3.1)
Here,'means equality up to a scale. In real data, these equations will not hold precisely and we need to find a set ofRi,cithat best satisfy these equations.
Figure 3.1: Geometric explanation of Equation (3.4).
We design our method based on two criteria. Firstly, the solution should be simple and efficient. Approximate solutions are acceptable, since a final BA will be applied. Secondly, the camera poses should be solved separately from the scene points. There are often much more scene points than cameras so that solving camera poses without scene points will significantly reduce the number of unknowns.
We first apply the linear method described in [89] to compute the global camera rotationsRi.
We find it provides good results in experiments, though a more sophisticated method [54] might be used. Basically, it over-parameterizesRi by ignoring the orthonormal constraint on its column
vectors and solves all the rotation matrices at once from the linear equationsRj =RijRi. Once all
rotations are fixed, we then solve all camera centers (ci,1≤i≤N) without reconstructing any 3D
point.
3.3
Translation registration
Given the global camera rotations computed in the previous section, we first transform eachtijto the
global rotation reference frame ascij =−R>jtij. The constraint on camera centers in Equation (3.1)
can be written as in [49],
cij ×(cj−ci) = 0. (3.2)
Here,×is the cross product. This is a linear equation about the unknown camera centers. However, equations obtained this way degenerate for collinear camera motion. Furthermore, as discussed in [49], equations for image pairs with larger baseline lengths are given larger weights. Careful iter-ative re-weighting is required for good results. In fact, Equation (3.2) minimizes the cross product betweencij and the baseline directioncj −ci. Minimizing such an algebraic error [55] is known
to be sub-optimal in many 3D vision problems. In the following, we derive an linear algorithm that minimizes an approximate geometric error.
3.3.1 Triplet translation registration
We begin with the special case of three cameras. The relative translationcij,cik,andcjk between
camera pairs are known. We need to estimate camera centersci,cj,andck. Ideally, the three unit
vectorscij,cik,andcjkshould be coplanar. However, various measurement noises often make them
non-coplanar in real data,i.e.(cij,cik,cjk)6= 0. Here,(·,·,·)is the scalar triple product.
We first consider cij as perfect and minimize the Euclidean distance between ck and the two
linesl(ci,cik)andl(cj,cjk). Here,l(p,q)is the line passing through a pointpwith the orientation
q. Due to measurement noise,l(ci,cik)andl(cj,cjk)generally are non-coplanar. Thus, the optimal
solutioncklies on the midpoint of their common perpendicularABas shown in Figure 3.1. Because
the three vectorscij,cikandcjkshould be close to coplanar, the angle∠Acickis close to zero, and
the length ofciAis close to that ofcick. We can calculate the length ofcickas: sin(θj) sin(θk) ||ci−cj|| ≈ sin(θj0) sin(θ0k)||ci−cj||=s ik ij||ci−cj||. (3.3)
Here,||ci−cj||is the distance betweenciandcj.sikij = sin(θj0)/sin(θ0k) =||ci−ck||/||ci−cj||
is the baseline length ratios. The angles are depicted in Figure 3.1. Note thatθ0j ≈ θj, θk0 ≈ θk
because the three vectors cij,cik andcjk are close to coplanar. The 3D coordinate ofA is then
approximated byci+sikij||ci−cj||cik. Similarly, we can obtain the coordinate ofBascj+sjkij||ci−
cj||cjk, wheresijjk = sin(θ0i)/sin(θk0) =||cj−ck||/||ci−cj||. As a result, the optimal positionck,
which is the midpoint ofAB, can be calculated as ck ≈ 1 2 ci+sikij||ci−cj||cik +cj+sjkij||ci−cj||cjk . (3.4)
Equation (3.4) is nonlinear about the unknown camera centers. To linearize it, we observe that
||ci−cj||cik=||ci−cj||Ri(θ0i)cij =Ri(θi0)(cj−ci). (3.5)
Here,Ri(φ)is the rotation matrix around the axiscij×cikfor an angleφ(counter-clockwise). Thus
we obtain the following linear equation,
2ck−ci−cj =Ri(θ0i)sikij(cj −ci) +Rj(−θ0j)s jk
ij(ci−cj). (3.6)
NoteRj(·)is a rotation matrix around the directioncij×cjk. Similarly, we can obtain the following
two linear equations of camera centers by assumingcikandcjkare free from error respectively 2cj−ci−ck=Ri(−θi0)s ij ik(ck−ci) +Rk(θ 0 k)s jk ik(ci−ck), (3.7) 2ci−cj −ck=Rj(θj0)s ij jk(ck−cj) +Rk(−θk0)sikjk(cj−ck). (3.8)
Solving these three linear equations can determine the camera centers up to a similarity transfor-mation. Note that Equation (3.6) does not require the orientation cj −ci to be the same as cij.
This introduces a rotation ambiguity in the plane defined by the camera centers. We can solve it by computing the average rotation to aligncj −ci,ck−ciandck−cj with the projection ofcij,cik
andcjk in the camera plane, respectively, after the initial registration.
Collinear Camera Motion Calculating baseline length ratios by the sine angles as described earlier is only valid whencij,cikandcjkare not collinear. In order to be robust regardless of the type
of camera motion, we compute all baseline length ratios from locally reconstructed scene points. Suppose a 3D scene pointXis visible in all the three images. From the pairwise reconstruction with imagei, j, we compute its depthdijj in the imagejwhile assuming unit baseline length. Similarly, we can calculatedjkj which is the depth ofXin the imagejfrom the reconstruction of imagej, k. The ratiosijjk is then estimated asdjkj /dijj . In general, we have more than one scene points visible in all three images. We discard distant points and use RANSAC[41] to compute an average ratio. Note we only require local pairwise reconstructions to obtain baseline length ratios. The translation registration does not involve reconstructing any scene point in the global coordinate system.
3.3.2 Multi-view translation registration
Our method can be applied directly to register multiple cameras. Given a triplet graph (see definition in Section 3.4), we collect all equations (i.e. Equation [3.5–3.7]) from its triplets and solve the resulting sparse linear systemAC = 0. Here,Cis a vector formed by concatenating all camera centers. Ais the matrix formed by collecting all the linear equations. The solution is a none trivial null vector of the matrix A, and is given by the eigenvector associated with the fourth smallest eigenvalue ofA>A. In general, camera centers are determined up to a scale and translation. In the special case where all cameras are coplanar (i.e.the rotation ambiguity in all triplets share the same rotation axis), there is a global in-plane rotation ambiguity similar to the three-camera case. We can use the same method described before to compute this rotation.
Scene Point Constraints In practice, some images participate in fewer triplets. In their involved triplets, we introduce additional constraints derived from scene points to enhance the robustness and accuracy of the reconstruction. Suppose a 3D pointXis visible atxi in the imagei. Here,xi is a
homogenous image coordinate. We should have
dixi=Ri(X−ci) ⇒ diR>i xi+ci =X,
wherediis the depth ofXin imagei. IfXis also visible atxj,xkin the imagesj, k, we can obtain djR>jxj +cj =X and dkR>kxk+ck =X.
𝑐
3𝑐
2𝑐
4𝑐
1𝑐
6𝑐
5𝑐
7𝑐
8𝑐
9𝑐
10𝑐
11 𝑇1 𝑇2 𝑇3 𝑇4 𝑇5 𝑇6 𝑇 7 𝑇8 (a) 𝑇1 𝑇2 𝑇3 𝑇4 𝑇5 𝑇6 𝑇7 𝑇8 (b)Figure 3.2: (a) A connected component of the match graph. (b) The two corresponding connected triplet graphs.
After eliminatingX, we obtain three equations as follows
diR>i xi+ci = djR>j xj+cj, diR>i xi+ci = dkR>kxk+ck,
djR>j xj+cj = dkR>kxk+ck. (3.9)
In a triplet, each 3D point introduces 3 unknown depths. There are 9 unknowns for the three camera centers. Equation (3.9) includes 6 independent linear constraints. Therefore, two points visible in a triplet can solve both camera centers and point locations. In our system, we include one 3D point for a triplet if one of its vertex (an image) participates in less than4valid triplets.
3.4
Generalization to EG outliers
The method described in Section 3.2 and Section 3.3 is applicable when there is no gross error in the pairwise epipolar geometries (EGs). However, many image sets, especially unordered Internet images, can generate incorrect EGs with large error due to suspicious feature matching, especially for scenes with repetitive structures. Incorrect EGs result in wrong estimation of rotations and translations. We take the following steps to build a robust system.
Match Graph ConstructionFor each input image, we find its 80 nearest neighbors by the method described in [96]. The five-point algorithm[94] can compute EGs between these images. We only reconstruct the largest connected component of the match graph.
EG VerificationWe perform various verifications to identify incorrect EGs. This involves several steps. 1) We verify every triplet in the match graph, and remove EGs which participate in no triplet that passes the verification. Specifically, we apply our translation registration to each triplet and calculate the average difference between the relative translation directions before and after the registration. If this average difference is larger than3◦, we consider the verification fail. We further require that at least one good point (with reprojection error smaller than 4 pixels) can be triangulated by the registered triplet cameras. 2) Among the edges of the match graph, we extract a subset of ‘reliable edges’ to compute the global camera orientations as described in Section 3.2. We first weight each edge by its number of correspondences and take the maximum spanning tree. We then go through all the valid triplets. If two edges of a triplet are in the selected set of ‘reliable edges’, we insert its third edge as well. We iterate this insertion to include as many reliable edges as possible. 3) We further use these camera orientations to verify the match graph edges, and discard an edge if the geodesic distance [54] between the loop rotation matrix [139] and the identity matrix is greater than 5 degrees. Here, the loop rotation matrix in our case is simplyR>ijRjR>i . 4) Finally, we only
consider the largest connected component of the remaining match graph.
Connected Triplet GraphWe further extract connected triplet graphs from the match graph, where each triplet is represented by a vertex. Two vertices are connected if their triplets are neighboring triangles with a common edge in the match graph. A single connected component of the match graph could generate multiple connected triplet graphs, as illustrated in Figure 3.2. We then apply our method in Section 3.3 to compute the positions of cameras in each triplet graph respectively. We triangulate 3D scene points from feature tracks after solving the camera positions. When there are multiple triplet graphs, their reconstructions are merged to obtain the final result. Specifically, we take their matched features to perform a 3D-to-3D registration for this merge.
3.5
Experiments
We verify our algorithm with various different experiments. We conduct our experiments on a 64-bit windows platform with 3.07GHz CPU. The ARPACK1 is used to solve the sparse eigenvalue problem.
3.5.1 Trifocal tensor estimation
We first evaluate our method with three synthetic input images with known ground truth to quantita-tively evaluate our method. We use a similar test geometry as in [95] (shown in Figure 3.3). Camera
0is placed at the world origin and camera2is placed at a random location away from camera0by
0.2unit. The location of camera1is sampled randomly in the sphere centered at the middle point between camera0and2, and passing through their camera centers. We further require the distance between any two cameras to be greater than0.05unit (which ensures the baseline length between
1
Figure 3.3: The test geometry used in comparison with the four-point algorithm [95].
Figure 3.4: Comparison with the four-point algorithm [95] (3V4P). Our method generates better results in all the three metrics.
Figure 3.5: Comparison with Arie-Nachimsonet al.[6]. Our method is much more stable in trans-lation estimation for near collinear camera motions.
any two cameras is not too small with respect to the scene distance, which is 1 unit here). The scene points are generated randomly within the viewing volume of the first camera and the distance between the nearest scene point and the furthest scene point is about0.5unit. The dimension of the synthetic image is352×288pixels and the field of view is45◦. Pairwise EG is computed using the five-point algorithm [94]. Zero mean Guassian noises are added to the image coordinates of the projected 3D points.
We evaluate the reconstruction accuracy with three metrics. The error of camera orientations Rerris the mean geodesic distance (in degrees) between the estimated and the true camera rotation
matrix. Translation angular error terr is the mean angular difference between the estimated and
the true baseline directions. Absolute camera locations errorcerr is the mean Euclidean distance
between the estimated and the true camera center positions. All these metrics reported below are the average results of50random experiments.
Comparison with [95]We compare with the four-point algorithm [95], which is the only prac-tical algorithm to compute trifocal tensor from three calibrated images as far as we know. The reconstruction accuracy of both methods under different amount of noise is shown in Figure 3.4, where the horizontal axis shows the standard deviation of the Gaussian noise. Our linear algorithm outperforms the four-point algorithm in all metrics under various noise levels.
Comparison with [6]We also compare with the recent method [6] to demonstrate the robust-ness of our method on near collinear camera motions. Here, we generate c0 andc2 as described
before. We samplec1 along a random direction spanning an angle of0.1to5degrees with the line
c0c2. Its location on that direction is randomly sampled while ensuring the angle∠c1c0c2 is the
smallest angle in the trianglec0c1c2. Gaussian noise with standard deviation of0.5pixels is used.
The reconstruction accuracy is reported in Figure 3.5. It is clear that our method produces more stable results for near collinear motion.
(a) (b) (c)
Figure 3.6: Input images and reconstructed point clouds of (a)fountain-P11, (b)Herz-Jesu-P25, (c)
castle-P30.
3.5.2 Multi-view reconstruction
We test the performance of our method with some standard benchmark datasets with known ground-truth camera motion to quantitatively evaluate the reconstruction accuracy. We also experiment with some relatively large scale image collections (sequential and unordered) to evaluate its scalability and robustness.
Evaluation on Benchmark DatasetWe compare our method with some well known and recent works2on the benchmark datasets provided in [127]. All results reported are computed using cali-bration information extracted from the EXIF tags unless stated otherwise. By our linear method, the average reprojection error is about2pixels forfountain-P11 andHerz-Jesu-P25, and 4pixels for
castle-P30, respectively. After the final bundle adjustment, it is reduced to below0.3pixels for all three datasets. To provide a visual validation, we apply the CMVS algorithm [45, 46] to reconstruct dense point clouds with our reconstructed camera parameters (after the final bundle adjustment). The results are visualized in Figure 3.6.
Table 3.1 summarizes the quantitative results of using both EXIF and ground truth calibration information. The absolute error in camera locations are measured in meters. On average our method produces error inciabout0.3%of the distance between the two farthest cameras. The results of our
linear solution before bundle adjustment are provided as ‘Our(L)’. Our method provides good ini-tialization for the bundle adjustment, and it achieves higher accuracy than [123]. It also outperforms [6] and VisualSFM [136] on thefountain-P11example, and achieves similar results on the
Herz-2
The results by the method [6] are kindly provided by its authors. The results by the method [123] are cited from [6]. We use the code shared by the authors of [136] to generate their results.
fountain-P11 c(Exif) R(Exif) c(GT cal.) R(GT cal.) Ours (L) 0.0528 0.5172 0.0336 0.3572 Ours 0.0139 0.1954 0.0039 0.0307 Arie-Nachimsonet al.[6] 0.0226 0.4211 0.0029 0.0288 Sinhaet al.[123] 0.1317 - - -VisualSFM [136] 0.0364 0.2794 0.0029 0.0301
Herz-Jesu-P25 c(Exif) R(Exif) c(GT cal.) R(GT cal.)
Ours (L) 0.106 0.5732 0.082 0.3939
Ours 0.0636 0.188 0.0093 0.0432
Arie-Nachimsonet al.[6] 0.0479 0.3125 0.0053 0.0308
Sinhaet al.[123] 0.2538 - -
-VisualSFM [136] 0.0551 0.2868 0.0071 0.0405
castle-P30 c(Exif) R(Exif) c(GT cal.) R(GT cal.)
Ours (L) 1.15832 1.6513 0.8735 0.7674
Ours 0.2345 0.48 0.0719 0.1265
Arie-Nachimsonet al.[6] - - -
-Sinhaet al.[123] - - -
-VisualSFM [136] 0.2639 0.398 0.0711 0.1458
Table 3.1: Reconstruction accuracy of the three benchmark datasets. The termsRandcdenote the absolute camera rotation error (in degrees) and camera location error (in meters) after final bundle adjustment, respectively.
Jesu-P25example. It also performs similarly on thecastle-P30example with VisualSFM. Bundler [124] produces similar or slightly inferior results as compared to VisualSFM on these datasets. Scalability and Time Efficiency. We evaluate the scalability and efficiency of our method with four relatively large scale image collections. TheBuilding3 example consists of128sequentially captured images. Our method recovers the cameras correctly regardless of the presence of a small fraction of errornenous epipolar geometries arising from symmetric scene structures. The Trevi Fountain andPisa example consist of954and481images downloaded from Flicr.com. Interest-ingly, the largest two connected triplet graphs in the Trevi Fountain example correspond to the daytime and the nighttime images respectively. The reconstructions from these two triplet graphs are merged in the final result. For theTrevi FountainandPisaexamples, there are222663and90081
scene points reconstructed, respectively. We also test our method with the publically availableNotre Dameexample. We use568images with which we can extract EXIF tags from and the largest con-nected component on the match graph consists of 371 views. Our method took 165seconds to compute the initial reconstruction from pairwise epipolar geometries. The final bundle adjustment4 took 107seconds with 115214 reconstructed scene points. Among the165seconds spent on the initial reconstruction, only65seconds were used for the camera registration. The EG verification took 64 seconds, and the triangulation of 3D points took another 30 seconds. VisualSFM [136] took756seconds (excluding pairwise matching and EG computation) to reconstruct the same371
views using four CPUs. Note that all the modules in our current implementation use a single CPU except for bundle adjustment. Typically, the average reprojection error is about5pixels by our
lin-3The dataset is obtained from http://www.inf.ethz.ch/personal/chzach/opensource.html 4
(a) (b)
(c) (d)
Foun-ear initialization, and is reduced to1pixel after bundle adjustment. To provide a visual validation, we feed our reconstructed cameras to the CMVS [45, 46] and visualize the dense reconstruction in Figure 3.7.
3.6
Summary
In this chapter, we present a novel linear solution for the global camera pose registration problem. Our method is derived by minimizing an approximate geometric error. It is free from the common degeneration of linear methods on collinear motion, and is robust to different baseline lengths be-tween cameras. For the case of three cameras, it produces more accurate results than prior trifocal tensor estimation method on calibrated images. For general multiple cameras, it outperforms prior works on either accurate, robustness or efficiency.