Comparison with State-of-the-Art - Semantic Match Consistency for Long-Term Visual

Paper II Semantic Match Consistency for Long-Term Visual

4.2 Comparison with State-of-the-Art

After demonstrating the benefit of our proposed semantic consistency scoring, we compare our localization pipeline against state-of-the-art approaches on both datasets, using the results reported in [18]. More concretely, we compare against ActiveSearch (AS) [19] and the City-Scale Localization (CSL) [8] methods, which represent the state-of-the-art in efficient and scal- able localization, respectively. In addition, we compare against two image retrieval-based baselines, namely DenseVLAD [28] and NetVLAD [23], when their results are available in [18]. We omitted results for the methods Lo- calSfM, DenseSfM, ActiveSearch+Generalized Camera, and FABMAP [60] present in [18], since these use either a sequence of images (the latter two), costly SfM approaches coupled with a strong location prior (the former two), or use ground truth information (the former three), and are thus not directly comparable. For a fair comparison with AS and CSL, we use the variant of our localization pipeline that uses semantic consistency scoring and the P3P solver.

Tables 3 and 4 show the results of our comparison. As can be seen, our approach significantly outperforms both AS and CSL, especially in the

5. Conclusion

high-precision regime. Especially the comparison with CSL is interesting as our pose generation stage is based on its geometric outlier filtering strategy. The clear improvements over CSL validate our idea of incorporating scene semantics into the pose estimation stage in general and the idea of using non-matching 3D points to score matches in particular.

On the CMU dataset, both DenseVLAD and NetVLAD can localize more query images in the coarse-precision regime (5 m, 10◦). Both approaches represent images using a compact image-level descriptor and approximate the pose of the query image using the pose of the top-retrieved database image. Both methods do not use any feature matching between images. As shown in Fig. 6, this allows DenseVLAD and NetVLAD to handle scenarios with very strong appearance changes in which feature matching completely fails. Note that both DenseVLAD or NetVLAD could be used as a fallback option for our approach.

Interestingly, the P3P RANSAC baseline outperforms AS and CSL in several instances. This is likely due to differing feature matching strategies and different numbers of RANSAC iterations. Active Search uses a very strict ratio test, which causes problems in challenging scenes. CSL was evaluated on CMU Seasons by keeping all detected features (no ratio test), resulting in several thousand matches per image. CSL may have yielded better results with a ratio test.

In addition, we also compare our approach to two methods based on P3P RANSAC. The first is PROSAC [59], a RANSAC variant that uses a deterministic sampling strategy, where correspondences deemed more likely to be correct are given higher priority during sampling. In our experiments, the quality measure used was the Euclidean distance between the descriptors of the observed 2D point and the corresponding matched 3D point.

The second RANSAC variant employs a very simple single-match semantic outlier rejection strategy: all 2D-3D matches for which the semantic labels of the 2D feature and 3D point do not match are discarded before pose estimation.

As can be seen in Tables 3 and 4, all three methods perform similarly well on the relatively easy daytime queries of the RobotCar Seasons dataset. However, our approach significantly outperforms the other two methods under all other conditions. This clearly validates our idea of semantic consistency scoring.

5 Conclusion

In this paper, we have presented a method for soft outlier filtering by using the semantic content of a query image. Our method ranks the 2D-3D

Paper II. Semantic Match Consistency for Long-Term Visual...

Figure 6: Illustrations of the result of our method on the CMU Seasons dataset. Rows 1 and 3 show query images that our method successfully localizes (error < .25 m) while DenseVLAD and AS fail (error > 10 m) and rows 2 and 4 the vice versa. Green boxes indicate true correspondences, while gray circles indicate false correspondences. White/red crosses indicate correctly/incorrectly detected inliers, respectively.

matches found by feature-based localization pipelines depending on how well they agree with the scene semantics. Provided that the gravity direction and camera height are (roughly) known, the camera is constrained to lie on a circle for a given match. Traversing this circle and projecting the semantically labelled scene geometry into the query image, we calculate a semantic consistency score for this match based on the fit between the projected and observed semantic labels. The scores are then used to bias sampling during RANSAC-based pose estimation.

Experiments on two challenging benchmarks for long-term visual localization show that our approach outperforms state-of-the-art methods. This validates our idea of using scene semantics to distinguish correct and wrong matches and shows the usefulness of semantic information in the context of visual localization.

Acknowledgements This work was partially supported by the Wallen- berg AI, Autonomous Systems and Software Program (WASP) funded by

5. Conclusion

the Knut and Alice Wallenberg Foundation, the Swedish Research Coun- cil (grant no. 2016-04445), the Swedish Foundation for Strategic Research (Semantic Mapping and Visual Navigation for Smart Robots), and Vinnova /FFI (Perceptron, grant no. 2017-01942).

Bibliography

[1] J. L. Schönberger and J.-M. Frahm, “Structure-From-Motion Revis- ited”, in CVPR, 2016.

[2] R. O. Castle, G. Klein, and D. W. Murray, “Video-rate localization in multiple maps for wearable augmented reality”, in ISWC, 2008. [3] S. Lynen et al., “Get Out of My Lab: Large-scale, Real-Time Visual-

Inertial Localization”, in RSS, 2015.

[4] Y. Li, N. Snavely, and D. P. Huttenlocher, “Location Recognition using Prioritized Feature Matching”, in ECCV, 2010.

[5] Y. Li et al., “Worldwide Pose Estimation Using 3D Point Clouds”, in ECCV, 2012.

[6] L. Liu, H. Li, and Y. Dai, “Efficient Global 2D-3D Matching for Cam- era Localization in a Large-Scale 3D Map”, in ICCV, 2017.

[7] T. Sattler et al., “Are Large-Scale 3D Models Really Necessary for Accurate Visual Localization?”, in CVPR, 2017.

[8] L. Svärm et al., “City-Scale Localization for Cameras with Known Vertical Direction”, PAMI, vol. 39, no. 7, pp. 1455–1461, 2017. [9] B. Zeisl, T. Sattler, and M. Pollefeys, “Camera Pose Voting for Large-

Scale Image-Based Localization”, in ICCV, 2015.

[10] T. Sattler et al., “Hyperpoints and Fine Vocabularies for Large-Scale Location Recognition”, in ICCV, 2015.

[11] L. Kneip, D. Scaramuzza, and R. Siegwart, “A Novel Parametrization of the Perspective-Three-Point Problem for a Direct Computation of Absolute Camera Position and Orientation”, in CVPR, 2011.

[12] M. Fischler and R. Bolles, “Random sample consensus: A paradigm for model fitting with applications to image analysis and automated cartography”, Communications of the ACM, 1981.

[13] A. Kendall, M. Grimes, and R. Cipolla, “PoseNet: A Convolutional Network for Real-Time 6-DOF Camera Relocalization”, in ICCV, 2015.

Paper II. Semantic Match Consistency for Long-Term Visual...

[14] A. Kendall and R. Cipolla, “Geometric loss functions for camera pose regression with deep learning”, in CVPR, 2017.

[15] E. Brachmann et al., “DSAC - Differentiable RANSAC for Camera Localization”, in CVPR, 2017.

[16] E. Brachmann and C. Rother, “Learning Less is More - 6D Camera Localization via 3D Surface Regression”, in CVPR, 2018.

[17] F. Walch et al., “Image-Based Localization Using LSTMs for Struc- tured Feature Correlation”, in ICCV, 2017.

[18] T. Sattler et al., “Benchmarking 6DOF Outdoor Visual Localization in Changing Conditions”, in CVPR, 2018.

[19] T. Sattler, B. Leibe, and L. Kobbelt, “Efficient & Effective Prioritized Matching for Large-Scale Image-Based Localization”, PAMI, vol. 39, no. 9, pp. 1744–1756, 2017.

[20] H. Badino, D. Huber, and T. Kanade, “Visual topometric localization”, in Intelligent Vehicles Symposium (IV), 2011 IEEE, IEEE, 2011, pp. 794– 799.

[21] C. Toft, C. Olsson, and F. Kahl, “Long-term 3D Localization and Pose from Semantic Labellings”, in ICCV Workshops, 2017.

[22] J. L. Schönberger et al., “Semantic Visual Localization”, in CVPR, 2018.

[23] R. Arandjelović et al., “NetVLAD: CNN architecture for weakly su- pervised place recognition”, in CVPR, 2016.

[24] D. M. Chen et al., “City-Scale Landmark Identification on Mobile Devices”, in CVPR, 2011.

[25] J. Knopp, J. Sivic, and T. Pajdla, “Avoding Confusing Features in Place Recognition”, in ECCV, 2010.

[26] T. Sattler et al., “Large-Scale Location Recognition and the Geometric Burstiness Problem”, in CVPR, 2016.

[27] G. Schindler, M. Brown, and R. Szeliski, “City-Scale Location Recog- nition”, in CVPR, 2007.

[28] A. Torii et al., “24/7 Place Recognition by View Synthesis”, in CVPR, 2015.

[29] A. R. Zamir and M. Shah, “Accurate Image Localization Based on Google Maps Street View”, in ECCV, 2010.

[30] A. R. Zamir and M. Shah, “Image Geo-Localization Based on Mul- tipleNearest Neighbor Feature Matching Using Generalized Graphs”, PAMI, vol. 36, no. 8, pp. 1546–1558, 2014.

BIBLIOGRAPHY

[31] W. Zhang and J. Kosecka, “Image based Localization in Urban Envi- ronments”, in 3DPVT, 2006.

[32] T. Weyand, I. Kostrikov, and J. Philbin, “PlaNet - Photo Geolocation with Convolutional Neural Networks”, in ECCV, 2016.

[33] D. Lowe, “Distinctive Image Features from Scale-Invariant Keypoints”, IJCV, vol. 60, no. 2, 2004.

[34] K. M. Yi et al., “LIFT: Learned Invariant Feature Transform”, in ECCV, 2016.

[35] S. Choudhary and P. J. Narayanan, “Visibility Probability Structure from SfM Datasets and Applications”, in ECCV, 2012.

[36] F. Camposeco et al., “Toroidal Constraints for Two-Point Localization under High Outlier Ratios”, in CVPR, 2017.

[37] A. Irschara et al., “From Structure-from-Motion Point Clouds to Fast Location Recognition”, in CVPR, 2009.

[38] T. Cavallari et al., “On-The-Fly Adaptation of Regression Forests for Online Camera Relocalisation”, in CVPR, 2017.

[39] D. Massiceti et al., “Random Forests versus Neural Networks - What’s Best for Camera Relocalization?”, in ICRA, 2017.

[40] J. Shotton et al., “Scene Coordinate Regression Forests for Camera Relocalization in RGB-D Images”, in CVPR, 2013.

[41] J. Valentin et al., “Exploiting Uncertainty in Regression Forests for Accurate Camera Relocalization”, in CVPR, 2015.

[42] S. Ardeshir et al., “GIS-Assisted Object Detection and Geospatial Localization”, in ECCV, 2014.

[43] N. Atanasov et al., “Localization from semantic observations via the matrix permanent”, IJRR, vol. 35, no. 1-3, pp. 73–99, 2016.

[44] A. Cohen et al., “Indoor-Outdoor 3D Reconstruction Alignment”, in ECCV, 2016.

[45] R. F. Salas-Moreno et al., “SLAM++: Simultaneous Localisation and Mapping at the Level of Objects”, in CVPR, 2013.

[46] M. Schreiber, C. Knöppel, and U. Franke, “LaneLoc: Lane marking based localization using highly accurate maps”, in IV, 2013.

[47] F. Yu, J. Xiao, and T. A. Funkhouser, “Semantic alignment of LiDAR data at city scale”, in CVPR, 2015.

[48] R. Arandjelović and A. Zisserman, “Visual Vocabulary with a Seman- tic Twist”, in ACCV, 2014.

Paper II. Semantic Match Consistency for Long-Term Visual...

[49] N. Kobyshev, H. Riemenschneider, and L. V. Gool, “Matching Fea- tures Correctly through Semantic Understanding”, in 3DV, 2014. [50] G. Singh and J. Košecká, “Semantically Guided Geo-location and

Modeling in Urban Environments”, in Large-Scale Visual Geo-Localization, 2016.

[51] A. Cohen, T. Sattler, and M. Pollefeys, “Merging the Unmatchable: Stitching Visually Disconnected SfM Models”, in ICCV, 2015.

[52] F. Yu and V. Koltun, “Multi-Scale Context Aggregation by Dilated Convolutions”, in ICLR, 2016.

[53] M. Cordts et al., “The cityscapes dataset for semantic urban scene understanding”, in Proc. of the IEEE Conference on Computer Vision and Pattern Recognition (CVPR), 2016.

[54] W. Maddern et al., “1 Year, 1000km: The Oxford RobotCar Dataset”, IJRR, vol. 36, no. 1, pp. 3–15, 2017.

[55] H. Zhao et al., “Pyramid Scene Parsing Network”, in CVPR, 2017. [56] G. Neuhold et al., “The Mapillary Vistas Dataset for Semantic Un-

derstanding of Street Scenes”, in ICCV, 2017.

[57] R. Haralick et al., “Review and analysis of solutions of the three point perspective pose estimation problem”, IJCV, vol. 13, no. 3, pp. 331– 356, 1994.

[58] Z. Kukelova, M. Bujnak, and T. Pajdla, “Closed-form Solutions to Minimal Absolute Pose Problems with Known Vertical Direction”, in ACCV, 2011.

[59] O. Chum and J. Matas, “Matching with PROSAC - progressive sample consensus”, in CVPR, 2005.

[60] M. Cummins and P. Newman, “Appearance-only SLAM at large scale with FAB-MAP 2.0”, IJRR, vol. 30, no. 9, pp. 1100–1123, 2011.

Supplementary Material

Supplementary Materials

This supplementary material presents additional material not included in the main paper: Sec. 6 shows more detailed results for the RobotCar Seasons dataset. Sec. 7 shows example images for the RobotCar Seasons dataset. Finally, Sec. 8 provides information about the run time of the method.

6 Detailed Results for the RobotCar Seasons

Dataset

In the main article, we provided localization results for the day-all and night-all conditions of the RobotCar Seasons dataset [18, 54]. Here, we present a more detailed breakdown of the day and night conditions into the different sub-conditions defined in [18]. Due to the large size of the table, we have divided it into two tables, Tab. 5 and 6. The different conditions are: Dawn, Dusk, Overcast-summer, Overcast-winter, Rain, Snow, Sun, Dawn, Night and Night-rain. The last two make up the night-all category in the main article, and the rest make up the day-all category.

Note that for most day conditions (when good correspondences are generally present), performing the semantic consistency ranking gives no significant increase in performance. For the more challenging conditions (such as Sun, Night and Night-rain), ranking the correspondences based on their semantic consistency allows the RANSAC procedure to find a better inlier set by making it more unlikely to pick outlier correspondences. For these conditions, we observe a significant improvement in localization performance for our approach.

Table 5: Additional localization results on the Oxford Seasons dataset, showing results for conditions Dawn, Dusk, Overcast-summer, Overcast- winter and Rain. Results from the reference methods are taken from the benchmark article [18].

Method Dawn Dusk OC-summer OC-winter Rain

m 0.25 / 0.5 / 5.0 0.25 / 0.5 / 5.0 0.25 / 0.5 / 5.0 0.25 / 0.5 / 5.0 0.25 / 0.5 / 5.0 deg 2 / 5 / 10 2 / 5 / 10 2 / 5 / 10 2 / 5 / 10 2 / 5 / 10 ActiveSearch [19] 36.2 / 68.7 / 89.4 44.7 / 74.6 / 95.9 24.8 / 63.9 / 95.5 33.1 / 71.5 / 93.8 51.3 / 79.8 / 96.9 CSL [8] 47.2 / 73.3 / 90.1 56.6 / 82.7 / 95.9 34.1 / 71.1 / 93.5 39.5 / 75.9 / 92.3 59.6 / 83.1 / 97.6 DenseVLAD [28] 8.9 / 36.9 / 92.5 10.2 / 38.8 / 94.2 6.0 / 29.8 / 92.0 4.4 / 26.7 / 93.3 10.2 / 40.6 / 96.9 NetVLAD [23] 6.2 / 22.8 / 82.6 7.4 / 29.7 / 92.9 6.5 / 29.6 / 95.2 3.1 / 25.9 / 92.6 9.0 / 35.9 / 96.0 PROSAC [59] 53.6 / 79.9 / 94.4 55.3 / 83.2 / 95.9 40.6 / 76.5 / 99.1 43.1 / 78.7 / 97.4 61.2 / 82.1 / 98.1 Single-match 53.8 / 80.5 / 95.5 57.1 / 82.5 / 97.5 37.6 / 75.4 / 98.3 43.3 / 78.7 / 97.4 62.5 / 82.7 / 98.8 Weighted, P3P 53.4 / 81.0 / 97.1 53.8 / 83.0 / 97.7 39.5 / 75.6 / 92.4 39.5 / 72.3 / 85.1 62.0 / 82.4 / 99.0 Unweighted, P3P 52.4 / 77.4 / 95.4 58.9 / 83.8 / 97.7 36.7 / 69.3 / 89.2 36.2 / 70.3 / 81.3 61.8 / 82.9 / 98.8 Weighted, P2P 47.4 / 79.5 / 94.8 47.5 / 81 / 95.9 21.6 / 66.1 / 91.6 32.3 / 66.9 / 85.1 31.1 / 74.8 / 95.2 Unweighted, P2P 48.2 / 79.1 / 94.2 44.7 / 82.2 / 95.4 18.6 / 60.5 / 91.8 30.8 / 65.1 / 85.1 30.4 / 75.0 / 94.8

Paper II. Semantic Match Consistency for Long-Term Visual...

Table 6: Additional localization results on the Oxford Seasons dataset, showing results for conditions Snow, Sun, Night, and Night-rain. Results from the reference methods are taken from the benchmark article [18].

Method Snow Sun Night Night-rain

m 0.25 / 0.5 / 5.0 0.25 / 0.5 / 5.0 0.25 / 0.5 / 5.0 0.25 / 0.5 / 5.0 deg 2 / 5 / 10 2 / 5 / 10 2 / 5 / 10 2 / 5 / 10 ActiveSearch [19] 36.4 / 72.2 / 93.7 25.0 / 46.5 / 69.1 0.5 / 1.1 / 3.4 1.4 / 3.0 / 5.2 CSL [8] 53.2 / 83.6 / 92.4 28.0 / 47.0 / 70.4 0.2 / 0.9 / 5.3 0.9 / 4.3 / 9.1 DenseVLAD [28] 8.6 / 30.1 / 90.2 5.7 / 16.3 / 80.2 0.9 / 3.4 / 19.9 1.1 / 5.5 / 25.5 NetVLAD [23] 7.0 / 25.2 / 91.8 5.7 / 16.5 / 86.7 0.2 / 1.8 / 15.5 0.5 / 2.7 / 16.4 PROSAC [59] 56.6 / 85.9 / 96.7 41.2 / 68.0 / 93.3 3.2 / 8.9 / 29.2 4.5 / 19.3 / 39.5 Single-match 58.1 / 86.1 / 97.1 42.6 / 69.6 / 95.2 2.7 / 6.8 / 18.5 2.2 / 6.0 / 15 Weighted, P3P 56.4 / 85.5 / 98 46.5 / 74.6 / 95.9 6.2 / 18.5 / 44.3 8.0 / 26.4 / 46.4 Unweighted, P3P 54.8 / 85.5 / 96.9 29.6 / 54.8 / 83.5 0.2 / 4.1 / 15.8 0.7 / 4.3 / 16.4 Weighted, P2P 46 / 81.4 / 96.3 21.7 / 62.6 / 94.1 10 / 25.8 / 61 15.9 / 42.3 / 65.2 Unweighted, P2P 44.8 / 80.8 / 96.3 20.7 / 56.1 / 94.3 4.1 / 16.0 / 44.7 6.1 / 25.7 / 48.9

7 RobotCar Seasons examples

Most of the daytime images from the Oxford seasons data set are fairly easy to localize correctly due to an abundance of buildings in the images. The visual appearance of these buildings stays fairly constant, and these buildings thus provide good, stable interest points to localize with. Most failure cases can be found in the nighttime images. Fig. 7 shows examples for these failure cases. We can see that the semantic classification fails for large parts of the nighttime images, as buildings and even sky are misla- belled. However, this is not particularly surprising given the limited amount of nighttime training examples that the semantic segmentation algorithm has seen during training.

8 Timing

In this section we present some information about the runtime of the presented algorithm. Fig. 8 shows histograms over the time required to calculate the semantic consistency score per correspondence for all images in the CMU Seasons dataset as well as the RobotCar Seasons dataset. Note that the semantic scoring is perfectly parallel: the scores can be calculated completely independently of one another. The algorithm is thus very well suited for a parallel implementation. The histograms shows the time taken for an unoptimized MATLAB implementation of the algorithm to calculate the semantic consistency score for one correspondence.

Since the calculation of the consistency score mostly requires matrix- vector multiplications (projections and angle calculations), the algorithm could, due to its parallel nature, be implemented on a GPU for a significant speedup if desired.

8. Timing

eral higher than for CMU Seasons correspondences since more points are generally visible at each camera position.

In our implementation, the most time-consuming part is to check which points are visible from each camera position, i.e., to check whether ~C ∈ Vi,

for each i. This part of our approach could be accelerated by pre-computing a covisibility graph for the 3D points in the map.

Paper II. Semantic Match Consistency for Long-Term Visual...

Database image Query image Segmentation

AS fails Ours fails Both fail Both succeed

Figure 7: Illustrations of the result of our method on the RobotCar Sea- sons dataset. Row 1 shows an example where Active Search fails, but our method succeeds. Row 2 shows an example of the opposite case. Here we can notice that all four correct correspondences are on the buildings, but we also see that buildings have been misclassified in the segmentation (they should be gray). The two bottom rows show examples where both algorithms perform similarly. Left: Example images used to construct the database model. Middle: Query images with feature correspondences. Green boxes indicate true correspondences, while gray circles indicate false

8. Timing

Figure 8: Histogram over the time required to calculate the semantic consistency score per correspondence for all images in the CMU Seasons and the RobotCar Seasons datasets.

In document Towards Robust Visual Localization in Challenging Conditions (Page 102-117)