Visual Localization Using Sparse Semantic 3D Map
aa r X i v : . [ c s . C V ] M a y VISUAL LOCALIZATION USING SPARSE SEMANTIC 3D MAP
Tianxin Shi ⋆ † Shuhan Shen ⋆ † Xiang Gao ⋆ † Lingjie Zhu ⋆ † ⋆ NLPR, Institute of Automation, Chinese Academy of Sciences, Beijing 100190, China † University of Chinese Academy of Sciences, Beijing 100049, China
ABSTRACT
Accurate and robust visual localization under a wide rangeof viewing condition variations including season and illumi-nation changes, as well as weather and day-night variations,is the key component for many computer vision and roboticsapplications. Under these conditions, most traditional meth-ods would fail to locate the camera. In this paper we presenta visual localization algorithm that combines structure-basedmethod and image-based method with semantic information.Given semantic information about the query and database im-ages, the retrieved images are scored according to the seman-tic consistency of the 3D model and the query image. Thenthe semantic matching score is used as weight for RANSAC’ssampling and the pose is solved by a standard PnP solver.Experiments on the challenging long-term visual localizationbenchmark dataset demonstrate that our method has signifi-cant improvement compared with the state-of-the-arts.
Index Terms — Visual localization, semantic segmenta-tion, image retrieval, camera pose estimation
1. INTRODUCTION
Visual localization plays a central role in many computer vi-sion and robotics applications, such as loop closure detec-tion [1] and re-localization [2] in SLAM [3], Structure-from-Motion (SfM) [4], augmented reality [5], and autonomous ve-hicles, which has recently drawn a lot of attentions.Currently, there are three main types of approaches to thelocalization problem.
3D structure-based methods [6, 7, 8,9] establish matches between 2D local features in the queryimage and 3D points in the model and use these correspon-dences to recover the 6DoF camera pose.
2D image-basedlocalization methods [10, 11, 12, 13] cast the localizationproblem as an image retrieval problem. The camera pose ofthe most relevant retrieved database image is used as an ap-proximation to the query image position.
Learning-basedmethods [14, 15, 16] use the learned model for descriptorlearning [14] or use the model to regress the camera posedirectly[16]. All of the three types of approaches have ad-vantages and disadvantages. The structure-based approach
This work was supported by the Natural Science Foundation of Chinaunder Grants 61632003, 61873265. is more accurate than the other two methods, but it becomesvery time-consuming as the scale of the scene increases. Al-though image-based methods are faster, they are often con-sidered inaccurate since they only approximate the positionsof query images. Combining two types of above approaches,[17, 18] improve the structure-based methods by restrictingfeatures only matched to the 3D points that are visible in thetop retrieved images. In this paper, we also incorporate thesetwo types of methods to ensure both accuracy and efficiency.When query and database images are taken under similarscenes and conditions, existing structure-based methods forvisual localization tend to work well. However, since theyrely heavily on local feature descriptors, which are vulnera-ble to the changes in illumination, weather, etc., it has beenbecoming a bottleneck restricting the robustness of existinglocalization algorithms. When images are taken under sig-nificantly different conditions or far apart in time, existingmethods often fail to recover camera poses because featuredescriptors are drastically changed. Fortunately, the high-level semantic information of image is more robust and in-variant than the underlying local image features. Therefore, itis reasonable to fuse the underlying textural information andhigh-level semantic information for visual localization.In this paper, we propose a robust method for visual lo-calization by utilizing semantic information. As the semanticinformation is comparatively invariant under different condi-tions, it can be used as a supervisor to distinguish correct re-trieved images from all retrieved database images. At first,we obtain the 3D model by running SfM algorithm on alldatabase images and assign each 3D model point a semanticlabel. For a query image, we compute its semantic segmen-tation and search its top- k similar database images. Then,we compute the query image’s temporary pose using 2D-3Dmatches produced by matches between the query image andeach one of the retrieved images, and assign semantic con-sistency scores for these 2D-3D matches belonging to the se-lected retrieved image. Finally, we put all 2D-3D matches to-gether with their semantic consistency scores as weights andrun a weighted RANSAC-based PnP solver to recover the fi-nal query pose. Thanks to the usage of high-level semanticinformation in both 3D model and the query image, the pro-posed method could achieve robust and accurate visual local-ization results on benchmark dataset [19]. atabase Images 3D Model Semantic Segmentation 3D Semantic Map
Query ImageRetrieved Database Images
Feature
Matching
Semantic Segmentation
Semantic
Scoring
Pose
Estimation on 3D Seemanttiic MMMap
Fig. 1 . Flowchart of the localization pipeline proposed in the paper.Compared to recent semantic localization approaches[20, 21, 22], our paper makes the following contributions:1) We propose a new localization pipeline that incorporatesstructure-based method and image-based method while uti-lizing semantic information at the same time. 2) We donot need any additional restrictions (known camera heightand gravity direction from ground truth) compared with thestate-of-the-art semantic visual localization method [22].
2. RELATED WORKLocalization with semantics.
Recently, several visual local-ization methods using semantic information have been pro-posed. Sch¨onberger et al. [20] use generative model to learndescriptors and the model was trained to complete the seman-tic scene. Learned descriptors are used for establishing 3D-3D matches and estimating an alignment to define the queryimage pose. Toft et al. [21] use optimization method to re-fine pose estimation by improving semantic consistency of thecurve segments and the projected 3D points. The most similarwork to ours is [22]. Toft et al. measure semantic consistencyfor each 2D-3D correspondence by projecting 3D semanticpoints into the query semantic images. Yet, they use gravitydirection and camera height as prior knowledge which is notrequired in our method.
Localization benchmarks.
Although notable datasets suchas North Campus Long-Term (NCLT) [23] and KITTI [24]provide visual localization benchmark with images cap-tured over long period, they do not contain large viewingcondition changes or only few scenes are visited multipletimes. Recently, Sattler et al. [19] create a benchmarkingRobotCar Seasons dataset for long-term visual localizationunder all kinds of challenging conditions including day-night changes, illumination changes (dawn/noon), as well asweather (dust/sun/rain/snow) and seasonal (winter/summer)variations. They manually labelled 2D-3D matches in sometough cases to overcome the impact of large condition varia-tions for obtaining the ground truth. Therefore, we evaluateour method on this dataset.
3. LOCALIZATION USING SEMANTIC 3D MAP
The main bottleneck of feature based visual localizationmethods is that they are fragile under large condition varia-tions in lighting, weather, season, etc. By contrast, semanticinformation is comparatively invariant under different condi-tions. In this paper, we propose a new localization pipelineand we do not need any strict prior as [22] did. The pipelineof our method is illustrated in Fig.1: 1) We first run a standardSfM algorithm to construct a sparse 3D model of the scene.2) Given semantic segmentation about each database image,every 3D point can be assigned a semantic label so that the3D model becomes a sparse semantic 3D map M S . 3) Wethen use image retrieval method to get a set of candidatedatabase images I R = { I iR , i = 1 , , ..., k } for the queryimage I Q . 4) For each pair of I Q and I iR , We establish 2D-3Dmatches between I Q and M S through 2D-2D feature matchesindirectly. Using these matches we can recover a temporarycamera pose for I Q (related to I iR ). 5) Given the estimatedpose and semantic segmentation of I Q , all 3D model pointsare projected into I Q . We measure the semantic consistencybetween the 3D points and the projections on I Q , and use itas the weight for all the 2D-3D matches related to image I iR .6) Finally we use 2D-3D matches related to all retrieved im-ages together with their consistency weights to bias samplingduring RANSAC-based pose estimation. We run a regular SfM pipeline using all database images toconstruct a sparse 3D model of the scene. After SfM, thelocation as well as the visible images of each 3D point areobtained. Given semantic segmentation for all database im-ages using off-the-shelf segmentation CNNs (DeepLabv3+network [25] in this paper), we assign each 3D point a seman-tic label by maximum voting with reprojection pixel labels inall its visible images. By exploiting semantics, we can re-move dynamic objects in the 3D point cloud such as person,car, bicycle, rider, bus and so on. Finally, we can obtain acleaner sparse semantic 3D map M S than the original one. ig. 2 . Feature matching. The blue dotted lines are the featurematches between I Q and I iR . The green solid lines are the 2D-3D correspondences between I iR and M S . The red solid linesare the 2D-3D matches between I Q and M S . Unlike [22], which measures the quality of 2D-3D matchesdirectly through strict restrictions as prior, the main idea ofour method is to measure the quality of each I iR without usingany prior information.We use NetVLAD [10] to obtain top- k ranked databaseimages I R for each query image I Q . The size of parame-ter k depends on the computational efficiency and comput-ing resources, and is set to 20-50 in our experiments. Due tothe environment variance, erroneous retrieved images are in-evitable. Fortunately, with the help of semantic information,we can alleviate the negative effect in most cases.We use one I iR at a time for feature matching with I Q .Using KNN search and Lowes ratio test [26], 2D-2D matches(blue dotted lines in Fig.2) are computed. We use a re-laxed ratio threshold of 0.9 to avoid rejecting possibly correctmatches. The SfM model provides the 2D-3D correspon-dences (green solid lines in Fig.2) between the image pointson I iR and the 3D points in M S . Therefore, we can obtainthe 2D-3D matches (red solid lines in Fig.2) between I Q and M S through the 2D-2D matches and the SfM model. Then,these 2D-3D matches are used to recover a temporary querypose by applying a PnP solver. Given semantic segmentationabout I Q , we project all 3D points into I Q by the estimatedtemporary query pose to check the number of consistent se-mantic points. Before we project all 3D points, we need tohandle the situations of occlusion.Apparently, not all 3D points are visible for a camera.Hence, we should only project the visible 3D points into thequery image I Q . Similar to [22], we only consider the 3Dpoint X that satisfies: d l < k v k < d u , ∠ ( v , v m ) < θ. (1)where v = C Q − X . C Q is the estimated camera center of I Q . d l denotes the minimum distance between the 3D point X and the position of all database cameras which could ob-serve it, while d u denotes the maximum distance. θ is the angle between the two extreme viewpoints from which the3D point X was triangulated and the unit vector v m is in themiddle between the two extreme viewpoints. This means that3D points used for projection should be seen by the query im-age from similar distance and direction as the 3D points areviewed from database images in the SfM procedure.We count the number of 3D points whose labels are thesame as their projections in the query image. We use thenumber as the semantic score of I iR . Since we use all 3Dmodel points, not only the 2D-3D matches, the quality ofthe semantic scores can be ensured. A high semantic scoremeans the pose estimated by I iR tends more likely to be cor-rect from the perspective of projection semantic consistency.In other words, the retrieved database images with high se-mantic scores can be considered as correct retrieved images,while those with low scores means erroneous retrieved imagesto some extent. In this way, each I iR can obtain a semanticscore during the above procedure independently. Finally, we put all 2D-3D matches produced by each I iR and I Q together to run a final PnP solver, inside a RANSAC loop.The 2D-3D matches produced by the same I iR are assigned asame score which equals to the semantic score of I iR . We nor-malize each score by the sum of scores of all 2D-3D matchesand use the normalized score as a weight p for RANSAC’ssampling. A 2D-3D match is selected with the probability p inside the RANSAC loop. This means if I iR has a high seman-tic score (correct retrieved image), the 2D-3D matches pro-duced by I iR will be selected with the same high probabilityin the RANSAC loop . Compared with directly removing 2D-3D matches with low semantic scores, this semantic weightedRANSAC strategy guarantees that we only use semantic in-formation as a soft constraint and makes our approach morerobust in semantically ambiguous situations.
4. EXPERIMENTAL EVALUATION
We evaluate the proposed method on benchmark visual lo-calization dataset RobotCar Seasons [19]. In the following,we will introduce the dataset and make a detailed comparisonwith other existing approaches.The long-term visual localization RobotCar Seasonsdataset [19] is based on a subset of the Oxford RobotCardataset [27]. It contains 20,862 database images and 11,934query images, which covers a wide range of environmentalcondition variations including season changes, weather varia-tions, even day-night changes. Besides, some images containmotion blur which causes more difficulties for accurate visuallocalization.The SfM model of the dataset is provided by [19] which isproduced by state-of-the-art opensource 3D system COLMAP[4]. To augment this SfM model with semantics, we use able 1 . Evaluation on the long-term visual localization dataset (RobotCar Seasons)
Day conditions Night conditions dawn dusk OC-summer OC-winter rain snow sun night night-rainm .25/.50/5.0 .25/.50/5.0 .25/.50/5.0 .25/.50/5.0 .25/.50/5.0 .25/.50/5.0 .25/.50/5.0 .25/.50/5.0 .25/.50/5.0deg 2/5/10 2/5/10 2/5/10 2/5/10 2/5/10 2/5/10 2/5/10 2/5/10 2/5/10ActiveSearch 36.2/68.9/89.4 44.7/74.6/95.9 24.8/63.9/95.5 33.1/71.5/93.8 51.3/79.8/96.9 36.6/72.2/93.7 25.0/46.5/69.1 0.5/1.1/3.4 1.4/3.0/5.2CSL 47.2/73.3/90.1 56.6/82.7/ / / / /95.7 / / / / / / / / / / / / / / Table 2 . Comparison with [22] all daym .25/.50/5.0deg 2/5/10Semantic Match Consistency 50.6/79.8/95.1Ours / / DeepLabv3+ network [25] to segment all dataset images andassign a label to each 3D point by maximum voting with re-projection pixel labels in all its visible database images. TheDeepLabv3+ network [25] is pre-trained on the Cityscapes[28] dataset. Additionally, we manually annotate 20 nightcondition images from the origin RobotCar dataset [27] anduse them to fine-tune the pre-trained model in order to im-prove the segmentation performance for all conditions. Thesemantic classes we use are the same as Cityscapes.In the image retrieval step, we use NetVLAD [10] andthe pre-trained Pitts30K model to generate 4096-dimensionaldescriptor vectors for each query and database image. Then,normalized L2 distances of the descriptors are computed, thetop- k best database matching images are chosen as candidateimages. We set the retrieval number, k , to 30 in day condi-tions and 50 in night conditions. We record the percentageof query images which are localized within X m and Y ◦ com-pared to ground truth. The same as [19], we use three differentthresholds: ( . m, ◦ ), ( . m, ◦ ) and ( m, ◦ ), represent-ing high, medium and coarse precision respectively.To verify the effectiveness of the semantic information,we first make a comparative experiment. The Non-semantics in Table 1 is our approach without using semantics. It uses2D-3D matches produced by all retrieved database images torun a PnP solver and have identical selection probability inthe RANSAC loop. As we can see in Table 1, by exploit-ing semantics our method leads to significant improvementof localization performance than the method without utilizingsemantics. Then, we compare our method against some state-of-the-art approaches, using the results from [19, 22] directly.We compare with two VLAD-based image retrieval methods,namely DenseVLAD [13] and NetVLAD [10]. FAB-MAP[29] is also a image retrieval approach based on the Bag-of-Words (BoW) paradigm [30]. As for structure-based algo-rithms, we compare with
Active Search [6] and
City-Scale Lo- calization (CSL) [9]. As current learning-based methods cannot achieve competitive performance [19], we do not incor-porate this type of methods for further comparison. As shownin Table 1, our method significantly outperforms all state-of-the-art approaches, which indicates that semantics couldsignificantly improve the robustness of visual localization inour method. Our approach is only slightly inferior to CSL incoarse precision, and the reason for this is likely due to the re-sults of image retrieval. In some cases, the environment wherethe query images were taken in has huge differences from thatof the database images. In these situations, only few correctdatabase images are retrieved in the top- k retrieved images sothat we can not locate camera accurately.In addition, we compare our method with [22] which alsouses semantics for localization. [22] uses camera height andgravity direction as prior. In [22], gravity direction is ex-tracted from the ground truth pose and camera height is ob-tained from the intersection of the database trajectory and thecone of possible poses. In contrast, our method does not re-quire any priors. As shown in Table 2, even without priorinformation our method still outperforms [22] in all of condi-tions.
5. CONCLUSION
In this paper, we proposed a 6DoF visual localization methodbased on sparse 3D semantic model. By exploiting semanticinformation, we score each retrieved image by projecting allvisible 3D points into query image and measuring the projec-tion semantic consistency. The semantic consistency scoreis used as weight for all 2D-3D matches produced by thecurrent retrieved image. We put 2D-3D matches related toall retrieved images together to run a final PnP solver insidea weighted RANSAC loop and obtain the final query imagepose consequently.Experiments on the benchmark visual localization datasetshows that our method outperforms state-of-the-arts in thehigh and medium precision. Besides, compared with state-of-the-art semantic visual localization method [22], the proposedmethod does not require any prior information, which makesour method more practical. In the future, we would like toimprove the image retrieval algorithm and attempt to replacethe sparse model with dense 3D semantic model. . REFERENCES [1] R. Dub, D. Dugas, E. Stumm, J. Nieto, R. Siegwart, and C. Cadena,“SegMatch: Segment based place recognition in 3D point clouds,” in , May2017, pp. 5266–5272.[2] Yunpeng Li, Noah Snavely, and Daniel P Huttenlocher, “Locationrecognition using prioritized feature matching,” in
European Confer-ence on Computer Vision . Springer, 2010, pp. 791–804.[3] Raul Mur-Artal, Jose Maria Martinez Montiel, and Juan D Tardos,“ORB-SLAM: A versatile and accurate monocular SLAM system,”
IEEE Transactions on Robotics , vol. 31, no. 5, pp. 1147–1163, 2015.[4] Johannes L Schonberger and Jan-Michael Frahm, “Structure-from-motion revisited,” in
Proceedings of the IEEE Conference on ComputerVision and Pattern Recognition , 2016, pp. 4104–4113.[5] Robert Castle, Georg Klein, and David W Murray, “Video-rate local-ization in multiple maps for wearable augmented reality,” in
WearableComputers, 2008. ISWC 2008. 12th IEEE International Symposium on .IEEE, 2008, pp. 15–22.[6] T. Sattler, B. Leibe, and L. Kobbelt, “Efficient & Effective PrioritizedMatching for Large-Scale Image-Based Localization,”
IEEE Transac-tions on Pattern Analysis & Machine Intelligence , vol. 39, no. 9, pp.1744–1756, Sept. 2017.[7] L. Liu, H. Li, and Y. Dai, “Efficient Global 2D-3D Matching for Cam-era Localization in a Large-Scale 3D Map,” in , Oct 2017, pp. 2391–2400.[8] Bernhard Zeisl, Torsten Sattler, and Marc Pollefeys, “Camera posevoting for large-scale image-based localization,” in
Proceedings of theIEEE International Conference on Computer Vision , 2015, pp. 2704–2712.[9] Linus Sv¨arm, Olof Enqvist, Fredrik Kahl, and Magnus Oskarsson,“City-scale localization for cameras with known vertical direction,”
IEEE Transactions on Pattern Analysis and Machine Intelligence , vol.39, no. 7, pp. 1455–1461, 2017.[10] Relja Arandjelovic, Petr Gronat, Akihiko Torii, Tomas Pajdla, andJosef Sivic, “NetVLAD: CNN architecture for weakly supervised placerecognition,” in
Proceedings of the IEEE Conference on Computer Vi-sion and Pattern Recognition , 2016, pp. 5297–5307.[11] Stephanie Lowry, Niko S¨underhauf, Paul Newman, John J Leonard,David Cox, Peter Corke, and Michael J Milford, “Visual place recog-nition: A survey,”
IEEE Transactions on Robotics , vol. 32, no. 1, pp.1–19, 2016.[12] Torsten Sattler, Michal Havlena, Konrad Schindler, and Marc Polle-feys, “Large-scale location recognition and the geometric burstinessproblem,” in
Proceedings of the IEEE Conference on Computer Visionand Pattern Recognition , 2016, pp. 1582–1590.[13] Akihiko Torii, Relja Arandjelovic, Josef Sivic, Masatoshi Okutomi, andTomas Pajdla, “24/7 place recognition by view synthesis,” in
Proceed-ings of the IEEE Conference on Computer Vision and Pattern Recogni-tion , 2015, pp. 1808–1817.[14] Zetao Chen, Adam Jacobson, Niko S¨underhauf, Ben Upcroft, LingqiaoLiu, Chunhua Shen, Ian Reid, and Michael Milford, “Deep learningfeatures at scale for visual place recognition,” in
Robotics and Automa-tion, 2017 IEEE International Conference on . IEEE, 2017, pp. 3223–3230.[15] Song Cao and Noah Snavely, “Graph-based discriminative learning forlocation recognition,” in
IEEE Conference on Computer Vision andPattern Recognition , 2013, pp. 700–707.[16] Alex Kendall, Matthew Grimes, and Roberto Cipolla, “Posenet: Aconvolutional network for real-time 6-dof camera relocalization,” in
Proceedings of the IEEE international conference on computer vision ,2015, pp. 2938–2946. [17] Torsten Sattler, Michal Havlena, Filip Radenovic, Konrad Schindler,and Marc Pollefeys, “Hyperpoints and fine vocabularies for large-scalelocation recognition,” in
Proceedings of the IEEE International Con-ference on Computer Vision , 2015, pp. 2102–2110.[18] Torsten Sattler, Tobias Weyand, Bastian Leibe, and Leif Kobbelt, “Im-age Retrieval for Image-Based Localization Revisited,” in
BMVC ,2012, vol. 1, p. 4.[19] Torsten Sattler, Will Maddern, Carl Toft, Akihiko Torii, Lars Ham-marstrand, Erik Stenborg, Daniel Safari, Masatoshi Okutomi, MarcPollefeys, Josef Sivic, et al., “Benchmarking 6dof outdoor visual local-ization in changing conditions,” in
Proceedings of the IEEE Conferenceon Computer Vision and Pattern Recognition , 2018, pp. 8601–8610.[20] J. L. Schnberger, M. Pollefeys, A. Geiger, and T. Sattler, “Semantic Vi-sual Localization,” in , June 2018, pp. 6896–6906.[21] Carl Toft, Carl Olsson, and Fredrik Kahl, “Long-term 3D localizationand pose from semantic labellings,” in
IEEE International Conferenceon Computer Vision Workshops , 2017, vol. 2, p. 3.[22] Carl Toft, Erik Stenborg, Lars Hammarstrand, Lucas Brynte, MarcPollefeys, Torsten Sattler, and Fredrik Kahl, “Semantic match con-sistency for long-term visual localization,” in
European Conference onComputer Vision . Springer, 2018, pp. 391–408.[23] Nicholas Carlevaris-Bianco, Arash K Ushani, and Ryan M Eustice,“University of Michigan North Campus long-term vision and lidardataset,”
The International Journal of Robotics Research , vol. 35, no.9, pp. 1023–1035, 2016.[24] Andreas Geiger, Philip Lenz, Christoph Stiller, and Raquel Urtasun,“Vision meets robotics: The KITTI dataset,”
The International Journalof Robotics Research , vol. 32, no. 11, pp. 1231–1237, 2013.[25] Liang-Chieh Chen, Yukun Zhu, George Papandreou, Florian Schroff,and Hartwig Adam, “Encoder-Decoder with Atrous Separable Convo-lution for Semantic Image Segmentation,” in
European Conference onComputer Vision , 2018.[26] David G Lowe, “Distinctive image features from scale-invariant key-points,”
International Journal of Computer Vision , vol. 60, no. 2, pp.91–110, 2004.[27] Will Maddern, Geoffrey Pascoe, Chris Linegar, and Paul Newman, “1year, 1000 km: The Oxford RobotCar dataset,”
The International Jour-nal of Robotics Research , vol. 36, no. 1, pp. 3–15, 2017.[28] Marius Cordts, Mohamed Omran, Sebastian Ramos, Timo Rehfeld,Markus Enzweiler, Rodrigo Benenson, Uwe Franke, Stefan Roth, andBernt Schiele, “The cityscapes dataset for semantic urban scene under-standing,” in
Proceedings of the IEEE conference on computer visionand pattern recognition , 2016, pp. 3213–3223.[29] Mark Cummins and Paul Newman, “FAB-MAP: Probabilistic local-ization and mapping in the space of appearance,”
The InternationalJournal of Robotics Research , vol. 27, no. 6, pp. 647–665, 2008.[30] Josef Sivic and Andrew Zisserman, “Video Google: A text retrievalapproach to object matching in videos,” in