[PDF] 2-Entity RANSAC for robust visual localization in changing environment

Abstract

Visual localization has attracted considerable attention due to its low-cost and stable sensor, which is desired in many applications, such as autonomous driving, inspection robots and unmanned aerial vehicles. However, current visual localization methods still struggle with environmental changes across weathers and seasons, as there is significant appearance variation between the map and the query image. The crucial challenge in this situation is that the percentage of outliers, i.e. incorrect feature matches, is high. In this paper, we derive minimal closed form solutions for 3D-2D localization with the aid of inertial measurements, using only 2 pairs of point matches or 1 pair of point match and 1 pair of line match. These solutions are further utilized in the proposed 2-entity RANSAC, which is more robust to outliers as both line and point features can be used simultaneously and the number of matches required for pose calculation is reduced. Furthermore, we introduce three feature sampling strategies with different advantages, enabling an automatic selection mechanism. With the mechanism, our 2-entity RANSAC can be adaptive to the environments with different distribution of feature types in different segments. Finally, we evaluate the method on both synthetic and real-world datasets, validating its performance and effectiveness in inter-session scenarios.

Full PDF

22-Entity RANSAC for robust visual localization in changingenvironment

Yanmei Jiao, Yue Wang, Bo Fu, Xiaqing Ding, Qimeng Tan, Lei Chen and Rong Xiong

Abstract — Visual localization has attracted considerable at-tention due to its low-cost and stable sensor, which is desiredin many applications, such as autonomous driving, inspectionrobots and unmanned aerial vehicles. However, current visuallocalization methods still struggle with environmental changesacross weathers and seasons, as there is signiﬁcant appearancevariation between the map and the query image. The crucialchallenge in this situation is that the percentage of outliers,i.e. incorrect feature matches, is high. In this paper, we deriveminimal closed form solutions for 3D-2D localization with theaid of inertial measurements, using only 2 point matches or1 point match and 1 line match. These solutions are furtherutilized in the proposed 2-entity RANSAC, which is morerobust to outliers as both line and point features can beused simultaneously and the number of matches required forpose calculation is reduced. Furthermore, we introduce threefeature sampling strategies with different advantages, enablingan automatic selection mechanism. With the mechanism, our2-entity RANSAC can be adaptive to the environments withdifferent distribution of feature types in different segments.Finally, we evaluate the method on both synthetic and real-world datasets, validating its performance and effectiveness ininter-session scenarios.

I. I

NTRODUCTION

Localization is essential for autonomous robot navigation.Compared to the LiDAR-based localization methods, visuallocalization is favorable in many applications as the low-costand stable cameras are widely available. A typical workﬂowof the visual localization task is to detect the feature pointsin the query image (FAST [1], SIFT [2]), match the imagefeature points to the map points (ORB [3], SIFT [2]), andestimate the pose based on a set of matches. Match outliersare hardly avoidable in this setting, leading to inaccuratepose estimation. Random sample consensus (RANSAC) [4]is a popular method to achieve robust estimation by ran-domly sampling the matching set and voting for the inliers.However, RANSAC is limited by serious appearance changesin the environment, in which the percentage of outliersmay grow signiﬁcantly [5] [6]. Therefore, reliable visuallocalization robust to the weather, illumination or seasonalchanges remains a challenging problem.Two factors are crucial for RANSAC to ﬁnd inliers, i.e.the number of inliers in the set (higher is better), and theminimal number of matches required to estimate a pose

Yanmei Jiao, Yue Wang, Bo Fu, Xiaqing Ding and Rong Xiong arewith the State Key Laboratory of Industrial Control and Technology,Zhejiang University, Hangzhou, P.R. China. Qimeng Tan and Lei Chenare with the Beijing Key Laboratory of Intelligent Space Robotic Sys-tem Technology and Applications, Beijing Institute of Spacecraft SystemEngineering, Beijing, P.R. China. Yue Wang is the corresponding author [email protected] . Rong Xiong is the co-correspondingauthor.

Fig. 1: The green line indicates the current session localizingin the black pre-built 3D map with 1 point 1 line or 2 pointsminimal solution. The map was built in summer, while thequery sessions are under different weather or seasons.(lower is better). For the ﬁrst factor, some previous workspropose to increase the number of inliers by utilizing multipletypes of features (e.g. points, lines and planes) [7] [8], asthe feature deﬁned on larger image regions usually leads tohigher robustness against the illuminative variation. In thispaper, we also consider multiple types of features followingthe practice of previous methods [9]. For the second factor,the well-known 3D-2D localization methods typically require3 or 4 feature matches such as perspective-3-points (P3P)[10], efﬁcient perspective-n-points (EPnP) [11], the minimalsolutions of 3 lines [12] [13] or a mixture of 3 entities (i.e.points and lines) [9]. Our key observation of this paper is thatthe number of matches can be further reduced for both pointand line features in robotics navigation scenario, leading tohigher robustness against outliers.Speciﬁcally, with the growing trend of visual inertialnavigation in robots, the direction of gravity between the pre-built map and the current query image can be easily aligned.Therefore, we aim to reduce the number of point featurematches from 3 to 2 for closed form pose estimation, bymaking use of the directly observed pitch and roll angle.Furthermore, the closed form solution is also derived for thecase of 1 point and 1 line, so that line feature matches canbe considered. As a result, we propose a 2-entity RANSACfor persistent visual localization as illustrated in Fig. 1. The2-entity is named for that only two feature matches betweenpoints and lines are used to pose estimation. In summary, wepresent the contributions of the paper as follows: a r X i v : . [ c s . R O ] F e b We propose a general framework to derive the closedform solutions to pose estimation which can deal withboth point and line features, namely 2 point matches or1 point match and 1 line match. • We propose a 2-entity RANSAC by embedding theminimal closed form solutions into the RANSACframework and proposed three sampling strategies inRANSAC that select different types of feature matchesfor pose estimation. • We analyze the success probability for different sam-pling strategies in the 2-entity RANSAC, and proposea selection mechanism that adaptively select strategy de-pending on environmental characteristics, i.e. structuredor unstructured. • The proposed method is evaluated on both synthetic andmultiple real world data sessions. The results validatethe effectiveness and efﬁciency of our method on visuallocalization with seasonal and illuminative changes.The remainder of this paper are organized as follows: InSection II we review some relevant visual pose estimationmethods. Section III gives the overview of the proposedmethod and the derivation of the closed form solutions. Theanalysis about the criterion to select the sampling strategy inRANSAC is addressed in Section IV. We then evaluate ourmethod in Section V on synthetic data and challenging realworld data. Finally, the conclusion is drawn in Section VI,which completes this paper.II. R

ELATED W ORKS

The minimal solution with efﬁcient 2D-2D point matchesfor camera pose estimation has been studied extensivelyfor decades. There are many well-known methods, such as8-point method for fundamental matrix estimation whichyields a unique solution [14]. In addition, there are also7-point [15], 6-point [16] and 5-point [17] methods whichfocus on further reduction of minimal number of pointmatches required for pose estimation. In recent years, withthe development of visual inertial navigation, this number isfurther reduced by considering the measurement of angularvelocity and the gravity. Based on this, a 3-point method wasproposed in [18]. More recently, the 2-point RANSAC [19][20] was proposed by employing the gyro to calculate therelative rotation within short duration, which was employedin several visual inertial odometry softwares. By furtherconstraining the motion pattern of the camera, only 1 pointmatch is sufﬁcient to derive the pose [21].In localization, 3D-2D matches are available but there isno measurement of relative yaw angle. In this case, the mini-mal number of required feature matches is reduced comparedwith scenario of 2D-2D. Speciﬁcally, there are several 3-point methods [10] [22] [23] [24] [25] that give up to foursolutions in closed-form and other 3-line solvers [12] [13]that give up to eight solutions which require slower iterativemethods. As signiﬁcant change can appear in long term,it’s difﬁcult to ﬁnd enough point matches or line matchesalone to guarantee the success rate of localization. Thus themethods utilizing both point and line features are studied in [9] including two possible cases: 2 points 1 line case and 1point 2 lines case. With the aid of inertial measurement,the2-point RANSAC solution is investigated in [20] and [26].While the former utilized the relative measurement of yawangle, which is only available between two consecutivecamera under odometry problem, thus cannot be appliedin localization, and the latter only considers point featurematches. To the best of our knowledge, the minimal solutionfor localization using both point and line features with theaid of inertial measurements has not been studied yet, whichis the focus of this paper.III. M

INIMAL S OLUTIONS

The deﬁnition of coordinates in this paper is shown in Fig.1. A 3D map is built by running a visual inertial simultaneouslocalization and mapping (VI-SLAM), of which the worldreference frame is denoted as W m . In the current querysession, a visual inertial navigation system (VINS) is utilizedwith its origin deﬁned as W q . Denoting the coordinates ofthe current query camera as C q , the localization problem isstated as the estimation of the query image pose with respectto the world, i.e. T W m C q . Thanks to the observability of pitchand roll angle, ˜ β and ˜ γ , given by the inertial measurements,the direction of gravity can be aligned between W m and W q [27], leading to the reduction of degrees of freedom (DoF)in pose as T W m C q = [ R z ( α ) R y ( ˜ β ) R x (˜ γ ) | ( T T T ) T ] where α and ( T , T , T ) denote the yaw angle and threetranslations to be estimated. In sequel, for the purpose ofclearance, we are going to estimate the inversion of thepose, i.e. T C q W m . Note that the number of unknown DoFsin T C q W m remains the same.To solve the 4DoF localization problem, we refer to twogeometric constraints: collinearity of 3D-2D point matchesand coplanarity of 3D-2D line matches. In an image, eachpixel encodes a 3D projection ray. As shown in Fig. 2,according to the projection geometry, the optical center C of the camera, the map point P and its projec-tion point D on the image lie on the same 3D line,which is denoted as { C , D , R C q W m P + t C q W m } L . Bysolving the line equation with the ﬁrst two points C and D , the third point can be substituted into the lineequation to form two independent constraints. In addition,the camera center C , the map line L L and its pro-jection line segment D D lie on the single plane π ,which is denoted as { C , D , D , R C q W m L + t C q W m } P ,and { C, D , D , R C q W m L + t C q W m } P . Similarly, by solv-ing the plane equation from the ﬁrst three points C , D and D , the two end points of the map line can be substituted intoit to form another two independent constraints. As a result,two non-degenerate feature matches should be sufﬁcient tosolve the 4DoF localization problem with three possiblecombinations: 1 point 1 line, 2 points and 2 lines in theory. A. 1 point 1 line

In this section, we are going to solve the 4DoF localizationproblem given one 3D-2D point match and one 3D-2Dig. 2: The illustration of intermediate reference frame C and W for (a) 1 point and 1 line case, (b) 2 lines case.line match. The solution is denoted as 1P1L. Given the2D coordinates of the detected image feature point, wecan compute the corresponding 3D projection ray accordingto the calibrated intrinsic parameters. The same process isapplied to the two end points of the line segment to obtaintwo 3D projection rays. The 3D projection rays are expressedin the camera reference frame C q , and the corresponding3D coordinates of the features are expressed in the worldreference frame W m . To simplify the form of the unsolvedequations, we introduce two intermediate reference framesdenoted as C and W for camera and map respectively. The choice of C : As shown in Fig. 2 (a), in the originalcamera reference frame C q , the origin is the camera center C , the camera projection ray associated with the 2D pointis given by its normalized direction vector −→ d , and projectionrays of the two end points of the 2D line are given bynormalized direction −→ d , −→ d .In the intermediate camera frame C , the projection rayassociated with the 2D point is denoted by CD , and theline, CD and CD . Speciﬁcally, the intermediate referenceframe C should satisfy the following conditions: • The new camera center C is (0 , , − . • One of the projection rays of the line end point CD lies on the z axis such that D = (0 , , . • The other projection ray of the line end point CD lieson the xz plane and the point D is the intersectionpoint between the x axis and the ray. • The point D in the projection ray of the point featurelies on the xy plane.After the transformation, we have following results C =  −  , D =   , D =  tan(arccos( −→ d · −→ d ))00  Then the points in the original camera frame are calculatedas follows C = × , D = C + −→ d , D = C + −→ d −→ d · −→ d The transformation T C C q can be computed by transform-ing the three points ( C , D , D ) to ( C, D , D ) . After that,the point D (cid:44) ( a , b , can also be computed. The choice of W : The transformation of the worldreference is a translation which transforms the 3D point tothe origin of W T W W m = (cid:20) I × − P × (cid:21) Thus in W we have: P = × , L i = { , } (cid:44) (cid:2) X i Y i Z i (cid:3) T Note that these points are all known.

Pose estimation between C and W : Let us denote therotation matrix and the translation matrix to be solved as R and t , that is R (cid:44) R C W , t (cid:44) t C W . According to thecollinearity of { C, D , RP + t } L , the following equationsare derived: a T − b T = 0 (1) b T − T = − b (2)As for the coplanarity of { C, D , D , RL + t } P : R X + R Y + R Z + T = 0 (3)And the coplanarity of { C, D , D , RL + t } P : R X + R Y + R Z + T = 0 (4)where R mn denotes the m -th row and n -th column entry of R , T m denotes the m -th entry of t .One should notice that the matrix R is determined onlyby α . Combining (3) and (4), we can solve α , which isthen substituted in (1) - (4) for the translation t . Put all theequations together, the closed form localization problem canbe solved as T C q W m = T C C q − · T C W · T W W m Degenerate cases : For 1 point 1 line case, if the point lieson the line, the corresponding case is degenerated.

B. 2 points

In this section, we are given two 3D-2D point matches, andthe solution is denoted as 2P. In this case, the intermediatecamera reference frame remains the same as the original one,thus the points in C q and C following the notations in 1P1Lare C = × , D i = { , } (cid:44) (cid:2) a i b i (cid:3) T where a , a , b , b are known parameters computed usingintrinsic parameters and normalized depth.The transformation of the world reference is a translationwhich transforms one of the two 3D points to the origin ofthe intermediate world frame. Thus in W we have: P = × , P (cid:44) (cid:2) X Y Z (cid:3) T ose estimation between C and W : Following the nota-tions in 1P1L, according to the collinearity of { C, D , RP + t } L , the following equations are derived: a T − b T = 0 (5) a T − T = 0 (6)As for the collinearity of { C, D , RP + t } L , we have: a ( R X + R Y + R Z + T ) − b ( R X + R Y + R Z + T ) = 0 (7) a ( R X + R Y + R Z + T ) − ( R X + R Y + R Z + T ) = 0 (8)Combining (5) - (8), we can solve T C W . Thus thelocalization result can be obtained by T C q W m = T C W · T W W m C. 2 lines

In this section, we are given two 3D-2D line matches asshown in Fig. 2 (b).

The choice of C : In C q , the camera projection raysassociated with the two 2D lines are given by pairs ( −→ d , −→ d ) and ( −→ d , −→ d ) , respectively. In C , the 3D projection rays asso-ciated with the two 2D lines are represented by ( CD , CD ) ,and ( CD , CD ) . Speciﬁcally, C should satisfy the follow-ing conditions: • The new camera center C is (0 , , − . • The intersection line of the two interpretation planesrepresented by projection ray CD lies on the z axissuch that D = (0 , , . • The projection ray CD lies on the xz plane and thepoint D is the intersection point between the x axisand the ray. • The point D in the projection ray of one line lies onthe xy plane.The unit normal vectors of the two planes formed by ( C , −→ d , −→ d ) and ( C , −→ d , −→ d ) can be computed as follows −→ n = −→ d × −→ d , −→ n = −→ d × −→ d , −→ d = −→ n × −→ n where −→ d is the direction vector of the intersection line CD . After such a transformation, we have the coordinatesof C and D , and D can be computed as follows D = (cid:104) tan(arccos( −→ d · −→ d )) 0 0 (cid:105) T Then the corresponding points in C q : C = × , D = C + −→ d −→ d · −→ d , D = C + −→ d The transformation T C C q can be computed by transform-ing ( C , D , D ) to ( C, D , D ) . After that, the point D (cid:44) ( a , b , can also be computed. The choice of W : T W W m is a translation which trans-forms one end point of the 3D line to the origin of theintermediate world frame. Thus in W we have L = × , L { i =2 , , } (cid:44) (cid:2) X i Y i Z i (cid:3) T Pose estimation between C and W : Followingthe notations in 1P1L, according to the coplanarity of { C, D , D , RL + t } P and { C, D , D , RL + t } P , thefollowing equations are derived T = 0 (9) R X + R Y + R Z + T = 0 (10)As for the coplanarity of { C, D , D , RL + t } P and { C, D , D , RL + t } P : a ( R X + R Y + R Z + T ) − b ( R X + R Y + R Z + T ) = 0 (11) a ( R X + R Y + R Z + T ) − b ( R X + R Y + R Z + T ) = 0 (12)From (9) - (12), we can easily ﬁnd that the constraintsare not sufﬁcient to solve T . In fact, one of the constraintsprovided by IMU is coincident with one of constraintsprovided by coplanarity, thus the 2 lines case cannot solvethe 4DoF localization problem.IV. M ODEL S ELECTION

In the last section, we discussed the possible minimalclosed form solutions of 4DoF localization problem wheninertial measurements exist, and showed that the 1P1L and2P are solvable. Based on the two minimal solutions, wepropose the 2-entity RANSAC, in which there are threepossible sampling strategies according to the correspondingsolution: 1P1L, 2P and mixed. The mixed refers to using both1P1L and 2P solutions according to the randomly selectedfeatures. Speciﬁcally, we ﬁrst select one point match andthen randomly select another feature match from points andlines. If it is a point match, we perform 2P, if it is a linematch, we perform 1P1L. There are some works discussabout sampling points from multiple combinations of views[28] [29], while we are focusing on sampling from multipletypes of features (points and lines). In this section, we derivethe success probability of the three sampling strategies toprovide insights of sampling in 2-entity RANSAC.Let’s denote p as the number of point matches, l as thenumber of line matches, λ as the point inliers rate, and γ asthe line inliers rate. mp = λ, nl = γ (0 ≤ λ, γ ≤ , lp = ε (0 ≤ ε ≤ l ) where m , n denote the point inliers number and line inliersnumber respectively. Then the success probability of differentsampling strategies during one iteration in RANSAC isderived as follows: P p l = mp · nl = λ · γP p = mp · m − p − λ · λp − p − P mixed = mp · m + n − p + l − λ · λp + γl − p + l − a) 10 matches (b) 5 matches(c) 4 matches (d) 3 matches Fig. 3: Accuracy simulation results: the translation and rotation error for 10, 5, 4, 3 feature matches of different methods.Notice that EPnP and P3P are not computed in 3 matches case, for they all need at least four point matches to compute.Then we have P p l − P mixed = λ ( γ − λp + γl − p + l − ∝ γ − ( λ − a ) P mixed − P p = λ ( λp + γl − p + l − − λp − p − ∝ γ − ( λ − a ) where a = − λp − .Generally, p − (cid:29) − λ holds for most real worldapplications, which means a is a small positive number closeto 0. Thus, the following conclusion can be derived: γ ≥ λ ⇒ P p l > P mixed > P p Furthermore, with proper scaling, the following conclusioncan also hold: γ < λ ⇒ P p l ≤ P mixed ≤ P p The percentage of line features and point features ishighly related to the environment. According to the abovediscussion, we could ﬁnd that the mixed sampling strategyis a more robust selection to the environmental variationsas its performance is always the average one. Empirically,when localizing in the unstructured scene, there might bemore point feature matches which indicates a higher pointinlier rate than line features, leading to the choice of 2Psampling strategy. In the structured scene, although there aremore line features compared with the unstructured scene,there are also many good point features. Considering thereconstruction error of the 3D lines can be higher than that of3D points, the performance of 1P1L may not be signiﬁcantlybetter than the 2P method in the structured scene. In orderto be adaptive to the changing environment, an automaticselection mechanism is proposed by statistic estimation ofthe inlier rate using historical data. Speciﬁcally, the map isdivided into structured or unstructured segments accordingto the inlier rate of point and line features. Then, when the robot is in the segment, the corresponding sampling strategyis selected, e.g. 2P in unstructured environment.V. E

XPERIMENTAL R ESULTS

The proposed methods are evaluated on both synthetic andreal data to demonstrate the advantages over other baselinemethods in visual localization. We start by simulation exper-iments to illustrate the accuracy when the image feature isnoisy, the sensitivity with respect to inaccurate pitch and rollangle, and the robustness against outliers. Then on real worlddata, we demonstrate the success rate comparison to showthe effectiveness of the proposed methods and the modelselection mechanism.

A. Results with synthetic data

In the simulation experiments, we generated some 3Dpoints and lines in the cube [ − , and computed 2Dprojections for varying camera poses to get 3D-2D featurematches. For each method, 100 iterations of RANSAC areperformed. The ﬁnal identiﬁed inliers are sent to a nonlinearminimization for high accuracy. In optimization, there aretwo possible ways: 4DoF or 6DoF optimization. The 4DoFoptimization means we ﬁx the pitch and roll angle providedby IMU, and only optimize the other four variables in thepose. This may be useful when there are little inliers tooptimize.The minimal solutions evaluated in simulation experimentsare EPnP [11], P3P [10], 2P1L [9] and our proposed solu-tions including 1P1L-6DoF, 1P1L-4DoF, 2P-6DoF, 2P-4DoF.The mixed-6DoF and mixed-4DoF are also evaluated inthe robustness against outliers experiment. We exploited thetranslation and rotation error to evaluate the estimated result [ R | t ] with respect to the ground truth [ R gt | t gt ] . Following[9], for translation error, we computed (cid:107) t − t gt (cid:107) /t gt inpercentage. For rotation error, we ﬁrst computed the relativerotation (cid:52) R = RR Tgt and represented it in degrees.ig. 4: Sensitivity simulation results: the translation and rotation error for 10, 5, 3 feature matches, respectively.

1) Accuracy:

To quantify the accuracy of different min-imal solutions, we added Gaussian noise with zero meanand varying standard deviations for the 2D projections andchanged the number of feature matches in four levels: 10, 5,4, 3 (the 10 case means there are 10 point matches and 10line matches in the scene). The results are shown in Fig. 3.We could ﬁnd that when the feature matches are sufﬁcient(see the 10 case), the proposed methods can achieve thesame accuracy over other baseline methods as the standarddeviation of the noise increases. While the number of featurematches decreases, the advantage of our 2-entity methodsbecomes bigger compared with 2P1L, EPnP and P3P. Onecan notice that when the number of feature matches decreasesto 4, the error of our methods is slightly larger than that ofthe 10 matches case, while others grow severely.

2) Sensitivity:

With the inertial sensor’s aid, we reducethe degree of the localization problem to 4 by leveraging thepitch and roll angle provided by the sensor. Therefore, it’snecessary to investigate the inﬂuence of the quality of thetwo angles on the ﬁnal accuracy. We added Gaussian noisewith zero mean and varying standard deviations on the pitchand roll angle and studied the performance of the proposedmethods in three feature matches levels: 10, 5, 3. The resultsare shown in Fig. 4.When there are enough feature matches (see the 10 case),the proposed methods can tolerate almost 25 degrees of noiseon both pitch and roll angles. Empirically in real worldapplication, the noise on pitch and roll angle is far less,thus the noise on pitch and roll should have limited impacton the accuracy. We also note that the 6DoF optimizationoutperforms the 4DoF in this case. However, as the numberof feature matches decreases, the difference between the6DoF and 4DoF methods becomes smaller. This is reason-able because when the inliers number is few, the error causedby the additional DoFs in optimization may be larger thanthat caused by noisy attitude estimation. Therefore, if thereare very limited reliable feature matches when long term change occurs, 4DoF optimization is a good choice.

3) Robustness:

To validate the robustness of the proposedmethods, we designed a few experiments by varying theoutlier rate from 0 to 80% with different number levels ofinliers: 10, 5, 4, 3. We achieve the outlier rate by addingoutliers and the outlier is generated by incorrectly associatingthe features in the original data. When the translation erroris less than 10% and the rotation error is smaller than 5degrees, the localization is assumed to be successful. Wedid 200 trails for each method to average the success rate.The results can be seen in Fig. 5. As expected, the proposed2-entity RANSAC outperforms the other minimal solutionswhen the outlier rate increases, which is more obvious whenthe number of inliers decreases. Note that with sufﬁcientinliers, the proposed methods can achieve a success rate ofmore than 90% when the outlier rate is 80%.

B. Results with real data

For real world experiments, YQ-Dataset is utilized, andthree of the datasets were collected during three days insummer 2017, denoted as 2017-0823, 2017-0827, 2017-0828, and the other was collected in winter 2018, denotedas 2018-0129. We select the 2017-0823 dataset to build the3D map and utilize the other three datasets to evaluate thelocalization performance over different weathers or seasons.The ground truth relative pose is provided by aligning thesynchronized 3D LiDAR scans. The evaluated methods areEPnP, P3P, and our methods including 1P1L-4/6DoF, 2P-4/6DoF, mixed-4/6DoF. To get the 3D-2D feature matchesbetween the query image and the map, we exploited thefollowing steps: • Obtain the camera poses and the 3D-2D point matchesin the map using visual inertial SLAM software [30]. • Run Line3D++ algorithm [31] to get the 3D-2D linematches in the map. • For the query session, we get the 3D-2D points/linesmatch based on the descriptors of LibVISO2 [32] and a) 10 matches (b) 5 matches (c) 4 matches (d) 3 matches

Fig. 5: Robustness simulation results: the success rate for 10, 5, 4, 3 inliers case of different methods.TABLE I: Evaluation of the localization on the selected scene of YQ-Dataset.

LBD [33], which are fed to the RANSAC with differentsolutions. • For the pitch and roll angle of the query image, wedirectly use the corresponding values measured by IMU.We ﬁrst counted the correctly localized query images outof all keyframes in the query session using different methods,and represent the successful localization rate in percentageunder four translation and rotation error thresholds, whichcan be seen in Tab. II. As expected, the performance ofmethods with 6DoF optimization are still better than 4DoFoptimization. Besides, the 2-entity RANSAC is better thanRANSAC with P3P and EPnP which calls for more numberof matches. The proposed 2P and mixed are obviously betterthan EPnP and P3P, which indicates the robustness of theproposed methods when dealing with different weathers andseasons. Note that 1P1L is not as good as point featuresbased methods, of which the reason is that there are manytrees on both sides of the road in the map environment sothat point features are far more abundant than line features.In order to make a fair comparison between the 1P1L andthe 2P methods, we manually picked some structured andunstructured segments from the map, in which some imagesare shown in Fig. 6. There are around 100 places selected inunstructured segments, and 50 places in structured segments.The success rates of localization with multiple methodson the selected segments are show in Tab. I. The resultsshow that the 2P method performs best in the unstructuredscene and the performance of 1P1L is similar with 2Pin the structured scene but obviously worse than that of2P in unstructured segments, conﬁrming the fact that thefeature distribution in the environment has unavoidable effecton the method selection. Besides, from the results of thestructured scene in 2018-0129, we can ﬁnd that when outlierpercentage grows due to the changing season, the better robustness of line feature matches promotes the performanceof 1P1L method. The mixed-6DoF, as derived in theory, givesrelatively stable performance compared with 2P and 1P1L inall segments.To further verify the correctness of the model selectionmechanism, we did an extra leave-one-out experiment onthe 2017-0828 dataset. According to the localization resultsof 2017-0827 and 2018-0129 dataset on the 2017-0823 map,we automatically labeled the whole dataset with structuredand unstructured segments by the inlier rate of the point andline features acquired by the estimated pose using 2P-6DoF.If the inlier rate of lines was higher than the points, 1P1L isutilized in that segment, otherwise, 2P. After that, we countedthe success rate on the whole 2017-0828 dataset again overdifferent localization error thresholds and calculate the areaunder the curve of each method, which is shown in Tab.III. Results show that with the selection mechanism, theperformance is further promoted as expected.VI. C

ONCLUSIONS

In this paper, the visual localization problem is reducedto 4DoF with the aid of inertial measurements. Based onthis, the uniﬁed 2-entity RANSAC pipeline is developed andvalidated on both synthetic and real data to demonstratethe robustness and efﬁciency of the methods. Furthermore,we have investigated the proper mechanism to select theappropriate sampling strategy and showed the practicabilityin the real world application.A

CKNOWLEDGMENT

This work was supported in part by the National Key R&DProgram of China (2017YFC0806501), in part by the Scienceand Technology Project of Zhejiang Province (2019C01043),and in part by the Science and Technology onSpaceIntelli-gent Control Laboratory (HTKJ2019KL502002).ABLE II: Evaluation on the whole YQ-Dataset.

TABLE III: The automatic model selection experiment.

Method 1P1L4DoF 1P1L6DoF 2P4DoF 2P6DoF mixed4DoF mixed6DoF selectionAUC 0.51 0.58 0.78 0.81 0.71 0.72 0.84 R EFERENCES[1] E. Rosten and T. Drummond, “Machine learning for high-speed cornerdetection,” in

European conference on computer vision , pp. 430–443,Springer, 2006.[2] D. G. Lowe, “Distinctive image features from scale-invariant key-points,”

International journal of computer vision , vol. 60, no. 2,pp. 91–110, 2004.[3] E. Rublee, V. Rabaud, K. Konolige, and G. Bradski, “Orb: An efﬁcientalternative to sift or surf,” 2011.[4] M. A. Fischler and R. C. Bolles, “Random sample consensus: aparadigm for model ﬁtting with applications to image analysis andautomated cartography,”

Communications of the ACM , vol. 24, no. 6,pp. 381–395, 1981.[5] X. Ding, Y. Wang, D. Li, L. Tang, H. Yin, and R. Xiong, “Laser mapaided visual inertial localization in changing environment,” in , pp. 4794–4801, IEEE, 2018.[6] L. Tang, Y. Wang, X. Ding, H. Yin, R. Xiong, and S. Huang,“Topological local-metric framework for mobile robots navigation: along term perspective,”

Autonomous Robots , pp. 1–15, 2018.[7] F. Dornaika and C. Garcia, “Pose estimation using point and linecorrespondences,”

Real-Time Imaging , vol. 5, no. 3, pp. 215–230,1999.[8] C. Raposo, M. Lourenc¸o, M. Antunes, and J. P. Barreto, “Plane-basedodometry using an rgb-d camera.,” in

BMVC , 2013.[9] S. Ramalingam, S. Bouaziz, and P. Sturm, “Pose estimation usingboth points and lines for geo-localization,” in

ICRA 2011-IEEE In-ternational Conference on Robotics and Automation , pp. 4716–4723,IEEE Computer Society, 2011.[10] X.-S. Gao, X.-R. Hou, J. Tang, and H.-F. Cheng, “Complete solutionclassiﬁcation for the perspective-three-point problem,”

IEEE transac-tions on pattern analysis and machine intelligence , vol. 25, no. 8,pp. 930–943, 2003.[11] V. Lepetit, F. Moreno-Noguer, and P. Fua, “Epnp: An accurate o (n)solution to the pnp problem,”

International journal of computer vision ,vol. 81, no. 2, p. 155, 2009.[12] M. Dhome, M. Richetin, J.-T. Lapreste, and G. Rives, “Determinationof the attitude of 3d objects from a single perspective view,”

IEEETransactions on Pattern Analysis & Machine Intelligence , no. 12,pp. 1265–1278, 1989.[13] H. H. Chen, “Pose determination from line-to-plane correspondences:Existence condition and closed-form solutions,” in

Computer Vision,1990. Proceedings, Third International Conference on , pp. 374–378,IEEE, 1990.[14] A. Gruen and T. S. Huang,

Calibration and orientation of cameras incomputer vision , vol. 34. Springer Science & Business Media, 2013.[15] R. Hartley and A. Zisserman,

Multiple view geometry in computervision . Cambridge university press, 2003.[16] J. Philip, “A non-iterative algorithm for determining all essentialmatrices corresponding to ﬁve point pairs,”

The PhotogrammetricRecord , vol. 15, no. 88, pp. 589–599, 1996.[17] D. Nist´er, “An efﬁcient solution to the ﬁve-point relative pose prob-lem,”

IEEE transactions on pattern analysis and machine intelligence ,vol. 26, no. 6, pp. 756–770, 2004.

Fig. 6: Examples of unstructured and structured scene. [18] F. Fraundorfer, P. Tanskanen, and M. Pollefeys, “A minimal casesolution to the calibrated relative pose problem for the case of twoknown orientation angles,” in

European Conference on ComputerVision , pp. 269–282, Springer, 2010.[19] C. Troiani, A. Martinelli, C. Laugier, and D. Scaramuzza, “2-point-based outlier rejection for camera-imu systems with applications tomicro aerial vehicles,” in

Robotics and Automation (ICRA), 2014 IEEEInternational Conference on , pp. 5530–5536, IEEE, 2014.[20] L. Kneip, M. Chli, and R. Y. Siegwart, “Robust real-time visualodometry with a single camera and an imu,” in

Proceedings of theBritish Machine Vision Conference 2011 , British Machine VisionAssociation, 2011.[21] D. Scaramuzza, F. Fraundorfer, and R. Siegwart, “Real-time monoc-ular visual odometry for on-road vehicles with 1-point ransac,” in ,pp. 4293–4299, Ieee, 2009.[22] B. M. Haralick, C.-N. Lee, K. Ottenberg, and M. N¨olle, “Review andanalysis of solutions of the three point perspective pose estimationproblem,”

International journal of computer vision , vol. 13, no. 3,pp. 331–356, 1994.[23] L. Kneip, D. Scaramuzza, and R. Siegwart, “A novel parametrizationof the perspective-three-point problem for a direct computation ofabsolute camera position and orientation,” 2011.[24] P. Wang, G. Xu, Z. Wang, and Y. Cheng, “An efﬁcient solution to theperspective-three-point pose problem,”

Computer Vision and ImageUnderstanding , vol. 166, pp. 81–87, 2018.[25] T. Ke and S. I. Roumeliotis, “An efﬁcient algebraic solution to theperspective-three-point problem,” arXiv preprint arXiv:1701.08237 ,2017.[26] Z. Kukelova, M. Bujnak, and T. Pajdla, “Closed-form solutions tominimal absolute pose problems with known vertical direction,” in

Asian Conference on Computer Vision , pp. 216–229, Springer, 2010.[27] M. Li and A. I. Mourikis, “High-precision, consistent ekf-based visual-inertial odometry,”

The International Journal of Robotics Research ,vol. 32, no. 6, pp. 690–711, 2013.[28] B. Clipp, C. Zach, J.-M. Frahm, and M. Pollefeys, “A new minimalsolution to the relative pose of a calibrated stereo camera with smallﬁeld of view overlap,” in , pp. 1725–1732, IEEE, 2009.[29] F. Vasconcelos, J. P. Barreto, and E. Boyer, “Automatic cameracalibration using multiple sets of pairwise correspondences,”

IEEEtransactions on pattern analysis and machine intelligence , vol. 40,no. 4, pp. 791–803, 2017.[30] R. Mur-Artal and J. D. Tard´os, “Visual-inertial monocular slam withmap reuse,”

IEEE Robotics and Automation Letters , vol. 2, no. 2,pp. 796–803, 2017.[31] M. Hofer, M. Maurer, and H. Bischof, “Efﬁcient 3d scene abstractionusing line segments,”

Computer vision and image understanding ,vol. 157, pp. 167–178, 2017.[32] A. Geiger, J. Ziegler, and C. Stiller, “Stereoscan: Dense 3d reconstruc-tion in real-time,” in

Intelligent Vehicles Symposium (IV) , 2011.[33] L. Zhang and R. Koch, “An efﬁcient and robust line segment matchingapproach based on lbd descriptor and pairwise geometric consistency,”