[PDF] Accurate Visual-Inertial SLAM by Feature Re-identification

Abstract

We propose a novel feature re-identification method for real-time visual-inertial SLAM. The front-end module of the state-of-the-art visual-inertial SLAM methods (e.g. visual feature extraction and matching schemes) relies on feature tracks across image frames, which are easily broken in challenging scenarios, resulting in insufficient visual measurement and accumulated error in pose estimation. In this paper, we propose an efficient drift-less SLAM method by re-identifying existing features from a spatial-temporal sensitive sub-global map. The re-identified features over a long time span serve as augmented visual measurements and are incorporated into the optimization module which can gradually decrease the accumulative error in the long run, and further build a drift-less global map in the system. Extensive experiments show that our feature re-identification method is both effective and efficient. Specifically, when combining the feature re-identification with the state-of-the-art SLAM method [11], our method achieves 67.3% and 87.5% absolute translation error reduction with only a small additional computational cost on two public SLAM benchmark DBs: EuRoC and TUM-VI respectively.

Full PDF

AAccurate Visual-Inertial SLAM by Feature Re-identiﬁcation

Xiongfeng Peng , Zhihua Liu , Qiang Wang , Yun-Tae Kim , Myungjae Jeon Abstract — We propose a novel feature re-identiﬁcationmethod for real-time visual-inertial SLAM. The front-endmodule of the state-of-the-art visual-inertial SLAM methods(e.g. visual feature extraction and matching schemes) relies onfeature tracks across image frames, which are easily brokenin challenging scenarios, resulting in insufﬁcient visual mea-surement and accumulated error in pose estimation. In thispaper, we propose an efﬁcient drift-less SLAM method by re-identifying existing features from a spatial-temporal sensitivesub-global map. The re-identiﬁed features over a long time spanserve as augmented visual measurements and are incorporatedinto the optimization module which can gradually decrease theaccumulative error in the long run, and further build a drift-lessglobal map in the system. Extensive experiments show that ourfeature re-identiﬁcation method is both effective and efﬁcient.Speciﬁcally, when combining the feature re-identiﬁcation withthe state-of-the-art SLAM method [11], our method achieves67.3% and 87.5% absolute translation error reduction withonly a small additional computational cost on two public SLAMbenchmark DBs: EuRoC and TUM-VI respectively.

I. INTRODUCTIONAccurate 3D pose estimation of a moving camera is animportant task in computer vision and has attracted moreand more attention in recent years. It provides a fundamentalfunction for many applications, such as augmented reality(AR) on smart phones or glasses, robot navigation andautonomous driving.Visual-inertial simultaneous localization and mapping (VI-SLAM) is one of the most promising methods providingprecise navigation in a 3D world through the fusion ofcameras and inertial sensors. Comparing with the visual-inertial odometry (VIO) methods [1], [2], [3], [4], [5], [6],[7], [8], [9], [10], VI-SLAM methods [11], [12], [13], [14],[15], [16], [17], [18] have the advantage of building a globalmap of 3D sparse features of the surrounding environment,which can effectively bound tracking drift. Patrick G. et al.[12] propose a SchmidtEKF based VI-SLAM method whichemploys keyframe-aided 2D-2D feature matching to ﬁndreliable correspondences between current 2D visual measure-ments and 3D map features. Qin T. et al. [13] propose amonocular visual-inertial SLAM which merges current mapwith previous map by loop detection and relocalizes camera.Campos C. et al. [15] propose a multiple map system thatrelies on a new place recognition method with improvedrecall. The above methods rely on feature matching betweentwo images or image matching with similar poses, i.e.

Xiongfeng Peng , Zhihua Liu and Qiang Wang are with SAIT-China Lab, Samsung R&D Institute China-Beijing, China { xf.peng,zhihua.liu, qiang.w } @samsung.com Yun-Tae Kim and Myungjae Jeon with Multimedia Processing Lab,Samsung Advanced Institute of Technology, South Korea { ytae.kim,myungje.jeon } @samsung.com Fig. 1. The comparison of loop closure with our proposed feature re-identiﬁcation method. For the query frame 776 th on EuRoC DB, DBoW2method is implemented to detect loop and the top-k retrieved frames aretemporally close to the query frame (k=7). However, our proposed featureReID method can successfully retrieve early frame 185 th and build theconstraints between them. the baseline and orientation difference between two imagescannot be very large so that direct matching based on featuredescriptors such as Bag-of-Words [29] can yield reliableresults.There are also methods which try to establish featurecorrespondence in relatively larger viewpoint changes bymulti-frame feature integration [25], [34], or by 2D-3Dfeature matching between the current frame and a local mapconstructed from a set of neighboring keyframes on the co-visibility graph [17]. Typically, two neighboring co-visibilitygraph nodes share a number of common features obtainedthrough either feature tracking or matching. However, suchmethods heavily rely on feature tracking or matching acrossimage frames in temporal neighborhood and cannot buildefﬁcient visual measurements, which result in signiﬁcantcamera drift in challenging scenarios such as occlusion orlarge viewpoint changes. To get drift-less camera pose withmore efﬁcient visual measurements, in this paper, we proposea novel visual inertial SLAM framework with feature re-identiﬁcation (ReID) method. The method re-identiﬁes newdetected features in the current frame whose corresponding3D map points have already been reconstructed previously.Note that the previous works on the loop closure detectionalso build the connection between the current frame andhistorical map/frames. However, our feature ReID methodexploits a fused IMU-aided, geometry and appearance con-straint which enables to re-identify reliable features in along time span while loop closure methods only focus onneighboring co-visibility keyframes before the closed loop is a r X i v : . [ c s . C V ] F e b ig. 2. Feature ReID in SLAM system. The left two column imagesshow feature ReID process. The feature pointed by the red arrow is ﬁrstlydetected in the 185 th frame and continuously tracked to the 244 th frame,then the feature is lost in the 245 th frame. In the 776 th frame, the featureis re-identiﬁed. The right graph is a part of camera tracking trajectory andpositions. found. Fig.1 represents a detailed comparison of our featureReID and the loop closure method.Our proposed feature ReID method in this paper is notstraightforward since the features to be re-identiﬁed maybe subjected to large viewpoints and appearance changes.Furthermore, the re-identiﬁcation method should be compu-tationally efﬁcient, so that it can be performed in every frameto bound the tracking drift. To re-identify global features withcomputational efﬁciency, we ﬁrst build a spatial-temporalsensitive (STS) sub-global map, then re-identify featureswith pose guidance in a long time span. Finally, the re-identiﬁed features provide augmented visual measurementsand are incorporated into local and global bundle adjust (BA)optimization modules for accurate pose estimation. Fig. 2shows our proposed feature ReID process. Take the publicEuRoC DB V2 03 sequence for example, a red dot indexedby 668 in the simulated camera motion represents a mappoint. The point is reconstructed in the 185 th frame andtracks to the 244 th frame. The feature track fails in the 245 th frame because it is approaching to the image boundary andcan not be tracked correctly. With our feature ReID method,the feature is successfully re-identiﬁed in the 776 th frame.The sight-of-view angle between the detected frame and there-identiﬁed frame is 49.6 ◦ and the spatial distance betweenthe two frames is about 3.3 meters. See the supplementarymaterial for more details.In summary, our main contributions are as follows: • We propose to reconstruct a spatial-temporal sensitive(STS) sub-global map to re-identify features for everyframe with high efﬁciency. The STS sub-global mapis reconstructed from the map points in the earlykeyframes which satisfy multi-view geometry constraintwith the current frame. • A pose guided feature matching method is proposedto establish feature matching. By using a fused IMU-aided, geometry and appearance consistent method, oursolution enables to incorporate efﬁcient visual measure-ments into energy function to bound camera drift. • With the combination of our proposed feature ReIDmethod with baseline method [11], the absolute transla-tion error (ATE) is 3.2cm and 1.1cm on two public DBs: EuRoC [26] and TUM VI [27], and achieves 67.3% and87.5% error reduction respectively.II. RELATED WORKTremendous research works on visual inertial SLAM haveappeared in the last few years. These methods exploit monoc-ular/stereo vision and IMU information to track camera poseand map the environment at the same time. Graph optimiza-tion based tightly-coupled visual inertial SLAM methodsjointly optimize camera and IMU measurements with BA orincremental bundle adjust (IBA) [35] method from the rawmeasurements and achieve the state-of-the-art performancein recent years. This work focuses on graph optimizationbased SLAM method and closely relates to the followingresearch topics: feature matching, pose graph, loop closureand re-localization. The following sections elaborate on therelated works.

A. Feature matching

Feature matching is an important part in visual SLAMsystem and it directly impacts both localization accuracy androbustness. There are two types of feature matching methodsin the state-of-the-art SLAM system. One is pose-free featurematching which establishes 2D-2D match by either KNNsearch or optical ﬂow tracking. Liu. et al. [11] and Qin. et al.[2] use KLT tracker [24] and build 2D-2D correspondence.While recent optical ﬂow methods based on deep learning arealso popular [19], [20], [21], [22], [23]. SEVIS [12] employskeyframe-aided 2D-2D feature matching to ﬁnd reliablecorrespondences between current 2D visual measurementsand 3D map features. Recent deep learning based methods[28], [41], [42] focus on learning better sparse detectorsand local descriptors from data using convolutional neuralnetworks (CNNs). With the deep features, Sarlin. et al. [40]learn feature matching and outlier ﬁltering by solving apartial assignment problem. The other type feature matchingis pose-assisted 3D-2D matching. The initial pose estimationof query image is from motion model or other sensorsmeasurements, i.e. IMU integration [37]. The extracted 2Dfeatures in query image establish 2D-3D matches with 3Dpoints in the local map attempt to estimate a 6-DoF pose witha PnP [43] geometric consistency check within a RANSACscheme [44].

B. Pose graph

Pose graph is deﬁned by co-visibility and two posesare connected to each other if they share enough commonfeatures [39]. Co-visibility graph in [17] is represented asan undirected weighted graph. Each node is a keyframeand an edge between two keyframes exists if they shareenough observations of the same map points. Our proposedSTS sub-global map is related to the co-visibility graphwhich both are reconstructed/built from keyframes or itsmap points. However, the co-visibility graph is reconstructedby the local neighboring keyframes who have strong co-visibility relationship with the current frame, which restrictsits capability to identify frames in a long time span. ig. 3. Our visual-inertial SLAM method pipeline.

C. Loop closure and Re-localization

Loop closure detection in SLAM is a key technique andis used to reduce camera drift when the camera goes backto the previous explored circumstance. Re-localization hasthe ability to recover from camera tracking failure in aSLAM system with a previously built map. The two tasks arerelevant and many techniques are proposed in the literatureworking on the tasks. The approach based on bag of words(BoW), such as DBoW2 [29], is the most popular for real-time visual SLAM systems [2], [12], [17]. Lynen et al. [32]propose a visual localization for large scale scene and server-side deployment. Recently, ConvNet-based approaches haverisen in popularity. Merril. et al. [30] learn an auto-encoderfrom the common HOG descriptors for the whole image.Kuse et al. [33] learn the whole-image-descriptor in a weaklysupervised manner based on NetVLAD [31].III. OUR METHODThe structure of our proposed visual-inertial SLAM sys-tem pipeline is shown in Fig. 3. The input streams are fromtwo different kinds of sensors: IMU and stereo camera. Inthe frond-end, local features are detected, tracked and IMUinstant measurements are pre-integrated. In the back-end,local BA optimizes temporal latest frames in a local slidingwindow. Global BA optimizes all the keyframes and mappoints to maintain a global consistent map. Our proposedfeature ReID module builds the constraints between thefront-end and the back-end in SLAM system to further reﬁnethe map and control camera drift.

A. Visual Inertial SLAM

In our method, stereo camera are equipped to guaranteethe system launch under a true scale motion and map. Similarto most of visual inertial SLAM methods, our objective isto estimate the unknown camera-rate state which includescamera pose, velocity, IMU bias as well as the position of3D map points in the environment.Suppose camera pose is described by T = ( R , p ) . Foreach 3D point X j , it is observed from multiple image framesand its corresponding 2D measurement is denoted by z ij in ith frame, then the visual constraint is represented byre-projection error E visij ( T i , X j ) = π ( T i ◦ X j ) − z ij , π is a transform from the camera coordinate system to the image coordinate system. Assuming X j is reconstructedfrom s j th frame and is parameterized by its inverse depth ρ j , then X j = T − s j ◦ ρ j z s j j . IMU measurements arealso important to provide relative motion constraint and areusually pre-integrated [37] to estimate IMU state ( T , M ) ,where M = ( v , b ) means velocity and bias respectively.Following tightly coupled visual inertial localization methods[2], [11], local sliding window based nonlinear optimizationframework processes visual and inertial measurements andthe cost function E L is deﬁned as: E L = arg min T i , M i ,ρ j t n (cid:88) i = t (cid:88) j ∈ V i (cid:107) E visij ( T i , T s j , ρ j ) (cid:107) + (cid:107) E priort ( M t , T t ) (cid:107) + t n − (cid:88) i = t (cid:107) E imui,i +1 ( M i , M i +1 , T i , T i +1 ) (cid:107) (1)where t is the ﬁrst frame and there are t n − t + 1 framesin the sliding window. When the oldest frame moves outof the sliding window, its corresponding visual and inertialmeasurements turn into a prior in local BA. If the frame isa keyframe, the prior becomes relative constraints, which isadded to global BA with visual and inertial measurementsof the keyframe. Global BA runs in parallel to local BA at arelatively lower frequency. The cost function E G is deﬁnedas: E G = arg min T i , M i ,ρ j k m (cid:88) i = k (cid:88) j ∈ V i (cid:107) E visij ( T i , T s j , ρ j ) (cid:107) + (cid:88) i (cid:107) E reli ( { T k ∈ L i } ) (cid:107) + k m − (cid:88) i = k (cid:107) E imui,i +1 ( M i , M i +1 , T i , T i +1 ) (cid:107) (2)where k , k , ..., k m are keyframe indexes. For IMU term E imui,i +1 , prior term E priort and relative constraint E reli , pleaserefer to [11] and [37] for details. Global BA optimizes allkeyframes and map points to maintain a global consistentmap. For the state-of-the-art SLAM methods, the visualconstraints in (1) always heavily rely on feature trackingor matching in temporal neighborhood and cannot buildefﬁcient visual measurements. It also means that the state-of-the-art SLAM methods ignore inter-frame feature matchingin long-term discontinuous time which leads to larger ac-cumulative error in the long run. To reduce the error andget drift-less camera pose, in this paper, we propose featureReID method to build more reliable visual constraints in along time interval. B. Feature ReID

To re-identify features at each frame, an efﬁcient STS sub-global map and pose guided feature matching are proposedin this part. For computational efﬁciency, only new detectedfeatures in the current frame are re-identiﬁed. The processterminates when all points in the STS sub-global map arexamined or a certain number of top ranked features aresuccessfully re-identiﬁed.

STS sub-global map

In SLAM system, the global mapis a map of BA-optimized consistent states and the mapsize gradually increases when a new scene is explored.Feature ReID on the whole global map is infeasible dueto unbounded map complexity and computational cost. Onthe other hand, not all keyframes associated with the globalmap contribute equally to BA optimization. For example,keyframes which are adjacent to the current frame in tem-poral domain are less important to provide novel constraint;for keyframes which are spatially overlapped with the currentframe, the keyframes with old timestamps are superior thanthe keyframes with new timestamps since ReID from oldmap point provides a longer time span of feature track (notnecessarily continuous) and less drift which is better. So wepropose to construct a STS sub-global map in considerationof both spatial and temporal aspects.Given a set of all keyframes in the global map, thesekeyframes are sorted in temporal increasing order. For eachkeyframe, there are a number of associated 3D map pointswhich are visible from its own viewpoint. We computethese 3D map points spatial distribution which approximatelyrepresents the camera view zone. In the same way, wecompute the current frame view zone, and calculate thespatially overlapped area with these keyframe view zones. Ifthe area is non-null, the keyframe is one of the candidates forthe STS sub-global map. Then we take temporal informationinto consideration. Candidate keyframes with older times-tamps have top prior in the STS sub-global map. Temporalneighboring keyframes which have more spatial overlappedareas are not in our scope. To prevent redundant 3D mappoints in the STS sub-global map and bound the complexity,we experimentally deﬁne a maximum of the STS sub-globalmap size T map . (e.g. T map = 1000)Fig.4 simulates the STS sub-global mapreconstruction process in a bird view. Suppose ψ = { F k i , F k i +1 , ..., F k i + m , ..., F rk } is a set of allkeyframes and F c is the current frame. The spatialdistribution of the keyframe F k i is represented by a 3D cube [ C − k i , C + k i ] = [ U k i − (cid:112) S k i , U k i + (cid:112) S k i ] and shown with adashed rectangle in a bird view, where U k i and S k i are 3Dpoints mean and variance respectively. For simplicity, thespatial distribution of the current frame F c is replaced by itsreference keyframe F rk . If [ C − k i , C + k i ] ∩ [ C − kf , C + kf ] (cid:54) = ∅ ,then the keyframe F k i has common view zone with thecurrent frame F c . As the shaded area shows in Fig.4,the keyframes F k i , F k i +1 , F k i + m +1 , F k i + m +2 , F k i + m +3 havecommon view zone with the reference keyframe F rk and areconsidered as spatial overlapped keyframes (SOKF) of thecurrent frame F c . The STS sub-global map points are fromSOKFs with older timestamps, i.e. F k i , F k i +1 , F k i + m +1 ,which are denoted by dashed circles without red cross. Pose guided matching

Based on the reconstructed STSsub-global map, the new detected features in the currentframe are re-identiﬁed with a pose guided feature matchingmethod. In comparison with tracking the local map module

Fig. 4. The STS sub-global map reconstruction process in abird view. The camera tracks in a clockwise direction. The spa-tial distribution of each keyframe is visualized by a dotted rectan-gle. Keyframes F k i , F k i +1 , F k i + m +1 , F k i + m +2 , F k i + m +3 have spatiallyoverlap (shaded area) with the reference keyframe F rk of the current frame F c and they are considered as SOKFs of the current frame. The STS sub-global map points are from early SOKFs, i.e. F k i , F k i +1 , F k i + m +1 . in [17], our method additionally utilizes IMU pre-integration,geometry ﬁltering and warping modules to guarantee accu-rate and robust matching.We exploit IMU aided camera pose prediction [37] to geta relative accurate pose prediction for ﬁnding the correspon-dence between 3D map points and 2D features. IMU mea-surements { ( ˜ ω k , ˜ α k ) | k = k , ..., k n } in [ i, i + 1] time in-tervals combining with optimal latest camera pose ( R i , p i ) ,IMU bias ( b gk , b ak ) , velocity v i and noise ( η gk , η ak ) togetherare used to predict the current camera pose ( R i +1 , p i +1 ) andvelocity v i +1 via R i +1 = R i k n (cid:89) k = k exp(( ˜ ω k − b gk − η gdk )∆ t ) v i +1 = v i + g ∆ t i,i +1 + k n (cid:88) k = k R k ( ˜ α k − b ak − η adk )∆ t p i +1 = p i + k n (cid:88) k = k [ v k ∆ t + 12 g ∆ t + 12 R k ( ˜ α k − b ak − η adk )∆ t ] . (3)With the reliable camera pose prediction in ( i +1) th moment,the STS sub-global map points are projected onto the currentframe to re-identify features.Then, each keyframe in the STS sub-global map is checkedwith geometry consistent to avoid mismatch. The relativepose of each keyframe and the current frame is computedand then the map points of each keyframe are re-projectedonto the current frame. The keyframe is considered as drift-less if most of the projected features are spatially closed tothe detected feature points of the current frame, otherwisethe map points are outliers.inally, to deal with larger viewpoints and scale changes offeature matching, we ﬁrst warp the image patches with assistof predicted pose, then compute ORB [38] descriptor andappearance distance. The feature is successfully re-identiﬁedif the minimum distance is smaller than a given threshold T dist . (e.g. T dist = 50) C. Visual Constraint Augment

Once the features are correctly re-identiﬁed, the measure-ments are ﬁrstly constrained in a temporally latest slidingwindow to be optimized in local BA. Suppose

AugV i isthe correctly re-identiﬁed map point index set for ith frame, ∀ j ∈ AugV i , map point ρ j and its measurements z ij areconstrained by the re-projection error E (cid:48) L = arg min { T i ,ρ j | s j / ∈ [ t ,t n ] } t n (cid:88) i = t (cid:88) j ∈ AugV i (cid:107) E visij ( T i , T s j , ρ j ) (cid:107) . (4)The new visual constraints E (cid:48) L , together with E L in (1) isoptimized in the sliding window. In (4), the camera states T i and the reconstructed points ρ j during [ t , t n ] are optimized.For map points ρ j , if their reconstructed frame s j is outsideof the sliding window, i.e. s j / ∈ [ t , t n ] , the points ρ j are notoptimized. These map points anchor the sliding window andenable that the camera poses and local points in the windoware consistent with the global map.When the oldest frame in the sliding window is a keyframeand it consists of newly re-identiﬁed features, the visualmeasurements in global BA cost function are also augmentedand denoted by E (cid:48) G in (5). Together with E G in (2), the costis optimized E (cid:48) G = arg min { T i ,ρ j } k m (cid:88) i = k (cid:88) j ∈ AugV i (cid:107) E visij ( T i , T s j , ρ j ) (cid:107) (5)where k , k , ..., k m are all keyframes indexes. Differentfrom local BA, all points in AugV i are updated in globalBA which help maintain a global consistent map. D. Feature ReID Veriﬁcation

To verify the geometry consistency of the new re-identiﬁedfeatures, the average re-projection error E vis ( ρ j ) of the mappoint ρ j is calculated by E vis ( ρ j ) = 1 | Z j | t n (cid:88) i = t (cid:107) E visij ( T i , T s j , ρ j ) (cid:107) (6)where Z j includes all 2D measurements of map point ρ j ,i.e. z s j j ∈ Z j and z ij ∈ Z j . The operator | · | calculatesthe set cardinality of Z j . If E vis ( ρ j ) is larger than a giventhreshold T rep (e.g. T rep = 10), then the map point ρ j isdiscarded, together with all its related measurements.IV. EXPERIMENTSTo validate our proposed feature ReID method, we cal-culate average tracking length and time span and comparewith the baseline method. Furthermore, localization accuracyis quantitatively compared with the state-of-the-art SLAM TABLE IQ

UANTITATIVE C OMPARISON OF A VERAGE T IME S PAN (ATS)

AND A VERAGE T RACKING L ENGTH (ATL) ON E U R O C AND

TUM VIMethods EuRoC TUM VIATS ATL ATS ATLBaseline(frames) 20.96 21.96 15.78 16.78Baseline + ReID(frames)

Extension Rate 106% 3.5% 401% 16.6%TABLE IIA

BLATION STUDY WITH DIFFERENT CONFIGURATIONS . (cid:172) DENOTES

STS

SUB - GLOBAL MAP AND (cid:173)

DENOTES POSE GUIDED MATCHING

DBs Conﬁguration ATE(cm)EuRoC Baseline 9.8Baseline + (cid:172) (cid:173) (cid:172) + (cid:173) TUM VI Baseline 8.8Baseline + (cid:172) (cid:173) (cid:172) + (cid:173) methods on two public DBs: EuRoC dataset [26] and TUMVI benchmark [27]. A. Public Datasets and Measurements

EuRoC Dataset

The EuRoC dataset contains 11 se-quences recorded by a small-scale hexacopter UAV, whichis equipped with a visual-inertial (VI) sensor unit. The VIsensor unit provides WVGA stereo grayscale images with abaseline of 11 cm at a rate of 20 Hz and 200 Hz inertialdata. The dataset is captured in two different scenes, one isa large machine hall which includes 5 sequences. The otherone is a small-scale room which has 6 sequences. The groundtruth pose is got by a motion capture system. Depending onthe texture, brightness, and UAV dynamics the sequences areclassiﬁed as easy, medium, and difﬁcult.

TUM VI Benchmark

The TUM VI benchmark pro-vides several types of sequences, such as rooms, corridors,magistrales, outdoors and slides. The room sequences canbe representative of typical AR/VR applications, where theuser moves wearing a head-mounted device in a smallenvironment. We evaluate the state-of-the-art methods onthis scene because it provides ground truth data in wholetrajectories. The image resolution is 512x512 at 20Hz andIMU measures at a rate of 200Hz. OptiTrack motion capturesystem records accurate ground-truth poses at a high framerate of 120 Hz. In these sequences, the cameras work witha circle motion in a small scale room.

Measurements

To validate feature re-identiﬁcation effec-tiveness, we calculate time span (TS) and tracking length(TL) of each map point.

T S ( ρ j ) = i − s j deﬁnes thedifference between the newest measurement frame index i and the reconstructed frame index s j . T L ( ρ j ) = | Z j | counts all 2D measurements z ij ∈ Z j corresponding to a 3Dpoint ρ j . To evaluate SLAM localization accuracy, absolutetranslation error (ATE) in [36] is used to compare our methodwith the state-of-the-art methods. When testing all DBs, we ABLE IIIA

VERAGE T IME C OST C OMPARISON OF E ACH

SLAM M

ODULE FROM

ORB-SLAM3, B

ASELINE AND O URS ON E U R O C AND

TUM VI DBORB-SLAM3 Baseline Our MethodDBs Tracking Thread(ms) Tracking Thread(ms) GBA Thread(ms) Tracking Thread(ms) GBA Thread(ms)(Front-end+LBA) (Front-end+ReID+LBA)EuRoC 51.46 use the same parameter settings.

B. Ablation Study

TS and TL

We compute average time span (ATS) andaverage tracking length (ATL) on EuRoC and TUM VIrespectively. As shown in Table I, with our proposed featureReID method, ATS has increased about 1 time and 4 timeswhen comparing with the baseline method [11] on twodifferent DBs. ATL has also averagely increased about 1frame and 3 frames for each map point. We observe thatthe extension rate is different for the two DBs and it closelyrelates to circumstances and camera movements. Comparingwith EuRoC DB, TUM VI benchmark is captured by a loopycamera motion in a relatively small indoor ofﬁce and it’seasier to re-identify features.

Feature ReID Validation

In this part, we validate featureReID method with different conﬁgurations and compare withthe baseline method. Table II lists ATE on the two publicDBs and the results show that our feature ReID methodachieves 67.3% and 87.5% error reduction when comparingwith the baseline method [11]. Table II also validates thatboth STS sub-global map and pose guided feature matchingare important components to improve localization accuracyin feature ReID. The STS sub-global map module preservesthe most relevant map points with the features in the currentframe and guarantees fast speed and makes correct featurematching possible. Pose guided feature matching module re-identiﬁes more features, which can combine into BA opti-mization to further decrease accumulative error and improvelocalization accuracy.

Time Analysis

The experiments are carried out on adesktop PC with i7 3.4GHz CPU and 8G memory. Thecomputational time comparison shows in Table III. We noticethat our feature ReID module costs additional 4ms ∼ C. Comparison with the state-of-the-art

In this section, we compare our proposed feature ReIDmethod with the state-of-the-art methods on EuRoC andTUM VI DBs.In Table IV, we show quantitatively comparison of ourfeature ReID method with the state-of-the-art methods onpublic EuRoC dataset. For SOFT-SLAM [34] and ICE-BAwith loop [11], we copy results from their published papersand run the released source code for VINS [2] and ORB-SLAM3 [15] three times and average them. Among the11 sequences, our method dominates 6 sequences and is

TABLE IVC

OMPARISON WITH THE S TATE - OF - THE - ART M ETHODS ON E U R O CEuRoC VINSw / loop SOFT-SLAMw / loop ICE-BAw / loop ORB-SLAM3w / loop Oursw / o loopMH 01 15.9

11 3.9 3.4MH 02 15.6 4.2 8 3.8

MH 03 12.8 3.8 5 2.8

MH 04 28.0 9.6 13

V1 01 7.8 4.2 7 3.6

V1 02 6.8 3.4 8

V2 01 5.9 7.2 6 3.6

V2 02 9.0 6.9 4

TABLE VC

OMPARISON WITH THE S TATE - OF - THE - ART M ETHODS ON

TUM VITUM VI VINSw / loop ICE-BAw / o loop ORB-SLAM3w / loop Oursw / o loopRoom1 7.0 13.9 Room2 7.0 10.2

Room3 11.0 10.0 1.2

Room4 4.0 6.8

Room6 8.0 3.3 comparable with the the state-of-the-art methods for the othersequences. Notice that all methods equip with loop closuremodule except us.Table V shows the comparison of our feature ReID methodwith the state-of-the-art methods on TUM VI Room bench-mark. We run ICE-BA without loop [11] and ORB-SLAM3[15] source codes three times and average them. The resultsof VINS [2] method are copied from [27]. It can be seenthat the performance of our method exceeds VINS, ICE-BAand is comparable with ORB-SLAM3 with loop method.V. CONCLUSIONSIn this paper, we propose a feature re-identiﬁcation methodto recognize new features from existing map and the experi-mental results have validated our method. More importantly,our feature ReID method can be effortlessly applied inall existing visual SLAM methods to improve ego-camerapose accuracy. This aspect is of great value for practicalapplications.comparable with the the state-of-the-art methods for the othersequences. Notice that all methods equip with loop closuremodule except us.Table V shows the comparison of our feature ReID methodwith the state-of-the-art methods on TUM VI Room bench-mark. We run ICE-BA without loop [11] and ORB-SLAM3[15] source codes three times and average them. The resultsof VINS [2] method are copied from [27]. It can be seenthat the performance of our method exceeds VINS, ICE-BAand is comparable with ORB-SLAM3 with loop method.V. CONCLUSIONSIn this paper, we propose a feature re-identiﬁcation methodto recognize new features from existing map and the experi-mental results have validated our method. More importantly,our feature ReID method can be effortlessly applied inall existing visual SLAM methods to improve ego-camerapose accuracy. This aspect is of great value for practicalapplications.