[PDF] OV^{2}SLAM : A Fully Online and Versatile Visual SLAM for Real-Time Applications

Abstract

Many applications of Visual SLAM, such as augmented reality, virtual reality, robotics or autonomous driving, require versatile, robust and precise solutions, most often with real-time capability. In this work, we describe OV^{2}SLAM, a fully online algorithm, handling both monocular and stereo camera setups, various map scales and frame-rates ranging from a few Hertz up to several hundreds. It combines numerous recent contributions in visual localization within an efficient multi-threaded architecture. Extensive comparisons with competing algorithms shows the state-of-the-art accuracy and real-time performance of the resulting algorithm. For the benefit of the community, we release the source code: \url{this https URL}.

Full PDF

OOV SLAM : A Fully Online and Versatile Visual SLAM for Real-TimeApplications

Maxime Ferrera , , † , Alexandre Eudes , Julien Moras , Martial Sanfourche and Guy Le Besnerais Abstract — Many applications of Visual SLAM, such as aug-mented reality, virtual reality, robotics or autonomous driving,require versatile, robust and precise solutions, most often withreal-time capability. In this work, we describe OV SLAM, afully online algorithm, handling both monocular and stereocamera setups, various map scales and frame-rates rangingfrom a few Hertz up to several hundreds. It combines nu-merous recent contributions in visual localization within anefﬁcient multi-threaded architecture. Extensive comparisonswith competing algorithms shows the state-of-the-art accuracyand real-time performance of the resulting algorithm. Forthe beneﬁt of the community, we release the source code: https://github.com/ov2slam/ov2slam . I. I

NTRODUCTION

Nowadays Visual SLAM (VSLAM) is getting more andmore mature. Yet, state-of-the-art methods still struggle toachieve simultaneously accuracy, robustness and real-time(RT) capability. In the context of VSLAM, the RT constraintis related to the camera’s frame-rate and, in practice, theproblem comes from frame losses related to peaks in theprocessing time. Even if an algorithm processes images fasteror at the camera’s frame-rate in average , those peaks impliesthat information is lost in “RT forced conditions” where, ateach time, the most recently received image is processed.As illustrated in Fig. 1 for trajectory MH05 of EUROCdataset [1], when tested in such conditions, ORB-SLAM [2]sees its accuracy decrease signiﬁcantly with respect to “notRT” conditions, where successive frames are all processedwhatever the processing delay of a particular frame.In this paper, we describe OV SLAM, a fully onlineVSLAM algorithm able to process stereo or monocularstreams, which aims at closing the gap between RT capabil-ity, accuracy and robustness. As shown in Fig 1, OV SLAMoutperforms both ORB-SLAM versions while running at20Hz. Besides, a fast conﬁguration of OV SLAM runningat 200Hz remains highly accurate on this dataset.It is well-known since PTAM [3] that multi-threading isthe key to efﬁcient VSLAM implementation. We push thisprinciple further by proposing a four-threads architecture(front-end, mapping, state optimization and loop closing).While most of the operations within each thread stem frompreviously published methods, our contribution is mainlyin a very careful management of these operations so as tominimize drift and save runtime. Besides, the loop closing ∗ This work was supported by the ANR/DGA project MALIN. DTIS, ONERA, Universit´e Paris-Saclay, F-91123, Palaiseau,France. Contact emails : { [email protected] } IFREMER, Ctr. M´editerran´ee, Underwater System Unit, CS20330,F-83507, La-Seyne-Sur-Mer, France. { [email protected] } † work done while at ONERA. Fig. 1: Stereo VSLAM results on MH05 trajectory of EUROC dataset.When RT is enforced the mean ATE of ORB-SLAM is multiplied by 3and the dispersion of its results increases signiﬁcantly. While running at20Hz, OV SLAM outperforms both ORB-SLAM results. Besides, a fastversion compatible with operation at 200Hz still leads to a high accuracy. (LC) thread is based on the online Bag-of-Words (BoW)algorithm iBoW-LCD [4]. The interest of iBoW-LCD, which,to the best of our knowledge, has not been included in aVSLAM implementation before, is that the vocabulary treeemployed for loop detection is built incrementally, makingit always suited to the current environment.OV SLAM has been compared with several competingalgorithms on various publicy available datasets, showingstate-of-the-art accuracy while being fully RT. We further seeOV SLAM as an open research platform, hence we releaseour code to the beneﬁt of the community.II. R

ELATED W ORK

VSLAM has been extensively studied in the past decades,mainly because of the affordability of camera’s systems butalso because of the great amount of information provided byimaging systems. PTAM [3] has been one of the ﬁrst monoc- a r X i v : . [ c s . C V ] F e b lar VSLAM algorithm able to employ Bundle Adjustment(BA) while ensuring RT capability on CPU by cleverlydecomposing the VSLAM motion and structure problems,allowing to employ a multi-threaded architecture. At the timeof writing, ORB-SLAM [2] is the state-of-the-art VSLAM.We must also consider visual odometry (VO) methods, suchas SVO [5] and DSO [6], which do not perform LC. All thesealgorithms run on CPU and have been publicly released,greatly impacting the robotic and computer vision ﬁelds.The aforementioned algorithms can be divided in two cat-egories: feature-based and direct. Direct approaches jointlyestimate the tracking of pixels and the pose of the camera,usually by computing an image alignment through the mini-mization of a photometric error. On the other hand, feature-based methods (also called undirect methods) separate thetracking and the pose estimation parts, solving them sequen-tially and relying on a geometric error minimization for poseestimation. We refer the interested readers to [6] for moredetails on the difference between both approaches.DSO falls in the direct VO category. It is a monocularmethod based on the tracking of a sparse set of pixels andcarefully taking into account the image formation parame-ters such as gain and lens vignetting. While the proposedOV SLAM also relies on a photometric tracking, it followsan undirect approach, still more robust than direct ones onmost of the existing datasets. Indeed, while DSO showsimpressive results when used with photometrically calibratedcameras, its accuracy drops quickly when used with regularcameras not providing these information [7]. SVO is ahybrid VO method, initializing its estimated pose in a directway but then minimizing a re-projection error to reﬁne theestimated pose. It handles both monocular and stereo setupsand focuses on speed but at the expense of robustness, failingquite often in difﬁcult sequences [5].Finally, ORB-SLAM is probably the most complete VS-LAM open-sourced algorithm at the time of writing. Itproposed a feature-based method suited for either monocular,stereo or RGB-D camera setups. It relies on the extractionof ORB [8] features and use them for all the algorithmtasks. ORB-SLAM is one of the most accurate VSLAMmethod up-to-date. Its impressive accuracy comes from itsvery Structure-from-Motion (SfM) like pipeline, limiting atmost the triangulation of new map points by heavily tryingto match extracted features to already existing map points.This way, it highly reduces the amount of noise includedin the 3D map and relies on the induced covisibility con-straints between these map points and their related keyframesfor optimal optimization through BA. Furthermore, its LCcapability combined with its local map tracking allows toboth close small and large loops, keeping the drift extremelylow whenever re-visiting an already mapped area. Yet, thisaccuracy comes at the price of potentially high run-timerequirement, leading to wide drops in terms of accuracywhen enforcing RT (see Sec.VIII and [5]). A commonissue with feature-based methods is their need for extractingfeatures in every acquired image. In OV SLAM, we heavilyreduce the computational load by limiting the extraction of features to keyframes and track them in subsequent framesby minimization of a photometric error. Yet, in opposition topure direct methods, we use the extracted descriptors for themeans of local map tracking such as in ORB-SLAM but onlyperforming this step for keyframes. This way we manage toclose small loops by re-tracking lost or temporarily occludedmap points without impacting the front-end speed but gainingin robustness and accuracy.Furthermore, to the best of our knowledge, all the pro-posed VSLAM algorithms that integrate a LC feature rely onofﬂine BoW, pre-trained on a given database. In opposition,for the ﬁrst time we propose to integrate an online BoWapproach that incrementally builds its vocabulary tree fromthe provided descriptors, making it always suited to thecurrent environment and less impacted by biases relatedto an off-line training database [9], [10]. Thus, OV SLAMbeneﬁts from the state-of-the-art performance of iBoW-LCD[4], making it highly efﬁcient in very diverse environments.III. A

RCHITECTURE , ASSUMPTIONS , NOTATIONS

Architecture.

The architecture of OV SLAM is displayedin Fig. 2: it is based on a careful segmentation of critical andnon-critical functions, either for ensuring RT processing oraccuracy. The next sections detail each thread and explainhow computation is optimized to ensure getting high perfor-mance while respecting RT constraints.

Assumptions.

In this work, we consider calibrated opticalsystems, where both the intrinsic and the extrinsic parametersare known. The pinhole camera model is used and bothradial-tangential and ﬁsheye distortion models are supported.Unlike most of current VSLAM methods, we do not applyany rectiﬁcation or undistortion to the images as those tendsto crop the available ﬁeld of view. This choice simply leads tostoring additional variables as explained below but, in return,allows to fully exploit the acquired images, proving to beuseful for wide ﬁeld of view cameras.

Notations.

In the following, camera’s pose at timestamp i are represented as rigid body transformation T wc i ∈ SE (3) ,which can transform the k -th 3D map point λ wk ∈ R expressed in the world frame w to the current camera’sframe: λ ik = T wc i − (cid:12) λ wk = T c i w (cid:12) λ wk .We denote the undistorded camera projection model as π : R (cid:55)→ R . Projection of a 3D map point to itsundistorted pixel projection in the camera’s image plane atinstant i writes: x ik = π ( T iw (cid:12) λ wk ) . In the stereo case, π (cid:48) denotes the right camera projection and the undistordedright image is then given by: x (cid:48) ik = π (cid:48) ( T rl T iw (cid:12) λ wk ) ,where T rl ∈ SE (3) represents the rigid body transformationbetween the left and right camera. We further consider theinverse projection, π − : R (cid:55)→ R , which projects undis-torted keypoints to normalized coordinates. When detectingor tracking a keypoint at some raw (distorded) pixel position,its undistorded position is immediately computed and storedtogether with raw coordinates.IV. T HE V ISUAL F RONT -E ND The front-end thread is responsible for estimating the poseof the camera in real-time, i.e. at the camera’s frame rate. ig. 2: Architecture of OV SLAM. Critical operations in terms of RTprocessing are highlighted in red. Functions related to the processing ofkeyframes are grouped inside the magenta frame: while not critical, thefastest they run, the better we ensure accuracy.

More precisely, we deﬁne the front-end tasks as follows:image pre-processing, keypoint tracking, outlier ﬁltering,pose estimation and keyframe creation triggering. The front-end pipeline is fully monocular, limiting all its operations toframes provided by the left camera with stereo setups.

Image Pre-Processing.

At the reception of a new image,we apply a contrast enhancement by means of CLAHE[11] which both increases the dynamic range and limits theintensity changes due to exposure adaptation.

Keypoint Tracking.

Keypoint tracking is performed bymeans of a guided coarse-to-ﬁne optical ﬂow method. Key-points are tracked individually using a pyramidal imple-mentation of the inverse compositional Lucas-Kanade (LK)algorithm [12] with a × pixel window and a pyramid scalefactor of 2. The tracking process depends on the nature ofthe keypoint. For 2D keypoints – i.e. those with no priorinformation on their real 3D position – , the initial positionis set to their position in the previous frame and we usea four-levels pyramid. For a 3D keypoint k – i.e. alreadytriangulated – , its initial position is computed thanks to theprojection model π , its 3D position λ wk and the predictionof the current pose, T wc i , using a constant velocity motionmodel. We limit the minimization process to the ﬁrst twolevels of the image pyramid ( i.e. the ones with highestresolution). Keypoints that we fail to track in this stepare then searched for on a four-levels pyramid, along with2D keypoints. This two-stage process reduces the trackingrun time while being robust to pose prediction errors orinaccurate 3D map points. We further avoid tracking errorsby applying backward tracking only to the ﬁrst level ofthe pyramid and removing keypoints that are more than 0.5pixels away from their original position. Tracked keypointsare ﬁnally updated and their undistorted coordinates x ik arecomputed from the known camera’s intrinsic calibration. Outlier ﬁltering.

To eliminate outliers that can still occurin the tracking process, we apply RANSAC ﬁltering basedon the epipolar constraint. While many works use the Funda-mental Matrix (FM) for this operation [13], we prefer to usehere the Essential Matrix (EM) as, in the case of VSLAM,camera’s calibration is known. This allows a faster RANSACas only 5 correspondences are required in this case insteadof the 7 or 8 usually required for the FM. Furthermore, the estimation of an EM is more robust to planar scenes, whichcan lead to a degenerate FM [14]. To improve the RANSACﬁltering efﬁciency, we estimate the EM from 3D keypointsonly (which are more likely to be reliable than 2D ones)and then use it to ﬁlter non-consistent 2D keypoints. Thisﬁltering step ensures that there are virtually no outliers leftbefore estimating the camera pose.

Pose Estimation.

Pose estimation is then performed byminimization of the 3D keypoints reprojection errors usinga robust Huber cost function [13] (cid:107)·(cid:107) ϕ : T ∗ wc i = arg min T wci (cid:88) k (cid:107) x ik − π ( T c i w (cid:12) λ wk ) (cid:107) ϕ Σ ik (1)where Σ ik is the covariance associated to x ik .This nonlinear optimization is performed with theLevenberg-Marquardt algorithm. The initial guess is of pri-mary importance. We usually start from the pose predictedfrom the constant velocity motion model. However, it canhappen that this prediction is wrong because the model ofmotion we are using is inadequate. Yet, the 3D keypointtracking offers us a mean to detect such situations. Moreprecisely, if the ﬁrst step of the tracking, which starts frompositions predicted thanks to the predicted pose results inless than half of the keypoints successfully tracked, we rejectthe predicted pose. In such case, we perform a new poseprediction using a P3P RANSAC [15] before applying eq.(1).At the end of the optimization, outliers are ﬁltered using a χ test at 95% on the reprojection errors [2]. Keyframe Creation.

Finally, the front-end thread is incharge of new keyframes creation. Mainly, if the numberof tracked 3D keypoints w.r.t. the last keyframe gets under athreshold (less than 85% keypoints tracked) or if a signiﬁcantparallax is detected (an average of 15 pixels of unrotatedkeypoints motion), a new keyframe is created. Detection ofnew keypoints is performed with a grid strategy, using cellsof × pixels. We process empty cells by selecting thepoint with the highest Shi-Tomasi score [16] within the cell,then computing a subpixel reﬁnement of its position. BRIEFdescriptors [17] are then computed for all (previously trackedor newly detected) keypoints. At this point, the front-endtriggers the mapping thread to further process the createdkeyframe and continues its operations over the next frame. Initialization.

Initialization is straightforward in stereoVSLAM as keypoints can be triangulated thanks to theknown extrinsic calibration of the stereo rig. In the monocu-lar case, we robustly compute an Essential Matrix E betweenthe ﬁrst two created keyframes. We then extract a relativepose from E , choose an arbitrary scale and set the pose ofthe current keyframe. The extracted pose is then used by themapping thread to initialize the 3D map.V. M APPING T HREAD

The mapping thread is in charge of processing every newkeyframe to create new 3D map points by triangulation andto track the current local map in order to minimize drift.These two tasks do not have the same priority though. Wegive a higher priority to triangulation, which is critical foreeping accurate pose estimation in the front-end. In practice,triangulation is applied ﬁrst, then the local map trackingoperation is executed and aborted if a new keyframe isavailable.

Stereo Matching.

With stereo setups, a stereo matchingstep is applied. Here, we follow the same coarse-to-ﬁneoptical ﬂow strategy already used for keypoint tracking. Both2D and 3D keypoints are tracked in the right view ( i.e. non-triangulated and already triangulated keypoints) for bothtriangulation and the creation of additional stereo constraintsin future BA steps. This method is efﬁcient but, due tothe small basin of convergence of the LK equations, theinitial guess is of primary importance. For 3D keypoints, wesimply project them in the right view using the estimatedkeyframe’s pose. For 2D keypoints, we initialise our guessusing the depth of neighboring 3D keypoints if at least 3 canbe found in the surrounding cells or use the left image pixelposition otherwise. Found stereo matches are then ﬁltered byremoving matches whose undistorted coordinates are awayby more than 2 pixels from their corresponding epipolar line.

Temporal Triangulation.

Temporal triangulation is ofcourse essential in the monocular case, as it is the only wayto initialize a 3D map. However, we’ve found useful to alsoapply it in the stereo case, for keypoints correctly trackeduntil the current keyframe, but for which no stereo matchcould be found. The beneﬁt is twofold. First, the 3D pointscreated here will be useful for pose estimation as they arevery likely to be tracked correctly in the following images.Second, correct tracking will lead to an accurate 3D position,hence a good initial estimate for the next stereo matchingstep. All the map points successfully triangulated are thenimmediately used by the front-end for localization purpose,while their 3D position will later be reﬁned through BA.

Local Map Tracking.

The local map denotes the set of3D map points observed either by the current keyframe K i or one of its covisible keyframe K j ( i.e. a keyframe withat least one shared observation). The goal of the local maptracking is to ﬁnd out if 3D map points belonging to this localmap and not observed in K i can be matched to keypointsof K i . Such ”re-tracking” operations can be considered aselementary loop closures, limiting the accumulation of drift.Any such 3D point whose projection onto K i is less than2 pixels away from a keypoint deﬁnes a candidate match. Asa 3D map point might be associated to several descriptors { d j } , we compute a distance between the keypoint candi-date’s descriptor d i and every descriptor d ∈ { d j } . Thecandidate is ﬁnally accepted if the lowest computed distanceis under a threshold. This strategy increases the chance ofcorrectly matching a 3D point to a keypoint as descriptorsassociated to a 3D map point are the ones extracted in thekeyframes observing it, thus providing more robustness toevolution of appearance and viewpoint changes while stillbeing extremely fast to compute thanks to the binary propertyof BRIEF. Successful matches leads to the addition of newlinks in the covisibility graphs of the impacted keyframes,which will beneﬁt to the upcoming local BA. VI. S TATE O PTIMIZATION T HREAD

The state optimization thread is in charge of running alocal Bundle Adjustment to reﬁne selected keyframes’ posesand 3D map point positions. It additionally ﬁlters redundantkeyframes to limit future local BA runtime.

Local Bundle Adjustment.

The local BA limits the driftcoming from the visual measurements’ noise by applyinga multi-view optimization over the local map. A classi-cal choice is to reﬁne only the poses of the most recentkeyframes. In contrast, we follow ORB-SLAM approachby taking advantage of the covisibility graphs and includein the BA every keyframe that shares at least 25 commonobservations with the most recent keyframe K i . Every 3Dmap point observed by any of those keyframes is alsoincluded in the BA. We denote the full set of parametersto optimize (poses and 3D map points) as ζ i . Keyframes notin ζ i but observing a map point in ζ i are further added asﬁxed constraints for better conditioning.3D map points are parameterized as anchored points withan inverse depth [18], [19], that is a 3D map point λ wk is deﬁned by its undistorted pixel position x αk in its ﬁrstobserving keyframe T w α and its inverse depth γ k relative tothis keyframe: λ αk = 1 γ k π − ( x αk ) (2) λ wk = T w α (cid:12) λ αk (3)Therefore, we limit the complexity of the BA by onlyhaving one d.o.f. to optimize per 3D map point.The BA cost function is made of a collection of 2Dreprojection errors, with, in the stereo case, additional termscorresponding to the right camera observation: ζ i ∗ = arg min ζ i (cid:88) j ∈ K i (cid:88) k ∈ L j (cid:107) x jk − π ( T jw · T w α (cid:12) λ αk ) (cid:107) ϕ Σ jk + β j · (cid:107) x (cid:48) jk − π (cid:48) ( T rl · T jw · T w α (cid:12) λ αk ) (cid:107) ϕ Σ (cid:48) jk (4)where K i is the full set of keyframes contributing to the BA, L j is the set of map points in ζ i observed by K j and β j isset to 1 if a map point is also observed in the right cameraand to 0 otherwise.This nonlinear least-squares optimization is solved on-manifold [20] with the Levenberg-Marquardt algorithm. Out-liers marked by the robust Huber function are then removedas done after solving eq.(1) and optimized states are updated. Keyframes Filtering.

Next, we apply a ﬁltering stepto remove any redundant keyframe. Exploring the currentkeyframe’s covisibility graph, we remove any keyframe forwhich at least 95% of their observed 3D map points arealready observed by at least 4 other keyframes. This limitsthe growth in the number of states included in the next BAwithout giving up on accuracy as removal is only applied tomerely informative keyframes. For 3D points anchored in akeyframe getting ﬁltered, we simply switch their anchor fortheir second observing keyframe.II. O

NLINE B AG - OF -W ORDS BASED L OOP C LOSER

While already being beneﬁcial in small and mediumenvironments, Loop Closing (LC) becomes a key feature foraccurate long-term localization in large-scale maps. The LCthread is hence responsible for detecting such loops and con-ducting relocalization, that is correct both the current poseestimates and the estimated trajectory between the currentframe and the passed keyframe where LC has been detected.The challenge is to do these operations as often (thereforeas quickly) as possible while rejecting false detections thatwould signiﬁcantly corrupt the trajectory.

Online Bag-of-Words.

Bag-of-Words (BoW) have provedto be very efﬁcient for fast LC detection. The vocabularytree deﬁned within the BoW along with the ”term frequency- inverse document frequency” (tf-idf) and the inverted indexallows fast computation of similarity scores between differentkeyframes. Yet, most of the SLAM algorithms make useof ofﬂine trained BoW, hence dependent and biased by thetraining database. Another strategy is to build the vocabularytree online [9], [10], using the images acquired so far.The idea here is to create a vocabulary tree that ﬁts thecurrent environment, hence avoiding the potentially under-ﬁtting issues of ofﬂine trained ones. This is the approachchosen in OV SLAM using a modiﬁed version of iBoW-LCD [4] to perform LC detection. We stick to the originalimplementation to ﬁnd LC candidates but perform our ownprocessing to assess the correctness of this candidate.

Keyframe’s Pre-Processing.

Upon reception of a newkeyframe, we perform a feature extraction step. As we do nottrack many keypoints for the purpose of localization (usuallyaround 200), we extract additional features before updatingthe vocabulary tree and searching for a LC candidate. Fea-tures are extracted with the FAST detector [21] and BRIEFdescriptor, keeping the best 300 features. The vocabulary treeis then updated using both the SLAM features and these newfeatures, and the resulting BoW signature is used to computea similarity score with previous keyframes.

Loop Candidate Processing.

If a good LC candidate isfound, we ﬁrst ensure that it is not a false positive. Giventhe current keyframe K i and the candidate keyframe K lc ,we ﬁrst apply a k-Nearest-Neighbor Brute-Force matchingbetween all the descriptors in both keyframes. Ambiguousmatches are ﬁrst ﬁltered by a classical ratio test [22]. Wethen compute an Essential Matrix within a RANSAC schemein order to keep only the matches satisfying the epipolargeometry. Using the remaining inliers, we compute a posehypothesis for K i through a P3P-RANSAC method [15]with the remaining 3D map points observed by K lc . If areliable pose is found, determined by the resulting number ofinliers, we get the local map of K lc and search for additionalmatches in K i , projecting the 3D map points using the P3Pcomputed pose. The hypothesis pose is then reﬁned basedon all the matching found so far using eq.(1) and a ﬁnalﬁltering step is performed based on the outliers detected bythe Huber robust cost function. Pose Graph Optimization.

Finally, if at least 30 inliers remain, we validate the LC detection and perform a PoseGraph Optimization (PGO) to correct the full trajectory. PGOis performed over the trajectory’s part starting from K lc up to K i and seeks to minimize the relative pose errors betweenconsecutive keyframes given their initial relative pose (i.e.the ones before LC detection) and the newly estimated posefor K i . A Levenberg-Marquardt optimization is performedover the SE (3) keyframes’ poses and we use a second-orderapproximation of the Campbell-Baker-Haussdorf formula[23] to compute their respective Jacobian on se (3) .As new keyframes might have been created since thestart of the current LC detection, we propagate the posegraph corrections to the most recent part of the trajectory,updating keyframes using their previous relative pose andthe corrected pose of K i . 3D map points matched between K i and K lc are then merged and all the 3D map points ﬁrstobserved by any corrected keyframe are updated through aforward-backward projection to align the 3D map w.r.t. tothe LC corrections: λ wk ∗ = T neww α · T old α w (cid:12) λ wk (5) Loose Bundle Adjustment.

Once PGO is done, we applya ”loose-BA”, which means that we apply a BA optimizationonly to the part of the map impacted by the LC corrections( i.e. all the 3D map points observed by corrected keyframesand all the keyframes observing these 3D points). Thus,we limit the overload of applying a full-BA (as done inORB-SLAM) by optimizing over a subset of keyframes and3D map points. This strategy leads to a largely decreasedrun-time for LC involving recent keyframes. As loose-BAoptimization might still take a few seconds to be performed,new keyframes and map points added meanwhile have to beupdated as well to keep the trajectory consistent. It is donein the same way as for PGO, propagating the correction onthe keyframes poses using their previous relative pose andupdating the 3D map points they observe through eq.(5).VIII. E

XPERIMENTS

Implementation.

The proposed method has been devel-oped in C++, runs on CPU only and relies on the ROSmiddleware for data reading and visualization. Feature-related operations are performed using the OpenCV library,multi-view geometry computations are run with OpenGV[24] and nonlinear optimization is applied using the Cereslibrary. In all experiments, the covariance associated to visualmeasurements x ik is set as Σ ik = I × . In addition tothe settings detailed throughout the paper, we propose alightweight conﬁguration called OV SLAM-Fast, able toprocess sequences at hundreds of Hertz. To reach suchperformances, we modify the settings of OV SLAM asfollows: we disable LC, we use FAST [21] for keypointdetection and increase the grid’s cell size to 50 ×

50 pixels.We run all our experiments on a high-end laptop equippedwith an Intel Xeon (8 threads @3.00 GHz / 32 Gb RAM)except for EuRoC dataset in monocular setup in which ani5 (4 threads @2.20 GHz / 8 Gb RAM) laptop is used. Aun time analysis is performed on sequence MH03 of EuRoCand Table I shows the average timings for the most importantfunctions of the algorithm obtained on both architectureswith both OV SLAM and OV SLAM-Fast. The full front-end thread computation time is upper bounded by the sum ofthe “Front-End Tracking” and “Keyframe Creation”, showingthat the usual acquisition rates of 20-30 Hz are easilysupported by either architecture and can even go up toseveral hundreds of Hertz when running OV SLAM-Fast.The conﬁguration ﬁles used for running all the followingexperiments are included in the open-source repository.

Timings (ms)Front-End Keyframe Mapping Local LCTracking Creation Thread BA DetectionIntel Xeon [email protected] GHz OV SLAM w. LC 8.01 6.82 24.90 73.36 42.7632 Gb RAM OV SLAM-Fast 6.05 5.22 11.37 18.68 -Intel i5 [email protected] GHz OV SLAM w. LC 16.49 14.02 43.87 113.35 80.168 Gb RAM OV SLAM-Fast 7.61 7.07 12.63 27.10 -

TABLE I: Timing of the main functions of OV SLAM in stereo mode.

Datasets.

We evaluate OV SLAM on the widely usedbenchmarking datasets EuRoC [1] and KITTI [25] as wellas on the very recent dataset TartanAir [26]. The EuRoCdataset is dedicated to localization of Micro-Aerial-Vehicles(MAV) in small to medium scaled environments, 6 sequencesbeing acquired in two small rooms (3 seq. per room) and5 sequences acquired in a factory hall. For each one ofthese environments, the sequences were acquired with anincreasing difﬁculty by performing more and more aggres-sive motions with the MAV. The KITTI dataset is dedicatedto autonomous car driving and is the main dataset used toevaluate the capacitiy of SLAM methods in handling large-scale environments, with some trajectories being several kilo-meters long. In opposition to these real world datasets, theTartanAir dataset is a photo-realistic one created with Unrealengine. It is meant for pushing the limit of VSLAM byproposing sequences with extremely diverse environments,motion patterns and lightning conditions (day / night, sunny/ rainy, ...). Either the ATE or the RPE metrics [27] areused for evaluation on these datasets, depending on what hasbeen classically used in the literature. Figures of the resultingtrajectories are available in the supplementary material . A. EuRoC MAV Dataset

We ﬁrst compare the stereo version of OV SLAM with thestereo VSLAM methods ORB-SLAM and SVO as well as thestereo VI-SLAM algorithms OKVIS [28], Vins-Fusion [29]and Basalt [30] on EuRoC [1] in Table II. The results for theVI-SLAM methods are reported from [30] and were obtainedenforcing RT processing. Results for SVO are reported fromthe authors’ evaluation (Table I in [5]) which were obtainedin RT. For ORB-SLAM, we both report the non RT resultsand the ones that we obtained when enforcing RT processing,keeping the same settings as those proposed by the authors.All the reported results are the median accuracy obtainedover 5 runs and we do not use the sequence V2 03 becausehundreds of images from the right camera are missing. Asthe reader can see, running the stereo version of ORB-SLAM when enforcing RT leads to big drops in terms of accuracy –being even more signiﬁcant given the high-end laptop usedfor running the experiments. Taking the sequences acquiredin the factory hall (MH XX), we can see that OV SLAMoutperforms all the RT methods, including the VI-SLAMones. Furthermore, even OV SLAM without LC manages tooutperform all the methods on all but one sequence. One canalso notice that the performances in terms of accuracy arevery close to ORB-SLAM not running in RT. On the smallroom sequences (VX XX), both version of OV SLAM havecomparable results to OKVIS and Vins-Fusion but SVO andBasalt mostly get better results. This is mainly due to the factthat these rooms are extremely low-textured, making directVSLAM methods more efﬁcient and the use of an IMU canbe beneﬁcial on these sequences.In order to highlight the RT performances of OV SLAM-Fast, we run it on EuRoC, playing the sequences at higherrates than the real one. More speciﬁcally, we run thesequences from 5 to 20 times the original rate. In thisexperiment, the factory sequences, we start running thesequences right after the aggressive motions performed forthe means of VI-SLAM initialization . The obtained resultsare reported in Table III. As one can see, we manage toget impressive results, reaching almost the same results thanwhen processing the sequences in real-time ( i.e. at 20 Hz)on the easiest sequences at rates up to 400 Hz. Moreover,comparing Table III to Table II, we can observe that theresults obtained up to 200 Hz are very close to the onesobtained with ORB-SLAM at 20 Hz.Finally, we compare the monocular version of OV SLAMto ORB-SLAM, SVO and DSO. Neither DSO or SVOimplements LC so we compare to the results reported in [5],obtained without LC for ORB-SLAM and comparing ORB-SLAM with and without enforcing RT processing. The re-ported results from [5] were obtained with an i7 @2.80 GHzarchitecture. To be fair, we use the consumer-grade laptop ([email protected] GHz) here to run OV SLAM. The results are reportedin Table IV. Once again, OV SLAM mostly outperformsothers methods on the factory sequences. It also shows highrobustness, handling almost all the sequences in the low-textured rooms with competitive accuracy. DSO is here theclosest method to OV SLAM in terms of performances.

B. KITTI Dataset

In this section, we evaluate OV SLAM on KITTI [25]and compare to ORB-SLAM on KITTI’s train set. Resultsare reported in Table V and highlights the fact that ORB-SLAM is highly impacted by the run-time requirement of asequence, signiﬁcantly decreasing its performances in termsof accuracy. In opposition, we show that OV SLAM is barelyimpacted by the run-time requirements here and clearlyoutperforms ORB-SLAM when real-time is enforced. Wefurther report the results for open-sourced VSLAM methodson KITTI’s online benchmark in Table VI and show that,at the time of writing, OV SLAM is the best open-sourcedmethod on this dataset. isual SLAM w. LC Visual SLAM no LC Visual-Inertial SLAMSeq. Length (m) ORB-SLAM (not RT) ORB-SLAM OV SLAM SVO OV SLAM OKVIS Vins-Fusion BASALTMH 01 79.84 × × × × × × TABLE II: Comparison of ATE rmse (m) for stereo methods on EuRoC (20 Hz). For RT VSLAM methods, best results are shown in bold blue andsecond best results in bold. If best, ORB-SLAM not RT results are shown in bold italic and VI-SLAM ones in blue. * indicates frequent failures. OV SLAM-Fast OV SLAM no LC.Seq. 100 Hz 150 Hz 200 Hz 300 Hz 400 Hz 20 HzMH 01 0.061 0.059 0.070 0.060 0.066 0.054MH 02 0.056 0.068 0.061 0.051 0.056 0.041MH 03 0.113 0.121 0.337 x x 0.058MH 04 0.256 0.214 0.263 0.233* 1.840* 0.116MH 05 0.165 0.165 0.192 0.223* 1.420* 0.140V1 01 0.186 0.257 0.179 0.498 0.353 0.093V1 02 0.590 0.578 0.624 x x 0.088V1 03 x x x x x 0.290V2 01 0.117 0.126 0.103 0.121 0.780 0.092V2 02 0.271 0.336 x x x 0.098

TABLE III: Comparison of ATE rmse (m) for stereo OV SLAM-Fastwhen run on EuRoC at high rates. * indicates frequent failures.

C. TartanAir Dataset

This section details the results obtained on the TartanAirdataset [26]. More speciﬁcally, we report the results obtainedon the sequences used for the SLAM Challenge organizedwithin the Visual SLAM workshop at CVPR 2020. Thischallenge was divided into two tracks, one for monocularVSLAM and one for stereo VSLAM. The results wereaveraged over 16 sequences, half considered as easy and theother half as hard . The results are showed in Table VII andcome both from the ofﬁcial benchmark website and from thepost challenge technical discussion . As it can be seen, thebest performing method used Colmap [33], a Structure-from- Not Real-Time Real-TimeSeq. ORB-SLAM no LC ORB-SLAM no LC SVO DSO OV SLAM no LCMH 01

MH 02 0.03 0.72 0.07

MH 03 × MH 04 0.22 6.32 0.40

MH 05 0.71 5.66 × V1 01 0.16 1.35

V1 02 0.18 0.58 × V1 03 0.78 × V2 01 × V2 02 0.21 0.68 × V2 03 1.25 × × TABLE IV: Comparison of ATE rmse (m) for monocular methods onEuRoC (20 Hz). For RT methods, best results are shown in bold blue andsecond best results in bold. ORB-SLAM not RT results are shown in bolditalic if best. * indicates frequent failures.. https://tinyurl.com/y3ggvvta , https://tinyurl.com/y3nh6yuo Not Real-Time Real-Time 0.5 × Real-TimeSeq. Length (m) ORB-SLAM ORB-SLAM OV SLAM ORB-SLAM OV SLAM00 3724.2 1.3 10.74*

01 2453.2

02 5067.2 ×

03 560.9

05 2205.6

06 1232.9 ×

07 694.7 0.5 1.07

08 3222.8 3.6 5.41*

09 1705.1 3.2 5.11*

10 919.5 1.0 2.91*

TABLE V: Comparison ATE rmse (m) for the stereo versions ofORB-SLAM and OV SLAM on KITTI training sequences played RT (10Hz) and half RT (5 Hz). Best results are shown in bold and ORB-SLAMnot RT results are shown in bold italic if best. * indicates frequent failures.Method t rel (%) r rel (deg / m)OV SLAM w. LC 0.94 0.0023OV SLAM no LC 0.98 0.0023Vins-Fusion [29] 1.09 0.0033ORB-SLAM [2] 1.15 0.0027S-PTAM [31] 1.19 0.0025RTAB-Map [32] 1.26 0.0026TABLE VI: Comparison of translational rmse (%) and rotational rmse(deg/m) on KITTI’s online benchmark for open-sourced stereo methods.

Motion (SfM) library, along with deep features [34], [35] formatching. The second best performing method was basedon Voldor [36], a dense VSLAM methods that computesresidual ﬂow for pose estimation, along with Colmap to scalethe estimated translation. OV SLAM ranks 3rd on both themonocular and stereo tracks of this challenge and is actuallythe 1st if we only consider online methods, i.e. methodsthat only use the past and present to perform estimations.The challenge’s organizers have run ORB-SLAM on bothtracks to provide a baseline (without enforcing RT) and,as one can see, the performances obtained with OV SLAMare dramatically better, highlighting its better robustness tochallenging and very diverse environments, even without en-abling the LC feature. Furthermore, we have run OV SLAMat 20 Hz before submitting the estimated trajectories on thesesequences and we can see that it compares very competitivelyto ofﬂine SfM methods while not having their run-timeburden (the team which ranked 1st reports half an hour toprocess a sequence with 1000 images). onocular Track Stereo Track OnlineMethod ATE (m) RPE (m) ATE (m) RPE (m)Colmap + SuperPoint + SuperGlue - - 0.119 0.484 XColmap + SuperPoint + SIFT 0.34 0.449 - - XColmap + Voldor 0.44 1.273 0.177 0.570 XOV SLAM w. LC - - 0.182 1.025 (cid:88) OV SLAM no LC 0.51 0.889 0.199 0.804 (cid:88)

ORB-SLAM 3.57 17.700 1.640 2.900 (cid:88)

TABLE VII: Comparison of ATE and RPE (m) on TartanAir’s onlinechallenge for submitted methods in both the monocular and stereo tracks.Methods are tagged as online if they process sequences sequentially.

IX. CONCLUSIONSIn this work, we have presented OV SLAM a completeVSLAM algorithm that aims at closing the gap betweenaccuracy, robustness and RT capability. OV SLAM is alsodesign to be versatile and we have successfully used itwith terrestrial, aerial and pedestrian setups in very differ-ent environments (indoor, outdoor). We have detailed thecareful design of its architecture, allowing to respect theRT constraints required by real-world applications withoutsacriﬁcing accuracy. It further integrates an online BoWmethods for very efﬁcient loop-closure detection. OV SLAMhas been evaluated on several datasets, both in monocularand stereo setups, and show state-of-the-art performances.By releasing its source code, we hope that it could make aninteresting ready-to-use VSLAM research platform.R

EFERENCES[1] M. Burri, J. Nikolic, P. Gohl, T. Schneider, J. Rehder, S. Omari, M. W.Achtelik, and R. Siegwart, “The euroc micro aerial vehicle datasets,”

The International Journal of Robotics Research , vol. 35, pp. 1157–1163, 2016.[2] R. Mur-Artal and J. D. Tard´os, “Orb-slam2: An open-source slamsystem for monocular, stereo, and rgb-d cameras,”

IEEE Transactionson Robotics , vol. 33, pp. 1255–1262, 2017.[3] G. Klein and D. Murray, “Parallel tracking and mapping for smallar workspaces,” in , pp. 225–234, 2007.[4] E. Garcia-Fidalgo and A. Ortiz, “ibow-lcd: An appearance-based loop-closure detection approach using incremental bags of binary words,”

IEEE Robotics and Automation Letters , vol. 3, pp. 3051–3057, 2018.[5] C. Forster, Z. Zhang, M. Gassner, M. Werlberger, and D. Scaramuzza,“Svo: Semidirect visual odometry for monocular and multicamerasystems,”

IEEE Transactions on Robotics , vol. 33, pp. 249–265, 2016.[6] J. Engel, V. Koltun, and D. Cremers, “Direct sparse odometry,”

IEEETransactions on Pattern Analysis and Machine Intelligence , vol. 40,pp. 611–625, 2017.[7] N. Yang, R. Wang, X. Gao, and D. Cremers, “Challenges inmonocular visual odometry: Photometric calibration, motion bias, androlling shutter effect,”

IEEE Robotics and Automation Letters , vol. 3,pp. 2878–2885, 2018.[8] E. Rublee, V. Rabaud, K. Konolige, and G. Bradski, “Orb: An efﬁcientalternative to sift or surf,” in , pp. 2564–2571, 2011.[9] A. Angeli, D. Filliat, S. Doncieux, and J.-A. Meyer, “Fast andincremental method for loop-closure detection using bags of visualwords,”

IEEE Transactions on Robotics , vol. 24, pp. 1027–1037, 2008.[10] T. Nicosevici and R. Garcia, “Automatic visual bag-of-words foronline robot navigation and mapping,”

IEEE Transactions on Robotics ,vol. 28, pp. 886–898, 2012.[11] K. Zuiderveld, “Contrast limited adaptive histogram equalization,” in

Graphics gems IV , 1994.[12] S. Baker and I. Matthews, “Lucas-kanade 20 years on: A unifyingframework,”

International Journal of Computer Vision , vol. 56, 2004.[13] R. Hartley and A. Zisserman,

Multiple view geometry in computervision . 2003. [14] D. Nist´er, “An efﬁcient solution to the ﬁve-point relative pose prob-lem,”

IEEE Transactions on Pattern Analysis and Machine Intelli-gence , vol. 26, pp. 756–770, 2004.[15] L. Kneip, D. Scaramuzza, and R. Siegwart, “A novel parametrizationof the perspective-three-point problem for a direct computation ofabsolute camera position and orientation,” in

IEEE Conference onComputer Vision and Pattern Recognition , pp. 2969–2976, 2011.[16] S. Jianbo and C. Tomasi, “Good features to track,” in

IEEE ComputerSociety Conference on Computer Vision and Pattern Recognition ,pp. 593–600, 1994.[17] M. Calonder, V. Lepetit, M. Ozuysal, T. Trzcinski, C. Strecha, andP. Fua, “Brief: Computing a local binary descriptor very fast,”

IEEETransactions on Pattern Analysis and Machine Intelligence , vol. 34,pp. 1281–1298, 2011.[18] J. Civera, A. J. Davison, and J. M. Montiel, “Inverse depthparametrization for monocular slam,”

IEEE Transactions on Robotics ,vol. 24, pp. 932–945, 2008.[19] T. Qin, P. Li, and S. Shen, “Vins-mono: A robust and versatile monoc-ular visual-inertial state estimator,”

IEEE Transactions on Robotics ,vol. 34, pp. 1004–1020, 2018.[20] P.-A. Absil, C. G. Baker, and K. A. Gallivan, “Trust-region methodson riemannian manifolds,” vol. 7, pp. 303–330, 2007.[21] E. Rosten and T. Drummond, “Machine learning for high-speed cornerdetection,” in

European Conference on Computer Vision , pp. 430–443,2006.[22] D. G. Lowe, “Distinctive image features from scale-invariant key-points,”

International Journal of Computer Vision , vol. 60, 2004.[23] T. D. Barfoot,

State estimation for robotics . 2017.[24] L. Kneip and P. Furgale, “Opengv: A uniﬁed and generalized approachto real-time calibrated geometric vision,” in

IEEE International Con-ference on Robotics and Automation , pp. 1–8, 2014.[25] A. Geiger, P. Lenz, and R. Urtasun, “Are we ready for autonomousdriving? the kitti vision benchmark suite,” in , pp. 3354–3361, 2012.[26] W. Wang, D. Zhu, X. Wang, Y. Hu, Y. Qiu, C. Wang, Y. Hu, A. Kapoor,and S. Scherer, “Tartanair: A dataset to push the limits of visual slam,”in , 2020.[27] J. Sturm, N. Engelhard, F. Endres, W. Burgard, and D. Cremers, “Abenchmark for the evaluation of rgb-d slam systems,” in

IEEE/RSJInternational Conference on Intelligent Robots and Systems , pp. 573–580, 2012.[28] S. Leutenegger, S. Lynen, M. Bosse, R. Siegwart, and P. Furgale,“Keyframe-based visual–inertial odometry using nonlinear optimiza-tion,”

The International Journal of Robotics Research , vol. 34,pp. 314–334, 2015.[29] T. Qin, J. Pan, S. Cao, and S. Shen, “A general optimization-basedframework for local odometry estimation with multiple sensors,” 2019.[30] V. Usenko, N. Demmel, D. Schubert, J. St¨uckler, and D. Cre-mers, “Visual-inertial mapping with non-linear factor recovery,”

IEEERobotics and Automation Letters , vol. 5, pp. 422–429, 2019.[31] T. Pire, T. Fischer, G. Castro, P. De Crist´oforis, J. Civera, and J. J.Berlles, “S-ptam: Stereo parallel tracking and mapping,”

Robotics andAutonomous Systems , vol. 93, pp. 27–42, 2017.[32] M. Labb´e and F. Michaud, “Rtab-map as an open-source lidar andvisual simultaneous localization and mapping library for large-scaleand long-term online operation,”

Journal of Field Robotics , vol. 36,pp. 416–446, 2019.[33] J. L. Schonberger and J.-M. Frahm, “Structure-from-motion revisited,”in

IEEE Conference on Computer Vision and Pattern Recognition ,pp. 4104–4113, 2016.[34] D. DeTone, T. Malisiewicz, and A. Rabinovich, “Superpoint: Self-supervised interest point detection and description,” in

IEEE Con-ference on Computer Vision and Pattern Recognition Workshops ,pp. 224–236, 2018.[35] P.-E. Sarlin, D. DeTone, T. Malisiewicz, and A. Rabinovich, “Su-perglue: Learning feature matching with graph neural networks,” in

IEEE/CVF Conference on Computer Vision and Pattern Recognition ,pp. 4938–4947, 2020.[36] Z. Min, Y. Yang, and E. Dunn, “Voldor: Visual odometry from log-logistic dense optical ﬂow residuals,” in

IEEE/CVF Conference onComputer Vision and Pattern Recognition , pp. 4898–4909, 2020. . S

UPPLEMENTARY M ATERIAL

This report contains ﬁgures of trajectories estimated in theexperiment section of OV SLAM paper. We both provideresults obtained on the training sequences of the KITTIdataset [25] and on the EuRoC dataset [1].XI. E U R O C E

XPERIMENTS

We compare the stereo version of ORB-SLAM [2] andOV SLAM on the EuRoC dataset with real-time enforced.We show the trajectories obtained on the

Machine Hall (MHXX) sequences in Figure 3 and on the

Vicon Room (VX-XX) sequences in Figure 4.We further display the trajectories estimated with themonocular version of OV SLAM with real-time processingenforced in Figure 5 and Figure 6.XII. KITTI E

XPERIMENTS

We display the trajectories obtained with both ORB-SLAM and OV SLAM while enforcing real-time on theKITTI dataset in Figure 7. (a) EuRoC MH01 (b) EuRoC MH02(c) EuRoC MH03 (d) EuRoC MH04(e) EuRoC MH05

Fig. 3: Trajectories estimated in Real-Time with stereo OV SLAM andORB-SLAM on EuRoC Machine Hall (MHXX) sequences. (a) EuRoC V1 01 (b) EuRoC V1 02(c) EuRoC V1 03 (d) EuRoC V2 01(e) EuRoC V2 02

Fig. 4: Trajectories estimated in Real-Time with stereo OV SLAM andORB-SLAM on EuRoC Vicon Room (VX-XX) sequences. a) EuRoC MH01 (b) EuRoC MH02(c) EuRoC MH03 (d) EuRoC MH04(e) EuRoC MH05

Fig. 5: Trajectories estimated in Real-Time with monocular OV SLAMwithout LC on EuRoC Machine Hall (MHXX) sequences. (a) EuRoC V1 01 (b) EuRoC V1 02(c) EuRoC V1 03 (d) EuRoC V2 01

Fig. 6: Trajectories estimated in Real-Time with monocular OV SLAMwithout LC on EuRoC Vicon Room (VX-XX) sequences. a) KITTI 00 (b) KITTI 01 (c) KITTI 02(d) KITTI 03 (e) KITTI 04 (f) KITTI 05(g) KITTI 06 (h) KITTI 07 (i) KITTI 08(j) KITTI 09 (k) KITTI 10

Fig. 7: Trajectories estimated in Real-Time with stereo OV2