[PDF] Panoramic annular SLAM with loop closure and global optimization

Abstract

In this paper, we propose panoramic annular simultaneous localization and mapping (PA-SLAM), a visual SLAM system based on panoramic annular lens. A hybrid point selection strategy is put forward in the tracking front-end, which ensures repeatability of keypoints and enables loop closure detection based on the bag-of-words approach. Every detected loop candidate is verified geometrically and the Sim(3) relative pose constraint is estimated to perform pose graph optimization and global bundle adjustment in the back-end. A comprehensive set of experiments on real-world datasets demonstrates that the hybrid point selection strategy allows reliable loop closure detection, and the accumulated error and scale drift have been significantly reduced via global optimization, enabling PA-SLAM to reach state-of-the-art accuracy while maintaining high robustness and efficiency.

Full PDF

PPanoramic annular SLAM with loop closure andglobal optimization H AO C HEN , W EIJIAN H U , K AILUN Y ANG , J IAN B AI , AND K AIWEI W ANG National Engineering Research Center of Optical Instrumentation, Zhejiang University, Hangzhou310058, China Institute for Anthropomatics and Robotics, Karlsruhe Institute of Technology, 76131 Karlsruhe, Germany State Key Laboratory of Modern Optical Instrumentation, Zhejiang University, Hangzhou 310027, China * [email protected] Abstract:

In this paper, we propose PA-SLAM, a monocular panoramic annular visualSLAM system with loop closure and global optimization. A hybrid point selection strategyis put forward in the tracking front-end, which ensures repeatability of keypoints and enablesloop closure detection based on the bag-of-words approach. Every detected loop candidate isveriﬁed geometrically and the

𝑆𝑖𝑚 ( ) relative pose constraint is estimated to perform pose graphoptimization and global bundle adjustment in the back-end. A comprehensive set of experimentson real-world datasets demonstrates that the hybrid point selection strategy allows reliable loopclosure detection, and the accumulated error and scale drift have been signiﬁcantly reduced viaglobal optimization, enabling PA-SLAM to reach state-of-the-art accuracy while maintaininghigh robustness and eﬃciency. © 2021 Optical Society of America under the terms of the OSA Open Access Publishing Agreement

1. Introduction

Pose estimation is a prerequisite for many applications, e.g., self-driving cars, autonomous robotsand augmented/virtual reality. Various sensors can be utilized in pose estimation, such as GPS,IMU, LIDAR and camera. Among them, camera is especially favored by researchers due toits small size, low cost and abundant perceived information. Pose estimation using only thecontinuous images captured by a single camera is called monocular visual odometry (VO).A multitude of VO systems have been presented as of now, such as SVO [1] and DSO [2].They are normally designed for the conventional pinhole cameras with a limited ﬁeld of view(FOV). The PALVO [3] proposed in our previous work is a monocular VO based on panoramicannular lens (PAL). PAL can transform the cylindrical side view onto a planar annular image andobtain panoramic perception of 360 ◦ FOV in a single shot [4], as shown in Fig. 1. Beneﬁtingfrom the panoramic imaging, PALVO can handle some challenging scenarios that are diﬃcultfor conventional VO based on pinhole cameras. For example, conventional VO will produceunreliable results when rotating with a fast angular velocity due to the rapid reduction of overlapsbetween adjacent frames, and is greatly aﬀected by dynamic components in the environmentbecause of the limited FOV. Compared with the traditional monocular VO, PALVO greatlyimproves the robustness of pose estimation in real application scenarios [3].However, there still exits some problems in PALVO when it runs on a large scale and for along time. The ﬁrst one is error accumulation [5]. Since PALVO only maintains a local mapconsisting of the most recent several keyframes, the pose of each new frame is calculated bytracking the previous frame and the local map. As a result, the errors introduced by each newframe-to-frame motion accumulate over time and cause the estimated trajectory to deviate fromthe actual path. Secondly, it is impossible for PALVO to recover the absolute scale because onlybearing information is available for a single camera, i.e., for a monocular VO, the motion and 3Dmap can only be recovered up to a scale factor. But due to the inevitable errors of pose estimation, a r X i v : . [ c s . R O ] F e b P’(a) (b) (c)

Fig. 1. (a) Schematic optical path in the PAL block. P and P’ represent the objectand image points, respectively. (b) A sample image captured by PAL camera. (c) Thecylindrical object space of PAL. the scale of the motion estimated later may be distinct from that determined at the beginning,which is known as scale drift [6]. These problems will cause that although the camera actuallyrevisits a certain place, it cannot be indicated from the estimated trajectory, i.e., the PALVOcannot “close the loop”.To solve the above problems, we propose Panoramic Annular SLAM (PA-SLAM), whichextends the previous PALVO by adapting it as the SLAM front-end to estimate camera poseswith local localization consistency, and corrects error accumulation as well as scale drift withloop closure detection and global optimization in the back-end.Compared with existing monocular visual SLAM systems based on pinhole camera withnarrow FOV, the proposed PA-SLAM has the following advantages: Firstly, beneﬁting from thelarge FOV brought by PAL, PA-SLAM is less aﬀected by dynamic components in the environmentwhen performing loop closure detection. For pinhole cameras, dynamic objects will have asigniﬁcant inﬂuence on the image appearance, which will aﬀect the loop closure detection [7].Secondly, the large FOV of PAL ensures that enough visual features can be extracted in asingle shot, so pose estimation and loop closure detection will not be aﬀected by the lack offeatures. Thirdly, due to the cylindrical object space of PAL (Fig. 1(c)), loop closure detection inPA-SLAM is insensitive to travel direction, i.e., loop closure can be detected not only when thecamera revisits a certain place in the same travel direction, but also in the perpendicular- andthe reverse direction. In contrast, pinhole cameras are mostly forward-looking and conventionalvisual SLAM can only detect loop closure in the same travel direction.The remainder of the paper is organized as follows. Section 2 reviews the related work. Ouralgorithms are described in detail in Section 3. In Section 4, extensive experiments are conductedto evaluate the proposed PA-SLAM. Finally, we draw our conclusion in Section 5.

2. Related work

Many visual SLAM systems have been proposed during the last decade. One of the mostinﬂuential visual SLAM approaches is ORB-SLAM2 [8]. It uses the same ORB features fortracking, mapping, and place recognition tasks. A bag-of-words (BoW) [9] place recognizerbuilt on DBoW2 with ORB features is embedded for loop closure detection. As a feature-basedmethod, ORB-SLAM2 needs to extract ORB features on both keyframes and non-keyframes, andrelies on feature matching to obtain data association, which is a time-consuming task.Another famous visual SLAM is LSD-SLAM [10], which utilizes FAB-MAP [11], anppearance-based loop detection algorithm, to detect large-scale loop closures. However, FAB-MAP needs to extract its own features, so none of information from the VO front-end can bereused in loop detection. Besides, the relative pose calculation relies on direct image alignment,which means that all the images of past keyframes need to be kept in memory, resulting in largememory costs in long-time running.Some researchers have also done some work to extend VO to SLAM. For example, LDSO [12]is extended by adding loop closure detection and pose map optimization to DSO [2]. As a VObased on the direct method, DSO tracks the pixels with high gradient in the image through directalignment in the front-end, and the back-end takes use of the sliding window method based onkeyframes. LDSO proposed to gear point selection towards repeatable features, which makesit possible to apply the BoW method similar to ORB-SLAM2 for loop closure detection, andestimate constraints using geometric techniques. Similarly, VINS-Mono [13] also calculatesadditional feature point descriptors in keyframes and utilizes BoW for loop closure detection.However, LDSO and VINS-Mono only conduct pose graph optimization, but do not perform theglobal bundle adjustment (BA).Inspired by LDSO and VINS-Mono, we extract additional features and take use of BoW todetect loop closure. Compared to them, PA-SLAM has three main advantages: (1) The extractedfeature points are not all involved in tracking front-end, but only part of the feature points willbe aggregated in the pose estimation and structure reconstruction, which enables reliable loopclosure detection and meanwhile ensures the computational eﬃciency; (2) The global BA can becarried out ﬂexibly after pose graph optimization, further improving localization accuracy andglobal mapping consistency; (3) The loop closure detection of PA-SLAM is direction-insensitive,while visual SLAM based on pinhole cameras can only handle the loop closure when traveling inthe same direction.

In recent years, many researchers have been exploring the application of panoramic images inpositioning tasks, including visual place recognition (VPR), VO and SLAM.For the VPR task, Murillo and Josecka [14] proposed place recognition utilizing GISTdescriptors, which has achieved satisfactory performance on large-scale datasets. Cheng etal. [15] presented a panoramic image retrieval method based on NetVLAD [16] to tackle thechallenges of various appearance variations between query and database images. Oishi et al. [17]proposed to use panoramic images as one of the multi-modal data for robot localization andnavigation, during which the panoramic images are matched using hand-crafted features and asliding window scheme.For VO and SLAM, some researchers have studied the advantages of large FOV. For example,SVO, DSO, VINS-Fusion, ORB-SLAM3 have been extended to support ﬁsheye lenses [18–21].Lin et al. [22] proposed PVO based on Ricoh Theta V panoramic camera, which is a multi-camerasystem composed of two ﬁsheye lenses and produces 360 ◦ FOV through stitching images. Seoket al. presented ROVO [23] and OmniSLAM [24] for a wide-baseline multiview stereo setup withwide-FOV ﬁsheye cameras. Gutierrez et al. [25] developed a real-time EKF based visual SLAMsystem for catadioptric cameras. Compared to these works with wide-FOV imaging systems(ﬁsheye lenses, catadioptric cameras and multi-camera panoramic imaging systems), we exploitPAL in the proposed PA-SLAM, which has signiﬁcant advantages of relative small distortions,single-shot panoramic perception and the compact structure [26]. These advantages make PALcamera an ideal sensor for localization and perception tasks [27–29].

3. Algorithm

Before going into PA-SLAM in more detail, we brieﬂy review the pipeline of PALVO [3], whichis the previous work of this paper. ocal Map Current KFGlobal map BoW database Candidate KF …… Sim(3)

Fig. 2. Framework of PA-SLAM. Keyframes moved out from the local map are managedby the global map, and a BoW database is constructed. Loop candidates are proposed byquerying the BoW database and veriﬁed geometrically. Once a loop closure is detectedsuccessfully, the

𝑆𝑖𝑚 ( ) relative pose constraint between the candidate keyframe (KF)and the current keyframe will be calculated. PALVO takes use of a sparse direct method, meaning that the feature correspondence is notexplicitly calculated. During the initialization process, feature points are tracked from frameto frame using Lucas-Kanade feature tracking (KLT) [30], and essential matrix is calculatedto recover the poses and 3D map points of the ﬁrst two keyframes. In the tracking thread, acoarse-to-ﬁne strategy is adopted to estimate the camera pose for each new frame: Firstly, trackthe previous frame to obtain the coarse pose estimation through photometric error minimization;Secondly, track the local map by projecting keypoints to the current frame and optimizing theprojection position; Finally, the camera pose is ﬁne-tuned by minimizing the reprojection error.In the mapping thread, a ﬁxed-size local map is maintained, and the depth of keypoints in thelocal map are updated through a depth ﬁlter. When the number of keyframes in the local mapexceeds a threshold, the furthest keyframe will be discarded.In this paper, we adapt PALVO as the front-end of PA-SLAM to estimate frame-to-framecamera poses, and correct error accumulation as well as scale drift with loop closure detectionand global optimization in the back-end. The tracking and mapping threads are inherited fromPALVO. The diﬀerence lies in that each keyframe moved out from the local map is not simplydiscarded but added to the global map with a BoW database, as shown in Fig. 2. The task ofloop closure detection is carried out by querying the BoW database and the loop candidates areveriﬁed geometrically. Once a loop closure is successfully detected, the

𝑆𝑖𝑚 ( ) transformationbetween the candidate keyframe and the current keyframe is calculated and added to the posegraph as a constraint. Then, all the poses of keyframes in the global map are adjusted by posegraph optimization and followed by global BA. As mentioned above, the front-end of PA-SLAM is a VO based on a sparse direct method, whichfeatures pose estimation via sparse image alignment rather than explicit feature matching. Thereexists several open challenges in adapting such a direct visual odometry to reuse the existing map.First of all, PALVO does not care about the repeatability of the tracked pixels (keypoints). Thus,if we simply attempt to reuse the tracked keypoints in the front-end and compute descriptorsfor them, it is likely to result in poor loop closure detection. Secondly, when the loop closureis detected and the inter-frame

𝑆𝑖𝑚 ( ) transformation computation is carried out, the actualtransformation matrix may be quite diﬀerent from the unit matrix (the initial guess of optimization eypoints tracked in the front-end ORB features extracted for loop closure detectionRedundant ORB featuresReused ORB keypointsSupplementary keypoints(a) (b) Fig. 3. The hybrid point selection strategy. (a) ORB features will be extracted from newkeyframes for loop closure detection, but only a part of them are selected for trackingin the front-end. Some supplementary keypoints with high image gradient will alsobeen involved in tracking if necessary. (b) Keypoint selection. The image is dividedin a grid, and for each cell only the ORB keypoint with the highest Harris response isselected for tracking in the front-end. And for the cells without ORB feature points,the image gradient in the cell is computed and the pixel with the highest gradient isselected as a supplementary keypoint.Fig. 4. Upper row: keypoints tracked in the front-end (drawn in green).Lower row: ORB features extracted for loop closure detection (drawn in blue). process). At this time, sparse image alignment will be invalid.Therefore, we propose a hybrid point selection strategy in PA-SLAM. When a frame is selectedas a keyframe, new keypoints extraction will be carried out before it is sent into the depthﬁlter. The hybrid point selection strategy means that when extracting new keypoints, it is moreinclined to consider ORB feature points, i.e., more ORB feature points are used as keypointsfor tracking in the front-end. In areas with insuﬃcient features, pixels with a high gradient areused to supplement. This strategy has the following advantages: Firstly, ORB feature points areactually FAST corners with good repeatability, and have been proved to be an eﬀective featurefor loop closure detection in visual SLAM [8]; Secondly, once a loop closure is detected, featurematching can be easily obtained, which is convenient for geometric check and inter-frame

𝑆𝑖𝑚 ( ) transformation computation.n the implementation, redundant ORB features will be extracted from new keyframes so as toensure the performance of loop closure detection, and all the features are involved in generatingBoW image descriptors, as shown in Fig. 3(a). But not all ORB feature points are picked asdepth ﬁlter seeds considering real-time performance. The image is divided in a grid, and for eachcell only the one with the highest Harris response is selected for depth recovery. And for the cellswithout ORB feature points, the image gradient in the cell is computed and the pixel with thehighest gradient is selected as a supplementary keypoint (same as the original strategy in PALVO)and fed into the depth ﬁlter, as shown in Fig. 3(b). Fig. 4 depicts the extracted ORB features forloop closure detection (lower row) and the tracked keypoints in the front-end (including reusedORB keypoints and the supplementary keypoints with a high image gradient, upper row) duringone run. As mentioned above, redundant ORB features will be extracted from new keyframes and thenDBoW3 [31] is utilized to transform ORB feature descriptors to BoW vectors and build a BoWdatabase, and the database is queried to propose loop candidates for the current keyframe. It isworth noting that the loop closure is only retrieved outside the local map, i.e., only the historicalkeyframes in the global map can be picked.There may be false positives in loop closure detection via BoW database retrieval. Therefore,a geometric check must be performed for each loop candidate. Here, geometric check is done viaverifying epipolar constraints. For each pair of ideal matching feature points u 𝑟 and u 𝑐 , it shouldbe satisﬁed that 𝜋 − ( u 𝑟 ) 𝑇 · E · 𝜋 − ( u 𝑐 ) = , (1)where 𝜋 − (·) is the back-projection function, u 𝑟 and u 𝑐 are the pixel coordinates of the matchingORB feature points on the reference frame (candidate) and the current keyframe respectively,and E is the essential matrix.Speciﬁcally, feature matching is ﬁrst carried out between the candidate- and the currentkeyframe, and good matches are selected according to the matching distance. Based on thegood matches, the essential matrix is computed using the 8-point method with a random sampleconsensus (RANSAC) scheme, and the number of inliers is counted. Only if the inlier number isgreater than a threshold, geometric check is considered to be successful.The same technology is also used in the initialization process of PALVO. The diﬀerence liesin that KLT is used to obtain the correspondence between pixels in initialization, while ORBfeature matching is used here. This is because the parallax between the loop candidate frame andthe current keyframe may be large, so the optical ﬂow can not be calculated eﬀectively. 𝑆𝑖𝑚 ( ) computation If a loop closure is successfully detected, the

𝑆𝑖𝑚 ( ) relative pose from the loop candidate frameto the current keyframe will be calculated, where the 3D coordinates of the matching pointsare required. As mentioned above, not all extracted ORB feature points are fed into the depthﬁlter to recover depth, so we can not guarantee that every matching point has its corresponding3D coordinates. In view of this, we propose an approximate strategy to obtain the depth of thefeature points.Speciﬁcally, for each feature point, if there exists a 3D map point in the same grid cell with it,the depth of this map point is regarded as the depth of the feature point. If the opposite is true,then we search its 3 × 𝑆𝑖𝑚 ( ) . In order to ensure the robustness of the solution, RANSAC scheme isadopted. The

𝑆𝑖𝑚 ( ) relative pose indicates the rotation, translation and scale constraints between theloop candidate frame and the current keyframe. By adding this constraint during pose graphoptimization, error accumulation and scale drift in this period of time can be reduced.In general, the relative pose estimation between adjacent frames in the local map is reliable,but due to error accumulation and scale drift, the error of global pose gradually increases overtime. Pose graph optimization is to optimize the pose of each keyframe with the constraints ofthe relative pose transformation between keyframes. Since the estimated pose in the front-end is 𝑆𝐸 ( ) , it is upgraded to 𝑆𝑖𝑚 ( ) during optimization so as to adjust its scale, and the initial scaleis set to 1. The form of error in pose graph optimization is e 𝑖 𝑗 = ln (cid:16) S 𝑖 𝑗 ˆ S − 𝑗 ˆ S 𝑖 (cid:17) ∨ , (2)where 𝑆 𝑖 represents the 𝑆𝑖𝑚 ( ) pose of the keyframe 𝑖 , 𝑆 𝑖 𝑗 denotes the 𝑆𝑖𝑚 ( ) relative posebetween the keyframe 𝑖 and 𝑗 , and ˆ denotes the estimated value of a variable.After pose graph optimization, the global BA is then performed to ﬁne-tune the 3D coordinatesof all map points and poses of all keyframes in the global map by minimizing the reprojectionerror. The error term is e 𝑚𝑖 = u 𝑚𝑖 − 𝜋 (cid:16) ˆ T 𝑖 · ˆ P 𝑚 (cid:17) , (3)where u 𝑚𝑖 represents the observed projection of the 3D map point 𝑚 in the keyframe 𝑖 , 𝜋 (·) isthe projection function, T 𝑖 is the pose of the keyframe 𝑖 , and P 𝑚 is the 3D coordinate of the mappoint 𝑚 .It is also important to note that in order not to interfere with the pose estimation process in thefront-end, the estimated poses of active keyframes in the local map are all ﬁxed during pose graphoptimization and global BA. Only the global poses of the old part of the trajectory will tend to bemodiﬁed. We utilize g2o, a graph optimization library proposed in [33] for optimization tasks. PAL (a) (b)

Fig. 5. (a) The remote control vehicle equipped with a PAL camera for dataset collection.(b) Perspective images used for comparative experiments are synthesized using a virtualpinhole camera and the reprojection method. . Experiments

In the following experiments, we take use of a self-designed PAL with a 70 ◦ × ◦ FOV [34]and a global shutter camera, of which the sensor size is 2/3 inches and the imaging resolutionis 2048 × ×

720 before being fed into PA-SLAM system. The PAL videos of realscenarios used in the following experiments are captured using a remote control vehicle equippedwith a PAL camera, as shown in Fig. 5(a).In order to compare with SLAM systems based on the conventional pinhole camera, we use avirtual pinhole camera and the reprojection method to obtain perspective images. This is perfectlyfeasible as the PAL imaging model follows a clear f-theta law [35]. As shown in Fig. 5(b), thePAL image is ﬁrst back-projected into 3D space using a calibrated PAL camera model, and thenre-projected into a perspective image using a virtual pinhole camera model with a 90 ◦ horizontalFOV. In this way, PAL and perspective image sequences share the same FPS and timestamp,ensuring the fairness of the comparison to the maximum extent.OmniCalib calibration tool [36] is used to calibrate the PAL camera. Evo [37] and the methodproposed in [38] are used to evaluate the trajectory. In this section, the relationship between the performance of loop closure detection based onPAL images and the number of ORB features is studied. Videos captured by the remote controlvehicle are used and the total length of the trajectory is about 500 meters. We select one image asa keyframe every ﬁxed number of images (set to 30 in this paper), and take all the keyframes asthe database to be queried. Then for each query frame, we use the algorithm described in Section3.2 for loop closure detection. For each detected loop closure candidate, if the diﬀerence of indexbetween the candidate frame and the current query frame is less than the interval number ofkeyframes (30 in this paper), it is considered to be a true positive (TP) loop closure; Otherwise, itwill be treated as false positive (FP). In addition, all the query frames that fail in detecting loopclosure are deﬁned as false negative (FN).The precision-recall curve is used to characterize the performance of the loop closure detectionalgorithm. Precision (P) and recall (R) can be calculated as follows: 𝑃 = 𝑇 𝑃𝑇 𝑃 + 𝐹𝑃 , (4) 𝑅 = 𝑇 𝑃𝑇 𝑃 + 𝐹 𝑁 . (5)The higher the curve, the higher the recall at the same precision, which means the better theperformance of the algorithm.The loop closure detection results with repect to diﬀerent numbers of ORB features are shownin Fig. 6(a). It can be seen that with the increase of the number of ORB features from 100 to3200, the recall rate at precision 100% increases gradually, which proves that the performanceof loop closure detection is positively correlated with the number of ORB features to a certainextent. In order to hold a good trade-oﬀ between performance and speed, we set the ORB featurenumber to 1600 when running PA-SLAM.Additionally, Fig. 6(b) shows the total number of extracted ORB features, the number of reusedORB keypoints fed into the depth ﬁlter and the number of all the tracked keypoints (includingsupplementary keypoints) when running PA-SLAM on this dataset. It can be seen that the ORBkeypoints actually involved in the tracking front-end only account for about 15% of all ORBfeatures, which ensures the running eﬃciency of PA-SLAM. .1 0.2 0.3 0.4 0.5 0.6 0.7 0.8 0.9 100.10.20.30.40.50.60.70.80.91

ORB feature number

Precision R e c a ll （ a ） Extracted ORB featuresReused ORB featuresTracked keypoints

Keyframe index K e y po i n t nu m be r （ b ） Fig. 6. (a) The precision-recall curve of loop closure detection with respect to diﬀerentnumbers of ORB features. The higher the curve, the better the performance of thealgorithm. (b) Statistics of extracted ORB features (blue), reused ORB features (red)and all the keypoints tracked in the front-end (yellow) in one run. start end(a) -50 0 50 100-140-120-100-80-60-40-20020

Estimated trajectoryDetected loop closure (b) X(m) Y ( m ) Fig. 7. (a) Schematic diagram of the path when collecting datasets for verifying thedirection insensitivity of loop closure detection based on PAL. In this path, there existsthe part traveling in the same direction (the green area) and the part traveling in theopposite direction (the blue area). (b) The estimate trajectory and results of loop closuredetection produced by PA-SLAM.

In order to verify the insensitivity to the travel direction, we collect another video whose path isshown in Fig. 7(a). There exists the part traveling in the same direction (the green area) and thepart traveling in the opposite direction (the blue area) in this dataset. The estimated trajectoryand successfully detected loop closure results (plot in red line segments) are shown in Fig. 7(b).It can be seen that whether the travel direction is same or opposite, loop closure can always becorrectly detected based on the PAL images, proving the direction insensitivity of loop closure inPA-SLAM. A T E ( m ) PA-SLAMORB-SLAM2PALVO A cc u m u l a t ed e rr o r( m ) PA-SLAMORB-SLAM2PALVO S c a l e d r i ft PA-SLAMORB-SLAM2PALVO (b)(a) (c)

Sequence number

Fig. 8. Accuracy test results. We run PA-SLAM, ORB-SLAM2 and PALVO onthe datasets with ground truth, and calculate the (a) absolute trajectory error, (b)accumulated error and (c) scale drift separately.

In this part, we evaluate the accuracy of PA-SLAM and compare it with the previous PALVO [3]as well as ORB-SLAM2 [8], which is a state-of-the-art implementation of visual SLAM. Weuse ArUco to obtain the ground truth of 6 degree of freedom (DOF) camera pose. ArUco isan open-source library for camera pose estimation using squared markers [39, 40]. The pixelcorrespondence necessary for pose estimation can be obtained through a single mark. Thus, thecamera pose can be calculated separately for each frame, and there is no error accumulation andscale drift over time.Image sequences that are used in this test are captured in an oﬃce, with paths ranging from 3meters to 50 meters in length. It is impossible to capture the ArUco marker in all images in caseof large scale camera movement. Thus, only part of the frames are assigned with ground truth.When collecting the datasets, we take the ArUco marker as the start point and the end point of thetrajectory, ensuring that frames in the beginning segment and the end segment have ground truth.The absolute trajectory error (ATE) is utilized as the criterion for accuracy evaluation.Additionally, the accumulated error and scale drift are also evaluated separately. Speciﬁcally, wealign the tracked trajectory with the beginning segment (B) and the end segment (E) independently,providing two

𝑆𝑖𝑚 ( ) transformations: 𝑆 gt 𝑏 = argmin 𝑆 ∈ Sim ( ) ∑︁ 𝑖 ∈ 𝐵 (cid:0) 𝑆 𝑝 𝑖 − 𝑝 (cid:48) 𝑖 (cid:1) , (6) 𝑆 gt 𝑒 = argmin 𝑆 ∈ Sim ( ) ∑︁ 𝑖 ∈ 𝐸 (cid:0) 𝑆 𝑝 𝑖 − 𝑝 (cid:48) 𝑖 (cid:1) . (7)The accumulated error ( 𝑒 𝑎𝑐𝑐𝑢 ) and scale drift ( 𝑒 𝑠 ) can be deﬁned as 𝑒 𝑎𝑐𝑐𝑢 = (cid:118)(cid:116) 𝑛 𝑛 ∑︁ 𝑖 = (cid:13)(cid:13)(cid:13) 𝑆 gt 𝑏 𝑝 𝑖 − 𝑆 gt 𝑒 𝑝 𝑖 (cid:13)(cid:13)(cid:13) , (8) 𝑒 𝑠 = (cid:12)(cid:12)(cid:12) 𝑙𝑜𝑔 ( 𝑠𝑐𝑎𝑙𝑒 ( 𝑆 gt 𝑏 ( 𝑆 gt 𝑒 ) − )) (cid:12)(cid:12)(cid:12) . (9)Fig. 8 presents the experiment results, from which one can see that our algorithm achieves theleast ATE on the sequence (2), (3) and (5). On the sequence (4), PA-SLAM is slightly inferior toRB-SLAM2. The same pattern also exists for accumulated error. As for scale drift, PA-SLAMachieves the best performance among the three algorithms on the sequences (2)-(5). There is anexception of sequence (1), on which PALVO performs better. This is because the movement scaleof this sequence is quite small (the path length of sequence (1) is about 3 meters). Under thiscircumstance, PALVO maintains good local consistency of the trajectory, with error accumulationand scale drift not being signiﬁcant.The experiment results indicate that the proposed PA-SLAM has achieved equivalent or evenbetter accuracy in comparison with ORB-SLAM2, and has been greatly improved compared tothe previous PALVO. It becomes clear that loop closure and global optimization signiﬁcantlydecrease error accumulation and scale drift in large-scale and long-term running. In addition, we also run our algorithm on the dataset used in the accuracy test of PALVO tocollect and compare the overall numerical performance. As described in the PALVO paper [3],this dataset is collected in an indoor corridor and contains a total of 5 videos (r1 - r5), withpaths ranging from 20 meters to 50 meters in length. The start and end point are exactly in thesame position. Loop closure error in percentage is utilized as a criterion for accuracy evaluation.Table 1 presents the quantitative results. As can be seen, the proposed PA-SLAM achieves theleast loop closure error in r1 - r4. In r5, it is inferior to ORB-SLAM2 but still better than SVOand has a great improvement compared to PALVO. These experiment results further support ourconclusion that PA-SLAM reaches the state-of-the-art performance.Moreover, we also evaluate the frame rate of our algorithm. With loop closing and globaloptimization, the proposed PA-SLAM is capable of processing frames at 99.1 frames per second(FPS), which is much faster than ORB-SLAM2.

Table 1.

Accuracy test results.

Frame rate Loop closure error (%)Method FPS r1 r2 r3 r4 r5PA-SLAM 99.1

SVO [1]

In order to further verify our algorithm and validate its eﬀectiveness and reliability in realapplications, ﬁeld tests are conducted in the outdoor area. We collect a number of videos in thecampus, ranging from 190 to 450 meters in length. In these videos, there are a large numberof pedestrians, vehicles and other dynamic components, which is challenging for conventionalvisual SLAM systems. Similarly, the start- and end point are kept in the same place and theloop closure error in percentage is calculated. Table 2 displays the experiment results, and theestimated trajectories are shown in Fig. 9.As can be seen in Table 2, PA-SLAM achieves less loop closure errors than PALVO on all ofthe datasets. As depicted in Fig. 9, the orientation of the remote control vehicle at the start pointis approximately perpendicular to the end point in S1, an opposite to the end point in S3. In spiteof this issue, PA-SLAM can still close the loop, further proving the direction insensitivity of

PA-SLAMPALVO -2 -1 0 1 2 3-4.5-4-3.5-3-2.5-2-1.5-1-0.50

PA-SLAMPALVO -6 -4 -2 0 2 4-8-7-6-5-4-3-2-1012

PA-SLAMPALVO -1.5 -1 -0.5 0 0.5 1 1.50.511.522.5

PA-SLAMPALVO (a) (b)(c) (d)X(m) Y ( m ) X(m) Y ( m ) Y ( m ) Y ( m ) X(m) X(m)

Fig. 9. Trajectories produced by PA-SLAM and PALVO on datasets (a) S1 - (d) S4. Thestart- and end point of each trajectory are indicated by crosses. In (a) the orientationof the remote control vehicle at the start point is approximately perpendicular to theend point, which is opposite to the end point in (c) and the same as the end point in(b)(d). It’s worth noting that the trajectories are plotted up to a scale factor, because theabsolute scale cannot be derived from a single camera.Table 2.

Field test results.

Loop closure error (%)Method S1 (190 m) S2 (450 m) S3 (200 m) S4 (250 m)PA-SLAM

PALVO 2.5719 4.2268 4.2288 3.6310loop closure in PA-SLAM. Fig. 10 depicts the ORB feature matching when the vehicle revisits acertain place (a loop closure occurs) with its orientation perpendicular to-, opposite to- and thesame as the ﬁrst visit, demonstrating the robustness of PA-SLAM in real-world unconstrainedscenarios. a) (b)(c) (d)

Fig. 10. Feature matching when the vehicle revisits a certain place (a loop closureoccurs) with its orientation (a) perpendicular to-, (c) opposite to- and (b)(d) the sameas the ﬁrst visit.

5. Conclusion

In this paper, we propose PA-SLAM, which extends the sparse direct method based PALVO toPA-SLAM with loop closure detection and global optimization. The hybrid point selection ispresented to enable reliable BoW-based loop closure detection while ensuring computationaleﬃciency. When a loop closure is successfully detected, pose graph optimization is performedand followed by global BA. Experiments demonstrate that PA-SLAM signiﬁcantly reduces theerror accumulation and scale drift in PALVO, reaching state-of-the-art accuracy and maintainingthe original robustness and high eﬃciency. Meanwhile, PA-SLAM can deal with loop closurein diﬀerent travel directions, which greatly improves the performance in practical applicationscenarios.

Funding.

This research was granted from ZJU-Sunny Photonics Innovation Center (No. 2020-03). Thisresearch was also funded in part through the AccessibleMaps project by the Federal Ministry of Labor andSocial Aﬀairs (BMAS) under the Grant No. 01KM151112.

Acknowledgments.

This research was supported in part by Hangzhou SurImage Technology CompanyLtd.

Disclosures.

The authors declare no conﬂicts of interest.

References

1. C. Forster, M. Pizzoli, and D. Scaramuzza, “Svo: Fast semi-direct monocular visual odometry,” in (IEEE, 2014), pp. 15–22.2. J. Engel, V. Koltun, and D. Cremers, “Direct sparse odometry,” IEEE Transactions on Pattern Analysis Mach. Intell. , 611–625 (2017).3. H. Chen, K. Wang, W. Hu, K. Yang, R. Cheng, X. Huang, and J. Bai, “PALVO: visual odometry based on panoramicannular lens,” Opt. Express , 24481–24497 (2019).4. Y. Luo, X. Huang, J. Bai, and R. Liang, “Compact polarization-based dual-view panoramic lens,” Appl. optics ,6283–6287 (2017).5. F. Fraundorfer and D. Scaramuzza, “Visual odometry: Part ii: Matching, robustness, optimization, and applications,”IEEE Robotics & Autom. Mag. , 78–90 (2012).6. H. Strasdat, J. Montiel, and A. J. Davison, “Scale drift-aware large scale monocular slam,” Robotics: Sci. Syst. VI ,7 (2010).. C. Yu, Z. Liu, X.-J. Liu, F. Xie, Y. Yang, Q. Wei, and Q. Fei, “Ds-slam: A semantic visual slam towards dynamicenvironments,” in (IEEE, 2018),pp. 1168–1174.8. R. Mur-Artal and J. D. Tardós, “ORB-SLAM2: An open-source slam system for monocular, stereo, and rgb-dcameras,” IEEE Transactions on Robotics , 1255–1262 (2017).9. D. Gálvez-López and J. D. Tardos, “Bags of binary words for fast place recognition in image sequences,” IEEETransactions on Robotics , 1188–1197 (2012).10. J. Engel, T. Schöps, and D. Cremers, “Lsd-slam: Large-scale direct monocular slam,” in European Conference onComputer Vision (ECCV), (Springer, 2014), pp. 834–849.11. A. Glover, W. Maddern, M. Warren, S. Reid, M. Milford, and G. Wyeth, “Openfabmap: An open source toolboxfor appearance-based loop closure detection,” in (IEEE, 2012), pp. 4730–4735.12. X. Gao, R. Wang, N. Demmel, and D. Cremers, “LDSO: Direct Sparse Odometry with Loop Closure,” in (IEEE, 2018), pp. 2198–2204.13. T. Qin, P. Li, and S. Shen, “VINS-Mono: A Robust and Versatile Monocular Visual-Inertial State Estimator,” IEEETransactions on Robotics , 1004–1020 (2018).14. A. C. Murillo and J. Kosecka, “Experiments in place recognition using gist panoramas,” in (IEEE, 2009), pp. 2196–2203.15. R. Cheng, K. Wang, S. Lin, W. Hu, K. Yang, X. Huang, H. Li, D. Sun, and J. Bai, “Panoramic annular localizer:Tackling the variation challenges of outdoor localization using panoramic annular images and active deep descriptors,”in (IEEE, 2019), pp. 920–925.16. R. Arandjelovic, P. Gronat, A. Torii, T. Pajdla, and J. Sivic, “Netvlad: Cnn architecture for weakly supervisedplace recognition,” in IEEE Conference on Computer Vision and Pattern Recognition (CVPR), (IEEE, 2016), pp.5297–5307.17. S. Oishi, Y. Inoue, J. Miura, and S. Tanaka, “Seqslam++: View-based robot localization and navigation,” RoboticsAuton. Syst. , 13–21 (2019).18. C. Forster, Z. Zhang, M. Gassner, M. Werlberger, and D. Scaramuzza, “Svo: Semidirect visual odometry formonocular and multicamera systems,” IEEE Transactions on Robotics , 249–265 (2016).19. H. Matsuki, L. von Stumberg, V. Usenko, J. Stückler, and D. Cremers, “Omnidirectional dso: Direct sparse odometrywith ﬁsheye cameras,” IEEE Robotics Autom. Lett. , 3693–3700 (2018).20. T. Qin, S. Cao, J. Pan, and S. Shen, “A general optimization-based framework for global pose estimation with multiplesensors,” arXiv:1901.03642 (2019).21. C. Campos, R. Elvira, J. J. G. Rodríguez, J. M. Montiel, and J. D. Tardós, “Orb-slam3: An accurate open-sourcelibrary for visual, visual-inertial and multi-map slam,” arXiv:2007.11898 (2020).22. M. Lin, Q. Cao, and H. Zhang, “Pvo: Panoramic visual odometry,” in (IEEE, 2018), pp. 491–496.23. H. Seok and J. Lim, “Rovo: Robust omnidirectional visual odometry for wide-baseline wide-fov camera systems,” in (IEEE, 2019), pp. 6344–6350.24. C. Won, H. Seok, Z. Cui, M. Pollefeys, and J. Lim, “Omnislam: Omnidirectional localization and dense mapping forwide-baseline multi-camera systems,” in (IEEE, 2020), pp. 559–566.25. D. Gutierrez, A. Rituerto, J. Montiel, and J. J. Guerrero, “Adapting a real-time monocular visual slam fromconventional to omnidirectional cameras,” in (IEEE, 2011), pp. 343–350.26. Z. Huang, J. Bai, T. X. Lu, and X. Y. Hou, “Stray light analysis and suppression of panoramic annular lens,” Opt.Express , 10810–10820 (2013).27. W. Hu, K. Wang, H. Chen, R. Cheng, and K. Yang, “An indoor positioning framework based on panoramic visualodometry for visually impaired people,” Meas. Sci. Technol. , 014006 (2019).28. K. Yang, X. Hu, H. Chen, K. Xiang, K. Wang, and R. Stiefelhagen, “Ds-pass: Detail-sensitive panoramic annularsemantic segmentation through swaftnet for surrounding sensing,” in (IEEE, 2020), pp. 457–464.29. Y. Fang, K. Wang, R. Cheng, and K. Yang, “Cfvl: A coarse-to-ﬁne vehicle localizer with omnidirectional perceptionacross severe appearance variations,” in (IEEE, 2020), pp. 1885–1891.30. J.-Y. Bouguet, “Pyramidal implementation of the aﬃne lucas kanade feature tracker description of the algorithm,”Intel Corp. , 4 (2001).31. “DBoW3 dbow3,” https://github.com/rmsalinas/DBow3 (2017).32. B. K. P. Horn, “Closed-form solution of absolute orientation using unit quaternions,” J. Opt. Soc. Am. A , 629(1987).33. G. Grisetti, R. Kümmerle, H. Strasdat, and K. Konolige, “g2o: A general framework for (hyper) graph optimization,”in Proceedings of the IEEE International Conference on Robotics and Automation (ICRA), (IEEE, 2011), pp. 9–13.34. D. Sun, X. Huang, and K. Yang, “A multimodal vision sensor for autonomous driving,” in

Counterterrorism, CrimeFighting, Forensics, and Surveillance Technologies III, vol. 11166 (International Society for Optics and Photonics,2019), p. 111660L.5. X. Zhou, J. Bai, C. Wang, X. Hou, and K. Wang, “Comparison of two panoramic front unit arrangements in design ofa super wide angle panoramic annular lens,” Appl. Opt. , 3219–3225 (2016).36. D. Scaramuzza, A. Martinelli, and R. Siegwart, “A toolbox for easily calibrating omnidirectional cameras,” in (IEEE, 2006), pp. 5695–5701.37. M. Grupp, “evo: Python package for the evaluation of odometry and slam.” https://github.com/MichaelGrupp/evo (2017).38. J. Engel, V. Usenko, and D. Cremers, “A photometrically calibrated benchmark for monocular visual odometry,”arXiv:1607.02555 (2016).39. S. Garrido-Jurado, R. Munoz-Salinas, F. J. Madrid-Cuevas, and R. Medina-Carnicer, “Generation of ﬁducial markerdictionaries using mixed integer linear programming,” Pattern Recognit. , 481–491 (2016).40. F. J. Romero-Ramirez, R. Muñoz-Salinas, and R. Medina-Carnicer, “Speeded up detection of squared ﬁducialmarkers,” Image Vis. Comput.76