[PDF] Simultaneous Localization and Mapping Related Datasets: A Comprehensive Survey

Abstract

Due to the complicated procedure and costly hardware, Simultaneous Localization and Mapping (SLAM) has been heavily dependent on public datasets for drill and evaluation, leading to many impressive demos and good benchmark scores. However, with a huge contrast, SLAM is still struggling on the way towards mature deployment, which sounds a warning: some of the datasets are overexposed, causing biased usage and evaluation. This raises the problem on how to comprehensively access the existing datasets and correctly select them. Moreover, limitations do exist in current datasets, then how to build new ones and which directions to go? Nevertheless, a comprehensive survey which can tackle the above issues does not exist yet, while urgently demanded by the community. To fill the gap, this paper strives to cover a range of cohesive topics about SLAM related datasets, including general collection methodology and fundamental characteristic dimensions, SLAM related tasks taxonomy and datasets categorization, introduction of state-of-the-arts, overview and comparison of existing datasets, review of evaluation criteria, and analyses and discussions about current limitations and future directions, looking forward to not only guiding the dataset selection, but also promoting the dataset research.

Full PDF

1 Date of publication xxxx 00, 0000, date of current version xxxx 00, 0000.

Digital Object Identifier

Datasets and Evaluation for Simultaneous Localization and Mapping Related Problems: A Comprehensive Survey

YUANZHI LIU , (Student Member, IEEE), YUJIA FU , FENGDONG CHEN , BART GOOSSENS , (Member, IEEE), WEI TAO , (Member, IEEE), AND HUI ZHAO , (Member, IEEE) School of Electronic Information and Electrical Engineering, Shanghai Jiao Tong University, Shanghai 200240, China School of Instrumentation Science and Engineering, Harbin Institute of Technology, Harbin 150001, China Department of Telecommunications and Information Processing (TELIN), Ghent University, 9000 Ghent, Belgium Interuniversity Microelectronics Center (imec), 3001 Leuven, Belgium

Corresponding author: Wei Tao (e-mail: [email protected]) and Hui Zhao (e-mail: [email protected]).

This work was supported in part by the National Key R&D Program of China under Grant 2018YFB1305005.

ABSTRACT

Simultaneous Localization and Mapping (SLAM) has found an increasing utilization lately, such as self-driving cars, robot navigation, 3D mapping, virtual reality (VR) and augmented reality (AR), etc., empowering both industry and daily life. Although the state-of-the-art algorithms where developers have spared no effort are source of intelligence, it is the datasets that dedicate behind and raise us higher. The employment of datasets is essentially a kind of simulation but profits many aspects – capacity of drilling algorithm hourly, exemption of costly hardware and ground truth system, and equitable benchmark for evaluation. However, as a branch of great significance, still the datasets have not drawn wide attention nor been reviewed thoroughly. Hence in this article, we strive to give a comprehensive and open access review of SLAM related datasets and evaluation, which are scarcely surveyed while highly demanded by researchers and engineers, looking forward to serving as not only a dictionary but also a development proposal. The paper starts with the methodology of dataset collection, and a taxonomy of SLAM related tasks. Then followed with the main portion – comprehensively survey the existing SLAM related datasets by category with our considerate introductions and insights. Furthermore, we talk about the evaluation criteria, which are necessary to quantify the algorithm performance on the dataset and inspect the defects. At the end, we summarize the weakness of datasets and evaluation – which could well result in the weakness of topical algorithms – to promote bridging the gap fundamentally.

INDEX TERMS

Dataset, evaluation, Localization, Mapping, review, SLAM, survey. I. INTRODUCTION

Simultaneous Localization and Mapping (SLAM) is a subject with respect to achieving both “internal and external cultivation” in a mobile robot – by externally knowing the world by constructing a map of an unknown environment and internally knowing himself by tracking the location continuously [1], [2], [3]. Serving as the sensory perception system like a human, SLAM has attracted wide attention and has long been regarded as the decisive technology in mobile robotic community. In self-driving cars [4], [5], [6], navigating robots [7], [8], [9], and unmanned aerial vehicles (UAVs) [10], [11], [12], SLAM depicts the surrounding environment and figures out the ego-motion inside the map. In mobile VR/AR systems [13], [14], [15], SLAM presents accurate positions and orientations of the agent, thus helps to intuitively manipulate virtual objects. In 3D mapping tasks [16], [17], [18], SLAM estimates transformations across adjacent visual or laser frames and merges 3D points together into a consistent model. The advent of these amazing applications (as shown in Fig. 1) indicates the fast development of SLAM related technologies in the past decades, offering a glimpse of the huge potential for driving industrial change. However, these visions have been remaining in our imagination for too long and cannot make it a breakthrough on transition to mature productivity [19], [20].

Y. Liu et al : Datasets and Evaluation for Simultaneous Localization and Mapping related problems: A Comprehensive Survey (a) (b) (c) FIGURE 1. Some typical SLAM related applications: (a) a self-driving car from Waymo [21]. (b) an AR game implemented via PTAM system [22]. (c) mapping 3D urban by merging LiDAR scans [23]. If we deem the state-of-the-art algorithms as source of intelligence, the datasets should be the powerhouse of improvement and breakthrough. Before 2009 the release of ImageNet [24], no one have imagined a vision Artificial Intelligence (AI) can be like today within only 10 years’ trip. The accuracy of objects classifying rose from 71.8% to 97.3%, surpassing human abilities and effectively proving that bigger data leads to better decisions [25]. There is a consensus widely circulated: Data and features determine the upper limit of machine learning, while the algorithms and models are just approaching this bound [26]. This concept obviously manifests the crucial role datasets play in machine learning field, while indeed could fit the situation in SLAM domain even better. Unlike object detection and Copyright © 2007, IEEE. semantic segmentation datasets, which comprise arbitrary individual images from any resource, SLAM datasets could involve multimodal inputs (e.g., camera frames, LiDAR [27] scans, and inertial measurements). And necessarily, the sensor setup within each independent sequence should remain unchanged, and the multi-channel data streams should be synchronized strictly. As it is not practical for every researcher to build such a tangible mobile robotic system, the concrete experimental process has long been regarded as a barrier for SLAM related research. And for the reference data, the two main tasks of SLAM – pose (position and orientation) estimation and scene mapping are exceptionally tough to measure ground truth [28], pushing individual enthusiasts away. Moreover, for the hardworking developers, it is never an efficient way to test every draft experimentally. To this end, the datasets come out serving as a simulation manner with great ease and hold a vital position. With the fast growing of publicly available datasets, some additional values are gradually taking shape. As researchers worldwide drill their algorithms on the datasets, the providers have defined some specific evaluation metrics for comparison against the reference data. Such as the best-known KITTI benchmark [29] and TUM benchmark [30], these online platforms can naturally serve as equitable baselines opening to all researchers, and gradually become consensual quantitative indicators for algorithm evaluation. More importantly, just like the ImageNet challenge [31], some demanding competitions that host specific datasets (e.g., the IROS 2019 Lifelong Robotic Vision Challenge [32], the ICRA 2020 FPV Drone Racing VIO Competition [33], and the CVPR 2020 Visual SLAM Challenge [34]) can significantly call for public attention on common key issues, thus promote technological breakthroughs. Quite evidently, the research on algorithms and datasets share such a mutuality on the way forward. To date, there has been quite a few literature that surveyed and reviewed the efforts on SLAM algorithms [35], [36], [37]. However, to the best of our knowledge, no open literature has reviewed specializing on SLAM related datasets, although there are already some reports on autonomous driving [38], [39] and RGB-D [40], [41] datasets approaching by a narrow margin. As such, it appears as a straightforward problem that, the researchers and engineers cannot access a comprehensive understanding of the datasets and evaluation, thus sometimes struggle on the validation of their research. After all, it is long taken for granted to drill algorithms on several commonly used datasets [29], [30], [42], while sometimes found it insufficient on variety and cannot meet the challenging requirements [43]. Another hidden worry is, so far, the research on datasets and evaluation has not been formalized into a mature self-contained research branch. Deep analyses and conclusions are of great significance to inform the contemporary drawbacks and forecast future directions, as obviously, the weaknesses of algorithms could well be resulted by the weaknesses of datasets. Hence in this article, we are sparing no efforts to give a compre-hensive and open access survey focusing on SLAM related datasets and evaluation, which are urgently demanded by the community, looking forward to serving as both a dictionary and moreover a development proposal. This paper promises to bridge the gap by: 1) Presenting methodology of datasets collection. SLAM datasets involve complicated procedure and techniques, some common and key aspects (e.g., hardware setup, sensor calibration, and ground truth generation) are introduced to give a deep understanding and pave the way for the following sections. 2)

Clearing the taxonomy of SLAM related problems. Varying by the application and sensor setup, there can be several hierarchical directions within SLAM domain, and the optimal usage of datasets are also ensured by classification. A unified and clear taxonomy is given to support both datasets archive and SLAM research. 3)

Comprehensively surveying the existing datasets based on the taxonomy. For each category, an overview and comparison table is given covering as many as possible the existing public datasets. Out of them, some popular and featured works are introduced in detail. 4)

Reviewing the evaluation criteria of positioning and mapping. Evaluation is a vital link in the closed-loop chain of “developing-inspection-improvement”. Some fundamental indicators and principles are introduced to give a well guideline on evaluation. 5)

Concluding the development situation and proposing future direction of the datasets and evaluation. The weakness of algorithms could always result from the weakness of datasets, so inspecting the data perspective can effectively trigger technological breakthrough. The remainder of this paper is structured into five sections (the paper structure is shown in Fig. 2). Section II presents the methodology of datasets collection. Section III clears the taxonomy of SLAM related problems. Section IV gives a comprehensive survey of the existing datasets based on the taxonomy. Furthermore, Section V reviews the evaluation criteria of positioning and mapping. In the end, Section VI summarizes the current research situation of contemporary datasets and evaluation and outlooks the prospects of future research.

FIGURE 2. The schematic diagram of the overall structure of the paper.

II.

METHODOLOGY OF DATASETS COLLECTION

The datasets should consist of data stream and ground truth. For SLAM related tasks, the construction of datasets mainly involves 3 parts: hardware configuration, sensor calibration, and ground truth generation. HARDWARE CONFIGURATION

The hardware configuration consists of mobile platform and sensor setup. For the mobile platform, depending on specific application and scenario, there are several commonly used options (as shown in Fig. 3): For AR/VR performance testing, a handheld carrier can recover a realistic situation [44], [45]; for autonomous driving validation, there is no alternative to an automobile [29]; for industry and field applications, no doubt a ground robot is preferable [30], [46]; for aerial applications, a UAV is a logical choice [42], [47]. Each platform has its unique motion pattern, which is important to validate the algorithms’ generalization.

As for the sensor setup, to meet the demands of different applications, there are several commonly used modalities for SLAM related problems, including camera (grayscale, color, and event-based [52]), RGB-D sensor, LiDAR (single-beam and multi-beam), Inertial Measurement Unit (IMU), Global Navigation Satellite System (GNSS), wheel odometry, etc.

Y. Liu et al : Datasets and Evaluation for Simultaneous Localization and Mapping related problems: A Comprehensive Survey Generally, it is logical to record the dataset using only the essential one (e.g., recording the VO dataset using a single camera), however, to meet the trend of data fusion, the collection systems are typically equipped with multiple sensors and redundant modalities. Under this circumstance, the spatial-temporal alignment among different sensors is obligatory to be calibrated.

FIGURE 3. Some common mobile platforms for SLAM implementations: handheld [48], vehicle [49], wheeled robot [50], and UAV [51] from top left to bound right respectively.

FIGURE 4. The stereo camera calibration procedure of the MATLAB Computer Vision Toolbox. SENSOR CALIBRATION

The spatial calibration aims to find a rigid transformation (also called extrinsic parameter) between different sensor frames, enabling to unify all the data into a single coordinate system. Usually, the extrinsic calibration may appear in pairs of camera-to-camera, camera-to-IMU, camera-to-LiDAR, and can be extended group-by-group in multiple sensors situation. For stereo camera calibration, the most widely used in machine vision field could be Zhang’s method [53] proposed in 2000. Although originated from monocular vision, the principle can be easily extended to stereo setup. It is quite flexible as the technique only requires to freely take images at several different orientations of a printed planar checkerboard. The corner points of the board among each shot image are detected, serving as the constraints for solving a closed-form solution. Furthermore, the output result will be refined by minimizing the reprojection error. So far, the intrinsic parameters, lens distortion, and extrinsic parameters are calibrated (the procedure is shown in Fig. 4). The algorithm has been implemented in Matlab , OpenCV , and ROS , proved to be highly accurate and reliable, significantly lowering the bound of 3D vision research. Another well-known calibration framework in robotics field is the Kalibr toolbox [54], [55], [56]. Sharing the same theory with Zhang’s method, Kalibr additionally supports AprilTag [57] as visual pattern, bringing about more robust performance. For camera-to-IMU calibration, meanwhile, Kalibr is a mature tool and widely applied both in academia and industry. While calibrating, it only requires waving the visual-inertial sensor in front of the calibration pattern. Since the framework parameterizes both the transformation and time-offset into a unified principled maximum-likelihood estimator, by minimizing the total objective errors, the Levenberg-Marquardt (LM) algorithm will estimate all unknown parameters at once. For camera-to-LiDAR calibration, the general methodology is to find mutual correspondences between 2D image and 3D point cloud, thus compute a rigid transformation between them (Fig. 5 shows an calibration example). The Camera and Range Calibration Toolbox proposed by KIT [58] and the Calibration Toolbox of Autoware [59] present a fully automatic and interactive implementation respectively. FIGURE 5. An example implemented by KIT Range Calibration Toolbox: the top image shows the arrangement of the markers and the bottom shows the result of camera-LiDAR calibration [58].

The temporal alignment aims to timestamp all the sensors with a single reference clock, and calibrate the time-offsets between different timing system. The most desirable solution is to synchronize with the hardware support, e.g., employing camera equipped with external trigger, timing with GNSS, or synchronizing through Precision Time Protocol [60] (PTP, also known as IEEE-1588 standard). However, these all require specialized hardware nature, which can hardly be https://ww2.mathworks.cn/help/vision/ref/cameracalibrator-app.html https://docs.opencv.org/master/dc/dbb/tutorial_py_calibration.html http://wiki.ros.org/camera_calibration https://github.com/ethz-asl/kalibr Copyright © 2012, IEEE. found on consumer level products. To this end, for the majority of robotic systems where the hardware demands cannot be filled with, a feasible approach is to estimate the time-offset among different sensors. Solving this kind of problems, the general methodology is to parameterize the time-offset, and furtherly optimize by maximizing the cross correlation between the two processes or minimizing a certain error function. As mentioned above, the Kalibr toolbox opens an implementation on unified temporal and spatial calibration for camera-to-IMU. Overall, the methodology and techniques for spatial-temporal alignment among different sensors are quite common, hence we will not detail the calibration process of each dataset in the following sections. GROUND TRUTH GENERATION

Highly accurate ground truth could be the most decisive component of the datasets, while its measurement also requires the most advanced techniques among the whole construction procedure. For SLAM related problems, the generation of ground truth includes the measuring of position and orientation internally, and 3D scene structure externally.  Position and Orientation : The technique employed for pose measurement could differ regarding different scenarios. For small-scale and indoor scenes, the most effective solution is Motion Capture System (e.g., VICON , OptiTrack , and MotionAnalysis ) (as shown in Fig. 6), which works at a rather high rate, and meantime provides incredible accuracy both on position and orientation. But there are still drawbacks of it. On the one hand, it has a requirement on the illumination condition – there should not be excessive interference of infrared light. Because the detection relies on the infrared light reflected by the reflective markers emitted from the camera, excessive interference could confuse the CMOS, resulting in the failure of marker tracking. On the other hand, deploying such system in large scale scenarios is not feasible at all – it can consume 4 tracking cameras for a 10m area, while the price of a single one is already thousands of dollars. For large scale outdoor scenarios, GNSS with RTK correction and dual-antenna could be the most mature solution [61]. However, the applicable condition is also strict – the mobile agent should be under an open sky, otherwise the positioning accuracy will drift significantly. So quite commonly, inside a single data sequence, there could be some segments that cannot be trusted, which will dramatically influence the quality of the dataset (a case is shown in Fig. 7). Another high-precision solution supporting both indoor and outdoor environment is laser tracker (e.g., FARO and Leica ). It provides with superior positioning accuracy (up to sub-mm level), while occlusion-free should be ensured to make way https://optitrack.com/ for the ranging laser emitted from instrument to target, which is hard to achieve in many situations. There are also some alternatives for localization, such as Fiducial Marker [57], GSM [62], UWB [63], Wi-Fi [64], and Bluetooth [65], all of which are widely deployed in industry. However, although much cheaper and lightweight, these solutions are either with insufficient accuracy or struggle in certain conditions. More-over, sometimes the environments could be too complex to measure ground truth (e.g., urban canyon, large scale, and indoor-outdoor connection scenario), then it is a compromise to use odometry or SLAM as baseline [66] or directly measure the start-to-end drift [67] for evaluation. FIGURE 6. The motion capture system from OptiTrack [68]. The high-rate cameras mounted around the room are tracking the location and posture of the performers by detecting the reflective markers.

FIGURE 7. The GPS reception rate during a data collection route of the Complex Urban dataset. The color of the trajectory represents the number of received satellites: Yellow more, Red fewer, and Green represents no satellite. In complex urban areas, it is challenging for the GNSS because of the signal interruption by the high-rise buildings [71]. 

3D Scene Structure : Chronically, it is a bit unusual to provide 3D-structure ground truth in SLAM datasets [30], [69], [70], although mapping is an integral part of research. One primary reason could be the hardware burden – high-accuracy 3D-scanners (as shown in Fig. 8) that provide dense scene model are hard to afford for many institutes, let alone individual researchers. For objects or small-scale scenes, a portable 3D-scanner powered by structured light or laser could be the best choice

Y. Liu et al : Datasets and Evaluation for Simultaneous Localization and Mapping related problems: A Comprehensive Survey (e.g., GOM ATOS , Artec 3D , and Shining 3D ), providing with an accuracy of 10- μ m level. For large-scale scenes, there is no alternative to a surveying and mapping instrument (e.g., FARO, Leica , and Trimble ), with an accuracy of mm level in hundreds of meters. The scanners are capable of covering very huge area by registering multiple stations, with each individual scan consuming several to dozens of minutes. FIGURE 8. The FARO X130 large-scale 3D scanner that we used to get the ground truth 3D model of a football court in Shanghai Jiao Tong University. It scans scene geometry by rotating its laser emitter globally and measuring the return wave to compute distance of each point.

III.

TAXONOMY OF SLAM RELATED PROBLEMS

The proposed taxonomy of SLAM related tasks is generally based on two principles, one is sensor-oriented, the other is functionality-oriented. Oriented by sensor setup, there are mainly five branches for SLAM related tasks: LiDAR-based, vision-based, visual-LiDAR fusion, RGB-D-based, and image-based. Benefiting from the high-accuracy ranging of laser sensors, LiDAR-based methods [72] are widely employed, acting as a mature solution. However, the price of LiDAR is too high, with multi-beam editions costing at least thousands of dollars [73]. As the visual sensors become ubiquitous, along with the improvements of computing hardware, vision-based methods have greatly attracted the efforts of SLAM community [22], [74], [75], [76]. Although not as robust nor accurate as LiDAR-based methods [72], [77], [78], still vision methods are proved to have huge potentials and are steadily on its way forward [79]. In order to combine the advantages of the both modalities, visual-LiDAR fusion has become a hot topic that attracts significant attention due to the superior performance profiting by complementary [80]. RGB-D is a novel modality that also combines the imaging and ranging sensor. It collects color images along with a depth channel which make it much easier to perform 3D perception. Since the Microsoft Kinect [81] was first introduced, RGB-D sensors have been widely adopted in SLAM research [82], [83] due to the low-cost and complementary nature. Additionally, there is another branch https://leica-geosystems.com/products/laser-scanners named image-based which is popular in 3D computer vision. It does not require specifically arranged sensor setup, photos that collected from the Internet can also be used to recover 3D model of a scene or an object. For the other modalities, as mentioned in Section II (e.g., IMU, GNSS, and wheel odometry), they can provide a priori and be fused into the optimization thus enhance the accuracy and robustness of motion estimation. Since they are adopted mainly to play supporting roles, the related fusions are categorized into the above five branches. In the context of the above sensor setups, oriented by functionality, there are two main branches of SLAM related problems: one is mobile localization and mapping, the other is 3D reconstruction. Within the range of mobile localization and mapping, there are three specific problems: odometry, mapping (here we refer to a real-time or quasi real-time mobile manner), and SLAM. The three terminologies are with a hierarchical relationship at technical level, as shown in Fig. 9. Odometry means to track the position and orientation of an agent incrementally over time [84]. The abbreviation of such system is usually named as “O”, such as VO, VIO, and LO, which refers to Visual Odometry, Visual-Inertial Odometry, and LiDAR Odometry respectively. It was proposed to serve as an alternative or supplement to wheel odometry due to the better performance especially in uneven or other adverse terrains [85], [86], [87]. In the strict sense, besides tracking the path, odometry also recovers 3D points, whereas these landmarks are only used for local optimization rather than building a map. Mapping means to build a map of the environment in a particular format through mobile collection. Normally, mapping works along with odometry, thus named as odometry and mapping (OAM) (we avoid using the term mobile mapping since it usually refers to a more professional procedure in surveying and mapping domain [88]), since it relies on pose estimation to recover 3D points and merge together scene structures. As for SLAM, intuitively, it has both the functions of odometry and mapping, but it is not a simple combination of the two. SLAM means to keep a track of an agent’s location while at the same time building a global consistent and reusable map [84], [89]. This implies two core functions of SLAM: loop closure and relocalization, which rely on the map database to query the previous visited places. Loop closure means to add new constraints around revisited places to perform global optimization, thus eliminate the accumulated drift of the path and ensure a global consistent map. Relocalization means to recover positioning within the prebuilt map after the robot lose its track. As can be seen, the “SLAM map” is not only a model, but also a database for query, which supports the map reuse capability that draws a line between SLAM and the other two. It should be noted that, the map does not have to be dense or visually good looking, the point is whether the robot can recognize it. After all, sparse map points can also store features and descriptors [90], [91]. FIGURE 9. The technical architecture and hierarchical relationship among odometry, mapping, and SLAM. FIGURE 10. The overall taxonomy of SLAM related problems. Here the abbreviation “I” means the IMU sensor is fused into the system.

3D reconstruction means to recover a high-quality 3D map or model of an environment or an object from sequential data or data of different locations and perspectives. Unlike mobile localization and mapping which tracks the sensor motion in real time and build a map, 3D reconstruction ultimately cares more about the reconstruction quality, thus does not require the processing speed. It can be achieved with many sensor modalities (e.g., laser, structured light, and RGB-D sensor), but we mainly refer to the image based approach here, which estimates the camera poses and recover 3D structures from

Y. Liu et al : Datasets and Evaluation for Simultaneous Localization and Mapping related problems: A Comprehensive Survey

2D sequential or individual images. Technically speaking, 3D reconstruction shares similar methodologies with visual mobile localization and mapping (e.g., epipolar geometry [92], triangulation, bundle adjustment [93], and global optimization), whereas the former adds as many as possible the constraints between co-visible images and does much more optimization, and no doubt achieve better accuracy and density. Overall, comprehensively considering the above principles and the consensus of this field, we propose the classification as four major branches which are respectively vision based mobile localization and mapping, LiDAR related mobile localization and mapping, RGB-D based mobile localization and mapping, and image based 3D reconstruction, with 26 specific tasks (as shown in Fig. 10). For the convenience of datasets classification, we categorize LiDAR based and vision-LiDAR fusion into the same branch, because usually LiDAR datasets also contain camera data for fusion research. VISION BASED MOBILE LOCALIZATION AND MAPPING

Benefitting from the advantages of low-cost, compact, and high-frequency of cameras, vision-based methods have been extensively studied. However, limited by the 2D perception characteristic of the camera, vision-based methods are weak in real-time mapping, especially lack in accuracy and density. Therefore, vision-based methods are mainly used for motion estimation, especially for lightweight, limited scale, and high-speed situations (e.g., UAV and VR/AR [94]). It should be noted that, except for odometry, a full SLAM is also preferred for pose estimation, although the scene map is not required, as the trajectory could be more precise. Moreover, as a low-cost, compact, and internal sensing device, IMU is widely adopted to complement camera as “VI-” [95], [96], which is regarded as the “minimum fusion system” [97], to enhance the accuracy and robustness of tracking (can handle some tough scenarios, e.g., dark environment and white wall). Overall, there are mainly 6 specific tasks in this branch, namely V-O, VI-O, V-OAM, VI-OAM, V-SLAM, and VI-SLAM, as shown in the first column of Fig. 10. LIDAR RELATED MOBILE LOCALIZATION AND MAPPING

Due to the advantages of long perception distance, accurate ranging, and all-day operation of LiDAR, the related method can achieve a high accuracy on both tracking and mapping, thus has been widely used in autonomous driving [98] and mobile mapping [99]. Even for odometry which cannot correct the path via loop closure, the translation error has already reached around 0.5% [72], thus many systems only perform “O” [100] or “OAM” [72] rather than a full SLAM. Likewise, it is also quite common to fuse LiDAR with IMU and camera, which can not only enhance the optimization, but also benefit other aspects, such as motion distortion correction and map reuse. Overall, there are mainly 12 specific tasks in this branch, namely L-O, LI-O, L-OAM, LI-OAM, L-SLAM, LI-SLAM, VL-O, VLI-O, VL-OAM, VLI- OAM, VL-SLAM, VLI-SLAM, as shown in the second column of Fig. 10. RGB-D BASED MOBILE LOCALIZATION AND MAPPING

As mentioned above, RGB-D sensor incorporates both the advantages of camera and LiDAR, thus the related SLAM research has attracted wide attention. However, the ranging distance of the sensor can only reach around 10m, which limits the application to small-scale or indoor environments. In addition, the mainstream RGB-D sensor brands (such as Microsoft Kinect, ASUS Xtion, and Intel Realsense) also provide IMU signal, which enables the research of inertial fusion algorithms [101]. Overall, there are mainly 6 specific tasks in this branch, namely RGBD-O, RGBDI-O, RGBD-OAM, RGBDI-OAM, RGBD-SLAM, and RGBDI-SLAM, as shown in the third column of Fig. 10. IMAGE BASED 3D RECONSTRUCTION

Recovering 3D model from 2D images is quite an effective approach as the data collection can be much easier. There are mainly two specific tasks of this field: Structure from Motion (SfM) and Multiple View Stereo (MVS) (as shown in the last column of Fig. 10). SfM is the process of recovering the 3D structure of a scene from 2D images of different locations and views [102], by estimating the camera motions among co-visible views. Usually, the output models are composed of sparse point cloud. MVS can be regarded as a generalization of the two-view stereo vision, it assumes known camera poses of each view (just as the extrinsic parameters between binocular cameras) to recover dense 3D models. Since SfM estimates the camera poses beforehand, MVS can directly serve as a post process to build an accurate and dense model [103].

IV.

SURVEY OF EXISTING DATASETS BY CATEGORY

In this section, we introduce a total of 84 existing public SLAM related datasets following the above 4 branches in a systematical way. For each branch of datasets, an overview and comparison table is summarized in a chronological order (for several datasets proposed in the same year, we follow the alphabet order), serving as a dictionary. We mainly choose collection scene, mobile platform, sensor setup, and ground truth to list the table, as these aspects are the fundamental components of the dataset. Collection scene shows where the dataset is captured, such as indoor, outdoor, or further urban, agriculture, underwater, etc. for more specific environment. Sensor setup includes the number and modality of the sensors, which may enable multi-sensor and fusion research, such as stereo vision, vision-inertial fusion, vision-LiDAR fusion and so on. Ground truth reveals the type of reference data (pose and 3D structure) and the truth generation technique (this decides how strong you can trust the evaluation results). It is not quite detailed but already concrete enough for a general comparison and a suitable choice. For deeper descriptions, we cannot cover all of them limited by the space, hence some popular or featured ones are well selected to introduce in detail. We cover the aspects of platform, sensor system, scene, sequence, ground truth, and highlight. However, we will not discuss about the drawbacks of each individual, as it is almost impossible to gather all the advantages in a single dataset (for example, you cannot cover all kinds of scenes with various conditions). Meanwhile, the creation of SLAM related datasets involves complicated procedures, high-end instruments, and advanced techniques. Therefore, we respect and encourage any qualified work, as they complement each other and form the datasets society. As for the summary and discussion, we treat all the datasets as an entirety and discuss in Section VI. TABLE 1. Overview and comparison of Vision-Based Mobile Localization and Mapping datasets.

Name Year Scene Platform Sensor Ground Truth Camera C a -IMU LiDAR Pose 3D Rawseeds [104] 2009 Indoor/Outdoor Wheeled Robot 3 × gray (stereo) 1 × color 1 × omni b -color Y 2 × Hokuyo-2D 2 × Sick-2D VisualTag LiDAR-ICP RTK-GPS N/A UTIAS MultiRobot [105] 2011 Indoor Wheeled Robot 1 × color (per-robot) N/A N/A MoCap c N/A Devon Island [106] 2012 Outdoor (planetary) Rover 2 × color (stereo) 1 × panor d -color N/A 1 × Optech-3D DGPS N/A Gravel Pit [107] 2012 Outdoor (planetary) Rover 1 × intensity N/A 1 × Autonosys-3D DGPS N/A TUM Omni [108] 2015 Indoor Handheld 1 × omni-gray N/A N/A MoCap (P e ) N/A EuRoC MAV [42] 2016 Indoor UAV 2 × gray (stereo) Y N/A MoCap LasTrack f Y (V i ) SYNTHIA [109] 2016 Outdoor (urban) Car 2 × color (stereo) 1 × depth N/A N/A Simulated N/A TUM monoVO [67] 2016 Indoor/Outdoor Handheld 1 × NA g -gray 1 × WA h -gray N/A N/A Loop Drift N/A (V) Virtual KITTI [110] 2016 Outdoor (urban) Car 1 × color 1 × depth N/A N/A Simulated N/A PennCOSYVIO [111] 2017 Indoor/Outdoor Handheld 4 × color (4-orien j ) 2 × gray (stereo) 1 × fisheye-gray Y N/A Fuducial Marker N/A UZH Event [112] 2017 Indoor/Outdoor Handheld 1 × event Y N/A MoCap (P) N/A Zurich Urban [47] 2017 Outdoor (urban) UAV 1 × color Y N/A GPS+Pix4D N/A ADVIO [113] 2018 Indoor/Outdoor Handheld 1 × color 1 × gray Y N/A Tango+IMU+Fixed-points N/A (V) Blackbird [114] 2018 Indoor UAV 2 × gray (stereo) 1 × gray Y N/A MoCap N/A SPO [115] 2018 Indoor/Outdoor Handheld 2 × gray (stereo) 1 × plenoptic N/A N/A Loop Drift N/A TUM VI [45] 2018 Indoor/Outdoor Handheld 2 × gray (stereo) Y N/A MoCap (P) N/A Upenn Fast Flight [116] 2018 Outdoor UAV 2 × gray (stereo) Y N/A GPS N/A AQUALOC [117] 2019 Underwater Underwater Vehicle 2 × gray (different scene) Y N/A COLMAP N/A OpenLORIS [118] 2019 Indoor/Outdoor Wheeled Robot 1 × RGB-D 2 × fisheye-RGB (stereo) Y 1 × Hokuyo-2D 1 × Robosense-16 MoCap/ LiDAR SLAM N/A Rosario [119] 2019 Outdoor (agriculture) Wheeled Robot 2 × color (stereo) Y N/A RTK-GPS N/A TUM RS k [120] 2019 Indoor Handheld 1 × GS l -gray 1 × RS-gray Y N/A MoCap N/A UZH-FPV [121] 2019 Indoor/Outdoor UAV 1 × event 2 × gray (stereo) Y N/A LasTrack N/A ViViD [122] 2019 Indoor/Outdoor Handheld 1 × thermal 1 × RGB-D 1 × event Y 1 × Velodyne-16 MoCap/ LiDAR SLAM N/A ZJU&SenseTime [44] 2019 Indoor Handheld 2 × color (different scene) Y N/A MoCap N/A (V) TartanAir [43] 2020 Indoor/Outdoor Multiple 2 × color (stereo) 1 × depth Y 1 × Simulated-32 Simulated Y (V) Virtual KITTI2 [123] 2020 Outdoor (urban) Car 2 × color (stereo) 1 × depth N/A N/A Simulated N/A a C-IMU means consumer-grade IMU, b omni means omnidirectional camera, c MoCap means Motion Capture System, d panor means panoramic, e P means partially coverage, f LasTrack means Laser Tracker, g NA means narrow angle, h WA means wide angle, i V means virtual and synthetic dataset, j orien means multiple orientations of the cameras, k RS means rolling shutter, l GS means global shutter.

Y. Liu et al : Datasets and Evaluation for Simultaneous Localization and Mapping related problems: A Comprehensive Survey A. VISION BASED MOBILE LOCALIZATION AND MAPPING

This branch collects 26 datasets that originated from or best fitted to vision based mobile localization and mapping usages. To begin with, an overview and comparison table of all the related datasets is listed in Table 1. Then out of them, we well select 6 popular or featured works to give a detailed description. EUROC MAV

The EuRoC MAV Dataset was collected under the context of the European Robotics Challenge (EuRoC), in particular, for the Micro Aerial Vehicle (MAV) competitions of visual-inertial algorithms [42]. Since the publication in 2016, it has been tested by a great many teams and cited by massive literature, becoming one of the most widely used datasets in SLAM scope.  Platform and Sensor Setup

The dataset was collected using an MAV platform equipped with two grayscale cameras (stereo setup), and an IMU. The properties of the sensors can be accessed in Table 2.

TABLE 2. The properties of the sensors and ground truth instruments.

Sensor/Instrument Type Rate Characteristics Camera 2 × MT9V034 20 Hz WVGA, monochrome, global shutter IMU ADIS16448 200 Hz MEMS, 3D accelerometer, 3D gyroscope Laser Tracker Leica MS50 20 Hz 3D position GT Motion Capture

Vicon

100 Hz 6D pose GT 3D Scanner

Leica MS50 30 kHz 3mm accuracy, 3D map GT  Scenes and Sequences

The dataset consists of two portions: machine hall and room, containing 5 and 6 sequences respectively. The collection environments are shown in Fig. 11. Varying in terms of texture, motion, and illumination conditions, the datasets were recorded in three difficulty levels: easy, medium, and difficult. This has been proved to be challenging for some algorithms [76], [125], [126], enabling researchers to locate the weakness and enhance their algorithms effectively.

FIGURE 11. The collection environments of the dataset: a representative image of the machine hall (left) and the ground truth 3D scan of the room (right) [124].  Ground Truth

The dataset has reliable ground truth data of both positioning and 3D mapping, thus support both kinds of evaluation. For the first batch collected in the machine hall, the 3D position ground truth was provided by laser tracker with around 1mm accuracy. For the second batch collected in the room, the 6D pose and 3D map ground truth were provided respectively by motion capture system and laser scanner. The properties of the ground truth instruments are shown in Table 2.  Highlights

Vision-IMU sensor setup, multiple complexity-levels, highly accurate ground truth, 3D map ground truth. TUM MONOVO

The TUM monoVO Dataset was published in 2016 [67], specifically for testing the tracking accuracy of monocular odometry and SLAM algorithms. It is quite widely used, as many quite long sequences across different environments were provided, covering both indoor and outdoor scenes. Remarkably, all the datasets have been photometrically calibrated.  Platform and Sensor Setup

The dataset was collected through a handheld manner. Two grayscale cameras (non-stereo) with different lenses were used to acquire narrow-angle and wide-angle sequences in parallel. The properties of the sensors are shown in Table 3.

TABLE 3. The properties of the sensors.

Sensor Type Rate Characteristics Camera × ◦ × ◦ FOV Camera uEye UI-3241LE-M-GL Up to 60 Hz 1280 × ◦ × ◦ FOV

FIGURE 12. Some representative images in the dataset, including the original and rectified views [127].  Scenes and Sequences

The dataset provides 50 sequences recorded in a wide variety of scenes from indoor to outdoor, containing mostly camera motion. The indoor scenes were mainly recorded in a school building, covering office, corridor, large hall, and so on. The outdoor scenes were mainly recorded in a campus area, including building, square, parking lot, and the like. Some representative images are shown in Fig. 12. Many sequences have extremely long record distances, which will benefit a lot for inspecting the performance of long-range algorithms.  Ground Truth

Due to the massive sequences collected in different indoor and outdoor environments, it was almost impossible to measure ground truth using GNSS/INS or motion capture system. So this dataset was designed to record each sequence starting and ending at the same position, allowing to evaluate the tracking accuracy based on the loop drift. To enable the quantitative evaluation, at the start segment and end segment of each sequence, the LSD-SLAM was used to generate a set of “ground truth” poses.  Highlights

Handheld motion pattern, various sequences, diverse scenes, photometrically calibration. TUM VI

The TUM VI Benchmark is a high-quality visual-inertial dataset published in 2018 [45] for evaluating odometry and SLAM algorithms. It contains diverse sequences in different scenes, along with challenging situations (e.g., long range and bad illumination). The images had a high dynamic range and were also photometrically calibrated as TUM monoVO.  Platform and Sensor Setup

The dataset was collected using a handheld carrier equipped with two grayscale cameras (stereo setup), and an IMU. The properties of the sensors are listed in Table 4.

TABLE 4. The properties of the sensors and ground truth instruments.

Sensor/Instrument Type Rate Characteristics Camera 2 × IDS uEye UI-3241LE-M-GL 20 Hz 1024 × ◦ FOV, global shutter IMU Bosch BMI160 200 Hz 3D accelerometer, 3D gyroscope, temperature

Motion Capture OptiTrack Flex13 120 Hz 6D pose GT, sub – mm/degree accuracy Light Sensor

TAOS TSL2561 200 Hz scalar luminance  Scenes and Sequences

The dataset was recorded in 5 scenes including room, corridor, magistrale, outdoors, and slides, forming a total of 28 sequences. The sequences that go through the slide are challenging, as the view is dark with almost no feature for tracking. Some representative images are shown in Fig. 13. FIGURE 13. Some representative images in the dataset [45].  Ground Truth

An OptiTrack motion capture system was used to generate accurate ground truth poses. But due to the limitation of area coverage, the ground truth can only be recorded inside the single room where the motion capture system was equipped. Therefore, all sequences were arranged to start and end at that room to enable the evaluation by investigating the loop-drift. The properties of the ground truth instruments are shown in Table 4.  Highlights

Vision-IMU sensor setup, diverse scenes, various sequences, challenging conditions. TARTANAIR

The TartanAir dataset, published in 2020 [43], is the official dataset of CVPR 2020 Visual SLAM Challenge. As the title shows, it was designed to push the limits of Visual SLAM. It is a synthetic dataset collected in simulation environments, covering a wide range of scenes, styles, and motion patterns with challenging visual effects.  Platform and Sensor Setup

The dataset was collected in a simulation manner, thus there is no practical platform or sensor setup. However, the dataset simulated many motion patterns with multi-modal sensors: two color cameras (stereo setup), one depth camera, and one 3D LiDAR. The properties are shown in Table. 5.  Scenes and Sequences

The dataset was very realistic with diverse scenes, including 30 environments in 6 categories: Urban, Rural, Nature, Domestic, Public, and Scifi. As shown in Fig. 14, even in the same category, the environments also have large diversity. The scenes were designed with challenging effects, such as dynamic light condition, bad illumination, adverse weather, time and season changes, and dynamic objects. The dataset has a total of 1037 long sequences with different motion patterns, resulting in more than 1 million frames, which is much larger than existing datasets. Copyright © 2020, IEEE.

Y. Liu et al : Datasets and Evaluation for Simultaneous Localization and Mapping related problems: A Comprehensive Survey TABLE 5. The properties of the sensor setup.

Sensor Type Rate Characteristics Camera 2 × Virtual 10 Hz (tunable) 640 × × FIGURE 14. Some representative images of different categories: Urban, Rural, Nature, Domestic, Public, and Scifi from left column to right [43].  Ground Truth

Both camera poses and 3D map ground truth are provided. They are extremely precise thanks to the simulation manner. Except for them, some other labels are also provided, such as semantic segmentation, stereo disparity, depth image, and optical flow.  Highlights

Multimodal sensors, various motion patterns, diverse scenes, wealthy sequences, challenging effects, multimodal ground truth, accurate ground truth. UZH-FPV DRONE RACING

The UZH-FPV Drone Racing Dataset, published in 2019 [121], is the official dataset of IROS 2020 Drone Racing Competition. Among the existing VI datasets, it contains the most aggressive motions, such as large accelerations and rapid rotations, which are far beyond the capabilities of existing algorithms.  Platform and Sensor Setup

The dataset was collected using a quadrotor equipped with an event camera (with built-in IMU), and a grayscale binocular Copyright © 2020, IEEE. stereo camera (with built-in IMU). The properties of the sensors are listed in Table 6.

TABLE 6. The properties of the sensors and ground truth instrument.

Sensor/Instrument Type Rate Characteristics Event Camera mDAVIS 50 Hz 346 × ◦ FOV, built-in IMU IMU × ◦ FOV

IMU

Leica MS60 20 Hz 3D position GT  Scenes and Sequences

The dataset was collected in two scenes: an indoor airplane hangar, and an outdoor large field. It consists of 27 sequences in total, with a 340.1/923.5 (m) max distance and a 12.8/23.4 (m/s) top speed for indoor/outdoor scene (both the top speeds are larger than any existing indoor/outdoor sequences). The collection environments are shown in Fig. 15. FIGURTE 15. Indoor and outdoor environments used for dataset collection [121].  Ground Truth

The ground truth data were generated by measuring the 3D position using the Leica MS60 total station, with below 1mm accuracy. Due to the aggressive flight in collection, laser lost tracking could always occur, thus the sequences with full trajectory were selected for publication. The properties of the instrument are shown in Table 6.  Highlights

Aggressive motion, highly-accurate ground truth, event data. OPENLORIS-SCENE

The OpenLORIS-Scene Dataset, published in 2019 [118], is the official dataset of IROS 2019 Lifelong SLAM Challenge. It is a real-world collection in dynamic and daily-changing environments specialized for long-term robot navigation.  Platform and Sensor Setup

The dataset was collected using a mobile robot (with wheel odometry) equipped with an RGB-D camera (with built-in IMU), a color binocular stereo camera (with built-in IMU), and a 2D/3D LiDAR (to generate ground truth poses). Note that, the LiDAR and mobile robot are different between scenes: it is a Gaussian robot and a 3D LiDAR for the market scene while a Segway robot and a 2D LiDAR for the other scenes. The properties of the sensors are listed in Table 7.

TABLE 7. The properties of the sensors and ground truth instruments.

Sensor/Instrument Type Rate Characteristics RGB-D Camera RealSense D435i 30 Hz 848 × ◦ × ◦ FOV, 6-axis IMU Stereo Camera (binocular) RealSense T265 30 Hz 640 × ◦ FOV, built-in IMU

LiDAR ◦ Horizontal, 30m range, 5cm accuracy, LiDAR ◦ H × ◦ V, 150m range, 2cm accuracy Motion Capture

OptiTrack 240 Hz 6D pose GT  Scenes and Sequences

The dataset was collected in 5 scenes including Office, Corridor, Home, Café, and Market. There are 22 sequences in total holding an accumulated 2244 seconds’ data length. Due to the record was performed under the real-life context, many dynamic people and objects were brought inside. Within each scene, there are 2-7 sequences recorded at different time slots to support testing in ever-change environments, including changes in illumination, viewpoints, and those caused by human activities. The example color images are shown in Fig. 16. FIGURE 16. 5 Different environments used for dataset collection: Office, Corridor, Home, Café, and Market from left to right. As can be seen from the two rows, the data were recorded in the same scenes at different time slots, which could benefit long-term SLAM research [118]. Copyright © 2020, IEEE.  Ground Truth

The ground truth was generated by either a Motion Capture System or LiDAR SLAM depending on the scene. For the Office scene, the ground truth was provided by the Motion Capture System at a rate of 240 Hz. For the other scenes, the ground truth was generated by LiDAR SLAM methods: in Corridor and Café, a variant of Hector SLAM algorithm [129] was used, while in Home and Market, another LiDAR SLAM method accompanied with multi-sensor fusion was used to obtain ground truth.  Highlights

Service robot platform, multi-modal sensors, ever-changing conditions, diverse scenes, repeated exploration. B. LIDAR RELATED MOBILE LOCALIZATION AND MAPPING

This branch collects 20 datasets that originated from or best fitted to LiDAR and vision-LiDAR based mobile localization and mapping usages. An overview of all the related datasets is provided in Table 9 to begin with. Note that, normally the datasets that support laser-based methods are also compatible with vision-based methods, due to the fact that LiDARs are always equipped along with cameras. KITTI ODOMETRY

The KITTI Odometry Benchmark, released by Karlsruhe Institute of Technology and Toyota Technological Institute in 2012, was designed to promote the research in autonomous driving field [29]. Indeed, the original focus of KITTI was to promote vision-based methods. Thanks to the high-quality LiDAR data acquired by a Velodyne HDL-64e, it has also become one of the most widely used datasets among LiDAR SLAM community [72], [80], [147].  Platform and Sensor Setup

The dataset was collected using a car equipped with two monochrome cameras (binocular stereo), two color cameras (binocular stereo), a LiDAR, and a set of GPS&INS device (to generate ground truth poses). All the sensors are top-level and provide with high-quality data. The properties of the sensors are listed in Table 8.

TABLE 8. The properties of the sensors and ground truth instruments.

Sensor/Devices Type Rate Characteristics Camera (monochrome) 2 × PointGrey Flea2 FL2-14S3M-C 10 Hz 1392 × ◦ × ◦ FOV Camera (color) 2 × PointGrey Flea2 FL2-14S3C-C 10 Hz 1392 × ◦ × ◦ FOV

LiDAR Velodyne HDL-64E 10 Hz 64-beam LiDAR, 360 ◦ H × ◦ V, 120m range, 2cm accuracy GPS&INS

OXTS RT3003 100 Hz L1/L2 RTK, <5cm error (under open sky), 6D pose GT

Y. Liu et al : Datasets and Evaluation for Simultaneous Localization and Mapping related problems: A Comprehensive Survey TABLE 9. Overview and comparison of LiDAR related mobile localization and mapping datasets.

Name Year Scene Platform Sensor Ground Truth Camera C a -IMU LiDAR Pose 3D DARPA [130] 2007 Outdoor (urban) Car 4 × color (stereo) 1 × WA b -color N/A 12 × Sick-2D 1 × Velodyne-64 DGPS+IMU+DMI N/A New College [131] 2009 Outdoor Wheeled Robot 2 × gray (stereo) 5 × color (panor c ) Y 2 × Sick-2D Segway Odometry N/A Marulan [132] 2010 Outdoor UGV 1 × color 1 × IR d -thermal N/A 4 × Sick-2D RTK-DGPS/INS N/A Ford Campus [133] 2011 Outdoor Car 6 × color (omni e ) Y 1 × Velodyne-64 2 × Riegl-2D DGPS N/A ASL Challenging [134] 2012 Indoor/Outdoor Custom Base N/A Y 1 × Hokuyo-2D (tilting) Total Station Y KITTI [29] 2012 Outdoor Car 2 × color (stereo) 2 × gray (stereo) N/A 1 × Velodyne-64 RTK-GPS/INS N/A Canadian Planetary [135] 2013 Outdoor (planetary) UGV 3 × color (stereo) Y (P f ) 1 × Sick-2D DGPS+IMU /DGPS+VO N/A Malaga Urban [136] 2014 Outdoor (urban) Car 2 × color (stereo) Y 2 × Sick-2D 3 × Hokuyo-2D GPS N/A NCLT [46] 2016 Indoor/Outdoor Wheeled Robot 6 × color (omni) N/A 1 × Velodyne-32 1 × Hokuyo-2D FOG+RTK-GPS+EKF-Odometry N/A Sugar Beets 2016 [137] 2016 Outdoor (agricultural) Field Robot 1 × RGB/NIR 1 × RGB-D N/A 2 × Velodyne-16 1 × Nippon-2D RTK-GPS Y Chilean Underground [138] 2017 Underground (mine) UGV 3 × color (stereo) N/A 1 × Rigel-3D Registration Y Oxford RobotCar [69] 2017 Outdoor (urban) Car 3 × color (stereo) 3 × color (3-orien g ) N/A 2 × Sick-2D 1 × Sick-4 RTK-GPS/INS N/A KAIST Day/Night [139] 2018 Outdoor Car 2 × color (stereo) 1 × thermal N/A 1 × Velodyne-32 RTK-GPS/INS N/A Katwijk Beach [140] 2018 Outdoor (planetary) Rover 2 × color (stereo) 2 × color (stereo) N/A 1 × Velodyne-16 RTK-DGPS N/A MVSEC [141] 2018 Indoor/Outdoor Multiple 2 × event (stereo) 2 × gray (stereo) Y 1 × Velodyne-16 MoCap/ LiDAR+IMU+GPS N/A MI3DMAP [142][143] 2018 Indoor Backpack 1 × RGB 2 × fisheye-RGB Y 2 × Velodyne-16 SLAM Y Urban@CRAS [144] 2018 Outdoor (urban) Car 2 × color (stereo) Y 1 × Velodyne-16 RTK-GPS N/A Complex Urban [66] 2019 Outdoor (urban) Car 2 × color (stereo) Y 2 × Velodyne-16 2 × Sick-2D RTK-GPS+ FOG+SLAM Y KAIST Radar [145] 2019 Outdoor (urban) Car 1 × radar N/A 1 × Sick-2D 1 × Velodyne-16 RTK-GPS+ FOG+SLAM Y Newer College [146] 2020 Outdoor Handheld 1 × RGB-D Y 1 × Ouster-64 ICP Y a C-IMU means consumer-grade IMU, b WA means wide angle, b MoCap means Motion Capture System, c panor means panoramic, d IR means infrared, e omni means omnidirectional, f P means partially coverage, g orien means multiple orientations of the cameras.  Scenes and Sequences

The dataset was collected by driving around a city, in rural areas and on highways. It covers very large scale scenes, two example images are shown in Fig. 17. Considering the long trajectories, varying speeds, and high-accuracy GPS signal, a total of 39.2km driving distance with frequent loop closures were selected, forming into 22 sequences.  Ground Truth

A high-end GPS&INS instrument was used to generate pose ground truth. To ensure reliable accuracy during each whole sequence, the trajectories were well selected to maintain those segments with good GPS signal. With the aid of RTK, the device can achieve better than 5cm positioning accuracy under an open sky. Since the GPS&INS data cannot be synchronized with the cameras and LiDAR via hardware, the closest readouts on the timeline were recorded. To enable frame-evaluation, the interpolation was performed to provide ground truth data at any timestamp. The properties of the ground truth instruments are listed in Table 9.

FIGURE 17. Two example images of the collection scenes: Residential (top), and Road (bottom) [148].  Highlights

High-quality data, vision-LiDAR setup, driving platform, wealthy sequences, loop trajectories. OXFORD ROBOTCAR

The Oxford RobotCar Dataset, presented in 2016, was designed for long-term autonomous driving research in real-world, dynamic urban environments. The record traversed a same route twice a week within around one year, resulting in over 1000km trajectories. Consequently, it covered inside significant scene appearance changes, which will enable all round investigation [69].  Platform and Sensor Setup

There are six cameras (3 of them form a trinocular stereo camera and the other 3 are monocular cameras at different orientations), two 2D-LiDARs, a 3D-LiDAR, and a set of GPS&INS device (to generate ground truth poses) equipped on a NISSAN LEAF vehicle. The properties of the sensors are listed in Table 10.

TABLE 10. The properties of the sensors and ground truth instruments.

Sensor/Devices Type Rate Characteristics Stereo Camera (trinocular) FLIR Bumblebee XB3 16 Hz 1280 × ◦ HFOV, 12/24cm baseline Camera 3 × FLIR GS2-FW-14S5C-C 12 Hz 1024 × ◦ HFOV

2D LiDAR SICK LMS-151 50 Hz 270 ◦ Horizontal, 50m range, 3cm accuracy 3D LiDAR SICK LD-MRS 12.5 Hz 85 ◦ H × ◦ V, 50m range, 3cm accuracy GPS&INS

NovAtel SPAN-CPT 50 Hz Dual antenna, 6D pose  Scenes and Sequences

The dataset was collected in central Oxford at the same route twice a week, forming a total of 1000km record length. A wide variety of scene appearances and structures caused by illumination, weather, dynamic objects, seasonal effects, and construction works were collected, covering conditions of pedestrian, cyclist, and vehicle traffic, light and heavy rain, direct sun, snow, and down, dusk and night. Several example images that illustrate changing of the environment are shown in Fig. 15.  Ground Truth

Separated from the publication of the dataset, the ground truth was released in 2020 [150]. After well selected from the whole dataset, a sample of 72 traversals were provided 6D-pose ground truth at 10Hz by Real-Time Kinematic (RTK) post-processing solution using tightly coupled observations. The estimated positioning errors are less than 15cm in latitude and longitude, and less than 25cm in altitude, and the orientation errors are less than 0.01 ° in pitch and roll and 0.1 ° in yaw. The properties of the ground truth instruments are listed in Table 10. FIGURE 18. The variation in driving conditions while dataset collection: including the change due to lighting, weather, and occlusions by other road users (vehicles, pedestrians, cyclists) [149].  Highlights

Long-term dataset, ever-changing conditions, vision-LiDAR setup, diverse scenes, repeatedly traverse, driving platform. NCLT

The NCLT Dataset, published in 2016 [46], was collected biweekly over 15 months in the University of Michigan’s North Campus, focusing on long-term autonomous operation. Due to the repeatedly explorations at different routes and times, many challenging effects and changing elements were included inside.  Platform and Sensor Setup

The dataset was collected using a Segway robot, equipped with a color omnidirectional camera, two 2D-LiDARs, one 3D-LiDAR, one IMU, one consumer-grade GPS, one RTK-GPS (to generate ground truth poses), and one single-axis Fibre-optic gyroscope (FOG). The properties of the sensors are listed in Table 11.  Scenes and Sequences

The dataset was recorded in the University of Michigan’s North Campus in 27 discrete mapping sessions, containing 34.9 hours of log with 147.4km of trajectory. The collections repeatedly traverse the campus with diverse trajectories (as can be seen from Fig. 19), covering both indoor and outdoor environments, at different time slots of the day throughout all seasons. Many challenging elements were covered inside,

Y. Liu et al : Datasets and Evaluation for Simultaneous Localization and Mapping related problems: A Comprehensive Survey including moving objects, varying viewpoints, scene changes caused by season, weather, and constructions. Some sample images and point clouds of the dataset are shown in Fig. 20. TABLE 11. The properties or the sensors and ground truth instruments.

Sensor/Devices Type Rate Characteristics Omnidirectional Camera LD3-20S4C 5 Hz 1600 × ◦ Horizontal, 30m range, 5cm accuracy

2D LiDAR ◦ Horizontal, 4m range, 12cm accuracy 3D LiDAR Velodyne HDL-32E 10 Hz 32-beam LiDAR, 360 ◦ H × ◦ V, 100m range, 2cm accuracy IMU Microstrain 3DM-GX3-45 100 Hz 3D accelerometer, 3D gyroscope FOG KVH DSP-3000 1000 Hz 1-axis, accurate yaw GPS Garmin 18x 5 Hz Consumer-grade RTK GPS

NovAtel DL-4 plus 1 Hz RTK correction, accurate position

FIGURE 19. Left: a sample trajectory of dataset collection. Right: all the trajectories are aligned together, each layer represents a session [151].

FIGURE 20. Sample images (forward camera) and point clouds of the dataset [151].  Ground Truth

The pose ground truth was generated using LiDAR scan matching and high-accuracy RTK-GPS. The LiDAR scan matching constraints were added inside and between sessions, enabling accurate pose generation under inaccurate and denied GPS. To make the data easier to use, the interpolation between each pose nodes (aligned with imagery and LiDAR scans) was performed based on the precomputed odometry via extended Kalman filter (EKF). The properties of the ground truth instruments are listed in Table 11.  Highlights

Long-term dataset, vision-LiDAR setup, repeatedly traverse, robot platform, indoor and outdoor coverage, ever-changing conditions. KAIST MULTI-SPECTRAL ALL-DAY

The KAIST Multi-spectral All-day Dataset [139], published in 2018, was recorded within a great range of drivable regions using multi-spectral sensor platform for autonomous systems research. It was collected in different time slots of a day, thus can facilitate various tasks and all-day perception.  Platform and Sensor Setup

The dataset was collected using an SUV equipped with two RGB cameras (stereo setup), one thermal camera, one 3D-LiDAR, and one GPS&INS system (to generate pose ground truth). The properties of the sensors are listed in Table 12.  Scenes and Sequences

The dataset was collected in general traffic situations of urban, campus, and residential areas covering many static and dynamic targets. Remarkably, the data were collected at different time slots (day and night), under good and bad illumination, which enabled the research in ever-changing conditions. The campus scenario was captured at finer time slots (sunrise, morning, afternoon, sunset, night, and dawn). Some sample images captured in campus scenario are shown in Fig. 21.  Ground Truth

The ground truth poses were generated by the OXTS GPS &INS device. With the aid of RTK correction, under reliable satellites signal, it can achieve an accuracy of centimeter level in position and sub-degree level in orientation. The properties of the ground truth instruments are listed in Table 12.  Highlights

Multi-spectrual sensors, ever-changing conditions, driving platform, diverse scenes, challenging environments.

TABLE 12. The properties of the sensors and ground truth instruments.

Sensor/Devices Type Rate Characteristics Camera 2 × PointGrey Flea3 FL3-GE-13S2C-C 25 Hz 1280 × ◦ H × ◦ V Thermal Camera FLIR A655Sc 25 Hz 640 × ◦ H × ◦ V

3D LiDAR Velodyne HDL-32E 10 Hz 32-beam LiDAR, 360 ◦ H × ◦ V, 100m range, 2cm accuracy GPS&INS

OXTS RT2002 100 Hz L1/L2 RTK, 0.02m/0.1 ◦ resolution, 6D pose GT FIGURE 21. Sample images captured in campus scenario at different time slots [139]. COMPLEX URBAN

The Complex Urban Dataset, released in 2019, was recorded in diverse and complex urban environments suffering from denied or inaccurate GPS with multi-modal sensors of both commercial-level and high-level [66].  Platform and Sensor Setup

The dataset was collected using a car equipped with two color cameras (stereo setup), two 2D-LiDARs, two 3D-LiDARs, one consumer-level GPS, one VRS-RTK GPS (to generate ground truth), one 3-axis FOG (to generate ground truth), one consumer-level IMU, and two wheel encoders (to generate ground truth). The properties of the sensors are listed in Table 13.

TABLE 13. The properties of the sensors and ground truth instruments.

Sensor/Devices Type Rate Characteristics Camera 2 × FLIR FL3-U3-20E4C-C 10 Hz 1280 × × SICK LMS-511 100 Hz 190 ◦ Horizontal, 80m range, 5cm accuracy

3D LiDAR 2 × Velodyne VLP-16 10 Hz 16-beam LiDAR, 360 ◦ H × ◦ V, 100m range, 3cm accuracy GPS U-Blox EVK-7P 10 Hz 2.5m accuracy, consumer-grade FOG KVH DSP-1760 1000 Hz 3-axis, 0.05 ◦ /h bias IMU Xsens MTi-300 200 Hz 9-axis AHRS, 10 ◦ /h bias Encoder 2 × RLS LM13 100 Hz 4096 resolution RTK GPS

The dataset was collected in diverse urban environments including metropolitan area, residential area, complex apartment, highway, tunnel, bridge, and campus. All the sequences were recorded with kilometers of trajectory length, along with many challenging elements like GPS-denied area, multi-lane roads, complex building structure, and highly dynamic objects, which were missing for a long time in existing datasets. Some sample images and point cloud data from the dataset are shown in Fig. 22.

FIGURE 22. Some sample images and 3D point cloud data in various environments [66].  Ground Truth

The trajectory baseline here was generated using the high-precision sensors, LiDAR data, and SLAM algorithms. The incremental smoothing and mapping [152] (iSAM) method was employed, with the measurements from FOG (rotation)

Y. Liu et al : Datasets and Evaluation for Simultaneous Localization and Mapping related problems: A Comprehensive Survey and encoder (translation) utilized to add constraints into pose graph – to enhance the accuracy. Additionally, to improve the global consistency, many loop paths were designed for performing ICP-based loop closure. However, although this baseline can be used for comparison, the accuracy may be lower than normal “ground truth” due to the complexity of environments.  Highlights

Driving platform, multi-modal sensors, multi-level sensors, complex environments, diverse scenes. NEWER COLLEGE

The Newer College Dataset (the upgrade version of the 2009 New College Dataset [131]), released in 2020, was collected at typical walking speeds using a handheld carrier through the New College of Oxford [146]. It is a precious dataset as it contains both trajectory and 3D mapping ground truth within a large scale.

FIGURE 23. Top: the pre-build 3D map of the environment. Bottom: some images and LiDAR scans from the Quad, Mid-Section, and Parkland [146].  Platform and Sensor Setup

The dataset was collected using a handheld carrier equipped with an RGB-D sensor (embedded with IMU), and one 3D-LiDAR (embedded with IMU). The properties of the sensors are listed in Table 14.  Scenes and Sequences

The dataset was collected in the New College of Oxford, UK, including 3 main sections: Quad (Q), Mid-Section (M), and Parkland (P). The sequences were recorded at walking speeds around 1 m/s with designed revisit path. The 3D map of the environment and some example images and LiDAR scans are shown in Fig. 23.

TABLE 14. The properties of the sensors and ground truth instruments.

Sensor/Instrument Type Rate Characteristics RGB-D Camera RealSense D435i 30 Hz 848 × ◦ × ◦ FOV, 6-axis IMU LiDAR Ouster OS1-64 10 Hz 64-beam LiDAR, 360 ◦ H × ◦ V, 120m range, 5cm accuracy, 6-axis IMU 3D scanner

Leica BLK360 360 kHz 7mm accuracy, 3D map GT  Ground Truth

Firstly, a millimeter-accurate ground truth 3D map was obtained using a survey grade 3D laser scanner. Then, an ICP-based approach was used to generate ground truth poses – it performed registration between the reading point cloud from 3D-LiDAR and the ground truth 3D map from the Leica scanner, inferring centimeter-accurate 6D-poses.  Highlights

Handheld platform, large scene, accurate ground truth, loop trajectories. C. RGB-D BASED MOBILE LOCALIZATION AND MAPPING

This branch collects 20 datasets that originated from or best fitted to RGB-D based mobile localization and mapping usages. To begin with, an overview of all the related datasets is provided in Table 15. Note that, normally the RGB-D datasets can also be used for pure vision-based methods, since the 2D images from the RGB-D sensors could be accessed directly apart from depth information. TUM RGB-D

The TUM RGB-D Dataset, released by the Computer Vision Group of Technical University of Munich in 2012 [30], has become one of the top popular SLAM related datasets in the past decade. Though captured using an RGB-D sensor, this dataset also drew significant attention from pure vision based algorithm researchers [172], [173], [174].  Platform and Sensor Setup

There are two recording platforms: handheld and robot, both of which were mounted with an RGB-D camera, attached with a set of reflective markers for ground truth measurement. The properties of the sensors are listed in Table 16. TABLE 15. Overview and comparison of RGB-D based mobile localization and mapping.

Name Year Scene Platform Sensor Ground Truth Camera C a -IMU LiDAR Pose 3D RGB-D Objects [153] 2011 Objects Turntable 1 × RGB-D 1 × color N/A N/A Marker Tracking SLAM KinectFusion [154] 2012 Indoor Handheld 1 × RGB-D N/A N/A SLAM SLAM/ 3D Scanner NYU v2 [155] 2012 Indoor Handheld 1 × RGB-D N/A N/A N/A Alignment TUM RGB-D [30] 2012 Indoor Handheld Wheeled Robot 1 × RGB-D N/A N/A MoCap b N/A 7-Scenes [156] 2013 Indoor Handheld 1 × RGB-D N/A N/A SLAM SLAM SUN 3D [157] 2013 Indoor Handheld Wheeled Robot 1 × RGB-D N/A N/A SfM N/A Stanford 3D Scene [158] 2013 Indoor/Outdoor Handheld 1 × RGB-D N/A N/A SLAM SLAM (V) ICL-NUIM [159] 2014 Indoor Handheld 1 × RGB-D N/A N/A SLAM Simulated RGB-D Scenes [160] 2014 Indoor Handheld 1 × RGB-D N/A N/A SLAM SLAM CoRBS [161] 2016 Indoor Handheld 1 × RGB-D N/A N/A MoCap 3D Scanner SceneNN [162] 2016 Indoor Handheld 2 × RGB-D (different scene) N/A N/A SLAM SLAM ETH RGB-D [163] 2017 Indoor MAV 1 × RGB-D N/A N/A MoCap 3D Scanner Robot@Home [164] 2017 Indoor Wheeled Robot 4 × RGB-D (180 ◦ assembly) N/A 1 × Hokuyo-2D SLAM N/A (V) SceneNet [165] 2017 Indoor Random 1 × RGB-D N/A N/A Simulated Simulated (V) SUNCG [166] 2017 Indoor Custom 1 × RGB-D N/A N/A Simulated Simulated ScanNet [167] 2017 Indoor Handheld 1 × RGB-D (custom) Y N/A SLAM SLAM (V) InteriorNet [168] 2018 Indoor Custom 1 × RGB-D Y N/A Simulated Simulated Bonn [169] 2019 Indoor Handheld 1 × RGB-D N/A N/A MoCap 3D Scanner ETH3D SLAM [170] 2019 Indoor/Outdoor Handheld 1 × color 1 × depth Y N/A MoCap+SfM N/A (V) Replica [171] 2019 Indoor Custom 1 × RGB-D N/A N/A Simulated Simulated a C-IMU means consumer-grade IMU, b MoCap means Motion Capture System.  Scenes and Sequences

The dataset was collected in two different indoor environments: an office (6 × ), and an industrial hall (10 ×

12 m ). A total of 39 sequences were captured with highly accurate ground truth poses. The collection environments are depicted in Fig. 24. TABLE 16. The properties of the sensors and ground truth instrument.

Sensor/Instrument Type Rate Characteristics RGB-D Camera Kinect v1 30 Hz 640 × ◦ × ◦ FOV, 3-axis accelerator Motion Capture

MotionAnalysis Raptor-E 100 Hz 6D pose GT FIGURE 24. The collection environments: office (left), and industrial hall (right) [172].  Ground Truth

Both the environments were equipped with a MotionAnalysis motion capture system, which can provide accurate 6D pose ground truth. The properties of the ground truth instruments are listed in Table 9.  Highlights

Handheld platform, robot platform, accurate ground truth. ICL-NUIM

The ICL-NUIM Dataset, released in 2014 [159], is one of the most widely used datasets in SLAM related field [175], [176], [177]. It provided with both camera poses and 3D structure ground truth. Though generated in a synthetic manner, the data are incredibly realistic.  Platform and Sensor Setup

The dataset was generated via a synthetic approach, so there was not a practical platform or a sensor system. However, as an alternative, the data generation process obeyed a realistic handheld trajectory within 3D models in the POV-Ray software, simulating an RGB-D sensor capturing the data. Of course, the intrinsic calibration matrix of the RGB-D sensor in POV-Ray was provided, enabling to run the algorithms realistically. The properties of the virtual sensor setup are listed in Table 17. Y. Liu et al : Datasets and Evaluation for Simultaneous Localization and Mapping related problems: A Comprehensive Survey TABLE 17. The properties of the virtual sensor setup.

Sensor Type Rate Characteristics RGB-D Camera

Virtual (POV-Ray) 30 Hz 640 × ◦ × ◦ FOV FIGURE 25. Two example images: living room (left), and office room (right) [159].  Scenes and Sequences

The dataset was collected synthetically in two different scenes: a living room, and an office room. Each virtual scene contains 4 sequences, one of which will contain a small loop trajectory. To simulate real-world data, noise could be added in the RGB and depth information. Two example images are shown in Fig. 25.  Ground Truth

The dataset provides two kinds of ground truth: camera trajectory, and 3D mapping. To achieve a high reliability, the trajectories were obtained by running Kintinuous [178] in a real living room, and immediately the estimated trajectories were used as ground truth routes to collect the dataset. For the office room scene, the designed model was provided as 3D structure ground truth.  Highlights

Synthetic dataset, realistic data, accurate ground truth, 3D scene ground truth. BONN RGB-D DYNAMIC

The Bonn RGB-D Dynamic Dataset, released in 2019 [169], is a highly dynamic RGB-D dataset focusing on SLAM related tasks. Both camera poses and 3D mapping ground truth are provided for evaluation.

TABLE 18. The properties of the sensors and ground truth instruments.

Sensor/Instrument Type Rate Characteristics RGB-D Camera ASUS Xtion Pro LIVE 30 Hz 1280 × ◦ × ◦ FOV Motion Capture OptiTrack Prime 13 240 Hz 6D pose GT, <0.2cm/0.5 ◦ error 3D Scanner Leica BLK360 360 kHz 7mm accuracy, 3D map GT Copyright © 2014, IEEE. FIGURE 26. Two example dynamic frames in the dataset: manipulating boxes (left), and playing with balloons (right) [169].  Platform and Sensor System

The sequences were captured using an RGB-D sensor by a handheld carrier. A motion capture system and a 3D laser scanner were used to measure the ground truth data. The properties of the sensors are listed in Table 18.  Scenes and Sequences

The dataset was collected in both dynamic and static scenes. A total of 24 sequences were captured under dynamic situations, such as manipulating boxes and playing with balloons, which could be challenging for the algorithms, and 2 sequences were captured within static situations. Two example dynamic frames are shown in Fig. 26.  Ground Truth

All the sequences are provided with ground truth poses measured by an OptiTrack motion capture system with high accuracy. Additionally, the 3D mapping ground truth is provided by a Leica laser scanner, containing only the static parts of the environment (as shown in Fig. 27), enabling the evaluation of reconstruction algorithms. FIGURE 27. The ground truth 3D point cloud of the scene [169].  Highlights

Highly dynamic environments, wealthy sequences, accurate ground truth, 3D scene ground truth. SUN3D

The SUN3D Dataset, published in 2013 [157], is a large-scale dataset containing various RGB-D videos and covering Copyright © 2019, IEEE. a wide range of indoor places. Both the ground truth camera poses and 3D model are provided within all the sequences.  Platform and Sensor Setup

There are two recording configurations: handheld and robot, both of which were mounted an RGB-D camera attached with a set of reflective markers for ground truth generation. The properties of the sensors are listed in Table 17.

TABLE 19. The properties of the sensor.

Sensor/Instrument Type Rate Characteristics RGB-D Camera

ASUS Xtion Pro LIVE 30 Hz 640 × ◦ × ◦ FOV  Scenes and Sequences

The dataset contains 415 sequences captured from 254 different places, such as apartment, conference room, restroom, classroom, lounge, etc., in 41 different buildings. While capturing, the operator was told to mimic the human exploration and walk through the entire space with several purposely loops. Some sample collection environments are depicted in Fig. 18. FIGURE 28. Some sample collection environments: apartment, conference room, conference hall, and restroom from left to right respectively [157].  Ground Truth

The ground truth poses are provided within all sequences. A system was designed first to perform an RGB-D SfM to obtain the initial camera poses, followed with some manual adjustments. Then a generalized bundle adjustment process that leveraged object annotations as constraints was used to refine the poses and offer the ground truth.  Highlights

Handheld platform, wealthy sequences, diverse scenes, high-quality data. CORBS

The CoRBS Dataset, presented in 2016 [161], was designed to be a comprehensive RGB-D Benchmark for SLAM. As Copyright © 2013, IEEE. the name “comprehensive” suggests, it provides both depth and color information, along with not only the camera poses but also the 3D scene ground truth, outperforming many state-of-the-art datasets.  Platform and Sensor Setup

The dataset was recorded using a handheld carrier, which was mounted with an RGB-D camera, attached with a set of reflective markers, enabling the external motion capture system to measure the trajectory ground truth. The properties of the sensors are listed in Table 20.

TABLE 20. The properties of the sensors and ground truth instruments.

Sensor/Instrument Type Rate Characteristics RGB-D Camera Kinect v2 30 Hz 1920 × ◦ × ◦ FOV, 3-axis accelerator Motion Capture OptiTrack Flex13 120 Hz 6D pose GT, sub-mm accuracy 3D Scanner FIGURE 29. The example data and ground truth models: human, desk, electrical cabinet, and racing car from left to right respectively [161].  Scenes and Sequences

The dataset contains 20 sequences captured in 4 different environments, including human, desk, electrical cabinet, and racing car. The camera exploration incorporates different characteristics, such as loop closures, twisting, slow and fast motion, short and long distance, etc. The example data and ground truth models are depicted in Fig. 29.  Ground Truth

Both the camera poses and 3D scene ground truth are provided along with the dataset. Thanks to the highly accurate OptiTrack motion capture system, the camera pose accuracy can reach 0.39mm in position and 0.15degree in rotation. Equally excellent, the 3D scene ground truth reconstructed by a fringe 3D scanner reached an accuracy of 0.2mm. The properties of the ground truth instruments are listed in Table 20. Copyright © 2016, IEEE.

Y. Liu et al : Datasets and Evaluation for Simultaneous Localization and Mapping related problems: A Comprehensive Survey  Highlights

Handheld platform, wealthy sequences, accurate ground truth, 3D scene ground truth. INTERIORNET

The InteriorNet Dataset, presented in 2018 [168], is a photo-realism, large scale, high variability, and interior scene synthetic dataset serving a wide range of research purposes. The dataset was rendered within the designed CAD models following realistic trajectories with high-resolution and high-rate sequences.

TABLE 21. The properties of the virtual sensors.

Sensor/Instrument Type Rate Characteristics RGB-D Camera Virtual 25 Hz 640 × Virtual 800 Hz 3D accelerometer, 3D gyroscope, ground truth/noisy

FIGURE 30. The collection scenes: rendering results (the left column), and real-world decorations (the right column) [168].  Platform and Sensor Setup

The dataset was produced by an end-to-end rendering pipeline, so there is not any mobile platform or practical sensor in use. Instead, an interactive simulator namely ViSim was proposed, creating trajectories manually or randomly, simulating an RGB-D sensor capturing the sequences. Additionally, the virtual fisheye camera, panorama camera, stereo camera, IMU, and event camera data were also provided. The properties of the virtual sensor setup are listed in Table 21.  Scenes and Sequences

The dataset was generated in a synthetic way within various interior scenes and layouts collected from world-leading furniture manufacturers. A total of 15k sequences with 1k images for each sequence are provided, along with many challenging conditions, including object movements, lighting changes, motion blur, etc. The rendering collection scenes are depicted in Fig. 30, as can be seen, the images are very photorealistic.  Ground Truth

The 3D scene ground truth was provided by the predesigned CAD models. For the camera poses ground truth, thanks to the simulator, the camera rendering trajectory could be easily accessed with absolute accuracy.  Highlights

Realistic rendering, wealthy sequences, complex conditions, diverse scenes, accurate ground truth, 3D scene ground truth. D. IMAGE BASED 3D RECONSTRUCTION

This branch collects 18 datasets that mainly focus on Image based 3D Reconstruction usages. To begin with, an overview of all the related datasets is provided in Table. 22. Note that, many datasets of this category were collected with no control under large scale environments (e.g., web collection photos from Flickr and Google Images ), making it difficult to generate ground truth of camera poses or scene geometry. To enable easier usage and better evaluation, here for detailed description, we give priority to those state-of-the-arts who contain better ground truth data. MIDDLEBURY

The Middlebury Dataset, published in 2006 [179], is a top famous multi-view image dataset with known 3D shape ground truth [180], [181], [182]. Novel evaluation metrics were proposed to enable quantitative comparison on mapping accuracy and completeness, which has been widely used by the community in recent years.  Platform and Sensor Setup

The dataset was collected using a CCD camera mounted on a Stanford Spherical Gantry robotic arm. The robotic arm has a high motion flexibility that enables the camera shooting at any viewpoints on a one-meter sphere, with a rather high positioning accuracy within 0.2mm and 0.02 degrees. The properties of the sensors are listed in Table 23.

TABLE 23. The properties of the sensors and ground truth instrument.

Sensor/Instrument Type Resolution Characteristics Camera CCD 640 ×

480 RGB 3D Scanner

Cyberware Model-15 0.25mm 3D map GT, 0.2mm accuracy  Scenes and Sequences

The dataset well selected two target objects for study: Temple and Dino (as shown in left part of Fig. 31, the Temple is 10cm × × × × https://images.google.com/ TABLE 22. Overview and comparison of image based 3D reconstruction datasets.

Name Year Scene Platform Sensor Ground Truth Camera C a -IMU Pose 3D Middlebury [179] 2006 Objects Robot Arm professional N/A Robot Arm 3D Scanner (W b ) Notre Dame [183] 2006 Outdoor Google random N/A N/A N/A EPFL [184] 2008 Outdoor N/A professional N/A 2D-3D Alignment 3D Scanner (W) Rome [187] 2009 Outdoor Flickr random N/A N/A N/A SAMANTHA [186] 2009 Outdoor N/A professional N/A Control Points (P c ) 3D Scanner (P) TUM multiview [185] 2009 Objects N/A professional N/A N/A 3D Scanner (W) Dubrovnik [188] 2010 Outdoor Flickr random N/A GPS N/A Quad [190] 2011 Outdoor N/A professional /consumer N/A GPS+DGPS N/A San Francesco [191] 2011 Outdoor Vehicle 6 × color (panor d ) N/A GPS+INS N/A Tsinghua University [189] 2011 Outdoor N/A professional N/A N/A 3D Scanner (W) Landmarks [192] 2012 Outdoor Flickr random N/A GPS+INS N/A (W) Large-Scale Collections [193] 2012 Outdoor Flickr random N/A SIFT matching N/A (W) SfM-Disambig [194] 2013 Outdoor Web random N/A N/A N/A DTU Robot [195] 2014 Objects Robot Arm professional N/A Robot Arm 3D Scanner 1DSfM [196] 2014 Outdoor Flickr random N/A N/A N/A COLMAP [197] 2016 Indoor/Outdoor N/A professional N/A N/A N/A ETH3D MVS [198] 2017 Outdoor Tripod /Handheld 1 × Digital 2 × NA e -gray (stereo) 2 × WA f -gray (stereo) Y 2D-3D Alignment 3D Scanner Tanks and Temples [199] 2017 Indoor/Outdoor Stabilizer 2 × color (different scene) N/A 2D-3D Registration 3D Scanner a C-IMU means consumer-grade IMU, b W means Web Collection Photos, c P means partially coverage, d panor means panoramic, e NA means narrow angle, f WA means wide angle. FIGURE 31. The object models and viewpoints of dataset: Temple and Dino (left), and different viewpoints (right) [179].  Ground Truth

The camera poses ground truth were provided by the high-precision robotic arm. Then based on the accurate camera poses, for each object, about 200 individual scans acquired by a Cyberware M-15 3D scanner were merged together into a whole model. It can reach 0.25mm resolution and 0.2mm accuracy for each single-scan, and by merging together the many scans, the final model could reach a greater precision and accuracy. The properties of the ground truth instruments are shown in Table 23.  Highlights

3D model ground truth, challenging model, novel evaluation metrics. TANKS AND TEMPLES

The Tanks and Temples Benchmark, published in 2017 [199], is a large-scale 3D reconstruction dataset covering both outdoor and indoor scenes. The most significant contribution of this work is the provided massive large-scale and high- Copyright © 2006, IEEE. quality ground truth 3D model, which was long deficient in previous collections [186], [187], [197].  Platform and Sensor Setup

There are two setups for capturing the data: DJI Zenmuse X5R stabilized by a DJI Osmo gimbal, and Sony a7S II stabilized by a Pilotfly H2 gimbal. Both the two cameras captured ultra-high resolution videos using rolling shutter (RS), thankfully, it was proved that RS would not affect the reconstruction quality compared with global shutter (GS). The properties of the sensors are listed in Table 24.

TABLE 24. The properties of the sensors and ground truth instrument.

Sensor/Instrument Type Rate Characteristics Camera × ◦ FOV Camera × FARO Focus 3D X330 976 kHz 3D map GT, 330m range, 2mm accuracy, HDR image  Scenes and Sequences

There is a total of 21 sequences provided, categorized into three groups: Intermediate group, Advanced group, and Training group. The Intermediate contains 8 scenes: Family, Francis, Horse, Lighthouse, M60, Panther, Playground, and Train; the Advanced contains 6 scenes: Auditorium, Ballroom, Courtroom, Museum, Palace, and Temple; and the Training contains 7 scenes: Barn, Caterpillar, Church, Courthouse, Ignatius, Meeting room, and Truck. Two

Y. Liu et al : Datasets and Evaluation for Simultaneous Localization and Mapping related problems: A Comprehensive Survey representative scene models are shown in Fig. 32. The Advanced group are more challenging due to the scale, complexity, and other factors, as can be seen from the Auditorium model. FIGURE 32. Four representative scene models of the dataset: Panther and Playground - Intermediate group (up), Auditorium and Temple – Advanced group (bound) [201].  Ground Truth

The large-scale 3D scene ground truth was acquired by a FARO Focus 3D X330 industrial-grade laser scanner (the properties are shown in Table 24), by registered multiple individual scans together using overlap areas. Furthermore, the ground truth camera poses were estimated using the ground truth point cloud [199].  Highlights

3D scene ground truth, challenging data, wealthy sequences, diverse scenes. ETH3D STEREO

The ETH3D Stereo Benchmark, presented in 2017 [198], is a multi-view dataset with both high-resolution and low-resolution images covering diverse indoor and outdoor scenes. Similar with the Tanks and Temples Benchmark [199], there is also a high-end laser scanner used for capturing ground truth 3D scene model.

TABLE 25. The properties of the sensors and ground truth instrument.

Sensor/Instrument Type Rate Characteristics Camera × ◦ FOV Camera × ◦ FOV Camera × ◦ FOV 3D Scanner

FARO Focus 3D X330 976 kHz 3D map GT, 330m range, 2mm accuracy, HDR image  Platform and Sensor Setup

There are two setups for capturing the data: 1 high-resolution DSLR camera mounted on a tripod, and 4 low-resolution cameras (two stereo pairs) mounted on a rig. The properties of the sensors are listed in Table 25.

FIGURE 33. Some representative scene images of the dataset: County, Office, and Playground from top to bound respectively [202].  Scenes and Sequences

There are three groups of data provided: high-resolution multi-view group, low-resolution many-view group, and low-resolution two-view group. A diverse range of indoor and outdoor scenes are included in the dataset, such as courtyard, office, playground, etc. Several representative scene images are shown in Fig. 33.  Ground Truth

The scene geometry ground truth was recorded by a FARO Focus X330 laser scanner (the properties are shown in Table 25). One or multiple individual scans were performed to avoid occlusions, merging into a full 3D scene model. Furthermore, the camera images were aligned against the ground truth point cloud, providing ground truth image poses and depths.  Highlights

3D scene ground truth, high and low resolution, wealthy sequences, diverse scenes. DTU ROBOT

The DTU Robot Dataset, published in 2014 [195], was a large-scale multi-view image dataset containing massive and diverse scenes. Accurate camera poses and scene geometry ground truth were provided for quantitative comparison.  Platform and Sensor Setup

The dataset was collected using a 6-axis high-precision industrial robotic arm mounted with a structured light scanner. The scanner could record the 3D geometry, with one of the scanner cameras used for capturing images aligned with the scanned model. Depending on the scenes, the camera kept a distance of 50 cm and 65 cm with the objects. 18 LEDs were used to control the illumination. The properties of the sensors are listed in Table 26.  Scenes and Views

The dataset contains 80 scene models of large variability, such as groceries, vegetables, buildings, etc. Each model has been performed 3D scans from 49 or 64 positions, along with the corresponding RGB images, with varying light settings. Some representative images of the diverse scenes are shown in Fig. 34. TABLE 26. The properties of the sensors and ground truth instrument.

Sensor/Instrument Type Resolution Characteristics Camera unknown 1200 × structured light 0.25mm 3D map GT, 0.14mm/0.6pix accuracy FIGURE 34. Some representative images of the diverse scenes [195].  Ground Truth

The camera poses ground truth were provided by the 6-axis robotic arm. For the 3D ground truth, based on the accurate camera locations of the arm, many individual scans by the binocular structured light scanner (the properties are shown in Table. 26) were merged into a whole 3D geometry model. After testing on a known-size geometry bowling ball, it was shown that the surface points accuracy could reach around 0.14mm (standard deviation).  Highlights

3D model ground truth, diverse scenes. BIGSFM

The BigSFM Project is a gathering of massive internet photos datasets, stimulating the growth of many large-scale reconstruction methods in recent years. It offers a seminal way to get huge data resource to support the 3D study of the scenes. Due to the unorganized nature that the photos could be collected using random cameras, by different persons, and under different conditions, it is normally impractical to provide ground truth 3D models for evaluation. As an alternative, sometimes the GPS tags along with the photos (when available) could be used to verify the camera poses, indicating the reconstruction quality.  Platform and Sensor Setup

This kind of datasets were normally collected on the internet, e.g., Flickr. Most of the photos were captured via hand-held smartphone or digital camera, so there was not a specific resolution or platform in use.  Scenes and Views

The BigSFM Project contains a variety of community photos datasets. Some of the famous collections are Quad [190], Copyright © 2014, IEEE. https://research.cs.cornell.edu/bigsfm/ Dubrovnik [188], Rome [187], Venice [203], NotreDame [183], San Francisco [191], etc. For each scene, there could be hundreds of to tens thousands of images shot at massive viewpoints. Some representative images and reconstructed models are shown in Fig. 18. FIGURE 35. Some representative images and reconstructed models of the scenes: Colosseum (Rome), Dubrovnik, and San Marco Square (Venice) from top to bound [187].  Ground Truth

Due to the collection nature of the community photos, normally there is no ground truth 3D model along with the datasets. Sometimes, there could be built-in GPS receivers inside the cameras, so the positioning data would be available, although not accurate that much.  Highlights

Large-scale scenes, massive photos. EPFL MULITI-VIEW

The EPFL Multi-View Dataset, released in 2008 [184], was a high-resolution outdoor multi-view image collection with 3D scene ground truth from LiDAR acquisition.  Platform and Sensor Setup

The dataset was collected using a Canon D60 camera with a fixed focal length. The properties of the sensors are listed in Table 27.

TABLE 27. The properties of the sensor and ground truth instrument.

Sensor/Instrument Type Resolution Characteristics Camera Canon D60 3072 × Z+F Imager 5003 3.1mm@10m 3D map GT, mm-level accuracy, HDR image  Scenes and Views

The dataset selected 4 outdoor scenes as objective environments, including Fountain, Entry, Herz-Jesu, and Ettlingen-castle. Depending on different research purposes, the data were organized into three groups: full camera calibration and dense multi-view stereo group, pose estimation and dense multi-view stereo group, and dense Copyright © 2009, IEEE.

Y. Liu et al : Datasets and Evaluation for Simultaneous Localization and Mapping related problems: A Comprehensive Survey multi-view stereo group. Some representative images from the Fountain scene are shown in Fig. 36. FIGURE 36. Some representative images from Fountain scene [184].  Ground Truth

The 3D geometry ground truth was acquired by a Zoller+ Fröhlich laser scanner (the properties are shown in Table 27). As indicated by the testing results, the scanned models had obtained an accuracy of around 2mm.  Highlights

3D model ground truth, high-resolution imagery. V. EVALUATION CRITERIA

Although reliable ground truth data are the decisive premise, the criteria and metrics are also indispensable for benchmarks. The evaluation criteria play an important role in correctly quantifying and fairly comparing the performance of algorithms, identifying potential problems, looking for breakthroughs, and thus promoting technological progress. In this section, we will focus on two types of indicators – positioning and mapping, which are the main functionalities of SLAM related problems. Some other non-functional concerns like time consuming, CPU usage, RAM usage, and GPU requirement, although have been investigated by some efforts [204], [205], [206], will not be covered here due to the immaturity and poor utilization. A. POSITIONING EVALUATION

Due to the limitation of large-scale surveying techniques in earlier years, it was rather difficult to obtain ground truth 3D maps. Therefore, it has been a long while to evaluate SLAM related algorithms by examining the positioning performance. The most widely used evaluating indicators could be the two proposed within the TUM RGB-D Benchmark – relative pose error (

RPE ), and absolute trajectory error (

ATE ) [30]. The relative pose error (RPE) investigates the local accuracy of the motion estimation systems across a fixed time interval Δ . If the estimated trajectory poses are defined as P ,…, P n ∈ SE(3) and the ground truth trajectory poses are Copyright © 2008, IEEE. defined as Q ,…, Q n ∈ SE(3), then the RPE at time step i could be defined as: : ( ) ( ) i i i i i − − −+∆ +∆ = E Q Q P P . Considering a sequence of n camera poses, there could be m = n – Δ individual RPEs. Then how to measure the overall performance on the whole trajectory? The TUM RGB-D Benchmark proposed to compute the root mean square error (RMSE) over all time intervals of the translational component as: mn ii transm =  ∆ =    ∑ E E , where the trans ( E i ) refers to the translational component of the RPE. Note that due to the amplification effect of the Euclidean norm, the RMSE could be sensitive to the outliers. So if desired, it is also reasonable to evaluate the mean or median errors as an alternative. Although only the translational errors are measured here, the rotational errors can also be reflected together inside. Additionally, the selection of time interval Δ could vary with different systems and conditions. For instance, the Δ can be set to 1 for the evaluation of frame-by-frame Visual Odometry, resulting as per frame drift; as for systems inferring via several previous frames, performing local optimization, and focusing on large-scale navigation, it is not necessary nor reasonable to count on the estimation of each individual frame. However, it is not appropriate to set Δ to n directly although comparing the start and end point is a quite intuitive thinking, because it penalizes early rotational errors much more than the occurrences near to the end. Instead, it is recommended to average the RMSEs over a wide range of time intervals, but to avoid excessive complexity, an approximation that counts on a fixed number of relative pose samples could also reach a compromise. The absolute trajectory error (ATE) measures the global accuracy of the Odometry/SLAM systems by comparing the estimated trajectory against the ground truth trajectory to get an absolute distance. As the two trajectories could lie in different coordinate frames, an alignment of rigid-body transformation S that maps the estimated poses P n onto the ground truth poses Q n is prerequisite. Then the ATE at time step i could be computed as: : i i i − = F Q SP . Similar with the RPE, the root mean square error (RMSE) over all time indices of the translational component is also proposed as one possible metric for ATE: nn ii transn =  =    ∑ F F . It should be noted that the ATE considers only the translational errors, but actually the rotational errors will also result in wrong translations. Importantly, compared with RPE, the ATE has an intuitive visualization for conveniently inspecting the actual accidental area of the systems. An example visualization of ATE and a comparison between ATE and RPE are shown in Fig 37. As can be seen from the comparison, the RPE could be slightly larger than ATE due to the more impact from rotational errors. FIRURE 37. An example visualization of absolute trajectory error on the “fr1/desk2” sequence of the TUM RGB-D Dataset (left) and a comparison between relative pose error and absolute trajectory error (right). Both plots were generated by RGB-D SLAM system [30].

There are many mainstream datasets sharing this same set of evaluation criteria [45], [70], [120], and also there has been emerging some extended versions originating from the same mechanism. For instance, the most famous ranking platform of the community – KITTI Odometry Benchmark , based on the RPE metric, treats rotational errors and translational errors separately to enable deeper insights. However, not all the datasets are capable of providing with complete and accurate ground truth trajectory [45], [67]. Under this circumstance, an intuitive thinking is to design the collection with close-loop trajectories and measure the start-to-end drift. For example, the TUM monoVO Dataset [67] leveraged LSD-SLAM algorithm to generate highly accurate camera poses at the start and end segments to serve as ground truth data. Then, an explicit alignment between these two segments was performed, making all camera poses regulated into one single coordinate frame, thus the loop drift could be measured and the tracking performance could be clear at a glance. B. MAPPING EVALUATION

In the existing SLAM algorithm publications, it is quite unusual to come across evaluations on mapping performance. But indeed, the evaluation criteria on 3D reconstruction has existed for a long time, which mainly originated from the Middlebury Dataset [179]. Denoting the ground truth geometry as G and the reconstruction result as R , there are generally two indicators to be considered: the accuracy, and the completeness. The accuracy measures how close R is to G by computing the distances between the corresponding points, which are usually determined by finding the nearest match. If the reconstructions or ground truth models are in other formats like triangle meshes, then the vertices could be used for comparison. One issue will be encountered where the G is incomplete, resulting the nearest reference points to fall on Copyright © 2012, IEEE. the boundary or a distant part. In this case, a hole-filled version G' is proposed for computing the distance, and the points that match to the repaired region will be discounted. Fig. 38 well illustrates the principle. The completeness investigates how much of G is modeled by R . Opposite from the accuracy indicator’s comparing R against G , the completeness measures the distances from G to R . Intuitively, if the matching distance runs beyond a threshold, we can deem there is no corresponding point on R , hence the point on G can be logically regarded as “not covered”. The above methodology is described in Fig. 38. FIGURE 38. Evaluation of the reconstruction result relative to the ground truth model. (a) the representation of the two models, each with incomplete surface. (b) to measure the distance, the nearest matches are found, and the matches related to the hole-filled area will not be included. (c) to measure the completeness, the distances between point matches are computed, those run beyond the threshold will be regarded as “not covered” [179].

For the convenience of the evaluation, the ICL-NUIM Dataset [159] proposed to exploit the open-source software CloudCompare to compute the distances between the reconstruction and the ground truth. Five standard statistics can be obtained: Mean, Median, Std., Min., and Max. An example of distances computation is shown in Fig. 39. FIGURE 39. An example of cloud/mesh computation module investigating the reconstruction accuracy in CloudCompare software [207], the different colors represent different distance values.

VI.

CONCLUSION AND PROSPECT

This paper is the first comprehensive survey in the field to investigate SLAM related datasets and evaluation. It covers a wide range of topics, including a novel taxonomy of SLAM related tasks, a full-set methodology of dataset collection, a comprehensive survey on existing datasets, and a guideline on the evaluation principles and indicators. Looking back on the development of SLAM related datasets in past years, we are delighted to see the increasing of quantity and the improving of quality (Fig. 40 depicts the uptrend of annual Copyright © 2006, IEEE. Y. Liu et al : Datasets and Evaluation for Simultaneous Localization and Mapping related problems: A Comprehensive Survey new releases in past years). However, we must be aware that the existing efforts on datasets and evaluation are still insufficient in many aspects, which delay the revolutionary breakthrough of related algorithms. The analyzation and prospects are given below. FIGURE 40. The uptrend of annual SLAM related datasets releases in the past two decades. A. DATASETS  Platform and Sensor . In section IV, we comprehen-sively surveyed the existing SLAM related datasets, discovering that most of them employed single hardware platform for collection [30], [42], [44], [67]. However, in real life we have to meet with various motion behaviors (e.g., fast motion, sharp turn, violent shaking, and so on), which could be reflected from different platforms. Hence it is highly recommended to deploy various platforms in collection, especially in each single scenario and route, in the hope of enhancing the practicality and generalization of algorithms. For perception system, we are exhilarated to observe more and more datasets [46], [111], [66], [168], using redundant and complementary sensors (e.g., camera, LiDAR, and IMU). Making full use of fused sensors can improve system accuracy and reliability, and would potentially unearth new technological patterns. Evidently, this is especially doable on larger platforms, such as ground robot and full-size vehicles.  Scene Type and Scale . The collection scenario is the key point that researchers are most concerned about, as the practical implementation rely significantly on specific validation environment. The existing efforts already cover a range of scenes, including urban [69], [66], underground [43], [66], indoor [30], [164], underwater [117], [208], and so on. But we still expect more diverse and more specific scene types like shopping mall, ruins, industrial area, community area, office building, and so on. In another aspect, for large-scale research, most off-the-shelf datasets were normally collected outdoor for autonomous driving purposes. Hence there is still a deficiency in large-scale datasets within indoor and indoor-outdoor integrated scenarios, which are crucial for seamless navigation across adjacent areas.  Collection Condition . With the rapid development in lately years, the state-of-the-arts can already achieve approving performance in undemanding conditions, as can be seen on KITTI Benchmark. However, such less challenging datasets will potentially hide the defects within the algorithms and constrain the evolution. For example, visual algorithms are susceptible to bad illumination, weak texture, motion blur, and dynamic objects (Fig. 41 shows some failure cases of the state-of-the-art ORB-SLAM-Stereo visual SLAM algorithm),

FIGURE 41. Some failure cases of the state-of-the-art ORB-SLAM-Stereo visual SLAM algorithm in challenging conditions like weak texture, bad illumination, and complex environment [214]. and LiDAR-based algorithms are afraid of weak structure and rapid motion. Also, some ordeals like different time slots, weather changes, and seasonal changes could easily shake the performance. To push the current limits and advance the robustness, the above situations should be incorporated in future datasets.  Ground Truth . As descripted in Section II, there are arresting restrictions for the mainstream positioning approaches, including the GNSS-denied areas outdoor and large complex regions indoor, which could cause inaccurate, incomplete, and infeasible measurements. Accordingly, novel techniques with high reliability and accuracy are urgently demanded, especially those with versatility in both indoor and outdoor scenes. Besides, multimodal ground truth labels are required, including but not limited to 3D map, depth image, and semantic annotations [209].  Synthesis and Simulation . Real-world datasets have long been struggling with the complicated collection procedure, limited scenario types and scale, and inaccurate ground truth. Within this context, synthetic datasets are opening a new route to handle these problems, and have recently attracted significant interests [43], [110], [159]. In a simulation manner, the synthetic engine can render out boundless sequences under customization, and especially with perfectly accurate ground truth. B. EVALUATION  Localization . With the wide implementation and test of time in past years [210], [211], [212], the evaluation criteria on localization are driving to a maturity stage. More diversified and refined indicators are expected in the future, e.g., the accuracy and speed of initialization, the accuracy and success rate of relocalization, the detection rate of loop-closure [70], and the localization accuracy under challenging conditions [213]. These indicators will definitely be of great significance for inspecting an algorithm all-roundly.  Mapping . As reflected in the literature and experience, the Mapping performance is rarely measured within SLAM algorithms, although it has long been an indispensable and mature part in SfM/MVS evaluations [179], [198], [199]. However, the existing principles only investigate the 3D geometry property [159], [179], [199], overlooking the texture correctness. Hence a new principle that combines color, texture, 3D geometry, and possibly semantic labels is desiderated by the community.  Hardware consumption . At present, the evaluation criteria mainly focus on the functional level, e.g., RPE, ATE, mapping accuracy, and mapping completeness. Lately, with the increasing transitions from technology to application, as a sharply decisive part, the hardware consumption should also be counted. Such as CPU and RAM usage, GPU requirement, and time consumption [204], these indicators determine whether an algorithm can be embedded lightweightly, whether the system can work in real time, and whether the total cost is reasonable.

ACKNOWLEDGMENT

To achieve a better illustration, this survey has reused many photos from the internet, dataset holders, and journals. The authors would like to acknowledge them all for the public licenses and copyright permissions. The authors would also like to acknowledge Dr. Yassine Selami for the refinement on English writing. Yuanzhi Liu would like to thank Prof. Bart Goossens and Prof. Wilfried Philips for the guidance and suggestions on the research when he served as a guest researcher in Ghent University, Belgium.

REFERENCES [1]

C. Cadena, L. Carlone, H. Carrillo, Y. Latif, D. Scaramuzza, J. Neira, I. Reid, and J. J. Leonard, "Past, present, and future of simultaneous localization and mapping: Toward the robust-perception age,"

IEEE Trans. Robot., vol. 32, no. 6, pp. 1309-1332, 2016, doi: 10.1109/TRO.2016.2624754. [2]

H. Durrant-Whyte and T. Bailey, "Simultaneous localization and mapping: Part I,"

IEEE Robot. Automat. Mag., vol. 13, no. 2, pp. 99-110, 2006, doi: 10.1109/MRA.2006.1638022. [3]

T. Bailey and H. Durrant-Whyte, "Simultaneous localization and mapping (SLAM): Part II,"

IEEE Robot. Automat. Mag., vol. 13, no. 3, pp. 108-117, 2006, doi: 10.1109/MRA.2006.1678144. [4]

C. Urmson, J. Anhalt, D. Bagnell, C. Baker, R. Bittner, M. N. Clark, J. Dolan, D. Duggins, T. Galatali, C. Geyer, M. Gittleman, S. Harbaugh, M. Hebert, T. M. Howard, S. Kolski, A. Kelly, M. Likhachev, M. McNaughton, N. Miller, K. Peterson, B. Pilnick, R. Rajkumar, P. Rybski, B. Salesky, Y.-W. Seo, S. Singh, J. Snider, A. Stentz, W. R. Whittaker, Z. Wolkowicki, J. Ziglar, H. Bae, T. Brown, D. Demitrish, B. Litkouhi, J. Nickolaou, V. Sadekar, W. Zhang, J. Struble, M. Taylor, M. Darms, and D. Ferguson, "Autonomous driving in urban environments: Boss and the Urban Challenge,"

J. Field Robot., vol. 25, no. 8, pp. 425-466, 2008, doi: 10.1002/rob.20255. [5]

Singandhupe and H. M. La, "A Review of SLAM Techniques and Security in Autonomous Driving," in , Naples, Italy, 2019, pp. 602-607, doi: 10.1109/irc.2019.00122. [6]

H. Lategahn, A. Geiger, and B. Kitt, "Visual SLAM for autonomous ground vehicles," in , Shanghai, China, 2011, pp. 1732-1737, doi: 10.1109/icra.2011. 5979711. [7]

Stasse, A. J. Davison, R. Sellaouti, and K. Yokoi, "Real-time 3D SLAM for humanoid robot considering pattern generator information," in , Beijing, China, 2006, pp. 348-355, doi: 10.1109/IROS. 2006.281645. [8]

M. Milford and G. Wyeth, "Persistent Navigation and Mapping using a Biologically Inspired SLAM System,"

Int. J. Robot. Res., vol. 29, no. 9, pp. 1131-1153, 2010, doi: 10.1177/027836490934 0592. [9]

Oliver, S. Kang, B. C. Wünsche, and B. Macdonald, "Using the Kinect as a navigation sensor for mobile robotics," in , Dunedin, New Zealand, 2012, pp. 509-514, doi: 10.1145/2425836.2425932. [10]

M. BlöSch, S. Weiss, D. Scaramuzza, and R. Siegwart, "Vision based MAV navigation in unknown and unstructured environments," in , Anchorage, AK, USA, 2010, pp. 21-28, doi: 10.1109/robot.2010. 5509920.

Y. Liu et al : Datasets and Evaluation for Simultaneous Localization and Mapping related problems: A Comprehensive Survey [11] J. Huai, G. J óźkó w, C. Toth, and D. A. Grejner-Brzezinska, "Collaborative monocular SLAM with crowdsourced data,"

Navigation, vol. 65, no. 4, pp. 501-515, 2018, doi: 10.1002/navi.266. [12]

Cvišić, J. Ćesić, I. Marković, and I. Petrović, "SOFT -SLAM: Computationally efficient stereo visual simultaneous localization and mapping for autonomous unmanned aerial vehicles,"

J. Field Robot., vol. 35, no. 4, pp. 578-595, 2018, doi: 10.1002/rob.21762. [13]

D. Chekhlov, A. P. Gee, A. Calway, and W. Mayol-Cuevas, "Ninja on a Plane: automatic discovery of physical planes for augmented reality using visual SLAM," in , Nara, Japan, 2007, pp. 153-156, doi: 10.1109/ismar.2007.4538840. [14]

H. Liu, G. Zhang, and H. Bao, "Robust Keyframe-based Monocular SLAM for Augmented Reality," in , Merida, Mexico, 2016, pp. 1-10, doi: 10.1109/ismar.2016.24. [15]

R. F. Salas-Moreno, B. Glocken, P. H. J. Kelly, and A. J. Davison, "Dense planar SLAM," in , Munich, Germany, 2014, pp. 157-164, doi: 10.1109/ismar.2014.6948422. [16]

S. Li, G. Li, L. Wang, and Y. Qin, "SLAM integrated mobile mapping system in complex urban environments,"

ISPRS-J. Photogramm. Remote Sens., vol. 166, pp. 316-332, 2020, doi: 10.1016/j.isprsjprs.2020.05.012. [17]

S. Karam, G. Vosselman, M. Peter, S. Hosseinyalamdary, and V. Lehtola, "Design, Calibration, and Evaluation of a Backpack Indoor Mobile Mapping System,"

Remote Sens., vol. 11, no. 8, p. 905, 2019, doi: 10.3390/rs11080905. [18]

H. A. Lauterbach, D. Borrmann, R. Heß, D. Eck, K. Schilling, and A. Nüchter, "Evaluation of a backpack-mounted 3D mobile scanning system,"

Remote Sens., vol. 7, no. 10, pp. 13753-13781, 2015, doi: 10.3390/rs71013753. [19]

D. L. Lu. "Autonomous Waymo Chrysler Pacifica Hybrid minivan undergoing testing in Los Altos, California." Wikimedia Commons, the free media repository. https://commons.wikimedia.org/wiki/ File:Waymo_Chrysler_Pacifica_in_Los_Altos,_2017.jpg (accessed May. 31st, 2020). [22]

G. Klein and D. Murray, "Parallel tracking and mapping for small AR workspaces," in , Nara, Japan, 2007, pp. 1-10, doi: 10.1007/978-3-319-10605-2_54. [23]

D. L. Lu. "A point cloud generated from a moving car using a single Ouster OS1 lidar." Wikimedia Commons, the free media repository. https://commons.wikimedia.org/wiki/File:Ouster_OS1-64_lidar_point_cloud_of_intersection_of_Folsom_and_Dore_St,_San_Francisco.png (accessed Oct. 10th, 2020). [24]

J. Deng, W. Dong, R. Socher, L.-J. Li, L. Kai, and F.-F. Li, "ImageNet: A large-scale hierarchical image database," in , Miami, FL, USA, 2009, pp. 248-255, doi: 10.1109/cvpr.2009.5206848. [25]

D. Gershgorn. "The data that transformed AI research—and possibly the world." Quartz. https://qz.com/1034972/the-data-that-changed-the-direction-of-ai-research-and-possibly-the-world/ (accessed Apr. 16, 2020). [26]

Patel. "Chapter-6 How to learn feature engineering?" Medium. https://medium.com/ml-research-lab/chapter-6-how-to-learn-feature-engineering-49f4246f0d41 (accessed May. 14, 2020). [27]

B. Schwarz, "Mapping the world in 3D,"

Nat. Photon., vol. 4, no. 7, pp. 429-430, 2010, doi: 10.1038/nphoton.2010.148. [28]

J.-L. Blanco, F.-A. Moreno, and J. Gonzalez, "A collection of outdoor robotic datasets with centimeter-accuracy ground truth,"

Auton. Robot., vol. 27, no. 4, pp. 327-351, 2009, doi: 10.1007/ s10514-009-9138-7. [29]

Geiger, P. Lenz, and R. Urtasun, "Are we ready for autonomous driving? the kitti vision benchmark suite," in , Providence, RI, USA, 2012, pp. 3354-3361, doi: 10.1109/CVPR.2012.6248074. [30]

J. Sturm, N. Engelhard, F. Endres, W. Burgard, and D. Cremers, "A benchmark for the evaluation of RGB-D SLAM systems," in , Vilamoura, Portugal, 2012, pp. 573-580, doi: 10.1109/IROS.2012.6385773. [31]

Russakovsky, J. Deng, H. Su, J. Krause, S. Satheesh, S. Ma, Z. Huang, A. Karpathy, A. Khosla, M. Bernstein, A. C. Berg, and L. Fei-Fei, "ImageNet Large Scale Visual Recognition Challenge,"

Int. J. Comput. Vis., vol. 115, no. 3, pp. 211-252, 2015, doi: 10.1007/ s11263-015-0816-y. [32]

IROS2019. "Lifelong Robotic Vision Competition." IROS2019 Competition. https://lifelong-robotic-vision.github.io/ (accessed Aug. 8th, 2020). [33]

ICRA2020. "FPV Drone Racing VIO Competition." The UZH FPV Dataset. https://fpv.ifi.uzh.ch/?page_id=151 (accessed Aug. 8th, 2020). [34]

CVPR2020. "Visual SLAM Challenge." CVPR 2020 Visual Localization and SLAM Challenges. https://sites.google.com/view/ vislocslamcvpr2020/slam-challenge (accessed Aug. 8th, 2020). [35]

J. Fuentes-Pacheco, J. Ruiz-Ascencio, and J. M. Rendón-Mancha, "Visual simultaneous localization and mapping: a survey,"

Artif. Intell. Rev., vol. 43, no. 1, pp. 55-81, 2015, doi: 10.1007/s10462-012-9365-8. [36]

G. Huang, "Visual-Inertial Navigation: A Concise Review," in , Montreal, QC, Canada, 2019, pp. 9572-9582, doi: 10.1109/icra.2019.8793604. [37]

Y. Cui, R. Chen, W. Chu, L. Chen, D. Tian, and D. Cao, "Deep Learning for Image and Point Cloud Fusion in Autonomous Driving: A Review," 2020, arXiv:2004.05224 . [Online]. Available: https:// arxiv.org/abs/2004.05224 [38]

E. Yurtsever, J. Lambert, A. Carballo, and K. Takeda, "A Survey of Autonomous Driving: Common Practices and Emerging Technologies,"

IEEE Access, vol. 8, pp. 58443-58469, 2020, doi: 10.1109/access.2020.2983149. [39]

G. Bresson, Z. Alsayed, L. Yu, and S. Glaser, "Simultaneous Localization and Mapping: A Survey of Current Trends in Autonomous Driving,"

IEEE Trans. Intell. Veh., vol. 2, no. 3, pp. 194-220, 2017, doi: 10.1109/tiv.2017.2749181. [40]

M. Firman, "RGBD datasets: Past, present and future," in

Proc. IEEE Conf. Comput. Vis. Pattern Recognit. Workshops (CVPRW) , Las Vegas, NV, USA, 2016, pp. 19-31, doi: 10.1109/CVPRW. 2016.88. [41]

Z. Cai, J. Han, L. Liu, and L. Shao, "RGB-D datasets using microsoft kinect or similar sensors: a survey,"

Multimed. Tools Appl., vol. 76, no. 3, pp. 4313-4355, 2017, doi: 10.1007/s11042-016-3374-6. [42]

M. Burri, J. Nikolic, P. Gohl, T. Schneider, J. Rehder, S. Omari, M. W. Achtelik, and R. Siegwart, "The EuRoC micro aerial vehicle datasets,"

Int. J. Robot. Res., vol. 35, no. 10, pp. 1157-1163, 2016, doi: 10.1177/0278364915620033. [43]

W. Wang, D. Zhu, X. Wang, Y. Hu, Y. Qiu, C. Wang, Y. Hu, A. Kapoor, and S. Scherer, "TartanAir: A Dataset to Push the Limits of Visual SLAM," 2020, arXiv:2003.14338 . [Online]. Available: https://arxiv.org/abs/2003.14338 [44]

L. Jinyu, Y. Bangbang, C. Danpeng, W. Nan, Z. Guofeng, and B. Hujun, "Survey and evaluation of monocular visual-inertial SLAM algorithms for augmented reality,"

Virtual Real. Intell. Hardw., vol. 1, no. 4, pp. 386-410, 2019, doi: 10.1016/j.vrih.2019.07.002. [45]

D. Schubert, T. Goll, N. Demmel, V. Usenko, J. Stuckler, and D. Cremers, "The TUM VI Benchmark for Evaluating Visual-Inertial Odometry," in , Madrid, SPAIN, 2018, pp. 1680-1687, doi: 10.1109/IROS.2018. 8593419. [46]

N. Carlevaris-Bianco, A. K. Ushani, and R. M. Eustice, "University of Michigan North Campus long-term vision and lidar dataset,"

Int. J. Robot. Res., vol. 35, no. 9, pp. 1023-1035, 2016, doi: 10.1177/0278364915614638. [47] L. Majdik, C. Till, and D. Scaramuzza, "The Zurich urban micro aerial vehicle dataset,"

Int. J. Robot. Res., vol. 36, no. 3, pp. 269-273, 2017, doi: 10.1177/0278364917702237. [48]

ChristinaC. "Augmented Reality for mobile commerce platforms." Wikimedia Commons, the free media repository. https://commons. wikimedia.org/wiki/File:Augmented_Reality_for_eCommerce.jpg (accessed Oct. 15th, 2020). [49]

S. Jurvetson. "Hands-free Driving: the passengers are being chauffeured by computer." Wikimedia Commons, the free media repository. https://commons.wikimedia.org/wiki/File:Hands-free_Dr iving.jpg (accessed Aug. 8th, 2020). [50]

PattyK33. "Waypoint Robotics Vector 3D HD omnidirectional autonomous mobile robot and EnZone wireless charging station." Wikimedia Commons, the free media repository. https://commons. wikimedia.org/wiki/File:Vector3DHD_autonomous_mobile_robot_and_EnZone_wireless_charging.jpg (accessed Aug. 21st, 2020). [51]

Pena. "A drone flies over a training area while capturing aerial intelligence." Wikimedia Commons, the free media repository. https://commons.wikimedia.org/wiki/File:Autumn_Drone_(cropped).jpg (accessed Aug. 17th, 2020). [52]

P. Lichtsteiner, C. Posch, and T. Delbruck, "A 128$\times$128 120 dB 15 $\mu$s Latency Asynchronous Temporal Contrast Vision Sensor,"

IEEE J. Solid-State Circuit, vol. 43, no. 2, pp. 566-576, 2008, doi: 10.1109/jssc.2007.914337. [53]

Z. Zhang, "A flexible new technique for camera calibration,"

IEEE Trans. Pattern Anal. Mach. Intell., vol. 22, no. 11, pp. 1330-1334, 2000, doi: 10.1109/34.888718. [54]

P. Furgale, J. Rehder, and R. Siegwart, "Unified temporal and spatial calibration for multi-sensor systems," in , Tokyo, Japan, 2013, pp. 1280-1286, doi: 10.1109/iros.2013.6696514. [55]

P. Furgale, T. D. Barfoot, and G. Sibley, "Continuous-time batch estimation using temporal basis functions," in , Saint Paul, MN, USA, 2012, pp. 2088-2095, doi: 10.1109/icra.2012.6225005. [56]

J. Maye, P. Furgale, and R. Siegwart, "Self-supervised calibration for robotic systems," in , Gold Coast, QLD, Australia, 2013, pp. 473-480, doi: 10.1109/ivs.2013. 6629513. [57]

E. Olson, "AprilTag: A robust and flexible visual fiducial system," in , Shanghai, China, 2011, pp. 3400-3407, doi: 10.1109/icra.2011.5979561. [58]

Geiger, F. Moosmann, O. Car, and B. Schuster, "Automatic camera and range sensor calibration using a single shot," in , Saint Paul, MN, USA, 2012, pp. 3936-3943, doi: 10.1109/icra.2012.6224570. [59]

S. Kato, E. Takeuchi, Y. Ishiguro, Y. Ninomiya, K. Takeda, and T. Hamada, "An Open Approach to Autonomous Vehicles,"

IEEE Micro, vol. 35, no. 6, pp. 60-68, 2015, doi: 10.1109/mm.2015.133. [60]

J. Eidson, "IEEE-1588 standard for a precision clock synchronization protocol for networked measurement and control systems—A tutorial," in , 2005. [61]

Geiger, P. Lenz, C. Stiller, and R. Urtasun, "Vision meets robotics: The KITTI dataset,"

Int. J. Robot. Res., vol. 32, no. 11, pp. 1231-1237, 2013, doi: 10.1177/0278364913491297. [62]

V. Otsason, A. Varshavsky, A. Lamarca, and E. De Lara, "Accurate GSM Indoor Localization," in

UbiComp 2005: Ubiquitous Computing . Heidelberg, Berlin: Springer, 2005, pp. 141-158, doi: 10.1007/11551201_9. [63]

Alarifi, A. Al-Salman, M. Alsaleh, A. Alnafessah, S. Al-Hadhrami, M. Al-Ammar, and H. Al-Khalifa, "Ultra Wideband Indoor Positioning Technologies: Analysis and Recent Advances,"

Sensors, vol. 16, no. 5, p. 707, 2016, doi: 10.3390/s16050707. [64]

C. Yang and H.-R. Shao, "WiFi-based indoor positioning,"

IEEE Commun. Mag., vol. 53, no. 3, pp. 150-157, 2015, doi: 10.1109/ mcom.2015.7060497. [65]

J. Hallberg, M. Nilsson, and K. Synnes, "Positioning with Bluetooth," in , Papeete, Tahiti, French Polynesia, 2003, pp. 954-958, doi: 10.1109/ictel. 2003.1191568. [66]

J. Jeong, Y. Cho, Y.-S. Shin, H. Roh, and A. Kim, "Complex urban dataset with multi-level sensors from highly diverse urban environments,"

Int. J. Robot. Res., vol. 38, no. 6, pp. 642-657, 2019, doi: 10.1177/0278364919843996. [67]

J. Engel, V. Usenko, and D. Cremers, "A photometrically calibrated benchmark for monocular visual odometry," 2016, arXiv: 1607.02555 . [Online]. Available: https://arxiv.org/abs/1607. 02555 [68]

OptiTrack. "OptiTrack Studio." Wikimedia Commons, the free media repository. https://commons.m.wikimedia.org/wiki/File:Opti track_Studio.jpg (accessed Aug. 20th, 2020). [69]

W. Maddern, G. Pascoe, C. Linegar, and P. Newman, "1 year, 1000 km: The Oxford RobotCar dataset,"

Int. J. Robot. Res., vol. 36, no. 1, pp. 3-15, 2017, doi: 10.1177/0278364916679498. [70]

X. Shi, D. Li, P. Zhao, Q. Tian, Y. Tian, Q. Long, C. Zhu, J. Song, F. Qiao, and L. Song, "Are we ready for service robots? The OpenLORIS-scene datasets for lifelong SLAM," 2019, arXiv:1911.05603 . [Online]. Available: https://arxiv.org/abs/1911. 05603 [71]

IRAP-Lab. "Complex Urban Data Set." Intelligent Robotic Autonomy and Perception (IRAP) Lab. https://irap.kaist.ac.kr/ dataset/system.html (accessed Oct. 10th, 2020). [72]

J. Zhang and S. Singh, "LOAM: Lidar Odometry and Mapping in Real-time," in , Berkeley, CA, USA, 2014, doi: 10.15607/RSS.2014.X.007. [73]

J. Davison, I. D. Reid, N. D. Molton, and O. Stasse, "MonoSLAM: Real-time single camera SLAM,"

IEEE Trans. Pattern Anal. Mach. Intell., no. 6, pp. 1052-1067, 2007, doi: 10.1109/TPAMI.2007.1049. [75]

J. Engel, T. Schöps, and D. Cremers, "LSD-SLAM: Large-scale direct monocular SLAM," in , Zürich, Switzerland, 2014, pp. 834-849, doi: 10.1007/978-3-319-10605-2_54. [76]

R. Mur-Artal and J. D. Tardos, "ORB-SLAM2: An Open-Source SLAM System for Monocular, Stereo, and RGB-D Cameras,"

IEEE Trans. Robot., vol. 33, no. 5, pp. 1255-1262, 2017, doi: 10.1109/ tro.2017.2705103. [77]

T. Shan and B. Englot, "LeGO-LOAM: Lightweight and Ground-Optimized Lidar Odometry and Mapping on Variable Terrain," in , Madrid, Spain, 2018, pp. 4758-4765, doi: 10.1109/iros.2018.8594299. [78]

J.-E. Deschaud, "IMLS-SLAM: scan-to-model matching based on 3D data," in , Brisbane, QLD, Australia, 2018, pp. 2480-2485, doi: 10.1109/ ICRA.2018.8460653. [79]

C. Campos, R. Elvira, J. J. G. Rodríguez, J. M. Montiel, and J. D. Tardós, "ORB-SLAM3: An Accurate Open-Source Library for Visual, Visual-Inertial and Multi-Map SLAM," 2020, arXiv:2007. 11898 . [Online]. Available: https://arxiv.org/abs/2007. 11898 [80]

J. Zhang and S. Singh, "Visual-lidar odometry and mapping: Low-drift, robust, and fast," in , 2015, pp. 2174-2181. [81]

Z. Zhang, "Microsoft Kinect Sensor and Its Effect,"

IEEE Multimedia, vol. 19, no. 2, pp. 4-10, 2012, doi: 10.1109/mmul.2012. 24. [82]

R. A. Newcombe, S. Izadi, O. Hilliges, D. Molyneaux, D. Kim, A. J. Davison, P. Kohli, J. Shotton, S. Hodges, and A. Fitzgibbon, "KinectFusion: Real-time dense surface mapping and tracking," in , Basel, Switzerland, 2011, pp. 127-136, doi: 10.1109/ismar.2011.6092378. [83]

T. Whelan, S. Leutenegger, R. Salas-Moreno, B. Glocker, and A. Davison, "ElasticFusion: Dense SLAM without a pose graph," in , Rome, Italy, 2015, doi: 10.15607/RSS.2015.XI.001. [84]

D. Nistér, O. Naroditsky, and J. Bergen, "Visual odometry," in , Washington, DC, USA, 2004, pp. I-I, doi: 10.1109/CVPR.2004. 1315094. [85]

D. Fernandez and A. Price, "Visual odometry for an outdoor mobile robot," in

IEEE Conf. Robot. Automat. Mechatron. , Singapore, Singapore, 2004, pp. 816-821, doi: 10.1109/ramech.2004.1438023.

Y. Liu et al : Datasets and Evaluation for Simultaneous Localization and Mapping related problems: A Comprehensive Survey [86] D. Scaramuzza and F. Fraundorfer, "Visual Odometry [Tutorial],"

IEEE Robot. Automat. Mag., vol. 18, no. 4, pp. 80-92, 2011, doi: 10.1109/mra.2011.943233. [87]

L. Murphy, T. Morris, U. Fabrizi, M. Warren, M. Milford, B. Upcroft, M. Bosse, and P. Corke, "Experimental comparison of odometry approaches," in

Exp. Robot. , 2013, pp. 877-890, doi: 10.1007/978-3-319-00065-7_58. [88]

Wikipedia. "Mobile Mapping." Wikipedia, the free encyclopedia. https://en.wikipedia.org/wiki/Mobile_mapping (accessed Aug. 8th, 2020). [89]

Wikipedia. "Simultaneous localization and mapping." Wikipedia, the free encyclopedia. https://en.wikipedia.org/wiki/Simultaneous_ localization_and_mapping (accessed Nov. 16, 2019). [90]

D. G. Lowe, "Distinctive image features from scale-invariant keypoints,"

Int. J. Comput. Vis., vol. 60, no. 2, pp. 91-110, 2004, doi: 10.1023/B:VISI.0000029664.99615.94. [91]

E. Rublee, V. Rabaud, K. Konolige, and G. R. Bradski, "ORB: An efficient alternative to SIFT or SURF," in , 2011, pp. 2564-2571, doi: 10.1109/ICCV.2011. 6126544. [92]

Wikipedia. "Epipolar geometry." Wikipedia, the free encyclopedia. https://en.wikipedia.org/wiki/Epipolar_geometry (accessed Aug. 9th, 2020). [93]

B. Triggs, P. F. McLauchlan, R. I. Hartley, and A. W. Fitzgibbon, "Bundle Adjustment — A Modern Synthesis," in , Corfu, Greece, 2000, pp. 298-372, doi: 10.1007/3-540-44480-7_21. [94]

T. Schops, J. Engel, and D. Cremers, "Semi-dense visual odometry for AR on a smartphone," in , 2014, pp. 145-150, doi: 10.1109/ismar.2014. 6948420. [95]

Wikipedia. "Visual inertial odometry." Wikipedia, the free encyclop edia. https://en.wikipedia.org/wiki/Visual_odometry

M. Li and A. I. Mourikis, "High-precision, consistent EKF-based visual-inertial odometry,"

Int. J. Robot. Res., vol. 32, no. 6, pp. 690-711, 2013, doi: 10.1177/0278364913481251. [97]

T. Qin, P. Li, and S. Shen, "VINS-Mono: A Robust and Versatile Monocular Visual-Inertial State Estimator,"

IEEE Trans. Robot., vol. 34, no. 4, pp. 1004-1020, 2018, doi: 10.1109/tro.2018.2853729. [98]

W. Wen, L. Hsu, and G. Zhang, "Performance analysis of NDT-based graph SLAM for autonomous vehicle in diverse typical driving scenarios of Hong Kong,"

Sensors , vol. 18, no. 11, pp. 3928-3948, 2018, doi: 10.3390/s18113928. [99]

Z. Gong, J. Li, Z. Luo, C. Wen, C. Wang, and J. Zelek, "Mapping and Semantic Modeling of Underground Parking Lots Using a Backpack LiDAR System,"

IEEE Trans. Intell. Transp. Syst., pp. 1-13, 2019, doi: 10.1109/tits.2019.2955734. [100]

Q. Li, S. Chen, C. Wang, X. Li, C. Wen, M. Chen, and J. Li, "LO-Net: Deep Real-Time Lidar Odometry," in , Long Beach, CA, USA, 2019, pp. 8465-8474, doi: 10.1109/CVPR.2019.00867. [101]

T. Laidlow, M. Bloesch, W. Li and S. Leutenegger, "Dense RGB-D-inertial SLAM with map deformations," in

IEEE/RSJ Int. Conf. Intell. Robot. Syst. (IROS) , Vancouver, BC, Canada, 2017, pp. 6741-6748, doi: 10.1109/IROS.2017.8206591. [102]

Ozyesil, V. Voroninski, R. Basri, and A. Singer, "A survey of structure from motion," 2017, arXiv:1701.08493 . [Online]. Available: https://arxiv.org/abs/1701.08493 [103]

G. Verhoeven, C. Sevara, W. Karel, C. Ressl, M. Doneus, and C. Briese, "Undistorting the past: New techniques for orthorectification of archaeological aerial frame imagery," in

Good practice in archaeological diagnostics . Cham, Switzerland: Springer, 2013, pp. 31-67, doi: 10.1007/978-3-319-01784-6_3. [104]

S. Ceriani, G. Fontana, A. Giusti, D. Marzorati, M. Matteucci, D. Migliore, D. Rizzi, D. G. Sorrenti, and P. Taddei, "Rawseeds ground truth collection systems for indoor self-localization and mapping,"

Auton. Robot., vol. 27, no. 4, pp. 353-371, 2009, doi: 10.1007/s10514-009-9156-5. [105]

K. Y. Leung, Y. Halpern, T. D. Barfoot, and H. H. Liu, "The UTIAS multi-robot cooperative localization and mapping dataset,"

Int. J. Robot. Res., vol. 30, no. 8, pp. 969-974, 2011, doi: 10.1177/0278364911398404. [106]

P. Furgale, P. Carle, J. Enright, and T. D. Barfoot, "The Devon Island rover navigation dataset,"

Int. J. Robot. Res., vol. 31, no. 6, pp. 707-713, 2012, doi: 10.1177/0278364911433135. [107]

S. Anderson, C. McManus, H. Dong, E. Beerepoot, and T. D. Barfoot, "The gravel pit lidar-intensity imagery dataset,"

Tech. Rep. ASRL-2012-ABLOOl, UTIAS,

D. Caruso, J. Engel, D. Cremers, and Ieee, "Large-Scale Direct SLAM for Omnidirectional Cameras," in , Hamburg, Germany, 2015, pp. 141-148, doi: 10.1109/IROS.2015.7353366. [109]

G. Ros, L. Sellart, J. Materzynska, D. Vazquez, and A. M. Lopez, "The SYNTHIA Dataset: A Large Collection of Synthetic Images for Semantic Segmentation of Urban Scenes," in , Las Vegas, NV, USA, 2016, pp. 3234-3243, doi: 10.1109/cvpr.2016.352. [110]

Gaidon, Q. Wang, Y. Cabon, and E. Vig, "Virtual worlds as proxy for multi-object tracking analysis," in , Las Vegas, NV, USA, 2016, pp. 4340-4349, doi: 10.1109/CVPR.2016.470. [111]

B. Pfrommer, N. Sanket, K. Daniilidis, and J. Cleveland, "PennCOSYVIO: A challenging Visual Inertial Odometry benchmark," in , Singapore, Singapore, 2017, pp. 3847-3854, doi: 10.1109/icra. 2017.7989443. [112]

E. Mueggler, H. Rebecq, G. Gallego, T. Delbruck, and D. Scaramuzza, "The event-camera dataset and simulator: Event-based data for pose estimation, visual odometry, and SLAM,"

Int. J. Robot. Res., vol. 36, no. 2, pp. 142-149, 2017, doi: 10.1177/0278364 917691115. [113]

S. Cortés, A. Solin, E. Rahtu, and J. Kannala, "ADVIO: An Authentic Dataset for Visual-Inertial Odometry," in , Munich, Germany, 2018, pp. 425-440, doi: 10.1007/978-3-030-01249-6_26. [114]

Antonini, W. Guerra, V. Murali, T. Sayre-Mccord, and S. Karaman, "The Blackbird Dataset: A Large-Scale Dataset for UAV Perception in Aggressive Flight," in , Buenos Aires, Argentina, 2018, pp. 130-139, doi: 10.1007/978-3-030-33950-0_12. [115]

N. Zeller, F. Quint, and U. Stilla, "A synchronized stereo and plenoptic visual odometry dataset," 2018, arXiv:1807.09372 . [Online]. Available: https://arxiv.org/abs/1807.09372 [116]

K. Sun, K. Mohta, B. Pfrommer, M. Watterson, S. Liu, Y. Mulgaonkar, C. J. Taylor, and V. Kumar, "Robust Stereo Visual Inertial Odometry for Fast Autonomous Flight,"

IEEE Robot. Automat. Lett., vol. 3, no. 2, pp. 965-972, 2018, doi: 10.1109/lra. 2018.2793349. [117]

M. Ferrera, V. Creuze, J. Moras, and P. Trouvé-Peloux, "AQUALOC: An underwater dataset for visual–inertial–pressure localization,"

Int. J. Robot. Res., vol. 38, no. 14, pp. 1549-1559, 2019, doi: 10.1177/0278364919883346. [118]

X. Shi, D. Li, P. Zhao, Q. Tian, Y. Tian, Q. Long, C. Zhu, J. Song, F. Qiao, L. Song, Y. Guo, Z. Wang, Y. Zhang, B. Qin, W. Yang, F. Wang, R. H. M. Chan, and Q. She, "Are We Ready for Service Robots? The OpenLORIS-Scene Datasets for Lifelong SLAM," in , Paris, France, 2020, pp. 3139-3145, doi: 10.1109/ICRA40945.2020.9196638. [119]

T. Pire, M. Mujica, J. Civera, and E. Kofman, "The Rosario dataset: Multisensor data for localization and mapping in agricultural environments,"

Int. J. Robot. Res., vol. 38, no. 6, pp. 633-641, 2019, doi: 10.1177/0278364919841437. [120]

D. Schubert, N. Demmel, L. v. Stumberg, V. Usenko, and D. Cremers, "Rolling-Shutter Modelling for Direct Visual-Inertial Odometry," in , Macau, China, 2019, pp. 2462-2469, doi: 10.1109/IROS40897.2019. 8968539. [121]

J. Delmerico, T. Cieslewski, H. Rebecq, M. Faessler, and D. Scaramuzza, "Are we ready for autonomous drone racing? the UZH-FPV drone racing dataset," in , Montreal, QC, Canada, 2019, pp. 6713-6719, doi: 10.1109/ICRA.2019.8793887. [122] J. J. Yeong Sang Park, Youngsik Shin, Ayoung Kim, "ViViD: Vision for Visibility Dataset," in , Montreal, Canada, 2019. [123]

Y. Cabon, N. Murray, and M. Humenberger, "Virtual KITTI 2," 2020, arXiv:2001.10773 . [Online]. Available: https://arxiv.org/abs/ 2001.10773 [124]

ETH-Zurich. "The EuRoC MAV Dataset." Autonomous Systems Lab. https://projects.asl.ethz.ch/datasets/doku.php?id=kmavvisual inertialdatasets (accessed May. 15th, 2020). [125]

J. Engel, J. Stuckler, and D. Cremers, "Large-scale direct SLAM with stereo cameras," in , Hamburg, Germany, 2015, pp. 1935-1942, doi: 10.1109/iros.2015.7353631. [126]

J.-C. Piao and S.-D. Kim, "Adaptive Monocular Visual–Inertial SLAM for Real-Time Augmented Reality Applications in Mobile Devices,"

Sensors, vol. 17, no. 11, pp. 2567-2591, 2017, doi: 10.3390/s17112567. [127]

Computer-Vision-Group. "Monocular Visual Odometry Dataset." Computer Vision Group, TUM. https://vision.in.tum.de/data/ datasets/mono-dataset (accessed May. 5th, 2020). [128]

Q. She. "OpenLORIS-Scene Dataset." IROS2019 Competition. https://lifelong-robotic-vision.github.io/dataset/scene (accessed Aug. 10th, 2020). [129]

S. Kohlbrecher, O. Von Stryk, J. Meyer, and U. Klingauf, "A flexible and scalable SLAM system with full 3D motion estimation," in , Kyoto, Japan, 2011, pp. 155-160, doi: 10.1109/ssrr.2011.6106777. [130]

S. Huang, M. Antone, E. Olson, L. Fletcher, D. Moore, S. Teller, and J. Leonard, "A High-rate, Heterogeneous Data Set From The DARPA Urban Challenge,"

Int. J. Robot. Res., vol. 29, no. 13, pp. 1595-1601, 2010, doi: 10.1177/0278364910384295. [131]

M. Smith, I. Baldwin, W. Churchill, R. Paul, and P. Newman, "The New College Vision and Laser Data Set,"

Int. J. Robot. Res., vol. 28, no. 5, pp. 595-599, 2009, doi: 10.1177/0278364909103911. [132]

T. Peynot, S. Scheding, and S. Terho, "The Marulan Data Sets: Multi-sensor Perception in a Natural Environment with Challenging Conditions,"

Int. J. Robot. Res., vol. 29, no. 13, pp. 1602-1607, 2010, doi: 10.1177/0278364910384638. [133]

G. Pandey, J. R. McBride, and R. M. Eustice, "Ford Campus vision and lidar data set,"

Int. J. Robot. Res., vol. 30, no. 13, pp. 1543-1552, 2011, doi: 10.1177/0278364911400640. [134]

F. Pomerleau, M. Liu, F. Colas, and R. Siegwart, "Challenging data sets for point cloud registration algorithms,"

Int. J. Robot. Res., vol. 31, no. 14, pp. 1705-1711, 2012, doi: 10.1177/0278364912458814. [135]

C. H. Tong, D. Gingras, K. Larose, T. D. Barfoot, and É. Dupuis, "The Canadian planetary emulation terrain 3D mapping dataset,"

Int. J. Robot. Res., vol. 32, no. 4, pp. 389-395, 2013, doi: 10.1177/ 0278364913478897. [136]

J.-L. Blanco-Claraco, F.-Á. Moreno-Dueñas, and J. González-Jiménez, "The Málaga urban dataset: High-rate stereo and LiDAR in a realistic urban scenario,"

Int. J. Robot. Res., vol. 33, no. 2, pp. 207-214, 2014, doi: 10.1177/0278364913507326. [137]

N. Chebrolu, P. Lottes, A. Schaefer, W. Winterhalter, W. Burgard, and C. Stachniss, "Agricultural robot dataset for plant classification, localization and mapping on sugar beet fields,"

Int. J. Robot. Res., vol. 36, no. 10, pp. 1045-1052, 2017, doi: 10.1177/02783649177 20510. [138]

K. Leung, D. Lühr, H. Houshiar, F. Inostroza, D. Borrmann, M. Adams, A. Nüchter, and J. Ruiz del Solar, "Chilean underground mine dataset,"

Int. J. Robot. Res., vol. 36, no. 1, pp. 16-23, 2017. [139]

Y. Choi, N. Kim, S. Hwang, K. Park, J. S. Yoon, K. An, and I. S. Kweon, "KAIST Multi-Spectral Day/Night Data Set for Autonomous and Assisted Driving,"

IEEE Trans. Intell. Transp. Syst., vol. 19, no. 3, pp. 934-948, 2018, doi: 10.1109/tits.2018. 2791533. [140]

R. A. Hewitt, E. Boukas, M. Azkarate, M. Pagnamenta, J. A. Marshall, A. Gasteratos, and G. Visentin, "The Katwijk beach planetary rover dataset,"

Int. J. Robot. Res., vol. 37, no. 1, pp. 3-12, 2018, doi: 10.1177/0278364917737153. [141]

Z. Zhu, D. Thakur, T. Ozaslan, B. Pfrommer, V. Kumar, and K. Daniilidis, "The multivehicle stereo event camera dataset: an event camera dataset for 3D perception,"

IEEE Robot. Automat. Lett., vol. 3, no. 3, pp. 2032-2039, 2018, doi: 10.1109/lra.2018.2800793. [142]

C. Wang, S. Hou, C. Wen, Z. Gong, Q. Li, X. Sun, J. Li, 2018. Semantic line framework-based indoor building modeling using backpacked laser scanning point cloud. ISPRS Journal of Photogrammetry and Remote Sensing 143, 150–166.. doi:10.1016/j. isprsjprs.2018.03.025 [143]

C. Wen, Y. Dai, Y. Xia, Y. Lian, J. Tan, C. Wang, J. Li, 2020. Toward Efficient 3-D Colored Mapping in GPS-/GNSS-Denied Environments. IEEE Geoscience and Remote Sensing Letters 17, 147–151.. doi:10.1109/lgrs.2019.2916844 [144]

R. Gaspar, A. Nunes, A. M. Pinto, and A. Matos, "Urban@CRAS dataset: Benchmarking of visual odometry and SLAM techniques,"

Robot. Auton. Syst., vol. 109, pp. 59-67, 2018, doi: 10.1016/ j.robot.2018.08.004. [145]

J. J. Yeong Sang Park, Youngsik Shin, Ayoung Kim, "Radar Dataset for Robust Localization and Mapping in Urban Environment," in , Montreal, Canada, 2019. [146]

M. Ramezani, Y. Wang, M. Camurri, D. Wisth, M. Mattamala, and M. Fallon, "The Newer College Dataset: Handheld LiDAR, Inertial and Vision with Ground Truth," 2020, arXiv:2003.05691 . [Online]. Available: https://arxiv.org/abs/2003.05691 [147]

J. Graeter, A. Wilczynski, and M. Lauer, "LIMO: Lidar-Monocular Visual Odometry," in , Madrid, Spain, 2018, pp. 7872-7879, doi: 10.1109/iros.2018.8594394. [148]

Oxford-Robotics-Institute. "Oxford RobotCar Dataset." Oxford Robotics Institute. https://robotcar-dataset.robots.ox.ac.uk/ (accessed Aug. 16th, 2020). [150]

W. Maddern, G. Pascoe, M. Gadd, D. Barnes, B. Yeomans, and P. Newman, "Real-time Kinematic Ground Truth for the Oxford RobotCar Dataset," 2020, arXiv:2002.10152 . [Online]. Available: https://arxiv.org/abs/2002.10152 [151]

Perceptual-Robotics-Laboratory. "University of Michigan North Campus Long-Term Vision and LIDAR Dataset." Perceptual Robotics Laboratory. http://robots.engin.umich.edu/nclt/ (accessed Aug. 15th, 2020). [152]

M. Kaess, A. Ranganathan, and F. Dellaert, "iSAM: Incremental smoothing and mapping,"

IEEE Trans. Robot., vol. 24, no. 6, pp. 1365-1378, 2008, doi: 10.1109/TRO.2008.2006706. [153]

K. Lai, L. Bo, X. Ren, and D. Fox, "A large-scale hierarchical multi-view RGB-D object dataset," in , Shanghai, China, 2011, pp. 1817-1824, doi: 10.1109/icra.2011.5980382. [154]

S. Meister, S. Izadi, P. Kohli, M. Hämmerle, C. Rother, and D. Kondermann, "When can we use kinectfusion for ground truth acquisition," in , Vilamoura, Portugal, 2012, pp. 3-8. [155]

N. Silberman, D. Hoiem, P. Kohli, and R. Fergus, "Indoor Segmentation and Support Inference from RGBD Images," in , Florence, Italy, 2012, pp. 746-760, doi: 10.1007/978-3-642-33715-4_54. [156]

J. Shotton, B. Glocker, C. Zach, S. Izadi, A. Criminisi, and A. Fitzgibbon, "Scene coordinate regression forests for camera relocalization in RGB-D images," in , Portland, OR, USA, 2013, pp. 2930-2937, doi: 10.1109/CVPR.2013.377. [157]

J. Xiao, A. Owens, and A. Torralba, "SUN3D: A Database of Big Spaces Reconstructed Using SfM and Object Labels," in , Sydney, NSW, Australia, 2013, pp. 1625-1632, doi: 10.1109/iccv.2013.458. [158]

Q.-Y. Zhou and V. Koltun, "Color map optimization for 3D reconstruction with consumer depth cameras,"

ACM Trans. Graph., vol. 33, no. 4, pp. 1-10, 2014, doi: 10.1145/2601097.2601134. [159]

Handa, T. Whelan, J. McDonald, A. J. Davison, and IEEE, "A Benchmark for RGB-D Visual Odometry, 3D Reconstruction and SLAM," in , Hong Kong, China, 2014, pp. 1524-1531.

Y. Liu et al : Datasets and Evaluation for Simultaneous Localization and Mapping related problems: A Comprehensive Survey [160] K. Lai, L. Bo, and D. Fox, "Unsupervised feature learning for 3D scene labeling," in , Hong Kong, China, 2014, pp. 3050-3057, doi: 10.1109/icra.2014. 6907298. [161]

Wasenmüller, M. Meyer, and D. Stricker, "CoRBS: Comprehensive RGB-D benchmark for SLAM using Kinect v2," in , Lake Placid, NY, USA, 2016, pp. 1-7, doi: 10.1109/WACV.2016.7477636. [162]

B.-S. Hua, Q.-H. Pham, D. T. Nguyen, M.-K. Tran, L.-F. Yu, and S.-K. Yeung, "SceneNN: A Scene Meshes Dataset with aNNotations," in , Stanford, CA, USA, 2016, pp. 92-101, doi: 10.1109/3dv.2016.18. [163]

H. Oleynikova, Z. Taylor, M. Fehr, R. Siegwart, and J. Nieto, "Voxblox: Incremental 3D Euclidean Signed Distance Fields for on-board MAV planning," in , Vancouver, BC, Canada, 2017, pp. 1366-1373, doi: 10.1109/iros.2017.8202315. [164]

J. R. Ruiz-Sarmiento, C. Galindo, and J. Gonzalez-Jimenez, "Robot@Home, a robotic dataset for semantic mapping of home environments,"

Int. J. Robot. Res., vol. 36, no. 2, pp. 131-141, 2017, doi: 10.1177/0278364917695640. [165]

J. McCormac, A. Handa, S. Leutenegger, and A. J. Davison, "SceneNet RGB-D: Can 5M Synthetic Images Beat Generic ImageNet Pre-training on Indoor Segmentation?," in , Venice, Italy, 2017, pp. 2678-2687, doi: 10.1109/iccv.2017.292. [166]

S. Song, F. Yu, A. Zeng, A. X. Chang, M. Savva, and T. Funkhouser, "Semantic Scene Completion from a Single Depth Image," in , Honolulu, HI, USA, 2017, pp. 1746-1754, doi: 10.1109/cvpr. 2017.28. [167]

Dai, A. X. Chang, M. Savva, M. Halber, T. Funkhouser, and M. Nießner, "Scannet: Richly-annotated 3d reconstructions of indoor scenes," in , Honolulu, HI, USA, 2017, pp. 5828-5839, doi: 10.1109/ CVPR.2017.261. [168]

W. Li, S. Saeedi, J. McCormac, R. Clark, D. Tzoumanikas, Q. Ye, Y. Huang, R. Tang, and S. Leutenegger, "InteriorNet: Mega-scale multi-sensor photo-realistic indoor scenes dataset," 2018, arXiv:1809.00716 . [Online]. Available: https://arxiv.org/abs/1809. 00716 [169]

E. Palazzolo, J. Behley, P. Lottes, P. Giguère, and C. Stachniss, "ReFusion: 3D Reconstruction in Dynamic Environments for RGB-D Cameras Exploiting Residuals," in , Macau, China, 2019, pp. 7855-7862, doi: 10.1109/IROS40897.2019.8967590. [170]

T. Schops, T. Sattler, and M. Pollefeys, "BAD SLAM: Bundle Adjusted Direct RGB-D SLAM," in , Long Beach, CA, USA, 2019, pp. 134-144, doi: 10.1109/cvpr.2019.00022. [171]

J. Straub, T. Whelan, L. Ma, Y. Chen, E. Wijmans, S. Green, J. J. Engel, R. Mur-Artal, C. Ren, and S. Verma, "The Replica dataset: A digital replica of indoor spaces," 2019, arXiv:1906.05797 . [Online]. Available: https://arxiv.org/abs/1906.05797 [172]

R. Mur-Artal, J. M. M. Montiel, and J. D. Tardos, "ORB-SLAM: A Versatile and Accurate Monocular SLAM System,"

IEEE Trans. Robot., vol. 31, no. 5, pp. 1147-1163, 2015, doi: 10.1109/tro. 2015.2463671. [173]

K. Tateno, F. Tombari, I. Laina, and N. Navab, "Cnn-slam: Real-time dense monocular slam with learned depth prediction," in , Honolulu, HI, USA, 2017, pp. 6243-6252, doi: 10.1109/CVPR.2017.695. [174]

J. Engel, J. Sturm, and D. Cremers, "Semi-dense Visual Odometry for a Monocular Camera," in , Sydney, NSW, Australia, 2013, pp. 1449-1456, doi: 10.1109/iccv.2013.183. [175]

C. Forster, Z. Zhang, M. Gassner, M. Werlberger, and D. Scaramuzza, "SVO: Semidirect Visual Odometry for Monocular and Multicamera Systems,"

IEEE Trans. Robot., vol. 33, no. 2, pp. 249-265, 2017, doi: 10.1109/tro.2016.2623335. [176]

Dai, M. Nießner, M. Zollhöfer, S. Izadi, and C. Theobalt, "BundleFusion: Real-time globally consistent 3D reconstruction using on-the-fly surface reintegration,"

ACM Trans. Graph., vol. 36, no. 4, p. 1, 2017, doi: 10.1145/3072959.3054739. [177]

T. Whelan, R. F. Salas-Moreno, B. Glocker, A. J. Davison, and S. Leutenegger, "ElasticFusion: Real-time dense SLAM and light source estimation,"

Int. J. Robot. Res., vol. 35, no. 14, pp. 1697-1716, 2016, doi: 10.1177/0278364916669237. [178]

T. Whelan, M. Kaess, H. Johannsson, M. Fallon, J. J. Leonard, and J. McDonald, "Real-time large-scale dense RGB-D SLAM with volumetric fusion,"

Int. J. Robot. Res., vol. 34, no. 4-5, pp. 598-626, 2015, doi: 10.1177/0278364914551008. [179]

S. M. Seitz, B. Curless, J. Diebel, D. Scharstein, and R. Szeliski, "A Comparison and Evaluation of Multi-View Stereo Reconstruction Algorithms," in , New York, NY, USA, 2006, pp. 519-528, doi: 10.1109/cvpr.2006.19. [180]

Y. Furukawa and J. Ponce, "Accurate, Dense, and Robust Multiview Stereopsis,"

IEEE Trans. Pattern Anal. Mach. Intell., vol. 32, no. 8, pp. 1362-1376, 2010, doi: 10.1109/tpami.2009.161. [181]

Zaharescu, E. Boyer, K. Varanasi, and R. Horaud, "Surface feature detection and description with applications to mesh matching," in , Miami, FL, USA, 2009, pp. 373-380, doi: 10.1109/cvpr.2009.5206748. [182]

D. Bradley, T. Boubekeur, and W. Heidrich, "Accurate multi-view reconstruction using robust binocular stereo and surface meshing," in , Anchorage, AK, USA, 2008, pp. 1-8, doi: 10.1109/cvpr.2008. 4587792. [183]

N. Snavely, S. M. Seitz, and R. Szeliski, "Photo tourism: Exploring Photo Collections in 3D," in , New York, NY, USA, 2006, pp. 835-846, doi: 10.1145/1179352.1141964. [184]

C. Strecha, W. Von Hansen, L. Van Gool, P. Fua, and U. Thoennessen, "On benchmarking camera calibration and multi-view stereo for high resolution imagery," in , Anchorage, AK, USA, 2008, pp. 1-8, doi: 10.1109/cvpr.2008.4587706. [185]

K. Kolev, M. Klodt, T. Brox, and D. Cremers, "Continuous Global Optimization in Multiview 3D Reconstruction,"

Int. J. Comput. Vis., vol. 84, no. 1, pp. 80-96, 2009, doi: 10.1007/s11263-009-0233-1. [186]

M. Farenzena, A. Fusiello, and R. Gherardi, "Structure-and-motion pipeline on a hierarchical cluster tree," in , Kyoto, Japan, 2009, pp. 1489-1496, doi: 10.1109/iccvw.2009.5457435. [187]

S. Agarwal, N. Snavely, I. Simon, S. M. Seitz, and R. Szeliski, "Building Rome in a day," in , Kyoto, Japan, 2009, pp. 72-79, doi: 10.1109/iccv.2009. 5459148. [188]

Y. Li, N. Snavely, and D. P. Huttenlocher, "Location Recognition Using Prioritized Feature Matching," in , Heraklion, Crete, Greece, 2010, pp. 791-804, doi: 10.1007/978-3-642-15552-9_57. [189]

S. Shuhan, "Accurate Multiple View 3D Reconstruction Using Patch-Based Stereo for Large-Scale Scenes,"

IEEE Trans. Image Process., vol. 22, no. 5, pp. 1901-1914, 2013, doi: 10.1109/tip.2013. 2237921. [190]

D. Crandall, A. Owens, N. Snavely, and D. Huttenlocher, "Discrete-continuous optimization for large-scale structure from motion," in , Providence, RI, USA, 2011, pp. 3001-3008, doi: 10.1109/cvpr.2011. 5995626. [191]

D. M. Chen, G. Baatz, K. Koser, S. S. Tsai, R. Vedantham, T. Pylvanainen, K. Roimela, X. Chen, J. Bach, M. Pollefeys, B. Girod, and R. Grzeszczuk, "City-scale landmark identification on mobile devices," in , Providence, RI, USA, 2011, pp. 737-744, doi: 10.1109/ cvpr.2011.5995610. [192]

Y. Li, N. Snavely, D. Huttenlocher, and P. Fua, "Worldwide Pose Estimation Using 3D Point Clouds," in , Florence, Italy, 2012, pp. 15-29, doi: 10.1007/978-3-642-33718-5_2. [193]

S. Cao and N. Snavely, "Learning to Match Images in Large-Scale Collections," in , Florence, Italy, 2012, pp. 259-270, doi: 10.1007/978-3-642-33863-2_26. [194]

K. Wilson and N. Snavely, "Network Principles for SfM: Disambiguating Repeated Structures with Local Context," in , Sydney, NSW, Australia, 2013, pp. 513-520, doi: 10.1109/iccv.2013.69. [195]

R. Jensen, A. Dahl, G. Vogiatzis, E. Tola, and H. Aanaes, "Large Scale Multi-view Stereopsis Evaluation," in , Columbus, OH, USA, 2014, pp. 406-413, doi: 10.1109/cvpr.2014.59. [196]

K. Wilson and N. Snavely, "Robust Global Translations with 1DSfM," in , Zurich, Switzerland, 2014, pp. 61-75, doi: 10.1007/978-3-319-10578-9_5. [197]

J. L. Schonberger and J.-M. Frahm, "Structure-from-Motion Revisited," in , Las Vegas, NV, USA, 2016, pp. 4104-4113, doi: 10.1109/ cvpr.2016.445. [198]

T. Schops, J. L. Schonberger, S. Galliani, T. Sattler, K. Schindler, M. Pollefeys, and A. Geiger, "A Multi-view Stereo Benchmark with High-Resolution Images and Multi-camera Videos," in , Honolulu, HI, USA, 2017, pp. 3260-3269, doi: 10.1109/cvpr.2017.272. [199]

Knapitsch, J. Park, Q.-Y. Zhou, and V. Koltun, "Tanks and temples: Benchmarking Large-Scale Scene Reconstruction,"

ACM Trans. Graph., vol. 36, no. 4, pp. 1-13, 2017, doi: 10.1145/3072959. 3073599. [200]

J. L. Schönberger, E. Zheng, J.-M. Frahm, and M. Pollefeys, "Pixelwise View Selection for Unstructured Multi-View Stereo," in , Amsterdam, The Netherlands, 2016, pp. 501-518, doi: 10.1007/978-3-319-46487-9_31. [201]

Tanks-and-Temples. "Tanks and Temples Benchmark." tanksandtemples.org. https://tanksandtemples.org/ (accessed Apr. 4th, 2020). [202]

S. Agarwal, Y. Furukawa, N. Sna, B. Curless, S. M. Seitz, and R. Szeliski, "Reconstructing Rome,"

Computer, vol. 43, no. 6, pp. 40-47, 2010, doi: 10.1109/mc.2010.175. [204]

L. Nardi, B. Bodin, M. Z. Zia, J. Mawer, A. Nisbet, P. H. J. Kelly, A. J. Davison, M. Lujan, M. F. P. O'Boyle, G. Riley, N. Topham, and S. Furber, "Introducing SLAMBench, a performance and accuracy benchmarking methodology for SLAM," in , Seattle, WA, USA, 2015, pp. 5783-5790, doi: 10.1109/icra.2015.7140009. [205]

B. Bodin, H. Wagstaff, S. Saecdi, L. Nardi, E. Vespa, J. Mawer, A. Nisbet, M. Lujan, S. Furber, A. J. Davison, P. H. J. Kelly, and M. F. P. O'Boyle, "SLAMBench2: Multi-Objective Head-to-Head Benchmarking for Visual SLAM," in , Brisbane, QLD, Australia, 2018, pp. 1-8, doi: 10.1109/icra.2018.8460558. [206]

M. Bujanca, P. Gafton, S. Saeedi, A. Nisbet, B. Bodin, O. Michael F.P, A. J. Davison, K. Paul H.J, G. Riley, B. Lennox, M. Lujan, and S. Furber, "SLAMBench 3.0: Systematic Automated Reproducible Evaluation of SLAM Systems for Robot Vision Challenges and Scene Understanding," in , Montreal, QC, Canada, 2019, pp. 6351-6358, doi: 10.1109/icra. 2019.8794369. [207]

Mallios, E. Vidal, R. Campos, and M. Carreras, "Underwater caves sonar data set,"

Int. J. Robot. Res., vol. 36, no. 12, pp. 1247-1251, 2017, doi: 10.1177/0278364917732838. [209]

M. Cordts, M. Omran, S. Ramos, T. Rehfeld, M. Enzweiler, R. Benenson, U. Franke, S. Roth, and B. Schiele, "The cityscapes dataset for semantic urban scene understanding," in , Las Vegas, NV, USA, 2016, pp. 3213-3223, doi: 10.1109/CVPR.2016.350. [210]

C. Kerl, J. Sturm, and D. Cremers, "Dense visual SLAM for RGB-D cameras," in , Tokyo, Japan, 2013, pp. 2100-2106, doi: 10.1109/IROS.2013. 6696650. [211]

F. Endres, J. Hess, J. Sturm, D. Cremers, and W. Burgard, "3-D Mapping With an RGB-D Camera,"

IEEE Trans. Robot., vol. 30, no. 1, pp. 177-187, 2014, doi: 10.1109/tro.2013.2279412. [212]

R. Mur-Artal and J. D. Tardos, "Visual-Inertial Monocular SLAM With Map Reuse,"

IEEE Robot. Automat. Lett., vol. 2, no. 2, pp. 796-803, 2017, doi: 10.1109/lra.2017.2653359. [213]

S. Saeedi, L. Nardi, E. Johns, B. Bodin, P. H. J. Kelly, and A. J. Davison, "Application-oriented design space exploration for SLAM algorithms," in , Singapore, Singapore, 2017, pp. 5716-5723, doi: 10.1109/icra. 2017.7989673. [214]

AirLab. "TartanAir: A Dataset to Push the Limits of Visual SLAM." AirLab. https://theairlab.org/tartanair-dataset/ (accessed Mar. 5th, 2020).

YUANZHI LIU was born in Huludao, Liaoning Province, China, in 1995. He received the B.S. degree in measurement and control technology from Harbin Institute of Technology (HIT), Harbin, Heilongjiang Province, China, in 2017. Now he is a Ph.D. candidate specialize in robotic simultaneous localization and mapping (SLAM) at School of Electronic Information and Electrical Engineering, Shanghai Jiao Tong University, Shanghai, China. From 2015 to 2017, he is a member of HIT Robot Team – serving as the Captain of Humanoid Robot Group. From 2019 to 2020, he is a Visiting Ph.D. Researcher in image processing and interpolation (IPI) at Ghent University. His research interests fall mainly on Robotics – including SLAM, Visual Odometry, 3D Mapping, Autonomous Driving, and Dataset Collecting. He has already obtained 2 granted patents and several under review patents on robotics and intelligent device. Mr. Liu is a student member of China Society of Image and Graphics. He became a student member of IEEE in 2016. He won the Champion in National Robot Championship in 2016. He was awarded Excellent Student Cadre in 2016 and Merit Student in 2017. He won the Scholarship of Excellent Ph.D. Students in 2019.

YUJIA FU was born in Anshan, Liaoning Province, China, in 1996. He received the B.S. degree in marine technology from the

Dalian University of Technology, Dalian, Liaoning Province, in 2019. From 2020 to 2021, he was a student member in China Computer Federation. His research interests include design and development of marine robot, robot visual localization and simultaneous localization and mapping. Mr. Fu won the first prize in University Physics Academic Competition of Liaoning in 2017, and the honor of Merit Student for 3 years in a row. He was awarded Outstanding Graduate of Dalian in 2019.

Y. Liu et al : Datasets and Evaluation for Simultaneous Localization and Mapping related problems: A Comprehensive Survey FENGDONG CHEN was born in Tonghua, Jilin Province, China, in 1977. He received the Ph.D. degree in the department of Computer Science and Engineering from Harbin Institute of Technology, Harbin, Heilongjiang Province, China, in 2009. Since 2000, he served as an Associate Professor of the department of Instrument Science and Engineering, Harbin Institute of Technology. His research interests include computer vision and precision instrument. He is the author of 20 articles, and more than 10 inventions. Dr. Chen has been awarded the invention award from the Ministry of National Defense in 2016. He is a reviewer of Acta Optica Sinica, Robot, Acta Electronica Sinica, and Journal of Harbin Institute of Technology.

BART GOOSSENS received the M.S. degree in computer science and the Ph.D. degree in engineering from Ghent University in 2006 and 2010, respectively. Since Oct. 2013, he served as a professor in the Image Processing and Interpretation group (IPI) of Ghent University, where he currently supervises research on image/video processing, computer vision, AI and heterogeneous platforms mapping tools. He is also a core principal investigator at imec. His interests in the efficient mapping of image processing and computer vision techniques onto hardware architectures such as a GPU have resulted in the design and development of Quasar (gepura.io/quasar), a brand new high-performance CPU/GPU-programming solution that offers greatly reduced development time. He is author of more than 100 scientific papers in journals and conference proceedings.

Wei Tao (M’2004) was born in Dalian, Liaoning Province, China, in 1975. She received the B.S., M.S. and Ph.D. degrees in Instrument science and technology from Harbin Institute of Technology, Heilongjiang Province, China, in 1997,1999 and 2003, respectively. From 2003 to 2018, she was a Research Assistant and Associate Professor with Shanghai Jiao Tong University. She became a Professor of Shanghai Jiao Tong University in 2018. She is the author of three books, more than 100 articles and more than 40 inventions. Her research interests include opto-electronic measurement technology and application, methods and algorithms in vision measurement process, and laser sensors and measurement instruments. Dr. Tao became an IEEE member of Instrument from 2004 and she is now also an OSA member. She has been awarded the invention award from the government of Shanghai and China Instrument and Control Society in 2007 and 2009, respectively.