[PDF] A Collaborative Visual SLAM Framework for Service Robots

Abstract

With the rapid deployment of service robots, a method should be established to allow multiple robots to work in the same place to collaborate and share the spatial information. To this end, we present a collaborative visual simultaneous localization and mapping (SLAM) framework particularly designed for service robot scenarios. With an edge server maintaining a map database and performing global optimization, each robot can register to an existing map, update the map, or build new maps, all with a unified interface and low computation and memory cost. To enable real-time information sharing, we design a simple but effective communication pipeline and a novel landmark retrieval method to augment each client's local map with nearby landmarks from the server. The framework is general enough to support both RGB-D and monocular cameras, as well as robots with multiple cameras, taking the rigid constraints between cameras into consideration. The proposed framework has been fully implemented and verified with public datasets and live experiments.

Full PDF

AA Collaborative Visual SLAM Framework for Service Robots

Ming Ouyang, Xuesong Shi, Yujie Wang, Yuxin Tian, Yingzhe Shen, Dawei Wang, Peng Wang

Abstract — With the rapid deployment of service robots, amethod should be established to allow multiple robots towork in the same place to collaborate and share the spatialinformation. To this end, we present a collaborative visualsimultaneous localization and mapping (SLAM) frameworkparticularly designed for service robot scenarios. With an edgeserver maintaining a map database and performing globaloptimization, each robot can register to an existing map, updatethe map, or build new maps, all with a uniﬁed interface and lowcomputation and memory cost. To enable real-time informationsharing, an efﬁcient landmark retrieval method is proposed toallow each robot to get nearby landmarks observed by others.The framework is general enough to support both RGB-D andmonocular cameras, as well as robots with multiple cameras,taking the rigid constraints between cameras into consideration.The proposed framework has been fully implemented andveriﬁed with public datasets and live experiments.

I. I

NTRODUCTION

Very recently, we see a rapidly increasing number ofservice robots being deployed in various public places,such as hospitals, hotels, restaurants and malls, for contact-less delivery, cleaning, disinfection and inspection, amongother tasks. Most robots are expected to be able to moveautonomously, so generally they would need a pre-builtor evolving map of the place of interest for localizationand planning. An evolving map is usually preferred, sincemost service robots work in dynamic and daily changingenvironments. For a single robot, its localization and mapupdate can be performed with a Simultaneous LocalizationAnd Mapping (SLAM) system [1]. However, as there arelikely to be multiple robots serving in one place, each maybe turned on and off from time to time, ideally there shallbe a centralized map database of this place that is alwaysup-to-date and can be used by any robots in this place atany time. For service robot scenarios, the map database canbe maintained by an on-premise edge server [2]. While edgeserver-assisted SLAM system has been proposed recently [3],there lacks a framework to allow multiple service robots tosimultaneously work within a centralized map database onthe edge server.Collaborative SLAM algorithms aim to enable multipleindependently moving agents to exchange spatial informationfor better robustness, efﬁciency and accuracy in localizationand mapping [4]. Many recent works adopt a client-serverarchitecture [5][6][7][8][9][10]. Although they are originallydesigned for Augmented Reality (AR) or Micro AerialVehicle (MAV) applications, many of the techniques can be

The authors are with Intel Corporation, Beijing, 100190 China. M.O.and Y.T. are graduate interns from the the Institute of Automation, ChineseAcademy of Sciences and Beihang University, China, respectively, advisedby X.S.. Correspondence should be addressed to [email protected]. applied to service robot scenarios. In this paper, we give acareful review of the literature, and present a multi-agentvisual SLAM system dedicated for service robot applica-tions. We elaborate our design with the hope of pushingforward a standard interface for different kinds of servicerobots to beneﬁt from a uniﬁed SLAM service powered byedge servers. There are many open problems that we couldimagine, such as privacy preserving and security assurance,but we believe that this is the right direction to build smartspaces where people can be better served.The ﬁrst problem in designing a multi-agent SLAM systemis the partition of modules between the server and clients, tobalance between a powerful centralized system and fully au-tonomous agents, with also consideration to the computationcost and communication bandwidth. For the targeted servicerobot scenarios, we assume that an edge server is always on-site with good (but non-perfect) communications with eachclient. Therefore, the maps are mainly maintained by theserver, while each client keeps only a small local map forreal-time pose tracking. The clients also optimize their localmaps with new observations, so that they can retain auton-omy even if the communication is temporarily unavailable.Instead of sending images to the server, each client reportsonly keyframe features and optimized landmarks. Thus thebandwidth usage is affordable, and so is the computationoverhead at the server side for handling each client. Bothare necessary for a server to support many clients.For each client to beneﬁt from others’ and previousobservations, we propose a novel method for the clients toefﬁciently query the map database at server side. Instead ofdownloading the full map, only landmarks within the client’scurrent view are retrieved. This is efﬁciently performed bya grid map-based landmark searching method. The retrievedlandmarks may originate from other clients’ observations, orother sessions. For each new keyframe, the server will tryto detect loops with existing keyframes and perform a mapmerging or map optimization whenever possible.A service robot can equip multiple cameras to enlarge itsview for more robustness in narrow or crowded places. Theproposed framework can naturally accommodate this settingby treating each camera as a separate client and adding inter-camera constraints for global optimization at the server side.The system can keep working when one of the camerastemporarily fails.In summary, the contributions of this work include • a collaborative SLAM framework particularly designedfor service robot usages; • an efﬁcient landmark retrieval method to share spatialinformation between clients in real time; a r X i v : . [ c s . R O ] F e b a fully-implemented and veriﬁed system that can workwith many robots, each with one or more RGB-D ormonocular cameras.II. R ELATED W ORKS

A. Visual SLAM

Visual SLAM algorithms, by the kind of visual informa-tion they use, can be categorized into direct methods andfeature-based ones. The former uses pixel intensities in theimages to estimate camera motion and environmental struc-tures by minimizing a photometric error, and can generatea dense or semi-dense reconstruction of the scene. Feature-based ones, despite giving only a sparse map, are generallymore accurate and robust for that they select the most reliablefeatures in the scene, such as keypoints, lines, planes orobjects. They are also more ﬂexible as the features fromdifferent trajectories can be matched and optimized together,which is crucial for generalizing to a multi-session or multi-agent setting. Representative works of keypoint and opti-mization based SLAM systems include PTAM [11], ORB-SLAM [12], ORB-SLAM2 [13] and OpenVSLAM [14]. Oursystem in this work is developed based on OpenVSLAM.

B. Multi-agent SLAM

The works on multi-agent SLAM have been motivated bythe purpose of building, managing and re-using maps fromand for multiple devices. Castle et al. [15] extends PTAMto a multi-camera multi-map setting, showing that multipletrackers can work within the same map, and each trackercan be efﬁciently re-localized on multiple maps. There is nomap merging mechanism because constructing a global mapis not mandatory in the targeted wearable AR applications.In robotic applications, however, maps with overlapped areasare preferred to be merged together, so as to support robotautonomy over a large area. Forster et al. [5] proposeCollaborative Structure from Motion (CSfM) for MAVs.Each MAV runs a visual-inertial odometry system, and sendsimage features and keyframe poses to a base station forcentralized mapping. A more recent work is MOARSLAM[7], where maps are constructed by each agent, but can besynchronized with a remote server. The server can connecttwo maps from different agents once a loop between themis reliably detected.Collaborative localization is also considered in the liter-ature. A seminal work is CoSLAM [16], which presentsseveral ways to enhance localization by utilizing commonobservations among multiple independently moving cameras.In particular, cameras with overlapped views are dynami-cally grouped together, and inter-camera pose estimation andmapping can be performed within the same group. However,despite the obvious merits, such methods are less re-used inlater works because they not only require all the cameras tobe synchronized, but also imply a tightly coupled pipelinebetween agents. Instead, the client-server architecture hasbeen enhanced to support multi-agent collaboration not onlyin mapping, but also in localization by real-time map syn-chronization between the server and clients [6][8][9][10]. The Cloud framework for Cooperative Tracking AndMapping (C TAM) [6] is a multi-agent SLAM system witha modern client-server infrastructure. The modules of itsbaseline SLAM algorithm, Parallel Tracking And Mapping(PTAM) [11], are partitioned into server-side and client-side. The major consideration is to leverage the storage andcomputational resources in the cloud server while main-taining the real-time pose estimation capability at clientside. To this end, the client performs only tracking andrelocation (local re-localization), while the server providemapping, place recognition (global re-localization) and mapfusion as services. To save bandwidth, each client sends onlykeyframes to the server. The server has to send updated mapto the client every time the map is optimized, which can becostly when the map gets large.In [8] and CCM-SLAM [10], each client runs trackingand local mapping, maintaining a local map consisting of aﬁxed number of nearest keyframes and related map points.The local map is synchronized with the server, allowing bi-directional updates of keyframes and map points, thus alsoenabling indirect information sharing between agents throughthe server. The system can also be extended to support avisual-inertial sensor setup as shown in [9].The proposed system in this paper has a similar archi-tecture with CCM-SLAM, but has been particularly op-timized for service robots. We propose an efﬁcient gridmap-based landmark searching method using ground plane.Unlike CCM-SLAM with monocular devices, we supportboth monocular and RGB-D cameras. We also show the usecase of robust visual SLAM with multiple heterogeneoussensors on one robot.Besides these multi-agent SLAM algorithms with a basestation or a cloud server, there are also works on decen-tralized multi-agent SLAM algorithms. In such cases theagents leverage local communication to reach consensus ona common estimate. Such kind of algorithms are generallymore costly in computation and communication, and morefragile to spurious measurements and matches as each agentonly has access to partial and local information [1]. Thusdecentralized SLAM algorithms are less practical for servicerobot scenarios, in which high reliability is required andserver deployment is often possible.

C. Multi-camera SLAM

SLAM with multiple cameras mounted on one rigid bodycan be viewed as a special case of multi-agent SLAM. Fora pair of synchronized and calibrated cameras with largelyoverlapped views, they are usually considered as one stereocamera setup, where frames from one camera are trackedand mapped, while frames from the other help triangulate thekeypoints [13]. With more cameras on one device, its ﬁeld ofview can be greatly enlarged, thus better mapping efﬁciencyand localization robustness. Kaess and Dellaert [17] designa rig with 8 cameras distributed equally along a circle, andshow that it can efﬁciently and accurately map a large scene.For a general multi-camera setting, Kuo et al. [18] presentsseveral techniques to build a more adaptive SLAM system apping ModuleTracking Module CommunicationModule keypointdetection keypointmatching poseoptimization localBA landmarkcreation redundancyremoval

Local Map send mapupdatesupdatelocal mapkeyframecreationpack mapupdates C o mm un i c a ti on M odu l e Client 1 Handler updateglobal map

GlobalOptimizationModule queue newkeyframe loopdetectionmapmergingsend nearbylandmarksor updatedlocal map pose graphoptimization

Client 2 Handler MapDatabaseClient 1 Server ...Client 2... add rigidconstraint

RGB-D/MonoImage V i s ua li z a ti on M odu l e Fig. 1. The proposed collaborative visual SLAM framework features a server to maintain all the maps. Each client performs visual odometry with a smalllocal map, which is synchronized with the server’s map database in real-time. with fewer hyper-parameters. One of the techniques is toorganize all the landmarks in a voxel grid, so that for anarbitary camera pose, landmarks within its view can beefﬁciently retrieved by sampling the camera frustum, ratherthan relying on keyframe covisibility which has severaldrawbacks [19]. We take a step further from this work andpropose a grid map based landmark retrieval method that arevery efﬁcient for collaborative service robot scenarios.

D. Multi-session SLAM

Multi-session SLAM, or lifelong SLAM in some contexts,aims to reuse previous map information for later trajectoriesin the same place. It can be viewed as a sub-topic of multi-agent SLAM, and the proposed methods can usually bedirectly incorporated into the server side of a client-serverarchitecture. The concept of

Atlas is introduced in [20],showing that a global topological map can be efﬁcientlybuilt and reused by multiple local SLAM sessions. Severalopen-source SLAM systems support inter-session loop de-tection (also known as place recognition) and map merging,including Maplab [21], RTABMap [22] and ORB-SLAM3[23]. For place recognition, recent works show that deepconvolutional neural network based feature extraction cangreatly outperform conventional methods and can be wellintegrated into keypoint-based SLAM pipelines [24].III. M

ULTI -A GENT

SLAM S

YSTEM

Our design goal is to build a ready-to-use visual SLAMsystem for multiple ground robots working in one place. Inparticular, the system should • allow multiple robots to build and reuse one or moreglobal maps of the place, without any assumption ofeach robot’s movement, • enable timely information sharing between robots, forexample, an area visited by one robot should be a knownarea for another robot following a similar trajectory, • allow robots with multiple cameras to take full advan-tage of the overall ﬁeld of view,with the constraints of a) no hardware synchronization ofcameras, b) bounded on-board computation and memory cost for each robot, and c) non-perfect communication channelswith possible message loss.For this goal, it is natural to employ an on-premise edgeserver to manage the global maps and communicate witheach robot via wireless connections. Instead of treatingeach robot as a client, we choose to make each individualsensor setup (monocular/stereo/RGB-D) to be a client tocommunicate with the server. Thus, a robot equipped withmultiple cameras may have multiple client-side softwaresrunning on-board. Then if needed, a separate fusion modulecan be designed to give the ﬁnal localization of this robot bymerging results from these clients with a ﬁlter and a policyto deal with individual tracking failures. A. Module Partition

OpenVSLAM [14] is an open-source single-agent SLAMsystem with a modulated design and state-of-the-art perfor-mance. There are three major modules in the system, fortracking, mapping, and global optimization, respectively. Were-use these modules in our system, enhance some of theirfunctionalities, and introduce new modules for client-servercommunication and map fusion. The overall architecture isshown in Fig. 1.The tracking module estimates the pose of each frame bymatching the extracted keypoints to the local map. It shouldbe running at client side to ensure real-time response.The mapping module , taking care of keyframe creation,landmark creation, and local bundle adjustment, should alsobe running at client side. Otherwise if each client rely onthe server to update the map, then its autonomy would begreatly constrained as the communication may fail or lag.The global optimization module performs loop detectionand pose graph optimization. It has no real-time requirementand may require a large memory, making it reasonable to runon a server machine. In our system, it can also detect inter-client and inter-session loops and merge corresponding maps,as well as addressing the prior rigid constraints betweencameras on the same robot.The map database at server side stores all the keyframesand landmarks. Each keyframe or landmark has a unique ID,nd can be efﬁciently retrieved based on a hash table index.They also have a map ID to mark which map they are on.The communication modules make connections betweenthe server and each client. At either side, it transmitsmessages queued by other modules, and invokes the propercallback when receiving messages. All the messages are ina uniﬁed format, containing keyframes and landmarks.For each alive client, there is a client handler at server sideto process messages from this client, fuse the informationinto the map database, and send back messages to the clientwith nearby landmarks or updated local map.

B. Communication Overview

When a client starts up, it establishes a connection withthe server, and sends to the server its camera intrinsics andextrinsics. All the following communications are based on anasynchronous message delivering mechanism with a uniﬁedmessage deﬁnition. Each message has a keyframe array and alandmark array, and a few metadata and control commands.Each keyframe and landmark has a unique ID within thewhole system. For each newly generated keyframe, the clientwill send the extracted keypoints to the server. In other casesonly pose estimates and other variables are updated. For eachlandmark, its location, observations and descriptor will besynchronized between server and clients. The client will sendto the server the new, updated and removed keyframes andlandmarks to the server after each local mapping process.The server, after receiving a new keyframe, will retrievethe landmarks near this keyframe and send them to thecorresponding client to augment its local map. If a globaloptimization has occurred on the server, it will send to theaffected clients the updated keyframes and landmarks in theircorresponding local maps.

C. Client Initialization, Tracking and Mapping

RGB-D and stereo cameras are usually preferred in servicerobot design, yet monocular cameras can also be used asauxiliary cameras, or by small-sized devices to track posewith a known map. Our system currently supports RGB-D and monocular cameras as the input of each client.For a client with RGB-D input, it will initialize a newmap at the beginning of each session or after a trackinglost. If the map is within a known area, soon it will bemerged into an existing map by the server. For clients withmonocular cameras, however, an explicit place recognitionrequest containing the current frame features will be sent tothe server, and the client will not start tracking until gettingsuccessful initial pose estimate and nearby landmarks fromthe server. The reason for this choice is that the initializationof monocular SLAM can be difﬁcult and time-consuming,and so can be the merging of maps without an absolutescale. With our method, on the contrary, the initialization ofmonocular clients is very fast and reliable if it is on a mappedarea, which we assume should often be the case for servicerobot applications. If the monocular client then explores intoa new area, the tracking and mapping will continue and the

Fig. 2. The grid map-based method to retrieve nearby landmarks for agiven camera pose. All the landmarks are indexed by a 2D grid (grey)which is coarsely aligned with the ground plane. Given a camera pose andintrinsics, we project the eight vertices (green dots) of its frustum (greenarea) onto the grid, calculate the grid cells that fall into the convex hullof the projected area (blue), and retrieve landmarks within these cells withadditional checking. server’s map database will also be extended, as the client hasalready been working with the correct map scale.For each image, the ORB features [25] are extracted forfurther processing. Although it has been shown that featuresfrom a deep neural network can signiﬁcantly outperform theORB features [24], their descriptors’ dimension are oftenmuch larger than the binary descriptors in ORB, makingthem much more bandwidth consuming in a multi-agentsystem. We leave the development of more suitable featureextractors to future works.The tracking and mapping pipeline, shown in Fig. 1, ismostly the same as in OpenVSLAM [14]. One differenceis that the client keeps only a local map (those needed forthe local bundle adjustment) and will remove all the otherkeyframes and landmarks out of its memory. So the client’smemory usage is small and bounded even in a long session.Another difference is that the local map is augmented byadditional landmarks sent from the server, which may beobserved by other clients, from previous sessions, or fromthe same session but had been shifted out from the localmap.

D. Global Landmark Retrieval

One of the core problems in collaborative SLAM is toshare spatial information between agents. In a client-serversystem, this can be done by server sending the global mapto each client [6], or more efﬁciently, sending only selectedkeyframes and landmarks that are close to the client’s cur-rent keyframe [10]. However, ﬁnding close keyframes andlandmarks are non-trival. CCM-SLAM retrieves keyframeswith the strongest covisibility, which is a powerful heuristic,but would fail to retrieve nearby landmarks observed froma largely different view angle. A more principled methodwould be to retrieve all the landmarks within the client’scurrent view. To realize it, we use a grid map to indexall the landmarks. This is inspired by [19] where a voxelmap is employed, but we improve the method so that allthe landmarks within the camera’s view can be retrieved, asopposed to that only part of the voxels are sampled in [19].A 2D grid map is sufﬁcient to index landmarks for servicerobot scenarios because when the robots move mostly onthe ground of indoor environments, landmarks distribute on large scale on the two axes parallel with the ground, whileclustered on the height direction. Nevertheless, the methodcan be directly extended to a 3D grid if needed.In our system, when a map is initialized, its coordinatesare deﬁned as originating from the current pose of the robotbase (the transform from the camera frame to the base frameis usually available for service robots). Then the z-axis ofthe world coordinates will be perpendicular to the ground(not need to be very precise). By discretizing the x- and y-coordinates of each landmark, a 2D grid map is implicitlybuilt. A hash table indices all the landmarks with their gridcoordinates as the key, so that given a map coordinate, thelandmarks in the corresponding grid cell can be efﬁcientlyretrieved.When a camera pose is given, we calculate the mapposition of the eight vertices of its view frustum, as illustratedin Fig. 2. A 2D convex hull is constructed with the x- andy-coordinates of the eight points. Then for all the grid cellswithin the axis-aligned bounding box of the hull, a 2D point-polygon test is performed to check if the cell center lieswithin the hull. If yes, then the landmarks in this cell willbe retrieved, and a re-projection check is performed for eachlandmark to ﬁlter out those outside the camera’s view. Thewhole operation can be executed in less than 1 millisecondin our implementation.Whenever the server receives a new keyframe, the corre-sponding client handler will retrieve all the landmarks withinthe view of this keyframe, and send them to the client toaugment its local map. To save bandwidth, the landmarksthat are lastly updated by the same client are excluded fromthe message.

E. Global Optimization and Map Merging

A uniﬁed approach is used for intra-session and inter-session loop detection at the server side. When a clienthandler receives a newly generated keyframe, after addingit into the map database, it will put the keyframe ID into aqueue shared between all the handlers. The global optimiza-tion module will retrieve each new keyframe from this queueand perform a loop detection with all existing keyframes inthe database. If a loop is found and veriﬁed between twokeyframes in different maps, then the smaller map will bemerged into the larger one, by updating the map ID andworld pose for the keyframes and landmarks in the former. Apose graph optimization is performed for each loop closure.During the optimization, the message processing of clienthandlers working on the related map(s) is paused. And afterthe optimization, those handlers will process the accumulatedmessages with aligning the world poses onto the updatedmap coordinates. They will also help update the clients’ localmaps by sending updated keyframes and landmarks.

F. Rigid Constraints for Multi-camera Robots

Service robots often face challenging visual conditions likefeatureless walls or serious occlusion, where visual SLAMis likely to fail [26]. This can be mitigated by equippingmultiple cameras with different views, e.g. one front-facing and one rear-facing. In such cases, the rigid constraintbetween cameras shall be considered in order to a) improveaccuracy when multiple cameras are working and b) keeptracking in the current map whenever at least one camera isworking. Both goals are achieved in our system. We treateach camera as an independent client, and fuse their maps atthe server side by connecting concurrent keyframes from anypair of clients that have a known rigid constraint. Wheneverone client has lost tracking, it can easily get back into theprevious map when re-initialized, with the help of the rigidconstraints with other cameras and map merging.Because the cameras are not synchronized and thekeyframe selection is independent between clients, it maybe rare that two keyframes are generated at a very closetime. To relax this condition, we generate a virtual keyframepose by interpolating between consecutive keyframes of oneclient, and add the constraint between this virtual keyframeand the keyframe from the other client. With this strategy,virtually concurrent keyframes can often be detected. Aglobal optimization is invoked only if the relative posebetween these two keyframes diverges from the pre-deﬁnedconstraint to a certain threshold, or they are on differentmaps. IV. E

XPERIMENTAL R ESULTS

The proposed system is veriﬁed with both public datasetsand live experiments. Firstly, we test the system with multipledata sequences from the OpenLORIS-Scene datasets [26].The two largest scenes from the dataset are market and corridor , while cafe is much smaller. In the RGB-Dsequences in these scenes, there are featureless areas andoccluded cases where any visual odometry system is likelyto fail. In our experiment, eight data sequences from the threescenes are concurrently played in real time, with eight clientssubscribing to the RGB-D stream of each sequence. They aredistributed on three machines: an Intel NUC mini-PC withIntel Core i7-8809G, a desktop with Intel Core i7-7820X, anda Dell laptop with Intel Core i5-6300HQ. The server runs onanother Intel NUC. The four machines are connected withan ofﬁce network. Fig. 3 shows the intermediate and ﬁnalmaps at the server side, where keyframes and landmarks arevisualized. It can be seen that while each client has initializeda separate map, those in the same scene are eventuallymerged together. It is worth noting that there are severalmore maps of corridor in the ﬁnal state that are notshown in the ﬁgure. The reason is that some clients havebuilt multiple maps after tracking failures, and some of themaps are not looped with the largest map due to signiﬁcantviewpoint changes. We also note that the CPU usage onthe server machine is only 100%-150% when handling allthe eight clients, indicating that this quad-core machine cansupport even more clients.To further verify the system, we test it with a real robotwith two RealSense D455 RGB-D cameras. The two camerasare facing front and rear respectively, with no view overlap.Their relative pose is obtained by each calibrating with athird camera on a robot arm on this robot that can move ig. 3. The visualized keyframes and landmarks at server side when it simultaneously handles 8 clients with different RGB-D sequences from theOpenLORIS-Scene datasets. Left: at an early time, 8 map are built. Right: at a later time, they have been merged into 3 major maps. Better view in color.Fig. 4. The map of an ofﬁce area (approx. 90 x 60 meters) built with adual-camera robot with 4 sessions and 1 monocular camera session (9 clientsin total, shown in different colors). The zoomed-in trajectories in the lowerleft shows a case where one camera gets lost (red circle) and then gets backto the map with the rigid constraint from the other camera (green circle).The zoomed-in trajectories in the middle shows the monocular camera’skeyframes in brown. either to the front or to the rear. The robot operates foursessions on a large ofﬁce ﬂoor, exploring different areas ineach session, while two clients are running on-board withthe two cameras. An additional session is made to verify thesystem’s capability with monocular cameras, where only theRGB stream of the front camera is used. The ﬁnal servermap after the ﬁve sessions is shown in Fig. 4. All the visitedareas in the ﬁve sessions have been mapped together, eventhat occasionally there will be one client getting lost whenfacing a white wall. The monocular camera also works wellin the experiment and contributes to the map building. V. C

ONCLUSION

We present a collaborative visual SLAM framework par-ticularly designed for service robot scenarios. With a client-server architecture and a simple communication protocol, thesystem is efﬁcient and ﬂexible enough to support multiplerobots to collaboratively build and re-use maps. A novellandmark retrieval method is introduced to allow real-timeinformation sharing between clients. The implemented sys-tem supports RGB-D cameras and monocular cameras, andmulti-camera robots. Experimental results show that it canmanage one or more large scenes, simultaneously supporteight clients to explore different scenes, and work with realrobots with multiple cameras. We hope that this research canpush a step further towards a uniﬁed edge server interface tofacilitate different service robots to work together.R

EFERENCES[1] C. Cadena, L. Carlone, H. Carrillo, Y. Latif, D. Scaramuzza, J. Neira,I. Reid, and J. J. Leonard, “Past, present, and future of simultaneouslocalization and mapping: Toward the robust-perception age,”

IEEETransactions on robotics , vol. 32, no. 6, pp. 1309–1332, 2016.[2] W. Shi, J. Cao, Q. Zhang, Y. Li, and L. Xu, “Edge computing: Visionand challenges,”

IEEE Internet of Things Journal , vol. 3, no. 5, pp.637–646, 2016.[3] A. J. B. Ali, Z. S. Hashemifar, and K. Dantu, “Edge-SLAM: edge-assisted visual simultaneous localization and mapping,” in

Proceedingsof the 18th International Conference on Mobile Systems, Applications,and Services , 2020, pp. 325–337.[4] D. Zou, P. Tan, and W. Yu, “Collaborative visual SLAM for multipleagents: A brief survey,”

Virtual Reality & Intelligent Hardware , vol. 1,no. 5, pp. 461–482, 2019.[5] C. Forster, S. Lynen, L. Kneip, and D. Scaramuzza, “Collaborativemonocular SLAM with multiple micro aerial vehicles,” in .IEEE, 2013, pp. 3962–3970.[6] L. Riazuelo, J. Civera, and J. M. Montiel, “C2TAM: A cloudframework for cooperative tracking and mapping,”

Robotics andAutonomous Systems , vol. 62, no. 4, pp. 401–413, 2014.7] J. G. Morrison, D. G´alvez-L´opez, and G. Sibley, “MOARSLAM:Multiple operator augmented RSLAM,” in

Distributed autonomousrobotic systems . Springer, 2016, pp. 119–132.[8] P. Schmuck and M. Chli, “Multi-UAV collaborative monocularSLAM,” in . IEEE, 2017, pp. 3863–3870.[9] M. Karrer, P. Schmuck, and M. Chli, “CVI-SLAM - collaborativevisual-inertial SLAM,”

IEEE Robotics and Automation Letters , vol. 3,no. 4, pp. 2762–2769, 2018.[10] P. Schmuck and M. Chli, “CCM-SLAM: Robust and efﬁcient central-ized collaborative monocular simultaneous localization and mappingfor robotic teams,”

Journal of Field Robotics , vol. 36, no. 4, pp. 763–781, 2019.[11] G. Klein and D. Murray, “Parallel tracking and mapping for small ARworkspaces,” in . IEEE, 2007, pp. 225–234.[12] R. Mur-Artal, J. M. M. Montiel, and J. D. Tardos, “ORB-SLAM: aversatile and accurate monocular SLAM system,”

IEEE transactionson robotics , vol. 31, no. 5, pp. 1147–1163, 2015.[13] R. Mur-Artal and J. D. Tard´os, “ORB-SLAM2: An open-sourceSLAM system for monocular, stereo, and RGB-D cameras,”

IEEETransactions on Robotics , vol. 33, no. 5, pp. 1255–1262, 2017.[14] S. Sumikura, M. Shibuya, and K. Sakurada, “OpenVSLAM: a versatilevisual SLAM framework,” in

Proceedings of the 27th ACM Interna-tional Conference on Multimedia , 2019, pp. 2292–2295.[15] R. Castle, G. Klein, and D. W. Murray, “Video-rate localization inmultiple maps for wearable augmented reality,” in . IEEE, 2008, pp.15–22.[16] D. Zou and P. Tan, “CoSLAM: Collaborative visual SLAM in dynamicenvironments,”

IEEE transactions on pattern analysis and machineintelligence , vol. 35, no. 2, pp. 354–366, 2012.[17] M. Kaess and F. Dellaert, “Visual SLAM with a multi-camera rig,”Georgia Institute of Technology, Tech. Rep., 2006.[18] J. Kuo, M. Muglikar, Z. Zhang, and D. Scaramuzza, “Redesign-ing SLAM for arbitrary multi-camera systems,” arXiv preprintarXiv:2003.02014 , 2020.[19] M. Muglikar, Z. Zhang, and D. Scaramuzza, “Voxel map for visualSLAM,” arXiv preprint arXiv:2003.02247 , 2020.[20] M. Bosse, P. Newman, J. Leonard, and S. Teller, “Simultaneouslocalization and map building in large-scale cyclic environments usingthe atlas framework,”

The International Journal of Robotics Research ,vol. 23, no. 12, pp. 1113–1139, 2004.[21] T. Schneider, M. Dymczyk, M. Fehr, K. Egger, S. Lynen, I. Gilitschen-ski, and R. Siegwart, “Maplab: An open framework for research invisual-inertial mapping and localization,”

IEEE Robotics and Automa-tion Letters , vol. 3, no. 3, pp. 1418–1425, 2018.[22] M. Labb´e and F. Michaud, “RTAB-Map as an open-source lidar andvisual simultaneous localization and mapping library for large-scaleand long-term online operation,”

Journal of Field Robotics , vol. 36,no. 2, pp. 416–446, 2019.[23] C. Campos, R. Elvira, J. J. G. Rodr´ıguez, J. M. Montiel, andJ. D. Tard´os, “ORB-SLAM3: An accurate open-source libraryfor visual, visual-inertial and multi-map SLAM,” arXiv preprintarXiv:2007.11898 , 2020.[24] D. Li, X. Shi, Q. Long, S. Liu, W. Yang, F. Wang, Q. Wei, and F. Qiao,“DXSLAM: A robust and efﬁcient visual SLAM system with deepfeatures,” in , Oct 2020, pp. 4958–4965.[25] E. Rublee, V. Rabaud, K. Konolige, and G. Bradski, “ORB: An efﬁ-cient alternative to SIFT or SURF,” in . IEEE, 2011, pp. 2564–2571.[26] X. Shi, D. Li, P. Zhao, Q. Tian, Y. Tian, Q. Long, C. Zhu, J. Song,F. Qiao, L. Song et al. , “Are we ready for service robots? theOpenLORIS-Scene datasets for lifelong SLAM,” in2020 IEEE In-ternational Conference on Robotics and Automation (ICRA)