[PDF] Communication constrained cloud-based long-term visual localization in real time

Abstract

Visual localization is one of the primary capabilities for mobile robots. Long-term visual localization in real time is particularly challenging, in which the robot is required to efficiently localize itself using visual data where appearance may change significantly over time. In this paper, we propose a cloud-based visual localization system targeting at long-term localization in real time. On the robot, we employ two estimators to achieve accurate and real-time performance. One is a sliding-window based visual inertial odometry, which integrates constraints from consecutive observations and self-motion measurements, as well as the constraints induced by localization on the cloud. This estimator builds a local visual submap as the virtual observation which is then sent to the cloud as new localization constraints. The other one is a delayed state Extended Kalman Filter to fuse the pose of the robot localized from the cloud, the local odometry and the high-frequency inertial measurements. On the cloud, we propose a longer sliding-window based localization method to aggregate the virtual observations for larger field of view, leading to more robust alignment between virtual observations and the map. Under this architecture, the robot can achieve drift-free and real-time localization using onboard resources even in a network with limited bandwidth, high latency and existence of package loss, which enables the autonomous navigation in real-world environment. We evaluate the effectiveness of our system on a dataset with challenging seasonal and illuminative variations. We further validate the robustness of the system under challenging network conditions.

Full PDF

CCommunication constrained cloud-based long-term visual localization inreal time

Xiaqing Ding , Yue Wang , Li Tang , Huan Yin , Rong Xiong Abstract — Visual localization is one of the primary capa-bilities for mobile robots. Long-term visual localization inreal time is particularly challenging, in which the robot isrequired to efﬁciently localize itself using visual data whereappearance may change signiﬁcantly over time. In this paper,we propose a cloud-based visual localization system targetingat long-term localization in real time. On the robot, we employtwo estimators to achieve accurate and real-time performance.One is a sliding-window based visual inertial odometry, whichintegrates constraints from consecutive observations and self-motion measurements, as well as the constraints induced bylocalization on the cloud. This estimator builds a local visualsubmap as the virtual observation which is then sent to thecloud as new localization constraints. The other one is a delayedstate Extended Kalman Filter to fuse the pose of the robotlocalized from the cloud, the local odometry and the high-frequency inertial measurements. On the cloud, we propose alonger sliding-window based localization method to aggregatethe virtual observations for larger ﬁeld of view, leading tomore robust alignment between virtual observations and themap. Under this architecture, the robot can achieve drift-freeand real-time localization using onboard resources even in anetwork with limited bandwidth, high latency and existenceof package loss, which enables the autonomous navigation inreal-world environment. We evaluate the effectiveness of oursystem on a dataset with challenging seasonal and illuminativevariations. We further validate the robustness of the systemunder challenging network conditions.

I. I

NTRODUCTION

Localization is one of the fundamental requirements forautonomous robots such as driverless cars, inspection robotsand service robots. Visual localization is of great interest asthe sensor of camera is very mature for consumer marketapplication, which is especially favorable in places whereGPS is unavailable. The main challenge that prevents visuallocalization from deployment on the robot is the robustnessagainst changing appearances. Speciﬁcally, as the environ-ment is not static in long term, the appearance can be affectedby many factors, such as illumination, season and weather.Some methods have been proposed to address this challengeby improving the feature matching [1], promoting the searcheffectiveness [2], and introducing invariant high level se-mantics [3]. Unfortunately, most methods are computationalintensive for an on-board processing unit carried by the robot.Furthermore, storing and loading of the whole map are alsounavoidable consumptions of the resource. Xiaqing Ding, Yue Wang, Li Tang, Huan Yin, Rong Xiong are withthe State Key Laboratory of Industrial Control and Technology, ZhejiangUniversity, Hangzhou, P.R. China. Yue Wang is the corresponding author [email protected] . Rong Xiong is the co-correspondingauthor [email protected] . Fig. 1: The framework of the cloud-based visual inertial lo-calization method. The method consists of a client subsystemwhich is applied on the robot for real-time pose estimationand a cloud subsystem designed for localization with priorilaser map. The “KFs” is short for “keyframes” and “MPs”is short for map points. The “comm” represents the commu-nication module that is used for message transportation.With the development of cloud computing, partial compu-tation burden on the robot can be transferred to the cloud toachieve robust localization in real time with limited onboardresource. Cloud robotics have demonstrated their potential inwide applications such as assembly, delivery and planning.Visual simultaneous localization and mapping (VSLAM)on the cloud has also gained popularity recently [4], [5].In these works, the main focus is to build a consistentmap on the cloud, while drifted position feedback on therobots provided by visual odometry might not satisfy therequirement of autonomous navigation. There are also someworks dealing with the visual localization on the cloud [6],[7]. For example in [6], the cloud receives a keyframe fromthe robot, estimates the global pose and sends it back tothe robot. The robot then integrates the global pose into itslocal visual inertial odometry (VIO) to eliminate the drift.However, when this method are applied to the long-termlocalization, pose estimation from only one keyframe mightbe unreliable due to the insufﬁcient inliers, leading to instablelocal estimation. In addition, these works rarely discussthe challenges on data transmission such as package loss,network delay and limited communication bandwidth, whichcould be a bottleneck of the system in the real scenario.In order to improve the robustness for localization, there isanother line of methods that aggregates the local observations a r X i v : . [ c s . R O ] M a r o generate larger ﬁeld of view for localization [8], [9].Although this is a promising direction for long-term appli-cations, it may not be directly applicable to cloud robotics,as these methods use sliding-window based estimator thatrequires multiple frames, leading to more data transmittedthrough the network. Therefore, the core problem for long-term localization on the cloud is the balance between theestimation quality and the network communication.In this paper, we propose a long-term localization systemfor cloud robotics. To improve the robustness on localization,we consider two sliding windows in the loop to overcomethe instability of the single frame estimator. To address thebottleneck on data transmission, we propose an informationsparsiﬁcation method to effectively transmit compressed datathrough the limited bandwidth communication. We alsoadopt a delayed state ﬁlter to compensate for network latencyand package drop. As a result, the local estimation of robotfusing the VIO and the localization information can bedrift-free in real time, which is important for autonomousnavigation. In summary, the contribution of the paper can bepresented as follows: • We divide the double sliding windows estimator to therobot and the cloud to aggregate the local observationsto larger virtual submap observations, which achievesmore robust alignment against the map. • We present an information reduction method based onsparsiﬁcation to control the data through the network,so that the requirement of the limited bandwidth can besatisﬁed. • We build a system for long-term visual localizationon the cloud that achieves the onboard drift-free poseestimation in real time, which is indispensable forautonomous navigation. • We conduct experiments on real world data acrossseasons to demonstrate the performance of the systemin long-term localization, even under the challengingnetwork conditions.We organize the remainder of this paper as follows.In Section II some related works about long-term visuallocalization and cloud-based architecture will be introduced.And we give a overview of our framework in Section III.The details of the cloud-based visual localization system aredemonstrated in Section IV. In Section V, the experimentalresults are analyzed to validate the effectiveness of theproposed method. Conclusions and some thoughts aboutfuture works are shown in Section VI.II. R

ELATED WORKS

VIO uses low-cost visual sensors aided by inertial instru-ments to provide precise and high-frequency relative poseestimation along the robot trajectory. With the developmentof VIO using onboard resources, a real-time positionalfeedback for robot autonomous navigation becomes possible.Generally, there are two branches of methods in this area.The ﬁrst branch utilizes a nonlinear ﬁlter to estimate thepose, which is very efﬁcient and light-weighted, thus issuitable for mobile platform with limited computational resources [10], [11]. Another branch of methods leveragesnon-linear optimization techniques based on local keyframes,i.e. local bundle adjustment [12], [13]. The optimizationbased methods can achieve higher performance than the ﬁlterbased solution, but require more computational resourceswhen the sliding window is long. However, the positionalfeedback by both branches of methods mentioned above areonly reliable in short time as VIO has drift in long duration,calling for correction from the localization.Visual localization in long term remains a open questionin the community as the feature matching involving lots ofoutliers. Some researches resort to the more robust laser mapfor localization. In [14], multi-session laser and visual dataare used to optimize the laser map and extract the salientand stable subset for visual localization. [15] leveragesstable depth information recovered from stereo cameras forregistration with prior laser map. Another way to improvelocalization performance in changing environment is to ag-gregate the variation in long term. In [16]–[19], topologicalgraph is proposed to manage experiences from multiplesessions of data, where nodes encode sensory data and edgesencode relative transformations. Whenever localization fails,measurements are added to the map to enrich diversity, whichhelps future localization. Note that both classes of methodscall for intensive computation to match the feature in a largemap and solve sliding-window based optimization for morerobust pose estimation.Instead of local features, the global image feature is shownto be more robust to appearance change. In [8], [20], they tryto ﬁnd the topological localization through place recognitionand give satisfactory performance. In spite of the fastercomputation, this class of methods fail to provide metriclocalization, thus insufﬁcient as a positional feedback fornavigation.Cloud computing provides a potential solution to real-time resource-aware SLAM and localization system, whereintensive computation can be transferred to high-end servers,leaving light-weighted algorithms running on the robots. [6]uses keyframes to track the pose of camera locally andsends keyframes to the server for global localization. Thesingle frame localization suffers from the large percentageof outliers when appearance variation exists, thus bringsinstable estimation to the robot. [5] also utilizes singleimage data to infer loop closure, thus impacted by thesimilar problem. In addition, the images are sent over thenetwork in these two methods, which may require highbandwidth. Some methods designed for cooperative systemdemonstrate insights for the resource-aware problem. [21]proposes an efﬁcient solution for the cooperative mappingproblem by introducing augmented variables to parallelizethe computation. To decrease the bandwidth requirement, [4]employs sparsiﬁcation methods to compress the edges, whichis also utilized in our system.III. S

YSTEM OVERVIEW

To achieve the robustness against the long-term variances,we employ a laser map to align the visual observationso that the robot can be localized [14]. The reason ofusing a laser map is that the 3D geometric structure isinvariant to illuminative and perspective changes. Thereforethe alignment is designed based on the geometric errorbetween the local visual map points and the laser map points.This error is incorporated into the sliding window of thelocal VIO to achieve tightly coupled optimization, so thatthe accuracy of the laser map can be utilized to constrainthe odometry. In this process, intensive computation lies onthe nearest neighbor searching for data association and thenonlinear system optimization. In addition, the laser maprequires memories for storage. Based on these analysis, weset to transfer the data association and the map to the cloud,and approximate the nonlinear system optimization with twosub-problems for the robot and the cloud. Note that thesystem presented in this paper can be applied for other typeof observation models.

A. Notation

We ﬁrst introduce some notations that would be usedthroughout the paper. We denote the inertial measurementunit (IMU) frame as B and camera frame as C . For eachclient robot, the coordinate of its VIO is represented as O .And the coordinate of the laser map saved in the cloud isdenoted as W . We utilize T ij ∈ SE (3) to denote the transfor-mation between frame i and j , in which the rotation matrix is R ij and the translation vector is t ij . The transformation T CB between IMU and the camera is assumed as known. Thelocalization output at timestamp k is deﬁned as the pose ofthe IMU represented in the laser map T WB ( k ) .During VIO, the state of the robot at timestamp k isdeﬁned as the combination of pose T OB ( k ) , velocity v OB ( k ) ,the acceleration measurement bias term b a ( k ) ∈ R and thegyroscope measurement bias term b g ( k ) ∈ R . In this paperthe visual map points are denoted as M v = { p v i } and thelaser map points as M m = { p m i } . B. Cloud-based visual inertial localization system

The whole system can be divided into the client subsystemdeployed on the robot and the cloud subsystem as shown inFig. 1. Wireless network is used to transmit data between thetwo subsystem: from the client subsystem the selected statesand information are sent to the cloud, and the localizationresult as well as the optimized states after map basedoptimization are sent back to the client.

1) Client subsystem:

Client subsystem consists of asliding-window based VIO and a delayed state ExtendedKalman ﬁlter (EKF) based client-cloud state fusion estimator.The sliding-window based VIO calculates the pose T OB ( t ) .When a new keyframe is created, its state and the otherrelated keyframes in the sliding-window as well as theirobserved visual map points are optimized. If information sentfrom the cloud is related to the states in the sliding-window,we also include the information within the optimization.After optimization we build a set of new edges betweenposes of keyframes and select the reliable subset of localvisual points, to form a virtual observation, submap, which Fig. 2: The coordinates and their relationship represented inthis paper.is aggregated from keyframes in the sliding window forlocalization on the cloud.The delayed state EKF based client-cloud state fusionestimator utilizes the IMU information for integration anduse the localization results from the cloud as well as thelocal odometry results for update. The results of this ﬁlter isregarded as the output of the whole localization system onthe robot. This architecture makes the localization systemless sensitive to the packet loss and latency occurring in thenetwork, and the frequency of the output is high enough asa real-time positional feedback.

2) Cloud subsystem:

The cloud subsystem calculates thelaser map aided localization optimization after receives thepackets from the client, which utilizes the cloud resources todeal with data association and the numerical optimization ofthe large nonlinear system. The whole map is pre-loaded inthe memory for further faster computation, which may not bepractical for limited resources on the robot. In addition, thevirtual observations sent from the robot is further aggregatedin a sliding window on the cloud, so that the alignmentagainst the map can be more robust. After optimization, thelocalization result as well as the optimized states that mightstill exist in the current sliding-window of the robot are sentback from the cloud according to the computational time andnetwork latency. IV. S

YSTEM MODULES

A. Sliding-window based visual inertial odometry

On the client robot, the sliding-window based VIO isimplemented based on the architecture in [22]. The robotstate is tracked in the front end. When a frame is selectedas a new keyframe, the inertial measurement aided localbundle adjustment will be executed which minimizes thereprojection, preintegration and the bias error terms relatedto the sliding-window. The graph model is described in Fig.3(a), in which the reprojection error for the i th visual point p v i and j th keyframe is formulated as e reproj ( i, j ) = ρ (( π ( p v i , T OC j ) − u i , j ) T Ω bai,j ( π ( p v i , T OC j ) − u i , j )) (1)where ρ ( · ) is the robust kernel and Ω represents the corre-sponding information matrix. T OC j is denoted as the cameraig. 3: This ﬁgure shows the related graph models that have shown up in this paper. (a) denotes the original graph model ofthe inertial measurement aided bundle adjustment and the length of the sliding-window is 4 in this case. (b) demonstratesthe marginalization results within the Markov Blanket (marked out with imaginary line) after the nodes of velocity and biasare removed from the graph. And (c) denotes the graph model with the built virtual observations from the information in(b). (d) is the model used in our VIO system which merges the constraints sent from the cloud. (e) relates to the graphmodel in the localization optimization.pose of the j th keyframe, which can be deduced from thecorresponding IMU pose T OB j and the calibration term T CB . u i , j represents the 2D image feature point of p v i observedby the keyframe j , and π ( · , · ) denotes the reprojectionfunction. The preintegration error term e preint and the biaserror term e bias for keyframe i and k is deﬁned as e preint ( i, k ) = ρ ([ e TR , e Tp , e Tv ] Ω prei,k [ e TR , e Tp , e Tv ] T ) (2) e bias ( i, k ) = ρ ( e Tb Ω biasi,k e b ) (3)where the error terms e R , e p , e v and e b are formulated as therotational, positional, velocity and bias errors following [13].So the cost function E iba can be deduced E iba = (cid:88) e reproj + (cid:88) e preint + (cid:88) e bias (4)After optimization, the graph information is organized asa submap and sent to the cloud for localization. To decreasethe demand for bandwidth, the velocity and bias nodes aswell as some unstable visual points are removed from thegraph.

1) Marginalize the velocity and bias nodes:

The velocityand bias nodes have strong correlations with the other statesin the graph. Directly marginalizing these nodes will bringin a dense information matrix that requires more bandwidthfor transmission and a customized optimizer on the cloud todeal with the n-nary factor. Nevertheless, the Markov blanketof velocity and bias nodes does not include the visual mappoints. We build a set of new binary edges between theposes based on our previous work [23] which calculates theoptimal representation of the marginal distribution under thedivergence of Kullback-Leibler.After local bundle adjustment the optimal local lineariza-tion points of the states in the Markov Blanket have beenestimated. We ﬁrst compute the information matrix Λ m forthe graph in the Markov Blanket. For readability, here wedenote z i , j ∈ Z as the edges between node i and j inthe graph, and J i , j as the corresponding Jacobian matrix atthe linearization points. The information matrix Λ m can bederived as Λ m = (cid:88) z i , j ∈ Z J Ti , j Λ i , j J i , j (5)Note that we set the oldest state in the Markov Blanketas ﬁxed so that the information matrix is invertable. Bypermutating the entries, we deﬁne Λ (cid:48) m as Λ (cid:48) m = (cid:20) Λ ss Λ sr Λ Tsr Λ rr (cid:21) (2)where the subblock related to the kept poses is Λ ss , and thesubblock related to the removed nodes of velocity and bias is Λ rr . The subblock Λ sr encodes the correlations between thekept poses and the removed nodes. The Schur complement Λ t of the removed velocity and bias nodes is denoted as Λ t = Λ ss − Λ sr ( Λ rr ) − Λ Tsr (6)This information matrix is dense and can be regarded asa n-nary edge as shown in Fig. 3(b). To keep the sparsityof the graph, we utilize the method proposed in [23] tobuild a new set of binary edges, virtual odometry, betweenthe consecutive poses as shown in Fig. 3(c). We deﬁne theJacobian matrix of the virtual odometry measurement modellinearized at the current points as A and the correspondingblock diagonal information matrix we want to recover as XX =  X · · · ... . . . ... · · · X l  (7)in which l denotes the length of the sliding window. Matrix A should be invertible as we have ﬁxed the oldest pose inthis Markov Blanket, so each subblock X i in the desiredinformation matrix X can be calculated as X i = ( { AΛ − A T } i ) − (8)where {·} i indicates the i th diagonal block with properdimension. ) Select the robust visual map points: The visual mappoints in the sliding window might have large depth uncer-tainty that are not reliable to match with the laser map in 3Dspace. Considering the robustness of cloud-based localizationand to further decrease the bandwidth requirement, we onlykeep those visual points that are observed more than θ times.

3) Merge information from cloud-based optimization:

After the localization optimization from the cloud side,besides the localization results we also send the optimizedrelative transformations between keyframes and the posi-tions of visual map points back to the client robot. Thesemeasurements are deﬁned as constraints for VIO as shownin Fig. 3(d). So during bundle adjustment, if a new packetof information from the cloud is received, the robot checkswhether it contains information that is related to the statesin current sliding-window. With this step, dynamic networklatency can be considered when updating the states. If thevisual map point (cid:101) p v i sent from the cloud still exists inthe sliding window, its optimized position is designed as ameasurement for p v i in the sliding window and we denoteit as the prior point error e pp ( i ) = ρ (( (cid:101) p v i − p v i ) T Ω pti ( (cid:101) p v i − p v i ) (9)If keyframe i and j that relate with the relative transfor-mation (cid:101) T ij sent by the cloud exist in current sliding window, (cid:101) T ij is utilized as a constraint for the poses of keyframe i and j and we denote it as the relative transformation error e rt = ρ ( Log (( (cid:101) T ij ) − T ij ) T Ω rt Log (( (cid:101) T ij ) − T ij )) (10)where Log ( · ) follows the deﬁnition in [24].During bundle adjustment, the error terms (9) and (10)as well as the original error terms (1), (2) and (3) will beoptimized together, based on which the precise geometryinformation of the laser map could be used to promote theperformance of VIO. The cost function of our cloud-aidedbundle adjustment now is formulated as E iba = (cid:88) e reproj + (cid:88) e preint + (cid:88) e bias + (cid:88) e pp + (cid:88) e rt (11)Note that the constraints (9) and (10) are not used toconstruct the information matrix Λ m in (5) to avoid theinformation reuse. B. Laser map aided localization on the cloud

The constraints of VIO are sparsiﬁed and sent from theclient robot after the optimization of bundle adjustment. Theinformation within the transmitted submap is deﬁned withrespect to the coordinate of VIO. The transformation T WO between the origins of VIO and laser map is initialized atthe beginning of localization as the ﬁrst pose of robot withrespect to the laser map. This term is used to align the stateswithin the transmitted submap with the coordinate of lasermap p Wv = T WO p Ov T WB = T WO T OB X = R WO X ( R WO ) T (12) Here we do not include the variable T WO as a node foroptimization. Instead the oldest state in the new-comingsliding-window is set as adjustable. The point-to-point andpoint-to-plane constraints introduced from the laser map canbe formulated as e ptp ( k, j ) = ρ (( p Wm j − p Wv k ) T Ω ptpk,j (( p Wm j − p Wv k ) (13) e ptl ( k, j ) = ρ (( r n ( k, j ) n p m j ) T Ω ptlk,j ( r n ( k, j ) n p m j )) (14)where r n ( k, j ) = ( p Wm j − p Wv k ) T · n p Wm j (15)where n p Wm j denotes the normal vector of laser map point p Wm j . Combined with the reprojection measurements and thevirtual odometry measurements received from the robot, thecost function for localization optimization is formulated as E loc = (cid:88) e reproj + (cid:88) e rt + (cid:88) e ptp + (cid:88) e ptl (16)which is represented as graph model shown in Fig. 3(e).After optimization T WO will be updated and saved for eachclient robot. Since this variable would not drift quickly, thelocalization could survive even it can not receive informationfrom the client robot frequently due to the network latencyor packet loss.The length of sliding window for VIO could not belong in order to satisfy real-time performance. When itturns to localization, the results might be trapped into localminima if the environment has changed a lot. In this cloud-based framework, we keep a longer sliding-window to mergemore environmental information in the cloud side for moreaccurate and robust localization. When a new submap isreceived, we will check whether the visual map points andkeyframes have been registered in the map database. If thepoint or keyframe has been registered, the correspondingitem in the database will be updated by applying new stateestimation and appending new observations if possible. Or anew item will be inserted into the database. After the update,a long sliding-window related to the latest keyframe k isestablished for localization optimization. Then the localiza-tion result T WB k as well as the optimized points and relativetransformations will be sent back to the robot. Note thatto decrease the bandwidth burden, only the transformationsamong the latest l keyframes ( l is the length of sliding-window in VIO) and their related visual map points willbe sent to the robot. C. Client-cloud state fusion

After optimization in the cloud, the localization result T WB will be sent to the client robot at a low frequency, whichcould not satisfy the navigation requirement. The delayedstate EKF based client-cloud state fusion estimator is imple-mented on the robot side to merge the localization resultswith the local odometry for higher localization frequency.We design the state fusion module based on the multi-sensor fusion framework [25]. This module gives fusionoutput at the frequency of IMU. When the client roboteceives the ﬁrst localization result from the cloud, initial-ization will be called to update the information of T WO .During the working process, the current state T WB ( t ) ofthe robot at timestamp t can be integrated from the IMUmessages. And the state at timestamp t could be updatedwhen receives odometry data T OB ( t − t o ) from the client robotor localization data T WB ( t − t l ) from the cloud. Note that T WO is also formulated as a variable for estimation, whichaims to compensate the drift of visual inertial odometry.V. E XPERIMENTAL RESULTS

To evaluate the effectiveness of the proposed cloud-basedvisual inertial localization method, we test it on the realworld YQ dataset that was also used for experiments in ourprevious work [14]. The dataset was collected in a campuswhich includes many dynamics such as the pedestrian, cy-clist and signiﬁcant appearance change across seasons andweather as shown in Fig. 4. We utilize the same laser map asin [14] for visual localization which was built based on 21sessions of data collected along almost the same route withinthree days in early spring, 2017. The data used for localiza-tion was collected in summer and winter with a MTi 100IMU and a pair of Pointgrey stereo camera. We also collectedlaser data from a VLP-16 Velodyne LiDAR for groundtruthgeneration following the other visual localization methodssuch as [26]. We list the starting time and duration of datacollection in TABLE I. Inside the session “2017/08/24” wascollected along the opposite direction while the others alongthe same direction of the route for map building. And thesession “2018/01/29” was collected after snow, in which theenvironment had changed a lot compared with the one ofmap building.With the development wireless communication, many low-power, wide-area (LPWA) technologies are studied to enablecommunications among things in wide area [27], whichalso provides powerful tools for the expanding of mobilerobots. Among them the promising technologies such asnarrowband internet of things (NB-IoT), have large networklatency up to 10s and low bandwidth about 250KB/s. Inour experiments, besides the performance under long-termchanging environment, we also test the robustness under suchnetwork conditions.We utilize a low-power laptop with an Intel i7-7500U CPUas the client, and a computer with an Intel i7-6700 CPU3.40GHz and 16G RAM as the cloud. All of the codes areimplemented in C++.TABLE I: The overview of the YQ dataset

Map Building Dataset (early spring)Day Starting Time2017/03/03 07:52, 09:20, 10:23, 11:48, 12:59, 14:34, 16:05, 17:382017/03/07 07:43, 09:06, 10:19, 12:40, 14:35, 16:28, 17:25, 18:072017/03/09 09:06, 10:03, 11:25, 15:06, 16:31Localization Testing Dataset (summer and winter)Starting Time Duration Starting Time Duration2017/08/23 09:40:13 16:31 2017/08/24 09:21:41 13:212017/08/27 15:22:11 17:03 2017/08/28 17:06:06 17:152018/01/29 11:09:15 14:59

Fig. 4: The satellite imagery of YQ dataset and some scenescaptured at different time.TABLE II: The ATE results (m) of the localization methodtested in wireless network

Sequences ATE ATE Sequences ATE ATE(median) (std) (median) (std)2017/08/23 0.600 0.334 2017/08/24 0.587 0.5282017/08/27 0.564 0.439 2017/08/28 0.623 0.3542018/01/29 0.460 0.318 average 0.567 0.395

A. Evaluations in the real world situation

We ﬁrst test the effectiveness of the proposed cloud-based visual localization method in real world situation. Thecommunication between the client and server is set up usingthe wireless network in the campus. The Absolute TrajectoryError (ATE) [28] is utilized to evaluate the performance ofthe localization method. We show the results of each sessionof data in TABLE II and draw the localization trajectoriesin the ﬁrst row of Fig. 5. For comparison, we also draw thetrajectories of VIO in the second row of Fig. 5. As the resultsshow, the proposed cloud-based long-term visual localizationmethod could give satisfactory results in every localizationsession bearing illumination, season or even heavy view pointchanges. Compared with VIO, our method gives the robot along-term drift-free pose estimation, making the navigationpossible. And during the whole localization process, theusage of bandwidth is around 40KB/s for transmitting thewhole messages from the robot to the cloud, and around20KB/s for transmitting from the cloud to the robot, whichare much lower than the wireless connection standard.

B. Evaluations under different network conditions

Cloud-based localization method should be robust underdifferent network conditions. In this subsection we analyzethe performance of the proposed localization method whenfacing network latency and packet loss.To eliminate the uncertainty of the wireless network, inthis subsection we utilize a cable to connect the clientand server subsystem and changing the network conditionsmanually. We evaluate the localization performance underdifferent conditions of packet loss and network latency usingthe session of “2018/01/29”. The localization results areig. 5: This ﬁgure shows the ATE errors mapped on the trajectories for each session of data. The subﬁgures in the ﬁrst roware related to the localization results of our method, and the subﬁgures in the second row are related to the results of VIO.Fig. 6: The localization results of session “2018/01/29” underdifferent packet loss probability and network latency.shown in Fig. 6 using boxplot. As we can see from theﬁrst picture in Fig. 6, the localization errors do not increasemuch when the packet loss probability is under 50 % . Andthe localization method could survive even when the packetloss probability rises up to 70 % , which effectively validatesthat the proposed cloud-based visual localization method isrobust to packet loss.As for network latency, in this case we intentionallygenerate a network latency of the speciﬁc seconds everytime the localization results are ready for transmission, whichshould be the worst situation the localization system wouldmeet under the conditions of network latency. From theresults we can see that the localization performance graduallydeteriorates along with the delay time. When the networklatency is large, the drifted states of the EKF based statefusion localiser could not be adjusted in time, which makesthe localization output drifts heavily and difﬁcult to converge. Fig. 7: The localization results tested on different lengths ofthe sliding window on the cloud.Also large part of the optimized information sent back fromthe cloud can not applied on VIO as the related keyframesand map points left the sliding-window. But as the resultsshow, our system could still give acceptable localizationresults when the delay time is less than 3s, and give roughlocalization results when the delay time is no more than 5s,which is more severe than lots of network condition in realworld environment, thus could validate the robustness of theproposed localization method to network latency. C. Effectivess of the divided double sliding-window archi-tecture

We set different lengths of sliding window on the cloudside to discuss the necessity and some issues about the di-vided double sliding-window architecture. Since the length ofsliding window is related to the observation of environment,here we utilize two sessions of data with different routedirections (“2017/08/24” and “2018/01/29”) for experiments.The length of sliding window on the robot is 10 in ourwork, so the test begins at the length of 10 in the cloud side.We evaluate the localization performace with the length of10, 20, 30, 40 and 50. The results of each session are draw inig. 7. As the results show, in session “2017/08/24” the lo-caliser fails to achieve localization along the whole trajectorywith the length of 10 and 20. And in the two sessions thelocalization performances grow better at ﬁrst, then graduallygrow worse. In session “2017/08/24” the route is along theopposite direction to the one for map building. We think thatthe differences of the spatial distributions between the mapsbuilt along different directories inﬂuence the performance oflocalization, especially when the environment has changed.Aggregating more information into sliding window couldrelieve this problem. When the sliding window grows longer,the variable T WO might drift. However, the drift would notbe taken into consideration during each optimization processin the cloud. Thus the accuracy would decrease if the slidingwindow is too long. So there should be a balance betweenaccuracy and robustness. We set the length of sliding windowon the cloud as 40 in other experiments in this paper.VI. C ONCLUSIONS

In this paper a cloud-based visual localization methodis proposed, which is designed for robots with limitedresources to achieve long-term and real-time localization.A novel robot-cloud localization architecture is proposed,which could provide high-frequency and drift-free localiza-tion estimation even in a low-bandwidth network with unpre-dictable conditions such as high latency and packet loss. Theexperimental results validate the long-term effectiveness andalso the performances under different network conditions arediscussed with quantitative analysis.In the future, we would like to study on mapping the net-work conditions and automatically selecting the coefﬁcientsaccording to the network condition.A

CKNOWLEDGEMENT

This work was supported in part by the National Key R&DProgram of China (2017YFB1300400) and in part by theNational Nature Science Foundation of China (U1609210).R

EFERENCES[1] H. Jegou, M. Douze, C. Schmid, and P. Perez, “Aggregating localdescriptors into a compact image representation,”

Proc Cvpr , vol. 238,no. 6, pp. 3304–3311, 2010.[2] D. G´alvez-L´opez and J. D. Tardos, “Bags of binary words for fast placerecognition in image sequences,”

IEEE Transactions on Robotics ,vol. 28, no. 5, pp. 1188–1197, 2012.[3] T. Naseer, G. L. Oliveira, T. Brox, and W. Burgard, “Semantics-awarevisual localization under challenging perceptual conditions,” in

IEEEInternational Conference on Robotics and Automation , 2017.[4] L. Paull, G. Huang, M. Seto, and J. J. Leonard, “Communication-constrained multi-auv cooperative slam,” in . IEEE, 2015, pp.509–516.[5] G. Mohanarajah, V. Usenko, M. Singh, R. D’Andrea, and M. Waibel,“Cloud-based collaborative 3d mapping in real-time with low-costrobots,”

IEEE Transactions on Automation Science and Engineering ,vol. 12, no. 2, pp. 423–431, 2015.[6] S. Middelberg, T. Sattler, O. Untzelmann, and L. Kobbelt, “Scalable6-dof localization on mobile devices,” in

European conference oncomputer vision . Springer, 2014, pp. 268–283.[7] X. Zhu, C. Qiu, F. Deng, S. Pang, and Y. Ou, “Cloudbased realtimeoutsourcing localization for a ground mobile robot in largescaleoutdoor environments,”

Journal of Field Robotics , vol. 34, 2017. [8] M. J. Milford and G. F. Wyeth, “Seqslam: Visual route-based nav-igation for sunny summer days and stormy winter nights,” in

IEEEInternational Conference on Robotics and Automation , 2012, pp.1643–1649.[9] W. Maddern, M. Milford, and G. Wyeth, “Cat-slam: Probabilisticlocalisation and mapping using a continuous appearance-based tra-jectory,”

International Journal of Robotics Research , vol. 31, no. 4,pp. 429–451, 2012.[10] K. J. Wu, A. M. Ahmed, G. A. Georgiou, and S. I. Roumeliotis, “Asquare root inverse ﬁlter for efﬁcient vision-aided inertial navigation onmobile devices,” in . MIT Press Journals, 2015.[11] A. I. Mourikis and S. I. Roumeliotis, “A multi-state constraint kalmanﬁlter for vision-aided inertial navigation,” in

Robotics and automation,2007 IEEE international conference on . IEEE, 2007, pp. 3565–3572.[12] S. Leutenegger, S. Lynen, M. Bosse, R. Siegwart, and P. Furgale,“Keyframe-based visual-inertial odometry using nonlinear optimiza-tion,”

The International Journal of Robotics Research , vol. 34, no. 3,pp. 314–334, 2015.[13] C. Forster, L. Carlone, F. Dellaert, and D. Scaramuzza, “On-manifoldpreintegration for real-time visual–inertial odometry,”

IEEE Transac-tions on Robotics , vol. 33, no. 1, pp. 1–21, 2017.[14] X. Ding, W. Yue, D. Li, T. Li, and X. Rong, “Laser map aided visualinertial localization in changing environment,” in

Intelligent Robotsand Systems (IROS), 2018 IEEE/RSJ International Conference on .IEEE, 2018, pp. 4794–4801.[15] Y. Kim, J. Jeong, and A. Kim, “Stereo camera localization in 3dlidar maps,” in . IEEE, 2018, pp. 5826–5833.[16] M. Paton, K. Mactavish, M. Warren, and T. D. Barfoot, “Bridgingthe appearance gap: Multi-experience localization for long-term visualteach and repeat,” in

Ieee/rsj International Conference on IntelligentRobots and Systems , 2016, pp. 1918–1925.[17] L. Tang, Y. Wang, X. Ding, H. Yin, R. Xiong, and S. Huang,“Topological local-metric framework for mobile robots navigation: along term perspective,”

Autonomous Robots , vol. 43, no. 1, pp. 197–211, 2019.[18] W. Churchill and P. Newman, “Experience-based navigation for long-term localisation,”

The International Journal of Robotics Research ,vol. 32, no. 14, pp. 1645–1661, 2013.[19] M. B¨urki, I. Gilitschenski, E. Stumm, R. Siegwart, and J. Nieto,“Appearance-based landmark selection for efﬁcient long-term visuallocalization,” in . IEEE, 2016, pp. 4137–4143.[20] S. Lowry, N. Snderhauf, P. Newman, and J. J. Leonard, “Visual placerecognition: A survey,”

IEEE Transactions on Robotics , vol. 32, no. 1,pp. 1–19, 2016.[21] C. X. Guo, K. Sartipi, R. C. DuToit, G. A. Georgiou, R. Li, J. O’Leary,E. D. Nerurkar, J. A. Hesch, and S. I. Roumeliotis, “Resource-awarelarge-scale cooperative three-dimensional mapping using multiple mo-bile devices,”

IEEE Transactions on Robotics , no. 99, pp. 1–21, 2018.[22] R. Mur-Artal and J. D. Tard´os, “Visual-inertial monocular slam withmap reuse,”

IEEE Robotics and Automation Letters , vol. 2, no. 2, pp.796–803, 2017.[23] Y. Wang, R. Xiong, and S. Huang, “A pose pruning driven solutionto pose feature graphslam,”

Advanced Robotics , vol. 29, no. 10, pp.683–698, 2015.[24] C. Forster, L. Carlone, F. Dellaert, and D. Scaramuzza, “On-manifoldpreintegration for real-time visual-inertial odometry,”

IEEE Transac-tions on Robotics , vol. 33, no. 1, pp. 1–21, 2015.[25] S. Lynen, M. Achtelik, S. Weiss, M. Chli, and R. Siegwart, “A robustand modular multi-sensor fusion approach applied to mav navigation,”in

Proc. of the IEEE/RSJ Conference on Intelligent Robots and Systems(IROS) , 2013.[26] T. Sattler, W. Maddern, C. Toft, A. Torii, L. Hammarstrand, E. Sten-borg, D. Safari, M. Okutomi, M. Pollefeys, J. Sivic et al. , “Bench-marking 6dof outdoor visual localization in changing conditions,” in

Proc. CVPR , vol. 1, 2018.[27] R. S. Sinha, Y. Wei, and S. H. Hwang, “A survey on lpwa technology:Lora and nb-iot,”

Ict Express , vol. 3, no. 1, 2017.[28] J. Sturm, W. Burgard, and D. Cremers, “Evaluating egomotion andstructure-from-motion approaches using the tum rgb-d benchmark,” in