Deep Reinforcement Learning based Automatic Exploration for Navigation in Unknown Environment
11 Deep Reinforcement Learning based AutomaticExploration for Navigation in UnknownEnvironment
Haoran Li, Qichao Zhang, Dongbin Zhao
Senior Member, IEEE
Abstract —This paper investigates the automatic explorationproblem under the unknown environment, which is the keypoint of applying the robotic system to some social tasks. Thesolution to this problem via stacking decision rules is impossibleto cover various environments and sensor properties. Learningbased control methods are adaptive for these scenarios. However,these methods are damaged by low learning efficiency andawkward transferability from simulation to reality. In this paper,we construct a general exploration framework via decomposingthe exploration process into the decision, planning, and mappingmodules, which increases the modularity of the robotic system.Based on this framework, we propose a deep reinforcementlearning based decision algorithm which uses a deep neuralnetwork to learning exploration strategy from the partial map.The results show that this proposed algorithm has better learn-ing efficiency and adaptability for unknown environments. Inaddition, we conduct the experiments on the physical robot, andthe results suggest that the learned policy can be well transferedfrom simulation to the real robot.
Index Terms —automatic exploration, deep reinforcementlearning, optimal decision, partial observation
I. I
NTRODUCTION A s an important subproblem of the robot autonomousnavigation, automatic exploration means that the robotsmove ceaselessly to build the entire environmental map ina new environment without any priori knowledge, which isa hot topic in the field of robotics. It has a wide range ofscenario applications in practice, such as the search work ofrescue robots and the sweeping work of the robot sweepersin the unknown environment. As the robot has to completethe exploration and mapping task in the completely differentand unknown environment, the automatic exploration can alsoreflect the adaptability of the robotic system.With the spreading attention from various fields, somenoticeable works have been yielded such as frontier-based [1]and information-based [2] methods. The frontier-based methodwas to decide the next move of the robot via searching thefrontier points which were between free points and unknown This work is supported by the Beijing Science and Technology Plan underGrants Z181100004618003, and the National Natural Science Foundationof China (NSFC) under Grants No. 61573353, 61803371, 61533017 and61603268. This work also supported by Noah’s Ark Lab, Huawei Technolo-gies.Haoran Li, Qichao Zhang, and Dongbin Zhao are with the state KeyLaboratory of Management and Control for Complex Systems, Instituteof Automation, Chinese Academy of Sciences, Beijing, 100190, China,and also with the University of Chinese Academy of Sciences, Beijing100049, China (email : [email protected], [email protected],[email protected]). points. And the information-based method applied Shannonentropy to describe the uncertainty of the environmental mapand construct optimization problems, in which way the op-timal control variable of the robot can be gained during theautomatic exploration process. Unfortunately, in most cases,there is no mathematical optimal solution in the optimizationproblems constructed by this method. Then, some researcherscombined these two methods [3]. This kind of approachemployed Shannon entropy to evaluate candidate points se-lected by frontier and then obtained the optimum point inthe next move. Furthermore, to make the exploration moreefficient and the map more precise, an extra target item wasadded into the objective function of the optimization problems[4]. However, the computational burden to solve these multi-objective optimization problems will increase rapidly with theincrease of the exploration area. In general, the key to thetraditional methods mentioned above is to find the optimumnext point according to the current point and the exploredmap. However, the adaptability is decreased due to the keyexploration strategy which is relied heavily on the expertfeature of maps. And it is also difficult to design a generalterminal mechanism to balance the exploration efficiency andcomputational burden.Gradually, learning based exploration methods are consid-ered to tackle the problems above. Krizhevsky et al. [5]adopted deep convolutional neural networks(CNN) and de-signed an efficient classification network for large-scale data,which triggered a wave of research on neural networks. CNNhas spurred a lot of big breakthroughs in the image recognitionarea [6]. Meanwhile, reinforcement learning(RL) has alsoachieved certain development [7]. Since Mnih et al. [7] applieda deep reinforcement learning(DRL) algorithm and gainedbetter performance than human players in the video game likeAtari, the algorithm combining with deep learning(DL) andRL became one of the effective tools for complex decision-making problems. Based on Q-learning [8] and the strongfeature presentation ability of CNN, Deep Q-network(DQN)has shown its tremendous potential in the robot control andgame decision making. In robot control field, the DRL meth-ods in continuous action spaces can establish the mappingfrom image inputs to the control policy which are concise [9][10]. It has many applications in manipulator controling [11],intelligent driving [12] and games [13] [14]. In game decisionmaking, the DRL methods in discrete action spaces, such asAlphaGo [15], show the strong search capability within thehigh-dimensional decision space. Considering these merits of a r X i v : . [ c s . R O ] J u l DRL, some researchers attempt to apply it to the explorationand navigation of intelligent agents in unknown environment.Currently, DL or DRL-based methods are mostly adaptedin the robotics visual control or navigation tasks, whichbuild the mapping relationship between raw sensor data andcontrol policy. In this way, the intelligent agents can moveautonomously relying on the sensor data. However, there aretwo disadvantages for this end-to-end approach. On the onehand, the information provided by mature navigation methodsin robotics is ignored, which means the intelligent agents needto learn how to move and explore effectively from raw data. Itdefinitely increases the training difficulty of intelligent agents.On the other hand, the reality gap between the synthetic andreal sensory data imposes the major challenge from simulationto reality. To the best of our knowledge, the reality gap forautomatic exploration based on DRL in unknown environmentare not addressed in the existing work.Motivated by these, we construct an automatic explorationframework based on map building, decision making andplanning modules. This framework makes it easy to combinethe fairly complete map building and navigation methods inrobotics. Since the visual feature of the built grid map couldguide the robot to explore unknown environment efficiently,we propose a DRL-based decision method to select next targetlocation with the grid map of the partial environment as input.Compared with some existing literature, the main contributionsemphasize in two parts.1) In contrast to the end-to-end approach with the raw sensordata as input and control policy as output, our explo-ration framework combing with traditional navigationapproaches can bridge the reality gap from the simulationto physical robots. Meanwhile, the training difficulty isreduced with the proposed framework.2) Compared with traditional methods [3] [4], we designa value network of deep reinforcement learning andcombine the DRL-based decision algorithm with classicalrobotic methods, which can improve the explorationefficiency and adaptability in unknown environment.This paper is organized as follow. Section II describesrelated works in robot automatic exploration and the intelligentagent navigation with DRL. Section III presents our definitionand analysis about the robot automatic exploration. Then,we give the details of our proposed algorithm and networkarchitecture in Section IV. Furthermore, the simulation ex-periment and the practicality experiment are displayed andthe comparison among different methods is given. Finally, insection V, we summarize our work and discuss future work.II. R
ELATED W ORK
In general, the automatic exploration scheme can be di-vided into two categories: traditional methods and intelligentmethods. The former, which have been attracting considerableinterest since the 1990s, employs expert features of maps forselecting the next goal point. The latter learns the control orstrategy directly from the observation based on the learningapproaches.The widely-used traditional method for selecting explorationpoints was suggested by Yamauchi [1]. This method detected the edges between free space and unexplored space basedon the occupancy grid maps and subsequently calculated thecentral point of each edge as the next candidate point. Then,the robot moved to the nearest point by performing the depth-first-search algorithm. However, the safety of the next point isuncertain, since the distance between the point and obstaclemay be too small. Gonzlez-Banos et al. [3] defined the safetyregions, and built a Next-Best-View algorithm to decide whichregion the robot should move to. The simultaneous localiza-tion and mapping(SLAM) algorithm and path planning wereembedded in this algorithm, where the former was appliedto reconstruct the environment and the latter was utilized forgenerating the actions of the robot.Another typical traditional methods were connected withinformation theory. Bourgault et al. [2] took advantage ofShannon entropy to evaluate the uncertainty of the roboticpositions and the occupancy grid maps, and exploited thisinformation metric as the utility of the robot control. Thecontrol was derived from a multi-objective problem based onthe utility. The utility function of localization was stemmedfrom the filter-based SLAM algorithms [16]. There is a largevolume of existing studies describing the role of the graph-based SLAM algorithm [17]. Different from filter-based algo-rithms, the graph-based methods proposed pose graph whichapplied poses of robots and landmarks as nodes and exploitedthe control and measurements as edges. Note that new posecaused by robot movement influenced the covariance matrixof the pose graph. To evaluate the quality of effect from thenew pose, Viorela et al. [18] defined a mutual information gainfor a new edge in the pose graph. With the rapidly-exploringrandom tree, they searched the paths to explore the unknownenvironment and used the information gain to evaluate the pathentropy, afterward they choose the one which minimized themap entropy.Due to the significant recent development in DRL, a num-ber of researchers in intelligent control area attempted toregard the robot exploration as an optimal control problemconstructed by information-based methods and employ rein-forcement learning to settle the sequence decision problem.Martinez et al. [4] modeled the exploration as the partial obser-vation Markov decision process(POMDP) [19] and employedthe direct policy search to solve the POMDP. For the greatperception ability of DL, it has been widely employed to buildpowerful methods for learning the mapping from sensor data tothe robot control. In 2016, Tai et al. [20] proposed a perceptionnetwork with RGB-D image inputs and a control networktrained by the DQN. Note that this algorithm only ensures therobot wandering without collision. Bai et al. [21] constructed asupervised learning problem to reduce Shannon entropy of themap. They fed a local pitch of the map into neural networksand predicted which direction the robot should move. Thisalgorithm often fell into a corner since it only perceived thelocal environment which was completely explored.Much of the current literature on intelligent methods paysa particular attention to game agents navigation and explo-ration in a virtual environment. Mirowski [22] trained anAsynchronous Advantage Actor-Critic(A3C) agent in a 3Dmaze. They embedded long short-term memory(LSTM) into the agent to give it the memory. To improve learning efficiency,they preceded a depth prediction task and a loop closure detec-tion task. This attempt presented a remarkable performance inthis virtual environment. Compared to adding external tasks,Pathak proposed an intrinsic curiosity module(ICM) [23]. Thedeep neural network was utilized to encode state featuresand constructed a world model applied to predict the nextstep observation in feature space. This model was used topredict the next step observation. The differences between thenext step encoder features and the prediction of next stepobservation features by the world model were employed todrive the agent to move around. Then Oleksii et al. [24]applied this approach in robot navigation. They combinedpolicy entropy and A3C to calculated robot control.III. P
ROBLEM S TATEMENT
In order to achieve the autonomous navigation, the robotreceives sensor data and simultaneously builds surroundingsmap during moving in the unknown environment. To createa more similar map to the actual situation, the robot ex-ploration algorithm is essential. Under traditional navigationframeworks, the robot has to generate the environment mapand locate itself. With the development of DL and DRL, thereare several end-to-end control methods by DRL over the years.Those methods feed the sensor data and surroundings as theinput of deep neural networks and output the control for therobot.A momentous advantage of end-to-end methods is concise.However, there is a wide gap to apply the technique to areal robot. The end-to-end learning methods have to conductmassive trial-and-error experiences, so most of the researchersput this learning process in the simulation environment. Conse-quently, this behavior raises three problems: 1) The algorithmneeds a long time to converge since the robot starts fromlearning movement to explore the unknown environment.2) Since the real environment is not the exact same withthe simulation environment, there is no guarantee that themapping represented by deep neural network learned undersimulation environment has a superior generalization underthe real environment. 3) The mechanism and sensor errors ofthe physical robot may cause that the intelligent control lawis not applicable to the physical robot.Based on the above issues, we separate the robot explorationinto three modules: decision module, mapping module, andplanning module. Therefore, We can design the planningalgorithms for different robots with the same decision mod-ule since each module is independent of each other. Theplanning algorithms guarantee the robots can move safelywithout learning in a known environment. With the help of thetraditional navigation algorithms, the decision module focuseson the efficient exploration strategy. Figure 1 describes therelationship between the three modules. In the following, weform each module more general for extensions and unfold eachmodule to explain its function.
A. Decision module
According to the historical observation, this module choosesthe goal point where the robot should move to. In this paper,
Decision module
Mapping module
Planning module
Map
Next goal
Motion data
Fig. 1: The relationship between the modules of the explo-ration framework. Decision module receives data from themapping module and sends the next goal point to planningmodule. Planning module calculates the potential trajectoriesfrom the current position to the goal position and sends thecontrol data to the mapping module.the module input is the map of surroundings built by themapping module. The decision module is modeled as g t = f decision ( l t , m t ) , (1)where g t is the goal point, l t are the robot positions fromtime 0 to t , and m t is the built surroundings map at time t .There are several efforts focused on figuring out the function f decision , most methods are based on the frontier methodsand with the heuristic search. Those methods are requiredto establish massive rules for complex environments, whichresults in heavy dependencies on expertise.In this paper, we will mainly focus on finding out the moreconcise and the general function f decision . We propose a DRL-based decision method in the next section. Compared to thefrontier-based method, our method learns the visual featurefor exploration automatically. Instead of stacking the rules forexploration, our method learns the feasible strategy from trialand error and cover more complex scenes. B. Planning module
This module aims to ascertain the feasible trajectory fromthe current position to the goal point. The planning module ismodeled as τ t : t + T = f planning ( l t , g, m t ) , (2)where τ t : t + T is the planning trajectory for robot control. Thewidely-investigated framework for many planning approacheswhich implement the planning function includes the globalplanning method and the local planning method. The aim ofthe former is to determine the shortest path from the currentposition to the goal position on the map, and the latter attemptsto convert the path to a trajectory and adjust the trajectorybased on the real-time sensor data. It should be mentionedthat recent studies related to the planning module have shown different frameworks. Those particular methods modeled asdeep neural networks take the sensor data and goal point asthe input and build the mapping from the input to the control.Considering that A ∗ [25] is a simple and efficient searchmethod, we apply A ∗ as the global planner. The results of A ∗ search is a path composed of the discrete points. And the pathcannot control the robot directly. We employ timed-elastic-band(TEB) method [26] as the local planner which convertthe path to the robot velocity. Compared to the pure tracking,this method can avoid obstacles which are not in the builtmap(such as moving objects) according to the real-time pointcloud from light detection and ranging(LiDAR). C. Mapping module
Mapping module processes the sensor data in sequenceand produces the robot pose and surroundings map at eachtimestamp, which is also known as SLAM. The function ofthe SLAM can be written as m t , l t = f mapping ( l t − , m t − , u t , z t ) . (3)Here u t and z t are the control and observation at time t ,respectively. The major SLAM methods can be divided intotwo categories: filter-based methods and graph-based methods.Filter-based methods make use of Kalman filter or expandKalman filter to update the estimation of the robot and land-mark(map) pose. Graph-based methods collect the control andthe observation to construct a pose graph and take advantageof nonlinear optimization to estimate the robot pose. More-over, Parisotto [27] proposed a neural SLAM method whichexploited deep neural network fitting the function f mapping .We utilize a graph-based method named Karto SLAM.It is simple to implement and works well in most scenes.Compared to the filter-based method, the graph-based methodcan diminish the error and obtain the extract solution consis-tently. Therefore, more and more papers focus on graph-basedmethods recently [17].In this paper, we place great emphasis on constructing theDRL algorithm for the decision module to implement efficientexploration. Compared with traditional decision methods, theDRL-based decision algorithm can eliminate trivial detailsof complicated rules. Different from end-to-end methods, theapproaches based on this framework have a higher degree ofmodularity, and it is more flexible and interpretable.IV. DRL-B ASED D ECISION M ETHOD
A. Objective function
The objective of the robot exploration is to build an envi-ronment map which is most similar to the real environment.Considering the battery capacity, the robot is designed to selecta shorter path to explore the environment. Then, we give thefollowing objective function c ( ˆ m, x t =0: T ) = min u t =0: T (cid:107) m − ˆ m (cid:107) + L ( x t =0: T ) . (4)Here ˆ m is the estimation map, m is the real map, and L ( · ) isthe path length during exploration from t = 0 to t = T . However, since the actual map is non-available due tothe unknown environment, it is hard to handle the objectivefunction and find the minimum solution. Inspired by theShannon entropy for evaluation of the exploration [2], weexploit the occupancy grid map to represent the environment.Note that p ( m i,j ) is the occupied probability of the cross gridof i -th column and j -th row. Each grid has three possiblestates: unknown, free and occupied. And the entropy of themap is H ( m ) = − (cid:88) i (cid:88) j p ( m i,j ) log p ( m i,j ) . (5)This objective function evaluates the uncertainty of the builtmap. B. Deep Reinforcement Learning
We take exploration as a sequence decision-making task. AMarkov decision process defined as a tuple < S, A, T, R, γ > is a framework for decision making. S denotes a finite setof states. A is the set of the actions that the agent can take. T ( s (cid:48) | s, a ) is the transform distribution over the next state s (cid:48) given the agent took the action a with the state s . R ( s, a ) isthe reward the agent receives after taking action a in state s . γ is the discount factor. The goal of the agent is to maximizethe expectation of long-term cumulative reward V ( s ) V ( s ) = max T (cid:88) t =1 E ( s t ,a t ) ∼ p θ ( s t ,a t ) [ r ( s t , a t )] . (6)RL is an intelligent method for robot control [28]. Q-learningis a popular method to learn an optimal policy from samplesgenerated from interactions with the environment. The key toQ-learning is calculating the Q-value function which evaluatesthe action took in the state. And the optimal Q-value obeysthe Bellman optimal equation as follows Q ∗ ( s, a ) = E s (cid:48) [ r + γ max a (cid:48) Q ∗ ( s (cid:48) , a (cid:48) ) | s, a ] . (7)Since this optimization problem is nonlinear, it is difficult toobtain the analytic solution. The common iterative method forthis problem is the value iteration as follows: Q ( S t , A t ) ← Q ( S t , A t ) + α [ R t +1 + γ max a (cid:48) ∈ A Q ( S t +1 , a (cid:48) ) − Q ( S t , A t )] . (8)Note that recent advances in DL have facilitated investigationof feature extraction. Mnih et al. [7] took a deep neuralnetwork as a function approximation for Q-value functionand trained the networks by minimizing the mean squarederror(MSE) L DQN = E [( R t +1 + max a (cid:48) ∈ A Q ( S t +1 , a (cid:48) ; θ − ) − Q ( S t , A t ; θ )) ] , (9)where ( S t , A t , R t +1 , S t +1 ) are sampled from the experiencereplay buffer collected by the agent. This DQN algorithmachieved a huge success in many games such as Go, Atari,etc.Recently, many researchers introduce DQN algorithm forrobot navigation. Most of them apply the original sensor data as the algorithm inputs, such as point clouds of LiDAR [29],images [30] [31] [32] [33]. With the powerful representationof deep neural networks, these methods can extract keyfeatures without expert experiences. Nevertheless, comparedto the empirical features, the original sensor data has a higherdimension which increases state space sharply. From thisperspective, the agent has to take massive trial and error tocover all the potential state-action combination, which takesa long time. In view of the safety and lifetime of the realrobots, it is impossible to take the extensive trials in thereal environment. Hence, most researches trained the agentin a simulated environment. However, there are two commonproblems in transferring the virtual agent into a real agent.First, since the error functions of the sensors are differentbetween virtual and real environment, It is hard to guaranteethe good generalization of the DRL algorithm from virtual toreal. Second, due to the mechanical error, it is uncertain thatthe real robots will have an excellent performance with thecontrol law learned in a virtual environment.As discussed above, we introduce the hierarchy decisionframework in the last section. We separate the decision modulefrom the planning module. Decision module provides the nextgoal point. And planning module plans the path from thecurrent position to the goal point. In this way, the controlpolicy derived from the traditional optimal control has a goodmatch with the real robot.In this section, we aim to design a novel DRL algorithmfor decision module. The input of the decision module is amap, current and historical robot positions obtained from themapping module. We obtain the optimal decision sequence x ∗ t =0: T = arg min H ( m T ) + L ( x t =0: T )= arg min H ( m T ) − H ( m ) + L ( x t =0: T )= arg min T (cid:88) t =1 [ H ( m t ) − H ( m t − )] + T (cid:88) t =1 L ( x t − , x t )= arg max T (cid:88) t =1 [ H ( m t − ) − H ( m t ) − L ( x t − , x t )] . (10)Therefore, we can define the reward function for explorationas follows r t = α ( H ( m t − ) − H ( m t ) − L ( x t − , x t )) , (11)where α is coefficient to make the training process more stable.In view of the safety, we attach the heuristic reward functionsuch as r t = (cid:40) − if the next point is in unknown space, − if the next point is too close to the obstacle.To stop the exploration in time, we introduce the terminalaction and define the reward for this action r t = (cid:40) if the ratio of explored region in map ρ > . , − otherwise. Remark 1:
The target of automatic exploration is to buildthe map of the unknown environment based on the navigation Fig. 2: The action space based on the occupancy grid map.In the grid map, the white and gray area is the free andunknown space, respectively. The black edges are the occupiedgrids which are the contours of the obstacle. The points arethe action positions which are sampled regularly from thegrid map. The green points are safe actions, and others aredangerous.and mapping technique. Note that the logic of reward signalis designed based on the safety and efficiency. As the robotcannot confirm the dangerous status of the unknown spacewhich may be the obstacle, wall and so on. Therefore, apenalty signal is given to encourage the robot to choosenext goal point in the free space. Besides, we hope thedistance from the goal point to the obstacles will be largerthan the robot geometrical radius in consideration of collisionavoidance. Combined with the designed efficient explorationlogic, the robot can find out the safety and efficient policy inunknown environment. It should be mentioned that the ratio ρ is a hyperparameter to balance the explored region rateand explored efficiency. Here we choose it as 0.85 based onexperiments C. Action Space and Network Architecture
The action of exploration is selecting a point from the builtmap. Since we utilize the occupancy grid map, the action spaceis discrete. To simplify the problem and reduce the searchingspace, we exploit rasterized sampling on the grid map. Andthe action space is composed of these sampling points. Figure2 describes the occupancy grid map and the action space forthe proposed algorithm.Since the points in action space are obtained from regularlysampling, we can design a convolutional neural network toestimate the value of each point in the map. The designednetwork architecture is shown in Figure 3. The lower layersare encoder layers where the output dimension is the same asaction space.To estimate the Q-values, we construct a network namedfully convolutional Q-network(FCQN) based on two con-volutional branches after encoder layers. The one appliesconvolution to calculate score map which has the same sizewith encoder feature maps. Each element of the score maprepresents the advantage function of the corresponding pointin action space. The other exploits max-pooling operation tocollect the feature vector of the whole map. Subsequently, thefully connection layer estimates the advantage value of the … State valueAdvantage value … Segmentation task1x1 conv Global MaxpoolingTranspose conv
Fully convolution Q network(FCQN)
Fig. 3: Fully convolutional Q-network with an auxiliary task. The lower layers of this network are convolutional as in theoriginal DQNs.terminal action and the state value. The Q-value of each pointaction is derived from [34] Q ( s, a ; α, β ) = V ( s ; β ) + ( A ( s, a ; α ) − | A | (cid:88) a (cid:48) A ( s, a (cid:48) ; β )) . (12)The terminal action Q value is Q ( s, a ; β ) = V ( s ; β ) + ( A ( s, a ; β ) − | A | (cid:88) a (cid:48) A ( s, a (cid:48) ; β )) . (13)Compared with original DQNs architecture, our networkmakes several significant improvements. One the one hand, theFCQN is more accessible to train and has less overfitting riskdue to the fewer parameters. The original DQNs employs fullyconnected layer after flattening feature maps and introducesmassive trainable parameters. By contrast, the FCQN only hasseveral convolutional kernel parameters. The fully connectedlayer in original DQNs needs the inputs to have the samedimension, which restricts its adaptability for various sizeenvironment. On the other hand, since the fully convolutionalnetwork accepts the different size inputs, the FCQN has betteradaptability for changeable map size. Those properties makeour network more convenient to train in a new environmentcontinuously. D. Auxiliary task
Compared with the empirical features, putting the mapimage as the input of DRL increases the search space rapidly.The agent needs more time to analyze every possible state-action combination, which increased difficulties in the trainingprocess. Some researchers proposed auxiliary tasks by enhancingprior knowledge to improve training efficiency. Mirowski et al. [22] implemented depth prediction and loop closureprediction as the auxiliary task to improve the performanceand training efficiency of DRL algorithms. Lei et al. [20]trained the deep convolutional network with classification taskand then exploited this network to extract the features for theRL. The feature input drastically reduces the state dimensionand exploration time instead of the original image input.Unfortunately, these features are usually not the task-oriented,there is no guarantee for the quality of the features.Considering the crucial roles of the map edge in navigationand exploration, we present the edge segmentation as theauxiliary task for FCQN. Note that the edges composedof two elements. One is the contour of obstacles which isessential for avoiding the collision during navigation. Theother is the boundary between free space and unknown space.This boundary is named frontier employed to produces thecandidate points.The third branch for segmentation appended in the networkarchitecture as shown in Figure 3. And we call the wholenetwork with segmentation branch as AFCQN (fully convolu-tional Q-network with an auxiliary task). We apply the decoderblock which consists of twice deconvolution after lower featuremaps. The output of this branch has the same dimension asthe input map. The loss function for this branch is L s = − (cid:88) p M (cid:88) c =1 y o,c logp o,c , (14)where p represents the pixel in the map, and M is the set of thepossible classes. In this case, M includes 3 types such as thecontours of the obstacle, frontiers, and others. y o,c is the binary Start
Get map and pose by gmapping
Calculate Q-value by the neural networksSelect action by greedy algorithmAction is Terminator
Convert action to the goal point
Is goal arrived
Search the path by A*Calculate velocity by TEBNo No
Yes
StopYes Robot
Mapping module
Decision modulePlanning module
Fig. 4: The diagram of automatic exploration system.indicator if the class label c is the correct classification forobservation o . And p o,c means that the predicted probabilityobservation o is of the class c .Therefore, the lower convolutional layers update the param-eters by two-part back-propagation error. ∆ w = − (cid:15) ( ∂L DQN ∂w − λ ∂L s ∂w ) , (15)where (cid:15) and λ are the learning rate and the balance parameterfor the segmentation task, respectively. w is the parameter oflower convolutional layers. These analytical procedures andthe results are described in the next chapter.V. W HOLE S YSTEM
The work flow of the whole system is shown in Figure 4.The robot collects odometry data and the point cloud data byscanning the environment and sends the data to the mappingmodule. In this paper, we apply the gmapping method as themapping module which builds the map and obtains the posein the built map. Based on the results of mapping module,we propose novel DRL-based decision algorithms(FCQN andAFCQN, where we use AFCQN as the example.) which utilizethe built mapping and the localization as the input and givesthe next goal point. As the global planner, A ∗ algorithmreceives the goal point from the decision module and thebuilt map from the mapping module. It searches the feasiblepath which is composed of discrete points. The local plannertranslates the path to the velocity and sends to the robot.The robot moves around in the environment, receives the newsensor data and builds a new map. Then it begins the newprocess loop. VI. E XPERIMENTS
In this section, we conduct experiments in the simulatedworld and the physical world to compare the performance ofdifferent algorithms. In our exploration framework, we presentthe DRL-based method as the decision algorithms. For themapping module, we make use of graph-based methods, wherethe front-end is based on scan matching method [35] and theback-end is based on graph optimization implemented by g2o[36] package. The planning module is composed of A ∗ searchmethods as the global planner and TEB as the local planner.Firstly, we compare the different networks include originalDQN, proposed FCQN and AFCQN with same state spaceand action space. Then we conduct experiments to find thesimilarity and difference between AFCQN and the frontier-based method. Finally, we deploy the algorithms on the realrobot. A. Evaluation Metrics
To compare the performance of different algorithms, wedefine the explored region rate, average path length andexploration efficiency.Explored region rate embodies the completeness of the mapbuilt during exploration. It is defined asExplored region rate = Number of explored free cellsNumber of free cells in the real map . (16)With the high explored region, we hope the agent moveefficiently. Therefore, the average path length and explorationefficiency are introduced.Average path length = (cid:80) i Lτ i Number of episodes (17)Exploration efficiency = (cid:80) i ( H ( T ) − H (0)) i (cid:80) i Lτ i (18)Where L τ i is the path length of the trajectory τ i during i -thepisode.In general, a larger explored region rate leads to a longerpath. So the larger explored region rate and shorter pathare inconsistent. Exploration efficiency represents the entropyreduced per unit length. And it is a compromise metric forexploration. B. Simulation Setup
We construct the simulation platform based on the Stagepackage [37] of robot operation system(ROS) for agent train-ing. We create a virtual map for algorithms training and designthe maps with different layouts and sizes for testing. Figure 5and the Table I show the details of the maps. The kinematicmodel of the agent is omnidirectional, which can go left orright without rotation.
C. Training Analysis
We conduct the training process on the training map. Theinputs of the algorithms are the combined images includingthe grid map, the current position map, and the last position
Fig. 5: The maps for training and test.TABLE I: The details of the training and test maps.
Map name Resolution Real size(meter)Training map 161 ×
201 5 × ×
201 5 × ×
201 5 × ×
277 13.65 × ×
267 9.5 × map. The position map draws the robot position on the blankmap which has the same size as the built map.Figure 6 shows training processes. The returns grow duringthe training episodes and converging to the stable value. Wecan see that FCQN learns faster than DQN, but the final returnis slightly smaller than DQN. The possible reason for thisappearance is that DQN has more parameters than FCQNand better ability to remember complex decision sequences.Since AFCQN uses more information to find out the optimalparameters, AFCQN has a better performance in convergencespeed and final return.Figure 7 describes the number of explored grids and the Episodes R e t u r n FCQNAFCQNDQN
Fig. 6: The return during training.
Episodes E x p l o r e d g r i d s DQNFCQNAFCQN
Episodes P a t h l e n g t h DQNFCQNAFCQN
Fig. 7: The explored grids and path length during training. Thebold curves are average values over episodes. The shadow ofeach curve is corresponding to variance.TABLE II: Results on the training map
Methods Explored region rate Path length Explored efficiencyAFCQN mean std 0.0414 2.8107 297.4689DQN mean 0.84 23.19 1108.51std 0.2190 5.9091 190.8171 path length during the training episodes. From the pictures, wefind out the edge segmentation task speeds up the exploration.AFCQN learns the safe strategy for movements and increasesthe path length quickly. However, the segmentation task alsobrings redundant movements which do not increase the newfree cells. With the iteration of Q-values, the algorithm cropsthe redundant movements and achieves the better return.To compare the trained algorithms in the training map, weconduct 50 trials by selecting initial points randomly fromfree space. Results are shown in Table II. All algorithmsfinish the exploration completely on the training map. AndAFCQN has highest explored region rate while FCQN hashighest exploration efficiency. The reason for high exploredefficiency for FCQN is selecting the terminal action in timewhich avoids redundant movement. AFCQN has more atten-tion on the completeness of the built map, which leads tothe longer path. For most trials, DQN has a higher exploredregion rate than FCQN. But it fails in some cases. Figure8 shows the comprehensive performance. The curve meansthat the proportion of the episodes which achieve the specifiedexplored region rate within 25 meters. According to the figure,we see AFCQN has a better performance than others. AndFCQN has a stable performance, where its explored region rate
Explored region rate E p i s o d e p r o p o r t i o n AFCQNFCQNDQN
Fig. 8: The episode proportion in trials goes with explored ratewhile exploration path length shorter than 25 meters.
Test map 1 Test map 2 Test map 3 Test map 4 E x p l o r e d r e g i o n r a t e AFCQNFCQNDQN
Test map 1 Test map 2 Test map 3 Test map 4 P a t h l e n g t h AFCQNFCQNDQN
Test map 1 Test map 2 Test map Test map 4 E x p l o r a t i o n e ff i c i e n c y AFCQNFCQNDQN
Fig. 9: The statistic results of different algorithms in 4 testmaps. Histogram means the average value for different metrics.And the error bar is the standard deviation.distributes around 0.8. Except for failed cases, the exploredregion rate of DQN distributes around 0.9.
D. Generalization Analysis
In order to analyze the generalization of the algorithms, wedesign the test maps different from the train map. Consideringthe constraint of the stand DQN caused by the fully connectedlayers, we design two maps(test map 1 and test map 2) asshown in Figure 5 with the same size but different layouts withthe train map. Our algorithms based on fully convolutionallayer adapt well to variant map sizes. In addition, we design the two maps(test map 3 and test map 4) with different sizeswith the train map. We test our algorithms on these maps. Andthe results are shown in Figure 9.According to Figure 9, all algorithms have good general-ization on the test maps which have different layouts fromthe train map. AFCQN has the highest explored region rate intest map 1 and FCQN follows. AFCQN and FCQN have thesame explored region rate in test map 2 while FCQN has aless standard deviation. In addition, FCQN has a shorter pathlength. FCQN has a better performance than DQN. Comparedto results in the training map, AFCQN has a higher exploredregion rate and the less explored region rate standard deviationthan others in test maps.Since the fully connected layers of DQN restrict its outputdimension, DQN has poor adaptability for different size maps.Note that AFCQN and FCQN make use of fully convolutionallayers instead of fully connected layers. They can addressdifferent size maps and adjust the output dimension adaptivelywith keeping a downsampling rate for generating action space.In this part, we compare the difference between AFCQNand FCQN. Although the layout and size are different fromthe train map, our proposed algorithms work well in newenvironments. AFCQN has a higher explored region rate. ButFCQN has a higher exploration efficiency. The changes oflayout and scale increase the standard deviation of exploredregion rate.From the above two groups of experiments, AFCQN andFCQN are better than DQN in all maps. The layout of themap has an influence on the algorithms. Comparing the testmap 1 and test map 2, we can find the common featurebetween the training map and test map 2. That is they bothhave many rooms and “doors”. And the layouts of the testmap 1 and 3 are relatively spacious. With this special feature,FCQN has better performance in exploration efficiency. Thereason for this result is that FCQN can end the explorationin time and has shorter exploration path. Note that we addthe edge segmentation as the auxiliary task, AFCQN focus onthe completeness of the built map(since AFCQN has higherexplored region rate) and has longer exploration path whichleads to lower exploration efficiency. In spacious environmentssuch as test map 1 and test map 3, the exploration strategyis simple than in crowded environments such as test map 2and training map. This means that it is easier to completeexploration. And AFCQN can detect the completeness of thebuilt map and end the process in time. FCQN need more stepto confirm the finish. These results show that AFCQN hasbetter ability to detect the completeness of the built map.
E. Relationship to Frontier-based Methods
Since AFCQN takes the edge segmentation as the auxiliarytask, we further analyze how this task affects the decision ofAFCQN. In the frontier-based methods, the frontiers whichare the boundaries between free space and unknown space arethe decision candidates, so we compare the decision processof AFCQN to frontier based methods. Figure 10 describes theprocess.In the early decision stage, the valuable decision positionsdistribute around the frontiers. With expanding of the map, the Fig. 10: The decision process of AFCQN during exploration. The whole process composed of 8 steps. Each step has 3 layersincluding built occupancy grid map(down), frontier(middle), Q-value map(top). After the 8th step, AFCQN selects the terminalaction and stops exploration.
Test map 1 Test map 2 Test map 3 Test map 4 E x p l o r e d r e g i o n r a t e AFCQNFrontier
Test map 1 Test map 2 Test map 3 Test map 4 P a t h l e n g t h AFCQNFrontier
Test map 1 Test map 2 Test map 3 Test map 4 E x p l o r a t i o n e ff i c i e n c y AFCQNFrontier
Fig. 11: The statistic results of AFCQN and frontier-basedmethod in 4 test maps. Histogram means the average value fordifferent metrics. And the error bar is the standard deviation.frontiers speed out. AFCQN learns intuition for maximizingthe increasing free cells. In step 5, the algorithm selects thecenter of frontiers in the left part of the map. It is moreefficient than selecting a center of any frontier. In step 6,there are two frontiers on the map. The upright frontier is thecandidate which increases more free grids. The upleft frontierwill increase a few free grids but closer to the current position.It can be seen that AFCQN selects the upleft point firstly andthe upright point secondly. This strategy is better than selectingthe upright firstly for considering the completeness and thewhole path length. After step 8, AFCQN selects the terminal
Explored region rate E p i s od e s p r opo r ti on AFCQN+test map 1AFCQN+test map 2AFCQN+test map 3AFCQN+test map 4Frontier+test map 1Frontier+test map 2Frontier+test map 3Frontier+test map 4
Fig. 12: The episodes proportion in trials goes with exploredrate.action to stop exploration.We also compare AFCQN to the frontier-based method.Figure 11 shows the results on the 4 test maps. The frontier-based method achieves a higher explored region rate. Sincethe decision point of this method is selected from frontiercenters, it explores the environment continually until there areno reachable points. In other words, it usually has a longpath. Although AFCQN is inspired by the built map edges,it estimates the status of the whole maps in order to selectthe terminal action in time. Therefore, AFCQN has higherexploration efficiency in the first three test maps.Figure 12 presents the stability of the algorithms. Withconstraints of the distance, the frontier-based method is better Fig. 13: The frontier-based methods fail case. The blue poly-gon is the base of the robot. The red grids are frontiers. Thegreen ball is the center of the frontiers. In this case, sincethe robot position and frontiers center overlap, the explorationstops.than AFCQN. However, the frontier-based method also faceswith some special cases which make exploration fail. Weexplain one case in Figure 13. Lidar receives the point cloudand renders to grid map. The frontier-based method detectsthe frontiers shown as red grids and calculates the center rep-resented by a green ball. Unfortunately, the robot is standingat this center point, then it stops the exploration. Since theinputs of AFCQN include the robot current position and lastdecision position, this case never happens during conductingAFCQN method.
F. Physical World Experiments
Since our algorithm output is the decision point insteadof the robot control, it is easier to transfer to the physicalrobot. We employ the DJI Robomaster robot platform shownin Figure 14. The major components include the gimbal andthe chassis. The chassis equips with mecanum wheel foromnidirectional movements. We fixed an RPLidar which hassimilar properties with simulated LiDAR on the chassis formapping. Then, we test our algorithm with the platform in thereal environment.Here, we show the results of the physical world experimentto the simulated world. The simulated map and built real worldmap are shown in Figure 15. Due to the difference in sensorerrors and environments, the observations of their experimentsare different. Table III shows the comparing results. Frontier-based methods have highest explored region rate in both thephysical world and simulated world. However, our proposedmethod AFCQN has best explored efficiency, since our methodmonitor the completeness of the built map and stop theexploration in time. It is obvious that the distinction betweenthe two observations impacts the explored region rate. DQNis sensitive of inputs and fails to explore the physical worldfor several times. For other methods, the variances of theexplored region rate are same. Those differences have noeffect on exploration path length. The variance of the path Fig. 14: The hardware of the DJI Robomaster platform.Fig. 15: The simulation map(upleft) and the real builtmap(upright). The downleft image is the observation (builtmap) coming from the simulation environment and the down-right image comes from the real-world environment.length is decided by the local planner. And we have differentlocal planner parameters for simulation and real experimentsin order to adapt the different kinematic models. To explainAFCQN behaviour, the exploration trajectories of AFCQNin the trails are shown in Figure 16. The most trajectoriesof the real experiments overlap with those of the simulationexperiments. And the results imply that the decision processesof the two experiments are similar.TABLE III: Results of real world and simulation experiments
Environment Methods Explored region rate Path length Explored efficiencyReal AFCQN mean 0.84 std 0.0344 3.9634 227.0628FCQN mean 0.78 16.18 796.37std 0.0543 4.2317 214.7032DQN mean 0.54 6.87 410.21std 0.4173 5.8931 340.1092Frontier mean std 0.0325 6.2216 347.7612FCQN mean 0.87 16.58 925.74std 0.0473 8.3158 314.8210DQN mean 0.81 17.23 806.94std 0.1917 9.0315 425.1479Frontier mean Fig. 16: The exploration trajectories of AFCQN in trials. Thered trajectories come from simulation experiments while theblue trajectories come from real experiments.VII. C
ONCLUSION
In this paper, we construct a novel framework for robot ex-ploration based on mapping, decision, and planning modules.And we declare the general model for each module. Underthis framework, each module is mutual independent and canbe implemented via various methods.The key contribution of this paper is proposing a decisionalgorithm based on DRL which takes the partial map asthe input. In order to improve the training efficiency andgeneralization performance, we attach the edge segmentationtask to train the feature-extracting networks auxiliary.Since the proposed framework is combined with the existingnavigation algorithms, the algorithms based on the frameworklearn faster than the end-to-end methods [20] [22] whichundergo thousands of episodes. The experiments show thatAFCQN has better generalization performance in differentmaps compare to end-to-end methods and stand DQN. Com-pared to the frontier-based method, AFCQN can avoid failedcase and make use of properties of the sensor to explorethe environment more efficiently, which makes the decisionmodule look more intelligent. The experiments in the physicalworld show that our algorithm is easy to transfer from thesimulator to the physical world with a tolerable performance.However, there are some further studies. Since the griddingmap is used to generate discrete actions, the final decisionsequence based on the discrete action space is suboptimal.Hilbert occupancy map [38] is a potential substitute, whichis used to find out the optimal exploration path in a givenmap. How to combine this map and policy gradient methodis a valuable topic. Deep recurrent neural networks can beadapted to improve the online learning performance [39], andit can be introduced to this task to augment the memoryof the algorithm. In addition, how to embed the availableauxiliary information and domain knowledge into the DRLmethod is also a very meaningful research topic for automaticexploration. R
EFERENCES[1] B. Yamauchi, “A frontier-based approach for autonomous exploration,”in
Proceeding of IEEE International Symposium on ComputationalIntelligence in Robotics and Automation (CIRA) . IEEE, 1997, pp. 146–151.[2] F. Bourgault, A. A. Makarenko, S. B. Williams, B. Grocholsky, andH. F. Durrant-Whyte, “Information based adaptive robotic exploration,”in
Proceeding of IEEE International Conference on Intelligent Robotsand Systems (IROS) , vol. 1. IEEE, 2002, pp. 540–545.[3] H. H. Gonz´alez-Banos and J.-C. Latombe, “Navigation strategies forexploring indoor environments,”
The International Journal of RoboticsResearch , vol. 21, no. 10-11, pp. 829–848, 2002.[4] M. Juli´a, A. Gil, and O. Reinoso, “A comparison of path planningstrategies for autonomous exploration and mapping of unknown envi-ronments,”
Autonomous Robots , vol. 33, no. 4, pp. 427–444, 2012.[5] A. Krizhevsky, I. Sutskever, and G. E. Hinton, “Imagenet classificationwith deep convolutional neural networks,” in
Proceeding of Advances inNeural Information Processing Systems (NIPS) , 2012, pp. 1097–1105.[6] Y. LeCun, Y. Bengio, and G. Hinton, “Deep learning,”
Nature , vol. 521,no. 7553, p. 436, 2015.[7] V. Mnih, K. Kavukcuoglu, D. Silver, A. A. Rusu, J. Veness, M. G.Bellemare, A. Graves, M. Riedmiller, A. K. Fidjeland, G. Ostrovski et al. , “Human-level control through deep reinforcement learning,”
Nature , vol. 518, no. 7540, p. 529, 2015.[8] R. S. Sutton and A. G. Barto,
Reinforcement Learning: An Introduction .Cambridge, : MIT press, 1998, vol. 1.[9] T. P. Lillicrap, J. J. Hunt, A. Pritzel, N. Heess, T. Erez, Y. Tassa,D. Silver, and D. Wierstra, “Continuous control with deep reinforcementlearning,” arXiv preprint arXiv:1509.02971 , 2015.[10] B. Kiumarsi, K. G. Vamvoudakis, H. Modares, and F. L. Lewis, “Optimaland autonomous control using reinforcement learning: A survey,”
IEEETransactions on Neural Networks and Learning Systems , vol. 29, no. 6,pp. 2042–2062, 2018.[11] Y. Zhu, D. Zhao, H. He, and J. Ji, “Event-triggered optimal controlfor partially unknown constrained-input systems via adaptive dynamicprogramming,”
IEEE Transactions on Industrial Electronics , vol. 64,no. 5, pp. 4101–4109, 2017.[12] D. Zhao, Z. Xia, and Q. Zhang, “Model-free optimal control based intel-ligent cruise control with hardware-in-the-loop demonstration [researchfrontier],”
IEEE Computational Intelligence Magazine , vol. 12, no. 2,pp. 56–69, 2017.[13] Y. Zhu, D. Zhao, and X. Li, “Iterative adaptive dynamic programmingfor solving unknown nonlinear zero-sum game based on online data,”
IEEE Transactions on Neural Networks and Learning Systems , vol. 28,no. 3, pp. 714–725, 2017.[14] Q. Zhang, D. Zhao, D. Wang, and Y. Zhu, “Experience replay for optimalcontrol of nonzero-sum game systems with unknown dynamics,”
IEEETransactions on Cybernetics , vol. 46, no. 3, pp. 854–865, 2016.[15] D. Silver, A. Huang, C. J. Maddison, A. Guez, L. Sifre, G. VanDen Driessche, J. Schrittwieser, I. Antonoglou, V. Panneershelvam,M. Lanctot et al. , “Mastering the game of Go with deep neural networksand tree search,”
Nature , vol. 529, no. 7587, p. 484, 2016.[16] A. Stentz, D. Fox, and M. Montemerlo, “Fastslam: A factored solutionto the simultaneous localization and mapping problem with unknowndata association,” in
Proceedings of the AAAI National Conference onArtificial Intelligence (AAAI) . Citeseer, 2003.[17] G. Grisetti, R. Kummerle, C. Stachniss, and W. Burgard, “A tutorial ongraph-based slam,”
IEEE Intelligent Transportation Systems Magazine ,vol. 2, no. 4, pp. 31–43, 2010.[18] V. Ila, J. M. Porta, and J. Andrade-Cetto, “Information-based compactpose slam,”
IEEE Transactions on Robotics , vol. 26, no. 1, pp. 78–93,2010.[19] G. E. Monahan, “State of the artła survey of partially observable markovdecision processes: theory, models, and algorithms,”
Management Sci-ence , vol. 28, no. 1, pp. 1–16, 1982.[20] L. Tai and M. Liu, “Mobile robots exploration through cnn-basedreinforcement learning,”
Robotics and Biomimetics , vol. 3, no. 1, p. 24,2016.[21] S. Bai, F. Chen, and B. Englot, “Toward autonomous mapping andexploration for mobile robots through deep supervised learning,” in
Proceeding of IEEE International Conference on Intelligent Robots andSystems (IROS) . IEEE, 2017, pp. 2379–2384.[22] P. Mirowski, R. Pascanu, F. Viola, H. Soyer, A. J. Ballard, A. Banino,M. Denil, R. Goroshin, L. Sifre, K. Kavukcuoglu et al. , “Learningto navigate in complex environments,” in
International Conference onLearning Representations (ICLR) , 2016. [23] D. Pathak, P. Agrawal, A. A. Efros, and T. Darrell, “Curiosity-drivenexploration by self-supervised prediction,” in Proceeding of IEEE Inter-national Conference on Machine Learning (ICML) , vol. 2017. IEEE,2017.[24] O. Zhelo, J. Zhang, L. Tai, M. Liu, and W. Burgard, “Curiosity-drivenexploration for mapless navigation with deep reinforcement learning,” arXiv preprint arXiv:1804.00456 , 2018.[25] J. Yao, C. Lin, X. Xie, A. J. Wang, and C.-C. Hung, “Path planning forvirtual human motion using improved a* star algorithm,” in
Proceedingof IEEE International Conference on Information Technology: NewGenerations (ITNG) . IEEE, 2010, pp. 1154–1158.[26] C. R¨osmann, F. Hoffmann, and T. Bertram, “Kinodynamic trajectoryoptimization and control for car-like robots,” in
Proceeding of IEEEInternational Conference on Intelligent Robots and Systems (IROS) .IEEE, 2017, pp. 5681–5686.[27] E. Parisotto and R. Salakhutdinov, “Neural map: Structured memory fordeep reinforcement learning,” arXiv preprint arXiv:1702.08360 , 2017.[28] Q. Zhang, D. Zhao, and D. Wang, “Event-based robust control foruncertain nonlinear systems using adaptive dynamic programming,”
IEEE Transactions on Neural Networks and Learning Systems , vol. 29,no. 1, pp. 37–50, 2018.[29] Y. F. Chen, M. Liu, M. Everett, and J. P. How, “Decentralized non-communicating multiagent collision avoidance with deep reinforcementlearning,” in
Proceeding of IEEE International Conference on Roboticsand Automation (ICRA) . IEEE, 2017, pp. 285–292.[30] D. Li, D. Zhao, Q. Zhang, and Y. Chen, “Reinforcement learningand deep learning based lateral control for autonomous driving,”
IEEEComputational Intelligence Magazine , vol. 14, no. 2, pp. 83–98, 2019.[31] C. Chen, A. Seff, A. Kornhauser, and J. Xiao, “Deepdriving: Learningaffordance for direct perception in autonomous driving,” in
Proceedingof IEEE International Conference on Computer Vision (ICCV) . IEEE,2015, pp. 2722–2730.[32] D. Zhao, Y. Chen, and L. Lv, “Deep reinforcement learning with visualattention for vehicle classification,”
IEEE Transactions on Cognitive andDevelopmental Systems , vol. 9, no. 4, pp. 356–367, 2017.[33] Y. Chen, D. Zhao, L. Lv, and Q. Zhang, “Multi-task learning for dan-gerous object detection in autonomous driving,”
Information Sciences ,vol. 432, pp. 559–571, 2018.[34] Z. Wang, T. Schaul, M. Hessel, H. Van Hasselt, M. Lanctot, andN. De Freitas, “Dueling network architectures for deep reinforcementlearning,” arXiv preprint arXiv:1511.06581 , 2015.[35] E. B. Olson, “Real-time correlative scan matching,”
Ann Arbor , vol.1001, p. 48109, 2009.[36] R. K¨ummerle, G. Grisetti, H. Strasdat, K. Konolige, and W. Burgard,“g 2 o: A general framework for graph optimization,” in
Proceedingof IEEE International Conference on Robotics and Automation (ICRA) .IEEE, 2011, pp. 3607–3613.[37] B. Gerkey, R. T. Vaughan, and A. Howard, “The player/stage project:Tools for multi-robot and distributed sensor systems,” in
Proceedings ofIEEE International Conference on Advanced Robotics (ICAR) , vol. 1.IEEE, 2003, pp. 317–323.[38] F. Ramos and L. Ott, “Hilbert maps: scalable continuous occupancymapping with stochastic gradient descent,”
The International Journal ofRobotics Research , vol. 35, no. 14, pp. 1717–1730, 2016.[39] T. Ergen and S. S. Kozat, “Efficient online learning algorithms based onLSTM neural networks,”