Accelerated Sim-to-Real Deep Reinforcement Learning: Learning Collision Avoidance from Human Player
Hanlin Niu, Ze Ji, Farshad Arvin, Barry Lennox, Hujun Yin, Joaquin Carrasco
AAccelerated Sim-to-Real Deep Reinforcement Learning: LearningCollision Avoidance from Human Player
Hanlin Niu, Ze Ji, Farshad Arvin, Barry Lennox, Hujun Yin, and Joaquin Carrasco
Abstract — This paper presents a sensor-level maplesscollision avoidance algorithm for use in mobile robots thatmap raw sensor data to linear and angular velocities andnavigate in an unknown environment without a map. Anefficient training strategy is proposed to allow a robot tolearn from both human experience data and self-exploratorydata. A game format simulation framework is designed toallow the human player to tele-operate the mobile robot toa goal and human action is also scored using the rewardfunction. Both human player data and self-playing data aresampled using prioritized experience replay algorithm. Theproposed algorithm and training strategy have been evaluatedin two different experimental configurations:
Environment1 , a simulated cluttered environment, and
Environment2 , a simulated corridor environment, to investigate theperformance. It was demonstrated that the proposed methodachieved the same level of reward using only 16% of thetraining steps required by the standard Deep DeterministicPolicy Gradient (DDPG) method in Environment 1 and 20%of that in Environment 2. In the evaluation of 20 randommissions, the proposed method achieved no collision in less than2 h and 2.5 h of training time in the two Gazebo environmentsrespectively. The method also generated smoother trajectoriesthan DDPG. The proposed method has also been implementedon a real robot in the real-world environment for performanceevaluation. We can confirm that the trained model withthe simulation software can be directly applied into thereal-world scenario without further fine-tuning, furtherdemonstrating its higher robustness than DDPG. The videoand code are available: https://youtu.be/BmwxevgsdGchttps://github.com/hanlinniu/turtlebot3_ddpg_collision_avoidance
I. I
NTRODUCTION
Navigation is one of the key problems in autonomousmobile robots, involving a combination of algorithms for lo-calisation, path planning and collision avoidance. To performreactive behaviours in the presence of dynamic obstacles, thecomputational efficiency is an important criterion to evaluatenavigation algorithms. Robot localisation can be achieved viavarious methods, including beacon-based methods, such asWifi localisation, visible-light method and Ultra-wideband(UWB) solution, allowing navigation for robots without pre-calibrating the maps. More popularly, autonomous naviga-tion usually involves using Simultaneous Localisation andMapping (SLAM) that allows a robot to build the map *This work was supported by EPSRC project No.EP/S03286X/1 andEPSRC RAIN project No. EP/R026084/1. (Corresponding author: HanlinNiu)
H. Niu, F. Arvin, B. Lennox, H. Yin and J. Carrasco are with the Depart-ment of Electrical & Electronic Engineering, The University of Manchester,Manchester, UK. { [email protected] } Z. Ji is with the School of Engineering, Cardiff University, Cardiff, UK. [email protected]
Fig. 1: A learning-based mapless collision avoidance algorithm and trainingstrategy were proposed by utilizing human demonstration data and simulateddata. of an unknown environment beforehand or simultaneouslyusing sensors like laser range finders or vision cameras. Pathplanning and collision avoidance can be then possible usinga combination of global and local planners with the fullor partial map information. However, it is time-consumingto build a complex and precise map using sensory datathat is measured by tele-operating the robot to navigatearound the environment. As well, autonomous navigation inhighly dynamic and unstructured environment with movingobstacles remains challenging.The goal of this paper is to design a learning-basedcollision avoidance algorithm to generate continuous com-mands without the map and only use range data and therelative goal position to produce the reactive behaviour, asshown in Fig. 1. Most deep reinforcement learning basedalgorithms are trained in simulation environment, includingV-REP, PyGame , OpenAI, Gazebo, etc. The gap betweenthe simulation environment and the highly complicated real-world environment is the main challenge to transfer thelearned experience to the real mobile robot directly. In thispaper, we use the laser data to abstract the distance betweenthe mobile robot and the surroundings. We also configurethe simulation in a game-like environment, allowing humanplayers to tele-operate the mobile robot to avoid the obstaclesand navigate to the goal. A reward function is designedto score the human action. A human player data logger isdesigned to record the human action data, robot state dataand reward data. The human experience and robot experienceare then mixed. Prioritised experience replay approach [1] is a r X i v : . [ c s . A I] F e b pplied to make the robot to learn human data and robot datamore efficiently.The fast learning capabilities of the proposed approachis validated using a turtlebot 3 mobile robot equipped witha laser sensor in both simulation and real-world environ-ments. The method is compared with the standard DDPGpolicy, which only learns from the simulated data. Themain contribution of this paper is three-fold: 1) A deepreinforcement learning-based mapless collision avoidancealgorithm is proposed; 2) an efficient learning strategy isproposed by enabling the autonomous agent to learn fromboth simulated agent and human player and the number oftraining step is only 16% and 20% of standard DDPG intwo simulation environments respectively; and 3) the trainedpolicy from the simulated environment can be deployed ontothe real mobile robot without further fine-tuning. 4) The codeof the proposed algorithm and framework for training robotis released as open-source.II. R ELATED W ORK
With the development of computational hardware andaccess to large amount of training data, deep learningshows a great potential in solving computer vision andnavigation problems. A decentralized collision avoidancepolicy is proposed in [2], where a multi-scenario multi-stage training framework is adopted to train the policy. Itis demonstrated that the learned policy can be used fornavigating a large number of robots in the simulated environ-ment. A successor-feature-based deep reinforcement learningalgorithm is proposed in [3] that can reduce the trainingtime, when transfer knowledge from the previously learnedexperience to the new scenarios. Tai et al. [4] proposed anend-to-end learning-based mapless motion planner by usinglaser data and the relative target information as input data.The policy outputs command data directly and its efficiencyis demonstrated in unknown simulated and real scenarios.A novel deep reinforcement learning method, integrating anLSTM agent and Local-Map Critic, is proposed by Choi etal . [5]. This method enables the mobile robot to navigatein a complex environment with limited field of view. Themobile robot, using NVIDIA Jetson TX2 as a processor andIntel Realsense D435 with a FOV of 90 ◦ , outperforms themethods having a wider FOV. A value based network isapplied to generate velocity commands in real-time and alsoconsiders the uncertainty of other agents. Comparing withoptimal reciprocal collision avoidance, the proposed methodshows more than 26% improvement in terms of the requiredtime to reach the goal. Deep reinforcement learning is notonly applied to avoid collision in near field, a long-term pathplanning algorithm is proposed by Zhu et al. [6] to navigatethe agent to visit unexplored environments. It is realizedby integrating the deep reinforcement learning model witha greedy strategy.Deep reinforcement learning has also been implementedon social mobile robots. A social norms-based deep rein-forcement learning algorithm is proposed by Chen et al. [7] that models subtle human behaviors and navigation rules for robot navigation, including passing, crossing andovertaking. The mobile robot is demonstrated to navigatethrough a pedestrian-rich environment at a human walkingspeed. Everett et al. [8] proposes a strategy using LSTM toenable the autonomous agent to navigate through an arbitrarynumber of agents, extending the previous work that canonly work with fixed number of agents. The attention-basedmotion planner and deep reinforcement learning algorithmare integrated by Chen et al. [9], which enables the agentto learn from the Human-Human interactions in crowd. It isdemonstrated that the agent can finally anticipate human be-haviours and navigate itself through the crowds. A compositereinforcement learning framework is proposed by Ciou etal. [10]. The proposed algorithm makes the system possibleto keep collecting human feedback to adjust the rewardsto the social norms, which enables the agent to modifyits velocity and perform Human-robot interaction in socialenvironment. Kohari et al. [11] applied deep deterministicpolicy gradient algorithm (DDPG) on attendant robot togenerate adaptive behaviors based on the classification statesof human, including walking, standing, sitting and talking.Deep reinforcement learning algorithm has demonstratedits superior in reducing computational time to traditionalcontrol algorithms, such as MPC (Model Predictive Con-trol) [12], validated on unmanned aerial vehicles for fastmaneuvers. A fast reactive navigation algorithm is proposedby Sampedro et al. [13] and the algorithm applies 2D-laser range measurements and the relative goal position asthe input for the deep neural network. A DJi Matrice 100quadcopter, equipped with Hokuyo laser rangefinder UTM-30LX, is trained in a Gazebo-based 3D simuation environ-ment. The learned experience is carried out in an unknownindoor environment, with the presence of static and dynamicobstacles. The DDPG algorithm is applied by Rodriguez-Ramos et al. [14] to enable a Parrot Bebop 2 drone to land ona moving platform. The drone model is trained in the Gazebo3D environment and Rotors UAV simulator and then appliedin the real-world environment. Haksar et al. [15] proposed acomputationally efficient algorithm using Multi-Agent DeepQ Network (MADQN) to control a swarm of UAVs to fightforest fires. Its scalability is demonstrated by carrying outmissions with various forest sizes and different number ofUAVs with various model parameters.The algorithms mentioned above are mainly trained usingthe data from a simulated agent. However, learning from anon-experienced agent makes the learning process inefficient.This research enables the proposed mapless collision avoid-ance algorithm not only learn from simulated agent but alsofrom human player, making the learning process faster andbetter. III. P ROBLEM F ORMULATION
The collision avoidance problem for mobile robots isdefined in the context of an autonomous agent movingon the Euclidean plane in the presence of obstacles. Ateach time step t , the robot has access to the surroundingobservation information state s t and generates the actionelocity command v t that allows collision free navigationfor the robot to the goal. The relative position to the goal isdenoted by g t . Instead of perfect observation of the wholemap, we assume that our robot has only partial observationin the range of the laser sensor and the relative goal position.This assumption makes our method more practical and robustin the real world environment. The velocity v t − of lasttime step is also considered. Therefore, the problem can beformulated as to find the translation function: v t = f ( s t , g t , v t − ) (1)IV. A PPROACH
A. Reinforcement learning Setup1) Observation space and action space:
The observationstate s t consists of laser data l t , velocity v t and relativetarget position g t . l t is a 1D array laser information and itis normalized by its maximum range. The velocity data v t represents the linear velocity and the rotational velocity ofthe agent. The relative target position g t is calculated in thepolar coordinate (relative distance and angle to the currentposition). The neural network outputs continuous velocitycommands to the robot. The action space of neural networksconsists of linear velocity l v and the angular velocity a v .
2) Reward space:
As the mobile robot should navigateto the goal, while keeping clear distance from obstacles,the reward is set with the consideration of the clearancedistance and the step distance towards to the goal. The rewardfunction is given as: r = r g + r c + r av + r lv (2)where r stands for the total reward, r g denotes the distancereward, r c represents the collision reward, r av and r lv repre-sent the angular velocity punishment (negative reward) andlinear velocity punishment, correspondingly. r g is calculatedusing: r g = (cid:40) r arrival if d g < d gmin ∆ d g otherwise (3)where d g denotes the distance to the goal. When d g is lessthan a threshold d gmin , the robot is considered to arrive thegoal and it will receive a reward r arrival , otherwise it willreceive reward that equals to the distance it travelled duringthe last time step. The collision reward r c can be calculatedby: r c = r collision if d romin ≤ d ro < d romin r collision if d ro < d romin otherwise (4)where d ro denotes the distance between robot and obsta-cles and d romin represents the threshold value of d ro . r collision denotes the collision reward. Negative reward r av and r lv are introduced, which will ensure the algorithm producingsmooth trajectories, and they can be calculated by: r av = (cid:40) r ap if | a v | > . | a vmax | otherwise (5) r lv = (cid:40) r lp if l v < l vmin otherwise (6)where a vmax denotes the maximum angular velocity thresh-old and l vmin represents the minimum linear velocity thresh-old. The robot will receive punishment reward r ap or r lp ,when its angular velocity | a v | is larger than threshold0 . | a vmax | or when linear velocity l v is lower than threshold l vmin . Note that although sigmoid function, which has anasymptotic limit, is used for generating angular velocitycommand, we still add a second threshold to limit theangular velocity in a smaller range, which makes it possibleto maneuver in a fast angular velocity although it is notencouraged. B. Network architecture
In this work, the DDPG algorithm is applied as the baseof the proposed method. As shown in Fig. 2, the input of theactor network is a 28-dimensional input vector, comprisingthe 24-dimensional laser data, merged with the 2-dimensionalrelative goal information and 2-dimensional robot actioninformation. The input data is connected with three denselayers (500 nodes each) and the actor network produces thelinear velocity and angular velocity by using a Sigmoid func-tion and a hyperbolic tangent function respectively. Thesetwo velocity commands are merged as the action input ofthe critic network. The state input of the critic network is thesame as the input of the actor network and it is processedby three dense layers (500 nodes each). The action inputis integrated with the second dense layer. Finally, the criticnetwork generates the Q-value through a linear activationfunction.
Fig. 2: The architecture of the actor network
C. Human player data collection
The demonstration data is collected through tele-operatingthe robot by human players, as shown in Fig. 4. The laserdata l t , velocity data v t and relative target position data g t aremerged as the current state s t and stored. v t is performed byHuman. The next time step state s t + is recorded by readingthe robot state from gazebo environment directly. s t , v t and s t + are used to evaluate the human player’s score r t , usingthe same reward function designed for the auto mission. ig. 3: The architecture of the critic networkFig. 4: Human player data collection Finally, s t , v t , s t + and r t are stored in the experience replaybuffer, allowing the agent to maintain these transitions topropagate the rewards throughout the value function. D. Prioritized experience replay
Instead of sampling historical data uniformly, the Priori-tized Experience Replay algorithm (PER) [1] is applied to en-sure important historical data to be sampled more frequently.Along with the experience buffer, priority information ofeach transition is calculated by (7) and is stored in thepriority tree. P i = p α i ∑ k p α k (7)where the sampling probability of the i th transition isdenoted by P i , α denotes distribution factor and p i standsfor the priority of a set of transition data, formulated in (8). p i = δ i + λ (cid:12)(cid:12) (cid:79) a Q ( s i , a i | θ Q ) (cid:12)(cid:12) + ε + ε D (8)where δ stands for the TD (time difference) error, thesecond term λ (cid:12)(cid:12) (cid:79) a Q ( s i , a i | θ Q ) (cid:12)(cid:12) denotes the actor loss, λ is a contribution factor and ε represents a small positivesampling probability for each transition, which ensures eachtransition data can still be sampled. ε D is applied to increasethe sampling probability of the demonstration data. Eachtransition is evaluated by : Fig. 5: TurtleBot3 Waffle Pi robotic platform. ω i = ( N · P i ) β (9)where ω i denotes the sampling weight for updating thenetwork, indicating the importance of each transition data, N stands for the batch size, and β is a constant value foradjusting the sampling weight.V. E XPERIMENTS AND RESULTS
A. Training and experiment setup
The TurtleBot3 Waffle Pi mobile robot was used inboth simulation and real-world environment, as shown inFig. 5. The training process was undertaken in the Gazeboenvironment, which minimises the effort to transfer thelearned experience from simulation to the real-world real-robot experiment, as both simulated and real models usethe same ROS interface and Gazebo world also simulatesthe dynamics of the robot. The robot model was trainedin a clustered environment (Environment 1) and a narrowcorridor environment (Environment 2), as shown in Fig. 6aand Fig. 7a. The small red ball was used to represent thetarget and was randomly placed in each episode. The bluelines visualize the laser information around the robot. Therobot subscribes to the laser data from a 360 Laser DistanceSensor LDS-01, which has a 360 ◦ field of view and its rangeis ( . m , . m ) . In this work, we only use 24 laser readingsfrom ( − ◦ , 90 ◦ ), as we assume the robot will only moveforward.The training was carried out on a computational platformwith a NVIDIA TITAN RTX GPU. The model was trainedusing Adam optimiser [16]. The learning rate of actor andcritic network were set to 0.0001. B. Training evaluation
The standard DDPG algorithm was implemented for com-parison. The standard DDPG algorithm used the same actorand critic network as the proposed method. The differenceis the standard DDPG method learns from scratch andsamples uniformly from the history data without taking intoconsideration of priority. The proposed method not only useshuman demonstration data but also implements the PERalgorithm. In this research, 1000 sets of transition data fromhuman demonstrators were recorded for both Environment a) (b) (c)
Fig. 6: (a) Environment 1 in Gazebo, (b) Q-value comparison and (c) reward comparison of the proposed method with DDPG in Environment 1. (a) (b) (c)
Fig. 7: (a) Environment 2 in Gazebo, (b) Q-value comparison and (c) reward comparison of the proposed method with DDPG in Environment 2.
1) Evaluation in simulated environment:
Q-value andreward of both the DDPG and the proposed method in thetraining procedure were recorded for comparison, as shownin Fig. 6 and Fig. 7. The mean value of Q-value and reward of4000 continuous steps are applied to reduce the effect of therandom mission problem. In Environment 1, the Q-value andreward of the proposed method increases much faster thanthe standard DDPG. The Q-value of the proposed methodreaches and maintains its level in the range from 80 to 100in less than 10000 steps, while the standard DDPG achieves80 after around 60000 steps and its Q-value stays between60 and 80 afterwards. The trend of the reward is similar withQ-value. The reward of the proposed method stays between5 and 10 after only about 10000 steps and the standardDDPG requires 60000 steps to reach the range of onlyfrom 4 to 8. Fig. 6b and Fig. 6c indicate that the proposedmethod requires only 16% of the training step number withthe standard DDPG and achieves better performance. InEnvironment 2, the proposed method still achieves higherQ-value and reward faster than DDPG. It is found that theQ-value and reward are lower than those in Environment 1,because the robot has to navigate through narrow obstaclesand ends with lower collision reward. As shown in Fig. 7b,the Q-values of both the proposed method and DDPG reaches20 with about 10000 steps and, in contrast, 50000 stepsrespectively. The average Q-value of the afterwards steps ofthe proposed method is also higher than DDPG. The reward has the same trend as Q-value, as shown in Fig. 7c. It can beconcluded that the proposed method only uses 20% of thetraining step number required by the standard DDPG andachieves higher Q-value and reward in Environment 2.Twenty random missions are used to evaluate the perfor-mance of two algorithms. In Environment 1, it is found theproposed method achieves no collision in 20 missions onlyafter 25000 steps, which costs only 1 hour and 53 minutes fortraining. The standard DDPG still encountered 7 collisionsafter 60000 training steps and 1 collision after 70000 trainingsteps. It is also noted that the proposed method generatedsmoother trajectories and navigated the robot to the goalin a shorter time. In Environment 2, the proposed methodachieves no collision in 20 missions after 35000 steps, whichcost 2 hours and 25 minutes for training, while the standardDDPG still had 3 collisions in 20 trials after 80000 steps andit also had unnecessary turning manoeuvres.
2) Evaluation in real-world environment:
To test theperformance of transferring the learned knowledge directlyfrom simulation to the real world, the models trained fromthe Gazebo Environment 2 were tested in the corridor ofSackville Street Building in the University of Manchester,as shown in Fig. 8a. Four bins were placed in the middleof the corridor to test the collision avoidance capability ofboth algorithms. The final goal was set behind of the fourthbin. In Fig. 8b and Fig. 8c, the yellow lines represent thetrajectories of the robot. The proposed method navigated therobot to pass all the four obstacles, while using the standardDDPG, the robot collided with the wall at the first turn. a) Corridor of Sackville Street Building (b) Trajectory of the proposed method (c) Trajectory of standard DDPG
Fig. 8: Comparison of the proposed method and DDPG in real-world environment
Five sets of the same task were repeated and the proposedmethod completed all missions and none of the mission canbe achieved by DDPG. Note that, the robot position and thegoal position were estimated using the TurtleBot3’s built-inAMCL algorithm. The map was only used for demonstration.The robot only senses the obstacles in its laser detectionrange and the decision is made by its current sensor reading.In Fig. 8b, the yellow trajectory crossed the first bin, whichindicates our robot could achieve the collision avoidancecapability in presence of the inaccuracy of the localisationalgorithm. We also tested to put sudden obstacle in frontof robot to test its obstacle avoidance capability in dealingwith dynamic environment, the robot could still avoid theobstacle. VI. C
ONCLUSION AND F UTURE W ORK
In this paper, a deep reinforcement learning based collisionavoidance algorithm was proposed. The proposed methoduses human demonstration data and the prioritized experi-ence replay algorithm to improve the performance of thestandard DDPG algorithm. In simulated Environment 1, thetraining step number of the proposed method was only 16%of the required by the DDPG algorithm and achieved higherreward and Q-value overall. In Environment 2, the proposedmethod only needed 20% training step number of standardDDPG. A quantitative comparison was also given by com-paring the two algorithms in 20 random missions. It wasfound that the proposed method needs less than 2 h trainingtime to achieve no collision in Environment 1 and 2.5 h inEnvironment 2, while still generates smoother trajectory thanstandard DDPG. The proposed method was also tested in theunknown real-world environment without further fine-tuning.It was found that the proposed method was more robust thanDDPG in real-world applications. In the future work, RGBimages and depth information will be considered to analyseand predict the behaviours of pedestrian hence the behaviourof the robot can be adjusted with human preference. Also,the LSTM algorithm can be applied to improve the long-termnavigation capability of the current method.R
EFERENCES[1] M. Vecerik, T. Hester, J. Scholz, F. Wang, O. Pietquin, B. Piot,N. Heess, T. Roth¨orl, T. Lampe, and M. Riedmiller, “Leveragingdemonstrations for deep reinforcement learning on robotics problemswith sparse rewards,” arXiv preprint arXiv:1707.08817 , 2017. [2] P. Long, T. Fanl, X. Liao, W. Liu, H. Zhang, and J. Pan, “Towardsoptimally decentralized multi-robot collision avoidance via deep re-inforcement learning,” in
IEEE International Conference on Roboticsand Automation (ICRA) , 2018, pp. 6252–6259.[3] J. Zhang, J. T. Springenberg, J. Boedecker, and W. Burgard, “Deepreinforcement learning with successor features for navigation acrosssimilar environments,” in
IEEE/RSJ International Conference on In-telligent Robots and Systems (IROS) , 2017, pp. 2371–2378.[4] L. Tai, G. Paolo, and M. Liu, “Virtual-to-real deep reinforcementlearning: Continuous control of mobile robots for mapless navigation,”in
IEEE/RSJ International Conference on Intelligent Robots andSystems (IROS) , 2017, pp. 31–36.[5] J. Choi, K. Park, M. Kim, and S. Seok, “Deep reinforcement learningof navigation in a complex and crowded environment with a limitedfield of view,” in
International Conference on Robotics and Automa-tion (ICRA) , 2019, pp. 5993–6000.[6] D. Zhu, T. Li, D. Ho, C. Wang, and M. Q.-H. Meng, “Deepreinforcement learning supervised autonomous exploration in officeenvironments,” in
IEEE International Conference on Robotics andAutomation (ICRA) , 2018, pp. 7548–7555.[7] Y. F. Chen, M. Everett, M. Liu, and J. P. How, “Socially aware motionplanning with deep reinforcement learning,” in
IEEE/RSJ InternationalConference on Intelligent Robots and Systems (IROS) , 2017, pp. 1343–1350.[8] M. Everett, Y. F. Chen, and J. P. How, “Motion planning amongdynamic, decision-making agents with deep reinforcement learning,”in
IEEE/RSJ International Conference on Intelligent Robots andSystems (IROS) , 2018, pp. 3052–3059.[9] C. Chen, Y. Liu, S. Kreiss, and A. Alahi, “Crowd-robot interaction:Crowd-aware robot navigation with attention-based deep reinforce-ment learning,” in
International Conference on Robotics and Automa-tion (ICRA) , 2019, pp. 6015–6022.[10] P.-H. Ciou, Y.-T. Hsiao, Z.-Z. Wu, S.-H. Tseng, and L.-C. Fu,“Composite reinforcement learning for social robot navigation,” in
IEEE/RSJ International Conference on Intelligent Robots and Systems(IROS) , 2018, pp. 2553–2558.[11] Y. Kohari, J. Miura, and S. Oishi, “Generating adaptive attendingbehaviors using user state classification and deep reinforcement learn-ing,” in
IEEE/RSJ International Conference on Intelligent Robots andSystems (IROS) , 2018, pp. 548–555.[12] T. Zhang, G. Kahn, S. Levine, and P. Abbeel, “Learning deep con-trol policies for autonomous aerial vehicles with mpc-guided policysearch,” in
IEEE international conference on robotics and automation(ICRA) , 2016, pp. 528–535.[13] C. Sampedro, H. Bavle, A. Rodriguez-Ramos, P. de la Puente, andP. Campoy, “Laser-based reactive navigation for multirotor aerialrobots using deep reinforcement learning,” in
IEEE/RSJ InternationalConference on Intelligent Robots and Systems (IROS) , 2018, pp. 1024–1031.[14] A. Rodriguez-Ramos, C. Sampedro, H. Bavle, I. G. Moreno, andP. Campoy, “A deep reinforcement learning technique for vision-basedautonomous multirotor landing on a moving platform,” in
IEEE/RSJInternational Conference on Intelligent Robots and Systems (IROS) ,2018, pp. 1010–1017.[15] R. N. Haksar and M. Schwager, “Distributed deep reinforcementlearning for fighting forest fires with a network of aerial robots,” in
IEEE/RSJ International Conference on Intelligent Robots and Systems(IROS) , 2018, pp. 1067–1074.16] D. P. Kingma and J. Ba, “Adam: A method for stochastic optimiza- tion,” arXiv preprint arXiv:1412.6980arXiv preprint arXiv:1412.6980