[PDF] Autonomous Obstacle Avoidance by Learning Policies for Reference Modification

Abstract

The key problem for autonomous robots is how to navigate through complex, obstacle-filled environments, without colliding into obstacles. The navigation problem is split into generating a global reference path to the goal and then using a local planner to track the reference path and avoid obstacles by generating velocity and steering references for the robot control system to execute. This paper presents a hybrid local planning architecture that combines a classic path following algorithm with a deep reinforcement learning agent. The novel "modification planner" uses the path follower to track the global plan and the deep reinforcement learning agent to modify the references generated by the path follower to avoid obstacles. Importantly, our architecture does not require an updated obstacle map and only 10 laser range finders to avoid obstacles. The modification planner is evaluated in the context of F1/10th autonomous racing and compared to a mapless navigation baseline, the Follow the Gap Method and an optimisation based planner. The results show that the modification planner can achieve faster average times compared to the baseline mapless navigation planner and a 94% success rate which is similar to the baseline.

Full PDF

AAutonomous Obstacle Avoidance by LearningPolicies for Reference Modiﬁcation

Benjamin Evans

Dept. of E&E EngineeringStellenbosch University

Stellenbosch, South [email protected]

Hendrik W. Jordaan

Dept. of E&E EngineeringStellenbosch University

Stellenbosch, South [email protected]

Herman A. Engelbrecht

Dept. of E&E EngineeringStellenbosch University

Stellenbosch, South [email protected]

Abstract —The key problem for autonomous robots is howto navigate through complex, obstacle ﬁlled environments. Thenavigation problem is broken up into generating a referencepath to the goal and then using a local planner to track thereference path and avoid obstacles by generating velocity andsteering references for a control system to execute.This paper presents a novel local planner architecture forpath following and obstacle avoidance by proposing a hybridsystem that uses a classic path following algorithm in parallelwith a neural network. Our solution, called the modiﬁcationarchitecture, uses a path following algorithm to follow a referencepath and a neural network to modify the references generatedby the path follower in order to prevent collisions. The neuralnetwork is trained using reinforcement learning to deviate fromthe reference trajectory just enough to avoid crashing. We presentour local planner in a hierarchical planning framework, nestedbetween a global reference planner and a low-level controlsystem. The system is evaluated in the context of F1/10th scaleautonomous racing in random forests and on a race track.The results demonstrate that our solution is able to followa reference trajectory while avoiding obstacles using only 10sparse laser range ﬁnder readings. Our hybrid system overcomesprevious limitations by not requiring an obstacle map, humandemonstrations, or an instrumented training setup.

Index Terms —Motion and Path Planning; ReinforcementLearning; Collision Avoidance

I. I

NTRODUCTION

The general robot navigation problem is to plan and executea trajectory that moves a robot from a starting location to agoal location without crashing into obstacles.Fig. 1 shows how the problem of autonomous navigationhas been broken up into global and local planning, perceptionand control [1], [2]. Before a mission begins, a global plannergenerates a reference path, linking the start and end points.During the mission the loop of collecting relevant data fromthe environment (perception), re-planning to avoid obstacles,and then executing the plan (control) is repeated.This project focuses on the local planning problem of track-ing a reference path while avoiding obstacles. Our contributionis a method of local planning that enables autonomous robotsto follow a global plan while avoiding collisions withoutthe need for maintaining an obstacle map or special trainingsetup. Our novel local planner features a path followingalgorithm which produces navigation references that followsthe global plan and a neural network that is able to modify

Global PlannerEnvironment Map Local Planner Control SystemPerception Vehicle

Fig. 1. Overview of Planning Hierarchy those references to ensure that obstacles are avoided. We usereinforcement learning to train the neural network to deviateto avoid collision while keeping close to the trajectory of theglobal planner. The proposed solution is evaluated using theF1/10th autonomous racing framework in a random forest andon a classic race track.This paper is laid out as follows. Previous attempts to solvethe local planning problem are studied in Section II. Section IIIpresents a theoretical overview of the proposed modiﬁcationarchitecture. A mathematical description of path followingproblem and an explanation of the pure pursuit algorithm isgiven in Section IV. In Section V, the learning formulationand reward signal used to train the neural network is detailedand motivated. The system is evaluated in Section VI and theresults are explained and discussed.II. R

ELATED W ORK

The problem of local planning for autonomous navigationis to generate navigation references (velocity and steering)that cause the vehicle to approximately track the referencepath while avoiding obstacles. Current solutions to the localplanning problem can be broadly grouped into the categoriesof map-based and mapless methods.

A. Map-based Obstacle Avoidance

Obstacle avoidance has been achieved by maintaining anupdated map of the environment which is used to plan a colli-sion free trajectory [3]. The map is continuously updated as the a r X i v : . [ c s . R O ] F e b ehicle moves around the environment and the updated mapis used to re-plan local trajectories. Maps are generally builtby using Simultaneous Localisation and Mapping (SLAM)algorithms which use high dimensional sensors, such as aLight Detection and Ranging (LIDAR), to construct a mapof the environment [4].Building maps requires a high dimensional sensor such as acamera or full LIDAR scan, is computationally expensive andmay take too long to be useful for high speed applications[5]. While SLAM based systems have the advantage of beingable to operate in unknown environments and robustly avoidobstacles, they are limited to systems that have high dimen-sional sensors, the required hardware and complex algorithmsto build a map in real time. [5] also identiﬁes that SLAMalgorithms can be ”easily induced to fail” when the robotmotion or environment is too complex or changes too quickly. B. Mapless Obstacle Avoidance

As a result of the limitations of map-based local planning,many attempts have been made to generate navigation refer-ences without maintaining an explicit obstacle map. Neuralnetworks have been used to associate raw sensory data (aLIDAR scan) directly with velocity and steering referencesfor mapless autonomous racing [6]. The two main methodswhich have been used to train neural networks to navigateare reinforcement learning (RL), where the agent learns fromexperience, and imitation learning (IL), which uses a teacher(either an expert demonstration or an optimal control strategy)to train the network [7], [8].Tai et al. [9] used the deep deterministic policy gradient(DDPG) RL algorithm, to map laser range ﬁnder readingsdirectly to wheel velocities, and showed that neural networkswere able to learn how to control a vehicle towards a targetdestination with no pre-planned path. The work done in [10]improves upon Tai’s work with an improved reward signal.Despite improved performance, [10] shows that there are manycomplex environments such as mazes where a RL trainedagent is unable to ﬁnd a path towards the goal, and thereforeconcludes that RL agents should rather be used as a low levelcomponent in conjunction with a global planner. Wei et al.implements a RL agent as a path follower, in conjunction witha high level, A* path ﬁnder [11].While imitation learning has shown to be able to learnnavigation policies from human expert demonstrations [12],it limited to the performance of experts in a task and requireslarge data sets of demonstrations. Guided Policy Search (GPS),aims to improve upon learning from humans by using ILto teach to teach a neural network to imitate demonstrationsprovided by an algorithmic teacher [13]. The teacher is setup as an optimal control strategy that has access to the fullstate of the environment. GPS algorithms have been applied tothe problem of obstacle avoidance and shown to be effectivefor generating navigation references [14], [15]. The authorsof [14] identify that while GPS methods are effective insome domains, they are limited to contexts where full state estimation of the system is available via an instrumentedtraining panel.III. T

HEORETICAL S OLUTION O VERVIEW

We address the problem of generating navigation referencesfor autonomous vehicles that result in a reference trajectorybeing followed and obstacles being avoided. Fig. 2 shows howan ideal local planner will approximately follow a referencepath when possible and deviate from it the minimum amountin order to smoothly avoid obstacles. The solution should beable to generate the desired behaviour without requiring an up-dated obstacle map, human demonstrations, or an instrumentedtraining setup.

Fig. 2.

Ideal Obstacle Avoidance Behaviour:

The green dashed linerepresents the reference trajectory and the red line shows the ideal path aroundthe blue obstacle.

We present a novel local planner which features a pathfollowing algorithm and a neural network in parallel as anattempt to generate the behaviour shown in Fig. 2. Fig. 3shows a schematic of how the neural network interacts withthe path follower. The path following algorithm is responsiblefor following the global plan and the neural network is ableto modify the plan in order to prevent collisions.Path Follower Neural NetworkGlobalPlanVehiclestate ++ Modiﬁedreferences

Fig. 3.

Proposed Local Planner Architecture:

A path follower is used togenerate navigation references, which the neural network is able to modify.The local planner receives the global plan and the vehicle state and outputsthe modiﬁed navigation references.

The action which the neural network outputs is scaled andthen added to the path follower references, thus allowing theneural network to modify the path being followed. The pathfollowing algorithm generates steering and velocity referencesthat result in a vehicle tracking the reference path. The pathollower’s velocity and steering references, the vehicle’s cur-rent velocity and steering and the laser range ﬁnder readingsare fed into the neural network as the state vector. Finally,the modiﬁed references are returned to the low level controlsystems to be executed.The neural network is trained using RL. At each time step,the agent receives a state vector, selects an action and thenreceives a reward. This architecture provides for an intuitivereward function where the agent can be easily punished fordeviating from the local plan or colliding with an obstacle.The path follower and neural network are combined in orderto utilise the reference tracking ability of the path follower andthe ﬂexibility to obstacles of neural networks. The architecturewill maintain the preference of the global plan in obstaclefree zones and modify the plan when obstacles are detected.Importantly, our system does not require an updated obstaclemap, like map-based methods for obstacle avoidance. Oursystem can learn from its own experience without the needfor human demonstration or an instrumented training setup.IV. P

RELIMINARIES : P

ATH F OLLOWING

Path following is the task of generating navigation refer-ences that result in a path being followed. Mathematically, apath follower generates velocity, V , and steering, δ , referencesbased on the current state of the vehicle, the global referenceplan and the vehicle kinematics.We develop our systems in the context of non-holonomicvehicles which are equipped with several range ﬁnders andthe ability to localise themselves on a map. The pure pursuitmethod [16], which is used as the path following algorithm,generates a steering command that will direct the vehicle to away-point a certain look ahead distance in front of the vehicle.The pure pursuit method calculates the steering angle ac-cording to, δ = arctan (cid:18) L sin( α ) l d (cid:19) , (1)where L is the length of the vehicle body, l d is the look aheaddistance to the way-point being pursued and α is the anglebetween the vehicles current heading and the heading to theway point.The velocity is calculated as the maximum velocity thatkeeps the vehicle within the friction limit of the tyres onthe road. The maximum force which the vehicle can sustainwithout the tyres slipping is, F max = bmg, (2)where m is the mass of the vehicle, b is the coefﬁcient offriction, and g is the acceleration due to gravity. The lateralforce on the vehicle due to steering is, F steer, lateral = ma lateral = m ˙ θV. (3)In order to plan for the future and take into account the factthat the vehicle has momentum and cannot change its velocity instantaneously, a look ahead is done for the maximum ex-pected value of delta, δ max , for a receding horizon. Therefore,the velocity for a certain δ max can be calculated as, V = F max m ˙ θ = bmg tan( δ max ) mV /l . (4)The steering and velocity references are executed using asimple, proportional, low level control system.V. R EFERENCE M ODIFICATION L OCAL P LANNING

The proposed solution, called the modiﬁcation architecture,combines a pure pursuit path follower and a neural networkinside the local planner. The neural network is trained withreinforcement learning to minimise deviation from the globaltrajectory while avoiding obstacles.The local planner receives the global trajectory, w , thecurrent state of the vehicle, x t , y t , θ t , V t , δ t and the range ﬁnderreadings, r t = [ r , r , ..., r m ] . The local planner must outputthe velocity magnitude V ref and steering δ ref navigationreferences. A. Learning Formulation

The neural network outputs a policy which consists of anaction that is used to modify the steering reference generatedby the path follower. The ideal policy will result in thetrajectory being modiﬁed the minimal amount that results inan obstacle free trajectory.RL refers to a collection of algorithms that use experienceto train a neural network to learn a policy that maximises theexpectation of future reward [17]. For each step, the networkreceives a state, x and outputs an action, a , which is givento the environment and returns a reward, r , and a new state.The network is trained to take actions which will produce thehighest rewards. The input to a RL agent, which comprises ofa neural network, is a state vector which represents the con-ditions of the environment. A neural network is an algorithmthat aims to recognise the underlying relationships between aset of input and output data.Our neural network is trained using the twin delayed deepdeterministic policy gradient (TD3) algorithm. TD3 is thecurrent state of the art algorithm and it is able to outputcontinuous actions [18]. TD3 belongs to a family of algorithmscalled actor-critic algorithms which use an actor to generateactions and a critic to train the actor.The agent in our solution observes the state of the vehicle, x ,at each discrete time-step, t , from the state space S . The statecomprises of the current steering and velocity of the vehicle,the navigation references determined by the pure pursuitpath follower and the range ﬁnder readings. Mathematically,the state is, x = [ V t , δ t , V ref , δ ref , r , r , ..., r m ] . All of thevalues are scaled to the range [ − , to improve trainingperformance.The RL agent outputs a single continuous action which isused to modify the steering. The action is in the range [ − , and is scaled according to δ max . . Reward Signal The aim of the local planner is to track the reference pathas closely as possible, yet deviate when needed to avoidobstacles. The ideal reward signal will lead to the behaviourshown in Fig. 2, where the local planner detects an obstacle,smoothly avoids it and then returns to the reference path. Thepath follower will track the reference path in all situations.Therefore, the aim of the neural network is to know whenand how to deviate from the reference path my modifying thesteering reference.The largest aim of the neural network is to avoid collisions.Therefore, the agent must receive the worst reward possibleif the vehicle crashes. In order to increase the severity of thepunishment for crashing, a small positive reward is given tothe agent for each step that does not result in a collision.Additionally, the vehicle should only deviate the minimumnecessary amount to avoid obstacles. Therefore the agent ispunished according to the magnitude of the deviation fromthe reference path. The reward signal is thus deﬁned as, r ( s t , a t ) = (cid:40) r crash if d obs = 0 β − ( β ∆ δ t ) otherwise , (5)where β and β are hyper-parameters which are tuned and ∆ δ t is the amount that the steering is modiﬁed by. C. Reference Modiﬁcation

The key part of our architecture is how the neural networkis combined with a path follower. The steering command iswhat determines the path that the vehicle will follow. Thepath follower (PF) produces a reference, δ PF , and the neuralnetwork returns a scaled steering modiﬁcation value, δ NN . Boththese values are in the range, [ − δ max , δ max ] , and are combinedthrough simple addition, δ ref = δ PF + δ NN . (6)In order to ensure that the friction limit of the vehicleis not reached, the velocity reference from the pure pursuitfollower is recalculated based on equation 4. Due to the factthat the velocity of the vehicle cannot change instantaneously,the combined steering reference is clipped based on the currentvelocity to ensure the vehicles safety.VI. E VALUATION

The aim of the evaluation is to assess the ability ofthe proposed modiﬁcation architecture to follow a referencetrajectory while avoiding obstacles. The architecture is ﬁrstevaluated by following a straight line reference through adense random forest of square randomly place obstacles, andthen by following a minimum curvature trajectory througha race track. The experiments are conducted using F1/10thscale autonomous racing cars. Autonomous racing has aperformance metric of time, numerous relevant benchmarksand allows for new algorithms to be implemented in bothsimulation and in practice at low cost.

A. Experimental Setup

A simple simulator of a F1/10th car on a track is built basedon a Robot Operating System (ROS) based F1/10th simulator[19]. The simulator represents the vehicle according to thebicycle model as described in [20], with standard F1/10thparameters [19].The ﬁrst environment, which is used for testing, is a randomforest similar to what has been used to evaluate comparablealgorithms [14], [15]. The reference trajectory is set to be astraight line and random obstacles are placed in front of thevehicle. The aim for the vehicle is to follow the straight linereference while avoiding the obstacles.The second environment which is used is a F1/10th racetrack with a minimum curvature reference trajectory. The racetrack evaluation is derived from the benchmark for F1/10thcars presented in [21] and shows how planning architecturescan be applied to real life situations. A standard race track isused for training and testing and random obstacles are placedon the track.The proposed solution is implemented on a F1/10th race car(called the hybrid vehicle) with 10 sparse laser range ﬁnders.A neural network with two hidden layers of 300 neurons eachis used in the implementation. The neural networks are trainedfor 100,000 steps on mini-batches of 100 samples. The experi-ential training data is collected in the same environment usedfor testing with randomly new obstacles randomly spawningafter every episode. The experimental tests were performedby running 100 episodes in each environment and taking andaverage of the results.The hybrid planner is compared a vehicle (known as thebenchmark vehicle) that uses a full map (including obstacles)to plan a minimum curvature trajectory around the obstacles.A standard pure pursuit local planner is used to follow the ob-stacle free reference path. This benchmark vehicle representsa theoretical best possible time to avoid obstacles.

B. Minimum Curvature Trajectory Generation

The second evaluation environment uses a minimum curva-ture path around the race track as the reference to follow.Additionally, the benchmark solution follows a minimumcurvature path which is planned with full knowledge of theobstacles. A minimum curvature trajectory is generated byrepresenting the track as shown in Fig. 4, and then using a pathoptimisation algorithm. This paper uses a path optimisationmethod similar to that presented in [22].The track is modelled as a set of centre-line locations x k , y k ,normal vectors, and track widths, w p,k , w n,k . Each point on therace line is deﬁned as a distance, n k , from the centre line alongthe normal vector. Modelling the track in this way allows fora simple optimisation equation to be set up with the only pathconstraint being, − w n,k < n k < w p,k ∀ k , which can beeasily reduced to take obstacles into account. The vector, n , ofthe normal distances, is optimised according to a cost functionwhich represents the curvature of the path. , ynw p w n Fig. 4.

Track Model used for Generating Minimum Curvature Path:

Thesolid black lines are the race track boundaries and the dashed black line is thetrack centre line. The black dots are the points at which the track is discritisedand the red dots and line show the optimal path which is calculated.

C. Random Forest Experimental Results

Table I shows the results from the experiments that wereperformed on a 25m random forest. The experiments wereperformed on an empty forest with no obstacles, to evaluatethe ability to track the reference path, and on a forest with6 randomly located obstacles to evaluate the ability to avoidobstacles.

Vehicle No Obstacles Obstacles Success

Benchmark 3.8 s 5.7 s 100%Hybrid 6.8 s 10.6 s 78%TABLE I C OMPARISON OF B ENCHMARK AND H YBRID P LANNER R ESULTSTHROUGH R ANDOM F OREST : T IMES ARE SHOWN IN SECONDS FORCOMPLETION OF THE FOREST WITH AND WITHOUT OBSTACLES . T

HEEPISODE WAS UNSUCCESSFUL IF THE VEHICLE CRASHED . Fig. 5 shows an example path taken through an emptyforest strip and the corresponding neural network output. Inall the trajectory images, the green dashed line represents thereference path, and the red line represents the trajectory thatwas taken. The light grey is the track area, the blue blocksare un-mapped obstacles and the yellow stars are the start andend locations.In the tests through the empty random forest, our hybrid ar-chitecture performed signiﬁcantly slower with an average timeof 6.8 seconds as opposed to 3.8 seconds for the benchmarksolution. While the vehicle follows the reference path closely,the neural network is still modifying the steering referencecauses the vehicle to swerve side to side. The swerving causedby the neural network causes the vehicle to travel at a slowerspeed and explains why the hybrid architectures average timeis so much slower than the benchmark.The tests with obstacles show that the modiﬁcation archi-tecture is able to guide the vehicle through the forest with a78% success rate. The success rate shows that the vehicle isable to avoid a signiﬁcant number, but not all obstacles. Fig.6 shows typical trajectories that the benchmark and hybrid − − . . Time N e t w o r k O u t pu t Fig. 5. Neural Network Output Compared to Path Taken by Vehicle vehicles took. The example trajectories show that the neuralnetwork learns a policy that avoids obstacles as narrowly aspossible. It is assumed that this narrow obstacle avoidancepolicy is the reason for the several crashes which took place.The assumed reason for the success rate being higher thanfor the forest is that the obstacles on the race track were lessdensely spaced.On the left hand side of the hybrid example trajectory, itcan be seen that the hybrid planner brings the vehicle backto the reference in between closely spaced obstacles, whilethe benchmark planner is able to smoothly plan around bothobstacles. The lack of smoothness in the hybrid’s plannerspath results in the vehicle travelling at slower speeds and thustaking an increased amount of time to complete the course.

Fig. 6.

Random Forest Example Trajectories:

Benchmark planner (top)and hybrid planner (bottom)

D. Race Track Experimental Results

Table II show the results from the experiments performedon the race track.The test with no obstacles on the race track producesvery similar results to that seen in the empty random forest.The hybrid planner is able to follow the reference trajectory, ehicle No Obstacles Obstacles Success

Benchmark 33.2 s 34.7 s 100%Hybrid 41.6 s 43.2 s 89%TABLE II C OMPARISON OF B ENCHMARK AND H YBRID P LANNER R ESULTS ON R ACE T RACK : T IMES ARE SHOWN IN SECONDS FOR LAP COMPLETION ONTHE TRACK WITH AND WITHOUT OBSTACLES . T

HE LAP WASUNSUCCESSFUL IF THE VEHICLE CRASHED . however the time taken is much longer due to the slowerspeeds caused by the constant swerving.Fig. 7 shows a typical example of how the hybrid agentmodiﬁes the path follower generated references to avoidobstacles. The vehicle can clearly be seen to leave the ref-erence trajectory when an obstacle is detected and then returnsmoothly once the obstacle is past. The hybrid planner is ableto successfully navigate around the track the with obstacles,in 89% of the test laps. This shows that the neural network isable to learn a policy to modify the references sufﬁciently toavoid a signiﬁcant amount of obstacles. Fig. 7. Example Trajectory of Vehicle on the Experimental Race Track

On the left hand side of the track, where the referencetrajectory comes close to the boundary, the hybrid agentdeviates slightly from the reference, away from the boundary.This behaviour shows that the agent will make small deviationsfrom the reference in order to maintain distance from obstaclesand boundaries alike.At certain times, the agent avoids obstacles in more elegantand optimal ways than at other times. In the bottom rightcorner of the track, the hybrid planner, smoothly avoids theobstacle by turning slightly more. On the right side of thetrack, the vehicle takes a jerky path around the obstacle. In themidsection of the track, it appears that the agent is confusedbetween swerving left or right to avoid the obstacles. Thisshows that the agent avoids obstacles in different way basedon the situation.

E. Discussion & Future Work

The results show that the hybrid planner is able to avoida signiﬁcant amount of obstacles without the maintainingan obstacle map or requiring a special training setup. Oursolution is able to track a reference trajectory and deviate whennecessary using only 10 sparse laser range ﬁnder readings. Theexample trajectories show that the hybrid planner outputs jerkytrajectories which prevent the vehicles traveling at high speeds.Future work must be done to improve the smoothness ofthe trajectory around obstacles. While the hybrid architectureis effective in avoiding a signiﬁcant amount of obstacles, itstill occasionally crashes. Future work must investigate safetyguarantees that can be integrated into the system to ensure thatit does not crash in application. The work done in this paper islimited to simulation and should be veriﬁed by implementinga hybrid planning architecture in practice on a physical setup.An open research area is avoiding dynamic obstacles, as itrequires reasoning about the motion of the obstacle. Additionalwork should extend the proposed hybrid architecture to avoiddynamic obstacles. VII. C

ONCLUSIONS

In this paper we address the problem of autonomouslygenerating navigation references to track a reference path andavoid obstacles. We present a hybrid local planner whichcombines the path followers ability to track a reference pathwith a neural networks ability to ﬂexible avoid obstacles. Theneural network operates in parallel with the path follower andis able to adjust the references which the path follower hasgenerated. The neural network is trained with RL to preventcollisions by modifying the references from the path follower.The architecture improves upon previous work by removingthe need to maintain an obstacle map (as with SLAM basedsystems), and not requiring expert demonstrations (like imi-tation learning) or instrumented training set ups (like guidedpolicy search). The architecture was evaluated in the contextof F1/10th autonomous race cars in a random forest and ona classic race track. The results show that our architecture isable to use 10 sparse laser range ﬁnders to track a referencepath and successfully avoid a signiﬁcant number of obstacles.Future work can be done to optimising the design for operationat high speeds and explore solutions for using a similararchitecture to avoid dynamic obstacles.R

EFERENCES[1] S. D. Pendleton, H. Andersen, X. Du, X. Shen, M. Meghjani, Y. H. Eng,D. Rus, and M. H. Ang, “Perception, planning, control, and coordinationfor autonomous vehicles,”

Machines , vol. 5, no. 1, p. 6, 2017.[2] P. Falcone, F. Borrelli, H. E. Tseng, J. Asgari, and D. Hrovat, “Ahierarchical model predictive control framework for autonomous groundvehicles,” in . IEEE, 2008, pp. 3719–3724.[3] L. Heng, L. Meier, P. Tanskanen, F. Fraundorfer, and M. Pollefeys,“Autonomous obstacle avoidance and maneuvering on a vision-guided mavusing on-board processing,” in . IEEE, 2011, pp. 2472–2477.[4] H. Durrant-Whyte and T. Bailey, “Simultaneous localization and mapping:part i,”

IEEE robotics & automation magazine , vol. 13, no. 2, pp. 99–110,2006.5] C. Cadena, L. Carlone, H. Carrillo, Y. Latif, D. Scaramuzza, J. Neira,I. Reid, and J. J. Leonard, “Past, present, and future of simultaneouslocalization and mapping: Toward the robust-perception age,”

IEEE Trans-actions on Robotics , vol. 32, no. 6, pp. 1309–1332, 2016.[6] R. Ivanov, T. J. Carpenter, J. Weimer, R. Alur, G. J. Pappas, and I. Lee,“Case study: verifying the safety of an autonomous racing car with a neuralnetwork controller,” in

Proceedings of the 23rd International Conferenceon Hybrid Systems: Computation and Control , 2020, pp. 1–7.[7] L. Tai, J. Zhang, M. Liu, J. Boedecker, and W. Burgard, “A survey of deepnetwork solutions for learning control in robotics: From reinforcement toimitation,” arXiv preprint arXiv:1612.07139 , 2016.[8] M. Pfeiffer, S. Shukla, M. Turchetta, C. Cadena, A. Krause, R. Siegwart,and J. Nieto, “Reinforced imitation: Sample efﬁcient deep reinforcementlearning for mapless navigation by leveraging prior demonstrations,”

IEEERobotics and Automation Letters , vol. 3, no. 4, pp. 4423–4430, 2018.[9] L. Tai, G. Paolo, and M. Liu, “Virtual-to-real deep reinforcement learning:Continuous control of mobile robots for mapless navigation,” in , 2017, pp. 31–36.[10] F. de Villiers and W. Brink, “Learning ﬁne-grained control for maplessnavigation,” in .IEEE, 2020, pp. 1–6.[11] M. Wei, S. Wang, J. Zheng, and D. Chen, “Ugv navigation optimizationaided by reinforcement learning-based path tracking,”

IEEE Access , vol. 6,pp. 57 814–57 825, 2018.[12] L. Tai, J. Zhang, M. Liu, and W. Burgard, “Socially compliant navigationthrough raw depth inputs with generative adversarial imitation learning,” in .IEEE, 2018, pp. 1111–1117.[13] S. Levine and V. Koltun, “Guided policy search,” in

InternationalConference on Machine Learning , 2013, pp. 1–9.[14] T. Zhang, G. Kahn, S. Levine, and P. Abbeel, “Learning deep controlpolicies for autonomous aerial vehicles with mpc-guided policy search,” in .IEEE, 2016, pp. 528–535.[15] G. Kahn, T. Zhang, S. Levine, and P. Abbeel, “Plato: Policy learningusing adaptive trajectory optimization,” in . IEEE, 2017, pp. 3342–3349.[16] R. C. Coulter, “Implementation of the pure pursuit path tracking algo-rithm,” Carnegie-Mellon UNIV Pittsburgh PA Robotics INST, Tech. Rep.,1992.[17] R. S. Sutton and A. G. Barto,

Reinforcement learning: An introduction .MIT press, 2018.[18] S. Fujimoto, H. Van Hoof, and D. Meger, “Addressing function approx-imation error in actor-critic methods,” arXiv preprint arXiv:1802.09477 ,2018.[19] V. S. Babu and M. Behl, “Ros f1/10 autonomous racecar simulator,”October 2019.[20] P. Polack, F. Altch´e, B. d’Andr´ea Novel, and A. de La Fortelle,“The kinematic bicycle model: A consistent model for planning feasibletrajectories for autonomous vehicles?” in . IEEE, 2017, pp. 812–818.[21] M. O’Kelly, H. Zheng, D. Karthik, and R. Mangharam, “textscf1tenth:An open-source evaluation environment for continuous control and re-inforcement learning,” in

NeurIPS 2019 Competition and DemonstrationTrack . PMLR, 2020, pp. 77–89.[22] F. Christ, A. Wischnewski, A. Heilmeier, and B. Lohmann, “Time-optimal trajectory planning for a race car considering variable tyre-roadfriction coefﬁcients,”