Autonomous Obstacle Avoidance by Learning Policies for Reference Modification
AAutonomous Obstacle Avoidance by LearningPolicies for Reference Modification
Benjamin Evans
Dept. of E&E EngineeringStellenbosch University
Stellenbosch, South [email protected]
Hendrik W. Jordaan
Dept. of E&E EngineeringStellenbosch University
Stellenbosch, South [email protected]
Herman A. Engelbrecht
Dept. of E&E EngineeringStellenbosch University
Stellenbosch, South [email protected]
Abstract —The key problem for autonomous robots is howto navigate through complex, obstacle filled environments. Thenavigation problem is broken up into generating a referencepath to the goal and then using a local planner to track thereference path and avoid obstacles by generating velocity andsteering references for a control system to execute.This paper presents a novel local planner architecture forpath following and obstacle avoidance by proposing a hybridsystem that uses a classic path following algorithm in parallelwith a neural network. Our solution, called the modificationarchitecture, uses a path following algorithm to follow a referencepath and a neural network to modify the references generatedby the path follower in order to prevent collisions. The neuralnetwork is trained using reinforcement learning to deviate fromthe reference trajectory just enough to avoid crashing. We presentour local planner in a hierarchical planning framework, nestedbetween a global reference planner and a low-level controlsystem. The system is evaluated in the context of F1/10th scaleautonomous racing in random forests and on a race track.The results demonstrate that our solution is able to followa reference trajectory while avoiding obstacles using only 10sparse laser range finder readings. Our hybrid system overcomesprevious limitations by not requiring an obstacle map, humandemonstrations, or an instrumented training setup.
Index Terms —Motion and Path Planning; ReinforcementLearning; Collision Avoidance
I. I
NTRODUCTION
The general robot navigation problem is to plan and executea trajectory that moves a robot from a starting location to agoal location without crashing into obstacles.Fig. 1 shows how the problem of autonomous navigationhas been broken up into global and local planning, perceptionand control [1], [2]. Before a mission begins, a global plannergenerates a reference path, linking the start and end points.During the mission the loop of collecting relevant data fromthe environment (perception), re-planning to avoid obstacles,and then executing the plan (control) is repeated.This project focuses on the local planning problem of track-ing a reference path while avoiding obstacles. Our contributionis a method of local planning that enables autonomous robotsto follow a global plan while avoiding collisions withoutthe need for maintaining an obstacle map or special trainingsetup. Our novel local planner features a path followingalgorithm which produces navigation references that followsthe global plan and a neural network that is able to modify
Global PlannerEnvironment Map Local Planner Control SystemPerception Vehicle
Fig. 1. Overview of Planning Hierarchy those references to ensure that obstacles are avoided. We usereinforcement learning to train the neural network to deviateto avoid collision while keeping close to the trajectory of theglobal planner. The proposed solution is evaluated using theF1/10th autonomous racing framework in a random forest andon a classic race track.This paper is laid out as follows. Previous attempts to solvethe local planning problem are studied in Section II. Section IIIpresents a theoretical overview of the proposed modificationarchitecture. A mathematical description of path followingproblem and an explanation of the pure pursuit algorithm isgiven in Section IV. In Section V, the learning formulationand reward signal used to train the neural network is detailedand motivated. The system is evaluated in Section VI and theresults are explained and discussed.II. R
ELATED W ORK
The problem of local planning for autonomous navigationis to generate navigation references (velocity and steering)that cause the vehicle to approximately track the referencepath while avoiding obstacles. Current solutions to the localplanning problem can be broadly grouped into the categoriesof map-based and mapless methods.
A. Map-based Obstacle Avoidance
Obstacle avoidance has been achieved by maintaining anupdated map of the environment which is used to plan a colli-sion free trajectory [3]. The map is continuously updated as the a r X i v : . [ c s . R O ] F e b ehicle moves around the environment and the updated mapis used to re-plan local trajectories. Maps are generally builtby using Simultaneous Localisation and Mapping (SLAM)algorithms which use high dimensional sensors, such as aLight Detection and Ranging (LIDAR), to construct a mapof the environment [4].Building maps requires a high dimensional sensor such as acamera or full LIDAR scan, is computationally expensive andmay take too long to be useful for high speed applications[5]. While SLAM based systems have the advantage of beingable to operate in unknown environments and robustly avoidobstacles, they are limited to systems that have high dimen-sional sensors, the required hardware and complex algorithmsto build a map in real time. [5] also identifies that SLAMalgorithms can be ”easily induced to fail” when the robotmotion or environment is too complex or changes too quickly. B. Mapless Obstacle Avoidance
As a result of the limitations of map-based local planning,many attempts have been made to generate navigation refer-ences without maintaining an explicit obstacle map. Neuralnetworks have been used to associate raw sensory data (aLIDAR scan) directly with velocity and steering referencesfor mapless autonomous racing [6]. The two main methodswhich have been used to train neural networks to navigateare reinforcement learning (RL), where the agent learns fromexperience, and imitation learning (IL), which uses a teacher(either an expert demonstration or an optimal control strategy)to train the network [7], [8].Tai et al. [9] used the deep deterministic policy gradient(DDPG) RL algorithm, to map laser range finder readingsdirectly to wheel velocities, and showed that neural networkswere able to learn how to control a vehicle towards a targetdestination with no pre-planned path. The work done in [10]improves upon Tai’s work with an improved reward signal.Despite improved performance, [10] shows that there are manycomplex environments such as mazes where a RL trainedagent is unable to find a path towards the goal, and thereforeconcludes that RL agents should rather be used as a low levelcomponent in conjunction with a global planner. Wei et al.implements a RL agent as a path follower, in conjunction witha high level, A* path finder [11].While imitation learning has shown to be able to learnnavigation policies from human expert demonstrations [12],it limited to the performance of experts in a task and requireslarge data sets of demonstrations. Guided Policy Search (GPS),aims to improve upon learning from humans by using ILto teach to teach a neural network to imitate demonstrationsprovided by an algorithmic teacher [13]. The teacher is setup as an optimal control strategy that has access to the fullstate of the environment. GPS algorithms have been applied tothe problem of obstacle avoidance and shown to be effectivefor generating navigation references [14], [15]. The authorsof [14] identify that while GPS methods are effective insome domains, they are limited to contexts where full state estimation of the system is available via an instrumentedtraining panel.III. T
HEORETICAL S OLUTION O VERVIEW
We address the problem of generating navigation referencesfor autonomous vehicles that result in a reference trajectorybeing followed and obstacles being avoided. Fig. 2 shows howan ideal local planner will approximately follow a referencepath when possible and deviate from it the minimum amountin order to smoothly avoid obstacles. The solution should beable to generate the desired behaviour without requiring an up-dated obstacle map, human demonstrations, or an instrumentedtraining setup.
Fig. 2.
Ideal Obstacle Avoidance Behaviour:
The green dashed linerepresents the reference trajectory and the red line shows the ideal path aroundthe blue obstacle.
We present a novel local planner which features a pathfollowing algorithm and a neural network in parallel as anattempt to generate the behaviour shown in Fig. 2. Fig. 3shows a schematic of how the neural network interacts withthe path follower. The path following algorithm is responsiblefor following the global plan and the neural network is ableto modify the plan in order to prevent collisions.Path Follower Neural NetworkGlobalPlanVehiclestate ++ Modifiedreferences
Fig. 3.
Proposed Local Planner Architecture:
A path follower is used togenerate navigation references, which the neural network is able to modify.The local planner receives the global plan and the vehicle state and outputsthe modified navigation references.
The action which the neural network outputs is scaled andthen added to the path follower references, thus allowing theneural network to modify the path being followed. The pathfollowing algorithm generates steering and velocity referencesthat result in a vehicle tracking the reference path. The pathollower’s velocity and steering references, the vehicle’s cur-rent velocity and steering and the laser range finder readingsare fed into the neural network as the state vector. Finally,the modified references are returned to the low level controlsystems to be executed.The neural network is trained using RL. At each time step,the agent receives a state vector, selects an action and thenreceives a reward. This architecture provides for an intuitivereward function where the agent can be easily punished fordeviating from the local plan or colliding with an obstacle.The path follower and neural network are combined in orderto utilise the reference tracking ability of the path follower andthe flexibility to obstacles of neural networks. The architecturewill maintain the preference of the global plan in obstaclefree zones and modify the plan when obstacles are detected.Importantly, our system does not require an updated obstaclemap, like map-based methods for obstacle avoidance. Oursystem can learn from its own experience without the needfor human demonstration or an instrumented training setup.IV. P
RELIMINARIES : P
ATH F OLLOWING
Path following is the task of generating navigation refer-ences that result in a path being followed. Mathematically, apath follower generates velocity, V , and steering, δ , referencesbased on the current state of the vehicle, the global referenceplan and the vehicle kinematics.We develop our systems in the context of non-holonomicvehicles which are equipped with several range finders andthe ability to localise themselves on a map. The pure pursuitmethod [16], which is used as the path following algorithm,generates a steering command that will direct the vehicle to away-point a certain look ahead distance in front of the vehicle.The pure pursuit method calculates the steering angle ac-cording to, δ = arctan (cid:18) L sin( α ) l d (cid:19) , (1)where L is the length of the vehicle body, l d is the look aheaddistance to the way-point being pursued and α is the anglebetween the vehicles current heading and the heading to theway point.The velocity is calculated as the maximum velocity thatkeeps the vehicle within the friction limit of the tyres onthe road. The maximum force which the vehicle can sustainwithout the tyres slipping is, F max = bmg, (2)where m is the mass of the vehicle, b is the coefficient offriction, and g is the acceleration due to gravity. The lateralforce on the vehicle due to steering is, F steer, lateral = ma lateral = m ˙ θV. (3)In order to plan for the future and take into account the factthat the vehicle has momentum and cannot change its velocity instantaneously, a look ahead is done for the maximum ex-pected value of delta, δ max , for a receding horizon. Therefore,the velocity for a certain δ max can be calculated as, V = F max m ˙ θ = bmg tan( δ max ) mV /l . (4)The steering and velocity references are executed using asimple, proportional, low level control system.V. R EFERENCE M ODIFICATION L OCAL P LANNING
The proposed solution, called the modification architecture,combines a pure pursuit path follower and a neural networkinside the local planner. The neural network is trained withreinforcement learning to minimise deviation from the globaltrajectory while avoiding obstacles.The local planner receives the global trajectory, w , thecurrent state of the vehicle, x t , y t , θ t , V t , δ t and the range finderreadings, r t = [ r , r , ..., r m ] . The local planner must outputthe velocity magnitude V ref and steering δ ref navigationreferences. A. Learning Formulation
The neural network outputs a policy which consists of anaction that is used to modify the steering reference generatedby the path follower. The ideal policy will result in thetrajectory being modified the minimal amount that results inan obstacle free trajectory.RL refers to a collection of algorithms that use experienceto train a neural network to learn a policy that maximises theexpectation of future reward [17]. For each step, the networkreceives a state, x and outputs an action, a , which is givento the environment and returns a reward, r , and a new state.The network is trained to take actions which will produce thehighest rewards. The input to a RL agent, which comprises ofa neural network, is a state vector which represents the con-ditions of the environment. A neural network is an algorithmthat aims to recognise the underlying relationships between aset of input and output data.Our neural network is trained using the twin delayed deepdeterministic policy gradient (TD3) algorithm. TD3 is thecurrent state of the art algorithm and it is able to outputcontinuous actions [18]. TD3 belongs to a family of algorithmscalled actor-critic algorithms which use an actor to generateactions and a critic to train the actor.The agent in our solution observes the state of the vehicle, x ,at each discrete time-step, t , from the state space S . The statecomprises of the current steering and velocity of the vehicle,the navigation references determined by the pure pursuitpath follower and the range finder readings. Mathematically,the state is, x = [ V t , δ t , V ref , δ ref , r , r , ..., r m ] . All of thevalues are scaled to the range [ − , to improve trainingperformance.The RL agent outputs a single continuous action which isused to modify the steering. The action is in the range [ − , and is scaled according to δ max . . Reward Signal The aim of the local planner is to track the reference pathas closely as possible, yet deviate when needed to avoidobstacles. The ideal reward signal will lead to the behaviourshown in Fig. 2, where the local planner detects an obstacle,smoothly avoids it and then returns to the reference path. Thepath follower will track the reference path in all situations.Therefore, the aim of the neural network is to know whenand how to deviate from the reference path my modifying thesteering reference.The largest aim of the neural network is to avoid collisions.Therefore, the agent must receive the worst reward possibleif the vehicle crashes. In order to increase the severity of thepunishment for crashing, a small positive reward is given tothe agent for each step that does not result in a collision.Additionally, the vehicle should only deviate the minimumnecessary amount to avoid obstacles. Therefore the agent ispunished according to the magnitude of the deviation fromthe reference path. The reward signal is thus defined as, r ( s t , a t ) = (cid:40) r crash if d obs = 0 β − ( β ∆ δ t ) otherwise , (5)where β and β are hyper-parameters which are tuned and ∆ δ t is the amount that the steering is modified by. C. Reference Modification
The key part of our architecture is how the neural networkis combined with a path follower. The steering command iswhat determines the path that the vehicle will follow. Thepath follower (PF) produces a reference, δ PF , and the neuralnetwork returns a scaled steering modification value, δ NN . Boththese values are in the range, [ − δ max , δ max ] , and are combinedthrough simple addition, δ ref = δ PF + δ NN . (6)In order to ensure that the friction limit of the vehicleis not reached, the velocity reference from the pure pursuitfollower is recalculated based on equation 4. Due to the factthat the velocity of the vehicle cannot change instantaneously,the combined steering reference is clipped based on the currentvelocity to ensure the vehicles safety.VI. E VALUATION
The aim of the evaluation is to assess the ability ofthe proposed modification architecture to follow a referencetrajectory while avoiding obstacles. The architecture is firstevaluated by following a straight line reference through adense random forest of square randomly place obstacles, andthen by following a minimum curvature trajectory througha race track. The experiments are conducted using F1/10thscale autonomous racing cars. Autonomous racing has aperformance metric of time, numerous relevant benchmarksand allows for new algorithms to be implemented in bothsimulation and in practice at low cost.
A. Experimental Setup
A simple simulator of a F1/10th car on a track is built basedon a Robot Operating System (ROS) based F1/10th simulator[19]. The simulator represents the vehicle according to thebicycle model as described in [20], with standard F1/10thparameters [19].The first environment, which is used for testing, is a randomforest similar to what has been used to evaluate comparablealgorithms [14], [15]. The reference trajectory is set to be astraight line and random obstacles are placed in front of thevehicle. The aim for the vehicle is to follow the straight linereference while avoiding the obstacles.The second environment which is used is a F1/10th racetrack with a minimum curvature reference trajectory. The racetrack evaluation is derived from the benchmark for F1/10thcars presented in [21] and shows how planning architecturescan be applied to real life situations. A standard race track isused for training and testing and random obstacles are placedon the track.The proposed solution is implemented on a F1/10th race car(called the hybrid vehicle) with 10 sparse laser range finders.A neural network with two hidden layers of 300 neurons eachis used in the implementation. The neural networks are trainedfor 100,000 steps on mini-batches of 100 samples. The experi-ential training data is collected in the same environment usedfor testing with randomly new obstacles randomly spawningafter every episode. The experimental tests were performedby running 100 episodes in each environment and taking andaverage of the results.The hybrid planner is compared a vehicle (known as thebenchmark vehicle) that uses a full map (including obstacles)to plan a minimum curvature trajectory around the obstacles.A standard pure pursuit local planner is used to follow the ob-stacle free reference path. This benchmark vehicle representsa theoretical best possible time to avoid obstacles.
B. Minimum Curvature Trajectory Generation
The second evaluation environment uses a minimum curva-ture path around the race track as the reference to follow.Additionally, the benchmark solution follows a minimumcurvature path which is planned with full knowledge of theobstacles. A minimum curvature trajectory is generated byrepresenting the track as shown in Fig. 4, and then using a pathoptimisation algorithm. This paper uses a path optimisationmethod similar to that presented in [22].The track is modelled as a set of centre-line locations x k , y k ,normal vectors, and track widths, w p,k , w n,k . Each point on therace line is defined as a distance, n k , from the centre line alongthe normal vector. Modelling the track in this way allows fora simple optimisation equation to be set up with the only pathconstraint being, − w n,k < n k < w p,k ∀ k , which can beeasily reduced to take obstacles into account. The vector, n , ofthe normal distances, is optimised according to a cost functionwhich represents the curvature of the path. , ynw p w n Fig. 4.
Track Model used for Generating Minimum Curvature Path:
Thesolid black lines are the race track boundaries and the dashed black line is thetrack centre line. The black dots are the points at which the track is discritisedand the red dots and line show the optimal path which is calculated.
C. Random Forest Experimental Results
Table I shows the results from the experiments that wereperformed on a 25m random forest. The experiments wereperformed on an empty forest with no obstacles, to evaluatethe ability to track the reference path, and on a forest with6 randomly located obstacles to evaluate the ability to avoidobstacles.
Vehicle No Obstacles Obstacles Success
Benchmark 3.8 s 5.7 s 100%Hybrid 6.8 s 10.6 s 78%TABLE I C OMPARISON OF B ENCHMARK AND H YBRID P LANNER R ESULTSTHROUGH R ANDOM F OREST : T IMES ARE SHOWN IN SECONDS FORCOMPLETION OF THE FOREST WITH AND WITHOUT OBSTACLES . T
HEEPISODE WAS UNSUCCESSFUL IF THE VEHICLE CRASHED . Fig. 5 shows an example path taken through an emptyforest strip and the corresponding neural network output. Inall the trajectory images, the green dashed line represents thereference path, and the red line represents the trajectory thatwas taken. The light grey is the track area, the blue blocksare un-mapped obstacles and the yellow stars are the start andend locations.In the tests through the empty random forest, our hybrid ar-chitecture performed significantly slower with an average timeof 6.8 seconds as opposed to 3.8 seconds for the benchmarksolution. While the vehicle follows the reference path closely,the neural network is still modifying the steering referencecauses the vehicle to swerve side to side. The swerving causedby the neural network causes the vehicle to travel at a slowerspeed and explains why the hybrid architectures average timeis so much slower than the benchmark.The tests with obstacles show that the modification archi-tecture is able to guide the vehicle through the forest with a78% success rate. The success rate shows that the vehicle isable to avoid a significant number, but not all obstacles. Fig.6 shows typical trajectories that the benchmark and hybrid − − . . Time N e t w o r k O u t pu t Fig. 5. Neural Network Output Compared to Path Taken by Vehicle vehicles took. The example trajectories show that the neuralnetwork learns a policy that avoids obstacles as narrowly aspossible. It is assumed that this narrow obstacle avoidancepolicy is the reason for the several crashes which took place.The assumed reason for the success rate being higher thanfor the forest is that the obstacles on the race track were lessdensely spaced.On the left hand side of the hybrid example trajectory, itcan be seen that the hybrid planner brings the vehicle backto the reference in between closely spaced obstacles, whilethe benchmark planner is able to smoothly plan around bothobstacles. The lack of smoothness in the hybrid’s plannerspath results in the vehicle travelling at slower speeds and thustaking an increased amount of time to complete the course.
Fig. 6.
Random Forest Example Trajectories:
Benchmark planner (top)and hybrid planner (bottom)
D. Race Track Experimental Results
Table II show the results from the experiments performedon the race track.The test with no obstacles on the race track producesvery similar results to that seen in the empty random forest.The hybrid planner is able to follow the reference trajectory, ehicle No Obstacles Obstacles Success
Benchmark 33.2 s 34.7 s 100%Hybrid 41.6 s 43.2 s 89%TABLE II C OMPARISON OF B ENCHMARK AND H YBRID P LANNER R ESULTS ON R ACE T RACK : T IMES ARE SHOWN IN SECONDS FOR LAP COMPLETION ONTHE TRACK WITH AND WITHOUT OBSTACLES . T
HE LAP WASUNSUCCESSFUL IF THE VEHICLE CRASHED . however the time taken is much longer due to the slowerspeeds caused by the constant swerving.Fig. 7 shows a typical example of how the hybrid agentmodifies the path follower generated references to avoidobstacles. The vehicle can clearly be seen to leave the ref-erence trajectory when an obstacle is detected and then returnsmoothly once the obstacle is past. The hybrid planner is ableto successfully navigate around the track the with obstacles,in 89% of the test laps. This shows that the neural network isable to learn a policy to modify the references sufficiently toavoid a significant amount of obstacles. Fig. 7. Example Trajectory of Vehicle on the Experimental Race Track
On the left hand side of the track, where the referencetrajectory comes close to the boundary, the hybrid agentdeviates slightly from the reference, away from the boundary.This behaviour shows that the agent will make small deviationsfrom the reference in order to maintain distance from obstaclesand boundaries alike.At certain times, the agent avoids obstacles in more elegantand optimal ways than at other times. In the bottom rightcorner of the track, the hybrid planner, smoothly avoids theobstacle by turning slightly more. On the right side of thetrack, the vehicle takes a jerky path around the obstacle. In themidsection of the track, it appears that the agent is confusedbetween swerving left or right to avoid the obstacles. Thisshows that the agent avoids obstacles in different way basedon the situation.
E. Discussion & Future Work
The results show that the hybrid planner is able to avoida significant amount of obstacles without the maintainingan obstacle map or requiring a special training setup. Oursolution is able to track a reference trajectory and deviate whennecessary using only 10 sparse laser range finder readings. Theexample trajectories show that the hybrid planner outputs jerkytrajectories which prevent the vehicles traveling at high speeds.Future work must be done to improve the smoothness ofthe trajectory around obstacles. While the hybrid architectureis effective in avoiding a significant amount of obstacles, itstill occasionally crashes. Future work must investigate safetyguarantees that can be integrated into the system to ensure thatit does not crash in application. The work done in this paper islimited to simulation and should be verified by implementinga hybrid planning architecture in practice on a physical setup.An open research area is avoiding dynamic obstacles, as itrequires reasoning about the motion of the obstacle. Additionalwork should extend the proposed hybrid architecture to avoiddynamic obstacles. VII. C
ONCLUSIONS
In this paper we address the problem of autonomouslygenerating navigation references to track a reference path andavoid obstacles. We present a hybrid local planner whichcombines the path followers ability to track a reference pathwith a neural networks ability to flexible avoid obstacles. Theneural network operates in parallel with the path follower andis able to adjust the references which the path follower hasgenerated. The neural network is trained with RL to preventcollisions by modifying the references from the path follower.The architecture improves upon previous work by removingthe need to maintain an obstacle map (as with SLAM basedsystems), and not requiring expert demonstrations (like imi-tation learning) or instrumented training set ups (like guidedpolicy search). The architecture was evaluated in the contextof F1/10th autonomous race cars in a random forest and ona classic race track. The results show that our architecture isable to use 10 sparse laser range finders to track a referencepath and successfully avoid a significant number of obstacles.Future work can be done to optimising the design for operationat high speeds and explore solutions for using a similararchitecture to avoid dynamic obstacles.R
EFERENCES[1] S. D. Pendleton, H. Andersen, X. Du, X. Shen, M. Meghjani, Y. H. Eng,D. Rus, and M. H. Ang, “Perception, planning, control, and coordinationfor autonomous vehicles,”
Machines , vol. 5, no. 1, p. 6, 2017.[2] P. Falcone, F. Borrelli, H. E. Tseng, J. Asgari, and D. Hrovat, “Ahierarchical model predictive control framework for autonomous groundvehicles,” in . IEEE, 2008, pp. 3719–3724.[3] L. Heng, L. Meier, P. Tanskanen, F. Fraundorfer, and M. Pollefeys,“Autonomous obstacle avoidance and maneuvering on a vision-guided mavusing on-board processing,” in . IEEE, 2011, pp. 2472–2477.[4] H. Durrant-Whyte and T. Bailey, “Simultaneous localization and mapping:part i,”
IEEE robotics & automation magazine , vol. 13, no. 2, pp. 99–110,2006.5] C. Cadena, L. Carlone, H. Carrillo, Y. Latif, D. Scaramuzza, J. Neira,I. Reid, and J. J. Leonard, “Past, present, and future of simultaneouslocalization and mapping: Toward the robust-perception age,”
IEEE Trans-actions on Robotics , vol. 32, no. 6, pp. 1309–1332, 2016.[6] R. Ivanov, T. J. Carpenter, J. Weimer, R. Alur, G. J. Pappas, and I. Lee,“Case study: verifying the safety of an autonomous racing car with a neuralnetwork controller,” in
Proceedings of the 23rd International Conferenceon Hybrid Systems: Computation and Control , 2020, pp. 1–7.[7] L. Tai, J. Zhang, M. Liu, J. Boedecker, and W. Burgard, “A survey of deepnetwork solutions for learning control in robotics: From reinforcement toimitation,” arXiv preprint arXiv:1612.07139 , 2016.[8] M. Pfeiffer, S. Shukla, M. Turchetta, C. Cadena, A. Krause, R. Siegwart,and J. Nieto, “Reinforced imitation: Sample efficient deep reinforcementlearning for mapless navigation by leveraging prior demonstrations,”
IEEERobotics and Automation Letters , vol. 3, no. 4, pp. 4423–4430, 2018.[9] L. Tai, G. Paolo, and M. Liu, “Virtual-to-real deep reinforcement learning:Continuous control of mobile robots for mapless navigation,” in , 2017, pp. 31–36.[10] F. de Villiers and W. Brink, “Learning fine-grained control for maplessnavigation,” in .IEEE, 2020, pp. 1–6.[11] M. Wei, S. Wang, J. Zheng, and D. Chen, “Ugv navigation optimizationaided by reinforcement learning-based path tracking,”
IEEE Access , vol. 6,pp. 57 814–57 825, 2018.[12] L. Tai, J. Zhang, M. Liu, and W. Burgard, “Socially compliant navigationthrough raw depth inputs with generative adversarial imitation learning,” in .IEEE, 2018, pp. 1111–1117.[13] S. Levine and V. Koltun, “Guided policy search,” in
InternationalConference on Machine Learning , 2013, pp. 1–9.[14] T. Zhang, G. Kahn, S. Levine, and P. Abbeel, “Learning deep controlpolicies for autonomous aerial vehicles with mpc-guided policy search,” in .IEEE, 2016, pp. 528–535.[15] G. Kahn, T. Zhang, S. Levine, and P. Abbeel, “Plato: Policy learningusing adaptive trajectory optimization,” in . IEEE, 2017, pp. 3342–3349.[16] R. C. Coulter, “Implementation of the pure pursuit path tracking algo-rithm,” Carnegie-Mellon UNIV Pittsburgh PA Robotics INST, Tech. Rep.,1992.[17] R. S. Sutton and A. G. Barto,
Reinforcement learning: An introduction .MIT press, 2018.[18] S. Fujimoto, H. Van Hoof, and D. Meger, “Addressing function approx-imation error in actor-critic methods,” arXiv preprint arXiv:1802.09477 ,2018.[19] V. S. Babu and M. Behl, “Ros f1/10 autonomous racecar simulator,”October 2019.[20] P. Polack, F. Altch´e, B. d’Andr´ea Novel, and A. de La Fortelle,“The kinematic bicycle model: A consistent model for planning feasibletrajectories for autonomous vehicles?” in . IEEE, 2017, pp. 812–818.[21] M. O’Kelly, H. Zheng, D. Karthik, and R. Mangharam, “textscf1tenth:An open-source evaluation environment for continuous control and re-inforcement learning,” in
NeurIPS 2019 Competition and DemonstrationTrack . PMLR, 2020, pp. 77–89.[22] F. Christ, A. Wischnewski, A. Heilmeier, and B. Lohmann, “Time-optimal trajectory planning for a race car considering variable tyre-roadfriction coefficients,”