[PDF] An advantage actor-critic algorithm for robotic motion planning in dense and dynamic scenarios

Abstract

Intelligent robots provide a new insight into efficiency improvement in industrial and service scenarios to replace human labor. However, these scenarios include dense and dynamic obstacles that make motion planning of robots challenging. Traditional algorithms like A* can plan collision-free trajectories in static environment, but their performance degrades and computational cost increases steeply in dense and dynamic scenarios. Optimal-value reinforcement learning algorithms (RL) can address these problems but suffer slow speed and instability in network convergence. Network of policy gradient RL converge fast in Atari games where action is discrete and finite, but few works have been done to address problems where continuous actions and large action space are required. In this paper, we modify existing advantage actor-critic algorithm and suit it to complex motion planning, therefore optimal speeds and directions of robot are generated. Experimental results demonstrate that our algorithm converges faster and stable than optimal-value RL. It achieves higher success rate in motion planning with lesser processing time for robot to reach its goal.

Full PDF

11Manuscript under review

An advantage actor-critic algorithm for robotic motionplanning in dense and dynamic scenarios

Chengmin Zhou , Bingding Huang , Pasi Fränti

1, *1

University of Eastern Finland, Joensuu, Finland Shenzhen Technology University, Shenzhen, China * Correspondence: [email protected]

Abstract:

Intelligent robots provide a new insightinto efficiency improvement in industrial andservice scenarios to replace human labor. However,these scenarios include dense and dynamicobstacles that make motion planning of robotschallenging. Traditional algorithms like A* can plancollision-free trajectories in static environment, buttheir performance degrades and computational costincreases steeply in dense and dynamic scenarios.Optimal-value reinforcement learning algorithms(RL) can address these problems but suffer slowspeed and instability in network convergence.Network of policy gradient RL converge fast inAtari games where action is discrete and finite, butfew works have been done to address problemswhere continuous actions and large action space arerequired. In this paper, we modify existingadvantage actor-critic algorithm and suit it tocomplex motion planning, therefore optimal speedsand directions of robot are generated. Experimentalresults demonstrate that our algorithm convergesfaster and stable than optimal-value RL. It achieveshigher success rate in motion planning with lesserprocessing time for robot to reach its goal.

Key words:

Motion Planning, Path Planning,Reinforcement Learning, Deep Learning.

I. Introduction

Intelligent robots are playing a crucial role invarious scenarios of daily life. For example, luggagedelivery in airport or food delivery in restaurant canuse robot to replace human labor for improving the

Figure 1. Circle-crossing scenario with dense anddynamic obstacles. Robot and all obstacles are indifferent velocities, and they are all generated inrandom positions within the circle. efficiency. These complex scenarios include manydynamic obstacles (e.g. pedestrians). The speed andmoving directions of these obstacles areunpredictable, therefore it is challenging for robot towalk safely and efficiently towards the destination.Traditional graph search algorithms (e.g. A* [1],Dijkstra’s algorithm [2]) and sampling-basedalgorithms (e.g. dynamic window approach [3],rapidly-exploring random tree [4]) can plan optimaland suboptimal collision-free trajectories to theirdestinations. These algorithms work well in staticenvironment but it is challenging for robots to reactproperly in scenarios that include many dynamicobstacles, because these algorithms lack speedadjustment. Reciprocal velocity obstacles [5] workwell in speed adjustment by computing acollision-free velocity to the goal. However,traditional algorithms mentioned above are allcomputationally expensive due to (1) large search

Manuscript under review space; (2) frequent updates of dynamic scenarios.Recent reinforcement learning (RL) can addressproblems in traditional algorithms well by using atrained network (model) to directly plan the speedand direction simultaneously. Examples of RLalgorithms mainly include optimal value RL (e.g. Q learning [6], deep Q network [7] and deep V -learning [8]) and policy gradient RL (e.g.asynchronous advantage actor-critic [9], deepdeterministic policy gradient [10]). Theirperformance however varies. Network foroutputting actions needs to be trained, but optimalvalue algorithm suffers slow speed and instabilityin network convergence because of over-estimationof action value in training [11], while policygradient RL has better performance in convergencebecause of its actor-critic architecture [12]. However,actor-critic algorithms (e.g. asynchronousactor-critic) still suffer slow convergence speed inearly-stage training because of its low-qualitytraining set generated by their low-accuracy policynetwork in the beginning, although its speed isfaster than optimal value algorithms.Most policy gradient RL are widely used ingames (e.g. Atari games) [13][22], in whichoutputted actions are finite and discrete, but onlyfew algorithms can be found in scenarios wherecontinuous actions and large action choices (actionspace) are required. This paper providesmodifications to existing high-performanceactor-critic algorithm to fit it to such scenarios thatrequire continuous actions and large action space.This is achieved by (i) applying imitation learning[13][23][25] to enhance the speed and stability ofnetwork convergence; (ii) using selected experiencesand experience replay memory [7] for offlinetraining to improve speed of convergence; (iii)applying a novel network framework to advantageactor-critic algorithm. II. Problem formulation

Recent work [14] introduced a competentsimulative environment (Figure 1) that includesdynamic robot and obstacles in a fix-size 2D indoorarea. Robot and obstacles move towards their owngoals simultaneously and avoid collisions with each other. Robot and obstacles obey the same ordifferent policies for motion planning to avoidcollisions. This simulative environment creates acircle-crossing scenario that adds predictablecomplexity to the environment. It is therefore agood platform to evaluate the performance ofalgorithms adopted by robot or obstacles.Robot and obstacles plan motions towards theirgoals and avoid collisions by sequential decisionmaking. Let represents the state of robot. Let and represent action and velocity of robot, and t t . Let t represents positionof robot. let represents state of robot at time step t . is composed by observable and hidden parts t , . Observable part refers tofactors of state that can be measured by others andis composed by position, velocity, and radius t , . Hidden part refers tofactors of state that cannot be seen by others and iscomposed by planned goal position, preferredspeed and heading angle t h . State, position, and radius of obstacles aredescribed by , and .To analyze this decision-making process, wefirst introduce one-obstacle one-robot case as in [8],and then extend it to one-robot multi-obstacle case.Robot plans its motion by obeying policy Қ , while obstacles obey Қ . The objective of robot is to minimize the time toits goal (Eq. 1) under the policy withoutcollisions to obstacles. Constraints of robot’s motionplanning can be formulated via Equations 2-5 thatrepresent collision avoidance constraint, goalconstraint, kinematics of robot and kinematics ofobstacle respectively. he (1) ee (2) t (3) t h Қ (4) t h Қ (5)Constraints of one-robot one-obstacle case canbe easily extended into one-robot N -obstacle casewhere the objective (Eq. 1) is replaced by he Қ (assume thatobstacles use the same policy ). Collisionavoidance constraint (Eq. 2) is replaced by Manuscript under review h h assuming that obstacles are in the same radius .Kinematics of robot is replaced by t h Қ Қ h , and kinematics of obstacle isreplaced by t h h t hh h h t hh Қ h . III. Methods

A. Related works

Traditional methods: graph search algorithm (e.g.A* [1] and Dijkstra’s algorithm [2]) fits constraints(Equations 2-5) by searching its working spaceusing search tree or by exhaustive search (note thatworking space here refers to indoor space), but itcan’t fit Eq. 1 because velocity cannot be regulatedin these algorithms, therefore expected time to goalis unpredictable. Additionally, the search process ingraph search algorithms is computationallyexpensive by repeated search in the working spaceand computing an optimal trajectory.

Sampling-basedalgorithm (e.g. dynamic window approach [3] andrapidly-exploring random tree [4]) samples theworking space and computes a suboptimaltrajectory. It fits constraints (Equations 2-5) butcannot fit Eq. 1 as graph search algorithm, and it isalso computationally expensive especially indynamic environment by frequently samplingworking space and updating maps that are used fortrajectory planning.

Reaction-based method (e.g.optimal reciprocal collision avoidance (ORCA) [14])fits the expected time to goal and constraints(Equations 1-5) well by computing one-step costand selecting a suboptimal collision-free velocity toits goal. However, it is computationally expensive asit requires frequent updates to avoid dynamicobstacles.

Optimal value RL : action-value-based RLalgorithm (e.g. deep Q network [7]) and state-value-based RL algorithm (e.g. deep V learning[8][21]) fit Equations 1-5 completely. Both are optimal value RL algorithms that are based on Markov decisionprocess (MDP). Sequential decision problem definedby Equations 1-5 can be easily formulated as a MDP,which is described as a tuple ult . S represents state and here refers to states of robotand obstacles. A represents an action taken by robot.State S transits into another state under astate-transition probability P and a reward R fromthe environment is obtained. is a discount factorused for normalization of state value ( V value) orstate-action value ( Q value). The objective ofoptimal-value-based RL is to find optimal Q or V values 䁡 t th tҚ Қ䁡 (6) t th tҚ Қ (7)where h ; Қ䁡 and Қ refer to optimalpolicies obtained by 䁡 t 䁡䁡䁡 t 䁡 Қ 䁡 Қ 䁡 t (8) t 䁡䁡䁡 t 䁡 Қ 䁡 Қ t (9)where Қ 䁡 refers to the state-transitionprobability from time step t to time step t + t .state-transition probability relates to kinematics ofrobot and obstacles, but it is unknown becausemotions of other obstacles are unpredictable.Optimal policy , however, can still be representedusing or functions (Equations 8-9) by usingneural network to approximate policy , thereforethe optimal action 䁡 can be selected, for instance,by 䁡䁡䁡䁡 t 䁡 h Қ (10)as the line 7 in Algorithm 1 [15]. State value V ismore suitable than action value Q in action selection,because action space here is continuous and the setof permissible velocity vectors t dependson states of robot and obstacles (preferred speed) [8].Note that action space here refers to all actions thatcan be possibly selected by robot or obstacles.Optimal value RL algorithms, eitherstate-value-based RL or action-value-based RL, arebased on neural network to approximate Q value or V value. This process inevitably introduces error in Manuscript under review value prediction [13], therefore leading toover-estimation of Q or V values. Over-estimationwill lead to slow speed and instability of networkconvergence in training, thereforeoptimal-value-based RL are not the best choice tosolve sequential decision problem defined byEquations 1-5. Algorithm 1: Deep V -learning

1: Initialize value network V with demonstration D

2: Initialize target value network

3: Initialize experience replay memory For episode = 1, M do

5: Initialize random sequence h Repeat 䁡䁡䁡䁡 l t 䁡 h Қ

8: Store tuple Қ 䁡 in E

9: Sample random minibatch tuples from D

10: Set target t h Қ

11: Update value network V by gradient descent12: Until terminal state or 䁡

13: Update target network

End for

Return V Policy gradient RL: optimal value RL usesneural network to approximate Q or V value toindirectly select action via Eq. 10. Policy gradientmethod [26] uses neural network as policy 䁡 to directly select actions. Policy gradient methodrelays on trajectory reward Қ as Equation 11,instead of one-step reward in optimal value RL(note that refers to robot’s interactive trajectorywith the environment). t 䁡䁡 Қ (11)This direct and trajectory-reward-based way canreduce instability of network convergence inoptimal-value-based RL. The objective of policygradient method is defined as t t ҚҚt (12).Policy network can be updated with objectivegradient ∇ Қ by ∇ Қ (13)where ∇ t ∇ ҚҚt . Trajectory canbe divided into T step, therefore gradient value canbe replaced by ∇ t th ∇ t Қ䁡 th Қ 䁡 (14).First part th ∇ t Қ䁡 is known by using neural network as approximator, therefore thegradient value relays on the second part th Қ 䁡 . However, using trajectory reward th Қ 䁡 to update policy network inevitablyslows down the speed of network convergencealthough stability is improved. Actor-critic algorithm [12] is an optimizedversion of policy gradient method by usingactor-critic architecture, in which one network( policy network ) works as actor to select action andanother network ( critic network ) works as critic toevaluate selected action to impact the update ofpolicy network . Policy network approximatespolicy value Қ䁡 , while critic networkapproximates state value Қ . Objectives ofpolicy network and critic network are defined as t 䁡 (15) t (16)where l t Қ is the temporaldifference error that measures the difference between unbiased estimation of and predicted statevalue . Updates of policy and criticnetworks are obtained via Equations 17-18 as theupdate of policy gradient method (Eq. 13). ∇ tҚ䁡 Қ (17) ∇ (18).Actor-critic algorithm updates its networks ( and ) using one-step reward or N -step rewards ininteractive trajectory , therefore speed of networkconvergence is improved. Denote that there is atrade-off in selecting N , because larger N will leadto slower convergence speed but stability isimproved as policy gradient method.Recent asynchronous advantage actor-criticalgorithm (A3C) [9] and GPU-based A3C (GA3C)[16][20] move deeper in improving speed ofnetwork convergence by multi-thread method [9] toobtain more interactive trajectories and update itsnetworks in each thread as the description in Algorithm 2 [9]. However, existing work about advantage actor-critic algorithm (A2C) [9][17] showsthat multi-thread method has limited contributionto speed of network convergence. Large portions oftrajectories collected by either A3C or GA3C lackquality. For example, collected trajectories are with

Manuscript under review high quality if robot reaches the goal or collideswith obstacles, while other trajectories collectedunder maximum time 䁡 t 䁡 contributeless to speed of network convergence. Additionally,A3C and GA3C are online-learning algorithms. Thismeans these algorithms cannot reuse collectedhigh-quality trajectories in training, therefore alsoleading to low efficiency in network convergence.Moreover, these algorithms lack proper method innetwork initialization, therefore network convergesslowly in early-stage training. Algorithm 2: A3C for each actor-learner thread //assume global shared parameter vectors and and global shared counter T = 0//assume thread-specific parameter vectors ' and '

1: Initialize thread step counter Repeat

3: Reset gradients: t h and t h

4: Synchronize thread-specific parameters: ' ' t 䁡 t

6: Get state Repeat

8: Perform 䁡 according to policy Қ䁡 '

9: Receive reward and new state Until terminal or 䁡 t 䁡 t t h t h䁡 ' t th䁡 For Қ 䁡 do t t

16: Accumulate gradients wrt ' : t t ∇ ' tҚ䁡 'Қt Қ '

17: Accumulate gradients wrt ' : t t Қt Қ ' /' End for

19: Perform asynchronous update of Until 䁡

B. Our optimized A2C for motion planningInspired by deep V -learning [8], A3C [9] andA2C [17], we modify A2C to solve the problemdefined by Equations 1-5. We introduce an A2Calgorithm for motion planning (A2CMP) shown in

Algorithm 3 .The speed of network convergence inearly-stage training is improved by imitationlearning with high-quality demonstrations(interactive trajectories) D [13], which are generatedby performing pretrained deep V -learning model [8].Hence, policy network, value network (criticnetwork) and their targeted networks are initialized (lines 1-3). Unlike online-learning algorithm (e.g.A3C), offline-learning method with experiencereplay memory E is used in A2CMP to reusecollected high-quality demonstrations. Replaymemory E is initialized by running initialized policynetwork (line 4), and this is achieved by lines 6-15.Experience is the tuple Қ where represents joint state composed by state of robot t and observable state of obstacles h h ]. Action is the vector .State value Қ is composed by reward andpredicted state value generated by preformingtargeted critic network ' (line 12) t h 'Қ ' (19)where h relates to time and preferred speedof robot. Reward is defined by t he if t h he t else if t he else if t h else (20)where d represents the distance between robot andobstacles. Negative rewards are used to penalizerobot if robot moves too close or collides with anobstacle [8]. Note that not all experiences arequalified for training, and qualified experiences aresaved in replay memory E when robot reaches thegoal or collides with an obstacle (line 13).Interaction (episode) of robot with the environmentterminates if robot reaches goal or collides withobstacles, or maximum interactive time is reached h t 䁡 (line 15).Experience replay memory E is updated in thesame way as in initialization (lines 6-15). N randomminibatch of tuple Қ with length L are sampled from replay memory (line 16) forforthcoming training (lines 17-26). Temporal differenceerror (TD-error) is used as advantage A by l t ' ' (21)where is the target state value obtained byEquation 19 and ' ' is the predicted statevalue by performing targeted critic network ' . Manuscript under review

Objective of critic network is defined as ' t l (22).Objective of policy network is defined as ' t 䁡 ' l ҚҚ ' (23)where the first part 䁡 ' l is the same as theobjective of policy network of A3C [9], while thesecond part ҚҚ ' is the policy entropy [18]that can encourage exploration of better actions inearly-state training for speeding up networkconvergence. Here Қh is a discount factor tobalance the exploration. According to objectives ofcritic and policy networks, losses of critic and policynetworks are defined by t l /' (24) t ∇ ' t 䁡 ' l ∇ ' ҚҚ ' (25).Targeted networks of policy and critic networks areupdated by minimizing total loss t䁡 t䁡 t (26)where Қh and it is a discount factor that isused to balance the update speed of targeted criticnetwork. Critic network and policy network areupdated after K episodes of training (line 27), andpolicy network is saved after M episodes of training(line 29). Algorithm 3: A2C for motion planning (A2CMP)

1: Collect demonstration D by pretrained deep V -learningmodel2: Initialize policy network and value network byimitation learning with demonstration D3: initialize target policy network and target value network: ' '

4: Initialize experience replay memory E5:

For episode = 1, M do

6: Initialize random sequence h Repeat

8: Perform 䁡 according to policy network

9: Receive reward

10: Save experience tuple 䁡

11: Get predicted state value ' '

12: Set target state value: t h 'Қ '

13: Update E 䁡 Қ if terminal state Until terminal state h䁡 t h䁡tet䁡ttt or h t 䁡

16: Sample N random minibatch 䁡 Қ withsize L from E For do

18: Get predicted state value ' '

19: Assume advantage value: l t ' '

20: Set critic objective: ' t l

21: Set policy objective: ' t 䁡 ' l ҚҚ '

22: Get critic loss: t l /'

23: Get policy loss: t ∇ ' t 䁡 ' l ∇ ' ҚҚ '

24: Get total loss: t䁡 t

25: update networks: ' ' and ' ' by minimizing total loss t䁡 End for

27: update networks: ' and ' every K episode28: End for

29: Return policy network IV. Experiment

A. Design of neural networkFour linear layers are used in framework ofA2CMP: linear 1 as an input layer, linear 2 as ahidden layer, linear critic as an output layer of criticnetwork and linear actor as an output layer of policynetwork. Two

ReLU layers are used as activationafter linear 1 and linear 2 layers. One softmax layer isused for activation and normalization, thereforefeatures are mapped to probability distribution.Framework of neural network is shown in Figure 2.

Figure 2. Network framework of A2CMP. It showsone-robot one-obstacle case where the joint state is composed by full state of robot andobservable state of obstacle. Experiment includes fiveobstacles therefore joint state is Manuscript under review Қ h . Input is the joint state that is composed by fullstate t h andobservable state of obstacles t where N is the number of obstacles. Herewe choose N =5 for our experiments. Hence, theinput size is N . Number of directions is set to 16( thtt t t Қet ), and number ofsampled speeds is set to 5 ( hht t h Қe ). Hence, action space is of size 16 N, 128), (128, 256), (256, 1) and (256,16 t thatcannot be directly used for normalization in softmaxlayer. According to one-hot encoding, we use thelabel of each outputted action in action space torepresent outputted action for normalization. Hence,outputs of softmax are labels of action that will beparsed to actions by mapping labels to actions.Table 1. Size of network layers. Name Size (input size, output size)Linear 1 (9+5 N, 128)Linear 2 (128, 256)Linear critic (256, 1)Linear actor (256, 16 B. TrainingExperiences for training are tuples from qualified trajectories collectedfrom simulative environment shown in Figures 1and 3. Simulative environment is the circle-crossingsimulator from RVO2 [14]. Width of the circle is setto 8m and position of circle’s center is set to (0, 0).Radius r of robot and obstacles are set to 0.3-0.5m,and 5 obstacles are in the same radius. Preferredspeed h is set to 1.0 m/s. Position of goal is randomly set within the circle, andstarting position of robot is on the oppositeposition within circle. Goal positions of obstacles are also randomly set, and startingpositions of obstacles are also on theopposite position within circle. Figure 3. Circle-crossing simulator in the experiment.

Capacity of experience replay memory E is setto 100000. Number of demonstrations D collectedfrom deep V -learning [8] for imitation learning is setto 3000 episodes. Learning rate and epochs ofimitation learning are set to 0.01 and 50 respectively.Learning rate, batch size and training episodes M ofA2CMP are set to 0.001, 100 and 10000 respectively.Critic network and policy network are updatedevery 50 episodes. Number of sampled minibatchesfor training is set to 1. Entropy coefficient andcritic loss coefficient are set to 0.01 and 0.5respectively. Values of all parameters are summarizedin Table 2. Table 2. Values of parameters used for training.

Parameters valueNumber of demonstrations D L M K N C. EvaluationDeep V -learning and A2CMP without imitationlearning are used as control groups to comparespeed of convergence and stability of training with

A2CMP with imitation learning . Networks of thesealgorithms are all trained by 10000 episodes ofexperience. Network of deep V -learning convergesafter about 8000 episodes. It converges slowly withsmall average rewards in the early-stage training.A2CMP network without imitation learningconverges after about 6000 episodes, and it receives Manuscript under review higher average rewards than deep V -learning.A2CMP network with imitation learning convergesnear 4000 episodes and average rewards received byA2CMP are higher than deep V -learning andA2CMP without imitation learning. A2CMPnetwork converges more stable, while networks ofdeep V -learning and A2CMP without imitationlearning lack stability according to their fluctuatedtraining curves shown in Figure 4. Hence, we canconclude that A2CMP is more stable than deep V -learning in training and it uses less trainingexperiences than deep V -learning in networkconvergence. We also see that imitation learning fornetwork initialization contributes a lot to stabilityand speed of convergence in training.(a)(b)(c) Figure 4. Training curves. (a) represents averagerewards received by deep V -learning. (b) representsaverage rewards received by A2CMP withoutimitation learning. (c) represents average rewardsreceived by A2CMP. Performance of A2CMP is evaluated incircle-crossing simulative environment wherestarting and goal positions of robot are fixed at (-4, 0)and (4, 0) respectively. Positions of obstacles arerandomly set within the circle, while their goalpositions are always on the opposite position withincircle. ORCA [14] and deep V -learning are used ascontrol groups for comparisons. Trained networksof these algorithms are all evaluated by 100 testing episodes. Each episode terminates once robotreaches goal or experimental time reachesmaximum testing time 䁡 t s . Trajectories andpositions of robot and obstacles at each time stepare shown in Figure 5. Success rate of ORCA after100 testing episodes is 0.99 which is higher than thatof deep V -learning (0.95) and A2CMP (0.96).Average time to reach goal by A2CMP (10.2s) isslightly better than that of deep V -learning (10.4s),and it is about 17% shorter than the time spent byORCA (12.5s).Rates of collision and goal missing by deep V -learning and A2CMP are higher than that byORCA. This may relate to the defective expressionof joint state that simply combines robot’s stateand obstacles’ state together. We cannot completelyrepresent relationships of robot and dynamicobstacles in this way. It could be fixed by using anew state expression (e.g. attention weight [15] andrelation graph [19]), but it is beyond the scope ofthis paper. Detailed comparisons of thesealgorithms are summarized in Table 3. Figure 5. Trajectory examples in evaluation. Figurelists four cases within 100 evaluations. Yellow circlerepresents positions of the robot, and other colorsdenote that of obstacles at each time step. Time toreach goals varies because the complexity of each casediffers. Average time to reach goal is therefore used toevaluate the performance of algorithms.

Table 3. Performance comparisons in evaluations.

Manuscript under review

Algorithm Successrate Collisionrate Goalmissing* Averagetime (s)ORCA 0.99 0.00 0.01 12.5DVL 0.95 0.02 0.03 10.4

A2CMP 0.96 0.02 0.02 10.2 *Goal missing represents that robot fails to reach goalwhen experimental time reaches the maximum time.

V. Conclusion and future works

Our A2CMP algorithm is the modified versionof advantage actor-critic algorithm, in whichactor-critic architecture contributes to speed ofnetwork convergence compared to deep V -learningalgorithm. A2CMP network learns from experienceoffline by using experience replay memory to storehigh-quality experiences, therefore speeding up theconvergence of network. Imitation learning is alsoused in network initialization, and it contributes toconvergence stability and convergence speedsimultaneously in training. Results of training andevaluation demonstrate that our A2CMP algorithmconverges faster and is more stable than deep V -learning. A2CMP costs less time to reach goalcompared to ORCA and deep V learning, but a fewcollisions and goals missing still exist.Future works will concentrate on better ways instate expression for a better description of robot anddynamic obstacles. For example, future motions ofdynamic obstacles should be predicted and used asthe feature in new state expression by estimatingfrom their previous trajectories. In this way,algorithms can learn more from the state and caseslike collision or goal missing can be better avoided. Acknowledgment

Thanks for open-source implementations ofROV2 and deep V -learning provided by [14][24] and[15]. Reference [1] P. E. Hart, N. J. Nilsson and B. Raphael. A formalbasis for the heuristic determination ofminimum cost paths.

IEEE Transactions onSystems Science and Cybernetics, ,4(2):100-107.[2] E. W. Dijkstra. A note on two problems in connexion with graphs.

Numerische Mathematik, , 1: 269-271.[3] D. Fox, W. Burgard and S. Thrun, The dynamicwindow approach to collision avoidance.

IEEERobotics and Automation Magazine , , vol. 4, no.1, pp. 23-33.[4] A. Bry and N. Roy. Rapidly-exploring randombelief trees for motion planning underuncertainty. IEEE International Conference onRobotics and Automation, , pp. 723-730.[5] J. van den Berg, Ming Lin and D. Manocha.Reciprocal velocity obstacles for real-timemulti-agent navigation. IEEE InternationalConference on Robotics and Automation, ,pp. 1928-1935.[6] R. S. Sutton and A.G. Barto.

Reinforcementlearning: an introduction . MIT Press, .[7] V. Mnih, K. Kavukcuoglu, D. Silver, A. Graves, I.Antonoglou, D. Wierstra and M. Riedmiller.Playing atari with deep reinforcement learning.arXiv , arXiv:1312.5602 [cs.LG].[8] Y. Chen, M. Liu, M. Everett and J. P. How.Decentralized non-communicating multiagentcollision avoidance with deep reinforcementlearning. arXiv , arXiv: 1609.07845v2[cs.MA][9] V. Mnih, A. P. Badia, M. Mirza, A. Graves, T. P.Lillicrap, T. Harley, D. Silver and K.Kavukcuoglu. Asynchronous methods for deepreinforcement learning. arXiv ,arXiv:1602.01783v2 [cs.LG].[10] T. P. Lillicrap, J. J. Hunt, A. Pritzel, N. Heess,T.Erez, Y. Tassa, D. Silver and D. Wierstra.Continuous control with deep reinforcementlearning. arXiv , arXiv:1509.02971 [cs.LG].[11] H. van Hasselt, A. Guez and D. Silver. Deepreinforcement learning with double Q-learning.arXiv , arXiv:1509.06461v3 [cs.LG].[12] V. R. Konda and J. N. Tsitsiklis. Actor-criticalgorithms. Society for Industrial and AppliedMathematics, , vol 42.[13] T. Hester, M. Vecerik, O. Pietquin, M. Lanctot, T.Schaul, B. Piot, D. Horgan, J. Quan, A.Sendonaris, G. D. Arnold, I. Osband, J. Agapiou,J. Z. Leibo and A. Gruslys. Deep Q-learning fromdemonstrations. arXiv , arXiv:1704.03732v4[cs.AI].[14] J. van den Berg, S. J. Guy, M. Lin and D.Manocha. Reciprocal n-body collision avoidance.

Springer Tracts in Advanced Robotics , , vol 70.[15] C. Chen, Y. Liu, S. Kreiss and A. Alahi.Crowd-robot interaction: crowd-aware robot navigation with attention-based deepreinforcement learning. International Conferenceon Robotics and Automation (ICRA), , pp.6015-6022.[16] M. Babaeizadeh, I. Frosio, S. Tyree, J. Clemonsand J. Kautz. Reinforcement learning throughasynchronous advantage Actor-Critic on a GPU.arXiv , arXiv:1611.06256 [cs.LG].[17] OpenAI Baselines: ACKTR and A2C. Web.August 18, 2017.https://openai.com/blog/baselines-acktr-a2c[18] Z. Ahmed, N. L. Roux, M. Norouzi and D.Schuurmans. Understanding the impact ofentropy on policy optimization. InternationalConference on Machine Learning, ,97:151-160.[19] C. Chen, S. Hu, P. Nikdel, G. Mori and M. Savva.Relational graph learning for crowd navigation.arXiv , arXiv:1909.13165v3 [cs.RO].[20] M. Everett, Y. Chen and J. P. How. Motionplanning among dynamic, decision-makingrobots with deep reinforcement learning.

IEEE/RSJ International Conference on IntelligentRobots and Systems (IROS), , pp. 3052-3059.[21] Y. Chen, M. Everett, M. Liu and J. P. How.Socially aware motion planning with deepreinforcement learning. arXiv ,arXiv:1703.08862v2 [cs.RO].[22] Z. Wang, V. Bapst, N. Heess, V. Mnih, R. Munos,K. Kavukcuoglu and N. de Freitas. Sampleefficient actor-critic with experience replay. arXiv , arXiv:1611.01224v2 [cs.LG].[23] F. Codevilla, M. Müller, A. López, V. Koltun andA. Dosovitskiy. End-to-end driving viaconditional imitation learning.

IEEE InternationalConference on Robotics and Automation (ICRA) , , pp. 4693-4700.[24] S. J. Guy, J. Chhugani, C. Kim, N. Satish, M. Lin,D. Manocha and P. Dubey. ClearPath: highlyparallel collision avoidance for multi-agentsimulation. ACM SIGGRAPH/EurographicsSymposium on Computer Animation , ,177–187.[25] J. Oh, Y. Guo, S. Singh and H. Lee.Self-imitation learning. arXiv ,arXiv:1806.05635v1 [cs.LG].[26] R. Sutton, D. A. Mcallester, S. Singh, Y. Mansour.Policy gradient methods for reinforcementlearning with function approximation. Proceedings of the 12th International Conference onNeural Information Processing Systems , MIT Press,Cambridge, MA, USA,1999