[PDF] A review of motion planning algorithms for intelligent robotics

Abstract

We investigate and analyze principles of typical motion planning algorithms. These include traditional planning algorithms, supervised learning, optimal value reinforcement learning, policy gradient reinforcement learning. Traditional planning algorithms we investigated include graph search algorithms, sampling-based algorithms, and interpolating curve algorithms. Supervised learning algorithms include MSVM, LSTM, MCTS and CNN. Optimal value reinforcement learning algorithms include Q learning, DQN, double DQN, dueling DQN. Policy gradient algorithms include policy gradient method, actor-critic algorithm, A3C, A2C, DPG, DDPG, TRPO and PPO. New general criteria are also introduced to evaluate performance and application of motion planning algorithms by analytical comparisons. Convergence speed and stability of optimal value and policy gradient algorithms are specially analyzed. Future directions are presented analytically according to principles and analytical comparisons of motion planning algorithms. This paper provides researchers with a clear and comprehensive understanding about advantages, disadvantages, relationships, and future of motion planning algorithms in robotics, and paves ways for better motion planning algorithms.

Full PDF

11Manuscript under review

A review of motion planning algorithms forintelligent robotics

Chengmin Zhou {[email protected]}Bingding Huang {[email protected]}Pasi Fränti {[email protected]} University of Eastern Finland, Joensuu, Finland. Shenzhen Technology University, Shenzhen, China.* Correspondence: Pasi Fränti

Abstract:

We investigate and analyze principles of typical motion planning algorithms. Theseinclude traditional planning algorithms, supervised learning, optimal value reinforcementlearning, policy gradient reinforcement learning. Traditional planning algorithms weinvestigated include graph search algorithms, sampling-based algorithms, and interpolatingcurve algorithms. Supervised learning algorithms include MSVM, LSTM, MCTS and CNN.Optimal value reinforcement learning algorithms include Q learning, DQN, double DQN,dueling DQN. Policy gradient algorithms include policy gradient method, actor-critic algorithm,A3C, A2C, DPG, DDPG, TRPO and PPO. New general criteria are also introduced to evaluateperformance and application of motion planning algorithms by analytical comparisons.Convergence speed and stability of optimal value and policy gradient algorithms are speciallyanalyzed. Future directions are presented analytically according to principles and analyticalcomparisons of motion planning algorithms. This paper provides researchers with a clear andcomprehensive understanding about advantages, disadvantages, relationships, and future ofmotion planning algorithms in robotics, and paves ways for better motion planning algorithms.

Keywords:

Motion Planning, Path Planning, Intelligent Robotics, Reinforcement Learning, DeepLearning.

I. Introduction

Intelligent robotics, nowadays, is serving people from different backgrounds in complex anddynamic shopping malls, train stations and airports [1] like Daxin in Beijing and Changi inSingapore. Intelligent robots guide pedestrians to find coffee house, departure gates and exits viaaccurate motion planning, and assist pedestrians in luggage delivery. Another example ofintelligent robotics is parcel delivery robots from e-commercial tech giants like JD in China andAmazon in US. Researchers in tech giants make it possible for robots to autonomously navigatethemselves and avoid dynamic and complex obstacles via applying motion planning algorithmsto accomplish parcel delivery tasks. In short, intelligent robotics gradually play a significant rolein service industry, agricultural production, manufacture industry and dangerous scenarios likenuclear radiation environment to replace human manipulation, therefore risks of injury isreduced and efficiency is improved.Research of motion planning is going through a flourishing period, due to development andpopularity of deep learning (DL) and reinforcement learning (RL) that have better performance incoping with non-linear and complex problems. Many universities, tech giants, and researchgroups all over the world therefore attach much importance, time, and energy on developingnew motion planning techniques by applying DL algorithms or integrating traditional motion

Manuscript under review planning algorithms with advanced machine learning (ML) algorithms. Autonomous vehicle is anexample. Among tech giants, Google initiated their self-driving project named Waymo in 2016.In 2017, Tesla pledges a fully self-driving capable vehicle. Autonomous car from Baidu hassuccessfully been tested in highways near Beijing in 2017, and man-manipulated buses fromHuawei have already been replaced by automated buses in some specific areas of Shenzhen.Other companies in traditional vehicle manufacturing, like Audi and Toyota, also have their ownexperimental autonomous vehicles. Among research institutes and universities, Navlab(navigation lab) in Carnegie Mellon, Oxford University and MIT are leading research institutes.Up to 2020, European countries like Belgium, France, Italy, and UK are planning to operatetransport systems for autonomous vehicles. Twenty-nine US states have passed laws inpermitting autonomous vehicles. Autonomous vehicle is therefore expected to widely spread innear future with improvement of traffic laws .Figure. 1 Three types of robotic platform. First and second figures representwheel-based chassis [37]. The first figure represents a car-like chassis, while the secondfigure represents a differential-wheel chassis. Third and fourth figures representfour-leg dog “SpotMini” from Boston Dynamic” and robotic arm [54].

Motion planning and robotic platform:

Robotics use motion planning algorithms to plantheir trajectories both at global and local level. Human-like and dog-like robotics from BostonDynamic and autonomous robotic car from MIT [2] are good examples. All of them leveragemotion planning algorithms to enable robotics to freely walk in complex and dynamic scenariosboth indoor and outdoor.

Chassis of robotics have two types of wheels, including car-like wheel and ddifferential wheel (Figure 1).In robotics with car-like wheels, front two wheels are utilized for steering, while rear twowheels is used for forwarding. The car-like chassis has two servos. Front two wheels share asame servo and it means these two wheels can steer with a same steering angle or range φ (Fig.1). Rear two wheels share another servo to control the speed of robotics. Robotics usingdifferential wheel, however, is completely different with car-like robot in functions of servo. Thechassis with differential wheels generally has two servos, and each wheel is controlled by oneservo for forwarding, and steering is realized by giving different speeds to each wheel. Steeringrange in car-like robotics is limited because two front wheels steer with a same angle φ . Thecar-like wheel is therefore suitable to be used in high-speed outdoor scenarios because ofstability. Robotics with differential wheels, however, can steer in an angle of 2π, and it meansrobotics can change their pose arbitrarily without moving forward. Robotics with differentialwheels is also sensitive to the speed difference of its two front wheels, and it means it is flexibleto move in low-speed indoor scenarios but very dangerous to move in high-speed situations ifsomething wrong in speed control of two front wheels, because little speed changes of front twowheels in differential chassis can be exaggerated and accident follows.It is popular to use legs in the chassis of robotics in recent years. Typical examples arehuman-like and animal-like (dog-like, Fig. 1) robotics from Boston Dynamic. Robotic arms (Fig. 1)are also a popular platform to deploy motion planning algorithms. In summary, wheels, arms,and legs are choices of chassis to implement motion planning algorithms which are widely used Manuscript under review in academic and industrial scenarios including commercial autonomous driving, service robot,surgery robot and industrial arms.

Architecture of robotics:

Basic architecture of automated robotics can be divided intomodules that include data collection, environment perception and understanding, decisionmaking and decision execution (Fig. 2). First, data are collected from sensors like light detectionand ranging (liDAR) and camera. Data are processed and interpreted by advanced algorithms likemotion planning, path planning, lane detection and tracking algorithms in environmentperception and understanding processes. Then, decisional messages are generated according tooutputs of mentioned algorithms. Finally, these messages are parsed from digital format intoanalog format that can be recognized and executed by hardware.Figure 2. Basic architecture of robotics.

Motion planning and path planning:

Performance of motion planning directly decideswhether task of robotics is successful or not, therefore requiring much more attention than otherfunctional modules in robotics.

Motion planning is extension of path planning . They are almost thesame term, but few differences exist. For example, path planning aims at finding the pathbetween origin and destination in workspace by strategies like shortest distance or shortest time(Fig. 3), therefore path is planned from topological level. Motion planning, however, aims atgenerating interactive trajectories in workspace when robots interact with dynamic environment,therefore motion planning needs to consider kinetics features, velocities and poses of robots anddynamic objects nearby (Fig. 3). In short, motion planning considers short-term optimal orsuboptimal strategies where robots interact with the environment to achieve long-term optimalmotion planning strategy. Denote that workspace is an area that an algorithm works or the taskexists.Figure 3. Path planning and motion planning. The left figure represents a path based onshortest distance and time, and path is generated from topological level. The right figurerepresents famous piano mover’s problem that not only consider planning a path fromtopological level, but also consider kinetics features, speeds and poses of the piano.

Classification of planning algorithms:

We divide motion planning algorithms into twocategories: traditional algorithms and

ML-based algorithms according to their principles and the erathey were invented. Traditional algorithms are composed by three groups including graph searchalgorithms , sampling-based algorithms and interpolating curve algorithms . ML based planningalgorithms are based on ML approaches that include supervised learning (e.g. support vector Manuscript under review machine (SVM) [53]), optimal value RL and policy gradient RL . Categories of planning algorithmsare summarized in Fig. 4.Figure 4. Classification of planning algorithms.

Development of ML-based algorithms:

Development of ML-based motion planningalgorithms is shown in Fig. 5. Researchers use supervised learning, like SVM, to implementsimple motion planning at an earlier stage, but its performance is poor because SVM isshort-sighted (one-step prediction) and it requires well-prepared vector inputs that cannot fullyrepresent features of image-based dataset. Significant improvement to extract high-level featuresfrom images were made after invention of convolutional neural network (CNN) [34]. CNN iswidely used in many image-related tasks including motion planning, but it cannot cope withcomplex time-sequential motion planning problems. These better suit

Markov chain [60] and longshort-term memory (LSTM) [4]. Many researchers combine neural networks with LSTM oralgorithms that are based on Markov chain (e.g.

Q learning [28]) to implement time-sequentialmotion planning tasks. However, their efficiency is limited. A breakthrough was made whenGoogle DeepMind introduced nature deep Q-learning network (DQN) [38-39], in which reply buffer is to reuse old data to improve efficiency. Performance in robustness, however, is limitedbecause of noise that impacts estimation of Q value. Double DQN [40][42] and dueling DQN [5]are therefore invented to cope with noise in DQN. Double DQN utilizes another network toevaluate the estimation of Q value in DQN to reduce noise, while advantage value ( A value) isutilized in dueling DQN to obtain better Q value, and noise is mostly reduced. The Q learning,DQN, double DQN and dueling DQN are all based on optimal values ( Q value and A value) toselect time-sequential actions, therefore these algorithms are called optimal value algorithms .Implementation of optimal value algorithms, however, is computationally expensive.Optimal value algorithms are latter replaced by policy gradient method [43], in which gradientapproach [59] is directly utilized to upgrade policy that is used to generate optimal actions. Policygradient method is more stable in network convergence, but it lacks efficiency in speed ofnetwork convergence. Actor-critic algorithm [6][44] improves speed of convergence by actor-criticarchitecture. However, improvement in convergence speed is achieved by sacrificing the stabilityof convergence, and it is hard to converge in earlier-stage training.

Asynchronous advantageactor-critic (A3C) [33][45], advantage actor-critic (A2C) [29][36], trust region policy optimization (TRPO) [69] and proximal policy optimization (PPO) [70] algorithms are then invented to cope withthis shortcoming.

Multi-thread technique [45] is utilized in A3C and A2C to accelerate the speed ofconvergence, while TRPO and PPO improve the policy of actor-critic algorithm by introducing trust region constraint in TRPO, and “surrogate” and adaptive penalty in PPO to improve speedand stability of convergence. Data, however, is dropped after training, and new data musttherefore be collected to train the network until convergence of network.

Off-policy gradient algorithms including deterministic policy gradient (DPG) [47] and deep DPG (DDPG) [46][67] are invented to reuse data by replay buffer of DQN. DDPG fuses the actor-criticarchitecture and deterministic policy to enhance the performance in network convergence. Insummary, supervised learning, optimal value RL, and policy gradient RL are typical MLalgorithms in motion planning.

Manuscript under review

Figure 5. Development of ML based motion planning algorithms. ML-based motionplanning algorithms evolve from supervised learning to optimal value RL and policygradient RL. Supervised learning cannot address time-sequential planning problem butRL addresses it well. Optimal value RL suffers slow and unstable convergence speedbut policy gradient RL performs better in convergence. Note that

Monte-carlo tree search (MCTS) is a traditional RL algorithm but in this paper, we place it in group ofsupervised learning for convenient and clear comparisons.In this paper, we investigate and analyze state-of-art ML based algorithms to provideresearchers with a comprehensive and clear understanding about functions, structures,advantages, and disadvantages of planning algorithms. We also introduce new criteria toevaluate the performance of planning algorithms. Potential directions for making practicaloptimization in motion planning algorithms are discussed simultaneously. Contributions of thispaper include (1) General survey of traditional planning algorithms; (2) Detailed survey ofsupervised learning, optimal value RL and policy gradient RL for robotic motion planning; (3)Analytical comparisons of these algorithms according to new evaluation criteria; (4) Analysis offuture directions.This paper is organized as follows: sections II, III, IV and V describe principles andapplications of traditional planning algorithms, supervised learning, optimal value RL andpolicy gradient RL in robotic motion planning; section VI describes analytical comparisons ofthese algorithms, and criteria for performance evaluation; section VII analyzes future direction ofrobotic motion planning.

II. Traditional planning algorithms

Traditional planning algorithms can be divided into three groups: graph-search , sampling-based and interpolating curve algorithms. They will be described in detail in the followingsections. Manuscript under review

Graph-search-based algorithms can be divided into depth-first search , breadth-first search , and best-first search [7]. The depth-first search algorithm builds a search tree as deep and fast aspossible from origin to destination until a proper path is found. The breadth-first searchalgorithm shares similarities with the depth-first search algorithm by building a search tree. Thesearch tree in the breadth-first search algorithm, however, is accomplished by extending the treeas broad and quick as possible until a proper path is found. The best-first search algorithm addsa numerical criterion (value or cost) to each node and edge in the search tree. According to that,the search process is guided by calculation of values in the search tree to decide: (1) whethersearch tree should be expanded; (2) which branch in the search tree should be extended. Theprocess of building search trees repeats until a proper path is found. Graph search algorithms arecomposed by many algorithms. The most popular are Dijkstra’s algorithm [7] and

A* algorithm [8].

Dijkstra’s algorithm is one of earliest optimal algorithms based on best-first searchtechnique to find the shortest paths among nodes in a graph. Finding the shortest paths in a roadnetwork is a typical example. Steps of the Dijkstra algorithm (Fig. 6) are as follows: (1) convertthe road network to a graph, and distances between nodes in the graph are expected to be foundby exploration; (2) pick the unvisited node with the lowest distance from the source node; (3)calculate the distance from the picked node to each unvisited neighbor and update the distanceof all neighbor nodes if the distance to the picked node is smaller than the previous distance; (4)mark the visited node when the calculation of distance to all neighbors is done. Previous stepsrepeat until the shortest distance between origin and destination is found. Dijkstra’s algorithmcan be divided into two versions: forward version and backward version. Calculation of overallcost in the backward version, called cost-to-come , is accomplished by estimating the minimumdistance from selected node to destination, while estimation of overall cost in the forwardversion, called cost-to-go , is realized by estimating the minimum distance from selected node tothe initial node. In most cases, nodes are expanded based on the cost-to-go .(a)(b)Figure 6. Steps of the Dijkstra algorithm (a) and road networks in web maps (b) [64-65].Web maps are based on GPS data. Road network is mapped into the graph that iscomposed by nodes and edges, therefore graph search algorithms like A* and Dijkstra’salgorithms can be used in these graphs.

A* algorithm is based on the best-first search, and it utilizes heuristic function to find theshortest path by estimating the overall cost. The algorithm is different from the Dijkstra’salgorithm in the estimation of the path cost. The cost estimation of a node i in a graph by A* is asfollows: (1) estimate the distance between the initial node and node i ; (2) find the nearestneighbor j of the node i , and estimate the distance of nodes j and i ; (3) estimate the distance Manuscript under review between the node j and the goal node. The overall estimated cost is the sum of these threefactors: . (1)where represents overall estimated cost of node i , the estimated cost from the origin tothe node i , the estimated distance from the node i to its nearest node j , and theestimated distance from the node j to the node of goal. A* algorithm has a long history in pathplanning in robotics. A common application of the A* algorithm is mobile rovers planning via anoccupancy grid map (Fig. 7) using the Euclidean distance [9]. There are many variants of A*algorithm, like dynamic A* and dynamic D* [10], Field D* [11],

Theta* [12],

Anytime Repairing A* (ARA*) and

Anytime D* [13], hybrid A* [14], and

AD* [15]. Other graph search algorithms have adifference with common robotic grid map. For example, the state lattice algorithm [16] uses onetype of grid map with a specific shape (Fig. 7), while the grid in common robotic map is in asquare-grid shape (Fig. 7).Figure 7. The left figure represents a specific grid map in the State Lattice algorithm,while the right figure represents a common square-grid (occupancy grid) map in therobot operating system (ROS).

Sampling-based algorithms randomly sample a fixed workspace to generate sub-optimalpaths. The rapidly-exploring random tree (RRT) and the probabilistic roadmap method (PRM) are twoalgorithms that are commonly utilized in motion planning. The RRT algorithm is more popularand widely used for commercial and industrial purposes. It constructs a tree that attempts toexplore the workspace rapidly and uniformly via a random search [17]. The RRT algorithm canconsider non-holonomic constraints, such as the maximum turning radius and momentum of thevehicle [18]. The example of trajectories generated by RRT is shown in Fig. 8. The PRM algorithm[20] is normally used in a static scenario. It is divided into two phases: learning phase and queryphase . In the learning phase, a collision-free probabilistic roadmap is constructed and stored as agraph. In query phase, a path that connects original and targeted nodes, is searched from theprobabilistic roadmap. An example of trajectory generated by PRM is shown in Fig. 8.Figure 8. Trajectories of RRT and PRM. The left figure represents trajectories of RRTalgorithm [19], and the right figure represents the trajectory of PRM algorithm [20].

Manuscript under review

Interpolating curve algorithm is defined as a process that constructs or inserts a set ofmathematical rules to draw trajectories. The interpolating curve algorithm is based on techniques(e.g. computer aided geometric design (CAGD)) to draw a smooth path. Mathematical rules are usedfor path smoothing and curve generation. Typical path smoothing and curve generation rulesinclude line and circle [21], clothoid curves [22], polynomial curves [23],

Bezier curves [24] and splinecurves [25]. Examples of trajectories are shown in Fig. 9.Figure 9. Interpolating curve algorithms generated by mathematical rules [21-25].

III. Supervised learning

Here we present basic principle of 4 pervasive supervised learning algorithms for motionplanning. These include SVM, LSTM, MCTS and CNN.

SVM [53] is a well-known supervised learning algorithm for classification. The basicprinciple of SVM is about drawing an optimal separating hyperplane between inputted data bytraining a maximum margin classifier [53]. Inputted data is in the form of vector that is mappedinto high-dimensional space where classified vectors are obtained by trained classifier. SVM isused in 2-class classification that cannot suit real-world task, but its variant multiclass SVM (MSVM) [71] works.

LSTM [72][4] is a variant of recurrent neural network (RNN). LSTM can remember inputteddata (vectors) in its cells. Because of limited capacity of cell in storage, a part of data will bedropped when cells are updated with past and new data, and then a part of data will beremembered and transferred to next time step. These functions in cells are achieved by neuralnetwork as the description in Fig. 10. In robotic motion planning, robots’ features and labels ineach time step are fed into neural networks in cells for training, therefore decisions for motionplanning are made by performing trained network.Figure 10. Cells of LSTM that are implemented using neural network [73]. denotes cell’s statein time step t . denotes the output that will be transferred to the next state as its input, Manuscript under review therefore format of input is the vector . Cell states are controlled and updated by 3 gates(forget gate, input gate and output gate) that are implemented using neural networks withweights , , and respectively. MCTS is the combination of Monte-carlo method [75] and search tree [76]. MCTS is widelyused in games (e.g. Go and chess) for motion prediction [74][3]. Mechanism of MCTS iscomposed by 4 processes that include selection, expansion, simulation, and backpropagation asFig. 11. In robotic motion planning, node of MCTS represents possible state of robot, and storesstate value of robot in each step. First, selection is made to choose some possible nodes in the treebased on known state value. Second, tree expands to unknown state by tree policy (e.g. randomsearch). Third, simulation of expansion is made on new-expanded node by default policy (e.g.random search) until terminal state of robot and reward R is obtained. Finally, backpropagationis made from new-expanded node to root node, and state values in these nodes are updatedaccording received reward. These four processes repeat until convergence of state values in thetree, therefore robot can plan its motion according to state values in the tree. MCTS fitsdiscrete-action tasks (e.g. AlphaGo [74]), and it also fits time-sequential tasks like autonomousdriving.Figure 11. 4 processes of MCTS. These processes repeat until the convergence of state values inthe tree. CNN [34] has become a research focus of ML after

LeNet5 [34] was introduced andsuccessfully applied into handwritten digits recognition. CNN is one of the essential types ofneural network because it is good at extracting high-level features from high-dimensionalhigh-resolution images by convolutional layers. CNN makes the robot avoid obstacles and plansmotions of robot according to human experience by models trained in forward propagation and back propagation process, especially the back propagation. In the back propagation, a model witha weight matrix/vector θ is updated to record features of obstacles. Note that where w and b represent weight and bias, and i represents the serial number of w-b pairs. L representsthe length of weight.Training steps of CNN are shown as Fig. 12. Images of objects (obstacles) are used as inputsof CNN. Outputs are probability distributions obtained by Softmax function [58].

Loss value is cross-entropy (CE) and that is obtained by log (2)where p denotes probability distributions of output (observed real value), q representsprobability distributions of expectation ㌳ , and i represents the serial number of eachbatch of images in training. The loss function measures the difference (distance) of observed real value p and expected value q . Mean-square error (MSE) is an alternative of CE and MSE is definedby where represents observed values while represents predictedvalues or expectation. The weight is updated in optimizer by minimizing the loss value using gradient descent approach [59] therefore new weight is obtained by (3)where w represents the weight, η represents a learning rate ( ㌳ ) and i represents the serialnumber of each batch of images in training. Improved variants of CNN is also widely used inmotion planning, e.g. residue networks [35][49].Figure 12. Training steps of CNN. The trajectory is planned by human in data collectionin which steering angles of robotics are recorded as labels of training data. Roboticslearn behavior strategies in training and move along the planned trajectory in the test.The Softmax function maps values of feature to probabilities between 0 and 1. Theoptimizer represents gradient descent approach, e.g. stochastic gradient descent (SGD)[59]. IV. Optimal value RL

Here we first introduce basic concepts in RL, and then introduce principles of Q learning,nature DQN, double DQN and dueling DQN.Supervised learning algorithms like CNN is competent only in static obstacle avoidance byone-step prediction, therefore it cannot cope with time-sequential obstacle avoidance. RLalgorithms, e.g. optimal value RL, fit time-sequential tasks. Typical examples of these algorithmsinclude Q learning, nature DQN, double DQN and dueling DQN. Motion planning is realized byattaching destination and safe paths with big reward (numerical value), while obstacles areattached with penalties (negative reward). Optimal path is found according to total rewards frominitial place to destination. To better understand optimal value RL, it is necessary to recall severalfundamental concepts:

Markov chain , Markov decision process (MDP), model-based dynamicprogramming , model-free RL , Monte-Carlo method (MC), temporal difference method (TD), and

State-action-reward-state-action (SARSA). MDP is based on Markov chain [60], and it can bedivided into two categories: model-based dynamic programming and model-free RL. Mode-freeRL can be divided into MC and TD that includes SARSA and Q learning algorithms.Relationship of these concepts is shown in Fig. 13.

Markov chain:

Variable set h ㌳ is called Markov chain [60] if X meets . (4)This means the occurrence of event depends only on event and has no correlation toany earlier events. Markov decision process:

MDP [60] is a sequential decision process based on Markov Chain.This means the state and action of the next step depend only on the state and action of thecurrent step. MDP is described as a tuple le h . S represents state and here refers to statesof robot and obstacles. A represents an action taken by robot. State S transits into another stateunder a state-transition probability P and a reward R from the environment is obtained.Principle of MDP is shown in Fig. 13. First, the robot in state s interacts with the environmentand generate an action based on policy s → . Robot then obtains the reward r from theenvironment, and state transits into the next state s’ . The reach of next state s’ marks the end ofone loop and the start of the next loop. Model-free RL and model-based dynamic programming:

Problems in MDP can be solvedusing model-based dynamic programming and model-free RL methods. The model-based dynamicprogramming is used in a known environment, while the model-free RL is utilized to solveproblems in an unknown environment.

Temporal difference and Monte Carlo methods:

The model-free RL includes MC and TD.A sequence of actions is called an episode . Given an episode < S , A , R , S , A , R , ...,S t , A t , R t+1 , ..., S T >, the state value V ( s ) in the time step t is defined as the expectation of overallrewards by t e e e (5)where represent a discount factor ( ㌳ ). MC uses to update its state value V MC ( s )by (6)where “ ” represents the update process in which new value will replace previous value. is adiscount factor. TD uses e to update its state value by e (7)where is a learning rate, e is TD target in which the estimated state value is obtained by bootstrapping method [56]. This means MC updates its state value after thetermination of an episode, while TD update its state value in every steps. TD method is thereforeefficient than MC in state value update.(a) (b)Figure 13. (a) represents the relationship of basic concepts. (b) represents the principle ofMDP. Q learning TD includes SARSA [55] and Q learning [28][66]. Given an episode < S , A , R , S , A , R , ..., S t ,A t , R t+1 , ..., S T >, SARSA and Q learning use the ε-greedy method [57] to select an action l at timestep t . There are two differences between SARSA and Q learning: (1) SARSA uses ε-greedy againto select an estimated action value l at time step t +1 to update its action value by lel l l e l l , (8)while Q learning directly uses maximum estimated action value max l at time step t +1 to update its action value by l l e max l l l ; (9)(2) SARSA adopts selected action l directly to update its next action value, but Q learningalgorithm use ε-greedy to select a new action to update its next action value.SARSA uses ε-greedy method to sample all potential action value of next step and selects a“safe” action eventually, while Q learning pays attention to the maximum estimated action valueof the next step and selects optimal actions eventually. Steps of SARSA is shown in Algorithm 1 [66], while Q learning algorithm as Algorithm 2 [66] and Fig. 14. Implementations of roboticmotion planning by Q learning are as [28][30][50].

Algorithm 1:

SARSAInitialize Q ( s , a ) arbitrarily Repeat (for each episode):Initialize s Choose a from s using policy derived from Q (e.g. -greedy ) Repeat (for each step of episode):Take action a , observe r , s’ Choose a’ from s’ using policy derived from Q (e.g. -greedy ) Q ( s , a ) Q ( s , a )+ [ r+ Q(s’,a’)-Q(s,a) ] s s’ ; a a’ ; until s is terminal Algorithm 2:

Q-learningInitialize Q ( s , a ) arbitrarily Repeat (for each episode):Initialize s Repeat (for each step of episode):Choose a from s using policy derived from Q (e.g. -greedy )Take action a , observe r , s’Q ( s , a ) Q ( s , a )+ [ r+ max Q(s’,a’)-Q(s,a) ] s s’ ; until s is terminalNote that s’ and a’ denote and l respectively.Figure 14. Steps of Q learning algorithm. Input of Q learning is in the vector formatnormally. Q value is obtained via Q value table or network as approximator. Extrapreprocessing is needed to extract features from image if input is in image format. DQN [38] is a combination of Q leaning and deep neural network (e.g. CNN). DQN usesCNN to approximate Q values by its weight θ . Hence, Q table in Q learning changes to Q valuenetwork that can be converged in a faster speed for complex motion planning. DQN became aresearch focus when it was invented by Google DeepMind [38][39], and performance of DQNapproximates or even surpasses the performance of human being in Atari games (e.g. Pac-manand Enduro in Fig. 15) and real-world motion planning tasks [31][51]. DQN utilizes CNN toapproximate Q values (Fig. 16) by . (10)In contrast with the Q learning, DQN features 3 components: CNN, replay buffer [41] and targeted network . CNN extracts feature from images as its inputs. Outputs can be Q value ofcurrent state Q ( s,a ) and Q value of next state Q ( s’,a’ ) , therefore experiences < s,a,r,s’ > are obtainedand temporarily stored in replay buffer. It is followed by training DQN using experiences in thereplay buffer. In this process, a targeted network is leveraged to minimize the loss value by max . (11)Loss value measures the distance between expected value and real value. In DQN, expectedvalue is ( r + γ max Q ( s’ , a’ ; θ’ )) that is similar to labels in supervised learning, while Q ( s , a ; θ ) is theobserved real value. weights of targeted network and Q value network share a same weight θ .The difference is that weight of Q value network θ is updated in each step, while weight oftargeted network θ’ is updated in a long period of time. Hence, θ is updated frequently and θ’ ismore stable. It is necessary to keep targeted network stable, otherwise Q value network will behard to converge. Detailed steps of DQN are shown as Algorithm 3 [38] and Fig. 17.

Algorithm 3:

Deep Q-learning with experience replayInitialize replay memory to capacity Initialize action-value function Q with random weights for episode = 1, M do Initialize sequence and preprocessed sequenced for t = 1, T do With probability select a random action otherwise select max Execute action in emulator and observe reward and image Set , , and preprocess Store transition in Sample random minibatch of transitions from Set for ternimal for nonternimal Perform a gradient descent step on end forend for Figure 15. Two examples of motion planning in early-stage arcade games: Enduro (left)and Pac-man (right).Figure 16. Q ( s,a ), Q ( s,a ), Q ( s,a ) and Q ( s,a t ) denote Q values of all potential actions.Figure 17. steps of DQN algorithm. Noise in DQN leads to bias and false selection of next action follows, therefore leading toover-estimation of next action value . To reduce the over-estimation caused by noise,researchers invented the double DQN [40] in which another independent targeted network withweight is introduced to evaluate the selected action . Hence, equation of targeted networktherefore changes from max to arg (12)Steps of double DQN are the same with DQN. Examples of application are [26][42][48] in whichdouble DQN is used in games and robotic operation system (ROS). The state value measures “how good the robot is” in the state s where π denotespolicy → , while the action value denotes “how good the robot is” after robot takes action a in state s using policy π . Advantage value ( A value) denotes the difference of and by l , (13)therefore A value measures “how good the action a is” in state s if robot takes action a . In neuralnetwork case (Fig. 18), weights α , β , θ are added, therefore l (14)where θ is the weight of neural network and it is the shared weight of Q , V and A values. Here α denotes the weight of A value, and β the weight of V value. is a scalar, and l isa vector. There are however too many V - A value pairs if Q value is simply divided into twocomponents, and only one V - A pairs are qualified. Thus, it is necessary to constrain the V valueor A value to obtain a fixed V - A pair. According to relationship of and where t t , the expectation value of A is t t l ㌳ . (15)Eq. 15 can be used as a rule to constrain A value for obtaining a stable V - A pair. Expectation ofadvantage value is obtained by using l to subtract mean A value that is obtained from allactions, therefore t l l l (16)where represents action space in time step t , number of actions, and one of actions in in time step t . Expectation of A value keeps zero for ㌳ , although the fluctuation of l in different action choices. Researchers use the expectation of A value to replace the current A value by l l . (17)Thus, a stable V-A pair is obtained although original semantic definition of A value (Eq. 13) ischanged [5]. In other words: (1) advantage constraint l l ㌳ is used tofind state advantage value; (2) action value is obtained by (this can belooked as a constraint) under l l ㌳ . Hence, 2 constraints lead to abetter estimation of action value.DQN obtained action value directly by using network to approximate action value.This process introduces over-estimation of action value. Dueling DQN obtains action value by constraining advantage value l . Finally, 3 weights ( ) are obtained aftertraining, and Q value network is with less bias but advantage value is better than action valueto represent “how good the action is” (Fig. 19).Further optimizations are distributional DQN [61], noise network [62], dueling double DQN[77] and rainbow model [63]. Distributional DQN is like the dueling DQN, as noise is reduced byoptimizing the architecture of DQN. Noise network is about improving the ability in explorationby a more exquisite and smooth approach. Dueling double DQN and rainbow model are hybridalgorithms. Rainbow model fuses several high-performance algorithms as components thatinclude double networks, replay buffer, dueling network, multi-step learning, distributionalnetwork, and noise network. Figure 18. The architecture of dueling DQN, in which Q value Q ( s,a ) is decoupled intotwo parts, including V value V ( s ) and A value A ( s,a ).Figure 19. Q ( s , a ) and A ( s , a ) saliency maps (red-tinted overlay) on the Atari game(Enduro). Q ( s , a ) learns to pay attention to the road, but pay less attention to obstacles inthe front. A ( s,a ) learns to pay much attention to dynamic obstacles in the front [5]. V. Policy gradient RL

Here we first introduce policy gradient method and actor-critic algorithm, and thenintroduce their optimized variants: (1) A3C and A2C; (2) DPG and DDPG; (3) TROP and PPO.Optimal value RL uses neural network to approximate optimal values to indirectly selectaction. This process is simplified as e . Noise leads toover-estimation of , therefore the selected actions are suboptimal, and network is hardto converge. Policy gradient algorithm uses neural network as policy → to directly selectactions to avoid this problem. Brief steps of policy gradient algorithm are shown in Fig. 20. Figure 20. Training and test steps of policy gradient algorithms. In the training,trajectories are generated by behavior policy. Note that policy is divided to behaviorpolicy and target policy. Behavior policy is about selecting action for training andbehavior policy will not be updated, while target policy is also used to select actions butit will be updated in training. Policy refers to target policy normally. Robots learntrajectories via target policy (neural network as approximator) and trained policy isobtained. In the test, optimal actions are generated directly by trained policy untildestination is reached.

Policy is a probability distribution P{a|s,θ}=π θ (a|s)=π(a|s,θ) that is used to select action a instate s , where weight θ is a parameter matrix that is used as an approximation of policy π ( a | s ). Policy gradient method (PG) [43] seeks an optimal policy and uses it to find optimal actions. how tofind this optimal policy? Given a trajectory τ =( s , a ,..., s T , a T ), the probability to output actions in τ is . The aim of the PG is to find optimal parameter arg t t e where trajectory reward e ) is overall reward intrajectory . Objective of PG is defined as the expectation in trajectory by t t e e . (18)To find higher expectation of reward, gradient operation is used on objective to find theincrement of network that leads to a better policy. Increment of network is the gradient value ofobjective, and that is ∇ ∇ e ∇ log e t t ∇ log e . (19)An example of PG is Monte-carlo reinforce [68]. Data for training are generated fromsimulation by stochastic policy . Previous objective and its gradient (Eq. 18-19) are replaced by (20) ∇ ∇ log (21)where N is the number of trajectories, T the length of trajectory. A target policy is used togenerate trajectory for training. For example, Gaussian distribution function is used as target policyto select actions by t . Network is then used to approximate expectation ofGaussian distribution by . It means t ݀ and ݀ where w and b represent weight and bias of network, L the number of w-b pairs. Its objective is defined as Σ , therefore the objective gradient is Σ (22)where is obtained by backward-propagation. According to Eq. 21-22, the objective gradient is ∇ Σ . (23)Once objective gradient is obtained, network is updated by gradient ascent method. That is ∇ (24) The update of policy in PG is based on expectation of multi-step rewards in trajectory t t e . This leads to high variance that causes low speed in network convergence, butconvergence stability is improved. Actor-critic algorithm (AC) [6][32][44] reduces the variance byone-step reward in TD-error e for network update. TD-error is defined by . (25) To enhance convergence speed, AC uses actor-critic architecture that includes actor network(policy network) and critic network . Critic network is used in TD-error to approximate state valueby . (26)Objective of critic network is defined by . (27)Objective gradient is therefore obtained by minimizing the mean-square error ∇ ∇ . (28)Critic network is updated by gradient ascent method [59]. That is ∇ (29)where β represents learning rate. Objective of policy network is defined by . (30)Hence, objective gradient of policy network is obtained by ∇ ∇ log (31)and policy network is updated by ∇ (32)where α is a learning rate of actor network. Detailed steps of the AC are as Fig. 21: (1) action at time step t is selected by policy network ; (2) selected action is executed and reward isobtained. State transits into the next state ; (3) state value is obtained by critic network and TDerror is obtained; (4) policy network is updated by minimizing objective of critic network; (5)critic network is updated according to objective gradient of critic network. This process repeatsuntil the convergence of policy and critic networks.Figure 21. Training steps of AC. in contrast to AC, the A3C [2] has three features: (1) multi-thread computing; (2)multi-step rewards; (3) policy entropy. Multi-thread computing means multiple interactions withthe environment to collect data and update networks. Multi-step rewards are used in criticnetwork, therefore the TD-error e of A3C is obtained by (33) therefore speed of convergence is improved. Here γ is a discount factor, and n is the number ofsteps. Data collection by policy will cause over-concentration , because initial policy is withpoor performance therefore actions are selected from small area of workspace. This causes poorquality of input, therefore convergence speed of network is poor. Policy entropy increases theability of policy in action exploitation to reduce over-concentration. Objective gradient of A3Ctherefore changes to ∇ l ∇ l ∇ (34)where β is a discount factor and is the policy entropy. A2C:

A2C [29] is the alternative of A3C algorithm. Each thread in A3C algorithm can beutilized to collect data, train critic and policy networks, and send updated weights to globalmodel. Each thread in A2C however can only be used to collect data. Weights in A2C areupdated synchronously compared with the asynchronous update of A3C, and experimentsdemonstrate that synchronous update of weights is better than asynchronous way in weightsupdate [36][45]. Their mechanisms in weight update are shown in Fig. 22.Figure 22. The weight update processes of the A3C and A2C.

Here we first introduce prerequisites: on-policy algorithm , off-policy algorithm , importantsampling ratio, stochastic policy gradient algorithm , and then introduce DPG and DDPG. Prerequisites: in data generation and training processes, if behavior policy and target policyare the same policy , these algorithms are called on-policy algorithm . On-policy algorithmhowever may lead to low-quality data in data generation and a slow speed in networkconvergence. This problem can be reduced by using one policy (behavior policy) β θ for datageneration and another policy (target policy) for learning and making decision. Algorithmsusing different policies on data generation and learning are therefore called off-policyalgorithm s . Although policies in off-policy algorithm are different, their relationship can still bemeasured by transition probability ρ β ( s ) that is the importance-sampling ratio and defined by . (35)Importance-sampling ratio measures the similarity of two policies. These policies must be withlarge similarity in definition of important sampling. Particularly, behavior policy β θ is the sameas policy in on-policy algorithms. This means and ρ β ( s )= ρ π ( s )=1. In on-policy policy gradient algorithm (e.g. PG), its objective is defined as t t e t e t tl e t t t e (36)where is the distribution of state transition. The objective gradient of PG ∇ t t ∇ log e includes a vector ∇ log and a scalar e e . Vector C is thetrend of policy update, while scalar R is range of this trend. Hence, the scalar R acts as a criticthat decides how policy is updated. Action value is defined as the expectation ofdiscounted rewards by t l . (37) is an alternative of scalar R , and it is better than R as critic. Therefore, objective gradientof PG changes to ∇ ∇ t tl t t t ∇ log , (38)and policy is updated using objective gradient with action value . Hence, algorithms arecalled stochastic policy gradient algorithm if action value is used as critic. DPG:

DPG are algorithms that train a deterministic policy to select actions, instead ofpolicy in AC. A policy is deterministic policy if it maps state to action ,while stochastic policy maps state and action to a probability [47]. The update ofdeterministic policy is defined as arg . (39)If network θ is used as approximator of deterministic policy, update of network changes to t t ∇ t t ∇ ∇ . (40)There are small changes in state distribution of deterministic policy during the update ofnetwork , but this change will not impact the update of network. Hence, network ofdeterministic policy is updated by t t ∇ ∇ (41)because ∇ ∇ e ∇ t t e ∇ ∇ t t ∇ ∇ . (42)Once is obtained, can be updated after obtaining objective gradient.How to find ? Note that discounted reward is a critic in stochastic policygradient mentioned before. If network w is used as approximator, is obtained by (43)stochastic policy gradient algorithm includes 2 networks, in which w is the critic thatapproximates action value and is used as actor to select actions in test (actions are selected bybehavior policy β in training). Stochastic policy gradient in this case is called off-policydeterministic actor-critic (OPDAC) or OPDAC-Q . Objective gradient of OPDAC therefore changesfrom the Eq. 42 to ∇ t t ∇ ∇ (44)where β represents the behavior policy. 2 networks are updated by (45) ∇ (46) ∇ ∇ . (47) However, no constrains is used on network w in approximation process will lead to a large bias.How to obtain a without bias? Compatible function approximation (CFA) can eliminatethe bias by adding two requirements on w (proof is given in [47]): (1) ∇ ∇ ; (2) t → ㌳ where ∇ ∇ . In other words, should meet ∇ ݀ (48)where state value ݀ may be any differentiable baseline function [47]. Here v and are featureand parameter of state value ( ݀ ݀ ). Parameter is also the feature of advantagefunction ( l ), and is defined as sa≝∇ . Hence, alow-bias Q w ( s , a ) is obtained using OPDAC-Q and CFA. This new algorithm with less bias iscalled Compatible OPDAC-Q (COPDAC-Q) [47], in which weights are updated as Eq. 49-51 ݀ ݀ ݀ (49) ∇ l (50) ∇ ∇ (51)where is the same as the Eq. 45. Here ݀ , and are learning rates. Note that linearfunction approximation method [47] is used to obtain advantage function l that is used toreplace the value function because l is efficient than in weight update.Linear function approximation however may lead to divergence of in critic δ. Critic δ can be replaced by the gradient Q-learning critic [52] to reduce divergence. Algorithm thatcombines COPDAC-Q and gradient Q-learning critic is called COPDAC Gradient Q-learning (COPDAC-GQ). Details of gradient Q-learning critic and COPDAC-GQ algorithm can be foundin [47][52].By analytical illustration above, 2 examples (COPDAC-Q and COPDAC-GQ) of DPGalgorithm are obtained. In short, key points of DPG is to (1) find a no-biased as critic; (2)train a deterministic policy to select actions. networks of DPG are updated as AC. Briefsteps of DPG is shown in Fig. 23.Figure 23. Brief steps of DPG algorithm. DDPG [67] is the combination of replay buffer, deterministic policy μ ( s ) and actor-criticarchitecture. is used as critic network to approximate action value . is used aspolicy network to approximate deterministic policy . TD target y of DDPG is defined by (52)where and are copies of and as target networks that update with low frequency.The objective of critic network is defined by . (53)Critic network is updated by minimizing the loss value (MSE loss) (54)where N is the number of tuples < s , a , r , s’ > sampled from replay buffer. Target function of policynetwork is defined by (55)and objective gradient is obtained by ∇ ∇ ∇ . (56)Hence, policy network is updated according to gradient ascent method by ∇ (57)where α is a learning rate. New target networks (58) (59)where τ is a learning rate, are obtained by “soft” update method that improves the stability ofnetwork convergence. Detailed steps of DDPG are shown in Algorithm 4 [67] and Fig. 24.Examples can be found in [27][46] in which DDPG is used in robotic arms.

Algorithm 4:

DDPGRandomly initialize critic network and actor μ ( s ; θ μ ) with weight and Initialize target network and with weights , Initialize replay buffer for episode =1, M do Initialize a random process for action explorationReceive initial observation state for t=1, T do Selection action according to the current policy and explorationnoiseExecute action and observe reward and observe new state Store transition in Sample a random minibatch of transitions from Set Update critic by minimization the loss: Update the actor policy using the sampled policy gradient: ∇ ∇ ∇ Update the target network: end forend for Figure 24. Steps of DDPG. DDPG combines the replay buffer, actor-critic architecture,and deterministic policy. First, action is selected by policy network and reward isobtained. State transits to next state. Second, experience tuple is saved in replay buffer.Third, experiences are sampled from replay buffer for training. Fourth, critic network isupdated. Finally, policy network is updated.

PPO [70][85] is the optimized version of TRPO [69]. Hence, here we first introduce TRPO,and then introduce PPO.

TRPO:

Previous policy gradient algorithms update their policies by ∇ .However, new policy is improved unstably. The goal of TRPO is to improve its policymonotonously, therefore stability of convergence is improved by finding a new policy with theobjective that is defined by (60)where is the approximation of new policy’s expectation, the KL divergenceand a trust region constraint of KL divergence. The objective gradient ∇ is obtained bymaximizing the objective . and denote expectations of new and old policies respectively. Theirrelationship is defined by t ㌳㌳㌳ l where is a discountfactor, and l is the advantage value that is defined by l . Thus, l where is the probability distribution of newpolicy, but is unknown therefore it is impossible to obtain new policy . Approximationof new policy’s expectation is defined by l l t l (61)where is known. The relationship of and [78] is proved to be ≥ (62)where penalty coefficient , ㌳ and the maximum advantage. Hence, it ispossible to obtain by or t l . (63)However, penalty coefficient (constrain of KL divergence) will lead to small step size in policyupdate. A trust region constraint is used to constrain KL divergence by t l (64)therefore step size in policy update is enlarged robustly. New improved policy is obtained intrust region by maximizing objective , s.t. . This objective can besimplified further [69] and new policy is obtained by ∇ st . (65) PPO : we know objective of TRPO is t l ,in which a fixed trust region constraint is used to constrain KL divergence instead of penaltycoefficient . Fixed trust region constraint leads to a reasonable step size in policy updatetherefore stability in convergence is improved and convergence speed is acceptable. However,objective of TRPO is obtained in implementation by conjugate gradient method [70] that iscomputationally expensive.PPO optimizes objective t l from 2 aspects: (1)probability ratio in objective is constrained in interval by introducing “surrogate” objective 㘶 th min l (66)where is a hyperparameter, to penalize changes of policy that move away from 1 [70]; (2)penalty coefficient is replaced by adaptive penalty coefficient that increases or decreasesaccording to the expectation of KL divergence in new update. To be exact, h (67)where t and denotes target value of KL divergence in each policyupdate, therefore KL-penalized objective is obtained by t l . (68)In the implementation with neural network, loss function is required to combine the policysurrogate and value function error [70], and entropy are also used in objective to encourageexploration. Hence, combined surrogate objective is obtained by 㘶a t 㘶 a (69)where , , S and a denote 2 coefficients, entropy bonus and square-error lossrespectively. Objectives ( 㘶a and ) of PPO is optimized by SGD that cost lesscomputing resource than conjugate gradient method. PPO is implemented with actor-criticarchitecture, therefore it converges faster than TRPO. VI. Analytical comparisons

To provide a clear understanding about advantages and disadvantages of different motionplanning algorithms, we divide them into 4 groups: traditional algorithms, supervised learningalgorithms, optimal value RL and policy gradient RL, and comparisons are made according totheir principles mentioned in section II, III, IV and V. First, direct comparisons of algorithms ineach group are made to provide a clear understanding about the input, output, and key featuresof these algorithms. Second, analytical comparisons of all motion planning algorithms are madeto provide a comprehensive understanding about performance and application of algorithms,according to general criteria. Third, analytical comparisons about the convergence of RL-based motion planning algorithms are specially made, because RL-based algorithms are the researchfocus recently. this group includes graph search algorithms, sampling-basedalgorithms, and interpolating curve algorithms. Table 1 lists their input, output and key features:(1) these algorithms use graph or map of workspace as input, and output trajectory directly; (2)graph search algorithms find shortest and collision-free trajectory by search methods (e.g.best-first search). For example, Dijkstra’s algorithm is based on best-first search. However, searchprocess is computationally expensive because search space is large, therefore heuristic function isused to reduce search space and the shortest path is found by estimating the overall cost (e.g. A*);(3) sampling-based algorithms randomly sample a collision-free trajectory in search space (e.g.PRM), and constraints (e.g. non-holonomic constraint) are needed for some algorithms (e.g. RRT)in sampling process; (4) interpolating curve algorithms plan their path by mathematical rules,and then planned path is smoothed by CAGD.Table 1. Comparison of traditional planning algorithms.

Classification Example input Key features outputGraph searchalgorithm Dijkstra’s Best-first search (large search space) Heuristic function in cost estimationA*

Samplingbasedalgorithm PRM

1. Random search (suboptimal path)2. Non-holonomic constraintRRT

Graph or map trajectoryInterpolatingCurvealgorithm Line and circle Mathematical rules;Path smoothing using CAGDClothoid curvesPolynomial curvesBezier curvesSpline curves

Supervised learning algorithms: this group includes MSVM, LSTM, MCTS and CNN.These algorithms are listed in Table 2: (1) MSVM, LSTM and MCTS use well-prepared vector asinput, while CNN can directly use image as input; (2) LSTM and MCTS can outputtime-sequential actions, because of their structures (e.g. tree) that can store and learntime-sequential features. MSVM and CNN cannot output time-sequential actions because theyoutput one-step prediction by performing trained classifier; (3) MSVM plans the motion ofrobots by training a maximum margin classifier. LSTM stores and processes inputs in its cell thatis a stack structure, and then actions are outputted by performing trained LSTM model. MCTS isthe combination of Monte-carlo method and search tree. Environmental states and values arestored and updated in its node of tree, therefore actions are outputted by performing trainedMCTS model. CNN converts high-dimensional images to low-dimensional features byconvolutional layers. These low-dimensional features are used to train a CNN model, thereforeactions are outputted by performing trained CNN model.Table 2. Comparison of supervised learning algorithms.

Algorithm Input Key features output

MSVM Vector Maximum marginclassifier None-sequential actionsLSTM Vector Cell (stack structure) Time-sequential actionsMCTS Vector Monte-carlo method;Tree structure Time-sequential actionsCNN Image Convolutional layers;Weight matrix None-sequential actions

Optimal value RL: this group here includes Q learning, DQN, double DQN, and duelingDQN. Features of algorithms here include replay buffer, objectives of algorithm, and Weightupdate method. Comparisons of these algorithms are listed in Table 3: (1) Q learning normallyuses well-prepared vector as input, while DQN, double DQN and dueling DQN use images asinput because these algorithms use convolutional layer to process high-dimensional images; (2)outputs of these algorithms are time-sequential actions by performing trained model; (3) DQN,double DQN and dueling DQN use replay buffer to reuse experience, while Q learning collectsexperiences and learns from then in an online way; (4) DQN, double DQN and dueling DQN useMSE as their objectives. Their differences are: first, DQN obtains action value by neuralnetwork , while Q learning obtains action value by querying Q-table; second, doubleDQN use another neural network to evaluate selected action to obtain a better action valueby arg ; third, dueling DQN obtains action value by dividing actionvalue to advantage value and state value. Constraint t t l ㌳ is used on advantagevalue, therefore a better action value is obtained by l l . Networks that approximate action value in these algorithms are updatedby minimizing MSE with gradient descent approach.Table 3. Comparison of optimal-value RL. Algorithm Input output Replaybuffer Objective Weight updatemethodQ learning Vector Time-sequentialactions No where gradientdescentDQN Image Time-sequentialactions Yes where gradientdescentDoubleDQN Image Time-sequentialactions Yes where arg gradientdescentDuelingDQN Image Time-sequentialactions Yes where and l l gradientdescent Policy gradient RL: this group here includes PG, AC, A3C, A2C, DPG, DDPG, TRPO, andPPO. Features of these algorithms include actor-critic architecture, multi-thread method, replaybuffer, objective of algorithm, and weight update method. Comparisons of these algorithms arelisted in Table 4: (1) input of policy gradient RL can be image or vector, and image is used asinputs under the condition that convolutional layer is used as preprocessing component toconvert high-dimensional image to low-dimensional feature; (2) outputs of policy gradient RLare time-sequential actions by performing trained policy s → ; (3) actor-critic architecture isnot used in PG, while other policy gradient RL are implemented with actor-critic architecture; (4)A3C and A2C use multi-thread method to collect data and update their network, while otherpolicy gradient RL are based on single thread in data collection and network update; (5) DPG and DDPG use replay buffer to reuse data in an offline way, while other policy gradient RL learnonline; (6) the objective of PG is defined as the expectation of accumulative rewards in trajectoryby t t e . Critic objectives of AC, A3C, A2C, DPG and DDPG are defined as MSE , andtheir critic networks are updated by minimizing the MSE. However, their actor objectives aredifferent because: first, actor objective of AC is defined as ; second, policy entropy isadded on to encourage exploration, therefore actor objectives of A3C and A2C aredefined by ; third, DPG and DDPG use action value as their actorobjectives by that are approximated by neural network, and their policy networks(actor networks) are updated by obtaining objective gradient ∇ ∇ .Critic objectives of TRPO and PPO are defined as the advantage value by l where , and their critic networks are updated byminimizing . Actor objectives of TRPO and PPO are different: objective of TRPO is defined as t l in which a fixed trust region constraint is used toensure the monotonous update of policy network , while PPO uses “surrogate” and adaptive penalty to ensure a better monotonous update of policy network, therefore PPOhas 2 objectives that are defined as t l (KL-penalized objective) and 㘶a t 㘶 a (surrogateobjective) where 㘶 th min l ; (7) policy network of policygradient RL are all updated by gradient ascent approach ∇ . Policy networks ofA3C and A2C are updated in asynchronous and synchronous ways respectively. Networks ofDDPG is updated in a “soft” way by and .Table 4. Comparison of policy gradient RL. Algorithm Input output Actor-criticarchitecture Multi-threadmethod ReplayBuffer Objective Weight updatemethodPG Image/vector Time-sequential actions ---* --- --- 1 Gradient ascentAC Image/vector Time-sequential actions Yes --- --- Critic: 2Actor: 3 Gradient ascentA3C Image/vector Time-sequential actions Yes Yes --- Critic: 4Actor: 5 Gradient ascentAsynchronous updateA2C Image/vector Time-sequential actions Yes Yes --- Critic: 6Actor: 7 Gradient ascentSynchronous updateDPG Image/vector Time-sequential actions Yes --- Yes Critic: 8Actor: 9 Gradient ascentDDGP Image/vector Time-sequential actions Yes --- Yes Critic: 10Actor: 11 Gradient ascentSoft updateTRPO Image/vector Time-sequential actions Yes --- --- Critic: 12Actor: 13 Gradient ascentPPO Image/vector Time-sequential actions Yes --- --- Critic: 14Actor: 15 Gradient ascent *Here the mark “---” denotes “No”.1. t t e , , , , , objective gradient: ∇ ∇ , , objective gradient: ∇ ∇ l , t l l ,

15. (1) t l (2) 㘶a t 㘶 a where 㘶 th min l Here analytical comparisons of motion planning algorithms are made according to generalcriteria we summarized. These criteria include (1) local or global planning; (2) path length; (3)optimal velocity; (4) reaction speed; (5) safe distance; (6) time-sequential path. Speed andstability of network convergence for optimal value RL and policy gradient RL are then comparedanalytically because convergence speed and stability of RL in motion planning are recentresearch focus.

A. comparisons according to general criteriaLocal or global planning: this criteria denotes the area where the algorithm is used in mostcase. Table 5 lists planning algorithms and which criteria they fit: (1) graph search algorithmsplan their path globally by search methods (e.g. depth-first search, best-first search) to obtain acollision-free trajectory on graph or map; (2) sampling-based algorithms samples local or globalworkspace by sampling methods (e.g. random tree) to find a collision-free trajectory; (3)interpolating curve algorithms draw fixed and short trajectory by mathematical rules to avoidlocal obstacles; (4) MSVM and CNN make one-step prediction by trained classifiers to decidestheir local motion; (5) LSTM, MCTS, optimal value RL and policy gradient RL can maketime-sequential motion planning from the start to destination by performing their trainedmodels. These models include stack structure model of LSTM, tree model of MCTS and matrixweight model of RL. These algorithms fit global motion planning tasks theoretically if size ofworkspace is not large, because it is hard to train a converged model in large workspace. In mostcase, models of these algorithms are trained in local workspace to make time-sequentialprediction by performing their trained model or policy s → . Path length: this criteria denotes the length of planned path that is described as “optimalpath”, “suboptimal path”, and “fixed path”. Path length of algorithms are listed in Table 5: (1)graph search algorithms can find a shortest path by performing search methods (e.g. best-firstsearch) in graph or map; (2) sampling-based algorithms plan a suboptimal path. Their samplingmethod (e.g. random tree) leads to insufficient sampling that only covers a part of cases andsuboptimal path is obtained; (3) interpolating curve algorithms plan their path according tomathematical rules that lead to a fixed length of path; (4) supervised learning algorithms (MSVM,LSTM, MCTS and CNN) plan their path by performing models that are trained withhuman-labeled dataset, therefore suboptimal path is obtained; (5) RL algorithms (optimal valueRL and policy gradient RL) can generate optimal path under the condition that reasonablepenalty is used to punish moved steps in the training, therefore optimal path is obtained byperforming trained RL policy.

Optimal velocity: this criteria denotes the ability to tune the velocity when algorithms plantheir path, therefore robot can reach the destination with minimum time along planned path.This criteria is described as “optimal velocity” and “suboptimal velocity”. Table 5 listsperformance of algorithms: (1) performance of graph search algorithms, sampling-basedalgorithms and interpolating algorithms in velocity tuning cannot be evaluated, because thesealgorithms are only designed for path planning to find a collision-free trajectory; (2) supervisedlearning algorithms (MSVM, LSTM, MCTS and CNN) can output actions that are in the format ݀ ݀ where ݀ and ݀ are velocity in x and y axis, if algorithms are trained with thesevector labels. However, these velocity-related labels are all hard-coded artificially. Time to reachdestination heavily relies on artificial factor, therefore supervised learning algorithms cannotrealize optimal velocity; (3) optimal value RL and policy gradient RL can realize optimal velocityby attaching penalty to consumed time in the training. These algorithms can automatically learnhow to choose the best velocity in the training to cost time as less as possible, therefore robotscan realize optimal velocity by performing trained policy. Note that in this case, actions inoptimal value RL and policy gradient RL must be in format of [ ݀ ݀ ] and action space thatcontains many action choices must be defined. Reaction speed: this criteria denotes the speed of reaction to dynamic obstacles. Reactionspeed is described as 3 levels that includes “slow”, “medium” and “fast”. Table 5 lists reactionspeed of algorithms: (1) graph search algorithms and sampling-based algorithms rely on plannedtrajectory in the graph or map to avoid obstacles. However, the graph or map is updated in aslow frequency normally, therefore reaction speed of these algorithms is slow; (2) interpolatingcurve algorithms plan their path according to mathematical rules that cost limited andpredictable time in computation, therefore reaction speed of these algorithms is medium; (3)supervised learning algorithms, optimal value RL and policy gradient RL react to obstacles byperforming trained model or policy s → that maps state of environment to a probabilitydistribution as . This process is fast and time cost can be ignored, therefore reaction speed ofthese algorithms is fast.

Safe distance: this criteria denotes the ability to keep a safe distance to obstacles. Safedistance is described as 3 level that includes “fixed distance”, “suboptimal distance” and“optimal distance”. Table 5 lists the performance of algorithms: (1) graph search algorithms andsampling-based algorithms keep a fixed distance to static obstacles by hard-coded setting inrobotic application. However, high collision rate is inevitable in dynamic environment becauseof slow update frequency of graph or map; (2) interpolating algorithms keep a fixed distance tostatic and dynamic obstacles according to mathematical rules; (3) supervised learning algorithmskeep a suboptimal distance to static and dynamic obstacles. Suboptimal distance is obtained byperforming a model that is trained with human-labeled dataset; (4) optimal value RL and policygradient RL keep an optimal distance to static and dynamic obstacles by performing a trainedpolicy s → . This policy is trained under the condition that penalty is used to punish closedistance between robot and obstacles in the training, therefore algorithms will automaticallylearn how to keep an optimal distance to obstacles when robot moves towards destination.

Time-sequential path: this criteria denote whether an algorithm fits time-sequential task ornot. Table 5 lists algorithms that fit time-sequential planning: (1) graph search algorithms,sampling-based algorithms and interpolating curve algorithms plan their path according tograph, map or mathematical rules, regardless of environment state in each time step. Hence,these algorithms cannot fit time-sequential task; (2) MSVM and CNN output actions by one-stepprediction that has no relation with environment state in each time step; (3) LSTM and MCTSstore environment state in each time step in their cells and nodes respectively, and their modelsare updated by learning from these time-related experience. Time-sequential actions are outputted by performing trained models, therefore these algorithms fit time-sequential task; (4)optimal value RL and policy gradient RL train their policy network by learning fromenvironmental state in each time step. Time-sequential actions are outputted by performingtrained policy, therefore these algorithms fit time-sequential task.Table 5. Analytical comparisons according to general criteria.

Algorithm Local/global path length Optimal velocity Reaction speed Safe distance Time-sequential pathGraph searchalg. Global Optimal path --* Slow Fixed distance;high Collison rate NoSampling-based alg. Local;Global Suboptimal path -- Slow Fixed distance;high Collison rate NoInterpolatingcurve alg. Local Fixed path -- Medium Fixed distance NoMSVM Local Suboptimal path Suboptimalvelocity Fast Suboptimaldistance NoLSTM Local;Global Suboptimal path Suboptimalvelocity Fast Suboptimaldistance YesMCTS Local;Global Suboptimal path Suboptimalvelocity Fast Suboptimaldistance YesCNN Local Suboptimal path Suboptimalvelocity Fast Suboptimaldistance NoQ learning Local;Global Optimal path Optimal velocity Fast Optimal distance YesDQN Local;Global Optimal path Optimal velocity Fast Optimal distance YesDouble DQN Local;Global Optimal path Optimal velocity Fast Optimal distance YesDueling DQN Local;Global Optimal path Optimal velocity Fast Optimal distance Yes PG Local;Global Optimal path Optimal velocity Fast Optimal distance Yes AC Local;Global Optimal path Optimal velocity Fast Optimal distance Yes

A3C