Multi-Agent Path Planning based on MPC and DDPG
JJOURNAL OF L A TEX CLASS FILES, VOL. 14, NO. 8, AUGUST 2015 1
Multi-Agent Path Planning based on MPC andDDPG
Junxiao Xue, Xiangyan Kong, Bowei Dong, and Mingliang Xu
Abstract —The problem of mixed static and dynamic obstacleavoidance is essential for path planning in highly dynamicenvironment. However, the paths formed by grid edges can belonger than the true shortest paths in the terrain since theirheadings are artificially constrained. Existing methods can hardlydeal with dynamic obstacles. To address this problem, we proposea new algorithm combining Model Predictive Control (MPC)with Deep Deterministic Policy Gradient (DDPG). Firstly, weapply the MPC algorithm to predict the trajectory of dynamicobstacles. Secondly, the DDPG with continuous action space isdesigned to provide learning and autonomous decision-makingcapability for robots. Finally, we introduce the idea of theArtificial Potential Field to set the reward function to improveconvergence speed and accuracy. We employ Unity 3D to performsimulation experiments in highly uncertain environment such asaircraft carrier decks and squares. The results show that ourmethod has made great improvement on accuracy by 7%-30%compared with the other methods, and on the length of the pathand turning angle by reducing 100 units and 400-450 degreescompared with DQN (Deep Q Network), respectively.
Index Terms —path planning, obstacle avoidance, MPC, Arti-ficial Potential Field, Unity 3D
I. I
NTRODUCTION I N many dangerous and complex environments, path plan-ning and obstacle avoidance still rely on manual decision-making. This method consumes a lot of human and materialresources, and has the characteristics of low efficiency in pathplanning. Besides, in such environments like path planningof carrier aircraft on carrier deck and path planning for post-disaster rescue, once humans make a mistake, it is likely tobring immeasurable losses and injuries. Hence, this requiresagents to have the abilities of autonomous path planning andobstacle avoidance in complex environments.The core idea of path planning is to explore an optimal orsub-optimal barrier-free path according to certain evaluationcriteria in obstacle environments [1], [2]. Early path planningis mostly graph-based path planning [3]–[6]: such as V-graph,C-space [3], etc. However, graph-based methods cannot dealwith the problem of dynamic obstacles. The Artificial PotentialField [7], Rapid-exploration Random Tree [8] and D* [9]can be used in dynamic environments. However, the ArtificialPotential Field has the problem of local minimum. The Rapid-exploration Random Tree can’t deal with the continuous highlydynamic environments and the planned path is usually sub-optimal and not smooth. The D* algorithm is effective infinding a path in a dynamic environment. However, when
J. Xue and X. Kong are with School of Software, Zhengzhou University.B. Dong and M. Xu are with School of Information Engineering, ZhengzhouUniversity. moving to the target, it only checks the changes of the nextnode or neighboring nodes on the shortest path. So D* can’tdeal with the changes of these nodes far away from the agent.Reinforcement Learning (RL) comes out as a new researchtendency that can grant the agent sufficient intelligence tomake local decisions and accomplish necessary tasks. In[10] and [11], the authors presented a Q-learning algorithm[12] to solve the autonomous navigation problem of UAVs,afterwards, the Q-learning was also employed to establishpaths while avoiding static or dynamic obstacles in [13] and[14]. Q-learning had good performance in these environments.However, in complex and highly dynamic environments, Q-learning hardly converges due to the curse of the dimension-ality.In recent years, artificial intelligence technology has beenrapidly developed and applied. DRL [15] was obtained bycombining the advantages of Deep Learning (DL) [16] andReinforcement Learning (RL) [17], which provides a solu-tion to the perceptual decision-making problem of complexsystems. Liu et al. [18] utilized the Deep Q Network (DQN)[19] algorithm combined with greedy strategy to support agenttrajectory planning and optimization in a large-scale complexenvironment. By using the value-based Dueling Double DQN(DDQN) [20], Zeng [21] introduced a high-precision naviga-tion technique for UAV, which has improved the efficiency oftarget tracking. However, these authors used discrete actions(i.e. the environment is modeled as a grid world with limitedagent action space, degree of freedom), which may reduce theefficiency while dealing with real-world environment, wherethe agent operated according to a continuous action space.To address these problems, in this paper, a new algorithmcombining MPC [22] (Model Predictive Control) with DDPG[23] based on policy gradient [24] is proposed. The maincontributions of this research are as follows: • A DDPG-based obstacle avoidance method is proposed,which enables agent to keep trial and error during thetraining process to learn a strategy to reach the destinationwithout collision. In order to improve the efficiency oflearning, the idea of the Artificial Potential Field (theobstacles and the target impose repulsion and attractionon agent respectively) is used to design the rewardfunction. • Considering the motion constraints of the agent in reality,the agent is modeled according to the Kinematic model.Then the MPC algorithm was used to predict the trajec-tory u(t) of the agent. • The experimental results showed that our algorithm hasmade great improvement on accuracy by 7%-30% com- a r X i v : . [ c s . A I] F e b OURNAL OF L A TEX CLASS FILES, VOL. 14, NO. 8, AUGUST 2015 2 pared with the other algorithms, and on the length ofthe path and turning angle by reducing 100 units and400-450 degrees compared with DQN (Deep Q Network),respectively.The structure of this paper is as follows: Section 2, the pathplanning algorithm based on DDPG and MPC is introduced.Including Knowledge of background, Algorithm update, Re-ward function, State space and Action space. The simulation ispresented in Section 3. We use unity3D to simulate the aircraftcarrier deck and square. Section 4 gives the Experimentalresults that compared the model with DDPG, DQN, and A2Con the accuracy, length of path, reward and smoothness of thepath. The summary is given in Section 5.II. P
ATH PLANNING ALGORITHM BASED ON
DDPG
AND
MPCIn this section, we present a new algorithm based on DDPGand MPC. Firstly, we introduce the knowledge of backgroundincluding MPC, DRL and DDPG. Then we analyze algorith-mic principle according to the algorithmic framework that isdivided into three parts: trajectory prediction, action selectionand update. Finally, the state space, action space and rewardfunction under this framework are defined and explained.
A. Knowledge of background1) MPC:
Model predictive control (Model Predictive Con-trol, MPC) is a type of control algorithm developed in the1970s. The algorithm adopts control strategies such as multi-step prediction, rolling optimization and feedback correction.Its purpose is to minimize the cost function to make theprediction more accurate. More precisely, in discrete time, themethod of the model predictive control can be formulated asfollows: (cid:40) 𝑥 ( 𝑡 + ) = 𝑓 ( 𝑥 ( 𝑡 ) , 𝑢 ( 𝑡 ) , 𝑤 ( 𝑡 )) 𝑦 ( 𝑡 ) = 𝑔 ( 𝑥 ( 𝑡 ) , 𝑢 ( 𝑡 )) ( ) Where 𝑢 ( 𝑡 ) is control matrix, 𝑤 ( 𝑡 ) is control error matrixcaused by noise or disturbance, 𝑥 ( 𝑡 ) is the initial state, 𝑦 ( 𝑡 ) 𝑡 + 𝑁𝑡 are the information of prediction calculated by the predictivemodel.To make each prediction module and the predicted resultcloser to the true value, we need to perform rolling optimiza-tion on the model. The cost/objective function is defined as 𝐽 ({ 𝑥 𝑘 } 𝑡 + 𝑁𝑡 , { 𝑦 𝑘 } 𝑡 + 𝑁𝑡 , { 𝑢 𝑘 } 𝑡 + 𝑁𝑡 ) . The optimal control input isgenerated by minimizing the cost function, that is: { 𝑢 ∗ 𝑘 } 𝑡 + 𝑁𝑡 = 𝑎𝑟𝑔 𝑚𝑖𝑛 { 𝑢 𝑘 } 𝑡 + 𝑁𝑡 𝐽 ({ 𝑥 𝑘 } 𝑡 + 𝑁𝑡 , { 𝑦 𝑘 } 𝑡 + 𝑁𝑡 , { 𝑢 𝑘 } 𝑡 + 𝑁𝑡 ) ( ) The cost/objective function involve the future state trajec-tory { 𝑥 𝑘 } 𝑡 + 𝑁𝑡 , output trajectory { 𝑦 𝑘 } 𝑡 + 𝑁𝑡 and control effort { 𝑢 𝑘 } 𝑡 + 𝑁𝑡 .
2) Deep Reinforcement Learning:
DRL [15] combines theperceptual ability of DL [16] and the decision-making abilityof RL [17], which can solve the high-dimensional decision-making problems. It is currently an extremely popular researchdirection in the field of machine learning. As shown in Fig.1, Fig. 1: The framework of the DRLthe framework of the DRL, the agent interacts with the envi-ronment to obtain a high-dimensional observation information,and the deep learning evaluates the value function of eachaction based on the expected return and completes the state-to-action mapping through a certain strategy. The environmentreacts to this action and gets the next observation information.
3) Deep Deterministic Policy Gradient:
DDPG is an exten-sion of DQN based on the framework of Actor-Critic , whichuses the experience replay sample collection method. It solvesthis problem that DQN can’t deal with continuous actions andActor-Critic algorithm is difficult to converge. It adopts thefour-network structure combining strategy gradient and valuefunction. The strategy gradient is used to select a certain actionfrom the continuous action space, and the value function isused to evaluate the selected action.DDPG adopts the delayed update method, however, unlikethe hard update of DQN, it uses the method of soft updateto update the target network. More precisely, the onlinepolicy Critic networks are updated by minimizing the meansquare error. The online policy Critic networks are updatedby maximizing the Q value. At last, the current networksparameter are updated by soft update strategies.
B. Algorithmic Framework1) Trajectory prediction:
Trajectory prediction refers topredict the short-term future coordinates according to historycoordinates. In our algorithm, the MPC (Model PredictiveControl) is used to predict the future coordinates, which hasthe advantages of good control effect and strong robustness.As shown in Fig.1 the framework of our algorithm, theprediction model of MPC takes the current state s and thehistorical trajectory { 𝑥 𝑘 } 𝑡 + 𝑁𝑡 of the dynamic obstacle as inputand the output { 𝑦 𝑘 } 𝑡 + 𝑁 + 𝑡 are calculated by prediction modelof MPC. In every time, the output 𝑦 𝑘 by prediction modeland historical trajectory 𝑥 𝑘 are used to optimal the algorithm.The final predicted trajectory vector { 𝑦 𝑘 } 𝑡 + 𝑁 + and the currentstate of the environment vector 𝑠 are concatenated as the finaloutput.
2) Action selection:
Action selection is the decision-making essence of reinforcement learning. As shown inFig.1 The framework of our algorithm, the DDPG takes
OURNAL OF L A TEX CLASS FILES, VOL. 14, NO. 8, AUGUST 2015 3
Fig. 2: framework of our algorithmthe predicted environmental state 𝑠 ( 𝑡 ) as input, and the out-put—action is calculated by online policy network of Actor.After the agent performs the action, the reward is givenby our reward function. Afterwards, the current predictedenvironment, the reward, the action and the next predictedenvironment state ( 𝑠 ( 𝑡 ) , 𝑎 𝑡 , 𝑟 𝑡 , 𝑠 ( 𝑡 + )) are stored in theexperience replay memory.
3) Update:
DDPG is an extension of DQN [18] basedon the AC [22] framework. It retains the experience replaymemory of DQN and dual network structure. However, unlikethe hard update of DQN, it uses the soft update method toupdate the target network.When updating the Actor and Critic networks, a mini-batch of N transitions are sampled from the experience replaymemory. Then the loss function 𝐿 ( 𝜃 𝑄 ) of Critic networks iscalculated by the current target Q value 𝑦 𝑖 . Meanwhile, theActor network is updated by the method of policy gradient. 𝑦 𝑡 = (cid:40) 𝑅 𝑡 𝑖𝑠 _ 𝑒𝑛𝑑 𝑡 𝑖𝑠 𝑡𝑟𝑢𝑒𝑅 𝑡 + 𝛾𝑄 (cid:48) ( 𝜙 ( 𝑆 (cid:48) 𝑡 ) , 𝜋 𝜃 (cid:48) ( 𝜙 ( 𝑆 (cid:48) 𝑡 ) , 𝜔 (cid:48) ) 𝑖𝑠 _ 𝑒𝑛𝑑 𝑡 𝑖𝑠 𝑓 𝑎𝑙𝑠𝑒 ( ) 𝐿 ( 𝜃 𝑄 ) = 𝑁 𝑁 ∑︁ 𝑡 ( 𝑌 𝑖 − 𝑄 ( 𝑠 𝑖 , 𝑎 𝑖 | 𝜃 𝑄 )) ( )∇ 𝜃 𝜇 𝐽 ≈ 𝑁 − 𝑁 ∑︁ 𝑡 ∇ 𝑎 𝑡 𝑄 ( 𝑠 𝑡 , 𝑎 𝑡 | 𝜃 𝑄 )∇ 𝜃 𝜇 𝜇 ( 𝑠 𝑡 | 𝜃 𝜇 ) ( ) At last, the two target network parameters 𝜃 𝜇 (cid:48) and 𝜃 𝑄 (cid:48) areupdated by soft update strategies: (cid:40) 𝜃 𝑄 (cid:48) = 𝜏𝜃 𝑄 + ( − 𝜏 ) 𝜃 𝑄 (cid:48) 𝜃 𝜇 (cid:48) = 𝜏𝜃 𝜇 (cid:48) + ( − 𝜏 ) 𝜃 𝜇 (cid:48) ( ) Where 𝜏 is a configurable constant coefficient, used to theregulate the soft update factor. C. State space
The state space of agent represents valuable informationagent can attain before decision-making, which is used to helpagent assesses the situation. For the agent to better understandthe changing environment, we divided the state into the currentenvironmental state and the predicted environmental state. Thestatus value can be illustrated as 𝑠 ( 𝑡 ) : { 𝑠, 𝑢 ( 𝑡 )} .
1) The current environmental state: 𝑠 : {( 𝑑 𝑝𝑥 , 𝑑 𝑝𝑦 ) , ( 𝑑 𝑜 𝑥 , 𝑑 𝑜 𝑦 ) . . . ( 𝑑 𝑜𝑚𝑥 , 𝑑 𝑜𝑚𝑦 )} ( ) Where ( 𝑑 𝑝𝑥 , 𝑑 𝑝𝑦 ) represents the distance between the coor-dinates of the agent point and the target, and the ( 𝑑 𝑜 𝑥 , 𝑑 𝑜 𝑦 ) −( 𝑑 𝑜𝑚𝑥 , 𝑑 𝑜𝑚𝑦 ) represent the distance between the coordinates ofthe agent and the static obstacles.
2) The predicted environmental state: 𝑢 ( 𝑡 ) : {( 𝑑 𝑜 (cid:48) 𝑥 , 𝑑 𝑜 (cid:48) 𝑦 ) , ( 𝑑 𝑜 (cid:48) 𝑥 , 𝑑 𝑜 (cid:48) 𝑦 ) . . . ( 𝑑 𝑜 (cid:48) 𝑛𝑥 , 𝑑 𝑜 (cid:48) 𝑛𝑦 )} ( ) Where ( 𝑑 𝑜 (cid:48) 𝑥 , 𝑑 𝑜 (cid:48) 𝑦 ) − ( 𝑑 𝑜 (cid:48) 𝑛𝑥 , 𝑑 𝑜 (cid:48) 𝑛𝑦 ) represent the distance be-tween the predicted coordinates of the dynamic obstacles andthe coordinates of the agent. D. Agent action space
The action space represents the set of actions that the agentcan perform. We set the action space as: 𝐴 : ( 𝑥, 𝑦 ) 𝑥, 𝑦 ∈ (− , ) ( ) The distances that the agent moves in the x and y directionsare expressed as: 𝑋 = 𝑥 ∗ 𝑌 = 𝑦 ∗ , respectively. E. Reward function
The reward in DRL, which is the feedback signal availablefor the agent’s learning, is used to evaluate the action takenby the agent. A simple approach is to set sparse reward,
OURNAL OF L A TEX CLASS FILES, VOL. 14, NO. 8, AUGUST 2015 4
Fig. 3: Scene 1 Fig. 4: Scene 2Fig. 5: trajectory of Scene 1 Fig. 6: trajectory of Scene 2Fig. 7: Scene 3 simulation diagram Fig. 8: Scene 3 simulation diagramin which the agent can get a positive return only if thetask is accomplished. However, this method is inefficient tocollect useful experience data so as to help agent learning.Accordingly, the convergent speed of network updating is slowand the agent may not learn optimal strategies. In this paper,we introduce the idea of the Artificial Potential Field (theobstacles and the target impose repulsion and attraction onagent respectively) into the design of reward function. Fourtypes of excitation are considered in the reward. (1) The rewardof target attraction. (2) The reward of obstacle repulsion. (3)Collision reward. (4) Reward for reaching the target point.More precisely, these rewards are described as follows:
1) The reward of target attraction:
The magnitude ofattraction is proportional to the distance between the agentand the target. As shown in Eq. (10) and (11), we simplifiedthe attraction of the Artificial Potential Field. 𝑟 = 𝐿 𝐷 𝑡𝑖,𝑒 − 𝐷 𝑡 + 𝑖,𝑒 ≥ 𝐿𝐷 𝑡𝑖,𝑒 − 𝐷 𝑡 + 𝑖,𝑒 𝑙 < 𝐷 𝑡𝑖,𝑒 − 𝐷 𝑡 + 𝑖,𝑒 < 𝐿𝑙 𝐷 𝑡𝑖,𝑒 − 𝐷 𝑡 + 𝑖,𝑒 ≤ 𝑙 ( ) 𝑟 = − 𝑙 𝐷 𝑡𝑖,𝑒 − 𝐷 𝑡 + 𝑖,𝑒 ≥ − 𝑙𝐷 𝑡𝑖,𝑒 − 𝐷 𝑡 + 𝑖,𝑒 − 𝐿 < 𝐷 𝑡𝑖,𝑒 − 𝐷 𝑡 + 𝑖,𝑒 < − 𝑙 − 𝐿 𝐷 𝑡𝑖,𝑒 − 𝐷 𝑡 + 𝑖,𝑒 ≤ − 𝐿 ( ) The reward of target attraction 𝑟 is usually proportional tothe difference between 𝐷 𝑡𝑖,𝑒 (the distance between the agentand the target at time t) and 𝐷 𝑡 + 𝑖,𝑒 (he distance between theagent and the target at time t+1). However, when the difference is too large or too small, 𝐿, 𝑙 or − 𝑙, − 𝐿 replace the difference asthe reward. Among them, the formulas (10) and (11) representagent is close or far away from the target respectively.
2) The reward of obstacle repulsion:
Similar to the rewardof target attraction , we also simplified the repulsion value ofthe Artificial Potential Field. The formulas as shown in Eq.(12) and (13). 𝑟 = 𝐻 𝐷 𝑡𝑖,𝑜𝑏 − 𝐷 𝑡 + 𝑖,𝑜𝑏 ≥ 𝐻𝐷 𝑡𝑖,𝑜𝑏 − 𝐷 𝑡 + 𝑖,𝑜𝑏 ℎ < 𝐷 𝑡𝑖,𝑜𝑏 − 𝐷 𝑡 + 𝑖,𝑜𝑏 < 𝐻ℎ 𝐷 𝑡𝑖,𝑜𝑏 − 𝐷 𝑡 + 𝑖,𝑜𝑏 ≤ ℎ ( ) 𝑟 = − ℎ 𝐷 𝑡𝑖,𝑜𝑏 − 𝐷 𝑡 + 𝑖,𝑜𝑏 ≥ − ℎ𝐷 𝑡𝑖,𝑜𝑏 − 𝐷 𝑡 + 𝑖,𝑜𝑏 − 𝐻 < 𝐷 𝑡𝑖,𝑜𝑏 − 𝐷 𝑡 + 𝑖,𝑜𝑏 < − ℎ − 𝐻 𝐷 𝑡𝑖,𝑜𝑏 − 𝐷 𝑡 + 𝑖,𝑜𝑏 ≤ − 𝐻 ( ) Where 𝐷 𝑡𝑖,𝑜𝑏 and 𝐷 𝑡 + 𝑖,𝑜𝑏 represent the distance betweenthe agent and the obstacle at time 𝑡 and 𝑡 + respectively. 𝐻, ℎ or − ℎ, − 𝐻 represent the upper and lower limits of thereward value respectively when agent is close or far awayfrom obstacles.
3) Collision reward:
As shown in Eq. (14), the reward isgiven by the environment when the agent collides with theobstacles. 𝑟 = − ( )
4) Reward for reaching the target point:
As shown in Eq.(15), the reward is given by the environment when the agentreach to the target . 𝑟 = ( ) OURNAL OF L A TEX CLASS FILES, VOL. 14, NO. 8, AUGUST 2015 5 (a) DQN (b) A2C(c) DDPG (d) our algorithm
Fig. 9: The accurate rate of scene 1At last, the total reward value can be defined in Eq. (16). 𝑅 = 𝑟 ∗ 𝜆 + 𝑟 ∗ 𝜆 + 𝑟 ∗ 𝜆 + 𝑟 ∗ 𝜆 ( ) Where 𝜆 , 𝜆 , 𝜆 , 𝜆 represents the weight of the four kindsof reward values respectively.III. S IMULATION
The path planning of the agent and dynamic obstacle avoid-ance are important problems in computer simulation researchand they are widely used in crowd animation, computergame and deduction of scheme. Among them, the pedestriansquare has the characteristics of large area, high density poorregularity in crowd movement and high mobility. While, thedeck space of the aircraft carrier is narrow and tasks are numer-ous and the uninterrupted operation of carrier-based aircraftincreases the uncertainty of space usage. It is a challenge tofind a collision-free path from the starting point to the target insuch highly uncertain environments. Therefore, in this section,we chose the aircraft carrier deck and square as the objectsof simulation to verify the efficiency of our algorithm, andthe Unity3D is used to model this simulation. Meantime, wesampled randomly an episode date to described the trajectoriesof the carrier-based aircraft and dynamic obstacles in everyscenes.
A. Aircraft carrier deck simulation
As shown in Fig.2 and Fig.3, in the aircraft carrier deck,the blue carrier-based aircraft needs to avoid dynamic and static obstacles (the green and yellow carrier-based aircraftsrepresented dynamic and static obstacles, respectively) in real-time to find a collision-free path from the start point to thetarget (the position of the white board). In scene 1, we set5 dynamic obstacles and 15 static obstacles. As shown inFig.3, in order to verify the effective of our algorithm in thehighly dynamic environment, 4 dynamic obstacles were addedrandomly to the scene 1.To simulate the movement of dynamic obstacles the realscene. Three assumptions were made as follows:Assumption 1: The movement of the dynamic obstacle havethe purpose;Assumption 2: Special locations are set as the target ofdynamic obstacles in the scene (for example gas station,command room, etc.);Assumption 3: The initial speeds are set to dynamic obsta-cles. Then when it moves to a special position, it will stay therefor several time to perform operations, the time of operation isgiven randomly. After that, a new speed of dynamic obstaclesis given randomly.For two scenes, we described the trajectories of the carrier-based aircraft and dynamic obstacles in an episode. As shownin in Fig.4 and Fig.5, the trajectories of the carrier-basedaircraft and dynamic obstacles were represented by the blueand green dashed lines, respectively. The final location ofcarrier-based aircraft and dynamic obstacles were marked withthe red rectangle and numbers to identify them. In scene 1,
OURNAL OF L A TEX CLASS FILES, VOL. 14, NO. 8, AUGUST 2015 6 (a) DQN (b) A2C(c) DDPG (d) our algorithm
Fig. 10: The accurate rate of scene 2when the carrier-based aircraft predicted that it would collidewith the obstacle 3, an action——upward was taken to avoidthe obstacle successfully in real-time. After that, the carrier-based aircraft predicted that the collision would be occurredwith the obstacle 4 and then the direction was adjusted toavoid. In scene 2, when obstacles 2 and 5 appeared aroundthe carrier-based aircraft, the carrier-based aircraft would moveupward to avoid two dynamic obstacles, and ensured that thecarrier-based aircraft moves closer to the target.
B. Crowd square simulation
As shown in Fig.6, in the pedestrian square, the people onthe blue square was moving toward the target to withdrawmoney in the position of green square board. In this scene,we set 28 static obstacles and 13 dynamic obstacles. It wasworth noting that 9 dynamic obstacles and a large numberof static obstacles were set around the agent to improve thedifficulty of obstacle avoidance.The trajectories of pedestrians and agent were depicted onthe square in an episode by the blue and yellow dashed lines,respectively in Fig. 5. It is obvious that the agent movedfollowing trajectory 1 to avoid the static crowd and pedestrian1. Then the agent predicted that if continue to move in theoriginal direction, it would collide with pedestrian 2 anddeviate from the target, so the agent adjusted its direction andmoved along with trajectory 2 to avoid collision and movetowards the target. When the agent arrived near the target, pedestrian 5 finished the task of money withdrawal and movedoutward. After the agent predicted that it would collide withpedestrian 5, he adjusted constantly the direction of movementto avoid obstacles and reached the target without collision infinally. IV. R
ESULTS
For avoiding repetition and verifying the efficiency ofthe model, in this section we just adopted the method ofcomparative experiment on the aircraft carrier deck scene,and systematically compared the model with DDPG, DQN,and A2C on the accuracy rate, length of path, reward andsmoothness of the path. In particular, the smoothness of thepath was expressed by the total turning angle of the aircraftreaching the target point without collision.First of all, we compared the accuracy rate for the differentalgorithms in every scene. In scene 1, as shown in Fig.8, theaccuracy rate of our algorithm was easiest to reach 100%in four algorithms. Due to the difficulty of convergence ofthe A2C, the accuracy rate of A2C was the most unstable.Secondly, although the accuracy rate of DDPG could reach100%, it still fluctuated greatly. Although DQN had a goodperformance in accuracy rate, it still has disadvantages, whichwould be introduced in the later. To compare the results moreclearly, we calculated the average accuracy rate and showed itin the table. As shown in Tables 1, 100 sets of accuracy ratewere sampled randomly to calculate the average, our algorithm
OURNAL OF L A TEX CLASS FILES, VOL. 14, NO. 8, AUGUST 2015 7 (a) DQN (b) A2C(c) DDPG (d) our algorithm
Fig. 11: The average reward of scene 1made great improvement on accuracy rate by 8%, 6%, and31% compared with DQN, A2C and DDPG, respectively. Inscene 2, with the complexity of the environment increases, theaccuracy rate of the four algorithms had decreased, but ouralgorithm could still reach 90%. Our algorithm still had thehighest accuracy rate in four algorithms. In summary, in termsof the accuracy rate of the algorithm, although both DDPG andA2C can solve the problems in the continuous action space,DDPG has a better learning effect. Our algorithm combines theprediction module on the basis of DDPG, so that the algorithmcan achieve a higher accuracy rate in complex environments.TABLE I: Scene1 Algorithm comparison table
Scene 1 Scene 2DQN 90% 84%A2C 67% 64%DDPG 92% 82%MDDPG 98% 91%
To verify the learning effect of the algorithm, we used ouralgorithm to compare the average reward with other algorithm.The average reward of the four algorithms in two scenes werepresented in Fig.10 and Fig.11. The average reward indicatedthe efficiency of the learning. As shown in Fig. 9, we comparedthe average reward of the four algorithms in scene 1. Dueto the difficulty of convergence of the A2C algorithm, themaximum average reward would only stabilize at about 100.However, DQN, DDPG, and our algorithm would reach 150. It was obvious that the average reward of DDPG was unstable.That showed that it was difficult for DDPG to deal withdynamic obstacles without prediction. Although the curve ofthe average reward is similar between DQN and our algorithm,our algorithm had higher stability and was more in line withthe kinetic model in the real scene. It could be seen from theaverage reward that our algorithm had the best learning effectin four algorithms.TABLE II: Scene 1 Algorithm comparison table
Length of path Turning angleDQN 438.22 586.73A2C 352.60 142.03DDPG 328.97 139.96MDDPG 328.95 139.80
TABLE III: Scene 2 Algorithm comparison table
Length of path Turning angleDQN 465.45 522.62A2C 324.16 158.00DDPG 320.61 134.28MDDPG 315.68 128.41
To take into account as much as possible the planning effectof the algorithm in the real scene, we compared the averagelength of the path and average turning angle of the algorithm.In a real scene, finding a smooth and optimal or sub-optimal
OURNAL OF L A TEX CLASS FILES, VOL. 14, NO. 8, AUGUST 2015 8 (a) DQN (b) A2C(c) DDPG (d) our algorithm
Fig. 12: The average reward of scene 2path can not only reduce the consumption of resource, but alsocomplete the task in the shortest time, increasing the real-timenature of the task. As shown in Table 2 and Table 3, whichindicated the average length of the path and average turningangle of the four algorithms respectively in two differentscenes. Although DQN had a good performance in accuracyrate, the average length of the path and average turning anglewere longest in four algorithms, which would consume moretime and resources in the real environment. Meanwhile, wecompared two different scenes. Though the accuracy rate ofthe four algorithms had reduced after increasing the complex-ity of the environment. The average length of the path andaverage turning angle had small fluctuations.In summary, our algorithm could effectively solve thedifficult convergence of A2C and the difficulty of DDPG todeal with dynamic obstacles. Compared with the other threealgorithms, our algorithm has obvious improvements in termsof average path length, path smoothness, accuracy and averagereward. V. C
ONCLUSION
In this paper, a new algorithm of path planning based onDDPG and MPC is designed which combines the percep-tion and decision-making capabilities of deep reinforcementlearning and the predictive capabilities of the MPC, and canapply to the highly uncertain scene. The trajectory predictionof dynamic obstacles is achieved by the MPC, which greatly reduces the uncertainty of the environment, and effectivelysolves the problems faced by traditional algorithms in dynamicenvironments, such as slow convergence speed and poor gen-eralization. The DDPG is suitable for solving the problemsof continuous state space. which is more in line with realscenes. In order to verify the efficiency of the algorithm, wechose Unity3D to simulate the complex aircraft carrier deckand square, and conducted a detailed analysis of the agent’strajectory. The final results show that our algorithm has madegreat improvement on accuracy by 7%-30% compared withthe other algorithms, and on the length of the path and turningangle by reducing 100 units and 400-450 degrees comparedwith DQN (Deep Q Network), respectively.R
EFERENCES[1] D. Wang, “Indoor mobile-robot path planning based on an improveda* algorithm,”
Journal of Tsinghua University Science and Technology ,vol. 52, no. 8, pp. 1085–1089, 2012.[2] L. Li, T. Ye, M. Tan, and X.-j. Chen, “Present state and futuredevelopment of mobile robot technology research,”
Robot , vol. 24, no. 5,pp. 475–480, 2002.[3] T. Lozano-Pérez and M. A. Wesley, “An algorithm for planning collision-free paths among polyhedral obstacles,”
Communications of the ACM ,vol. 22, no. 10, pp. 560–570, 1979.[4] J. Barraquand and J.-C. Latombe, “Robot motion planning: A distributedrepresentation approach,”
The International Journal of Robotics Re-search , vol. 10, no. 6, pp. 628–649, 1991.[5] G. Hoffmann, S. Waslander, and C. Tomlin, “Quadrotor helicoptertrajectory tracking control,” in
AIAA guidance, navigation and controlconference and exhibit , 2008, p. 7410.
OURNAL OF L A TEX CLASS FILES, VOL. 14, NO. 8, AUGUST 2015 9 [6] R. Bohlin and L. E. Kavraki, “Path planning using lazy prm,” in
Proceedings 2000 ICRA. Millennium Conference. IEEE InternationalConference on Robotics and Automation. Symposia Proceedings (Cat.No. 00CH37065) , vol. 1. IEEE, 2000, pp. 521–528.[7] Y. K. Hwang, N. Ahuja et al. , “A potential field approach to pathplanning.”
IEEE Transactions on Robotics and Automation , vol. 8, no. 1,pp. 23–32, 1992.[8] J. Bruce and M. M. Veloso, “Real-time randomized path planning forrobot navigation,” in
Robot Soccer World Cup . Springer, 2002, pp.288–295.[9] A. Stentz, “Optimal and efficient path planning for partially knownenvironments,” in
Intelligent unmanned ground vehicles . Springer,1997, pp. 203–220.[10] C. Yan and X. Xiang, “A path planning algorithm for uav based onimproved q-learning,” in . IEEE, 2018, pp. 1–5.[11] O. Bouhamed, H. Ghazzai, H. Besbes, and Y. Massoud, “Q-learningbased routing scheduling for a multi-task autonomous agent,” in . IEEE, 2019, pp. 634–637.[12] C. J. Watkins and P. Dayan, “Q-learning,”
Machine learning , vol. 8, no.3-4, pp. 279–292, 1992.[13] V. N. Sichkar, “Reinforcement learning algorithms in global path plan-ning for mobile robot,” in . IEEE, 2019,pp. 1–5.[14] M. A. K. Jaradat, M. Al-Rousan, and L. Quadan, “Reinforcementbased mobile robot navigation in dynamic environment,”
Robotics andComputer-Integrated Manufacturing , vol. 27, no. 1, pp. 135–149, 2011.[15] V. Mnih, K. Kavukcuoglu, D. Silver, A. A. Rusu, J. Veness, M. G.Bellemare, A. Graves, M. Riedmiller, A. K. Fidjeland, G. Ostrovski et al. , “Human-level control through deep reinforcement learning,” nature , vol. 518, no. 7540, pp. 529–533, 2015.[16] Y. LeCun, Y. Bengio, and G. Hinton, “Deep learning,” nature , vol. 521,no. 7553, pp. 436–444, 2015.[17] R. S. Sutton, A. G. Barto et al. , Introduction to reinforcement learning .MIT press Cambridge, 1998, vol. 135.[18] Q. Liu, L. Shi, L. Sun, J. Li, M. Ding, and F. Shu, “Path planning foruav-mounted mobile edge computing with deep reinforcement learning,”
IEEE Transactions on Vehicular Technology , vol. 69, no. 5, pp. 5723–5728, 2020.[19] V. Mnih, K. Kavukcuoglu, D. Silver, A. Graves, I. Antonoglou, D. Wier-stra, and M. Riedmiller, “Playing atari with deep reinforcement learn-ing,” arXiv preprint arXiv:1312.5602 , 2013.[20] M. Sewak, “Deep q network (dqn), double dqn, and dueling dqn,” in
Deep Reinforcement Learning . Springer, 2019, pp. 95–108.[21] Y. Zeng, X. Xu, S. Jin, and R. Zhang, “Simultaneous navigationand radio mapping for cellular-connected uav with deep reinforcementlearning,” arXiv preprint arXiv:2003.07574 , 2020.[22] E. F. Camacho and C. B. Alba,
Model predictive control . Springerscience & business media, 2013.[23] T. P. Lillicrap, J. J. Hunt, A. Pritzel, N. Heess, T. Erez, Y. Tassa,D. Silver, and D. Wierstra, “Continuous control with deep reinforcementlearning,” arXiv preprint arXiv:1509.02971 , 2015.[24] J. Peters and S. Schaal, “Policy gradient methods for robotics,” in2006IEEE/RSJ International Conference on Intelligent Robots and Systems