[PDF] Deep Reinforcement Learning with a Stage Incentive Mechanism of Dense Reward for Robotic Trajectory Planning

Abstract

(This work has been submitted to the IEEE for possible publication. Copyright may be transferred without notice, after which this version may no longer be accessible.) To improve the efficiency of deep reinforcement learning (DRL)-based methods for robot manipulator trajectory planning in random working environments, we present three dense reward functions. These rewards differ from the traditional sparse reward. First, a posture reward function is proposed to speed up the learning process with a more reasonable trajectory by modeling the distance and direction constraints, which can reduce the blindness of exploration. Second, a stride reward function is proposed to improve the stability of the learning process by modeling the distance and movement distance of joint constraints. Finally, in order to further improve learning efficiency, we are inspired by the cognitive process of human behavior and propose a stage incentive mechanism, including a hard stage incentive reward function and a soft stage incentive reward function. Extensive experiments show that the soft stage incentive reward function is able to improve the convergence rate by up to 46.9% with the state-of-the-art DRL methods. The percentage increase in the convergence mean reward was 4.4-15.5% and the percentage decreases with respect to standard deviation were 21.9-63.2%. In the evaluation experiments, the success rate of trajectory planning for a robot manipulator reached 99.6%.

Full PDF

DDeep Reinforcement Learning with Stage Incentive Mechanism for Robotic Trajectory Planning Jin Yang a,b, *, Gang Peng a,b a Key Laboratory of Image Processing and Intelligent Control, Ministry of Education; b School of Artificial Intelligence and Automation, Huazhong University of Science and Technology, Wuhan, China;

ABSTRACT

To improve the efficiency of deep reinforcement learning (DRL) based methods for robot manipulator trajectory planning in random working environment. Different from the traditional sparse reward function, we present three dense reward functions in this paper. Firstly, posture reward function is proposed to accelerate the learning process with a more reasonable trajectory by modeling the distance and direction constraints, which can reduce the blindness of exploration. Secondly, to improve the stability, a reward function at stride reward is proposed by modeling the distance and movement distance of joints constraints, it can make the learning process more stable. In order to further improve learning efficiency, we are inspired by the cognitive process of human behavior and propose a stage incentive mechanism, including hard stage incentive reward function and soft stage incentive reward function. Extensive experiments show that the soft stage incentive reward function proposed is able to improve convergence rate by up to 46.9% with the state-of-the-art DRL methods. The percentage increase in convergence mean reward is 4.4%~15.5% and the percentage decreases with respect to standard deviation by 21.9%~63.2%. In the evaluation, the success rate of trajectory planning for robot manipulator is up to 99.6%.

Keywords: deep reinforcement learning; trajectory planning; dense reward function; stage incentive mechanism * Corresponding author. Jin Yang (Corresponding Author) Master graduate student, Email:[email protected]; Gang Peng (Co-First Author), PhD, Assoc. Prof, Email: [email protected]; ． INTRODUCTION

Trajectory planning is one of the fundamental problem for the motion control of robot manipulator. The result of the trajectory planning determines the quality index of the task carried out by the robot manipulator directly. Traditional trajectory planning for robot manipulator mainly includes artificial potential field methods [1-3], polynomial interpolation [4-6], etc. These methods have low intelligence, poor dynamic planning and no self-learning ability. In recent years, Deep Reinforcement Learning (DRL) provides a new idea for trajectory planning of robot manipulator. [7-10] It can make the robot manipulator learn autonomously and plan an optimal path in a complex and random environment. As shown in Fig. 1, the three elements of DRL: environment, agent and reward function. The agent in DRL combines exploration to give the possible actions, according to the current state of the robot manipulator. The robot manipulator executes the action in the environment, and feeds back the reward value to the agent according to the defined reward function. Through the iterative update method, the agent can learn better strategies of trajectory planning.

Agent

Reward function environment rewardstate stateaction exploration

FIGURE 1 . Frame of Deep Reinforcement Learning.

In the history of the development of deep reinforcement learning, the typical strategies include Deep Q-learning Network (DQN). [11-12] However, its spaces of output action are discrete, is difficult to apply to continuous action spaces such as the trajectory planning of robot manipulator. And then, Deep Deterministic Policy Gradient (DDPG) [13] based on the Actor-Critic (AC) architecture, Asynchronous Advantage Actor-Critic (A3C) [14], Proximal Policy Optimization (PPO) [15] and Soft Actor-Critic (SAC) [16] have been proposed one by one, which can be applied to the tasks with continuous action spaces. Nevertheless, randomness and blindness are still the problems in DRL methods. he core of this problem is the reward function, which is an important part of DRL. To the best of our knowledge, all the reward functions used in robot manipulator trajectory planning task are sparse reward functions. It can lead to lots of ineffective explorations, which will decrease the efficiency of algorithm. [17-19] To solve the problem, we present a stage incentive mechanism based on the human behavior cognition for robotic trajectory planning in DRL. The primary contributions of this paper are summarized as follows: 1)

Combining the characteristics of trajectory planning and work environment, three brand-new dense reward functions are proposed. Dense reward function provides non-zero rewards that is different from the sparse reward function. It can provide more information after each action, which can reduce invalid and blind exploration of DRL in trajectory planning for robot manipulator. 2)

First, posture reward function and stride reward function are proposed. Posture reward function includes position reward function and direction reward function. Position reward function is composed of the task status item (whether the task is completed) and distance guide item (Euclidean distance between the end of robot manipulator and the random target); direction reward function is modeled by the angle between the expected direction vector and the actual direction vector. Stride reward function is built by position reward and movement distance reward. The position reward is the same with above mentioned, and movement distance reward is composed of the average movement distance of each joint of the robot manipulator. The posture reward function and the stride reward function can make the robot manipulator explore more efficiently under a reasonable constraint in position, direction and movement distance, and reduce invalid and blind exploration. 3)

In order to further improve learning efficiency, we are inspired by the cognitive process of human behavior and propose a stage incentive mechanism. First of all, the hard stage incentive mechanism is established by combining the posture reward function and stride reward function. To improve its potential stability hazards, a soft stage incentive mechanism is further proposed. By this innovative structure, we have increased the expected return obtained by the algorithm while ensuring the stability of the algorithm, which has improved the overall efficiency of the algorithm.

The rest of this paper is organized as follows. The structure of posture reward function and stride reward function is presented in Section II and Section III. In Section IV, stage incentive mechanism is introduced, including the hard stage incentive reward function and the soft stage incentive reward function. The implementation of reward function is illustrated in Section V. It mainly discusses how to implement the proposed reward functions on the current mainstream DRL methods. Then, experimental results are demonstrated and discussed in Section VI. Finally, the conclusions are drawn in Section VII. II ． POSTURE REWARD FUNCTION

For DRL based methods, the robot manipulator performs lots of ineffective exploration in a complex random environment, which is the main reason for reducing he efficiency of the algorithm. To cope with this problem, we hope to replace traditional sparse reward function with dense reward function. Dense reward functions can give more information after each action, but are more difficult to construct than sparse reward functions. Posture reward function restricts the relative position and relative direction of the endpoint of robot manipulator and target reasonably, as referred to position reward function and direction reward function, respectively. Posture reward function can make the algorithm more reasonable to generate actions executable by the robot manipulator, and improve the efficiency of algorithm. A. POSITION REWARD FUNCTION

In random environment, the Euclidean distance between the end of the robot manipulator and target can reflect the current state of the robot. The position reward function is designed in this paper consists of two parts, the item of task status and the item of distance guide. The task status item reflects the result of the trajectory planning of the robot manipulator, that is, whether it reaches the position of the target that appears in space randomly. The purpose of the item of distance guide is to motivate the robot manipulator to approach the target point quickly.

Distance guide item : In order to motivate the robot manipulator to approach the target point T quickly, the distance guide item is represented by the Euclidean distance 𝐷 𝑃𝑇 , between the end of the robot manipulator P and the target T . Task status item:

The task status item is modeled by 𝐷 𝑃𝑇 . The smaller the 𝐷 𝑃𝑇 , the more likely the robot manipulator will reach the target. Task status item is represented by parameters 𝐽 𝑟𝑒𝑎𝑐ℎ : as shown in (1): 𝐽 𝑟𝑒𝑎𝑐ℎ = { 0, 𝐷 𝑃𝑇 > 𝛽 1, 𝐷 𝑃𝑇 < 𝛽 , (1) where 𝛽 is adjustable according to the actual requirements of the environment, the value of 𝛽 is set to 0.01 in this paper. By combining the task status item and distance guide item, the position reward function is designed as shown in (2): 𝑅 𝑝𝑜𝑠𝑖𝑡𝑖𝑜𝑛 (𝐷 𝑃𝑇 ) = 𝐽 𝑟𝑒𝑎𝑐ℎ − 𝐷 𝑃𝑇 . (2) B. DIRECTION REWARD FUNCTION

On the basis of the guide of position reward function, by adding a direction guide, the robot manipulator can get more information and reach target faster. The direction reward function is modeled by the relationship between two vectors in three-dimensional space, the expected direction and the actual direction of the end of the robot manipulator. As shown in Fig. 2, 𝑃𝑇 is the expected motion direction, which is represented by 𝑉 𝑃𝑇 ⃗⃗⃗⃗⃗⃗ , 𝑃𝑃 ′ is the actual motion direction, which is represented by 𝑉 𝑃𝑃 ′ ⃗⃗⃗⃗⃗⃗⃗⃗ . The arithmetic expressions of 𝑉 𝑃𝑇 ⃗⃗⃗⃗⃗⃗ and 𝑉 𝑃𝑃 ′ ⃗⃗⃗⃗⃗⃗⃗⃗ are formulated in (3) and (4): 𝑉 𝑃𝑇 ⃗⃗⃗⃗⃗⃗ = 〈 (𝑇 𝑥 − 𝑃 𝑥 ), (𝑇 𝑦 − 𝑃 𝑦 ), (𝑇 𝑧 − 𝑃 𝑧 ) 〉 , (3) 𝑉 𝑃𝑃 ′ ⃗⃗⃗⃗⃗⃗⃗⃗ = 〈 (𝑃 ′𝑥 𝑡𝑒𝑚𝑝⁄ ), (𝑃 ′𝑦 𝑡𝑒𝑚𝑝⁄ ), (𝑃 ′𝑧 𝑡𝑒𝑚𝑝⁄ ) 〉𝑡𝑒𝑚𝑝 = sin(cos −1 (𝑃 ′𝑤 )) , (4) where 𝑇 𝑥 , 𝑇 𝑦 , 𝑇 𝑧 is the coordinate of the target, 𝑃 𝑥 , 𝑃 𝑦 , 𝑃 𝑧 is the coordinate of the end of the robot manipulator at the current state, 𝑃 ′𝑥 , 𝑃 ′𝑦 , 𝑃 ′𝑧 , 𝑃 ′𝑤 is the quaternion of the end of the robot manipulator at the current state. φ represents the angle between 𝑉 𝑃𝑇 ⃗⃗⃗⃗⃗⃗ and 𝑉 𝑃𝑃 ′ ⃗⃗⃗⃗⃗⃗⃗⃗ , it is used to measure the deviation between motion vector planned by the algorithm and expected motion vector. The smaller φ , the lower deviation. The arithmetic expressions of φ is formulated in (5): { 𝜑 = |𝑐𝑜𝑠 −1 𝑉 𝑃𝑇 ⃗⃗⃗⃗⃗⃗⃗⃗ ∙𝑉 𝑃𝑃′ ⃗⃗⃗⃗⃗⃗⃗⃗⃗⃗ √( 𝑉 𝑃𝑇 ⃗⃗⃗⃗⃗⃗⃗⃗⃗ ∙𝑉 𝑃𝑇 ⃗⃗⃗⃗⃗⃗⃗⃗ )×(𝑉 𝑃𝑃′ ⃗⃗⃗⃗⃗⃗⃗⃗⃗⃗ ∙𝑉

𝑃𝑃′ ⃗⃗⃗⃗⃗⃗⃗⃗⃗⃗ ) | 𝜑 ∈ [0, 𝜋] . (5)

FIGURE 2 : Scheme of direction reward function.

The direction reward function is designed in this paper is shown in (6): 𝑅 𝑑𝑖𝑟𝑒𝑐𝑡𝑖𝑜𝑛 (𝜑) = ⌊𝜑⌋ ∗ , (6) where ⌊𝜑⌋ ∗ represents an operation, the value of the function is output normally when the calculation result in ⌊∙⌋ ∗ is less than 𝜋 2⁄ ; otherwise, the result is π − 𝜑 . C. MODELING OF POSTURE REWARD FUNCTION

In the process of trajectory planning of the robot manipulator, position and direction are two key factors to be considered comprehensively. Simply use the position reward function or direction reward function, the performance of algorithm is poor. Therefore, consider combining the position reward function and direction reward function to form posture reward function 𝑅 𝑝𝑜𝑠𝑡𝑢𝑟𝑒 , as shown in (7): 𝑅 𝑝𝑜𝑠𝑡𝑢𝑟𝑒 (𝐷 𝑃𝑇 , 𝜑) = 𝑅 𝑝𝑜𝑠𝑖𝑡𝑖𝑜𝑛 (𝐷 𝑃𝑇 ) − 𝑅 𝑑𝑖𝑟𝑒𝑐𝑡𝑖𝑜𝑛 (𝜑) . （） III ． STRIDE REWARD FUNCTION

It not only can reach the target accurately, but also the movement distance of the obot manipulator is as small as possible for the optimal trajectory we expect. This not only promotes the robot manipulator to reach the target quickly, but also reduces the energy consumption during the operation of the robot manipulator. The stride reward function is modeled by the position reward function and the movement distance reward function. The position reward function remains the same as in section II-A. A. MOVEMENT DISTANCE REWARD FUNCTION

In this paper, we take the average movement distance of each joint as a constraint condition while the robot manipulator is running, and model the movement distance reward function. It’s difficult to obtain the movement distance of each joint directly during the operation of the robot manipulator. So we start from the speed of each joint of the robot manipulator to calculate the distance of each joint. We define the joint velocity vector of the robot manipulator as (8):

𝑉⃗ = [𝑣 , 𝑣 𝑣 𝑣 𝑁 ] , 𝑁 = 𝑛𝑢𝑚𝑏𝑒𝑟 𝑜𝑓 𝑗𝑜𝑖𝑛𝑡𝑠 . (8) The movement distance reward function 𝑅 𝑚𝑜𝑣𝑒 is shown in (9): 𝑅 𝑚𝑜𝑣𝑒 (𝑉⃗ ) = ∆𝑡 ∗ (𝑉⃗ ∙ 𝑉⃗ )/𝑁 , (9) where ∆𝑡 represents the working frequency of the robot manipulator, that is, the robot manipulator runs according to the speed command every time ∆𝑡 . N is the number of joints of the robot manipulator. In this paper, we set the ∆𝑡 is 0.05, and N is 6. B. MODELING OF STRIDE REWARD FUNCTION

The stride reward function is proposed by combining the position reward function and the movement distance reward function in this paper, we use the position and the movement distance of each joint of the robot manipulator as constraints to promote the policy of the trajectory planning learned by the algorithm, which can ensure the target is reached and the movement distance of each joint of the robot manipulator is reduced. The stride reward function is designed in this paper is shown in (10): 𝑅 𝑠𝑡𝑟𝑖𝑑𝑒 (𝐷 𝑃𝑇 , 𝑉 ⃗ ) = 𝑅 𝑝𝑜𝑠𝑖𝑡𝑖𝑜𝑛 (𝐷 𝑃𝑇 ) − 𝑅 𝑚𝑜𝑣𝑒 ( 𝑉 ⃗ ) . (10) IV ． STAGE INCENTIVE MECHANISM A. INSPIRATION SOURCE

First, let’s make an analogy. When we were in elementary school, our parents told us that we could get a big toy if we got good grades, so we studied hard for the toy; but in the university, our parents also told us that we can get a prize if we got good grades. So the question is, what’s the prize to motivate us to study hard in the university? As shown in Fig. 3. If it’s still a simple toy, it’s not enough to motivate us to study hard to get good grades in the university. From a psychological point of view, the process is a process of human behavior cognitive actually, it conforms to Maslow’s hierarchy of needs (a representative of cognitive motivational theory). target reward reward university elementaryschool as incentive as incentive insufficient incentives get reward getreward

FIGURE 3 : Scheme of inspiration source of stage incentive mechanism. B. HARD STAGE INCENTIVE REWARD FUNCTION

We use an adjustable coefficient 𝛾 to achieve the different reward function at different stages of the task during the operation of the robot manipulator. The mechanism of hard stage incentive divides the task of trajectory planning of robot manipulator into two stages, including the fast approach area and the slow adjustable area, as shown in Fig. 4. In the fast approach area, the robot manipulator uses the posture reward function to prompt it to approach target quickly. In the slow adjustable area, uses the stride reward function as an incentive mechanism. fast approach areaslow adjustable area FIGURE 4 : Scheme of hard stage incentive reward function.

In this paper, we use 𝐷 𝑃𝑇 =0.5 as the boundary to divide the fast approach area and the slow adjustable area. The relationship between the adjustable coefficient 𝛾 and the motion area of the robot manipulator is shown in Fig. 5:

11 0.5 slow adjustable area fast approach area

FIGURE 5 : Diagram of adjustable coefficient 𝛾 . 𝛾 can be calculated by (11): 𝛾 = { [ 𝛾 𝑝𝑜𝑠𝑡𝑢𝑟𝑒 = 1 𝛾 𝑠𝑡𝑟𝑖𝑑𝑒 = 0] 𝑇 , 𝑃 ∈ 𝑓𝑎𝑠𝑡 𝑎𝑝𝑝𝑟𝑜𝑎𝑐ℎ 𝑎𝑟𝑒𝑎[ 𝛾 𝑝𝑜𝑠𝑡𝑢𝑟𝑒 = 0 𝛾 𝑠𝑡𝑟𝑖𝑑𝑒 = 1] 𝑇 , 𝑃 ∈ 𝑠𝑙𝑜𝑤 𝑎𝑑𝑗𝑢𝑠𝑡𝑎𝑏𝑙𝑒 𝑎𝑟𝑒𝑎 (11) The mechanism of hard stage incentive reward function 𝑅 𝐻𝐴𝑅 we proposed is shown in (12): 𝑅 𝐻𝐴𝑅 = 𝛾 [𝑅 𝑝𝑜𝑠𝑡𝑢𝑟𝑒 (𝐷 𝑃𝑇 , 𝜑) 𝑅 𝑠𝑡𝑟𝑖𝑑𝑒 (𝐷 𝑃𝑇 , 𝑉 ⃗ )] 𝑇 = [𝛾 𝑝𝑜𝑠𝑡𝑢𝑟𝑒 , 𝛾 𝑠𝑡𝑟𝑖𝑑𝑒 ][𝑅 𝑝𝑜𝑠𝑡𝑢𝑟𝑒 (𝐷 𝑃𝑇 , 𝜑) , 𝑅 𝑠𝑡𝑟𝑖𝑑𝑒 (𝐷 𝑃𝑇 , 𝑉 ⃗ )] 𝑇 (12) C. SOFT STAGE INCENTIVE REWARD FUNCTION

Although the hard stage incentive reward function has achieved good results in experiments, we found that it has potential stability problems. That is, the adjustment process is rough, it’s easy to cause the fluctuation of the reward curve when change the reward function, which brings unstable factors for the algorithm. The switching process of the hard stage incentive reward function is similar to the bang-bang control in the classic control, the method is bound to affect the stability of the algorithm. To cope with this problem, we further proposed the soft stage incentive reward function. In this paper, the weight coefficient α = [α α ] is introduces to model a soft stage incentive reward function, as shown in (13), (14): 𝛼 = 𝑓(𝐷 𝑃𝑇 ) = 1 − ⌊𝐷 𝑃𝑇 ⌋ −𝜎 , (13) 𝛼 = 𝑓(𝐷 𝑃𝑇 ) = ⌊𝐷 𝑃𝑇 ⌋ −𝜎 , (14) where ⌊∙⌋ − represents an operation, constrain the value of 𝐷 𝑃𝑇 between [0,1]. 𝜎 , 𝜎 can be adjusted according to the actual situation of the task. In this paper, we set 𝜎 = 𝜎 =1 according to experimental experience. ombining with the weight coefficient α , the final expression of soft stage incentive reward function is defined as (15). 𝑅 𝑆𝐴𝑅 = 𝛼 𝑅 𝑠𝑡𝑟𝑖𝑑𝑒 (𝐷 𝑃𝑇 , 𝑉 ⃗ ) − 𝛼 𝑅 𝑝𝑜𝑠𝑡𝑢𝑟𝑒 (𝐷 𝑃𝑇 , 𝜑) . (15) The soft stage incentive reward function doesn’t need to divide the working space of the robot manipulator, according to the real-time change of the weight coefficient α , the reward function is adjusted dynamically and continuously. V ． IMPLEMENTATION OF REWARD FUNCTION

In this section, we introduce how to implement the proposed reward function in the mainstream DRL methods. This paper mainly introduces the implementation of reward function on methods with AC frame. As shown in Fig. 6, the learning process of the robot manipulator mainly consists of four stages: initialization, action generation, reward calculation and network training. At the stage of initialization, the parameters of actor network 𝜃 𝜇 and critic network 𝜃 𝑄 are initialized randomly. The actor network is used to predict the action which will be performed. The critic network is for judging the value of the action generated by the actor network. The actor network and the critic network are represented as 𝜇(𝑠|𝜃 𝜇 ) and 𝑄(𝑠|𝜃 𝑄 ) . In action generation stage, environment state S includes the relative distance between the robot manipulator and target. Combining environment state S and the value given by critic network, actor network will generate the action which is the speed of joints and put them into effect. In reward calculation stage, reward for current action is computed by soft stage incentive reward function, and the result is sent to critic network to train. In network training stage, the weights of network are updated. action generation Reward function

Soft stage incentive reward function rewardNetwork traininginitialization

Actor Critic action loop iteration

FIGURE 6:

Diagram of the training process for DRL with AC frame.

The overall process is summarized as algorithm 1, M is the maximum episode and T is the maximum training steps in each episode. lgorithm 1

Trajectory planning algorithm with soft stage incentive reward function

Input:

Environment state space S.

Output:

Action a 1: Initialize Actor Network 𝜇(𝑆|𝜃 𝜇 ) and Critic Network 𝑄(𝑆|𝜃 𝑄 ) for episode = 1 to M do for t = 1 to T do 𝑎 𝑡 ← 𝜇(𝑆|𝜃 𝜇 ) 𝑅 𝑆𝐼𝑅 ← 𝐹(𝑠, 𝑎)

6: reward = 𝑅 𝑆𝐼𝑅

7: Update weight of Actor Network 𝜃 𝜇

8: Update weight of Actor Network 𝜃 𝑄 end for end for VI ． EXPERIMENTAL RESULTS AND DISCUSSIONS

In this section, we set three sets of experiments to test the performance of proposed reward functions. The performance of our method is evaluated with four indicators, including convergence rate, the mean reward of each episode (During the experiment, we set any time step in an episode to complete the trajectory planning, and immediately stop the current episode and enter the next episode. Therefore, the total reward is calculated within the time step before completing the trajectory planning in each episode.), the average steps to complete the task (It’s impossible for the robot manipulator to complete the task of trajectory planning in one step, and usually requires multiple steps.) and standard deviation. The first three indicators mentioned above are used to verify the learning efficiency, and the standard deviation is used to judge the stability and robustness of the algorithm. In the first set of experiments, we apply our posture reward function and stride reward function to the state-of-the-art DRL methods, including Deep Deterministic Policy Gradient (DDPG) [13] and Soft Actor-Critic (SAC) [16]. In the second experiments, the hard stage incentive reward function is applied. The effectiveness of stage incentive mechanism can be verified by comparing four evaluation indicators. In the last set of experiments, soft stage incentive reward function is used, we further discuss the deficiencies of the hard stage incentive reward function and verify the performance of soft stage incentive reward function. Simulation experiments are conducted in V-REP [20] [21]. A random environment is initialized as shown in Fig. 7. The red ball is the target randomly appearing in the workspace. An additional reward of +20 will be given after each task.

FIGURE 7.

Simulation environments for robot manipulator.

The computer configuration used in the experiments is summarized in TABLE 1. Parameters setting of DDPG and SAC are summarized in TABLE 2. SAC is adopted by enforces an entropy constraint by varying entropy regularization coefficient over the course of training.

TABLE 1.

Configuration used in the experiments. index name information OS simulation environmentrobot manipulator programming language CPURAM12345 Ubuntu16.04V-REPJAKA

Python

64 Intel(R) i7-10750H CPU @ 2.6GHz16G

TABLE 2.

Key configuration parameters in DDPG and SAC. index name value 1 Actor Network Learning Rate 5e-4 2 Actor Network Hidden Units (256,128) 3 Actor Network Weight Decay 1e-5 4 Critic Network Learning Rate 1e-3 5 Critic Network Hidden Units (256, 128) 6 Critic Network Weight Decay 1e-5 7 Memory Buffer Size 1e6 8 Batch Size 64 9 Max episodes 1e4 10 Max steps 50 11 Update Weight τ . POSTURE AND STRIDE REWARD FUNCTION

In this part, three kinds of reward function including basic (which is a sparse reward function), posture and stride are applied to two mainly DRL methods. (Among them, after experimental verification, DDPG and SAC still cannot converge after a long time of training based on sparse reward function, so we will not discuss it.) During the experiment, we initialize the same working environment for 20 times. After all the methods converged, we calculate the convergence rate, the mean reward of each episode, the average steps to complete the task and standard deviation, as summarized in the TABLE 3. In the training, the changing process of reward and the average steps to complete the task in the training for each method is visualized in Fig. 8, and in the evaluation, the changing process for each method is visualized in Fig. 9.

TABLE 3.

Results with posture and stride reward function.

Method Reward Function Episode Reward Step Standard deviation Reward Step Success rate

SAC Basic

DDPG

Posture

Stride

Train Evaluation ----

Basic

Posture

Stride ---- ---- ---- ---- ---- ---- -------- ---- ---- ---- ----

11 90.4%

11 88.6% ----

DDPG (b) s t e p episode s c o r e episode SAC (a) s c o r e episode s t e p episode FIGURE 8.

Diagram of convergence process with posture and stride reward function in the training.

ACDDPG s c o r e episode s t e p episode (a) s c o r e episode s t e p episode (b) FIGURE 9.

Diagram of convergence process with posture and stride reward function in the evaluation.

From TABLE 3, we can observe that SAC converges faster in general. The convergence rate of SAC with posture reward function is 10.8% faster than that of DDPG with posture reward function, the convergence rate of SAC with stride reward function is 17.9% faster than that of DDPG with stride reward function. Compared to stride reward function, the convergence rate of posture reward function increases 22.6%~26.4%. But standard deviation of stride reward function is 23.7%~46.4% lower than posture reward function and mean reward of stride reward function is 3.2% higher than posture reward function in DDPG. On analysis of the reward curves of different reward functions in Fig. 8, a situation can be found. During the training, posture reward function will fluctuate to a large extent after convergence; the convergence rate of stride reward function is slow, it will fluctuate greatly before convergence, but it’s stable after convergence. The reason is not difficult to explain, posture reward function guides the robot manipulator closer to the target with distance and direction constraints, which is more oriented to complete the task. But its stability is poor, and a little interference can make it move away from the target quickly, resulting in mission failure. Stride reward function takes the distance and the movement distance of each joint of the robot manipulator as constrains to guide the robot manipulator to approach the target. Due to this, it will not suddenly move significantly. Compared to posture reward function, it’s more cautious. The fluctuations in training are caused by the agent’s exploration. Under the dual effects of exploration nd the stride reward function’s own characteristics, the robot manipulator can’t reach the target quickly. During the evaluation, we did 500 random trials, used the trained model to realize trajectory planning with the robot manipulator, and calculated its success rate. From the TABLE 3, we can observe that the success rates of DDPG and SAC based on posture reward function are 90.4% and 89.2% respectively. The success rates of DDPG and SAC based on stride reward function are 88.6% and 90.8% respectively. Generally speaking, posture reward function completes trajectory planning with less average steps and stride reward function obtains more rewards. In terms of improving convergence rate, posture reward function shows more advantages. Thus, stride reward function plays a more important role in improving algorithm stability. So, posture reward function can make the robot manipulator get rid of blind exploration at the early stages of exploration. The stride reward function can ensure stable convergence. B. HARD STAGE INCENTIVE REWARD FUNCTION

It can be seen from the above experiments that in a complex environment, the dense reward function will achieve better results, but there are still some defects. Then, we apply our hard stage incentive reward function (referred to as HAR reward function for abbreviation hereinafter) to SAC and DDPG, and their convergence results are shown in TABLE 4. In the training, the changing process of reward and the average steps to complete the task in the training for each method is visualized in Fig. 10, and in the evaluation, the changing process for each method is visualized in Fig. 11.

TABLE 4.

Results with hard stage incentive reward function.

Method Reward Function Episode Reward Step Standard deviation

Reward

Step

Success rate

SAC

Train

Evaluation ----BasicPosture

Stride ---- ---- ---- ---- ---- ----159.956±1.352 18.161±1.430 13.902 10 89.2% DDPG ----BasicPosture

Stride ---- ---- ---- ---- ---- ----915.644±0.518 11.625±3.073 14.871 11 90.4%

From TABLE 4, in the process of training, for robustness, the standard deviation reduces by about 42.6% compared to posture reward function. The convergence rate of HAR reward function is about 20.4% faster than stride reward function. But it’s slower than the posture reward function, which is caused by the characteristics of HAR reward function itself. When the mechanism of hard stage incentive adjusts the reward function used in different stages, the switch-type adjustment method is adopted without a smooth transition process. This is also one of the reasons why the SAC with HAR reward function in the Fig. 8(a) fluctuates greatly in the training. Although its performance is not obvious in the DDPG. In the evaluation, two methods all have great promotion both in convergence performance and robustness by using our HAR reward function in scene. The reward rises by up to 24.7%, the average steps reduce by 16.7%~27.2%, and the success rate or trajectory planning rises by up to 2.4%~5.2%.

SAC s c o r e episode s t e p episode (a) DDPG s c o r e episode (b) s t e p episode FIGURE 10.

Diagram of convergence process with posture and stride reward function in the training.

SACDDPG s c o r e episode s t e p episode (b) s c o r e episode s t e p episode (a) FIGURE 11.

Diagram of convergence process with posture and stride reward function in the evaluation. . SOFT STAGE INCENTIVE REWARD FUNCTION

Although the method with HAR reward function has achieved improvement in both the training and evaluation process, but due to the limit of HAR reward function’s own characteristics, its learning efficiency and robustness still need to be improved. In the last set of experiments, experimental group uses the soft stage incentive reward function (referred to as SAR reward function for abbreviation hereinafter). As shown in TABLE 5, the results of SAR reward function are superior to others in all cases. In the training, the changing process of reward and the average steps to complete the task in the training for each method is visualized in Fig. 12, and in the evaluation, the changing process for each method is visualized in Fig. 13.

TABLE 5.

Results with soft stage incentive reward function.

Method Reward Function Episode Reward Step Standard deviation Reward Step Success rate

SAC

Train Evaluation ----BasicPostureStride 52446773---- ---- ---- ---- ---- ----159.956±1.352 18.161±1.430 13.902 10 89.2%1216.125±0.833 9.728±1.049 16.584 12 90.8%HAR 5369 1215.366±0.403 10.358±0.348 17.349 10 93.2%DDPG ----BasicPostureStride 58797988---- ---- ---- ---- ---- ----915.644±0.518 11.625±3.073 14.871 11 90.4%1116.142±0.827 8.872±2.505 16.163 11 88.6%HAR 6385 917.266±0.276 6.688±0.549 17.361 8 93.8%SAR

SAR

In the training, compared with the above three reward functions, convergence rate is accelerated by 18.7%~40.1% in DDPG and convergence rate is accelerated by 31.4%~46.9% in SAC. For convergent mean reward, the promotion is between 4.6%~15.5% in DDPG and 4.4%~9.5% in SAC. The performance of robustness is excellent at the same time, the standard deviation is decreased by 35.9%~63.2% in DDPG and 21.9%~26.7% in SAC. This shows our SAR reward function has good convergent rate, stability and robustness. Why does it work so well? On the one hand, it combines the advantages of posture reward function and stride reward function, which can not only ensure the fast convergence at the early stages of exploration, but also ensure the stable convergence. On the other hand, it solves the switch adjustment mode of HAR reward function, which makes the transition process of using different reward functions in different stages smoothly. The convergence rate of SAR reward function in DDPG is slower than SAC, but other indicators are better than SAC. Using the model obtained by the DDPG with SAR reward function for evaluation, the success rate of the trajectory planning can reach 99.6%. The average steps to complete the trajectory planning are 5. From Fig. 13, compared with the other three reward functions, we can observe that SAR reward function only needs fewer steps to realize the trajectory planning of robot manipulator, get more rewards and more stable.

ACDDPG s c o r e episode s t e p episode (a) s c o r e episode s t e p episode (b) FIGURE 12.

Diagram of convergence process with posture and stride reward function in the training.

SACDDPG s t e p episode (a) s c o r e episode episode (b) s t e p s c o r e episode FIGURE 13.

Diagram of convergence process with posture and stride reward function in the evaluation.

II.

CONCLUSIONS

To cope with the inefficiency, instability and blindness of DRL based methods in trajectory planning task for robot manipulator, this paper proposed three dense reward function, including posture reward function, stride reward function and the mechanism of stage incentive. The posture reward function can reduce the blindness of exploration to accelerate the learning process. And the stride reward function can make the learning process more stable. However, the soft stage incentive reward function absorbs the advantages of both, has better convergent rate, stability and robustness at the same time.

Experimental results demonstrate that state-of-the-art DRL methods using the proposed reward functions can improve the convergence rate and trajectory planning quality with respect to the accuracy and robustness In the future work, we will further explore the mechanism of reward shaping, we plan to extend this method to more complex trajectory robotic planning task, which will further increase the universality of the method. And experiment with real robot manipulator will be performed at the same time

EFERENCES [1] N. Zhang, Y. Zhang, C. Ma and B. Wang, "Path planning of six-DOF serial robots based on improved artificial potential field method," 2017 IEEE International Conference on Robotics and Biomimetics (ROBIO), Macau, 2017, pp. 617-621, doi:10.1109/ROBIO.2017.8324485. [2] H. Lin and M. Hsieh, "Robotic Arm Path Planning Based on Three-Dimensional Artificial Potential Field," 2018 18th International Conference on Control, Automation and Systems (ICCAS), Daegwallyeong, 2018, pp. 740-745. [3] S. N. Gai, R. Sun, S. J. Chen and S. Ji, "6-DOF Robotic Obstacle Avoidance Path Planning Based on Artificial Potential Field Method," 2019 16th International Conference on Ubiquitous Robots (UR), Jeju, Korea (South), 2019, pp. 165-168, doi:10.1109/URAI.2019.8768792. [4] S. Fang, X. Ma, Y. Zhao, Q. Zhang and Y. Li, "Trajectory Planning for Seven-DOF Robotic Arm Based on Quintic Polynormial," 2019 11th International Conference on Intelligent Human-Machine Systems and Cybernetics (IHMSC), Hangzhou, China, 2019, pp. 198-201, doi:10.1109/IHMSC.2019.10142. [5] Fang S., Ma X., Qu J., Zhang S., Lu N., Zhao X. (2020) Trajectory Planning for Seven-DOF Robotic Arm Based on Seventh Degree Polynomial. In: Jia Y., Du J., Zhang W. (eds) Proceedings of 2019 Chinese Intelligent Systems Conference. CISC 2019. Lecture Notes in Electrical Engineering, vol 593. Springer, Singapore. https://doi.org/10.1007/978-981-32-9686-2_34 [6] Sadiq A T , Raheem F A , Abbas N A F . Optimal Trajectory Planning of 2-DOF Robot Arm Using the Integration of PSO Based on D* Algorithm and Cubic Polynomial Equation[C]// The First International Conference for Engineering Researches - March 2017. 2017. [7] K. Katyal, I. Wang and P. Burlina, "Leveraging Deep Reinforcement Learning for Reaching Robotic Tasks," 2017 IEEE Conference on Computer Vision and Pattern Recognition Workshops (CVPRW), Honolulu, HI, 2017, pp. 490-491, doi: 10.1109/CVPRW.2017.71. [8] K. Kamali, I. A. Bonev and C. Desrosiers, "Real-time Motion Planning for Robotic Teleoperation Using Dynamic-goal Deep Reinforcement Learning," 2020 17th Conference on Computer and Robot Vision (CRV), Ottawa, ON, Canada, 2020, pp. 182-189, doi: 10.1109/CRV50864.2020.00032. [9] Z. Li, H. Ma, Y. Ding, C. Wang and Y. Jin, "Motion Planning of Six-DOF Arm Robot Based on Improved DDPG Algorithm," 2020 39th Chinese Control Conference (CCC), Shenyang, China, 2020, pp. 3954-3959, doi: 10.23919/CCC50068.2020.9188521. [10] S. Wen, J. Chen, S. Wang, H. Zhang and X. Hu, "Path Planning of Humanoid Arm Based on Deep Deterministic Policy Gradient," 2018 IEEE International Conference on Robotics and Biomimetics (ROBIO), Kuala Lumpur, Malaysia, 2018, pp. 1755-1760, doi: 10.1109/ROBIO.2018.8665248. [11]Mnih, V., Kavukcuoglu, K., Silver, D. et al. Human-level control through deep reinforcement learning. Nature 518, 529–533 (2015). [12]T. D. Le, A. T. Le and D. T. Nguyen, "Model-based Q-learning for humanoid robots," 2017 18th International Conference on Advanced Robotics (ICAR), Hong ong, 2017, pp. 608-613, doi: 10.1109/ICAR.2017.8023674. [13] T.P. Lillicrap, J.J. Hunt, A. Pritzel, N. Heess, T. Erez, Y. Tassa, D. Silver, and D. Wierstra, “Continuous control with deep reinforcement learning,” arXiv preprint arXiv:1509.02971, 2015. [14] V. Mnih, A.P. Badia, M. Mirza, A. Graves, T. Lillicrap, T. Harley, and K. Kavukcuoglu, “Asynchronous methods for deep reinforcement learning,” in Proc. ICML, New York, USA, Jun. 2016, pp.1928–1937. [15] J. Schulman, F. Wolski, P. Dhariwal, A. Radford, O. Klimov, “Proximal Policy Optimization Algorithms,” arXiv preprint arXiv:1707.06347, 2017. [16] T. Haarnoja, A. Zhou, P. Abbeel, S. Levine, “Soft Actor-Critic: Off-Policy Maximum Entropy Deep Reinforcement Learning with a Stochastic Actor”, arXiv preprint arXiv:1801.01290, 2018. [17][1] N. Zhang, Y. Zhang, C. Ma and B. Wang, "Path planning of six-DOF serial robots based on improved artificial potential field method," 2017 IEEE International Conference on Robotics and Biomimetics (ROBIO), Macau, 2017, pp. 617-621, doi:10.1109/ROBIO.2017.8324485. [2] H. Lin and M. Hsieh, "Robotic Arm Path Planning Based on Three-Dimensional Artificial Potential Field," 2018 18th International Conference on Control, Automation and Systems (ICCAS), Daegwallyeong, 2018, pp. 740-745. [3] S. N. Gai, R. Sun, S. J. Chen and S. Ji, "6-DOF Robotic Obstacle Avoidance Path Planning Based on Artificial Potential Field Method," 2019 16th International Conference on Ubiquitous Robots (UR), Jeju, Korea (South), 2019, pp. 165-168, doi:10.1109/URAI.2019.8768792. [4] S. Fang, X. Ma, Y. Zhao, Q. Zhang and Y. Li, "Trajectory Planning for Seven-DOF Robotic Arm Based on Quintic Polynormial," 2019 11th International Conference on Intelligent Human-Machine Systems and Cybernetics (IHMSC), Hangzhou, China, 2019, pp. 198-201, doi:10.1109/IHMSC.2019.10142. [5] Fang S., Ma X., Qu J., Zhang S., Lu N., Zhao X. (2020) Trajectory Planning for Seven-DOF Robotic Arm Based on Seventh Degree Polynomial. In: Jia Y., Du J., Zhang W. (eds) Proceedings of 2019 Chinese Intelligent Systems Conference. CISC 2019. Lecture Notes in Electrical Engineering, vol 593. Springer, Singapore. https://doi.org/10.1007/978-981-32-9686-2_34 [6] Sadiq A T , Raheem F A , Abbas N A F . Optimal Trajectory Planning of 2-DOF Robot Arm Using the Integration of PSO Based on D* Algorithm and Cubic Polynomial Equation[C]// The First International Conference for Engineering Researches - March 2017. 2017. [7] K. Katyal, I. Wang and P. Burlina, "Leveraging Deep Reinforcement Learning for Reaching Robotic Tasks," 2017 IEEE Conference on Computer Vision and Pattern Recognition Workshops (CVPRW), Honolulu, HI, 2017, pp. 490-491, doi: 10.1109/CVPRW.2017.71. [8] K. Kamali, I. A. Bonev and C. Desrosiers, "Real-time Motion Planning for Robotic Teleoperation Using Dynamic-goal Deep Reinforcement Learning," 2020 17th Conference on Computer and Robot Vision (CRV), Ottawa, ON, Canada, 2020, pp. 182-189, doi: 10.1109/CRV50864.2020.00032. [9] Z. Li, H. Ma, Y. Ding, C. Wang and Y. Jin, "Motion Planning of Six-DOF Arm Robot Based on Improved DDPG Algorithm," 2020 39th Chinese Control Conference (CCC), Shenyang, China, 2020, pp. 3954-3959, doi: 10.23919/CCC50068.2020.9188521. [10] S. Wen, J. Chen, S. Wang, H. Zhang and X. Hu, "Path Planning of Humanoid Arm Based on Deep Deterministic Policy Gradient," 2018 IEEE International Conference on Robotics and Biomimetics (ROBIO), Kuala Lumpur, Malaysia, 2018, pp. 1755-1760, doi: 10.1109/ROBIO.2018.8665248. [11]Mnih, V., Kavukcuoglu, K., Silver, D. et al. Human-level control through deep reinforcement learning. Nature 518, 529–533 (2015). [12]T. D. Le, A. T. Le and D. T. Nguyen, "Model-based Q-learning for humanoid robots," 2017 18th International Conference on Advanced Robotics (ICAR), Hong ong, 2017, pp. 608-613, doi: 10.1109/ICAR.2017.8023674. [13] T.P. Lillicrap, J.J. Hunt, A. Pritzel, N. Heess, T. Erez, Y. Tassa, D. Silver, and D. Wierstra, “Continuous control with deep reinforcement learning,” arXiv preprint arXiv:1509.02971, 2015. [14] V. Mnih, A.P. Badia, M. Mirza, A. Graves, T. Lillicrap, T. Harley, and K. Kavukcuoglu, “Asynchronous methods for deep reinforcement learning,” in Proc. ICML, New York, USA, Jun. 2016, pp.1928–1937. [15] J. Schulman, F. Wolski, P. Dhariwal, A. Radford, O. Klimov, “Proximal Policy Optimization Algorithms,” arXiv preprint arXiv:1707.06347, 2017. [16] T. Haarnoja, A. Zhou, P. Abbeel, S. Levine, “Soft Actor-Critic: Off-Policy Maximum Entropy Deep Reinforcement Learning with a Stochastic Actor”, arXiv preprint arXiv:1801.01290, 2018. [17]