Robot Navigation in a Crowd by Integrating Deep Reinforcement Learning and Online Planning
Zhiqian Zhou, Pengming Zhu, Zhiwen Zeng, Junhao Xiao, Huimin Lu, Zongtan Zhou
aa r X i v : . [ c s . R O ] F e b Robot Navigation in a Crowd by Integrating Deep ReinforcementLearning and Online Planning
Zhiqian Zhou, Pengming Zhu, Zhiwen Zeng, Junhao Xiao, Huimin Lu*, Zongtan Zhou
Abstract — It is still an open and challenging problem formobile robots navigating along time-efficient and collision-freepaths in a crowd. The main challenge comes from the complexand sophisticated interaction mechanism, which requires therobot to understand the crowd and perform proactive andforesighted behaviors. Deep reinforcement learning is a promis-ing solution to this problem. However, most previous learningmethods incur a tremendous computational burden. To addressthese problems, we propose a graph-based deep reinforcementlearning method, SG-DQN, that (i) introduces a social attentionmechanism to extract an efficient graph representation for thecrowd-robot state; (ii) directly evaluates the coarse q-values ofthe raw state with a learned dueling deep Q network(DQN);and then (iii) refines the coarse q-values via online planning onpossible future trajectories. The experimental results indicatethat our model can help the robot better understand the crowdand achieve a high success rate of more than 0.99 in thecrowd navigation task. Compared against previous state-of-the-art algorithms, our algorithm achieves an equivalent, ifnot better, performance while requiring less than half of thecomputational cost.
I. INTRODUCTIONIn the last decade, a significant number of mobile ser-vice robots have been introduced into households, offices,and various public places. They share living and socialspace with humans and have varying degrees of interactionswith humans (e.g., carrying foods, cleaning rooms, andguiding visitors). Because humans are dynamic decision-making agents, a robot needs to understand interactionsamong humans and construct a proactive and foresightedcollision avoidance strategy. However, this is still an openand challenging problem.Navigating robots among dynamic obstacles has been thor-oughly studied in robotics and related areas. There is a largebody of work on classic navigation techniques, such as socialforce models [1], [2], and velocity obstacles [3]–[6]. Theseapproaches often treat humans as simple dynamic obstacles,focusing only on reaction-based collision avoidance rules andignoring pedestrians’ decision-making operations. Therefore,such approaches cannot understand social scenarios well andresult in shortsighted and occasionally unnatural behaviors.To address this problem, some researchers separate thecrowd navigation task into two disjoint subtasks: first pre-dicting pedestrians’ trajectories and then planning collision- *This work was supported by the National Natural Science Foundation ofChina [61773393, U1913202, U1813205]All the authors are with Robotics Research Center, College ofIntelligence Science and Technology, National University of DefenseTechnology. [email protected], [email protected],[email protected], [email protected], [email protected],@nudt.edu.cn and [email protected]
Fig. 1: Illustration of our SG-DQN. When the robot navigatesin a crowd, it selectively aggregates pedestrians’ informationwith social attention weights, coarsely evaluates the state-action values of all actions with the learned dueling DQN,and quickly generates the best candidate actions. By per-forming rollouts on the current state, the robot refines thecoarse state-action values of the best candidate actions andmakes a more foresighted decision.free paths [7]–[12]. However, with increasing crowd sizeand density, trajectory-based approaches suffer from highcomputational cost and the freezing robot problem [13], inwhich the robot cannot find a plausible path. The key toaddressing this issue is to consider crowd interactions.With the rapid development of machine learning algo-rithms, some other studies have focused on deep reinforce-ment learning (DRL) because of its ability to model varyinginteractions in a crowd and generate a more foresighted path.Previous works have focused on value-based and model-based DRL methods [14]–[17]. Their framework includestwo disjoint modules: a value function to estimate the statevalue of a given robot-crowd configuration and a state transi-tion function to propagate agents’ dynamics forward in time.Furthermore, Chen et al. [18] combined the learned modelwith
Monte-Carlo tree search (MCTS), selecting the actionwith maximum d-step return. However, since they musttraverse all successor state when estimating the state value,these methods require numerous computational resources, es-pecially when performing a look-ahead rollout. Meanwhile,model-free DRL methods, such as deep Q network (DQN)[19], [20], are widely used to learn control policies directlyfrom the raw state with end-to-end training. However, mostof them takes raw sensor readings as input (e.g. point cloudsand images) [21], [22]. This leads to a large state space,which makes training more challenging. Additionally, it ishard to learn a good understanding of a crowd scenario fromraw sensor readings, which is the key to a proactive andforesighted navigation policy. key challenge for learning-based methods in crowdnavigation is to get an extensible and efficient state represen-tation, which involves two subproblems. The first subprob-lem is to model crowd interactions. Most previous modelsconsider only pairwise interactions between the robot andeach pedestrian while ignoring human-human interactionsin the crowd, which are more common and important [17],[23], [24]. LM-SARL, proposed in [16], captures human-human interactions via a coarse-grained local map. However,the neighborhoods of humans are × grid, which greatlysimplifies the human-human interactions in a crowd. Thesecond sub-problem is to aggregate the neighbors’ infor-mation. The simplest method is to apply a pooling module[14], [25], treating all neighbors equally. Another methodis to apply a long short-term memory (LSTM) module tocombine the pedestrians’ information sequentially accordingto their distances to the robot [23], [24]. Both of them donot consider the crowd as a whole, losing the importantstructural information. The social attention mechanism [7],[8], which models the relative influences among pedestriansin a crowd, is very suitable to selectively aggregate theneighbors’ information. In LM-SARL, a social attentivepooling module is built to reason the collective importance ofneighboring humans [16]. Similar work can be found in [17],where a separate attention network was proposed to inferthe relative importance of neighboring humans. After beingtrained with human gaze data, it is available to accuratelypredicts human-like attention weights in crowd navigationscenarios.Since a crowd typically produces non-Euclidean data,graph neural networks (GNNs) can be used to extract ef-ficient representations in crowd navigation. In [18], Chenet al. proposed a relational graph model with a two-layergraph convolutional network (GCN) to reason about therelations among all agents [18]. However, this algorithmdoes not show good performance in the crowd navigationtask and encounters convergence problems. Even in the givensimple simulated environment, the training process is alwaysdivergent. Another crowd navigation algorithm, G-GCNRL,utilizes two GCNs to learn human-like attention weights andintegrate information about the crowd scenario [17]. Theexperimental results show that the introduction of GCNsgreatly enhances the performance of G-GCNRL.This work focuses on three improvements to addressthe above issues and proposes a graph-based reinforcementlearning method named SG-DQN. First, the social attentionmechanism is introduced into the graph attention network(GAT) [26] to extract an efficient graph representation.Second, a dueling DQN [27], is utilized to coarsely evaluatethe state-action values, which greatly shortens the time ofvalue estimation. Third, the dueling DQN is combined withonline planning, which fine-tunes the coarse evaluation witha simple and rapid rollout performance. Additionally, thereward function is redesigned based on current navigationscenarios to enhance the training convergence. Finally, twotypes of scenarios, one simple and one complex, are de-signed to evaluate the proposed approach. The experimental results show that the proposed method helps the robot tobetter understand the crowd, navigates the robot along time-efficient and collision-free paths, and outperforms state-of-the-art methods. The code and the demonstration video areavailable at github.com/nubot-nudt/SG-DQN .II. P ROBLEM F ORMULATION
A. Crowd Navigation Modeling
As a typical sequential decision-making problem, thecrowd navigation task can be formulated in a reinforcementlearning framework [14]. Suppose that a mobile robot nav-igates in a crowd of N pedestrians over discrete time steps.Natural number i is used to number agents, 0 for the robotand i ( i > for the i th pedestrian. The agents’ configurationis described in robot-centric coordinates, where the robotis located at the origin and the positive direction of the x-axis is from its initial position to its goal position. For eachagent, its configuration w can be divided into two parts,the observable state and the hidden state. In this work, theobservable state consists of the agent’s position p = [ p x , p y ] ,velocity v = [ v x , v y ] and radius ρ , and the hidden stateincludes the agent’s intended goal position g = [ g x , g y ] ,preferred velocity v p and heading angle θ . For the robot,it is impossible to observe the pedestrians’ hidden states.Therefore, the robot’s input state s can be represented as: s = [ w , w , ..., w N ] w = [ p x , p y , v x , v y , ρ , g x , g y , v p , θ ] w i = [ p ix , p iy , v ix , v iy , ρ i , i > (1)where w is the full state of the robot and w i is theobservable state of the i th pedestrian.At time step t , the robot observes a states s t , chooses anaction from its discrete action set a t ∈ A and then receivesan immediate reward signal r t = R ( s t , a t ) at time step t + 1 .For holonomic robots, the action space consists of the stopaction and 80 discrete actions: 5 speeds evenly spaced in (0, v p ] and 16 headings evenly spaced in [0, 2 π ). This work iseasy to extend to car-like robots by limiting their headings. B. Reinforcement Learning Based on the Q-Value
In this work, a dueling DQN is utilized to directly evaluatethe state-action values (q-values for simplicity) and selectthe best action, which takes the agent-level state as input.There are two main reasons for this. First, the agent-levelstate can help the robot to better understand crowd scenarios.Second, the dimension of agent-level state is much less thanthe dimension of raw sensor data, which greatly reduces thecomputational cost. The objective is to develop the optimaldeterministic policy, π ∗ : s t → a t , that maximizes theexpected discounted return of the robot in reaching its goal: π ∗ ( s t ) = arg max a t Q ∗ ( s t , a t ) (2) ∗ ( s, a ) is the corresponding optimal state-action function,which satisfies the Bellman equation: Q ∗ ( s, a ) = X s ′ ,r P ( s ′ , r | s, a )[ r + γ △ t · v p max a ′ Q ∗ ( s ′ , a ′ )] (3)where s ′ and r are the successor state and the immediatereward respectively. γ ∈ (0 , is the discount factor thatbalances the immediate and future rewards, which is nor-malized by the preferred velocity v p and time step △ t [14],[15]. The transition probability is described by P ( s ′ , r | s, a ) . C. Reward Shaping
The reward function is also a highly essential point in deepreinforcement learning. However, previous works ignore thispoint and apply the reward function from [14], which wasoriginally designed to resolve the noncommunicating two-agent collision avoidance problem. As the scene continues toexpand, the mismatched reward makes the training processchallenging and results in poor training convergence [18]. Inthis work, the reward function is redesigned with three parts: R g , R c and R s . R g is designed to navigate the robot movingtowards its goal. R c is built to penalize collision cases, and R s is designed to reward the robot in maintaining a safedistance from all pedestrians. Formally, the reward functionat time step t can be given as: R t = R tg + R tc + R ts (4)where R tg , R tc and R ts are given by: R tg = ( r g if || p t − g || < . . || p t − − g || − || p t − g || ) otherwise (5) R tc = ( r c if || p t − p ti || < ρ + ρ i otherwise (6) R ts = N X i =1 f ( d ti , d s ) f ( d ti , d s ) = ( △ t · ( d ti − d s ) / if d ti < . otherwise (7)Here, p and g are the position and goal of the robot, d s denotes the threshold distance that the robot needs tomaintain from pedestrians at all time and d ti is the actualminimum separation distance between the robot and the i thpedestrian during the time step. Here, r g = 10 , r c = − . and d s = 0 . . III. M ETHODOLOGY
As shown in Fig. 2, a new framework is proposed tonavigate social robots in a crowd, which can be divided intothree parts. The first part is a two-layer GAT that extractsan efficient graph representation from the crowd-robot state,which is the basis of q-value estimators. The second part isthe dueling DQN, which coarsely estimates the q-values ofthe current state. The third part is the online planner, which Fig. 2: The framework of SG-DQN. The crowd-robot state isdescribed in the form of graph data. Then, the dueling DQNutilizes a two-layer GAT to extract a high-level representationand directly evaluates the q-values of the current state. Thecore of the online planner is the rollout performance basedon the learned dueling DQN and the environment model,both of which are optimized with simulated trajectories.performs rollouts in the near and short-term future based onthe dueling DQN and the environment model.
A. Graph Representation with Social Attention
In this work, a social attention mechanism is introducedinto the GAT to learn a good graph representation ofthe crowd-robot state. Unlike [16], attention weights arecomputed for all agents, regardless of the robot or pedes-trians. Both robot-human interactions and human-humaninteractions can be modeled via the same graph convolutionoperation.In the graph representation, the nodes and edges repre-sent agents and connections between agents. Since agentsoften have different types, the nodes’ dimensions are alwaysdifferent (e.g., s ∈ R and s i ∈ R , i > ). Therefore,multilayer perceptrons (MLPs) are utilized to extract thefixed-length latent state, which will be fed into subsequentgraph convolutional layers. The latent state of the i th nodeis given as: h = Ψ r ( w ; W r ) h i = Ψ p ( w i ; W p ) , i = 1 , , ..., N. (8)here Ψ r and Ψ p are MLPs with ReLU activations, and theirnetwork weights are represented by W r and W p , respectively.The superscript l is used to denote the layer number. Theinput of the first layer is denoted by H = [ h , h , · · · , h N ] .For all nodes, h i = h i .Then, a social attention mechanism is introduced intoFig. 3: The layerwise graph convolution operation on the i thnode and its neighborhood (e.g., h and h N ). l is the layernumber, Q is the query matrix and K is the key matrix.the spatial graph convolution operation to model interationsin a crowd. First, a query matrix Q and a key matrix K are built to transform the input features into higher-levelfeatures. Then, these features are concatenated to computeattention coefficients e ij via a fully connected network(FCN). The FCN is followed by a LeakyReLU nonlinearitywith a negative input slope α = 0 . . Finally, the attentioncoefficients are normalized via a softmax function. A sketchof layerwise graph convolution operation on the i th node andits neighborhood is shown in Fig. 3 and the correspondinglayerwise graph convolution rule is given as: e ij = LeakyRuLU ( a ( q i || k j )) α ij = softmax j ( e ij ) = exp( e ij ) P k ∈ N ( i ) exp( e ik ) h l +1 i = σ ( X j ∈ N ( i ) α lij h lj ) (9)where q i = Ψ q ( h i ; Q ) , k j = Ψ k ( h j , K ) , a ( · ) is the attentionfunction and || is the concatenation operation. A layerwiseFCN Ψ a ( · ; A ) is utilized to map concatenated states toattention weights. α ij is the normalized attention weight,indicating the importance of the j node to the i th node. N ( i ) is the neighborhood of the i th node in the graph.Similar to previous work [14]–[18], all agents can obtainaccurate observable states for other pedestrians. Therefore,the neighborhood of the i th agent includes all pedestrians.Considering that there are indirect interactions in a crowd,the GAT is equipped with two graph convolutional layers tomodel both direct and indirect interactions in a crowd. Thefinal output of the two-layer GAT can be described by: H = h + h + h (10) B. Graph-Based Deep Q-learning
Encouraged by the great success of DQNs [19], [20] andtheir variants in reinforcement learning, a dueling DQNis built to estimate q-values in this work. The significant
Algorithm 1:
Deep Q-learning
Output:
Q-value network: Q ( · ; θ ) Initialize empty experience replay memory E Initialize target dueling DQN with random weights θ Duplicate DQN Q ′ ( · ; θ ′ ) ← Q ( · ; θ ) for episode ← to num of episodes do Initialize random crowd-robot state S while not reach goal, collide or timeout do a t ← arg max a t Q ( s t , a t ; θ ) Execute a t and obtain r t and s t +1 s t ← s t +1 Assimilate tuple ( s t , a t , r t , s t +1 ) into E end Sample randomly subset M from E Update Q ( · ; θ ) with M and Q ′ ( · ; θ ′ ) for every C episodes do evaluate value network Q ( · ; θ ) Q ′ ( · ; θ ′ ) ← Q ( · ; θ ) end end difference from the previous state value network is that thedueling DQN requires only the current state as input. Itis a model-free RL method and does not need to evaluatesubsequent states.As shown in the bottom of Fig. 2, the dueling DQNincludes three separate modules: MLPs that extract fixed-length latent states, a two-layer GAT to obtain graphrepresentations H , and a dueling architecture to estimateq-values. All these parameters are trained in an end-to-endmanner.The dueling architecture consists of two streams thatrepresent the state value and state-dependent advantagefunctions [27]. In this work, a two-layer MLP, Ψ c ( H ; α ) , isbuilt as a common convolutional feature learning module,and two fully connected layers, Ψ v ( · ; β ) and Ψ d ( · ; η ) , toobtain a scalar V ( H ; α, β ) and an | A | -dimensional vector D ( H, a ; α, η ) . Here, H is the final graph representationmentioned in Eq. 10, and α , β and η are the parameters ofthe dueling architecture. Finally, the state-action function Q ( H, a ; · ) can be described by Q ( H, a ; α, β, η ) = V ( H ; α, β ) + D ( H, a ; α, η ) (11)The learning algorithm is shown in Algorithm 1, where θ denotes all parameters of the dueling DQN. In each episode,the crowd-robot state s is initialized randomly (line 5).Afterward, the robot is simulated to navigate in the crowdusing a σ greedy policy (line 7). The experience replaytechnique [20], [28] is also used to increase data efficiencyand reduce the correlation among samples. At every step,the newly generated state-value pairs are assimilated into anexperience replay memory E (line 10) with a size of . Inevery training, minibatch experiences M are sampled from E to update the network (line 12). To promote convergence, theig. 4: A two-step rollout on state s t . For simplicity, a bluearrow and a red arrow are used to denote the learned DQNand the environment model, respectively.network Q ( · ; θ ) is cloned to build a target network Q ′ ( · ; θ ′ ) (line 16), which generates the Q-learning target values forthe next C = 500 episodes. The loss function is defined as: L ( θ ) = X ( r + γ △ t · v p max a ′ Q ′ ( s ′ , a ′ ; θ ′ ) − Q ( s, a ; θ )) (12) C. Online Planning Based on Rollout Performance
Previous works have proven that a DQN has the potentialto partly address the shortsightedness problem in crowd nav-igation. However, with unknown human behavior policies, itis difficult to learn a perfect network model. Inspired by [29]–[32], the learned DQN is combined with online planning torefine coarse q-values by performing rollouts on the currentstate and reasoning about the coming future. The depth andwidth of the rollout performance are denoted by d and k .A smaller k means that the navigation policy relies moreon model-free RL and has a lower computational burden.When k = 1 , the algorithm degenerates into a pure model-free RL policy. d is designed to balance the importance ofthe current state and future states, which has a similar effecton the navigation policy as k . A two-step rollout in the look-ahead tree search diagram is shown in Fig. 4.In the rollout performance, the learned dueling DQN andthe environment model are applied recursively to build alook-ahead tree. In particular, at each expansion, the duelingDQN is used to generate q -values and generate the bestcandidate actions (e.g., actions with the top k q -values). Theenvironment model is utilized to predict the correspondingrewards and subsequent states based on the current states andcandidate actions. Then, it is possible to refine the q -valueswith d -step returns, which can be described as: Q d ( s, a ; θ ) = Q ( s, a ; θ ) if d = 0 dd + 1 Q ( s, a ; θ ) + 1 d + 1 ( r d − + γ △ t · v p max a ′ Q d − ( s ′ , a ′ ; θ )) otherwise(13) In addition, the environment model and the RL model areseparated here, the environment model can be any model forcrowd state prediction or human trajectory prediction. In thiswork, the crowd prediction framework proposed in [18] isused to learn a environment model from simulted trajectories.There are three main reasons. First, the environment modelis necessary to perform rollouts. However, it is impossibleand impractical for the robot to query the real environmentmodel. Second, with R s encouraging the robot to keep a safedistance from pedestrians, model fidelity is less important inthe rollout performance. Third, crowd prediction has showngood performance in previous work. D. Implementation Details
In this work, the hidden units of Ψ r ( · ) , Ψ h ( · ) , Ψ r , Ψ v and Ψ d have dimensions (64,32), (64,32), (128), (128) and(128). In every layer, the output dimensions of Ψ q ( · ) and Ψ k ( · ) are 32. In the dueling architecture, Ψ r , Ψ v and Ψ d have output dimensions of 128, 1, and 81. All the parametersare trained via reinforcement learning and updated by Adam[33]. The learning rate and discount factor are 0.0005 and0.9 respectively. In addition, the exploration rate of ǫ -greedypolicy decays from 0.5 to 0.1 linearly in the first 5000episodes and remains 0.1 in the remaining 5000 episodes.In the rollout performance, the planning depth and planningwidth are set to 1 and 10, respectively.IV. E XPERIMENTS
A. Simulation Setup
SG-DQN has been evaluated in both simple and complexscenarios. The simple scenario is built based on the Crowd-Nav [16], in which only five pedestrians in a circle of radius4 m are controlled by ORCA [5]. For each pedestrian, hisinitial position and his goal position are set on the circle,symmetric about the center of the circle. Therefore, they willlikely interact near the center of the circle at approximately4.0 s . In addition to the five circle-crossing pedestrians,the complex scenario introduces another five square-crossingpedestrians, whose initial positions and goal positions aresampled randomly in a square with a side length of 10 m .The square shares the same center as the circle. In either thesimple scenario or the complex scenario, once the pedestrianarrives at his goal position, a new goal position will be resetrandomly within the square, and the former goal positionis recorded as a turning position. The initial position, theturning position, and the final goal position of pedestrian0 are marked in Fig. 6(a). Note that the robot is invisiblein both scenarios. As a result, the simulated pedestrians willnever give way to the robot, which requires the robot to havea more proactive and foresighted collision avoidance policy. B. Qualitative Evaluation1) Training Process:
SG-DQN is trained in both simplescenarios and complex scenarios. The resulting trainingcurves are shown in Fig. 5. At the beginning of training,it is difficult to complete the crowd navigation task witha randomly initialized model, and most termination states R e w a r d A v e r a g e R e t u r n N a v i g a t i o n T i m e D i s c o m f o r t R a t e (a) Trainning in simple scenarios R e w a r d A v e r a g e R e t u r n N a v i g a t i o n T i m e D i s c o m f o r t R a t e (b) Training in complex scenarios Fig. 5: Training curves in simple scenarios (Fig. 5(a)) andcomplex scenarios (Fig. 5(b)). From top to bottom, the y-axiscorresponds to the total reward, average return, navigationtime and discomfort rate per episode, averaged over the last100 episodes.are timeout. As the training continues, the robot quicklylearns to keep a safe distance from pedestrians and slowlyunderstands the crowd. In the last period of training, SG-DQN achieves relatively stable performance. The differencebetween different scenarios is predictable. The main reason isthat there are more interactions in complex scenarios, whichmakes the environment more challenging and difficult. Moredetailed quantitative results are described in Sec. IV-C.
2) Collision Avoidance Behaviors:
With the learned pol-icy, the robot is able to reach its goal position safely andquickly in both simple and complex scenarios. The resultingtrajectory diagrams are shown in Fig 6. In the complexscenario, the robot has to pay more attention on avoidingpedestrians, resulting a more rought trajectory and a longernavigation time. In both simple and complex scenarios, therobot performs proactive and foresighted collision avoidancebehaviors. The robot can always recognize and avoid theapproaching interaction center of the crowd. For example,in the simple scenario, the robot turns right suddenly atapproximately 1.5 s to avoid the potential encirclement at4.0 s . In addition, as shown in Fig. 6(b), even if the robot is − − − − y ( m ) Robot t=4.0s pedestrian 0 (a) Trajectory in a simple scenario − − − − y ( m ) Robot t=0.0s t=4.0s (b) Trajectory in a complex scenario
Fig. 6: Trajectory diagrams for a simple scenario (Fig. 6(a))and a complex scenario (Fig. 6(b)). Here, the discs representagents, black for the robot and other colors for pedestrians.The numbers near the discs indicate the time. The timeinterval between two consecutive discs is 1.0 s . Here, theinitial positions, the turning positions and the final goalpositions are marked with triangles, squares and five-pointedstars, respectively. −4 −2 0 2 4 x(m) −4−2024 y ( m ) Time: 4.00
Time: 4.00
RobotHumanGoal (a) Simple scenario −4 −2 0 2 4 x(m) −4−2024 y ( m ) Time: 1.75
Time: 1.75
RobotHumanGoal (b) Complex scenario
Fig. 7: Robot attention weights for agents in a simplescenario (Fig. 7(a)) and a complex scenario (Fig. 7(b)). Thenatural numbers are the serial numbers and the two-digitdecimals are the attention weights the robot gives to agents.The red arrows indicate the velocity attitudes of agents. Theinitial position and goal position of the robot are marked bya black disc and a red five-pointed star, respectively.trapped in an encirclement of pedestrians, it has the ability toescape from the environment safely. Here, the encirclementof three pedestrians starts at 0.0 s and lasts approximately4.0 s . The encirclements are indicated by lines with arrows.
3) Attention Modeling:
Fig. 7 shows the attention weightsin two test crowd scenes. In both simple and complexscenarios, the robot pays more attention to pedestrians whoare close or moving towards it, e.g., pedestrian 3 in Fig.7(a) and pedestrians 0, 3, and 8 in Fig. 7(b). Additionally,the robot also pays more attention to pedestrians who mayinteract with it. An example is the attention weight given topedestrian 2 in the complex scenario, which shows that therobot’s understanding of the crowd is foresighted. Anotherinteresting point is the self-attention weight of the robot. Thefewer pedestrians there are around the robot, the greater theself-attention weight. This means that the robot is able tobalance its navigation task and collision avoidance behavior.ABLE I: Quantitative results in simple secnarios.
Methods Success Collision Nav. Time(s) Disc. Rate Avg. Return Run Time(ms)
ORCA [5] 0.824 0.176 12.07 0.053 4.476 dueling DQN 0.986 0.013 11.25 0.011 6.120 2.67LM-SARL [16]
Methods Success Collision Nav. Time(s) Disc. Rate Avg. Return Run Time(ms)
ORCA [5] 0.769 0.222 13.88 0.095 3.689 dueling DQN 0.950 0.048 13.66 0.029 5.283 2.76LM-SARL [16] 0.993 0.007 12.47
C. Quantitative Evaluation
Three existing state-of-the-art mothods, ORCA [5], LM-SARL [16] and MP-RGL-Onestep [18], are implemented asbaseline methods. In addition, to verify the effect of therollout performance, a dueling DQN version of SG-DQN isalso developed as a contrast algorithm by setting the planningdepth to 0. In the implementation of ORCA, the pedestrians’radii are set by d s = 0 . to maintain a safe distance frompedestrians. For a fair comparison, RL methods apply thesame reward function. The implementation of MP-RGL-Onestep is different from the original version proposed in[18]. With two independent graph models, the state valueprediction and the human motion prediction are two sep-arated modules. In addition, the training process has beenrepeated for six times and the resulting models are evaluatedwith 1000 random test cases. The random seeds in the 6training processes are 7, 17, 27,37, 47, and 57. Finally, thestatistical results are shown in Table I (for simple scenarios)and Table II (for complex scenarios).In the quantitative evaluation, the metrics includes: "Suc-cess" , the success rate of robot reaching its goal safely; "Collision" , the rate of the robot colliding with pedestri-ans; "Nav. Time" , the navigation time to reach the goalin seconds; "Disc. Rate" , the ratio of the number of stepscause discomfort (the robot violates the safety distance ofthe pedestrians) to the total number of steps; "Return" , thediscounted cumulative return averaged over stpes; and "Run.Time" , the running time per iteration in milliseconds. Here,all metrics are averaged over steps across all episodes andtest cases.As expected, the ORCA method has the highest collisionrate, the longest navigation time and the best real-timeperformance in both simple and complex scenarios. This isbecause that the ORCA method is a totally reactive policy,which easily induces the robot to fall into the inevitablecollision states (ICS) [34] and causes the freezing robotproblem [13]. The results of ORCA show the necessity of aforesighted policy.The dueling DQN acheives better performance than ORCA but worse performance than the other three RL algorithms.On the one hand, this shows that even though the duelingDQN is a reactive policy, it can help the robot to make fore-sighted decisions after being trained with much experience.On the other hand, it illustrates that as the scenario becomesincreasingly complex it will be increasingly challenging tolearn a simple control policy mapping the raw state to thebest option directly. It is necessary to integrate the duelingDQN with online planning.All of LM-SARL, MP-RGL-Onestep and SG-DQNachieve performance far superior to that of dueling DQN,with success rates higger than 0.99. In simple scenarios, SG-DQN performs better than all others, achieving the shortestnavigation time and the largest discounted cumulative return.It can be attributed to the dueling DQN, which stores someshortcuts and is independent of the learned environmentmodel. In complex scenarios, the performance of SG-DQN isat least equivalent to, if not better than, others. The increasedcrowd interactions requires the robot to pay more attentionto avoiding pedestrians, and narrows the gap between SG-DQN and the other two algorithms. Next, let us turn ourattention to the metric of Run. Time . Regardless of whetherthe scenario is simple or complex, SG-DQN requires a muchlower computational cost, taking 10.18 ms and 14.12 ms periteration in simple and complex scenarios, respectively. Itis approximately half of the time of MP-RGL-Onestep anda quater that of LM-SARL. In addition, SG-DQN causesa slight decrease in
Success and an slight increase in
Disc.Rate . Considering the halved computational cost, this is quitevaluable. V. C
ONCLUSION
In this paper, we propose SG-DQN, a graph-based rein-forcement learning method, for mobile robots in a crowd,with a high success rate of more than 0.99. Compared againstthe state-of-the-art methods, SG-DQN achieves equivalent, ifnot better, performance in both simple scenarios and complexscenarios, while requiring less than half of the computationalcost. Its success can be attributed to three innovations pro-posed in this work. The first innovation is the introduction of social attention mechanism in the spatial graph convolutionoperation. With the improved two-layer GAT, it is availableto extract an efficient graph representation for the crowd-robot state. The second innovation is the application of adueling DQN, which can directly evaluate coarse q-valuesof the current state and quickly generate the best candidateactions. It greatly reduces the computational cost. The thirdinnovation is the integration of the learning method andonline planning. By performing rollouts on the current state,the coarse q-values generated by the dueling DQN are refinedwith a tree search. These innovations may also be useful inother similar applications.R
EFERENCES[1] D. Helbing and P. Molnár, “Social force model for pedestrian dynam-ics,”
Phys. Rev. E , vol. 51, pp. 4282–4286, May 1995.[2] G. Ferrer, A. Garrell, and A. Sanfeliu, “Robot companion: A social-force based approach with human awareness-navigation in crowdedenvironments,” in , pp. 1688–1694, 2013.[3] P. Fiorini and Z. Shiller, “Motion planning in dynamic environmentsusing velocity obstacles,”
The International Journal of Robotics Re-search , vol. 17, no. 7, pp. 760–772, 1998.[4] J. van den Berg, M. Lin, and D. Manocha, “Reciprocal velocityobstacles for real-time multi-agent navigation,” in , pp. 1928–1935, 2008.[5] J. van den Berg, G. S.J., M. Lin, and D. Manocha, “Reciprocal n-bodycollision avoidance,” in
Robotics Research , pp. 3–19, 2011.[6] S. Kim, S. J. Guy, W. Liu, D. Wilkie, R. W. Lau, M. C. Lin, andD. Manocha, “Brvo: Predicting pedestrian trajectories using velocity-space reasoning,”
The International Journal of Robotics Research ,vol. 34, no. 2, pp. 201–217, 2015.[7] A. Alahi, K. Goel, V. Ramanathan, A. Robicquet, L. Fei-Fei, andS. Savarese, “Social lstm: Human trajectory prediction in crowdedspaces,” in , pp. 961–971, 2016.[8] A. Vemula, K. Muelling, and J. Oh, “Social attention: Modelingattention in human crowds,” in , pp. 4601–4607, 2018.[9] A. Rudenko, L. Palmieri, S. Herman, K. M. Kitani, D. M. Gavrila,and K. Arras, “Human motion trajectory prediction: a survey,”
TheInternational Journal of Robotics Research , vol. 39, pp. 895–935,2019.[10] K. D. Katyal, G. D. Hager, and C. M. Huang, “Intent-aware pedestrianprediction for adaptive crowd navigation,” in , pp. 3277–3283,2020.[11] J. Sun, Q. Jiang, and C. Lu, “Recursive social behavior graph fortrajectory prediction,” in , pp. 657–666, 2020.[12] T. Salzmann, B. Ivanovic, P. Chakravarty, and M. Pavone, “Tra-jectron++: Dynamically-feasible trajectory forecasting with heteroge-neous data,” 2021, arXiv:2001.03093.[13] P. Trautman and A. Krause, “Unfreezing the robot: Navigation indense, interacting crowds,” in , pp. 797–803, 2010.[14] Y. F. Chen, M. Liu, M. Everett, and J. P. How, “Decentralized non-communicating multiagent collision avoidance with deep reinforce-ment learning,” in , pp. 285–292, 2017.[15] Y. F. Chen, M. Liu, M. Everett, and J. P. How, “Socially awaremotion planning with deep reinforcement learning,” in ,pp. 1343–1350, 2017.[16] C. Chen, Y. Liu, S. Kreiss, and A. Alahi, “Crowd-robot interaction:Crowd-aware robot navigation with attention-based deep reinforce-ment learning,” in , pp. 6015–6022, 2019. [17] Y. Chen, C. Liu, B. E. Shi, and M. Liu, “Robot navigation in crowdsby graph convolutional networks with attention learned from humangaze,”
IEEE Robotics and Automation Letters , vol. 5, no. 2, pp. 2754–2761, 2020.[18] C. Chen, S. Hu, P. Nikdel, G. Mori, and M. Savva, “Relationalgraph learning for crowd navigation,” in , pp. 10007–10013, 2020.[19] V. Mnih, K. Kavukcuoglu, D. Silver, A. Graves, I. Antonoglou,D. Wierstra, and M. Riedmiller, “Playing atari with deep reinforcementlearning,” in
NIPS Deep Learning Workshop , 2013.[20] V. Mnih, K. Kavukcuoglu, D. Silver, A. A. Rusu, J. Veness, M. G.Bellemare, A. Graves, M. Riedmiller, A. K. Fidjeland, G. Ostrovski,S. Petersen, C. Beattie, A. Sadik, I. Antonoglou, H. King, D. Kumaran,D. Wierstra, S. Legg, and D. Hassabis, “Human-level control throughdeep reinforcement learning,”
Nature , vol. 518, no. 7540, pp. 529–533,2015.[21] A. J. Sathyamoorthy, J. Liang, U. Patel, T. Guan, R. Chandra, andD. Manocha, “Densecavoid: Real-time navigation in dense crowdsusing anticipatory behaviors,” in , pp. 11345–11352, 2020.[22] P. Long, T. Fan, X. Liao, W. Liu, H. Zhang, and J. Pan, “Towardsoptimally decentralized multi-robot collision avoidance via deep re-inforcement learning,” in , pp. 6252–6259, 2018.[23] M. Everett, Y. F. Chen, and J. P. How, “Motion planning amongdynamic, decision-making agents with deep reinforcement learning,”in , pp. 3052–3059, 2018.[24] M. Everett, Y. F. Chen, and J. P. How, “Collision avoidance inpedestrian-rich environments with deep reinforcement learning,”
IEEEAccess , vol. 9, pp. 10357–10377, 2021.[25] A. Gupta, J. Johnson, L. Fei-Fei, S. Savarese, and A. Alahi, “So-cial gan: Socially acceptable trajectories with generative adversarialnetworks,” in , pp. 2255–2264, 2018.[26] P. Velickovic, G. Cucurull, A. Casanova, A. Romero, P. Liò, andY. Bengio, “Graph attention networks,” in , 2018.[27] Z. Wang, T. Schaul, M. Hessel, H. Hasselt, M. Lanctot, and N. Freitas,“Dueling network architectures for deep reinforcement learning,”in
Proceedings of The 33rd International Conference on MachineLearning , vol. 48, pp. 1995–2003, 2016.[28] L. J. Lin,
Reinforcement Learning for Robots Using Neural Networks .PhD thesis, Carnegie Mellon University, Pittsburgh, January 1993.[29] J. Oh, X. Guo, H. Lee, R. Lewis, and S. Singh, “Action-conditionalvideo prediction using deep networks in atari games,” in
Proceedingsof the 28th International Conference on Neural Information ProcessingSystems , vol. 2, pp. 2863–2871, 2015.[30] J. Oh, S. Singh, and H. Lee, “Value prediction network,” in
Advancesin Neural Information Processing Systems 30: Annual Conference onNeural Information Processing Systems , pp. 6118–6128, 2017.[31] D. Silver, J. Schrittwieser, K. Simonyan, I. Antonoglou, A. Huang,A. Guez, T. Hubert, L. Baker, M. Lai, A. Bolton, Y. Chen, T. Lillicrap,F. Hui, L. Sifre, G. van den Driessche, T. Graepel, and D. Hassabis,“Mastering the game of go without human knowledge,”
Nature ,vol. 550, pp. 354–359, 2017.[32] J. Schrittwieser, I. Antonoglou, T. Hubert, K. Simonyan, L. Sifre,S. Schmitt, A. Guez, E. Lockhart, D. Hassabis, T. Graepel, T. Lillicrap,and D. Silver, “Mastering atari, go, chess and shogi by planning witha learned model,”
Nature , vol. 588, pp. 604–609, 2020.[33] D. P. Kingma and J. Ba, “Adam: A method for stochastic optimiza-tion,” in ,2015.[34] T. Fraichard and H. Asama, “Inevitable collision states — a steptowards safer robots?,”