[PDF] Where to go next: Learning a Subgoal Recommendation Policy for Navigation Among Pedestrians

Abstract

Robotic navigation in environments shared with other robots or humans remains challenging because the intentions of the surrounding agents are not directly observable and the environment conditions are continuously changing. Local trajectory optimization methods, such as model predictive control (MPC), can deal with those changes but require global guidance, which is not trivial to obtain in crowded scenarios. This paper proposes to learn, via deep Reinforcement Learning (RL), an interaction-aware policy that provides long-term guidance to the local planner. In particular, in simulations with cooperative and non-cooperative agents, we train a deep network to recommend a subgoal for the MPC planner. The recommended subgoal is expected to help the robot in making progress towards its goal and accounts for the expected interaction with other agents. Based on the recommended subgoal, the MPC planner then optimizes the inputs for the robot satisfying its kinodynamic and collision avoidance constraints. Our approach is shown to substantially improve the navigation performance in terms of number of collisions as compared to prior MPC frameworks, and in terms of both travel time and number of collisions compared to deep RL methods in cooperative, competitive and mixed multiagent scenarios.

Full PDF

WWhere to go next: Learning a Subgoal Recommendation Policyfor Navigation in Dynamic Environments

Bruno Brito , Michael Everett , Jonathan P. How and Javier Alonso-Mora Abstract — Robotic navigation in environments shared withother robots or humans remains challenging because theintentions of the surrounding agents are not directly observableand the environment conditions are continuously changing.Local trajectory optimization methods, such as model predictivecontrol (MPC), can deal with those changes but require globalguidance, which is not trivial to obtain in crowded scenarios.This paper proposes to learn, via deep Reinforcement Learning(RL), an interaction-aware policy that provides long-termguidance to the local planner. In particular, in simulationswith cooperative and non-cooperative agents, we train a deepnetwork to recommend a subgoal for the MPC planner. Therecommended subgoal is expected to help the robot in makingprogress towards its goal and accounts for the expected inter-action with other agents. Based on the recommended subgoal,the MPC planner then optimizes the inputs for the robotsatisfying its kinodynamic and collision avoidance constraints.Our approach is shown to substantially improve the navigationperformance in terms of number of collisions as comparedto prior MPC frameworks, and in terms of both travel timeand number of collisions compared to deep RL methods incooperative, competitive and mixed multiagent scenarios.

I. INTRODUCTIONAutonomous robot navigation in crowds remains difﬁcultdue to the interaction effects among navigating agents.Unlike multi-robot environments, robots operating amongpedestrians require decentralized algorithms that can handlea mixture of other agents’ behaviors without depending onexplicit communication between agents.Several state-of-the-art collision avoidance methods em-ploy model-predictive control (MPC) with online optimiza-tion to compute motion plans that are guaranteed to respectimportant constraints [1]. These constraints could include therobot’s nonlinear kino-dynamics model or collision avoid-ance of static obstacles and other dynamic, decision-makingagents (e.g., pedestrians). Although online optimization be-comes less computationally practical for extremely densescenarios, modern solvers enable real-time motion planningin many situations of interest [2].A key challenge is that the robot’s global goal is oftenlocated far beyond the planning horizon, meaning that a local

This work was supported by the European Union’s Horizon 2020 researchand innovation programme under grant agreement No. 101017008, theAmsterdam Institute for Advanced Metropolitan Solutions, the NetherlandsOrganisation for Scientiﬁc Research (NWO) domain Applied Sciences (Veni15916), and Ford Motor Company. The authors are with the Cognitive Robotics (CoR) department,Delft University of Technology, 2628 CD Delft, The Netherlands { bruno.debrito, j.alonsomora } @tudelft.nl The authors are with Massachusetts Institute of Technology, AerospaceControls Laboratory, Cambridge, MA, USA. { mfe, jhow } @mit.edu Code: https://github.com/tud-amr/go-mpc.git

Video: https://youtu.be/sZBbWMnwle8

Fig. 1: Proposed navigation architecture. The subgoal plannerobserves the environment and suggests the next subgoalposition to the local motion planner, the MPC. The MPCthen computes a local trajectory and the robot executes thenext optimal control command, which minimizes the distanceto the provided position reference while respecting collisionand kinodynamic constraints.subgoal or cost-to-go heuristic must be speciﬁed instead.This is straightforward in a static environment (e.g., usingeuclidean/diffusion [3] distance), but the presence interactiveagents makes it difﬁcult to quantify which subgoals will leadto the global goal quickest. A body of work addresses thischallenge with deep reinforcement learning (RL), in whichagents learn a model of the long-term cost of actions in anofﬂine training phase (usually in simulation) [4]–[7]. Thelearned model is fast-to-query during online execution, butthe way learned costs/policies have been used to date doesnot provide guarantees on collision avoidance or feasibilitywith respect to the robot dynamics.In this paper, we introduce Goal Oriented Model Predic-tive Control (GO-MPC), which enhances state-of-art onlineoptimization-based planners with a learned global guidancepolicy. In an ofﬂine RL training phase, an agent learns apolicy that uses the current world conﬁguration (the states ofthe robot and other agents, and a global goal) to recommenda local subgoal for the MPC, as depicted in Fig. 1. Then,the MPC generates control commands ensuring that therobot and collision avoidance constraints are satisﬁed (if afeasible solution is found) while making progress towards thesuggested subgoal. Our approach maintains the kino-dynamicfeasibility and collision avoidance guarantees inherent in anMPC formulation, while improving the average time-to-goaland success rate by leveraging past experience in crowdedsituations.The main contributions of this work are: • A goal-oriented Model Predictive Control method (GO-MPC) for navigation among interacting agents, which a r X i v : . [ c s . R O ] F e b tilizes a learned global guidance policy (recommendedsubgoal) in the cost function and ensures that dy-namic feasibility and collision avoidance constraints aresatisﬁed when a feasible solution to the optimizationproblem is found; • An algorithm to train an RL agent jointly with anoptimization-based controller in mixed environments,which is directly applicable to real-hardware, reducingthe sim to real gap.Finally, we present simulation results demonstrating animprovement over several state-of-art methods in challengingscenarios with realistic robot dynamics and a mixture ofcooperative and non-cooperative neighboring agents. Ourapproach shows different navigation behaviors: navigatingthrough the crowd when interacting with cooperative agents,avoiding congestion areas when non-cooperative agentsare present and enabling communication-free decentralizedmulti-robot collision avoidance.

A. Related Work1) Navigation Among Crowds:

Past work on navigation incluttered environments often focuses on interaction models using geometry [8], [9], physics [10], topologies [11], [12],handcrafted functions [13], and cost functions [14], [14]or joint probability distributions [15] learned from data.While accurate interaction models are critical for collisionavoidance, this work emphasizes that the robot’s perfor-mance (time-to-goal) is highly dependent on the qualityof its cost-to-go model (i.e., the module that recommendsa subgoal for the local planner). Designing a useful cost-to-go model in this problem remains challenging, as itrequires quantifying how “good” a robot’s conﬁguration iswith respect to dynamic, decision-making agents. In [4],deep RL was introduced as a way of modeling cost-to-go through an ofﬂine training phase; the online executionused simple vehicle and interaction models for collision-checking. Subsequent works incorporated other interactionsto generate more socially compliant behavior within thesame framework [5], [16]. To relax the need for simpleonline models, [6] moved the collision-checking to the ofﬂinetraining phase. While these approaches use pre-processedinformation typically available from perception pipelines(e.g., pedestrian detection, tracking systems), other worksproposed to learn end-to-end policies [7], [17]. Although allof these RL-based approaches learn to estimate the cost-to-go, the online implementations do not provide guaranteesthat the recommended actions will satisfy realistic vehicledynamics or collision avoidance constraints. Thus, this workbuilds on the promising idea of learning a cost-to-go model,but we start from an inherently safe MPC formulation.

2) Learning-Enhanced MPC:

Outside the context ofcrowd navigation, numerous recent works have proposedlearning-based solutions to overcome some of the knownlimitations of optimization-based methods (e.g., nonlinearMPC) [18]. For example, solvers are often sensitive tothe quality of the initial guess hence, [19] proposes tolearn a policy from data that efﬁciently “warm-starts” a MPC.

Model inaccuracies can lead to sub-optimal MPCsolution quality; [20] proposes to learn a policy by choos-ing between two actions with the best expected reward ateach timestep: one from model-free RL and one from amodel-based trajectory optimizer. Alternatively, RL can beused to optimize the weights of an MPC-based Q-functionapproximator or to update a robust MPC parametrization[21]. When the model is completely unknown, [22] showsa way of learning a dynamics model to be used in MPC.

Computation time is another key challenge: [23] learns acost-to-go estimator to enable shortening of the planninghorizons without sacriﬁcing much solution quality, althoughtheir approach differs from this work as it uses local andlinear function approximators which limits its applicabilityto high-dimensional state spaces.The closest related worksaddress cost tuning with learning. MPC’s cost functions arereplaced with a value function learned via RL ofﬂine in [24](terminal cost) and [25] (stage cost). [26] deployed valuefunction learning on a real robot outperforming an expert-tuned MPC. While these ideas also use RL for a better cost-to-go model, this work focuses on the technical challengeof learning a subgoal policy required for navigation throughcrowds avoiding the approximation issues and extrapolationissues to unseen events. Moreover, this work learns to setterminal constraints rather than setting a cost with a valuefunction.

3) Combining MPC with RL:

Recently, there is increasinginterest on approaches combining the strengths of MPC andRL as suggested in [27]. For instance, optimization-basedplanning has been used to explore high-reward regions anddistill the knowledge into a policy neural network, ratherthan a neural network policy to improve an optimization.[28]–[30].Similar to our approach, [31] utilizes the RL policyduring training to ensure exploration and employs a MPC tooptimize sampled trajectories from the learned policy at testtime. Moreover, policy networks have be used to generateproposals for a sampling-based MPC [32], or to select goalpositions from a predeﬁned set [33].Nevertheless, to the extent of our knowledge, approachescombining the beneﬁts of both optimization and learning-based methods were not explored in the context of crowdnavigation. Moreover, the works exploring a similar idea oflearning a cost-to-go model do not allow to explicitly deﬁnecollision constraints and ensure safety. To overcome theprevious issues, in this paper, we explore the idea of learninga cost-to-go model to directly generate subgoal positions,which lead to higher long-term rewards and too give therole of local collision avoidance and kinematics constraintsto an optimization-based planner.Such cost-to-go information can be formulated as learninga value function for the ego-agent state-space providinginformation which states are more valuable [25]. In contrast,we propose to learn a policy directly informing which actionslead to higher rewards allowing to directly incorporate theMPC controller in the training phase.I. PRELIMINARIESThroughout this paper, vectors are denoted in bold lower-case letters, x , matrices in capital, M , and sets in calligraphicuppercase, S . (cid:107) x (cid:107) denotes the Euclidean norm of x and (cid:107) x (cid:107) Q = x T Q x denotes the weighted squared norm. Thevariables { s , a } denote the state and action variables used inthe RL formulation, and { x , u } denote the control state andaction commands used in the optimization problem. A. Problem Formulation

Consider a scenario where a robot must navigate froman initial position p to a goal position g on the plane R , surrounded by n non-communicating agents. At eachtime-step t , the robot ﬁrst observes its state s t (deﬁned inSec.III-A.2) and the set of the other agents states S t = (cid:83) i ∈{ ,...,n } s it , then takes action a t , leading to the immediatereward R ( s t , a t ) and next state s t +1 = h ( s t , a t ) , under thetransition model h .We use the superscript i ∈ { , . . . , n } to denote the i -th nearby agent and omit the superscript when referring tothe robot. For each agent i ∈ { , n } , p it ∈ R denotes itsposition, v it ∈ R its velocity at step t relative to a inertialframe, and r i the agent radius. We assume that each agent’scurrent position and velocity are observed (e.g., with on-board sensing) while other agents’ motion intentions (e.g.,goal positions) are unknown. Finally, O t denotes the areaoccupied by the robot and O it by each surrounding agent, attime-step t .The goal is to learn a policy π for the robot thatminimizes time to goal while ensuring collision-free motions,deﬁned as: π ∗ = argmax π E (cid:34) T (cid:88) t =0 γ t R ( s t , π ( s t , S t )) (cid:35) s.t. x t +1 = f ( x t , u t ) , (1a) s T = g , (1b) O t ( x t ) ∩ O it = ∅ (1c) u t ∈ U , s t ∈ S , x t ∈ X , (1d) ∀ t ∈ [0 , T ] , ∀ i ∈ { , . . . , n } where (1a) are the transition dynamic constraints consideringthe dynamic model f , (1b) the terminal constraints, (1c) thecollision avoidance constraints and S , U and X are the set ofadmissible states, inputs (e.g., to limit the robot’s maximumspeed) and the set of admissible control states, respectively.Note that we only constraint the control states of the robot.Moreover, we assume other agents have various behaviors(e.g., cooperative or non-cooperative): each agent samplesa policy from a closed set P = { π , . . . , π m } (deﬁned inSec.II-C) at the beginning of each episode. B. Agent Dynamics

Real robotic systems’ inertia imposes limits on linearand angular acceleration. Thus, we assume a second-order unicycle model for the robot [34]: ˙ x = v cos ψ ˙ v = u a ˙ y = v sin ψ ˙ ω = u α ˙ ψ = ω (2)where x and y are the agent position coordinates and ψ isthe heading angle in a global frame. v is the agent forwardvelocity, ω denotes the angular velocity and, u a the linearand u α angular acceleration, respectively. C. Modeling Other Agents’ Behaviors

In a real scenario, agents may follow different policiesand show different levels of cooperation. Hence, in contrastto previous approaches, we do not consider all the agentsto follow the same policy [6], [35]. At the beginning ofan episode, each non-ego agent either follows a cooperativeor a non-cooperative policy. For the cooperative policy, weemploy the Reciprocal Velocity Obstacle (RVO) [36] modelwith a random cooperation coefﬁcient c i ∼ N (0 . , sampled at the beginning of the episode. The “reciprocal” inRVO means that all agents follow the same policy and usethe cooperation coefﬁcient to split the collision avoidanceeffort among the agents (e.g., a coefﬁcient of 0.5 means thateach agent will apply half of the effort to avoid the other).In this work, for the non-cooperative agents, we considerboth constant velocity (CV) and non-CV policies. The agentsfollowing a CV model drive straight in the direction of theirgoal position with constant velocity. The agents following anon-CV policy either move in sinusoids towards their ﬁnalgoal position or circular motion around their initial position.III. METHODLearning a sequence of intermediate goal states that leadan agent toward a ﬁnal goal destination can be formulatedas a single-agent sequential decision making problem. Be-cause parts of the environment can be difﬁcult to modelexplicitly, the problem can be solved with a reinforcementlearning framework. Hence, we propose a two-level planningarchitecture, as depicted in Figure 1, consisting of a subgoalrecommender (Section III-A.2) and an optimization-basedmotion planner (Section II-C). We start by deﬁning the RLframework and our’s policy architecture (SectionIII-A.2).Then, we formulate the MPC to execute the policy’s actionsand ensure local collision avoidance (SectionIII-B). A. Learning a Subgoal Recommender Policy

We aim to develop a decision-making algorithm to providean estimate of the cost-to-go in dynamic environments withmixed-agents. In this paper, we propose to learn a policydirectly informing which actions lead to higher rewards.

1) RL Formulation:

As in [4], the observation vector iscomposed by the ego-agent and the surrounding agents states,deﬁned as: s t = [ d g , p t − g , v ref , ψ, r ] (Ego-agent) s it = (cid:2) p it , v it , r i , d it , r i + r (cid:3) ∀ i ∈ { , n } (Other agents)(3) This coefﬁcient is denoted as α BA in [8] ig. 2: Proposed network policy architecture.where s t is the ego-agent state and s it the i -th agent state atstep t . Moreover, d g = (cid:107) p t − g (cid:107) is the ego-agent’s distanceto goal and d it = (cid:13)(cid:13) p t − p it (cid:13)(cid:13) is the distance to the i -th agent.Here, we seek to learn the optimal policy for the ego-agent π : ( s t , S t ) −→ a t mapping the ego-agent’s observationof the environment to a probability distribution of actions.We consider a continuous action space A ⊂ R and deﬁnean action as position increments providing the directionmaximizing the ego-agent rewards, deﬁned as: p ref t = p t + δ t (4a) π θ π ( s t , S t ) = δ t = [ δ t,x , δ t,y ] (4b) (cid:107) δ t (cid:107) ≤ N v max , (4c)where δ k,x , δ k,y are the ( x, y ) position increments, v max themaximum linear velocity and θ π are the network policy pa-rameters. Moreover, to ensure that the next sub-goal positionis within the planning horizon of the ego-agent, we boundthe action space according with the planning horizon N ofthe optimization-based planner and its dynamic constraints,as represented in Eq. (4b).We design the reward function to motivate the ego-agentto reach the goal position while penalizing collisions: R ( s , a ) =  r goal if p = p g r collision if d min < r + r i ∀ i ∈ { , n } r t otherwise (5)where d min = min i (cid:13)(cid:13) p − p i (cid:13)(cid:13) is the distance to the closestsurrounding agent. r t allows to adapt the reward functionas shown in the ablation study (Sec.IV-C), r goal rewards theagent if reaches the goal r collision penalizes if it collides withany other agents. In Section. IV-C we analyze its inﬂuencein the behavior of the learned policy.

2) Policy Network Architecture:

A key challenge in col-lision avoidance among pedestrians is that the number ofnearby agents can vary between timesteps. Because feed-forward NNs require a ﬁxed input vector size, prior work [6]proposed the use of Recurrent Neural Networks (RNNs) tocompress the n agent states into a ﬁxed size vector at eachtime-step. Yet, that approach discarded time-dependencies ofsuccessive observations (i.e., hidden states of recurrent cells).Here, we use the “store-state” strategy, as proposed in[37]. During the rollout phase, at each time-step we storethe hidden-state of the RNN together with the current stateand other agents state, immediate reward and next state, ( s k , S k , h k , r k , s k +1 ) . Moreover, the previous hidden-stateis feed back to warm-start the RNN in the next step, asdepicted in Fig.2. During the training phase, we use thestored hidden-states to initialize the network. Our policyarchitecture is depicted in Fig. 2. We employ a RNN toencode a variable sequence of the other agents states S k andmodel the existing time-dependencies. Then, we concatenatethe ﬁxed-length representation of the other agent’s states withthe ego-agent’s state to create a join state representation. Thisrepresentation vector is fed to two fully-connected layers(FCL). The network has two output heads: one estimatesthe probability distribution parameters π θ π ( s , S ) ∼ N ( µ, σ ) of the policy’s action space and the other estimates thestate-value function V π ( s t ) := E s t +1: ∞ , [ (cid:80) ∞ l =0 r t + l ] . µ and σ are the mean and variance of the policy’s distribution,respectively. B. Local Collision Avoidance: Model Predictive Control

Here, we employ MPC to generate locally optimal com-mands respecting the kino-dynamics and collision avoidanceconstraints. To simplify the notation used, hereafter, weassume the current time-step t as zero.

1) State and Control Inputs:

We deﬁne the ego-agentcontrol input vector as u = [ u a , u α ] and the control stateas x = [ x, y, ψ, v, w ] ∈ R following the dynamics modeldeﬁned in Section II-B.

2) Dynamic Collision Avoidance:

We deﬁne a set ofnonlinear constraints to ensure that the MPC generatescollision-free control commands for the ego-agent (if afeasible solution exists). To limit the problem complexityand ensure to ﬁnd a solution in real-time, we consider alimited number of surrounding agents X m , with m ≤ n .Consider X n = { x , . . . , x n } as the set of all surroundingagent states, than the set of the m -th closest agents is: Deﬁnition A set X m ⊆ X n is the set of the m -th closestagents if the euclidean distance ∀ x j ∈ X m , ∀ x i ∈ X n \X m : (cid:107) x j , x (cid:107) ≤ (cid:107) x i , x (cid:107) . We represent the area occupied by each agent O i as acircle with radius r i . To ensure collision-free motions, weimpose that each circle i ∈ { , . . . , n } i does not intersectwith the area occupied by the ego-agent resulting in thefollowing set of inequality constraints: c ik ( x k , x ik ) = (cid:13)(cid:13) p k , p ik (cid:13)(cid:13) ≥ r + r i , (6)for each planning step k . This formulation can be extendedfor agents with general quadratic shapes, as in [2].

3) Cost function:

The subgoal recommender provides areference position p ref guiding the ego-agent toward theﬁnal goal position g and minimizing the cost-to-go whileaccounting for the other agents. The terminal cost is deﬁnedas the normalized distance between the ego-agent’s terminalposition (after a planning horizon N ) and the referenceposition (with weight coefﬁcient Q N ): J N ( p N , π ( x , X )) = (cid:13)(cid:13)(cid:13)(cid:13) p N − p ref p − p ref (cid:13)(cid:13)(cid:13)(cid:13) Q N , (7)o ensure smooth trajectories, we deﬁne the stage cost as aquadratic penalty on the ego-agent control commands J u k ( u k ) = (cid:107) u k (cid:107) Q u , k = { , , . . . , N − } , (8)where Q u is the weight coefﬁcient.

4) MPC Formulation:

The MPC is then deﬁned as a non-convex optimization problem min x N , u N − J N ( x N , p ref ) + N − (cid:88) k =0 J uk ( u k ) s.t. x = x (0) , (1d), (2) ,c ik ( x k , x ik ) > r + r i , u k ∈ U , x k ∈ S , ∀ i ∈ { , . . . , n } ; ∀ k ∈ { , . . . , N − } . (9)In this work, we assume a constant velocity model estimateof the other agents’ future positions, as in [2]. C. PPO-MPC

In this work, we train the policy using a state-of-artmethod, Proximal Policy Optimization (PPO) [38], but theoverall framework is agnostic to the speciﬁc RL trainingalgorithm. We propose to jointly train the guidance policy π θ π and value function V θ V ( s ) with the MPC, as opposedto prior works [6] that use an idealized low-level controllerduring policy training (that cannot be implemented on a realrobot). Algorithm 1 describes the proposed training strategyand has two main phases: supervised and RL training. First,we randomly initialize the policy and value function param-eters { θ π , θ V } . Then, at the beginning of each episode werandomly select the number of surrounding agents between [1 , n agents ] , the training scenario and the surrounding agentspolicy. More details about the different training scenarios and n agents considered is given in Sec.IV-B.An initial RL policy is unlikely to lead an agent to a goalposition. Hence, during the warm-start phase, we use theMPC as an expert and perform supervised training to trainthe policy and value function parameters for n MPC steps.By setting the MPC goal state as the ego-agent ﬁnal goalstate p ref = g and solving the MPC problem, we obtain alocally optimal sequence of control states x ∗ N . For eachstep, we deﬁne a ∗ t = x ∗ t,N and store the tuple containing thenetwork hidden-state, state, next state, and reward in a buffer B ← { s k , a ∗ t , r k , h k , s k +1 } . Then, we compute advantageestimates [39] and perform a supervised training step θ Vk +1 = arg min θ V E ( a k , s k ,r k ) ∼D MPC [ (cid:13)(cid:13) V θ ( s k ) − V targ k (cid:13)(cid:13) ] (10) θ πk +1 = arg min θ E ( a ∗ k , s k ) ∼D MPC [ (cid:107) a ∗ k − π θ ( s k ) (cid:107) ] (11)where θ V , θ π are the value function and policy parameters,respectively. Note that θ V and θ π share the same parameterexcept for the ﬁnal layer, as depicted in Fig.2. Afterwards, weuse Proximal Policy Optimization (PPO) [38] with clippedgradients for training the policy. PPO is a on-policy methodaddressing the high-variance issue of policy gradient methodsfor continuous control problems. We refer the reader to [38] Algorithm 1

PPO-MPC Training Inputs: planning horizon H , value fn. and policyparameters { θ V , θ π } , number of supervised and RLtraining episodes { n MPC , n episodes } , number of agents n , n mini-batch , and reward function R ( s t , a t , a t +1 ) Initialize states: { s , . . . , s n } ∼ S , { g , . . . , g n } ∼ S while episode < n episodes do Initialize

B ← ∅ and h ← ∅ for k = 0 , . . . , n mini-batch do if episode ≤ n MPC then Solve Eq.9 considering p ref = g Set a ∗ t = x ∗ N else p ref = π θ ( s t , S t ) end if { s k , a k , r k , h k +1 , s k +1 , done } = Step ( s ∗ t , a ∗ t , h t ) Store

B ← { s k , a k , r k , h k +1 , s k +1 , done } if done then episode + = 1 Reset hidden-state: h t ← ∅ Initialize: { s , . . . , s n } ∼ S , { g , . . . , g n } ∼ S end if end for if episode ≤ n MPC then

Supervised training: Eq.10 and Eq.11 else

PPO training [38] end if end while return { θ V , θ π } for more details about the method’s equations. Please notethat our approach is agnostic to which RL algorithm weuse. Moreover, to increase the learning speed during training,we gradually increase the number of agents in the trainingenvironments (curriculum learning [40]).IV. RESULTSThis section quantiﬁes the performance throughout thetraining procedure, provides an ablation study, and comparesthe proposed method (sample trajectories and numerically)against the following baseline approaches: • MPC: Model Predictive Controller from Section III-Bwith ﬁnal goal position as position reference, p ref = g ; • DRL [6]: state-of-the-art Deep Reinforcement Learningapproach for multi-agent collision avoidanceTo analyze the impact of a realistic kinematic model duringtraining, we consider two variants of the DRL method [6]:the same RL algorithm [6] was used to train a policy undera ﬁrst-order unicycle model, referred to as DRL, and asecond-order unicycle model (Eq.2), referred to as DRL-2.All experiments use a second-order unicycle model (Eq.2) inenvironments with cooperative and non-cooperative agents torepresent realistic robot/pedestrian behavior. Animations ofsample trajectories accompany the paper.ABLE I: Hyper-parameters.

Planning Horizon N r goal γ r collision -10Clip factor 0.1 Learning rate − A. Experimental setup

The proposed training algorithm builds upon the open-source PPO implementation provided in the Stable-Baselines[41] package. We used a laptop with an Intel Core i7 and32 GB of RAM for training. To solve the non-linear and non-convex MPC problem of Eq. (9), we used the ForcesPro [42]solver. If no feasible solution is found within the maximumnumber of iterations, then the robot decelerates. All MPCmethods used in this work consider collision constraints withup to the closest six agents so that the optimization problemcan be solved in less than 20ms. Moreover, our policy’snetwork has an average computation time of 2ms with avariance of 0.4ms for all experiments. Hyperparameter valuesare summarized in Table I.

B. Training procedure

To train and evaluate our method we have selected fournavigation scenarios, similar to [5]–[7]: • Symmetric swapping : Each agent’s position is ran-domly initialized in different quadrants of the R x-yplane, where all agents have the same distance to theorigin. Each agent’s goal is to swap positions with anagent from the opposite quadrant. • Asymmetric swapping : As before, but all agents arelocated at different distances to the origin. • Pair-wise swapping : Random initial positions; pairs ofagents’ goals are each other’s intial positions • Random : Random initial & goal positionsEach training episode consists of a random number of agentsand a random scenario. At the start of each episode, eachother agent’s policy is sampled from a binomial distribu-tion (80% cooperative, 20% non-cooperative). Moreover, forthe cooperative agents we randomly sample a cooperationcoefﬁcient c i ∼ U (0 . , and for the non-cooperativeagents is randomly assigned a CV or non-CV policy (i.e.,sinusoid or circular). Fig. 3 shows the evolution of therobot average reward and the percentage of failure episodes.The top sub-plot compares our method average reward withthe two baseline methods: DRL (with pre-trained weights)and MPC. The average reward for the baseline methods(orange, yellow) drops as the number of agents increases(each vertical bar). In contrast, our method (blue) improveswith training and eventually achieves higher average rewardfor 10-agent scenarios than baseline methods achieve for2-agent scenarios. The bottom plot demonstrates that thepercentage of collisions decreases throughout training despitethe number of agents increasing. C. Ablation Study

A key design choice in RL is the reward function; here,we study the impact on policy performance of three variants Fig. 3: Moving average rewards and percentage of failureepisodes during training. The top plot shows our methodaverage episode reward vs DRL [6] and simple MPC.Fig. 4: Two agents swapping scenario. In blue is depictedthe trajectory of robot, in red the non-cooperative agent, inpurple the DRL agent and, in orange the MPC.of reward. The sparse reward uses r t = 0 (only non-zeroreward upon reaching goal/colliding). The time reward uses r t = − . (penalize every step until reaching goal). The progress reward uses r t = 0 . ∗ ( (cid:107) s t − g (cid:107) − (cid:107) s t +1 − g (cid:107) ) (encourage motion toward goal). Aggregated results in TableII show that the resulting policy trained with a time rewardfunction allows the robot to reach the goal with minimumtime, to travel the smallest distance, and achieve the lowestpercentage of failure cases. Based on these results, weselected the policy trained with the time reward function forthe subsequent experiments. D. Qualitative Analysis

This section compares and analyzes trajectories for dif-ferent scenarios. Fig. 4 shows that our method resolves afailure mode of both RL and MPC baselines. The robot hasto swap position with a non-cooperative agent (red, movingright-to-left) and avoid a collision. We overlap the trajectories(moving left-to-right) performed by the robot following ourmethod (blue) versus the baseline policies (orange, magenta).The MPC policy (orange) causes a collision due to thedynamic constraints and limited planning horizon. The DRLpolicy avoids the non-cooperative agent, but due to itsreactive nature, only avoids the non-cooperative agent whenvery close, resulting in larger travel time. Finally, whenusing our approach, the robot initiates a collision avoidancemaneuver early enough to lead to a smooth trajectory andfaster arrival at the goal.We present results for mixed settings in Fig. 5 andhomogeneous settings in Fig. 6 with n ∈ { , , } agents.ABLE II: Ablation Study: Discrete reward function leads to better policy than sparse, dense reward functions. Results areaggregated over 200 random scenarios with n ∈ { , , } agents. Time to Goal [s] % failures (% collisions / % timeout) Traveled distance Mean [m] (a) (b) (c)

Fig. 5: Sample trajectories with mixed agent policies (robot:blue, cooperative: green, non-cooperative: red). In (a), allagents are cooperative; in (b), two are cooperative and ﬁvenon-cooperative (const. vel.); in (c), three are cooperative andtwo non-cooperative (sinusoidal). The GO-MPC agent avoidsnon-cooperative agents differently than cooperative agents. (a) DRL [6] (b)

DRL-2 (ext. of [6]) (c) GO-MPC

Fig. 6: 8 agents swapping positions. To simulate a multi-robot environment, all agents follow the same policy.In mixed settings, the robot follows our proposed policywhile the other agents either follow an RVO [36] or a non-cooperative policy (same distribution as in training). Fig. 5demonstrates that our navigation policy behaves differentlywhen dealing with only cooperative agents or both coop-erative and non-cooperative. Whereas in Fig. 5a the robotnavigates through the crowd, Fig. 5b shows that the robottakes a longer path to avoid the congestion.In the homogeneous setting, all agents follow our proposedpolicy. Fig. 6 shows that our method achieves faster time-to-goal than two DRL baselines. Note that this scenariowas never introduced during the training phase, nor havethe agents ever experienced other agents with the samepolicy before. Following the DRL policy (Fig. 6a), all agentsnavigate straight to their goal positions leading to congestionin the center with reactive avoidance. The trajectories fromthe DRL-2 approach (Fig. 6b) are more conservative, due tothe limited acceleration available. In contrast, the trajectoriesgenerated by our approach (Fig. 6c), present a balancebetween going straight to the goal and avoiding congestionin the center, allowing the agents to reach their goals fasterand with smaller distance traveled.

E. Performance Results

This section aggregates performance of the various meth-ods across 200 random scenarios. Performance is quantiﬁed by average time to reach the goal position, percentage ofepisodes that end in failures (either collision or timeout),and the average distance traveled.The numerical results are summarized in Table III. Ourmethod outperforms each baseline for both mixed and ho-mogeneous scenarios. To evaluate the statistical signiﬁcance,we performed pairwise Mann–Whitney U-tests between GO-MPC and each baseline (95 % conﬁdence). GO-MPC showsstatistically signiﬁcant performance improvements over theDRL-2 baseline in terms of travel time and distance, andthe DRL baseline in term of travel time for six agents andtravel distance for ten agents. . For homogeneous scenarios,GO-MPC is more conservative than DRL and MPC baselinesresulting in a larger average traveled distance. Nevertheless,GO-MPC is reaches the goals faster than each baseline and isless conservative than DRL-2, as measured by a signiﬁcantlylower average distance traveled.Finally, considering higher-order dynamics when trainingDRL agents (DRL-2) improves the collision avoidance per-formance. However, it also increases the average time to goaland traveled distance, meaning a more conservative policythat still under-performs GO-MPC in each metric.V. CONCLUSIONS & FUTURE WORKThis paper introduced a subgoal planning policy forguiding a local optimization planner. We employed DRLmethods to learn a subgoal policy accounting for the inter-action effects among the agents. Then, we used an MPC tocompute locally optimal motion plans respecting the robotdynamics and collision avoidance constraints. Learning asubgoal policy improved the collision avoidance performanceamong cooperative and non-cooperative agents as well asin multi-robot environments. Moreover, our approach canreduce travel time and distance in cluttered environments.Future work could account for environment constraints.R EFERENCES[1] B. Paden, M. ˇC´ap, S. Z. Yong, D. Yershov, and E. Frazzoli, “Asurvey of motion planning and control techniques for self-drivingurban vehicles,”

IEEE Transactions on intelligent vehicles , 2016.[2] B. Brito, B. Floor, L. Ferranti, and J. Alonso-Mora, “Model predictivecontouring control for collision avoidance in unstructured dynamicenvironments,”

IEEE Robotics and Automation Letters , vol. 4, no. 4,pp. 4459–4466, 2019.[3] Y. F. Chen, S.-Y. Liu, M. Liu, J. Miller, and J. P. How, “Motion plan-ning with diffusion maps,” in . IEEE, 2016.[4] Y. F. Chen, M. Liu, M. Everett, and J. P. How, “Decentralized non-communicating multiagent collision avoidance with deep reinforce-ment learning,” in . IEEE, 2017, pp. 285–292.[5] C. Chen, Y. Liu, S. Kreiss, and A. Alahi, “Crowd-robot interaction:Crowd-aware robot navigation with attention-based deep reinforce-ment learning,” in . IEEE, 2019, pp. 6015–6022.

ABLE III: Statistics for 200 runs of proposed method (GO-MPC) compared to baselines (MPC, DRL [6] and DRL-2,an extension of [6]): time to goal and traveled distance for the successful episodes, and number of episodes resulting incollision for n ∈ { , , } agents. For the mixed setting, 80% of agents are cooperative, and 20% are non-cooperative. Time to Goal (mean ± std) [s] % failures (% collisions / % deadlocks) Traveled Distance (mean ± std) [m] Mixed Agents

MPC 11.2 ± ± ± ± ± ± ± ± ± ± ± ± ± ± ± ± ± ± ± ± ± ± ± ± Homogeneous

MPC 17.37 ± ± ± ± ± ± ± ± ± ± ± ± ± ± ± ± ± ± ± ± ± ± ± ± IEEEAccess , vol. 9, pp. 10 357–10 377, 2021.[7] T. Fan, P. Long, W. Liu, and J. Pan, “Distributed multi-robot collisionavoidance via deep reinforcement learning for navigation in complexscenarios,”

The International Journal of Robotics Research .[8] J. Van Den Berg, S. J. Guy, M. Lin, and D. Manocha, “Reciprocaln-body collision avoidance,” in

Robotics research . Springer, 2011.[9] J. Van Den Berg, J. Snape, S. J. Guy, and D. Manocha, “Reciprocalcollision avoidance with acceleration-velocity obstacles,” in . IEEE.[10] D. Helbing and P. Molnar, “Social force model for pedestrian dynam-ics,”

Physical review E , vol. 51, no. 5, p. 4282, 1995.[11] C. I. Mavrogiannis and R. A. Knepper, “Multi-agent path topology insupport of socially competent navigation planning,”

The InternationalJournal of Robotics Research , vol. 38, no. 2-3, pp. 338–356, 2019.[12] C. I. Mavrogiannis, W. B. Thomason, and R. A. Knepper, “Social mo-mentum: A framework for legible navigation in dynamic multi-agentenvironments,” in

Proceedings of the 2018 ACM/IEEE InternationalConference on Human-Robot Interaction , 2018, pp. 361–369.[13] P. Trautman, J. Ma, R. M. Murray, and A. Krause, “Robot navigationin dense human crowds: Statistical models and experimental studiesof human–robot cooperation,”

The International Journal of RoboticsResearch , vol. 34, no. 3, pp. 335–356, 2015.[14] B. Kim and J. Pineau, “Socially adaptive path planning in human envi-ronments using inverse reinforcement learning,”

International Journalof Social Robotics , vol. 8, no. 1, pp. 51–66, 2016.[15] A. Vemula, K. Muelling, and J. Oh, “Modeling cooperative navigationin dense human crowds,” in . IEEE, 2017, pp. 1685–1692.[16] Y. F. Chen, M. Everett, M. Liu, and J. P. How, “Socially awaremotion planning with deep reinforcement learning,” in .[17] L. Tai, J. Zhang, M. Liu, and W. Burgard, “Socially compliant navi-gation through raw depth inputs with generative adversarial imitationlearning,” in . IEEE, 2018, pp. 1111–1117.[18] L. Hewing, K. P. Wabersich, M. Menner, and M. N. Zeilinger,“Learning-based model predictive control: Toward safe learning incontrol,”

Annual Review of Control, Robotics, and Autonomous Sys-tems , vol. 3, pp. 269–296, 2020.[19] N. Mansard, A. DelPrete, M. Geisert, S. Tonneau, and O. Stasse,“Using a memory of motion to efﬁciently warm-start a nonlinearpredictive controller,” in . IEEE, 2018, pp. 2986–2993.[20] G. Bellegarda and K. Byl, “Combining beneﬁts from trajectory opti-mization and deep reinforcement learning,” 2019.[21] M. Zanon and S. Gros, “Safe reinforcement learninge using robustmpc,”

IEEE Transactions on Automatic Control , 2020.[22] A. Nagabandi, G. Kahn, R. S. Fearing, and S. Levine, “Neural networkdynamics for model-based deep reinforcement learning with model-free ﬁne-tuning,” 2017.[23] M. Zhong, M. Johnson, Y. Tassa, T. Erez, and E. Todorov, “Valuefunction approximation and model predictive control,” in . IEEE, 2013, pp. 100–107.[24] K. Lowrey, A. Rajeswaran, S. Kakade, E. Todorov, and I. Mordatch,“Plan online, learn ofﬂine: Efﬁcient learning and exploration viamodel-based control,” arXiv preprint arXiv:1811.01848 , 2018.[25] F. Farshidian, D. Hoeller, and M. Hutter, “Deep value model predictivecontrol,” arXiv preprint arXiv:1910.03358 , 2019.[26] N. Karnchanachari, M. I. Valls, D. Hoeller, and M. Hutter, “Practicalreinforcement learning for mpc: Learning from sparse objectives inunder an hour on a real robot,” 2020.[27] D. Ernst, M. Glavic, F. Capitanescu, and L. Wehenkel, “Reinforcementlearning versus model predictive control: a comparison on a power sys-tem problem,”

IEEE Transactions on Systems, Man, and Cybernetics,Part B (Cybernetics) , vol. 39, no. 2, pp. 517–529, 2008.[28] S. Levine and V. Koltun, “Guided policy search,” in

InternationalConference on Machine Learning , 2013, pp. 1–9.[29] ——, “Variational policy search via trajectory optimization,” in

Ad-vances in neural information processing systems , 2013.[30] I. Mordatch and E. Todorov, “Combining the beneﬁts of functionapproximation and trajectory optimization.” in

Robotics: Science andSystems , vol. 4, 2014.[31] Z.-W. Hong, J. Pajarinen, and J. Peters, “Model-based lookaheadreinforcement learning,” 2019.[32] T. Wang and J. Ba, “Exploring model-based planning with policynetworks,” in

International Conference on Learning Representations .[33] C. Greatwood and A. G. Richards, “Reinforcement learning andmodel predictive control for robust embedded quadrotor guidance andcontrol,”

Autonomous Robots , vol. 43, no. 7, pp. 1681–1693, 2019.[34] S. M. LaValle,

Planning algorithms . Cambridge university press.[35] T. Fan, P. Long, W. Liu, and J. Pan, “Fully distributed multi-robot col-lision avoidance via deep reinforcement learning for safe and efﬁcientnavigation in complex scenarios,” arXiv preprint arXiv:1808.03841 .[36] J. Van den Berg, M. Lin, and D. Manocha, “Reciprocal velocity obsta-cles for real-time multi-agent navigation,” in . IEEE, 2008, pp. 1928–1935.[37] S. Kapturowski, G. Ostrovski, W. Dabney, J. Quan, and R. Munos,“Recurrent experience replay in distributed reinforcement learning,”in

International Conference on Learning Representations , 2019.[Online]. Available: https://openreview.net/forum?id=r1lyTjAqYX[38] J. Schulman, F. Wolski, P. Dhariwal, A. Radford, and O. Klimov,“Proximal policy optimization algorithms,” 2017.[39] J. Schulman, P. Moritz, S. Levine, M. Jordan, and P. Abbeel, “High-dimensional continuous control using generalized advantage estima-tion,” arXiv preprint arXiv:1506.02438 , 2015.[40] Y. Bengio, J. Louradour, R. Collobert, and J. Weston, “Curriculumlearning,” in