[PDF] Learning Safe Multi-Agent Control with Decentralized Neural Barrier Certificates

Abstract

We study the multi-agent safe control problem where agents should avoid collisions to static obstacles and collisions with each other while reaching their goals. Our core idea is to learn the multi-agent control policy jointly with learning the control barrier functions as safety certificates. We propose a novel joint-learning framework that can be implemented in a decentralized fashion, with generalization guarantees for certain function classes. Such a decentralized framework can adapt to an arbitrarily large number of agents. Building upon this framework, we further improve the scalability by incorporating neural network architectures that are invariant to the quantity and permutation of neighboring agents. In addition, we propose a new spontaneous policy refinement method to further enforce the certificate condition during testing. We provide extensive experiments to demonstrate that our method significantly outperforms other leading multi-agent control approaches in terms of maintaining safety and completing original tasks. Our approach also shows exceptional generalization capability in that the control policy can be trained with 8 agents in one scenario, while being used on other scenarios with up to 1024 agents in complex multi-agent environments and dynamics.

Full PDF

PPublished as a conference paper at ICLR 2021 L EARNING S AFE M ULTI -A GENT C ONTROL WITH D ECENTRALIZED N EURAL B ARRIER C ERTIFICATES

Zengyi Qin , Kaiqing Zhang , Yuxiao Chen , Jingkai Chen and Chuchu Fan Massachusetts Institute of Technology University of Illinois Urbana-Champaign California Institute of Technology { qinzy, chuchu } @mit.edu A BSTRACT

We study the multi-agent safe control problem where agents should avoid colli-sions to static obstacles and collisions with each other while reaching their goals.Our core idea is to learn the multi-agent control policy jointly with learning thecontrol barrier functions as safety certiﬁcates . We propose a novel joint-learningframework that can be implemented in a decentralized fashion, with generaliza-tion guarantees for certain function classes. Such a decentralized framework canadapt to an arbitrarily large number of agents. Building upon this framework, wefurther improve the scalability by incorporating neural network architectures thatare invariant to the quantity and permutation of neighboring agents. In addition,we propose a new spontaneous policy reﬁnement method to further enforce thecertiﬁcate condition during testing. We provide extensive experiments to demon-strate that our method signiﬁcantly outperforms other leading multi-agent controlapproaches in terms of maintaining safety and completing original tasks. Our ap-proach also shows exceptional generalization capability in that the control policycan be trained with 8 agents in one scenario, while being used on other scenarioswith up to 1024 agents in complex multi-agent environments and dynamics.

NTRODUCTION

Machine learning (ML) has created unprecedented opportunities for achieving full autonomy. How-ever, learning-based methods in autonomous systems (AS) can and do fail due to the lack of formalguarantees and limited generalization capability, which poses signiﬁcant challenges for developingsafety-critical AS, especially large-scale multi-agent AS, that are provably dependable.On the other side, safety certiﬁcates (Chang et al. (2019); Jin et al. (2020); Choi et al. (2020)), whichwidely exist in control theory and formal methods, serve as proofs for the satisfaction of the desiredproperties of a system, under certain control policies. For example, once found, a Control BarrierFunction (CBF) ensures that the closed-loop system always stays inside some safe set (Wieland &Allg¨ower, 2007; Ames et al., 2014) with a CBF Quadratic Programming (QP) supervisory controller.However, it is extremely difﬁcult to synthesize CBF by hand for complex dynamic systems, whichstems a growing interest in learning-based CBF (Saveriano & Lee, 2020; Srinivasan et al., 2020; Jinet al., 2020; Bofﬁ et al., 2020; Taylor et al., 2020; Robey et al., 2020). However, all of these studiesonly concern single-agent systems. How to develop learning-based approaches for safe multi-agentcontrol that are both provably dependable and scalable remains open.In multi-agent control, there is a constant dilemma: centralized control strategies can hardly scaleto a large number of agents, while decentralized control without coordination often misses safetyand performance guarantees. In this work, we propose a novel learning framework that jointly de-signs multi-agent control policies and safety certiﬁcate from data, which can be implemented ina decentralized fashion and scalable to an arbitrary number of agents. Speciﬁcally, we ﬁrst intro-duce the notion of decentralized CBF as safety certiﬁcates, then propose the framework of learningdecentralized CBF, with generalization error guarantees. The decentralized CBF can be seen as acontract among agents, which allows agents to learn a mutual agreement with each other on howto avoid collisions. Once such a controller is achieved through the joint-learning framework, it can1 a r X i v : . [ c s . M A ] J a n ublished as a conference paper at ICLR 2021be applied on an arbitrarily number of agents and in scenarios that are different from the trainingscenarios, which resolves the fundamental scalability issue in multi-agent control. We also proposeseveral effective techniques in Section 4 to make such a learning process even more scalable andpractical, which are then validated extensively in Section 5.Experimental results are indeed promising. We study both 2D and 3D safe multi-agent controlproblems, each with several distinct environments and complex nonholonomic dynamics. Our joint-learning framework performs exceptionally well: our control policies trained on scenarios with 8agents can be used on up to 1024 agents while maintaining low collision rates, which has notablypushed the boundary of learning-based safe multi-agent control. Speaking of which, 1024 is not thelimit of our approach but rather due to the limited computational capability of our laptop used for theexperiments. We also compare our approach with both leading learning-based methods (Lowe et al.,2017; Zhang & Bastani, 2019; Liu et al., 2020) and traditional planning methods (Ma et al., 2019;Fan et al., 2020). Our approach outperforms all the other approaches in terms of both completingthe tasks and maintaining safety. Contributions.

Our main contributions are three-fold: 1) We propose the ﬁrst framework to jointlylearning safe multi-agent control policies and CBF certiﬁcates, in a decentralized fashion. 2) Wepresent several techniques that make the learning framework more effective and scalable for practi-cal multi-agent systems, including the use of quantity-permutation invariant neural network architec-tures in learning to handle the permutation of neighbouring agents. 3) We demonstrate via extensiveexperiments that our method signiﬁcantly outperforms other leading methods, and has exceptionalgeneralization capability to unseen scenarios and an arbitrary number of agents, even in quite com-plex multi-agent environments such as ground robots and drones. The video that demonstrates theoutstanding performance of our method can be found in the supplementary material.

Related Work.

Learning-Based Safe Control via CBF.

Barrier certiﬁcates (Prajna et al., 2007)and CBF (Wieland & Allg¨ower, 2007) is a well-known effective tool for guaranteeing the safetyof nonlinear dynamic systems. However, the existing methods for constructing CBFs either rely onspeciﬁc problem structures (Chen et al., 2017b) or do not scale well (Mitchell et al., 2005). Recently,there has been an increasing interest in learning-based and data-driven safe control via CBFs, whichprimarily consist of two categories: learning CBFs from data (Saveriano & Lee, 2020; Srinivasanet al., 2020; Jin et al., 2020; Bofﬁ et al., 2020), and

CBF-based approach for controlling unknownsystems (Wang et al., 2017; 2018; Cheng et al., 2019; Taylor et al., 2020). Our work is more pertinentto the former and is complementary to the latter, which usually assumes that the CBF is provided.None of these learning-enabled approaches, however, has addressed the multi-agent setting.

Multi-Agent Safety Certiﬁcates and Collision Avoidance.

Restricted to holonomic systems, guar-anteeing safety in multi-agent systems has been approached by limiting the velocities of the agents(Van den Berg et al., 2008; Alonso-Mora et al., 2013). Later, Borrmann et al. (2015) Wang et al.(2017) have proposed the framework of multi-agent CBF to generate collision-free controllers, witheither perfectly known system dynamics (Borrmann et al., 2015), or with worst-case uncertaintybounds (Wang et al., 2017). Recently, Chen et al. (2020) has proposed a decentralized controllersynthesized approach under this CBF framework, which is scalable to an arbitrary number of agents.However, in Chen et al. (2020) the CBF controller relies on online integration of the dynamics underthe backup strategy, which can be computationally challenging for complex systems. Due to spacelimit, we omit other non-learning multi-agent control methods but acknowledge their importance.

Safe Multi-Agent (Reinforcement) Learning (MARL).

Safety concerns have drawn increasingattention in MARL, especially with the applications to safety-critical multi-agent systems (Zhang &Bastani, 2019; Qie et al., 2019; Shalev-Shwartz et al., 2016). Under the CBF framework, Cheng et al.(2020) considered the setting with unknown system dynamics, and proposed to design robust multi-agent CBFs based on the learned dynamics. This mirrors the second category mentioned above insingle-agent learning-based safe control, which is perpendicular to our focus. RL approaches havealso been applied for multi-agent collision avoidance (Chen et al., 2017a; Lowe et al., 2017; Everettet al., 2018; Zhang et al., 2018). Nonetheless, no formal guarantees of safety were establishedin these works. One exception is Zhang & Bastani (2019), which proposed a multi-agent modelpredictive shielding algorithm that provably guarantees safety for any policy learned from MARL,which differs from our multi-agent CBF-based approach. More importantly, none of these MARL-based approaches scale to a massive number of, e.g., thousands of agents, as our approach does.The most scalable MARL platform, to the best of our knowledge, is Zheng et al. (2017), which may2ublished as a conference paper at ICLR 2021handle a comparable scale of agents as ours, but with discrete state-action spaces. This is in contrastto our continuous-space models that can model practical control systems such as robots and drones.

RELIMINARIES

ONTROL B ARRIER F UNCTIONS AS S AFETY C ERTIFICATES

One common approach for (single-agent) safety certiﬁcate is via control barrier functions (Ameset al., 2014), which can enforce the states of dynamic systems to stay in the safe set. Speciﬁcally, let

S ⊂ R n be the state space, S d ⊂ S is the dangerous set, S s = S\S d is the safe set, which containsthe set of initial conditions S ⊂ S s . Also deﬁne the space of control actions as U ⊂ R m . For adynamic system ˙ s ( t ) = f ( s ( t ) , u ( t )) , a control barrier function h : R n (cid:55)→ R satisﬁes: ( ∀ s ∈ S , h ( s ) ≥ (cid:94) ( ∀ s ∈ S d , h ( s ) < (cid:94) ( ∀ s ∈ { s | h ( s ) ≥ } , ∇ s h · f ( s, u ) + α ( h ) ≥ , (1) where α ( · ) is a class- K function, i.e., α ( · ) is strictly increasing and satisﬁes α (0) = 0 . For a controlpolicy π : S → U and CBF h , it is proved in Ames et al. (2014) that if s (0) ∈ { s | h ( s ) ≥ } andthe three conditions in (1) are satisﬁed with u = π ( x ) , then s ( t ) ∈ { s | h ( s ) ≥ } for ∀ t ∈ [0 , ∞ ) ,which means the state would never enter the dangerous set S d under π .2.2 S AFETY OF M ULTI - AGENT D YNAMIC S YSTEMS

Consider a multi-agent system with N agents, the joint state of which at time t is denoted by s ( t ) = { s ( t ) , s ( t ) , · · · , s N ( t ) } where s i ( t ) ∈ S i ⊂ R n denotes the state of agent i at time t . Thedynamics of agent i is ˙ s i ( t ) = f i ( s i ( t ) , u i ( t )) where u i ( t ) ∈ U i ⊂ R m is the control action of agent i . The overall state space and input space are denoted as S . = N ⊗ i =1 S i , U . = N ⊗ i =1 U i . For each agent i ,we deﬁne N i ( t ) as the set of its neighborhood agents at time t . Let o i ( t ) ∈ R n ×|N i ( t ) | be the localobservation of agent i , which is the states of |N i ( t ) | neighborhood agents. Notice that the dimensionof o i ( t ) is not ﬁxed and depends on the quantity of neighboring agents. We assume that the safetyof agent i is jointly determined by s i and o i . Let O i be the set of all possible observations and X i := S i × O i be the state-observation space that contains the safe set X i,s , dangerous set X i,d andinitial conditions X i, ⊂ X i,s . Let d : X i → R describe the minimum distance from agent i to otheragents that it observes, d ( s i , o i ) < κ s implies collision. Then X i,s = { ( s i , o i ) | d ( s i , o i ) ≥ κ s } and X i,d = { ( s i , o i ) | d ( s i , o i ) < κ s } . Let ¯ d i : S → R be the lifting of d from X i to S , which is well-deﬁned since there is a surjection from S to X i . Then deﬁne S s . = { s ∈ S|∀ i = 1 , ..., N, ¯ d i ( s ) ≥ κ s } . The safety of a multi-agent system can be formally deﬁned as follows: Deﬁnition 1 (Safety of Multi-Agent Systems) . If the state-observation satisﬁes d ( s i , o i ) ≥ κ s foragent i and time t , then agent i is safe at time t . If for ∀ i , agent i is safe at time t , then the multi-agentsystem is safe at time t , and s ∈ S s . A main objective of this paper is to learn the control policy π i ( s i ( t ) , o i ( t )) for ∀ i such that themulti-agent system is safe. The control policy is decentralized (i.e., each agent has its own controlpolicy and there does not exist a central controller to coordinate all the agents). In this way, ourdecentralized approach has the hope to scale to very a large number of agents. EARNING F RAMEWORK FOR M ULTI -A GENT D ECENTRALIZED

CBF

ECENTRALIZED C ONTROL B ARRIER F UNCTIONS

For a multi-agent dynamic system, the most na¨ıve CBF would be a centralized function taking intoaccount the cross production of all agents’ states, which leads to an exponential blow-up in the statespace and difﬁculties in modeling systems with an arbitrary number of agents. Instead, we considera decentralized control barrier function h i : X i (cid:55)→ R : ( ∀ ( s i , o i ) ∈ X i, , h i ( s i , o i ) ≥ (cid:94) ( ∀ ( s i , o i ) ∈ X i,d , h i ( s i , o i ) < (cid:94) ( ∀ ( s i , o i ) ∈ { ( s i , o i ) | h i ( s i , o i ) ≥ } , ∇ s i h i · f i ( s i , u i ) + ∇ o i h i · ˙ o i ( t ) + α ( h i ) ≥ (2) where ˙ o i ( t ) is the time derivative of the observation, which depends on the behavior of other agents.Although there is no explicit expression of this term, it can be evaluated and incorporated in the3ublished as a conference paper at ICLR 2021learning process. Note that the CBF h i ( s i , o i ) is local in the sense that it only depends on the localstate s i and observation o i . We refer to the three conditions in (2) as decentralized CBF conditions .The following proposition shows that satisfying (2) guarantees the safety of the multi-agent system. Proposition 1 (Multi-Agent Safety Certiﬁcates with Decentralized CBF) . If for ∀ i , the initial state-observation ( s i (0) , o i (0)) ∈ { ( s i , o i ) | h i ( s i , o i ) ≥ } and the decentralized CBF conditions in (2)are satisﬁed, then ∀ i and ∀ t , ( s i ( t ) , o i ( t )) ∈ { ( s i , o i ) | h i ( s i , o i ) ≥ } , which implies the statewould never enter X i,d for any agent i . Thus, by Deﬁnition 1, the multi-agent system is safe. The proof of Proposition 1 is provided in the supplementary material. The key insight of Propo-sition 1 is that for the whole multi-agent system, the CBFs can be applied in a decentralized fash-ion for each agent. Since h i ( s i , o i ) ≥ is invariant, by deﬁnition of h i , h i ( s i , o i ) > ⇒ ¯ d i ( s ) ≥ κ s , which means agent i never gets closer than κ s to all its neighborhood agents. There-fore, ∀ i, h i ( s i , o i ) ≥ implies that ∀ i, ¯ d i ( s ) ≥ κ s , which by deﬁnition also means s ∈ S s , and themulti-agent system is safe as deﬁned in Deﬁnition 1.Notice that an agent only needs to care about its local information, and if all agents respect the sameform of contract (i.e., the decentralized CBF conditions), the whole multi-agent system will be safe.The fact that global safety can be guaranteed by decentralized CBF is of great importance since itreveals that a centralized controller that coordinates all agents is not necessary to achieve safety. Acentralized control policy has to deal with the dimension explosion when the number of agents grow,while a decentralized design can signiﬁcantly improve the scalability to an arbitrarily large numberof agents.3.2 L EARNING F RAMEWORK AND G ENERALIZATION G UARANTEE

From Proposition 1, we know that if we can jointly learn the control policy π i ( s i , o i ) and controlbarrier function h i ( s i , o i ) such that the decentralized CBF conditions in (2) are satisﬁed, then themulti-agent system is guaranteed to be safe. Next we formulate the optimization objective for thejoint learning of π i ( s i , o i ) and h i ( s i , o i ) . To answer how well the learned π i ( s i , o i ) and h i ( s i , o i ) can generalize to unseen scenarios, we will provide a generalization bound with probabilistic guar-antees. Let T ⊂ R + be the time interval and τ i = { s i ( t ) , o i ( t ) } t ∈ T be a trajectory of state andobservation of agent i . Let T i be the set of all possible trajectories of agent i . Let H i and V i be thefunction classes of h i and π i . Deﬁne the function y i : T i × H i × V i (cid:55)→ R as: y i ( τ i , h i , π i ) := min (cid:110) inf X i, ∩ τ i h i ( s i , o i ) , inf X i,d ∩ τ i − h i ( s i , o i ) , inf X i,h ∩ τ i ( ˙ h i + α ( h i )) (cid:111) . (3)The set X i,h := { ( s i , o i ) | h i ( s i , o i ) ≥ } . Notice that the third item on the right side of Equation (3)depends on both the control policy and CBF, since ˙ h i = ∇ s i h i · f i ( s i , u i ) + ∇ o i h i · ˙ o i ( t ) , u i = π i ( s i , o i ) . It is clear that if we can ﬁnd h i and π i ( s i , o i ) such that y i ( τ i , h i , π i ) > for ∀ τ i ∈ T i and ∀ i , then the conditions in (2) are satisﬁed. For each agent i , assume that we are given z i i.i.dtrajectories { τ i , τ i , · · · , τ z i i } drawn from distribution D i during training. We solve the objective: For all i, ﬁnd h i ∈ H i and π i ∈ V i , s . t . y i ( τ ji , h i , π i ) ≥ γ, ∀ j = 1 , , · · · z i , (4)where γ > is a margin that allows us to derive probabilistic guarantees later. We denote the solu-tion to (4) as ˆ h i and ˆ π i . Denote the Rademacher complexity of the function class of y i as R z i ( Y i ) ,whose deﬁnition could be found in Appendix B. Also we deﬁne (cid:15) i as the probability that the decen-tralized CBF conditions are violated for agent i over randomly sampled trajectories (not necessarilythe samples encountered in training). Under such deﬁnition, (cid:15) i measures the generalization error and can be expressed as (cid:15) i = P τ i ∼D i (cid:104) y i ( τ i , ˆ h i , ˆ u i ) ≤ (cid:105) . Then we have Proposition 2 that providesgeneralization guarantees for all the learned ˆ h i and ˆ π i . Proposition 2 (Generalization Error Bound of Learning Decentralized CBF) . Assume that | y | ≤ b and (4) is feasible. Let ˆ h i and ˆ u i be the solutions to (4) and µ be a universal positive constant vector.Recall that N is the number of agents. Then, for any δ ∈ (0 , the following statement holds: P (cid:34) N (cid:92) i =1 (cid:18) (cid:15) i ≤ µ i log z i γ R z i ( Y i ) + µ i log( N log(4 b/γ ) /δ ) z i (cid:19)(cid:35) ≥ − δ. (6) Neural Net Controller

Neural Net CBF

Control input 𝑢 𝑖 ሶ𝑠 𝑖 = 𝑓 𝑖 (𝑠 𝑖 , 𝑢 𝑖 ) ℎ 𝑖 (𝑠 𝑖 , 𝑜 𝑖 ) ∇ 𝑠 ℎ 𝑖 (𝑠 𝑖 , 𝑜 𝑖 ) ∇ 𝑠 ℎ 𝑖 ∙ 𝑓 𝑖 Loss ℒ 𝑐 State and observation (𝑠 𝑖 , 𝑜 𝑖 ) Figure 1:

The computational graph of the control-certiﬁcate jointly learning framework in multi-agent systems.Only the graph for agent i is shown because agents have the same graph and the computation is decentralized. The proof is provided in the supplementary material. The left side of Equation (6) is the probabilitythat the generalization error (cid:15) i is upper bounded for all the N agents. Equation (6) claims that thegeneralization error is bounded for all agents with high probability − δ . Similar to the discussionsin Section 4 in Bofﬁ et al. (2020), for speciﬁc function classes of H i and V i , such as Lipschitz para-metric function or Reproducing kernel Hilbert space function classes, the Rademacher complexityof the function classes can be further bounded, leading to vanishing generalization errors as thenumber of samples z i increases. Such derivations are standard, and are thus omitted as they are notthe focus of the present paper. Note that the Lipschitz function class includes some neural networkswith differentiable activation functions, which will be used in our experiments in Section 4.Although we have presented some generalization guarantee for learning decentralized CBF, therestill exists a gap between the theory and practical implementation. First, the theory does not providea concrete way of designing loss functions to realize the optimization objectives in (4). Second,in theory, there are still N pairs of functions ( h i , π i ) to be learned. Unfortunately, the dimensionof the input o i of the functions h i , π i are different for each agent i , and will even change overtime in practice, as the proximity of other agents is time-varying, leading to time-varying localobservations. To scale to an arbitrary number of agents, h i and π i should be invariant to the quantityand permutation of neighbourhood agents. Third, the theory does not provide ways to deal withscenarios where the decentralized CBF conditions are not (strictly) satisﬁed, i.e., where problem (4)is not feasible, which may very likely occur when the system becomes too complex or the functionclasses are not rich enough. To this end, we propose effective approaches to solving these issues,facilitating the scalable learning of safe multi-agent control in practice, as to be introduced next. CALABLE L EARNING OF D ECENTRALIZED

CBF IN P RACTICE

Following the theory in Section 3, we consider the practical learning of safe multi-agent control withneural barrier certiﬁcates, i.e., using neural networks for H and V . We will present the formulationof loss functions in Section 4.1, which corresponds to the objective in (4). Section 4.2 presents theneural network architecture of h i and π i , which are invariant to the quantity and permutation ofneighboring agents. Section 4.3 demonstrates a spontaneous policy reﬁnement method that enablesthe control policy to satisfy the decentralized CBF conditions as possible as it could during testing.4.1 L OSS F UNCTIONS OF J OINTLY L EARNING C ONTROLLERS AND B ARRIER C ERTIFICATES

Based on Section 3.2. the main idea is to jointly learn the control policies and control barrier func-tions in multi-agent systems. During training, the CBFs regulate the control policies to satisfy thedecentralized CBF conditions (2) so that the learned policies are safe. All agents are put into a singleenvironment to generate experiences, which are combined to minimize the empirical loss function L c = Σ i L ci , where L ci is the loss function for agent i formulated as: L ci ( θ i , ω i ) = (cid:88) s i ∈X i, max (cid:16) , γ − h θ i i ( s i , o i ) (cid:17) + (cid:88) s i ∈X i,d max (cid:16) , γ + h θ i i ( s i , o i ) (cid:17) + (cid:88) s i ∈X i,h max (cid:16) , γ − ∇ s i h θ i i · f i ( s i , π ω i i ( s i , o i )) − ∇ o i h θ i i · ˙ o i − α ( h θ i i ) (cid:17) , (7) where γ is the margin deﬁned in Section 3.2. We choose γ = 10 − in implementation. θ i and ω i are neural network parameters. On the right side of Equation (7), the three items enforce thethree CBF conditions respectively. Directly computing the third term could be challenging sincewe need to evaluate ˙ o i , which is the time derivative of the observation. Instead, we approximate ˙ h ( s i , o i ) = ∇ s i h θ i i · f i ( s i , π ω i i ( s i , o i ))+ ∇ o i h θ i i · ˙ o i numerically by ˙ h ( s i , o i ) = [ h ( s i ( t +∆ t ) , o i ( t +∆ t )) − h ( s i ( t ) , o i ( t ))] / ∆ t . For the class- K function α ( · ) , we simply choose a linear function5ublished as a conference paper at ICLR 2021 RowMax C n n |𝒩 𝑖 𝑡 |

128 128 m 𝑠 𝑖 (𝑡)𝑜 𝑖 (𝑡) 𝑢 𝑖 (𝑡) ∈ ℝ 𝑚 Vector

Fully connected layersNotation

Concatenation |𝒩 𝑖 𝑡 | Figure 2:

Neural network architecture of the control policy. The blue part indicates the quantity-permutationinvariant observation encoder, which maps o i ( t ) ∈ R n ×|N i ( t ) | with time-varying dimension to a ﬁxed lengthvector. The network takes the state s i and local observation o i as input to compute a control action u i . Theneural network of the decentralized CBF h i has a similar architecture except that the output is a scalar. α ( h ) = λh . Note that L c mainly considers safety instead of goal reaching. To train a safe controlpolicy π i ( s i , o i ) that can drive the agent to the goal state, we also minimize the distance between u i and u gi , where u gi is the reference control input computed by classical approaches (e.g., LQR andPID controllers) to reach the goal. The goal reaching loss L g = Σ i L gi , where L gi is formulated as L gi ( ω i ) = (cid:80) s i ∈X || π ω i i ( s i , o i ) − u gi ( s i ) || . The ﬁnal loss function L = L c + η L g , where η is abalance weight that is set to . in our experiments. We present the computational graph in Figure 1to help understand the information ﬂow.In training, all agents are put into the speciﬁc environment, which is not necessarily the same asthe testing environment, to collect state-observation pairs ( s i , o i ) under their current policies withprobability − ι and random policies with probability ι , where ι is set to be . in our experiment.The collected ( s i , o i ) are stored as a temporary dataset and in every step of policy update, 128 ( s i , o i ) are randomly sampled from the temporary dataset to calculate the total loss L . We minimize L by applying stochastic gradient descent with learning rate − and weight decay − to θ i and ω i , which are the parameters of the CBF and control policies. Note that the gradients are computedby back-propagation rather than policy gradients because L is differentiable w.r.t. θ i and ω i .4.2 Q UANTITY -P ERMUTATION I NVARIANT O BSERVATION E NCODER

Recall that in Section 3.1, we deﬁne o i as the local observation of agent i . o i contains the statesof neighboring agents and its dimension can change dynamically. In order to scale to an arbitrarynumber of agents, there are two pivotal principles of designing the neural network architectures of h i ( s i , o i ) and π i ( s i , o i ) . First, the architecture should be able to dynamically adapt to the changingquantity of observed agents that affects the dimension of o i . Second, the architecture should beinvariant to the permutation of observed agents, which should not affect the output of h i or π i . Allthese challenges arise from encoding the local observation o i . Inspired by PointNet (Qi et al., 2017),we leverage the max pooling layer to build the quantity-permutation invariant observation encoder.Let us start with a simple example with input observation o i ( t ) ∈ R n ×|N i ( t ) | , where n is the di-mension of state and N i ( t ) is the set of the neighboring agents at time t . n is ﬁxed while N i ( t ) canchange from time to time. The permutation of the columns of o i is also dynamic. Denote the weightmatrix as W ∈ R p × n and the element-wise ReLU activation function as σ ( · ) . Deﬁne the row-wisemax pooling operation as RowMax( · ) , which takes a matrix as input and outputs the maximumvalue of each row. Consider the following mapping ρ : R n ×|N i ( t ) | (cid:55)→ R p formulated as ρ ( o i ) = RowMax( σ ( W o i )) , (8)where ρ maps a matrix o i whose column has dynamic dimension and permutation to a ﬁxed lengthfeature vector ρ ( o i ) ∈ R p . The dimension of ρ ( o i ) remains the same even if the number of columnsof o i ( t ) , which is |N i ( t ) | , change over time. The network architecture of the control policy is shownin Figure 2, which uses the RowMax( · ) operation. The network of the control barrier function issimilar except that the output is a scalar instead of a vector.4.3 S PONTANEOUS O NLINE P OLICY R EFINEMENT

We propose a spontaneous online policy reﬁnement approach that produces even safer control poli-cies in testing than the neural network has actually learned during training. When the model dy-namics or environment settings are too complex and exceed the capability of the control policy, the6ublished as a conference paper at ICLR 2021 (c) Nested Rings (b) Predator-Prey (a) Navigation

Figure 3:

Illustrations of the 2D environments used in the experiments. The

Navigation and

Predator-Prey environments are adopted from the multi-agent particle environment (Lowe et al., 2017). The

Nested-Rings environment is adopted from Rodr´ıguez-Seda et al. (2014).

Figure 4:

Safety rate and reward in the 2D tasks. Results are taken after each method converged and areaveraged over 10 independent trials. decentralized CBF conditions can be violated at some points along the trajectories. Thanks to thecontrol barrier function jointly learned with the control policy, we are able to reﬁne the control input u i online by minimizing the violation of the decentralized CBF conditions. That is, the learned CBFcan serve as a guidance on generating updated u i in unseen scenarios to guarantee safety. This isalso a standard technique used in (non-learning) CBF control where the CBF h is usually computedﬁrst using optimization methods like Sum-of-Squares, then the control inputs u are computed onlineusing h by solving quadratic programming problems (Xu et al., 2017; Ames et al., 2017). In theexperiments, we also study the effects of such an online policy reﬁnement step.Given the state s i , local observation o i , and action u i computed by the control policy, consider thescenario where the third CBF condition is violated, which means ∇ s i h i · f i ( s i , u i ) + ∇ o i h i · ˙ o i + α ( h i ) < when h i ≥ . Let e i ∈ R m be an increment of the action u i . Deﬁne φ ( e i ) : R m (cid:55)→ R as φ ( e i ) = max(0 , −∇ s i h i · f i ( s i , u i + e i ) − ∇ o i h i · ˙ o i − α ( h i )) + µ || e i || . (9)If the ﬁrst term on the right side of Equation (9) is 0, then the third CBF condition is satisﬁed. Wecan enforce the satisfaction in every timestep of testing (after u i is computed by the neural networkcontroller) by ﬁnding an e i that minimizes φ ( e i ) . µ is a regularization factor that punishes large e i .We set µ = 1 in implementation and observed that in our experiment, a ﬁxed µ is sufﬁcient to makesure the || u i + e i || do not exceed the constraint on control input bound. When evaluating on newscenarios and the constraints is violated, one can dynamically increase µ to strengthen the penalty.For every timestep during testing, we initialize e i to zero and check the value of φ ( e i ) . φ ( e i ) > indicates that the control policy is not good enough to satisfy the decentralized CBF conditions.Then we iteratively reﬁne e i by e i = e i − ∇ e φ ( e i ) until φ ( e i ) − µ || e i || = 0 or the maximumallowed iteration is exceeded. The ﬁnal control input is u i = u i + e i . Such a reﬁnement can ﬂexiblyreﬁne the control input to satisfy the decentralized CBF conditions as much as possible.7ublished as a conference paper at ICLR 2021Figure 5: Environments and results of 3D tasks. In

Maze and

Tunnel , the initial and target locations of eachdrone are randomly chosen. The drones start from the initial locations and aim to reach the targets withoutcollision. The results are taken after each method converged and are averaged over 10 independent trials.

XPERIMENTAL R ESULTS

Baseline Approaches.

The baseline approaches we compare with include: MAMPS (Zhang &Bastani, 2019), PIC (Liu et al., 2020) and MADDPG (Lowe et al., 2017). For the drone tasks,we also compare with model-based planning method S2M2 (Chen et al., 2021). A brief descrip-tion of each method is as follows. MAMPS leverages the model dynamics to iteratively switch tosafe control policies when the learned policies are unsafe. PIC proposes the permutation-invariantcritic to enhance the performance of multi-agent RL. We incorporate the safety reward to its rewardfunction and denote this safe version of PIC as PIC-Safe. The safety reward is -1 when the agententers the dangerous set. MADDPG is a pioneering work on multi-agent RL, and MADDPG-Safeis obtained by adding the safety reward to the reward function that is similar to PIC-Safe. S2M2 isa state-of-the-art model-based multi-agent safe motion planner. When directly planning all agentsfails, S2M2 evenly divides the agent group to smaller partitions for replanning until paths that arecollision-free for each partition are found. The agents then follow the generated paths using PID orLQR controllers.For each task, the environment model is the same for all the methods. The exact model dynamicsare visible to model-based methods including MAMPS, S2M2 and our methods, and invisible to themodel-free MADDPG and PIC. Since the model-free methods do not have access to model dynamicsbut instead the simulators, they are more data-demanding. The number of state-observation pairs totrain MADDPG and PIC is times more than that of model-based learning methods to make surethey converge to their best performance. When training the RL-based methods, the control actioncomputed by LQR for goal-reaching is also fed to the agent as one of the inputs to the actor network.So the RL agents can learn to use LQR as a reference for goal-reaching. Evaluation Criteria.

Since the primal focus of this paper is the safety of multi-agent systems,we use the safety rate as a criteria when evaluating the methods. The safety rate is calculated as N Σ Ni =1 E t ∈ T [ I (( s i ( t ) , o i ( t )) ∈ X s )] where I ( · ) is the indicator function that is 1 when its argumentis true or 0 otherwise. The observation o i contains the states of other agents within the observationradius, which is 10 times the safe distance. The safe distance is set to be the diagonal length of thebounding box of the agent. In addition to the safety rate, we also calculate the average reward thatconsiders how good the task is accomplished. The agent is given a +10 reward if it reaches the goaland a -1 reward if it enters the dangerous set. Note that the agent might enter the dangerous set formany times before reaching the goal. The upper-bound of the total reward for an agent is +10, whichis attained when the agent successfully reaches the goal and always stays in the safe set. Ground Robots.

We consider three tasks illustrated in Figure 3. In the Navigation task, eachagent starts from a random location and aims to reach a random goal. In the Predator-Prey task, thepreys aim to gather the food while avoid being caught by the predators chasing the preys. We onlyconsider the safety of preys but not predators. In the Nested-Rings task, the agents aim to followthe reference trajectories while avoid collision. In order for the RL-based agents to follow the ringstrajectory, we also give the agents a negative reward proportional to the distance to the nearest point8ublished as a conference paper at ICLR 2021Figure 6:

Generalization capability of MDBC in the 3D tasks. MDBC can be trained with 8 agent in oneenvironment and generalize to 1024 agents in another environment in testing. on the rings. When adding more agents to an environment, we will also enlarge the area of theenvironment to ensure the overall density of agents remains similar.Figure 4 demonstrates that when the number of agents grows (e.g., 32 agents), our approach(MDBC) can still maintain a high safety rate and average reward, while other methods have muchworse performance. However, our method still cannot guarantee that the agents are safe.The failure is mainly because we cannot make sure the decentralized CBF conditions are satisﬁedfor every state-observation pair in testing even if they are satisﬁed on all training samples, due tothe generalization error identiﬁed in Proposition 2. We also show the generalization capability ofMDBC with up to 1024 in the appendix and also visualization results in the supplementary materials.Figure 7:

Illustration of the

Maze environmentwith 1024 drones. Videos can be found in the sup-plementary material.

Drones.

We experiment with 3D drones whose dy-namics are even more complex. Figure 5 demon-strates the environments and the results of each ap-proach. Similar to the results of ground robots, whenthere are a large number of agents (e.g., 32 agents),our method can still maintain a high reward andsafety rate, while other methods have worse perfor-mance. Figure 6 shows the generalization capabil-ity of our method across different environments andnumber of agents. For each experiment, we train8 agents during training, but test with up to 1024agents. The extra agents are added by copying theneural network parameters of the trained 8 agents.Results show that our method has remarkable gen-eralization capability to diverse scenarios. Anotherrelated work Chen et al. (2020) can also handle thesafe multi-drone control problem via CBF, but their CBF is handcrafted and based on quadratic pro-gramming to solve the u i . Their paper only reported the results on two agents, and for 32 agents itwould take more than 70 hours for a single run of evaluation (7000 steps and 36 seconds per step).By contrast, our method only takes ∼ s for a single run of evaluation with 32 agents, showing asigniﬁcant advantage in computational efﬁciency. For both the ground robot and drone experiments,we provide video demonstrations in the supplementary material. ONCLUSION

This paper presents a novel approach of learning safe multi-agent control via jointly learning the de-centralized control barrier functions as safety certiﬁcates. We provide the theoretical generalizationbound, as well as the effective techniques to realize the learning framework in practice. Experi-ments show that our method signiﬁcantly outperforms previous methods by being able to scale to anarbitrary number of agents, and demonstrates remarkable generalization capabilities to unseen andcomplex multi-agent environments. 9ublished as a conference paper at ICLR 2021 R EFERENCES

Javier Alonso-Mora, Andreas Breitenmoser, Martin Ruﬂi, Paul Beardsley, and Roland Siegwart.Optimal reciprocal collision avoidance for multiple non-holonomic robots. In

Distributed Au-tonomous Robotic Systems , pp. 203–216. Springer, 2013.Aaron D Ames, Jessy W Grizzle, and Paulo Tabuada. Control barrier function based quadraticprograms with application to adaptive cruise control. In

Decision and Control (CDC), 2014 IEEE53rd Annual Conference on , pp. 6271–6278. IEEE, 2014.Aaron D Ames, Xiangru Xu, Jessy W Grizzle, and Paulo Tabuada. Control barrier function basedquadratic programs for safety critical systems.

IEEE Transactions on Automatic Control , 62(8):3861–3876, 2017.Nicholas M Bofﬁ, Stephen Tu, Nikolai Matni, Jean-Jacques E Slotine, and Vikas Sindhwani. Learn-ing stability certiﬁcates from data. arXiv preprint arXiv:2008.05952 , 2020.Urs Borrmann, Li Wang, Aaron D Ames, and Magnus Egerstedt. Control barrier certiﬁcates for safeswarm behavior.

IFAC-Papers-OnLine , 48(27):68–73, 2015.Ya-Chien Chang, Nima Roohi, and Sicun Gao. Neural lyapunov control. In

Advances in NeuralInformation Processing Systems , pp. 3245–3254, 2019.Jingkai Chen, Jiaoyang Li, Chuchu Fan, and Brian C. Williams. Scalable and safe multi-agentmotion planning with nonlinear dynamics and bounded disturbances. In

Proceedings of the Thirty-Fifth AAAI Conference on Artiﬁcial Intelligence (AAAI 2021) , 2021.Yu Fan Chen, Michael Everett, Miao Liu, and Jonathan P How. Socially aware motion planningwith deep reinforcement learning. In

IEEE/RSJ International Conference on Intelligent Robotsand Systems (IROS) , pp. 1343–1350. IEEE, 2017a.Yuxiao Chen, Huei Peng, and Jessy Grizzle. Obstacle avoidance for low-speed autonomous vehicleswith barrier function.

IEEE Transactions on Control Systems Technology , 26(1):194–206, 2017b.Yuxiao Chen, Andrew Singletary, and Aaron D Ames. Guaranteed obstacle avoidance for multi-robot operations with limited actuation: a control barrier function approach.

IEEE Control Sys-tems Letters , 5(1):127–132, 2020.Richard Cheng, G´abor Orosz, Richard M Murray, and Joel W Burdick. End-to-end safe reinforce-ment learning through barrier functions for safety-critical continuous control tasks. In

AAAIConference on Artiﬁcial Intelligence , volume 33, pp. 3387–3395, 2019.Richard Cheng, Mohammad Javad Khojasteh, Aaron D Ames, and Joel W Burdick. Safe multi-agentinteraction through robust control barrier functions with learned uncertainties. arXiv preprintarXiv:2004.05273 , 2020.Jason Choi, Fernando Casta˜neda, Claire J Tomlin, and Koushil Sreenath. Reinforcement learningfor safety-critical control under model uncertainty, using control lyapunov functions and controlbarrier functions. arXiv preprint arXiv:2004.07584 , 2020.Michael Everett, Yu Fan Chen, and Jonathan P How. Motion planning among dynamic, decision-making agents with deep reinforcement learning. In

IEEE/RSJ International Conference on In-telligent Robots and Systems (IROS) , pp. 3052–3059. IEEE, 2018.Chuchu Fan, Kristina Miller, and Sayan Mitra. Fast and guaranteed safe controller synthesis fornonlinear vehicle models. In Shuvendu K. Lahiri and Chao Wang (eds.),

Computer Aided Veriﬁ-cation , pp. 629–652, Cham, 2020. Springer International Publishing.Paul Glotfelter, Jorge Cort´es, and Magnus Egerstedt. Nonsmooth barrier functions with applicationsto multi-robot systems.

IEEE control systems letters , 1(2):310–315, 2017.Wanxin Jin, Zhaoran Wang, Zhuoran Yang, and Shaoshuai Mou. Neural certiﬁcates for safe controlpolicies. arXiv preprint arXiv:2006.08465 , 2020.10ublished as a conference paper at ICLR 2021Iou-Jen Liu, Raymond A Yeh, and Alexander G Schwing. Pic: permutation invariant critic formulti-agent deep reinforcement learning. In

Conference on Robot Learning , pp. 590–602, 2020.Ryan Lowe, Yi I Wu, Aviv Tamar, Jean Harb, OpenAI Pieter Abbeel, and Igor Mordatch. Multi-agent actor-critic for mixed cooperative-competitive environments. In

Advances in Neural Infor-mation Processing Systems , pp. 6379–6390, 2017.Hang Ma, Daniel Harabor, Peter. J Stuckey, Jiaoyang Li, and Sven Koenig. Searching with consis-tent prioritization for multi-agent path ﬁnding.

AAAI 2019 : Thirty-Third AAAI Conference onArtiﬁcial Intelligence , 33(1):7643–7650, 2019.Ian M Mitchell, Alexandre M Bayen, and Claire J Tomlin. A time-dependent hamilton-jacobi formu-lation of reachable sets for continuous dynamic games.

IEEE Transactions on automatic control ,50(7):947–957, 2005.Stephen Prajna, Ali Jadbabaie, and George J Pappas. A framework for worst-case and stochasticsafety veriﬁcation using barrier certiﬁcates.

IEEE Transactions on Automatic Control , 52(8):1415–1428, 2007.Charles R. Qi, Hao Su, Kaichun Mo, and Leonidas J. Guibas. Pointnet: Deep learning on pointsets for 3d classiﬁcation and segmentation. In

Proceedings of the IEEE Conference on ComputerVision and Pattern Recognition (CVPR) , July 2017.Han Qie, Dianxi Shi, Tianlong Shen, Xinhai Xu, Yuan Li, and Liujing Wang. Joint optimizationof multi-UAV target assignment and path planning based on multi-agent reinforcement learning.

IEEE Access , 7:146264–146272, 2019.Alexander Robey, Haimin Hu, Lars Lindemann, Hanwen Zhang, Dimos V Dimarogonas, StephenTu, and Nikolai Matni. Learning control barrier functions from expert demonstrations. arXivpreprint arXiv:2004.03315 , 2020.Erick J. Rodr´ıguez-Seda, Chinpei Tang, Mark W. Spong, and Duˇsan M. Stipanovi´c. Trajectorytracking with collision avoidance for nonholonomic vehicles with acceleration constraints andlimited sensing.

The International Journal of Robotics Research , 33(12):1569–1592, 2014.Matteo Saveriano and Dongheui Lee. Learning barrier functions for constrained motion planningwith dynamical systems. arXiv preprint arXiv:2003.11500 , 2020.Shai Shalev-Shwartz, Shaked Shammah, and Amnon Shashua. Safe, multi-agent, reinforcementlearning for autonomous driving. arXiv preprint arXiv:1610.03295 , 2016.Nathan Srebro, Karthik Sridharan, and Ambuj Tewari. Smoothness, low noise and fast rates. In

Advances in Neural Information Processing Systems 23 , pp. 2199–2207, 2010.Mohit Srinivasan, Amogh Dabholkar, Samuel Coogan, and Patricio Vela. Synthesis of control barrierfunctions using a supervised machine learning approach. arXiv preprint arXiv:2003.04950 , 2020.Andrew Taylor, Andrew Singletary, Yisong Yue, and Aaron Ames. Learning for safety-criticalcontrol with control barrier functions. In

Learning for Dynamics and Control , pp. 708–717,2020.Jur Van den Berg, Ming Lin, and Dinesh Manocha. Reciprocal velocity obstacles for real-timemulti-agent navigation. In

IEEE International Conference on Robotics and Automation (ICRA) ,pp. 1928–1935. IEEE, 2008.Li Wang, Aaron D Ames, and Magnus Egerstedt. Safety barrier certiﬁcates for collisions-free mul-tirobot systems.

IEEE Transactions on Robotics , 33(3):661–674, 2017.Li Wang, Evangelos A Theodorou, and Magnus Egerstedt. Safe learning of quadrotor dynamicsusing barrier certiﬁcates. In

IEEE International Conference on Robotics and Automation (ICRA) ,pp. 2460–2465. IEEE, 2018.Peter Wieland and Frank Allg¨ower. Constructive safety using control barrier functions.

IFAC Pro-ceedings Volumes , 40(12):462–467, 2007. 11ublished as a conference paper at ICLR 2021Xiangru Xu, Jessy W Grizzle, Paulo Tabuada, and Aaron D Ames. Correctness guarantees forthe composition of lane keeping and adaptive cruise control.

IEEE Transactions on AutomationScience and Engineering , 15(3):1216–1229, 2017.Kaiqing Zhang, Zhuoran Yang, Han Liu, Tong Zhang, and Tamer Basar. Fully decentralized multi-agent reinforcement learning with networked agents. In

International Conference on MachineLearning , pp. 5872–5881, 2018.Wenbo Zhang and Osbert Bastani. Mamps: Safe multi-agent reinforcement learning via modelpredictive shielding. arXiv preprint arXiv:1910.12639 , 2019.Lianmin Zheng, Jiacheng Yang, Han Cai, Weinan Zhang, Jun Wang, and Yong Yu. Magent: Amany-agent reinforcement learning platform for artiﬁcial collective intelligence. arXiv preprintarXiv:1712.00600 , 2017. 12ublished as a conference paper at ICLR 2021

A P

ROOF OF P ROPOSITION Since ˙ h i = ∇ s i h i · f i ( s i , u i ) + ∇ o i h i · ˙ o i ( t ) , the satisfaction of (2) implies: ∀ ( s i , o i ) ∈ X i, , h i ( s i , o i ) ≥ ∀ ( s i , o i ) ∈ X i,d , h i ( s i , o i ) < ∀ ( s i , o i ) ∈ { ( s i , o i ) | h i ( s i , o i ) ≥ } , ˙ h i + α ( h i ) ≥ . (10)The initial condition ( s i (0) , o i (0)) ∈ { ( s i , o i ) | h i ( s i , o i ) ≥ } means that h i ≥ at time t = 0 .Since ˙ h i + α ( h i ) ≥ , h i will stay non-negative, which is proved in Section 2 of Ames et al. (2014).This means that ( s i ( t ) , o i ( t )) / ∈ X i,d for ∀ t > . Thus for ∀ i and ∀ t > , agent i would not enterthe dangerous set, and the whole multi-agent system is safe by Deﬁnition 1. Remark.

Since the input dimension and permutation of h i can change with time, the time derivativeof h i does not exist everywhere but almost everywhere. In fact, in the safety guarantee provided byProposition 1, we do not require the time derivative of h i to exist everywhere. The h i can also benon-smooth. Based on (2) of Glotfelter et al. (2017), we can deﬁne ˙ h i as the generalized gradientthat always exists when h i is non-smooth and the time derivative exists almost everywhere. Thenbased on Theorem 2 of Glotfelter et al. (2017), as long as the CBF conditions are satisﬁed under thegeneralized gradient, then h i is a valid CBF and the safety can be guaranteed. Global CBF from decentralized CBFs.

In addition to Proposition 1, another way to prove theglobal safety of the multi-agent system under the decentralized CBFs is to construct the global CBF h g : S (cid:55)→ R from individual CBFs by taking the minimum, as is in: h g ( s ) := min { h ( s ↓ s , s ↓ o ) , h ( s ↓ s , s ↓ o ) , · · · , h N ( s ↓ s N , s ↓ o N ) } , (11)where s ↓ s i is the projection of the global state onto the state of agent i and s ↓ o i is the projectionof the global state to the observation of agent i . Then the following proposition guarantees the globalsafety of the multi-agent system. Proposition 3.

If (10) is satisﬁed for every agent i , then the global CBF h g ( s ) satisﬁes: ∀ s ∈ S , h g ( s ) ≥ ∀ s ∈ S d , h g ( s ) < ∀ s ∈ { s | h g ( s ) ≥ } , ˙ h g + α ( h g ) ≥ , (12)where S = { s ∈ S|∀ i, ( s ↓ s i , s ↓ o i ) ∈ X i, } and S d = { s ∈ S|∃ i, ( s ↓ s i , s ↓ o i ) ∈ X i,d } . Then ∀ t > , h g ( s ( t )) ≥ and s (cid:54)∈ S d , which means the multi-agent system is globally safe. Proof.

Let us ﬁrst prove that the satisfaction of (10) implies the satisfaction of (12). By deﬁnitionof S , when s ∈ S , we have ∀ i, ( s ↓ s i , s ↓ o i ) ∈ X i, , which means ∀ i, h i ( s ↓ s i , s ↓ o i ) ≥ .Thus h g ( s ) = min i { h i ( s ↓ s i , s ↓ o i ) } ≥ . When s ∈ S d , we have ∃ i, ( s ↓ s i , s ↓ o i ) ∈ X i,d ,which means ∃ i, h i ( s ↓ s i , s ↓ o i ) < . Thus h g ( s ) = min i { h i ( s ↓ s i , s ↓ o i ) } < . When s ∈ { s | h g ( s ) ≥ } , we have ∀ i, ( s ↓ s i , s ↓ o i ) ∈ { ( s ↓ s i , s ↓ o i ) | h i ( s ↓ s i , s ↓ o i ) ≥ } . So ∀ i, ˙ h i + α ( h i ) ≥ . Let i ∗ = arg min i { h i ( s ↓ s i , s ↓ o i ) } . Then h g ( s ) = h i ∗ ( s ↓ s i ∗ , s ↓ o i ∗ ) and ˙ h g + α ( h g ) = ˙ h i ∗ + α ( h i ∗ ) ≥ . Hence the satisfaction of (10) implies the satisfaction of(12). Then based on Section 2 of Ames et al. (2014), we have h g ( s ( t )) ≥ , ∀ t > . This means s ( t ) (cid:54)∈ S d , ∀ t > , and the multi-agent system is globally safe. B P

ROOF OF P ROPOSITION Note that (cid:15) i = P τ i ∼D i (cid:104) y i ( τ i , ˆ h i , ˆ u i ) ≤ (cid:105) = E τ i ∼D i (cid:104) I (cid:16) y i ( τ i , ˆ h i , ˆ u i ) ≤ (cid:17)(cid:105) . Under zero empiri-cal loss, using the Theorem 5 in Srebro et al. (2010), for any δN > , the following statement holdswith probability at least − δN : (cid:15) i ≤ µ i log z i γ R z i ( Y i ) + µ i log( N log(4 b/γ ) /δ ) z i (13)13ublished as a conference paper at ICLR 2021where µ i > is some universal constant. By taking the union bound over all N agents, the followingstatement holds with probability at least (1 − δN ) N : N (cid:92) i =1 (cid:18) (cid:15) i ≤ µ i log z i γ R z i ( Y i ) + µ i log( N log(4 b/γ ) /δ ) z i (cid:19) . (14)Since (1 − δN ) N > − δ for δ ∈ (0 , , we have: P (cid:34) N (cid:92) i =1 (cid:18) (cid:15) i ≤ µ i log z i γ R z i ( Y i ) + µ i log( N log(4 b/γ ) /δ ) z i (cid:19)(cid:35) ≥ − δ, (15)which completes the proof.The Rademacher complexity R z i ( Y i ) is deﬁned as: R z i ( Y i ) := sup τ i , ··· τ zii ∼D i E ξ ∼ Unif ( {± } zi ) sup h i ∈H i ,π i ∈V i z i (cid:12)(cid:12)(cid:12)(cid:12)(cid:12) z i (cid:88) j =1 ξ j y i ( τ ji , h i , π i ) (cid:12)(cid:12)(cid:12)(cid:12)(cid:12) , where ξ ∈ R z i is a random vector and ξ j denotes its j th element. R z i ( Y i ) characterizes the richnessof function class Y i . C M

ODEL D YNAMICS

In the experiment section of our main paper, we used the 2D ground robots and 3D drones. Themodel dynamics of the ground robot is adopted from Rodr´ıguez-Seda et al. (2014). For drones, weuse the following dynamics: s i =  x i y i z i v x,i v y,i v z,i θ x,i θ y,i  , ds i dt =  v x,i v y,i v z,i g tan ( θ x,i ) g tan ( θ y,i ) a z,i ω x,i ω y,i  , u i = (cid:34) ω x,i ω y,i a z,i (cid:35) . (16) D S

UPPLEMENTARY E XPERIMENT

For 3D drones we have shown the generalization capability to 1024 agents even when our methodis trained with 8 agents (Figure 6). For 2D ground robots we have similar results that were omittedin the main paper due to space limitations. We present the results in Figure 8 as below. Our methoddemonstrates the exceptional generalization capability to testing scenarios where the number ofagents is signiﬁcantly greater than that in training. The safety rate and average reward remain higheven when the number of agents grow exponentially.

Ablation Study on Online Policy Reﬁnement.

In Section 4.3, we introduced a test-time policyreﬁnement method. Here we study the effect of this method on our performance and present theresults in Table 1. It is shown that even without the OPR, the safety rate is still promising. TheOPR further improved the safety rate. The steps requiring OPR in testing only accounts for a smallproportion ( < ) of the total steps. The proportion gradually becomes saturated and does notsigniﬁcantly increase as the number of agents grow. Visualization of the Learned CBF.

To have a better understanding of the learned decentralizedCBF, we provide a visualization in Figure 9. The CBF is learned in the

Maze environment with two14ublished as a conference paper at ICLR 2021Figure 8: Generalization capability of our method in the 2D tasks. Our method is trained with 8agents and tested with up to 1024 agents.Table 1:

Effect of online policy reﬁnement (OPR). Proportion of OPR stands for the proportion of steps thatOPR is performed in testing.

Environment Conﬁg Safety Rate Proportion of OPR4 Agents 8 Agents 32 Agents 1024 Agents 4 Agents 8 Agents 32 Agents 1024 Agents

Maze w/ OPR 0.9999 0.9999 0.9987 0.9956 0.0149 0.0958 0.1423 0.1655w/o OPR 0.9999 0.9999 0.9869 0.9741 0 0 0 0

Tunnel w/ OPR 0.9999 0.9998 0.9988 0.9946 0.0117 0.0729 0.1271 0.1493w/o OPR 0.9999 0.9992 0.9866 0.9727 0 0 0 0

Figure 9:

Visualization of the learned CBF in the

Maze environment with 2 agents. The red area is where thedistance between agents is less that the safe threshold.environment with 2 agents. The red area is where thedistance between agents is less that the safe threshold.